I was wrong. According to Adobe TN5116, it's the whole JFIF file including APP0 headers, etc.
So, has anyone seen a problem like this? I'm starting to suspect I have a bad JVM/ImageIO; I'm going to try running my code on another system. On Mon, Feb 13, 2012 at 8:57 AM, Jason Cwik <[email protected]> wrote: > Hi All, > > I'm using pdfbox 1.6 to generate PDF files. These text files contain some > simple text and JPEG images. The JPEGs are small (~157x200), representing > thumbnails of other documents. > > The problem is, only about half of my images display. The rest have a > blank box where the image should be. Also, if I run the viewer like > pdfedit or evince from the command line, you see errors: > > jason@butters:~/Desktop$ evince msg4.pdf > Error: Could not find start of jpeg data > Error: Could not find start of jpeg data > Error: Could not find start of jpeg data > Error: Could not find start of jpeg data > Error: Could not find start of jpeg data > > > Looking at PDJpeg, it looks like it reads in my JPEG to a BufferedImage, > and then recompresses it to the stream. The problem is (I think), that if > you look at the PDF spec it seems that the stream should really be just the > raw DCT data. However, when you look at the PDFs generated by PDFBox, I > see the JPEG headers (e.g. 0xff, ... "JFIF") in the stream. It seems like > the PDF viewers are being lenient and trying to find the DCT data, but > giving up on some of my images. > > Does this sound correct?? > > Thanks, > Jason > > -- Jason Cwik CTO Connectic, Inc Cell: 612-217-0442

