Yes, I believe this is a masked image. I did a close reading of the PDF 1.7 spec and I think that's what I have.
The sample I'm testing with can be found here: https://dl.dropbox.com/u/20078596/pdfScannedPageWithMaskedImage.pdf Here are the dictionary entries for the three XObjects in the document: 9 0 obj <</BitsPerComponent 8/ColorSpace/DeviceGray/Filter[/FlateDecode/DCTDecode]/Height 1100/Length 19570/Name/image_bg0/Subtype/Image/Type/XObject/Width 850>> 10 0 obj <</BitsPerComponent 8/ColorSpace/DeviceGray/Filter[/FlateDecode/DCTDecode]/Height 1100/Length 8521/Mask 11 0 R/Name/image_fg0/Subtype/Image/Type/XObject/Width 850>> 11 0 obj <</BitsPerComponent 1/DecodeParms<</Columns 2550/K -1>>/Filter/CCITTFaxDecode/Height 3300/ImageMask true/Length 10266/Name/image_sel/Subtype/Image/Type/XObject/Width 2550>> So if I understand what this is saying, object 11 is the image mask applied to object 10. In my test code I made a little StreamEngine that simply reports on all XObjects and writes any PDXObjectImage objects to the file system. This is the output I get on this test document: processOperator(): objectName="image_bg0" processOperator(): object type="PDJpeg" processOperator(): image class=PDJpeg processOperator(): imageWidth="850" processOperator(): imageHeight="1100" Creating file /var/folders/_r/zht66_tx2lzcz4k18rzbxc240000gp/T/TestPdfUtils/image_bg0_0.jp g processOperator(): objectName="image_fg0" processOperator(): object type="PDJpeg" processOperator(): image class=PDJpeg processOperator(): imageWidth="850" processOperator(): imageHeight="1100" Creating file /var/folders/_r/zht66_tx2lzcz4k18rzbxc240000gp/T/TestPdfUtils/image_fg0_1.jp g Where the objectName="image_bg0" line will be emitted for any XObject of any type. So it looks like the ImageMask object is not being reported as an XObject. Thanks, Eliot On 12/9/12 6:58 AM, "Andreas Lehmkuehler" <[email protected]> wrote: > Hi, > > Am 06.12.2012 18:48, schrieb Eliot Kimber: >> I am trying to find QR codes on PDFs that are scanned page images. My code >> works fine for scans produced by my OfficeJet and for page images produced >> out of Acrobat but scans produced by my client's eCopy ShareScan device >> (according to the PDF metadata) are not usable. >> >> Looking into the PDF data stream, each page is represented by two images, a >> "bg" image that is what I would expect for the page image, but very faint >> grey, and a "fg" image that reflects the page content but with lots of grey >> and ghosting. > Sounds like masked images, but that's just a guess. > >> The PDF renderer must be combining these two images in some way to provide >> the clear image I see in Acrobat. >> >> Is there something I can find in the PDF data stream that will tell me how >> these images are combined and, if so, can anyone point me in the right >> direction for processing these images? I am pretty new to Java image >> processing so I'm not sure where to look or what to look for. >> >> The images themselves are repored by PDFBox as PDJpeg objects. >> >> I can provide a sample PDF page if it's needed. > Due to some restrictions you can't attach it to a posting. Please post a > download link referring to a public location or create an issue on jira [1] > >> >> Thanks, >> >> Eliot >> > > > BR > Andreas Lehmkühler > > [1] https://issues.apache.org/jira/browse/PDFBOX -- Eliot Kimber Senior Solutions Architect, RSI Content Solutions "Bringing Strategy, Content, and Technology Together" Main: 512.554.9368 www.rsicms.com www.rsuitecms.com Book: DITA For Practitioners, from XML Press, http://xmlpress.net/publications/dita/practitioners-1/

