Re: Handling Graphics from Scanned PDF

Eliot Kimber Sun, 09 Dec 2012 06:36:47 -0800

Yes, I believe this is a masked image. I did a close reading of the PDF 1.7
spec and I think that's what I have.


The sample I'm testing with can be found here:

https://dl.dropbox.com/u/20078596/pdfScannedPageWithMaskedImage.pdf

Here are the dictionary entries for the three XObjects in the document:

9 0 obj
<</BitsPerComponent
8/ColorSpace/DeviceGray/Filter[/FlateDecode/DCTDecode]/Height 1100/Length
19570/Name/image_bg0/Subtype/Image/Type/XObject/Width 850>>

10 0 obj
<</BitsPerComponent
8/ColorSpace/DeviceGray/Filter[/FlateDecode/DCTDecode]/Height 1100/Length
8521/Mask 11 0 R/Name/image_fg0/Subtype/Image/Type/XObject/Width 850>>

11 0 obj
<</BitsPerComponent 1/DecodeParms<</Columns 2550/K
-1>>/Filter/CCITTFaxDecode/Height 3300/ImageMask true/Length
10266/Name/image_sel/Subtype/Image/Type/XObject/Width 2550>>

So if I understand what this is saying, object 11 is the image mask applied
to object 10.

In my test code I made a little StreamEngine that simply reports on all
XObjects and writes any PDXObjectImage objects to the file system. This is
the output I get on this test document:

processOperator(): objectName="image_bg0"
processOperator(): object type="PDJpeg"
processOperator(): image class=PDJpeg
processOperator(): imageWidth="850"
processOperator(): imageHeight="1100"
Creating file 
/var/folders/_r/zht66_tx2lzcz4k18rzbxc240000gp/T/TestPdfUtils/image_bg0_0.jp
g
processOperator(): objectName="image_fg0"
processOperator(): object type="PDJpeg"
processOperator(): image class=PDJpeg
processOperator(): imageWidth="850"
processOperator(): imageHeight="1100"
Creating file 
/var/folders/_r/zht66_tx2lzcz4k18rzbxc240000gp/T/TestPdfUtils/image_fg0_1.jp
g

Where the objectName="image_bg0" line will be emitted for any XObject of any
type.

So it looks like the ImageMask object is not being reported as an XObject.

Thanks,

Eliot

On 12/9/12 6:58 AM, "Andreas Lehmkuehler" <[email protected]> wrote:

> Hi,
> 
> Am 06.12.2012 18:48, schrieb Eliot Kimber:
>> I am trying to find QR codes on PDFs that are scanned page images. My code
>> works fine for scans produced by my OfficeJet and for page images produced
>> out of Acrobat but scans produced by my client's eCopy ShareScan device
>> (according to the PDF metadata) are not usable.
>> 
>> Looking into the PDF data stream, each page is represented by two images, a
>> "bg" image that is what I would expect for the page image, but very faint
>> grey, and a "fg" image that reflects the page content but with lots of grey
>> and ghosting.
> Sounds like masked images, but that's just a guess.
> 
>> The PDF renderer must be combining these two images in some way to provide
>> the clear image I see in Acrobat.
>> 
>> Is there something I can find in the PDF data stream that will tell me how
>> these images are combined and, if so, can anyone point me in the right
>> direction for processing these images? I am pretty new to Java image
>> processing so I'm not sure where to look or what to look for.
>> 
>> The images themselves are repored by PDFBox as PDJpeg objects.
>> 
>> I can provide a sample PDF page if it's needed.
> Due to some restrictions you can't attach it to a posting. Please post a
> download link referring to a public location or create an issue on jira [1]
> 
>> 
>> Thanks,
>> 
>> Eliot
>> 
> 
> 
> BR
> Andreas Lehmkühler
> 
> [1] https://issues.apache.org/jira/browse/PDFBOX

-- 
Eliot Kimber
Senior Solutions Architect, RSI Content Solutions
"Bringing Strategy, Content, and Technology Together"
Main: 512.554.9368
www.rsicms.com
www.rsuitecms.com
Book: DITA For Practitioners, from XML Press,
http://xmlpress.net/publications/dita/practitioners-1/

Re: Handling Graphics from Scanned PDF

Reply via email to