Hi Maruan,

thank you, I now do have a first clue what is happening, as you suggested I 
used the command line with the ExtractImages command, which leads to many 
Images, those are actually the same, that I see on my created convertToImage() 
pages.

Using the ExtractText method from the cml, I get all the text from this PDF. 
So somehow convertToImage() for this particular PDF seems to only return the 
results from "ExtractImages".
I also tried PDFToImage using the nonSeq parameter, this method returns exactly 
the semi-empty pages that my java code produces. 

So I conclude for some PDFs convertToImage() returns text+images for some it 
only returns images. Is this the expected behaviour? 

All PDFs I process have 'real' text, which is selectable and that is not 
covered by an ImageLayer of text of some sort (at least I think so). 

I uploaded the PDF and the output of PDFToImage to 
https://www.dropbox.com/sh/inkcdahx4da1kzp/13bnj-BrZt

Cheers,

Alex



--
Dr. Alexander G. Klenner
Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
Schloss Birlinghoven, D-53754 Sankt Augustin
Tel.: +49 - 2241 - 14 - 2736
E-mail: [email protected]
Internet: http://www.scai.fraunhofer.de


----- Original Message -----
From: "Maruan Sahyoun" <[email protected]>
To: [email protected]
Sent: Monday, April 8, 2013 9:20:10 AM
Subject: Re: errors with PDPage.convertToImage()

Hi,

unfortunately the attachment didn't make it through.

Could you try the PDF in question using the command line app ExtractImage with 
the -nonSeq  parameter or use the following code

PDDocument pdDoc = PDDocument.loadNonSeq(…)

The NonSequentialParser gives better results if the document has incremental 
updates. In addition it's not necessary to create a new PDDocument from the 
cosDoc as parser.getDocument already passes a PDDocument ….

BR from you neighborhood


Maruan Sahyoun

Am 08.04.2013 um 08:52 schrieb Alexander Klenner 
<[email protected]>:

> Hi all,
> 
> I frequently come across PDFs where the convertToImage() method is generating 
> blank or partly blank images. One of those PDFs is attached to this mail. 
> 
> My code for processing: 
> 
> PDFParser parser;
> parser = new PDFParser(new FileInputStream(f));
> parser.parse();
> cosDoc = parser.getDocument();
> 
> pdDoc = new PDDocument(cosDoc);
> ..
> Iterator<PDPage> it = pdDoc.getDocumentCatalog().getAllPages().iterator();
> PDPage page = it.next();
> ...
> PDRectangle cropBox = page.findCropBox();
> Dimension dimension = cropBox.createDimension();
> ...
> BufferedImage img = page.convertToImage(BufferedImage.TYPE_INT_RGB, 
> ImageParser.PARAM_DPI);
> 
> 
> I am using pdfbox-app-1.8.0.jar.
> 
> So I have two questions: 
> 
> 1. Is there a different way to to extract the page as an image that I am not 
> aware of to get the correct image? 
> 2. Or is it possible to detect, that this page was not extracted correctly 
> before or after the extraction?
> 
> At the moment I just don't know when dealing with a corrupted image.
> 
> Thanks a lot for any hints,
> 
> Alex
> 
> --
> Dr. Alexander G. Klenner
> Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
> Schloss Birlinghoven, D-53754 Sankt Augustin
> Tel.: +49 - 2241 - 14 - 2736
> E-mail: [email protected]
> Internet: http://www.scai.fraunhofer.de
> 

Reply via email to