Hi,

could you also try the PDFToImage command as Andreas suggested (and I actually 
meant) as this will convert a PDF to Image page by page. ExtractImage extracts 
the images on the page but doesn't deal with text, line art ….

I will take a quick look at the sample you provided.

BR

Maruan Sahyoun

Am 08.04.2013 um 10:14 schrieb Alexander Klenner 
<[email protected]>:

> Hi Maruan,
> 
> thank you, I now do have a first clue what is happening, as you suggested I 
> used the command line with the ExtractImages command, which leads to many 
> Images, those are actually the same, that I see on my created 
> convertToImage() pages.
> 
> Using the ExtractText method from the cml, I get all the text from this PDF. 
> So somehow convertToImage() for this particular PDF seems to only return the 
> results from "ExtractImages".
> I also tried PDFToImage using the nonSeq parameter, this method returns 
> exactly the semi-empty pages that my java code produces. 
> 
> So I conclude for some PDFs convertToImage() returns text+images for some it 
> only returns images. Is this the expected behaviour? 
> 
> All PDFs I process have 'real' text, which is selectable and that is not 
> covered by an ImageLayer of text of some sort (at least I think so). 
> 
> I uploaded the PDF and the output of PDFToImage to 
> https://www.dropbox.com/sh/inkcdahx4da1kzp/13bnj-BrZt
> 
> Cheers,
> 
> Alex
> 
> 
> 
> --
> Dr. Alexander G. Klenner
> Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
> Schloss Birlinghoven, D-53754 Sankt Augustin
> Tel.: +49 - 2241 - 14 - 2736
> E-mail: [email protected]
> Internet: http://www.scai.fraunhofer.de
> 
> 
> ----- Original Message -----
> From: "Maruan Sahyoun" <[email protected]>
> To: [email protected]
> Sent: Monday, April 8, 2013 9:20:10 AM
> Subject: Re: errors with PDPage.convertToImage()
> 
> Hi,
> 
> unfortunately the attachment didn't make it through.
> 
> Could you try the PDF in question using the command line app ExtractImage 
> with the -nonSeq  parameter or use the following code
> 
> PDDocument pdDoc = PDDocument.loadNonSeq(…)
> 
> The NonSequentialParser gives better results if the document has incremental 
> updates. In addition it's not necessary to create a new PDDocument from the 
> cosDoc as parser.getDocument already passes a PDDocument ….
> 
> BR from you neighborhood
> 
> 
> Maruan Sahyoun
> 
> Am 08.04.2013 um 08:52 schrieb Alexander Klenner 
> <[email protected]>:
> 
>> Hi all,
>> 
>> I frequently come across PDFs where the convertToImage() method is 
>> generating blank or partly blank images. One of those PDFs is attached to 
>> this mail. 
>> 
>> My code for processing: 
>> 
>> PDFParser parser;
>> parser = new PDFParser(new FileInputStream(f));
>> parser.parse();
>> cosDoc = parser.getDocument();
>> 
>> pdDoc = new PDDocument(cosDoc);
>> ..
>> Iterator<PDPage> it = pdDoc.getDocumentCatalog().getAllPages().iterator();
>> PDPage page = it.next();
>> ...
>> PDRectangle cropBox = page.findCropBox();
>> Dimension dimension = cropBox.createDimension();
>> ...
>> BufferedImage img = page.convertToImage(BufferedImage.TYPE_INT_RGB, 
>> ImageParser.PARAM_DPI);
>> 
>> 
>> I am using pdfbox-app-1.8.0.jar.
>> 
>> So I have two questions: 
>> 
>> 1. Is there a different way to to extract the page as an image that I am not 
>> aware of to get the correct image? 
>> 2. Or is it possible to detect, that this page was not extracted correctly 
>> before or after the extraction?
>> 
>> At the moment I just don't know when dealing with a corrupted image.
>> 
>> Thanks a lot for any hints,
>> 
>> Alex
>> 
>> --
>> Dr. Alexander G. Klenner
>> Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
>> Schloss Birlinghoven, D-53754 Sankt Augustin
>> Tel.: +49 - 2241 - 14 - 2736
>> E-mail: [email protected]
>> Internet: http://www.scai.fraunhofer.de
>> 

Reply via email to