Hi Maruan, thank you, I now do have a first clue what is happening, as you suggested I used the command line with the ExtractImages command, which leads to many Images, those are actually the same, that I see on my created convertToImage() pages.
Using the ExtractText method from the cml, I get all the text from this PDF. So somehow convertToImage() for this particular PDF seems to only return the results from "ExtractImages". I also tried PDFToImage using the nonSeq parameter, this method returns exactly the semi-empty pages that my java code produces. So I conclude for some PDFs convertToImage() returns text+images for some it only returns images. Is this the expected behaviour? All PDFs I process have 'real' text, which is selectable and that is not covered by an ImageLayer of text of some sort (at least I think so). I uploaded the PDF and the output of PDFToImage to https://www.dropbox.com/sh/inkcdahx4da1kzp/13bnj-BrZt Cheers, Alex -- Dr. Alexander G. Klenner Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI) Schloss Birlinghoven, D-53754 Sankt Augustin Tel.: +49 - 2241 - 14 - 2736 E-mail: [email protected] Internet: http://www.scai.fraunhofer.de ----- Original Message ----- From: "Maruan Sahyoun" <[email protected]> To: [email protected] Sent: Monday, April 8, 2013 9:20:10 AM Subject: Re: errors with PDPage.convertToImage() Hi, unfortunately the attachment didn't make it through. Could you try the PDF in question using the command line app ExtractImage with the -nonSeq parameter or use the following code PDDocument pdDoc = PDDocument.loadNonSeq(…) The NonSequentialParser gives better results if the document has incremental updates. In addition it's not necessary to create a new PDDocument from the cosDoc as parser.getDocument already passes a PDDocument …. BR from you neighborhood Maruan Sahyoun Am 08.04.2013 um 08:52 schrieb Alexander Klenner <[email protected]>: > Hi all, > > I frequently come across PDFs where the convertToImage() method is generating > blank or partly blank images. One of those PDFs is attached to this mail. > > My code for processing: > > PDFParser parser; > parser = new PDFParser(new FileInputStream(f)); > parser.parse(); > cosDoc = parser.getDocument(); > > pdDoc = new PDDocument(cosDoc); > .. > Iterator<PDPage> it = pdDoc.getDocumentCatalog().getAllPages().iterator(); > PDPage page = it.next(); > ... > PDRectangle cropBox = page.findCropBox(); > Dimension dimension = cropBox.createDimension(); > ... > BufferedImage img = page.convertToImage(BufferedImage.TYPE_INT_RGB, > ImageParser.PARAM_DPI); > > > I am using pdfbox-app-1.8.0.jar. > > So I have two questions: > > 1. Is there a different way to to extract the page as an image that I am not > aware of to get the correct image? > 2. Or is it possible to detect, that this page was not extracted correctly > before or after the extraction? > > At the moment I just don't know when dealing with a corrupted image. > > Thanks a lot for any hints, > > Alex > > -- > Dr. Alexander G. Klenner > Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI) > Schloss Birlinghoven, D-53754 Sankt Augustin > Tel.: +49 - 2241 - 14 - 2736 > E-mail: [email protected] > Internet: http://www.scai.fraunhofer.de >

