Hi Andreas, sorry I was busy uploading the PDFs and writing the mail, didn't see your mail, but I figured PDFToImage might be the correct choice here ;).
I do not get any exceptions but some info logs, which are: Apr 8, 2013 10:16:49 AM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: BX Apr 8, 2013 10:16:50 AM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: BDC Apr 8, 2013 10:16:50 AM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: BMC Apr 8, 2013 10:16:50 AM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: i Apr 8, 2013 10:16:50 AM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: DP Apr 8, 2013 10:16:51 AM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: EMC Apr 8, 2013 10:16:52 AM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: EX Those I get for every page in this document. Cheers, Alex -- Dr. Alexander G. Klenner Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI) Schloss Birlinghoven, D-53754 Sankt Augustin Tel.: +49 - 2241 - 14 - 2736 E-mail: [email protected] Internet: http://www.scai.fraunhofer.de ----- Original Message ----- From: "Andreas Lehmkühler" <[email protected]> To: [email protected] Sent: Monday, April 8, 2013 9:58:25 AM Subject: Re: errors with PDPage.convertToImage() Hi, Maruan Sahyoun <[email protected]> hat am 8. April 2013 um 09:20 geschrieben: > Hi, > > unfortunately the attachment didn't make it through. Due to some security restrictions. > Could you try the PDF in question using the command line app ExtractImage with > the -nonSeq parameter or use the following code I guess there is a missunderstanding. Please use PDFToImage to create one image for each page [1]. Provide us with any possible exception or log. > PDDocument pdDoc = PDDocument.loadNonSeq(…) > > The NonSequentialParser gives better results if the document has incremental > updates. > In addition it's not necessary to create a new PDDocument from the cosDoc as > parser.getDocument already passes a PDDocument …. +1, that's an old pattern and should be used any more. > BR from you neighborhood I'm not that far away either ;-) > Maruan Sahyoun > > Am 08.04.2013 um 08:52 schrieb Alexander Klenner > <[email protected]>: > > > Hi all, > > > > I frequently come across PDFs where the convertToImage() method is > > generating blank or partly blank images. One of those PDFs is attached to > > this mail. > > > > My code for processing: > > > > PDFParser parser; > > parser = new PDFParser(new FileInputStream(f)); > > parser.parse(); > > cosDoc = parser.getDocument(); > > > > pdDoc = new PDDocument(cosDoc); > > .. > > Iterator<PDPage> it = pdDoc.getDocumentCatalog().getAllPages().iterator(); > > PDPage page = it.next(); > > ... > > PDRectangle cropBox = page.findCropBox(); > > Dimension dimension = cropBox.createDimension(); > > ... > > BufferedImage img = page.convertToImage(BufferedImage.TYPE_INT_RGB, > > ImageParser.PARAM_DPI); > > > > > > I am using pdfbox-app-1.8.0.jar. > > > > So I have two questions: > > > > 1. Is there a different way to to extract the page as an image that I am not > > aware of to get the correct image? > > 2. Or is it possible to detect, that this page was not extracted correctly > > before or after the extraction? > > > > At the moment I just don't know when dealing with a corrupted image. > > > > Thanks a lot for any hints, > > > > Alex > > > > -- > > Dr. Alexander G. Klenner > > Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI) > > Schloss Birlinghoven, D-53754 Sankt Augustin > > Tel.: +49 - 2241 - 14 - 2736 > > E-mail: [email protected] > > Internet: http://www.scai.fraunhofer.de > > BR Andreas Lehmkühler [1] http://pdfbox.apache.org/commandlineutilities/PDFToImage.html

