Hi, could you also try the PDFToImage command as Andreas suggested (and I actually meant) as this will convert a PDF to Image page by page. ExtractImage extracts the images on the page but doesn't deal with text, line art ….
I will take a quick look at the sample you provided. BR Maruan Sahyoun Am 08.04.2013 um 10:14 schrieb Alexander Klenner <[email protected]>: > Hi Maruan, > > thank you, I now do have a first clue what is happening, as you suggested I > used the command line with the ExtractImages command, which leads to many > Images, those are actually the same, that I see on my created > convertToImage() pages. > > Using the ExtractText method from the cml, I get all the text from this PDF. > So somehow convertToImage() for this particular PDF seems to only return the > results from "ExtractImages". > I also tried PDFToImage using the nonSeq parameter, this method returns > exactly the semi-empty pages that my java code produces. > > So I conclude for some PDFs convertToImage() returns text+images for some it > only returns images. Is this the expected behaviour? > > All PDFs I process have 'real' text, which is selectable and that is not > covered by an ImageLayer of text of some sort (at least I think so). > > I uploaded the PDF and the output of PDFToImage to > https://www.dropbox.com/sh/inkcdahx4da1kzp/13bnj-BrZt > > Cheers, > > Alex > > > > -- > Dr. Alexander G. Klenner > Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI) > Schloss Birlinghoven, D-53754 Sankt Augustin > Tel.: +49 - 2241 - 14 - 2736 > E-mail: [email protected] > Internet: http://www.scai.fraunhofer.de > > > ----- Original Message ----- > From: "Maruan Sahyoun" <[email protected]> > To: [email protected] > Sent: Monday, April 8, 2013 9:20:10 AM > Subject: Re: errors with PDPage.convertToImage() > > Hi, > > unfortunately the attachment didn't make it through. > > Could you try the PDF in question using the command line app ExtractImage > with the -nonSeq parameter or use the following code > > PDDocument pdDoc = PDDocument.loadNonSeq(…) > > The NonSequentialParser gives better results if the document has incremental > updates. In addition it's not necessary to create a new PDDocument from the > cosDoc as parser.getDocument already passes a PDDocument …. > > BR from you neighborhood > > > Maruan Sahyoun > > Am 08.04.2013 um 08:52 schrieb Alexander Klenner > <[email protected]>: > >> Hi all, >> >> I frequently come across PDFs where the convertToImage() method is >> generating blank or partly blank images. One of those PDFs is attached to >> this mail. >> >> My code for processing: >> >> PDFParser parser; >> parser = new PDFParser(new FileInputStream(f)); >> parser.parse(); >> cosDoc = parser.getDocument(); >> >> pdDoc = new PDDocument(cosDoc); >> .. >> Iterator<PDPage> it = pdDoc.getDocumentCatalog().getAllPages().iterator(); >> PDPage page = it.next(); >> ... >> PDRectangle cropBox = page.findCropBox(); >> Dimension dimension = cropBox.createDimension(); >> ... >> BufferedImage img = page.convertToImage(BufferedImage.TYPE_INT_RGB, >> ImageParser.PARAM_DPI); >> >> >> I am using pdfbox-app-1.8.0.jar. >> >> So I have two questions: >> >> 1. Is there a different way to to extract the page as an image that I am not >> aware of to get the correct image? >> 2. Or is it possible to detect, that this page was not extracted correctly >> before or after the extraction? >> >> At the moment I just don't know when dealing with a corrupted image. >> >> Thanks a lot for any hints, >> >> Alex >> >> -- >> Dr. Alexander G. Klenner >> Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI) >> Schloss Birlinghoven, D-53754 Sankt Augustin >> Tel.: +49 - 2241 - 14 - 2736 >> E-mail: [email protected] >> Internet: http://www.scai.fraunhofer.de >>

