[ https://issues.apache.org/jira/browse/PDFBOX-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr updated PDFBOX-3037: ------------------------------------ Attachment: 001131.pdf > Text extraction decodes image files > ----------------------------------- > > Key: PDFBOX-3037 > URL: https://issues.apache.org/jira/browse/PDFBOX-3037 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.0 > Reporter: Tilman Hausherr > Assignee: Tilman Hausherr > Fix For: 2.0.0 > > Attachments: 001131.pdf > > > I get this with text extraction of file 001131.pdf: > {code} > java.io.IOException: Could not read JPEG 2000 (JPX) image > at org.apache.pdfbox.filter.JPXFilter.readJPX(JPXFilter.java:90) > at org.apache.pdfbox.filter.JPXFilter.decode(JPXFilter.java:59) > at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69) > at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163) > at > org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:234) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.<init>(PDImageXObject.java:145) > at > org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:69) > at > org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:342) > at > org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:50) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:819) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:476) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:448) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155) > {code} > This shouldn't happen, i.e. we shouldn't even try to decode images when > extracting text, this is a waste of time and memory. > The cause is this in DrawObject: > {code} > PDXObject xobject = context.getResources().getXObject(name); > {code} > it results in the object being created and its contents being decoded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org