[jira] [Updated] (PDFBOX-3037) Text extraction decodes image files

Tilman Hausherr (JIRA) Tue, 20 Oct 2015 09:15:58 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tilman Hausherr updated PDFBOX-3037:
------------------------------------
    Attachment: 001131.pdf

> Text extraction decodes image files
> -----------------------------------
>
>                 Key: PDFBOX-3037
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3037
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Tilman Hausherr
>            Assignee: Tilman Hausherr
>             Fix For: 2.0.0
>
>         Attachments: 001131.pdf
>
>
> I get this with text extraction of file 001131.pdf:
> {code}
> java.io.IOException: Could not read JPEG 2000 (JPX) image
>       at org.apache.pdfbox.filter.JPXFilter.readJPX(JPXFilter.java:90)
>       at org.apache.pdfbox.filter.JPXFilter.decode(JPXFilter.java:59)
>       at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69)
>       at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
>       at 
> org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:234)
>       at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.<init>(PDImageXObject.java:145)
>       at 
> org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:69)
>       at 
> org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:342)
>       at 
> org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:50)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:819)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:476)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:448)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
> {code}
> This shouldn't happen, i.e. we shouldn't even try to decode images when 
> extracting text, this is a waste of time and memory.
> The cause is this in DrawObject:
> {code}
> PDXObject xobject =  context.getResources().getXObject(name);
> {code}
> it results in the object being created and its contents being decoded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-3037) Text extraction decodes image files

Reply via email to