[jira] [Comment Edited] (PDFBOX-3340) Image decoded twice without a real need

John Hewson (JIRA) Wed, 11 May 2016 12:55:56 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280662#comment-15280662
 ]


John Hewson edited comment on PDFBOX-3340 at 5/11/16 7:54 PM:
--------------------------------------------------------------

>  But for the first time, the image is decompressed in the constructor of 
> PDImageXObject [...] just to allow the filter (CCITTFaxFilter in this case) 
> to provide additional dictionary parameters in case something is missing in 
> the input [...] I think this is a complete waste.

Yes, it is. The JPX filter doesn't get this luxury due to JAI and we're forced 
to parse the entire image in that case. But for CCITT (and others) we can 
indeed skip the decoding. Some possibly helpful information with regards to 
this:

- Repair was a very late addition to PDFBox and its implementation is 
definitely sub-optimal
- Filter instances are singletons and have no local state
- Repair should be non-destructive, we don't want to modify the COSStream 
itself because that prevents clean round-tripping (hence DecodeResult).
- We can't afford to cache the initial parsed images, as they can be too large 
and persist for too long.


was (Author: jahewson):
>  But for the first time, the image is decompressed in the constructor of 
> PDImageXObject [...] just to allow the filter (CCITTFaxFilter in this case) 
> to provide additional dictionary parameters in case something is missing in 
> the input [...] I think this is a complete waste.

Yes, it is. The JPX filter doesn't get this luxury due to JAI and we're forced 
to parse the entire image in that case. But for CCITT (and others) we can 
indeed skip the decoding. Some possibly helpful information with regards to 
this:

- Repair was a very late addition to PDFBox and its implementation is 
definitely sub-optimal
- Filter instances are singletons and have no local state
- Repair should be non-destructive, we don't want to modify the COSStream 
itself because that prevents clean round-tripping (hence DecodeResult).

> Image decoded twice without a real need
> ---------------------------------------
>
>                 Key: PDFBOX-3340
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3340
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Petr Slaby
>            Priority: Minor
>
> Take the pdf from PDFBOX-1708, put a breakpoint into the class 
> CCITTFaxFilter, method decode() and run PDFToImage. You will see the debugger 
> stop twice, even if the pdf contains a single image. 
> The second call is arrives when the image is rendered to G2D, this is OK. But 
> for the first time, the image is decompressed in the constructor of 
> PDImageXObject - line 147 
> {noformat}
> this(stream, resources, stream.createInputStream());
> {noformat}
> just to allow the filter (CCITTFaxFilter in this case) to provide additional 
> dictionary parameters in case something is missing in the input (COLORSPACE 
> would be set to DeviceGray if missing here).
> I think this is a complete waste. The filter should be able to fix the 
> dictionary without having to decode the image. As far as I can tell, this 
> could be done by implementing a repair method on COSStream and on 
> implementations of Filter.
> Also, I do not see that the stream created in the above mentioned constructor 
> of PDImageXObject would ever be closed. This seems to be a more general 
> issue. I have put a counter into COSInputStream.create(), there where it 
> creates new RandomAccessInputStream(buffer). With the testfile from 
> PDFBOX-1708, I end up with 3 unclosed streams when the program finishes. I am 
> not sure whether this is important, but I guess the unclosed streams are 
> uselessly occupying space in the scratch file.
> Sorry if this is just lack of understanding of the code from my side, but I 
> could not resist to report what I see. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-3340) Image decoded twice without a real need

Reply via email to