[jira] [Commented] (PDFBOX-2893) Simplify COSStream encoding and decoding

Tilman Hausherr (JIRA) Sat, 15 Aug 2015 13:41:22 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698441#comment-14698441
 ]


Tilman Hausherr commented on PDFBOX-2893:
-----------------------------------------

I don't know if this is related to the COSStream change or to an earlier 
change, but if I use the two files I just committed for the merge test by using 
this code
{code}
    public void testPDFMergerUtility() throws IOException
    {
        checkMergeIdentical("PDFBox.GlobalResourceMergeTest.Doc01.pdf",
                "PDFBox.GlobalResourceMergeTest.Doc02.pdf",
                "GlobalResourceMergeTestResult.pdf", 
                false);
        
        // once again, with scratch file
        checkMergeIdentical("PDFBox.GlobalResourceMergeTest.Doc01.pdf",
                "PDFBox.GlobalResourceMergeTest.Doc02.pdf",
                "GlobalResourceMergeTestResult2.pdf", 
                true);
    }
{code}
I get this exception for the test with a scratch file:
{code}
testPDFMergerUtility(org.apache.pdfbox.multipdf.PDFMergerUtilityTest)  Time 
elapsed: 4.802 sec  <<< ERROR!
java.io.IOException: Buffer already closed
        at 
org.apache.pdfbox.io.ScratchFileBuffer.checkClosed(ScratchFileBuffer.java:91)
        at 
org.apache.pdfbox.io.ScratchFileBuffer.seek(ScratchFileBuffer.java:289)
        at 
org.apache.pdfbox.io.RandomAccessInputStream.restorePosition(RandomAccessInputStream.java:47)
        at 
org.apache.pdfbox.io.RandomAccessInputStream.read(RandomAccessInputStream.java:78)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at java.io.FilterInputStream.read(FilterInputStream.java:107)
        at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:66)
        at 
org.apache.pdfbox.multipdf.PDFCloneUtility.cloneForNewDocument(PDFCloneUtility.java:117)
        at 
org.apache.pdfbox.multipdf.PDFCloneUtility.cloneForNewDocument(PDFCloneUtility.java:98)
        at 
org.apache.pdfbox.multipdf.PDFCloneUtility.cloneForNewDocument(PDFCloneUtility.java:133)
        at 
org.apache.pdfbox.multipdf.PDFMergerUtility.appendDocument(PDFMergerUtility.java:477)
        at 
org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:196)
        at 
org.apache.pdfbox.multipdf.PDFMergerUtilityTest.checkMergeIdentical(PDFMergerUtilityTest.java:103)
        at 
org.apache.pdfbox.multipdf.PDFMergerUtilityTest.testPDFMergerUtility(PDFMergerUtilityTest.java:61)
{code}
So somehow the use of the scratch file is different if a stream is Flate 
encoded.

> Simplify COSStream encoding and decoding
> ----------------------------------------
>
>                 Key: PDFBOX-2893
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2893
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: John Hewson
>            Assignee: John Hewson
>            Priority: Blocker
>             Fix For: 2.0.0
>
>         Attachments: PDFBOX-2893-2.patch
>
>
> Performance issues and memory usage issues surrounding streams are one of the 
> few things blocking the release of 2.0 (see  PDFBOX-2301, PDFBOX-2882, 
> PDFBOX-2883).
> Though we've managed to reduce some of the memory used by RandomAccessBuffer 
> and to take advantage of buffering of scratch files, we still have problems 
> with the amount of memory which COSStream holds onto. Changes introduced in 
> 2.0 have resulted in COSStreams having a very complex relationship with 
> classes which hold a lot of memory in complex ways (e.g. the fields: 
> tempBuffer, filteredBuffer, unfilteredBuffer, filteredStream, 
> unFilteredStream, scratchFile). Access to scratch file pages in particular 
> does not seem to be well regulated, especially with regards to multithreading 
> (an avenue we'd at least like to leave open).
> Given recent flux, I'm doubtful that we can ship the current API for 
> COSStream w.r.t. RandomAccess without shipping performance issues or flaws 
> which will be unfixable without breaking changes.
> One of the recent changes to COSStream is that it now exposes a RandomAccess, 
> this is so that PDFStreamParser can parse content streams (as well as other 
> subclasses which handle xref and object streams). However, streams are 
> fundamentally not random access - stream filters are sequential. While the 
> consumer of a stream may wish to buffer the data (in memory or scratch) for 
> random access, COSStream itself does not need to expose such an elaborate API 
> - many pieces of gymnastics are performed inside COSStream to present this 
> illusion, at significant cost. We should remove that.
> But what about providing a RandomAccess for PDFStreamParser, 
> PDFObjectStreamParser, and PDFXrefStreamParser? It turns out that those 
> classes don't actually perform random I/O. They perform sequential I/O with a 
> buffer for peek/unread.
> We need to simplify to get 2.0 fast, lean, and maintainable. Here's what I 
> think we should do:
> 1. Split the interfaces for sequential and random I/O
> - Introduce a new SequentialSource interface for sequential I/O, with thin 
> wrappers for RandomAccessRead and InputStream.
> - BaseParser will use SequentialSource rather than RandomAccessRead (this 
> will be inherited by PDFStreamParser, PDFObjectStreamParser, and 
> PDFXrefStreamParser).
> - COSParser will use RandomAccessRead and pass a SequentialSource wrapper to 
> it's superclass, BaseParser.
> 2. Remove RandomAccess APIs from COSStream, expose only InputStream and 
> OutputStream, as we used to do. We can pass an InputStream to PDFStreamParser 
> using a wrapper which implements SequentialSource. This will remove 
> tempBuffer, filteredBuffer, and unfilteredBuffer from COSStream, all of which 
> hold memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2893) Simplify COSStream encoding and decoding

Reply via email to