[jira] [Comment Edited] (PDFBOX-2893) Simplify COSStream encoding and decoding

John Hewson (JIRA) Fri, 14 Aug 2015 23:36:00 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698145#comment-14698145
 ]


John Hewson edited comment on PDFBOX-2893 at 8/15/15 6:35 AM:
--------------------------------------------------------------

I agree with that sentiment Maruan but the PDF spec doesn't have much to say on 
this matter, as it never really assigns names to those two concepts. Stream 
input and output data is referred to only in passing as "encoded" data and 
"original binary data". After all, a PDF end user is never interested in 
reading the encoded data, they want the decoded data. Working with encoded data 
is a PDFBox abstraction, rather than a PDF abstraction.


was (Author: jahewson):
I agree with that sentiment Maruan but the PDF spec doesn't have much to say on 
this matter, as it never really assigns names to those two concepts. Stream 
input and output data is referred to only in passing as "encoded" data and 
"original binary data".

> Simplify COSStream encoding and decoding
> ----------------------------------------
>
>                 Key: PDFBOX-2893
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2893
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: John Hewson
>            Assignee: John Hewson
>            Priority: Blocker
>             Fix For: 2.0.0
>
>         Attachments: PDFBOX-2893-2.patch
>
>
> Performance issues and memory usage issues surrounding streams are one of the 
> few things blocking the release of 2.0 (see  PDFBOX-2301, PDFBOX-2882, 
> PDFBOX-2883).
> Though we've managed to reduce some of the memory used by RandomAccessBuffer 
> and to take advantage of buffering of scratch files, we still have problems 
> with the amount of memory which COSStream holds onto. Changes introduced in 
> 2.0 have resulted in COSStreams having a very complex relationship with 
> classes which hold a lot of memory in complex ways (e.g. the fields: 
> tempBuffer, filteredBuffer, unfilteredBuffer, filteredStream, 
> unFilteredStream, scratchFile). Access to scratch file pages in particular 
> does not seem to be well regulated, especially with regards to multithreading 
> (an avenue we'd at least like to leave open).
> Given recent flux, I'm doubtful that we can ship the current API for 
> COSStream w.r.t. RandomAccess without shipping performance issues or flaws 
> which will be unfixable without breaking changes.
> One of the recent changes to COSStream is that it now exposes a RandomAccess, 
> this is so that PDFStreamParser can parse content streams (as well as other 
> subclasses which handle xref and object streams). However, streams are 
> fundamentally not random access - stream filters are sequential. While the 
> consumer of a stream may wish to buffer the data (in memory or scratch) for 
> random access, COSStream itself does not need to expose such an elaborate API 
> - many pieces of gymnastics are performed inside COSStream to present this 
> illusion, at significant cost. We should remove that.
> But what about providing a RandomAccess for PDFStreamParser, 
> PDFObjectStreamParser, and PDFXrefStreamParser? It turns out that those 
> classes don't actually perform random I/O. They perform sequential I/O with a 
> buffer for peek/unread.
> We need to simplify to get 2.0 fast, lean, and maintainable. Here's what I 
> think we should do:
> 1. Split the interfaces for sequential and random I/O
> - Introduce a new SequentialSource interface for sequential I/O, with thin 
> wrappers for RandomAccessRead and InputStream.
> - BaseParser will use SequentialSource rather than RandomAccessRead (this 
> will be inherited by PDFStreamParser, PDFObjectStreamParser, and 
> PDFXrefStreamParser).
> - COSParser will use RandomAccessRead and pass a SequentialSource wrapper to 
> it's superclass, BaseParser.
> 2. Remove RandomAccess APIs from COSStream, expose only InputStream and 
> OutputStream, as we used to do. We can pass an InputStream to PDFStreamParser 
> using a wrapper which implements SequentialSource. This will remove 
> tempBuffer, filteredBuffer, and unfilteredBuffer from COSStream, all of which 
> hold memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-2893) Simplify COSStream encoding and decoding

Reply via email to