[jira] [Comment Edited] (PDFBOX-2893) Simplify COSStream encoding and decoding

John Hewson (JIRA) Fri, 14 Aug 2015 23:53:09 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698146#comment-14698146
 ]


John Hewson edited comment on PDFBOX-2893 at 8/15/15 6:51 AM:
--------------------------------------------------------------

Ha ha, that's how I felt about the old API. I think it's been made worse that 
it really is by the JavaDoc being backwards, sorry about that.

It's unfortunate that "raw" is still a somewhat ambiguous name. Encoded and 
decoded seem like such natural choices but result in an equally confusing API, 
as mentioned above. Really, the concept of a stream's data with its filters 
still applied is an odd concept and one which is more an invention of PDFBox 
than a logical concept in PDF (i.e. it's an incidental syntactic element of a 
PDF file, like PDFDocEncoding).

Here's an idea which might work: we could remove createRawInputStream() 
entirely and replace it with createInputStream(List<String>) from PDStream 
(i.e. we will move that method into COSStream). Then to get the encoded data 
("as is") one can call createInputStream(Collections.<String>emptyList()) or 
perhaps simply createInputStream(null) or something similar.


was (Author: jahewson):
Ha ha, that's how I felt about the old API. I think it's been made worse that 
it really is by the JavaDoc being backwards, sorry about that.

It's unfortunate that "raw" is still a somewhat ambiguous name. Encoded and 
decoded seem like such natural choices but result in an equally confusing API, 
as mentioned above. Really, the concept of a stream's data with its filters 
still applied is an odd concept and one which is more an invention of PDFBox 
than a logical concept in PDF (i.e. it's an incidental syntactic element of a 
PDF file, like PDFDocEncoding).

Here's an idea which might work: we could remove createRawInputStream() 
entirely and replace it with createInputStream(List<String>) from PDStream 
(i.e. we will move that method into COSStream). Then to get the encoded data 
("as is") one can call createInputStream(Collections.<String>emptyList()) or 
perhaps simply createInputStream(null).

> Simplify COSStream encoding and decoding
> ----------------------------------------
>
>                 Key: PDFBOX-2893
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2893
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: John Hewson
>            Assignee: John Hewson
>            Priority: Blocker
>             Fix For: 2.0.0
>
>         Attachments: PDFBOX-2893-2.patch
>
>
> Performance issues and memory usage issues surrounding streams are one of the 
> few things blocking the release of 2.0 (see  PDFBOX-2301, PDFBOX-2882, 
> PDFBOX-2883).
> Though we've managed to reduce some of the memory used by RandomAccessBuffer 
> and to take advantage of buffering of scratch files, we still have problems 
> with the amount of memory which COSStream holds onto. Changes introduced in 
> 2.0 have resulted in COSStreams having a very complex relationship with 
> classes which hold a lot of memory in complex ways (e.g. the fields: 
> tempBuffer, filteredBuffer, unfilteredBuffer, filteredStream, 
> unFilteredStream, scratchFile). Access to scratch file pages in particular 
> does not seem to be well regulated, especially with regards to multithreading 
> (an avenue we'd at least like to leave open).
> Given recent flux, I'm doubtful that we can ship the current API for 
> COSStream w.r.t. RandomAccess without shipping performance issues or flaws 
> which will be unfixable without breaking changes.
> One of the recent changes to COSStream is that it now exposes a RandomAccess, 
> this is so that PDFStreamParser can parse content streams (as well as other 
> subclasses which handle xref and object streams). However, streams are 
> fundamentally not random access - stream filters are sequential. While the 
> consumer of a stream may wish to buffer the data (in memory or scratch) for 
> random access, COSStream itself does not need to expose such an elaborate API 
> - many pieces of gymnastics are performed inside COSStream to present this 
> illusion, at significant cost. We should remove that.
> But what about providing a RandomAccess for PDFStreamParser, 
> PDFObjectStreamParser, and PDFXrefStreamParser? It turns out that those 
> classes don't actually perform random I/O. They perform sequential I/O with a 
> buffer for peek/unread.
> We need to simplify to get 2.0 fast, lean, and maintainable. Here's what I 
> think we should do:
> 1. Split the interfaces for sequential and random I/O
> - Introduce a new SequentialSource interface for sequential I/O, with thin 
> wrappers for RandomAccessRead and InputStream.
> - BaseParser will use SequentialSource rather than RandomAccessRead (this 
> will be inherited by PDFStreamParser, PDFObjectStreamParser, and 
> PDFXrefStreamParser).
> - COSParser will use RandomAccessRead and pass a SequentialSource wrapper to 
> it's superclass, BaseParser.
> 2. Remove RandomAccess APIs from COSStream, expose only InputStream and 
> OutputStream, as we used to do. We can pass an InputStream to PDFStreamParser 
> using a wrapper which implements SequentialSource. This will remove 
> tempBuffer, filteredBuffer, and unfilteredBuffer from COSStream, all of which 
> hold memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-2893) Simplify COSStream encoding and decoding

Reply via email to