[ https://issues.apache.org/jira/browse/PDFBOX-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934120#comment-14934120 ]
John Hewson edited comment on PDFBOX-2893 at 9/28/15 9:50 PM: -------------------------------------------------------------- We already discussed those names above, they don't work. I have literally no idea what they do, does createEncodedOutputStream() read the encoded data or does it decode the encoded data? If I want to read _encoded_ data I have to use createDecodedInputStream() instead of createEncodedInputStream(). It's awful, the names are fundamentally ambiguous. was (Author: jahewson): We already discussed those names above, they don't work. I have literally no idea what they do, does createEncodedOutputStream() read the encoded data or does it decode the encoded data? If I want to read _encoded_ data I have to use createDecodedInputStream() instead of createEncodedInputStream(). It's awful. > Simplify COSStream encoding and decoding > ---------------------------------------- > > Key: PDFBOX-2893 > URL: https://issues.apache.org/jira/browse/PDFBOX-2893 > Project: PDFBox > Issue Type: Improvement > Affects Versions: 2.0.0 > Reporter: John Hewson > Assignee: John Hewson > Priority: Blocker > Fix For: 2.0.0 > > Attachments: PDFBOX-2893-2.patch, PDImage-getStream.patch > > > Performance issues and memory usage issues surrounding streams are one of the > few things blocking the release of 2.0 (see PDFBOX-2301, PDFBOX-2882, > PDFBOX-2883). > Though we've managed to reduce some of the memory used by RandomAccessBuffer > and to take advantage of buffering of scratch files, we still have problems > with the amount of memory which COSStream holds onto. Changes introduced in > 2.0 have resulted in COSStreams having a very complex relationship with > classes which hold a lot of memory in complex ways (e.g. the fields: > tempBuffer, filteredBuffer, unfilteredBuffer, filteredStream, > unFilteredStream, scratchFile). Access to scratch file pages in particular > does not seem to be well regulated, especially with regards to multithreading > (an avenue we'd at least like to leave open). > Given recent flux, I'm doubtful that we can ship the current API for > COSStream w.r.t. RandomAccess without shipping performance issues or flaws > which will be unfixable without breaking changes. > One of the recent changes to COSStream is that it now exposes a RandomAccess, > this is so that PDFStreamParser can parse content streams (as well as other > subclasses which handle xref and object streams). However, streams are > fundamentally not random access - stream filters are sequential. While the > consumer of a stream may wish to buffer the data (in memory or scratch) for > random access, COSStream itself does not need to expose such an elaborate API > - many pieces of gymnastics are performed inside COSStream to present this > illusion, at significant cost. We should remove that. > But what about providing a RandomAccess for PDFStreamParser, > PDFObjectStreamParser, and PDFXrefStreamParser? It turns out that those > classes don't actually perform random I/O. They perform sequential I/O with a > buffer for peek/unread. > We need to simplify to get 2.0 fast, lean, and maintainable. Here's what I > think we should do: > 1. Split the interfaces for sequential and random I/O > - Introduce a new SequentialSource interface for sequential I/O, with thin > wrappers for RandomAccessRead and InputStream. > - BaseParser will use SequentialSource rather than RandomAccessRead (this > will be inherited by PDFStreamParser, PDFObjectStreamParser, and > PDFXrefStreamParser). > - COSParser will use RandomAccessRead and pass a SequentialSource wrapper to > it's superclass, BaseParser. > 2. Remove RandomAccess APIs from COSStream, expose only InputStream and > OutputStream, as we used to do. We can pass an InputStream to PDFStreamParser > using a wrapper which implements SequentialSource. This will remove > tempBuffer, filteredBuffer, and unfilteredBuffer from COSStream, all of which > hold memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org