[ https://issues.apache.org/jira/browse/PDFBOX-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934090#comment-14934090 ]
ASF subversion and git services commented on PDFBOX-2893: --------------------------------------------------------- Commit 1705782 from [~jahewson] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1705782 ] PDFBOX-2893: Remove getStream() from PDImage > Simplify COSStream encoding and decoding > ---------------------------------------- > > Key: PDFBOX-2893 > URL: https://issues.apache.org/jira/browse/PDFBOX-2893 > Project: PDFBox > Issue Type: Improvement > Affects Versions: 2.0.0 > Reporter: John Hewson > Assignee: John Hewson > Priority: Blocker > Fix For: 2.0.0 > > Attachments: PDFBOX-2893-2.patch, PDImage-getStream.patch > > > Performance issues and memory usage issues surrounding streams are one of the > few things blocking the release of 2.0 (see PDFBOX-2301, PDFBOX-2882, > PDFBOX-2883). > Though we've managed to reduce some of the memory used by RandomAccessBuffer > and to take advantage of buffering of scratch files, we still have problems > with the amount of memory which COSStream holds onto. Changes introduced in > 2.0 have resulted in COSStreams having a very complex relationship with > classes which hold a lot of memory in complex ways (e.g. the fields: > tempBuffer, filteredBuffer, unfilteredBuffer, filteredStream, > unFilteredStream, scratchFile). Access to scratch file pages in particular > does not seem to be well regulated, especially with regards to multithreading > (an avenue we'd at least like to leave open). > Given recent flux, I'm doubtful that we can ship the current API for > COSStream w.r.t. RandomAccess without shipping performance issues or flaws > which will be unfixable without breaking changes. > One of the recent changes to COSStream is that it now exposes a RandomAccess, > this is so that PDFStreamParser can parse content streams (as well as other > subclasses which handle xref and object streams). However, streams are > fundamentally not random access - stream filters are sequential. While the > consumer of a stream may wish to buffer the data (in memory or scratch) for > random access, COSStream itself does not need to expose such an elaborate API > - many pieces of gymnastics are performed inside COSStream to present this > illusion, at significant cost. We should remove that. > But what about providing a RandomAccess for PDFStreamParser, > PDFObjectStreamParser, and PDFXrefStreamParser? It turns out that those > classes don't actually perform random I/O. They perform sequential I/O with a > buffer for peek/unread. > We need to simplify to get 2.0 fast, lean, and maintainable. Here's what I > think we should do: > 1. Split the interfaces for sequential and random I/O > - Introduce a new SequentialSource interface for sequential I/O, with thin > wrappers for RandomAccessRead and InputStream. > - BaseParser will use SequentialSource rather than RandomAccessRead (this > will be inherited by PDFStreamParser, PDFObjectStreamParser, and > PDFXrefStreamParser). > - COSParser will use RandomAccessRead and pass a SequentialSource wrapper to > it's superclass, BaseParser. > 2. Remove RandomAccess APIs from COSStream, expose only InputStream and > OutputStream, as we used to do. We can pass an InputStream to PDFStreamParser > using a wrapper which implements SequentialSource. This will remove > tempBuffer, filteredBuffer, and unfilteredBuffer from COSStream, all of which > hold memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org