[
https://issues.apache.org/jira/browse/PDFBOX-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139369#comment-14139369
]
John Hewson commented on PDFBOX-2301:
-------------------------------------
In response to Andreas' third problem:
{quote}
find a way to release resources during processing a pdf, e.g. after each page
{quote}
Currently we cache the page resources in PDResources which belongs to a
specific PDPage. This causes two problems, 1) users who want to hold many
PDPage objects in memory will have high memory use (but this is often by
accident*). 2) By caching resources in PDPage we only get to keep that cache
for the lifetime of the page, which e.g. in PDFRenderer is a single page only.
That means that a font which appears on 40 pages has to be parsed 40 times,
which causes slow running times, but also memory thrashing as objects are
destroyed frequently only to be re-created.
What PDFRenderer really needs is not page-wide caching but document-wide
caching, so that it can cache fonts, cmaps, color profiles, etc. only once. But
that won't work for images, because they're too large. What we're beginning to
realise is that caching is use-case specific and probably shouldn't be built-in
to PDFBox's pdmodel. Instead we should removing resource caching from
PDPage/PDResources and implement custom caching in PDFRenderer and other
downstream classes such as PDFTextStripper. I'll happily volunteer myself. The
existing high-level PDFBox APIs will continue to "just work" and power users
will get a level of control that they appreciate.
This strategy could be enhanced by removing memory-hungry methods on
PDResources such as getFonts() and getXObjects() which force all resources of a
particular type to be loaded, whether or not they are needed, or actually used
in the content stream. They would be replaced by methods to retrieve a single
resource, e.g. getFont(name).
---
\* There probably isn't a legitimate use case for 1) any more, we've solved the
issues which we used to have with image caching (in fact, the clearCache()
method actually no longer needs to be called by PDFRenderer, though it
currently is). The real problem is that it's easy to accidentally retain PDPage
objects, the PDDocument#getDocumentCatalog().getAllPages() method is dangerous
as looping over it will cause pages to be retained during processing, like so:
{code}
for (PDPage page : document.getDocumentCatalog().getAllPages()) //
java.util.List
{
// ... this is idiomatic in PDFBox 1.8
}
// List returned by getAllPages() kept in scope until here (bad)
{code}
I added of couple of methods a while ago to avoid this by fetching each PDPage
one at a time, and this is now used internally in PDFBox to avoid the memory
problems we used to have:
{code}
for (int i = 0; i < document.getNumberOfPages(); i++)
{
PDPage page = document.getPage(i);
// ... this is the new 2.0 way
// current page falls out of scope here (good)
}
{code}
To solve this problem, we could change getAllPages() so that instead of
returning a List it returns an Iterator<PDPage>, which would provide a nicer
API than getPage(int) and most existing code will continue to work. This is
also an opportunity to also fix type safety issues due to PDPageNode and
incorrect handling of the page tree (this is similar to the issue we had we the
acroform field tree).
> RandomAccessBuffer consumes too much memory.
> --------------------------------------------
>
> Key: PDFBOX-2301
> URL: https://issues.apache.org/jira/browse/PDFBOX-2301
> Project: PDFBox
> Issue Type: Bug
> Components: PDModel
> Affects Versions: 1.8.6, 2.0.0
> Reporter: gee
> Assignee: Andreas Lehmkühler
> Fix For: 2.0.0
>
> Attachments: clone.diff, clone2.diff, clone3.diff
>
>
> RandomAccessBuffer holds uncompressed image during operation because it is
> what exactly pdfbox ExtractImages do.
> but holding uncompressed image instead of compressed one in memory consumes
> too much memory, not excluding many PDF XObjects that can use filter to
> compress itself. It would be good if pdfbox provides option that reverts to
> COSObject state just before the RandomAccess object created(the state that
> pdf XObject stream parsed and COSDictionary objects haven't created because
> user doesn't requested it using get____() method.) It is crucial feature so
> that pdfbox can analyze huge pdf file(>100MB).
> In current source, one must close COSStream unless required(and I know closed
> stream cannot reopened again.)
> Class Name
>
>
> |
> Shallow Heap | Retained Heap
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> org.apache.pdfbox.cos.COSObject @ 0x5ad4940
>
>
> |
> 24 | 8,187,264
> |- <class> class org.apache.pdfbox.cos.COSObject @ 0x58c4020
>
>
> |
> 0 | 0
> |- generationNumber org.apache.pdfbox.cos.COSInteger @ 0x5ad0080
>
>
> |
> 24 | 24
> |- baseObject org.apache.pdfbox.cos.COSStream @ 0x5b25ea0
>
>
> |
> 32 | 8,187,216
> | |- <class> class org.apache.pdfbox.cos.COSStream @ 0x58c3e00
>
>
> |
> 8 | 8
> | |- items java.util.LinkedHashMap @ 0x5b2a0f0
>
>
> |
> 56 | 552
> | |- file org.apache.pdfbox.io.RandomAccessBuffer @ 0x5b2a128
>
>
> |
> 48 | 8,186,528
> | | |- <class> class org.apache.pdfbox.io.RandomAccessBuffer @ 0x5ad2b00
>
>
> |
> 8 | 8
> | | |- currentBuffer byte[16384] @ 0x590f360 16,400 | 16,400
> | | |- bufferList java.util.ArrayList @ 0x5b2e200
>
>
> |
> 24 | 8,170,080
> | | '- Total: 3 entries
>
>
> |
> |
> | |- filteredStream org.apache.pdfbox.io.RandomAccessFileOutputStream @
> 0x5b2a158
>
>
> | 32 | 32
> | |- decodeResult org.apache.pdfbox.filter.DecodeResult @ 0xa65f618
>
>
> |
> 16 | 16
> | |- unFilteredStream org.apache.pdfbox.io.RandomAccessFileOutputStream @
> 0xa71ab18
>
> |
> 32 | 32
> | '- Total: 6 entries
>
>
> |
> |
> |- objectNumber org.apache.pdfbox.cos.COSInteger @ 0x5b25ec0
>
>
> |
> 24 | 24
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)