[
https://issues.apache.org/jira/browse/PDFBOX-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710785#comment-16710785
]
Ben Manes commented on PDFBOX-4396:
-----------------------------------
The memory problems shifted now to object finalization. The GC has 4gb of
java.lang.ref.Finalizer queued up. It is unable to keep up and crashes. The
referent is ScratchFileBuffer, which calls close() and involves I/O. This is
not closed by PDDocument, as the internal comment indicates that COSStream
creates new buffers without closing the old ones. This means there is not a way
for an application developer to influence this.
It does look like my next step will be to disable caching, reduce the reuse of
PDDocument instances, and manually trigger GC. I don't think there is much that
I can do and that the library makes assumptions based on small / medium sized
documents, which at this scale abuse the garbage collector. For a large
document it breaks down badly. Do you see any fixes that I can make to help
alleviate the problem?
> Memory leak due to soft reference caching
> -----------------------------------------
>
> Key: PDFBOX-4396
> URL: https://issues.apache.org/jira/browse/PDFBOX-4396
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 2.0.12
> Environment: JDK10; G1
> Reporter: Ben Manes
> Priority: Major
> Attachments: #2 - memory leak 2.png, #2 - memory leak.png, memory
> leak 2.png, memory leak.png
>
>
> In a heap dump, it appears that DefaultResourceCache is retaining 5.3 GB of
> memory due to buffered images (via PDImageXObject). I suspect that G1 is not
> collecting soft references across all regions before it out-of-memory errors.
> In PDFBOX-4389, I discovered very slow PDDocument#load times due to a JDK10
> I/O bug. Previously I was loading the document to render each page, but this
> took 1.5 minutes. To work around that bug I reused the document instance
> across pages. This seems to have fail because the pages were cached and not
> cleared by the GC.
> The DefaultResourceCache does not prune its cache entries when the soft
> references are collected. Like WeakHashMap, it should use a ReferenceQueue,
> poll it on every access, and prune accordingly.
> Thankfully PDDocument#setResourceCache exists. For now I am going to reset
> the cache to a new instance after a page has been rendered. The entries
> should no longer be reachable and be GC'd more aggressively. If that doesn't
> work, I'll either replace the cache (e.g. with Caffeine) or disable it by
> setting the instance to null.
> I think the desired fix is to prune the DefaultResourceCache and, ideally,
> reconsider usage of soft references (as they tend to be poor in practice).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]