[
https://issues.apache.org/jira/browse/PDFBOX-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711955#comment-16711955
]
Ben Manes commented on PDFBOX-4396:
-----------------------------------
The process completed for one of the large uploads and I had to disable the
others due to taking too long (hours). The cpu overhead on the machine caused
bad user-facing latencies, since the scheduler doesn't take cpu into account
and those jobs were being delayed. I think since our use cases expanded
expecting 5-10 page documents to now many thousands of pages (monthly
historicals), it's no longer a good fit to do the work on a single process,
shared with other user-facing work. I think my next step should be to migrate
this use-case to a lambda, distribute page ranges, and invoke in parallel. That
could easily be distributed using pdfbox and work great, but it's probably
easier / faster / cheaper to use ghostscript for such a simple lambda task.
The documents are not encrypted so I think that case may not apply. In my code
I often pass around a Guava Closer to accumulate resources across methods, and
then ensure all are closed if not done so otherwise. If everything is
associated to a document, it would make sense for a closer to be propagated
from it and then it can close all of the resources (if not closed already).
That could be a custom utility, etc. of course rather than Guava's.
You might also considered using weak / phantom references instead of
finalization. For my application's file I/O (local and s3), I give clients a
session with their own tempdir and reference count downloaded files against a
global cache. The session handles are proxies that clients should close, but
held in a weak keyed cache where the actual implementation is the value. Then
when the proxy is collected, the strong-ref value is explicitly closed. This
acts as a safety net just in case, since we do a lot of I/O and this form of
reference caching is cheap. The same can be done better with phantom
references, but more work than spinning up a weak cache with a removal
listener. From reading the code, it looks like a lot of effort was made to
close resources but it also got really complex with patches for the inevitable
leaks. Of course, you might not be able to change much due to API compatibility
needs.
I think at this point I'll close this, like the other, as not something
trivially fixable. I do think better resource handing is warranted, but that
requires a thoughtful refactor.
> Memory leak due to soft reference caching
> -----------------------------------------
>
> Key: PDFBOX-4396
> URL: https://issues.apache.org/jira/browse/PDFBOX-4396
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 2.0.12
> Environment: JDK10; G1
> Reporter: Ben Manes
> Priority: Major
> Attachments: #2 - memory leak 2.png, #2 - memory leak.png, memory
> leak 2.png, memory leak.png
>
>
> In a heap dump, it appears that DefaultResourceCache is retaining 5.3 GB of
> memory due to buffered images (via PDImageXObject). I suspect that G1 is not
> collecting soft references across all regions before it out-of-memory errors.
> In PDFBOX-4389, I discovered very slow PDDocument#load times due to a JDK10
> I/O bug. Previously I was loading the document to render each page, but this
> took 1.5 minutes. To work around that bug I reused the document instance
> across pages. This seems to have fail because the pages were cached and not
> cleared by the GC.
> The DefaultResourceCache does not prune its cache entries when the soft
> references are collected. Like WeakHashMap, it should use a ReferenceQueue,
> poll it on every access, and prune accordingly.
> Thankfully PDDocument#setResourceCache exists. For now I am going to reset
> the cache to a new instance after a page has been rendered. The entries
> should no longer be reachable and be GC'd more aggressively. If that doesn't
> work, I'll either replace the cache (e.g. with Caffeine) or disable it by
> setting the instance to null.
> I think the desired fix is to prune the DefaultResourceCache and, ideally,
> reconsider usage of soft references (as they tend to be poor in practice).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]