[ https://issues.apache.org/jira/browse/PDFBOX-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710391#comment-16710391 ]
Ben Manes commented on PDFBOX-4396: ----------------------------------- Yes, I agree caching makes sense in general. My case is extreme due to N thousand page documents from scanned paperwork, taking 3-13 seconds per page for PdfBox to render into an image. While I'd appreciate better performance, that's only if it retains stability. I agree weak references are not a fit here and did not intend to imply otherwise. My point is that the cache held 430k HashMap.Entry objects where many might have null values. This can be pruned by using a ReferenceQueue, something like the below code. Soft references are problematic and typically chosen because a developer doesn't know a good size. Instead of a strict limit, the decision is left to the JVM. The references are in a global cache, so an inexpensive cache might cause a critical one to be flushed. The collection behavior is GC specific and the penalty is placed in the critical section of the pause time. Many collectors are not aggressive, which increases hit rates but the memory pressure causes full GCs in short intervals. A collector that is aggressive makes the cache ineffective. If there is a way to estimate the size, then a bounded cache is preferrable. This avoids the above problems with the potential of higher hit rates, as LRU can easily to polluted. See for example [Caffeine's hit rates|https://github.com/ben-manes/caffeine/wiki/Efficiency] by taking frequency into account, or our new [research paper|https://drive.google.com/file/d/1CT2ASkfuG9qVya9Sn8ZUCZjrFSSyjRA_/view?usp=sharing] for an adapting policy. If the number of entries or weight of an entry can be estimated then a strong reference cache is typically the preferred approach. If that is problematic, usually one has to investigate off-heap caching. So far resetting the ResourceCache has been effective. I could try amortizing that, e.g. reseting it every N pages, to gain a little better reuse as you indicated. If I had a better sense of the objects being cached, I would switch to a Caffeine-backed version for an explicit bound. Can the ResourceCache be shared across documents or are the entries document specific? {code:java} final ReferenceQueue queue; final Map<K, SoftValueReference<K, V>> cache; public void put(K key, V value) { prune(); cache.put(key, new SoftValueReference<>(key, value, queue)); } public V get(K key) { prune(); var ref = cache.get(key); return (ref == null) ? null : ref.get(); } private void prune() { Reference<? extends V> ref; while ((ref = queue.poll()) != null) { var reference = (SoftValueReference<K, V>) ref; cache.remove(ref.getKey()); } } static final class SoftValueReference<K, V> extends SoftReference<V> { private final K key; public SoftValueReference(K key, V value, ReferenceQueue<V> queue) { super(value, queue); this.key = key; } public Object getKey() { return key; } } {code} > Memory leak due to soft reference caching > ----------------------------------------- > > Key: PDFBOX-4396 > URL: https://issues.apache.org/jira/browse/PDFBOX-4396 > Project: PDFBox > Issue Type: Bug > Affects Versions: 2.0.12 > Environment: JDK10; G1 > Reporter: Ben Manes > Priority: Major > Attachments: memory leak 2.png, memory leak.png > > > In a heap dump, it appears that DefaultResourceCache is retaining 5.3 GB of > memory due to buffered images (via PDImageXObject). I suspect that G1 is not > collecting soft references across all regions before it out-of-memory errors. > In PDFBOX-4389, I discovered very slow PDDocument#load times due to a JDK10 > I/O bug. Previously I was loading the document to render each page, but this > took 1.5 minutes. To work around that bug I reused the document instance > across pages. This seems to have fail because the pages were cached and not > cleared by the GC. > The DefaultResourceCache does not prune its cache entries when the soft > references are collected. Like WeakHashMap, it should use a ReferenceQueue, > poll it on every access, and prune accordingly. > Thankfully PDDocument#setResourceCache exists. For now I am going to reset > the cache to a new instance after a page has been rendered. The entries > should no longer be reachable and be GC'd more aggressively. If that doesn't > work, I'll either replace the cache (e.g. with Caffeine) or disable it by > setting the instance to null. > I think the desired fix is to prune the DefaultResourceCache and, ideally, > reconsider usage of soft references (as they tend to be poor in practice). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org