[
https://issues.apache.org/jira/browse/PDFBOX-4668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17946052#comment-17946052
]
Tilman Hausherr edited comment on PDFBOX-4668 at 4/21/25 6:16 AM:
------------------------------------------------------------------
This happens with 2.0 with file [^002145.pdf]:
Exception in thread "main" java.lang.NullPointerException: Cannot invoke
"String.equals(Object)" because the return value of
"org.apache.pdfbox.pdmodel.font.PDFont.getName()" is null
at
org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:571)
at
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:382)
at
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:308)
at
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:256)
at
org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:403)
at
org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:300)
at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:97)
at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
was (Author: tilman):
This happens with 2.0:
Exception in thread "main" java.lang.NullPointerException: Cannot invoke
"String.equals(Object)" because the return value of
"org.apache.pdfbox.pdmodel.font.PDFont.getName()" is null
at
org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:571)
at
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:382)
at
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:308)
at
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:256)
at
org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:403)
at
org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:300)
at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:97)
at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
> Add ResourceCacheFactory as global setting
> ------------------------------------------
>
> Key: PDFBOX-4668
> URL: https://issues.apache.org/jira/browse/PDFBOX-4668
> Project: PDFBox
> Issue Type: New Feature
> Components: Rendering
> Affects Versions: 3.0.4 PDFBox, 4.0.0
> Reporter: Ben Manes
> Assignee: Andreas Lehmkühler
> Priority: Major
> Fix For: 3.0.5 PDFBox, 4.0.0
>
> Attachments: 002145.pdf, Screenshot 2023-03-20 at 18.57.40.png,
> memory.png, threads.png
>
>
> Image rendering is cached by {{DefaultResourceCache}} per-document using soft
> references. As described in the [FAQ|https://pdfbox.apache.org/2.0/faq.html],
> this can lead to an {{OutOfMemoryError}} when processing, e.g. many documents
> in parallel. The configuration of this cache is per-document and it is
> initialized with the default.
> {code}
> // document-wide cached resources
> private ResourceCache resourceCache = new DefaultResourceCache();
> {code}
> This requires all call sites be modified to disable it, some of which may be
> in 3rd party code. The ask is to static factory to configure the default
> globally, which would return a new {{DefaultResourceCache}} when called. This
> would let a user specify a new static factory, e.g. one that returns a custom
> cache or {{null}} if disabled.
> Soft references are a problematic caching scheme, which degrades poorly. It
> is very likely that the many and large image fragments causes GC promotion
> (eden=>young=>old) which requires a full GC to collect. Under memory/cpu
> pressure, the GC can devolve into a death spiral of collecting the minimal
> heap space to match its pause time constraints, leading to repeated GCs due
> to soft reference pollutions and an eventual OOME. If caching was set, it
> might be preferable to be size-based (by rough byte-size) and perhaps tied
> into {{MemoryUsageSetting}} main memory configuration.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]