[
https://issues.apache.org/jira/browse/PDFBOX-6146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18051710#comment-18051710
]
Tilman Hausherr commented on PDFBOX-6146:
-----------------------------------------
I committed your proposed changes; I also tried an alternative, which was to
just return if an entry occurred twice, but then unit tests failed. This means
that there are legit fonts that have double entries. I'm now running our huge
regression tests (PDFBOX-6140), which will be done within a few hours.
> OutOfMemoryError when trying to extract text from pdf
> -----------------------------------------------------
>
> Key: PDFBOX-6146
> URL: https://issues.apache.org/jira/browse/PDFBOX-6146
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 3.0.6 PDFBox
> Environment: java 17. macos 26.2
> Reporter: james
> Priority: Blocker
> Attachments: test.pdf
>
>
> I have a pdf file which causes an OutOfMemory error when trying to extract
> the text. Unfortunately, the file is a customer file which i cannot share
> (it is only about 5mb). I am willing to work with someone, however, on
> debugging the issue. Before it fails with OOME, i get the following errors:
> {{16:45:23.418 [main] ERROR o.a.f.ttf.GlyphSubstitutionTable -
> scriptOffsets[1680]: 10084 implausible: data.getCurrentPosition() - offset =
> 10088 ()}}
> {{16:45:23.419 [main] WARN o.a.f.ttf.GlyphSubstitutionTable - FeatureRecord
> array not alphabetically sorted by FeatureTag: S»d» < »d»c ()}}
> The above output is from running on 3.0.6, which seems to have improved
> something relevant. This was originally failing on 3.0.5, in which case i
> got around 40k log entries like:
> {{16:51:52.897 [main] ERROR o.a.f.ttf.GlyphSubstitutionTable - LangSysRecords
> not alphabetically sorted by LangSys tag: PH'd <= PLS» ()}}
> I've tried allocating up to 6gb to the extraction process without any
> success. I can open the pdf file with the macos preview without any issues.
> The relevant stack trace is:
> {{java.lang.OutOfMemoryError: Java heap space
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLookupTable(GlyphSubstitutionTable.java:341)
>
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLookupList(GlyphSubstitutionTable.java:292)
>
> org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:115)at
> org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:409)
> org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:186)
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:165)
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:66)
> org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:123)
> org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:72)
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createDescendantFont(PDFontFactory.java:385)
> org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:97)
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:173)
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:170)
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:72)
>
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:926)
>
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:559)
>
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:517)}}
> {{org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:158)}}
> {{org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:153)}}
> {{org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:380)}}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]