I have pdf that reproduce similar problem : java.lang.OutOfMemoryError: Java heap space
at org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable( GlyphSubstitutionTable.java:147) at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable( GlyphSubstitutionTable.java:129) at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList( GlyphSubstitutionTable.java:98) at org.apache.fontbox.ttf.GlyphSubstitutionTable.read( GlyphSubstitutionTable.java:78) at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:353) at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>( PDTrueTypeFont.java:198) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont( PDFontFactory.java:75) at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process( SetFontAndSize.java:60) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator( PDFStreamEngine.java:869) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators( PDFStreamEngine.java:505) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream( PDFStreamEngine.java:479) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage( PDFStreamEngine.java:152) at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage( LegacyPDFStreamEngine.java:139) at org.apache.pdfbox.text.PDFTextStripper.processPage( PDFTextStripper.java:391) at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages( AbstractPDF2XHTML.java:835) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266 ) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) To whom can I send pdf for investigation (it's from customer, can't send it public). Thanks On Sun, Nov 3, 2019 at 12:10 PM Slava G <[email protected]> wrote: > Well, it's not easy to provide those documents, as they're customers > content and need approval, > Need to get customer approval for that. I'll try, and will let you know.. > Thanks > > On Sun, Nov 3, 2019 at 11:45 AM Tilman Hausherr <[email protected]> > wrote: > >> Hello, >> >> I'd be interested in the OOM exception. The one below aborts the >> parsing. Can you open a PDFBox issue and attach your PDF? We could just >> skip the table here instead of failing. >> >> Re the OOM we'd also need a PDF. >> >> Skipping parsing of embedded ttf will possibly have a negative impact on >> text extraction. >> >> Tilman >> >> >> Am 03.11.2019 um 10:38 schrieb Slava G: >> > Hi, >> > In some PDF files parsing we see different errors related to PDF >> > parsing, one is OutOfMemmory exception during pdf parsing and another: >> > >> > WARN - Could not read embedded TTF for font ABCDEE+Segoe >> > UI,BoldItalic >> > java.io.IOException: Kerning sub-table too short, got 0 bytes, expect >> > 6 or more. >> > at >> > >> org.apache.fontbox.ttf.KerningSubtable.readSubtable0(KerningSubtable.java:191) >> > at org.apache.fontbox.ttf.KerningSubtable.read(KerningSubtable.java:70) >> > at org.apache.fontbox.ttf.KerningTable.read(KerningTable.java:80) >> > at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:353) >> > at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) >> > at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) >> > at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) >> > at >> > >> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:198) >> > at >> > >> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) >> > at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) >> > at >> > >> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) >> > at >> > >> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:869) >> > at >> > >> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:505) >> > at >> > >> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:479) >> > at >> > >> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:152) >> > at >> > >> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) >> > at >> > >> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) >> > at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153) >> > at >> > >> org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:835) >> > at >> > >> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) >> > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124) >> > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) >> > >> > How can I skip parsing of embedded TTF inside PDF ? >> > >> > Thanks >> >> >>
