In addition, trying to parse text using ExtractText, got this: Dec 03, 2019 8:17:30 AM org.apache.pdfbox.filter.FlateFilter decompress WARNING: FlateFilter: premature end of stream due to a DataFormatException Dec 03, 2019 8:17:30 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init> WARNING: Could not read embedded TTF for font ABCDEE+Calibri-BoldItalic java.io.IOException: LangSysRecords not alphabetically sorted by LangSys tag: Q <= í$ at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:125) at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) at org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:353) at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:198) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:89) at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:61) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:872) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:506) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:480) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:153) at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:377) at org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:274) at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:97) at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
Dec 03, 2019 8:17:30 AM org.apache.pdfbox.filter.FlateFilter decompress WARNING: FlateFilter: premature end of stream due to a DataFormatException Dec 03, 2019 8:17:30 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init> WARNING: Could not read embedded OTF for font ABCDEE+Calibri-BoldItalic java.io.IOException: LangSysRecords not alphabetically sorted by LangSys tag: Q <= í$ at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:125) at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) at org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:353) at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:79) at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:27) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:73) at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:112) at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:65) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createDescendantFont(PDFontFactory.java:139) at org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:196) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:97) at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:61) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:872) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:506) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:480) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:153) at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:377) at org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:274) at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:97) at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60) Dec 03, 2019 8:17:30 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode WARNING: No Unicode mapping for CID+32 (32) in font ABCDEE+Calibri-BoldItalic Dec 03, 2019 8:17:30 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 codeToGID WARNING: Failed to find a character mapping for 32 in ABCDEE+Calibri-BoldItalic Thanks On Tue, Dec 3, 2019 at 8:16 AM Slava G <[email protected]> wrote: > Hi, > I've tried to run PDFDebugger from the latest PDFBox, what should be > normal expected result ? As in my case it's just hanged out, after printing: > Dec 03, 2019 7:58:51 AM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS > INFO: use the option > -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider > Dec 03, 2019 7:58:51 AM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS > INFO: or call System.setProperty("sun.java2d.cmm", > "sun.java2d.cmm.kcms.KcmsServiceProvider") > > Thanks > > On Mon, Dec 2, 2019 at 11:13 PM Tilman Hausherr <[email protected]> > wrote: > >> Send it to me, tilman at snafu dot de. >> >> (The readLangSysTable problem should be solved in 2.0.17, so make sure >> you are using that one) >> >> Oops I see this is the tika list, so maybe that is a lower version. >> Please retry with a "freshly downloaded" PDFDebugger of the pdfbox website. >> >> Tilman >> >> Am 02.12.2019 um 16:42 schrieb Slava G: >> >> I have pdf that reproduce similar problem : >> >> java.lang.OutOfMemoryError: Java heap space >> >> at org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable( >> GlyphSubstitutionTable.java:147) >> >> at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable( >> GlyphSubstitutionTable.java:129) >> >> at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList( >> GlyphSubstitutionTable.java:98) >> >> at org.apache.fontbox.ttf.GlyphSubstitutionTable.read( >> GlyphSubstitutionTable.java:78) >> >> at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:353) >> >> at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) >> >> at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) >> >> at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) >> >> at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>( >> PDTrueTypeFont.java:198) >> >> at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont( >> PDFontFactory.java:75) >> >> at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) >> >> at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process( >> SetFontAndSize.java:60) >> >> at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator( >> PDFStreamEngine.java:869) >> >> at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators( >> PDFStreamEngine.java:505) >> >> at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream( >> PDFStreamEngine.java:479) >> >> at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage( >> PDFStreamEngine.java:152) >> >> at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage( >> LegacyPDFStreamEngine.java:139) >> >> at org.apache.pdfbox.text.PDFTextStripper.processPage( >> PDFTextStripper.java:391) >> >> at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153) >> >> at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages( >> AbstractPDF2XHTML.java:835) >> >> at org.apache.pdfbox.text.PDFTextStripper.writeText( >> PDFTextStripper.java:266) >> >> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124) >> >> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) >> >> >> To whom can I send pdf for investigation (it's from customer, can't send >> it public). >> >> >> Thanks >> >> On Sun, Nov 3, 2019 at 12:10 PM Slava G <[email protected]> wrote: >> >>> Well, it's not easy to provide those documents, as they're customers >>> content and need approval, >>> Need to get customer approval for that. I'll try, and will let you know.. >>> Thanks >>> >>> On Sun, Nov 3, 2019 at 11:45 AM Tilman Hausherr <[email protected]> >>> wrote: >>> >>>> Hello, >>>> >>>> I'd be interested in the OOM exception. The one below aborts the >>>> parsing. Can you open a PDFBox issue and attach your PDF? We could just >>>> skip the table here instead of failing. >>>> >>>> Re the OOM we'd also need a PDF. >>>> >>>> Skipping parsing of embedded ttf will possibly have a negative impact >>>> on >>>> text extraction. >>>> >>>> Tilman >>>> >>>> >>>> Am 03.11.2019 um 10:38 schrieb Slava G: >>>> > Hi, >>>> > In some PDF files parsing we see different errors related to PDF >>>> > parsing, one is OutOfMemmory exception during pdf parsing and another: >>>> > >>>> > WARN - Could not read embedded TTF for font ABCDEE+Segoe >>>> > UI,BoldItalic >>>> > java.io.IOException: Kerning sub-table too short, got 0 bytes, expect >>>> > 6 or more. >>>> > at >>>> > >>>> org.apache.fontbox.ttf.KerningSubtable.readSubtable0(KerningSubtable.java:191) >>>> > at >>>> org.apache.fontbox.ttf.KerningSubtable.read(KerningSubtable.java:70) >>>> > at org.apache.fontbox.ttf.KerningTable.read(KerningTable.java:80) >>>> > at >>>> org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:353) >>>> > at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) >>>> > at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) >>>> > at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) >>>> > at >>>> > >>>> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:198) >>>> > at >>>> > >>>> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) >>>> > at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) >>>> > at >>>> > >>>> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) >>>> > at >>>> > >>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:869) >>>> > at >>>> > >>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:505) >>>> > at >>>> > >>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:479) >>>> > at >>>> > >>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:152) >>>> > at >>>> > >>>> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) >>>> > at >>>> > >>>> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) >>>> > at >>>> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153) >>>> > at >>>> > >>>> org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:835) >>>> > at >>>> > >>>> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) >>>> > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124) >>>> > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) >>>> > >>>> > How can I skip parsing of embedded TTF inside PDF ? >>>> > >>>> > Thanks >>>> >>>> >>>> >>
