Donald Van den Driessche created CONNECTORS-1593: ----------------------------------------------------
Summary: Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) Key: CONNECTORS-1593 URL: https://issues.apache.org/jira/browse/CONNECTORS-1593 Project: ManifoldCF Issue Type: Bug Components: Tika extractor Affects Versions: ManifoldCF 2.12 Reporter: Donald Van den Driessche I have created an Issue with fontbox too: When using the internal Tika extractor in a Manifold Job on certain occasions I get an Out of Memory Error. {code:java} Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of memory - shutting down Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: Java heap space Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:199) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$MonitoredAddActivityWrapper.sendDocument(IncrementalIngester.java:3471) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at digital.formica.manifold.connector.transformation.fetchwebresource.WebresourceFetchTransformationConnector.addOrReplaceDocumentWithException(WebresourceFetchTransformationConnector.java:118) {code} I've allocated 8g of heap size, Installed the latest version of Tika (1.20) and PDFBOX (2.0.14). But no solutions found. After a heap dump and analyzing this dump, I notice that it is the Integer class that takes about 2.6g of memory. Any suggestions? -- This message was sent by Atlassian JIRA (v7.6.3#76005)