[ https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16798910#comment-16798910 ]
Karl Wright commented on CONNECTORS-1593: ----------------------------------------- There is a philosophy about memory consumption that we rigorously adhere to in ManifoldCF which is known as the "bounded memory consumption" philosophy, which is that connectors must be written so they are not sensitive to the size of the data they are indexing. Streams are used and the data does not ever "hit memory". But if you aren't careful, the custom connector you have might well put entire documents into memory and then of course all you need would be two large documents at the same time and you are hosed. Can you check your custom connector for that issue? If there is a problem there, you could work around it by limiting the number of custom connector connections to 1. If that works reliably, then you know where the issue is. > Memory issue on > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > --------------------------------------------------------------------------------------------------------------- > > Key: CONNECTORS-1593 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1593 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor > Affects Versions: ManifoldCF 2.12 > Reporter: Donald Van den Driessche > Assignee: Karl Wright > Priority: Major > Attachments: image-2019-03-22-08-57-53-887.png > > > I have created an Issue with fontbox too: > > When using the internal Tika extractor in a Manifold Job on certain occasions > I get an Out of Memory Error. > {code:java} > Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of > memory - shutting down > Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: > Java heap space > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:199) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$MonitoredAddActivityWrapper.sendDocument(IncrementalIngester.java:3471) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > digital.formica.manifold.connector.transformation.fetchwebresource.WebresourceFetchTransformationConnector.addOrReplaceDocumentWithException(WebresourceFetchTransformationConnector.java:118) > {code} > I've allocated 8g of heap size, Installed the latest version of Tika (1.20) > and PDFBOX (2.0.14). > But no solutions found. > After a heap dump and analyzing this dump, I notice that it is the Integer > class that takes about 2.6g of memory. > Any suggestions? -- This message was sent by Atlassian JIRA (v7.6.3#76005)