[jira] [Commented] (CONNECTORS-1625) When processing a specific PDF Manifold goes out of memory
[ https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949467#comment-16949467 ] Donald Van den Driessche commented on CONNECTORS-1625: -- After another test, we came to the conclusion that the file is processed correctly after choosing "-- No extraction selected --" instead of "General purpose extraction" on the Boilerpipe parameter. Now I have to estimate the impact of the different Boilerpipe paramater. > When processing a specific PDF Manifold goes out of memory > -- > > Key: CONNECTORS-1625 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1625 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: abd-serotec-antibodies-uk.pdf > > > When processing attached file with manifoldcf 2.12, we keep getting an out of > memory error. > When just parsing it throug Tika 1.18, no issues are being found. > Can anyone look into it? > Thanks in advance! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1625) When processing a specific PDF Manifold goes out of memory
[ https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949443#comment-16949443 ] Donald Van den Driessche commented on CONNECTORS-1625: -- After running the same process (with the same config) locally, we had no issues. So, it might be something with the streams. We've written a custom connector to fetch the files. It might use the wrong way to provide the file to the Tika parser. > When processing a specific PDF Manifold goes out of memory > -- > > Key: CONNECTORS-1625 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1625 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: abd-serotec-antibodies-uk.pdf > > > When processing attached file with manifoldcf 2.12, we keep getting an out of > memory error. > When just parsing it throug Tika 1.18, no issues are being found. > Can anyone look into it? > Thanks in advance! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1625) When processing a specific PDF Manifold goes out of memory
[ https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944414#comment-16944414 ] Donald Van den Driessche commented on CONNECTORS-1625: -- We are running this pdf as the one and only document. It's manifold 2.12. We tried to parse it through Tika locally with Tika 1.18 and 1.22 and both succeeded. We've set the heap space to 3G and 5G and still the same issues. I've now read somewhere that disk space might be used. But since the file is only 21MB large, I don't see how much disk space might be used. > When processing a specific PDF Manifold goes out of memory > -- > > Key: CONNECTORS-1625 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1625 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: abd-serotec-antibodies-uk.pdf > > > When processing attached file with manifoldcf 2.12, we keep getting an out of > memory error. > When just parsing it throug Tika 1.18, no issues are being found. > Can anyone look into it? > Thanks in advance! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
[ https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846635#comment-16846635 ] Donald Van den Driessche commented on CONNECTORS-1593: -- [~kwri...@metacarta.com] Apparently the problem exists on the server of our client. This incident can be closed. > Memory issue on > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > --- > > Key: CONNECTORS-1593 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1593 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: image-2019-03-22-08-57-53-887.png > > > I have created an Issue with fontbox too: > > When using the internal Tika extractor in a Manifold Job on certain occasions > I get an Out of Memory Error. > {code:java} > Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of > memory - shutting down > Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: > Java heap space > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) > Mar 16
[jira] [Commented] (CONNECTORS-1602) Continuous crawling doesn't recrawl everything
[ https://issues.apache.org/jira/browse/CONNECTORS-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826755#comment-16826755 ] Donald Van den Driessche commented on CONNECTORS-1602: -- Karl The website we're crawling also needs session based login. What happens with cookies in a continuous crawl? > Continuous crawling doesn't recrawl everything > -- > > Key: CONNECTORS-1602 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1602 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Reporter: Donald Van den Driessche >Priority: Major > > When crawling a website in continuous crawling mode we saw that not all > documents are recrawled. > The site is quite extensive. We figured out that after crawling a > document/page gets a recrawl timestamp in between the recrawl interval and > max recrawl interval. > But if these values occur within the first crawl, Manifold starts recrawling > those, but seems to ignore the rest of the website. Also sometimes documents > get recrawled 5 times while other don't get recrawled. Apparently due to the > same issue. > > Is it possible to shed a bit more light on the continuous crawling? > Is it a good system to use for crawling a (extensive) website? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1602) Continuous crawling doesn't recrawl everything
[ https://issues.apache.org/jira/browse/CONNECTORS-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826727#comment-16826727 ] Donald Van den Driessche commented on CONNECTORS-1602: -- Thanks. I know it runs continuous, but I'm wondering what happens if the recrawl timestamp is reached for documents. Will it first recrawl and then continue crawling, of contiunue crawling and then do the recrawl, or simultaneously crawl and recrawl? The last might slow down the crwaling speed. > Continuous crawling doesn't recrawl everything > -- > > Key: CONNECTORS-1602 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1602 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Reporter: Donald Van den Driessche >Priority: Major > > When crawling a website in continuous crawling mode we saw that not all > documents are recrawled. > The site is quite extensive. We figured out that after crawling a > document/page gets a recrawl timestamp in between the recrawl interval and > max recrawl interval. > But if these values occur within the first crawl, Manifold starts recrawling > those, but seems to ignore the rest of the website. Also sometimes documents > get recrawled 5 times while other don't get recrawled. Apparently due to the > same issue. > > Is it possible to shed a bit more light on the continuous crawling? > Is it a good system to use for crawling a (extensive) website? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1602) Continuous crawling doesn't recrawl everything
[ https://issues.apache.org/jira/browse/CONNECTORS-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826707#comment-16826707 ] Donald Van den Driessche commented on CONNECTORS-1602: -- Ok, thanks. That already clears some things up. How does Manifold know a document doesn't change that often if it isn't crawled? If a full crawling takes about 8 hours, but you make your recrawl intervals smaller than that. Will it start recrawling before the job has completed a full run? And if so, may that interfere with the termination of the job? So that it might not get to a full run? > Continuous crawling doesn't recrawl everything > -- > > Key: CONNECTORS-1602 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1602 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Reporter: Donald Van den Driessche >Priority: Major > > When crawling a website in continuous crawling mode we saw that not all > documents are recrawled. > The site is quite extensive. We figured out that after crawling a > document/page gets a recrawl timestamp in between the recrawl interval and > max recrawl interval. > But if these values occur within the first crawl, Manifold starts recrawling > those, but seems to ignore the rest of the website. Also sometimes documents > get recrawled 5 times while other don't get recrawled. Apparently due to the > same issue. > > Is it possible to shed a bit more light on the continuous crawling? > Is it a good system to use for crawling a (extensive) website? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (CONNECTORS-1602) Continuous crawling doesn't recrawl everything
Donald Van den Driessche created CONNECTORS-1602: Summary: Continuous crawling doesn't recrawl everything Key: CONNECTORS-1602 URL: https://issues.apache.org/jira/browse/CONNECTORS-1602 Project: ManifoldCF Issue Type: Bug Components: Web connector Reporter: Donald Van den Driessche When crawling a website in continuous crawling mode we saw that not all documents are recrawled. The site is quite extensive. We figured out that after crawling a document/page gets a recrawl timestamp in between the recrawl interval and max recrawl interval. But if these values occur within the first crawl, Manifold starts recrawling those, but seems to ignore the rest of the website. Also sometimes documents get recrawled 5 times while other don't get recrawled. Apparently due to the same issue. Is it possible to shed a bit more light on the continuous crawling? Is it a good system to use for crawling a (extensive) website? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
[ https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814330#comment-16814330 ] Donald Van den Driessche edited comment on CONNECTORS-1593 at 4/10/19 11:00 AM: Hi [~kwri...@metacarta.com] The connector is a custom connector. It uses CloseableHttpClient with basic authentication. When we created a project outside manifold to download, we had the same issues. About 8% of the downloaded docs had different hashes. Only when putting about 3 seconds between the 2 downloads of the same file, we had 0% of different hashes. was (Author: donaldvdd): Hi [~kwri...@metacarta.com] The connector is a custom connector. It uses CloseableHttpClient with basic authentication. > Memory issue on > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > --- > > Key: CONNECTORS-1593 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1593 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: image-2019-03-22-08-57-53-887.png > > > I have created an Issue with fontbox too: > > When using the internal Tika extractor in a Manifold Job on certain occasions > I get an Out of Memory Error. > {code:java} > Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of > memory - shutting down > Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: > Java heap space > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at >
[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
[ https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814330#comment-16814330 ] Donald Van den Driessche commented on CONNECTORS-1593: -- Hi [~kwri...@metacarta.com] The connector is a custom connector. It uses CloseableHttpClient with basic authentication. > Memory issue on > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > --- > > Key: CONNECTORS-1593 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1593 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: image-2019-03-22-08-57-53-887.png > > > I have created an Issue with fontbox too: > > When using the internal Tika extractor in a Manifold Job on certain occasions > I get an Out of Memory Error. > {code:java} > Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of > memory - shutting down > Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: > Java heap space > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) >
[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
[ https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814179#comment-16814179 ] Donald Van den Driessche commented on CONNECTORS-1593: -- [~kwri...@metacarta.com] After further investigation we also figured out that there were only 2-3 bytes replaced by 0x00 bytes. And this in every file that gave an issue Because the issue didn't persist in every file and on each run in different files, we narrowed it down to possible download issues. Also because no manipulation is done to the input before being passed to the TIKA Parser. The current workaround is to download the file twice, check a hash of them and if they are different retry a double download. When the has is the same we continue with one of the downloaded files. This gave us no more PDFParser issues. Also speed of the process wasn't really impacted. > Memory issue on > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > --- > > Key: CONNECTORS-1593 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1593 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: image-2019-03-22-08-57-53-887.png > > > I have created an Issue with fontbox too: > > When using the internal Tika extractor in a Manifold Job on certain occasions > I get an Out of Memory Error. > {code:java} > Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of > memory - shutting down > Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: > Java heap space > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at >
[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
[ https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813045#comment-16813045 ] Donald Van den Driessche commented on CONNECTORS-1593: -- [~kwri...@metacarta.com] I added some more info in the issue at PDFBOX. > Memory issue on > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > --- > > Key: CONNECTORS-1593 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1593 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: image-2019-03-22-08-57-53-887.png > > > I have created an Issue with fontbox too: > > When using the internal Tika extractor in a Manifold Job on certain occasions > I get an Out of Memory Error. > {code:java} > Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of > memory - shutting down > Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: > Java heap space > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) > Mar 16 14:20:06 manifold01 manifoldcf[15747]:
[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
[ https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16798815#comment-16798815 ] Donald Van den Driessche commented on CONNECTORS-1593: -- [~kwri...@metacarta.com] Yes all connectors for that job pipeline are set to 2. The pipeline exists of 4 stages * JDBC connector: this gets the url's and some metadata from a MSSQL repository * Webresource fetch connector (custom): Based on the passed url, this will make a connection to a intranet site (basic header authentication) and will retrieve a binary file * Tika extractor: Retrieves the content of the binary file * Elasticsearch output connector: puts the data to Elastic I've reexamined the code for the webresource fetch connector, but I don't think any memory leaks are in that. !image-2019-03-22-08-57-53-887.png! > Memory issue on > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > --- > > Key: CONNECTORS-1593 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1593 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: image-2019-03-22-08-57-53-887.png > > > I have created an Issue with fontbox too: > > When using the internal Tika extractor in a Manifold Job on certain occasions > I get an Out of Memory Error. > {code:java} > Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of > memory - shutting down > Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: > Java heap space > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at >
[jira] [Updated] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
[ https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Donald Van den Driessche updated CONNECTORS-1593: - Attachment: image-2019-03-22-08-57-53-887.png > Memory issue on > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > --- > > Key: CONNECTORS-1593 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1593 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: image-2019-03-22-08-57-53-887.png > > > I have created an Issue with fontbox too: > > When using the internal Tika extractor in a Manifold Job on certain occasions > I get an Out of Memory Error. > {code:java} > Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of > memory - shutting down > Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: > Java heap space > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at >
[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
[ https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796015#comment-16796015 ] Donald Van den Driessche commented on CONNECTORS-1593: -- The number of threads in the properties is 100. I have allocated 2 max connections to each connector in the job. This job is the only one running at that time. > Memory issue on > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > --- > > Key: CONNECTORS-1593 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1593 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > > I have created an Issue with fontbox too: > > When using the internal Tika extractor in a Manifold Job on certain occasions > I get an Out of Memory Error. > {code:java} > Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of > memory - shutting down > Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: > Java heap space > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) > Mar 16 14:20:06
[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
[ https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795984#comment-16795984 ] Donald Van den Driessche commented on CONNECTORS-1593: -- It doesn't seem to be 1 specific document. The process reads an SQL table and will then retrieve the url in this table to get the document. Each time the job processes 4100 documents. Sometimes the error occures after 200 documents, sometimes after 1500, sometimes after 3800,... And weirdly sometimes the whole process runs perfectly. It's a really weird situation. I've tried connector logging on, but there wasn't much logged. > Memory issue on > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > --- > > Key: CONNECTORS-1593 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1593 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > > I have created an Issue with fontbox too: > > When using the internal Tika extractor in a Manifold Job on certain occasions > I get an Out of Memory Error. > {code:java} > Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of > memory - shutting down > Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: > Java heap space > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16
[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
[ https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795951#comment-16795951 ] Donald Van den Driessche commented on CONNECTORS-1593: -- I added it under the external issue. https://issues.apache.org/jira/browse/PDFBOX-4489 > Memory issue on > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > --- > > Key: CONNECTORS-1593 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1593 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > > I have created an Issue with fontbox too: > > When using the internal Tika extractor in a Manifold Job on certain occasions > I get an Out of Memory Error. > {code:java} > Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of > memory - shutting down > Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: > Java heap space > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at > org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) > Mar 16 14:20:06 manifold01 manifoldcf[15747]: at >
[jira] [Created] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
Donald Van den Driessche created CONNECTORS-1593: Summary: Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) Key: CONNECTORS-1593 URL: https://issues.apache.org/jira/browse/CONNECTORS-1593 Project: ManifoldCF Issue Type: Bug Components: Tika extractor Affects Versions: ManifoldCF 2.12 Reporter: Donald Van den Driessche I have created an Issue with fontbox too: When using the internal Tika extractor in a Manifold Job on certain occasions I get an Out of Memory Error. {code:java} Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of memory - shutting down Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: Java heap space Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) Mar 16 14:20:06 manifold01 manifoldcf[15747]: at
[jira] [Commented] (CONNECTORS-1579) Error when crawling a MSSQL table
[ https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760852#comment-16760852 ] Donald Van den Driessche commented on CONNECTORS-1579: -- Thanks for clearing this out. > Error when crawling a MSSQL table > - > > Key: CONNECTORS-1579 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1579 > Project: ManifoldCF > Issue Type: Bug > Components: JDBC connector >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: 636_bb2.csv > > > When I'm crawling a MSSQL table through the JDBC connector I get following > error on multiple lines: > > {noformat} > FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple > document primary component dispositions not allowed: document '636' > java.lang.IllegalStateException: Multiple document primary component > dispositions not allowed: document '636' > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605) > ~[mcf-pull-agent.jar:?] > at > org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944) > ~[?:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) > [mcf-pull-agent.jar:?]{noformat} > I looked this error up on the internet and it said that it might have > something to do with using the same key for different lines. > I checked, but I couldn't find any duplicates that match any of the selected > fields in the JDBC. > Hereby my queries: > Seeding query > {code:java} > SELECT pk1 as $(IDCOLUMN) > FROM dbo.bb2 > WHERE search_url IS NOT NULL > AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', > 'application/xml', 'application/zip'); > {code} > Version check query: none > Access token query: none > Data query: > > > {code:java} > SELECT > pk1 AS $(IDCOLUMN), > search_url AS $(URLCOLUMN), > ISNULL(content, '') AS $(DATACOLUMN), > doc_id, > search_url AS url, > ISNULL(title, '') as title, > ISNULL(groups,'') as groups, > ISNULL(type,'') as document_type, > ISNULL(users, '') as users > FROM dbo.bb2 > WHERE pk1 IN $(IDLIST); > {code} > The hereby added csv is the corresponding line from the table. > [^636_bb2.csv] > > Due to this problem, the whole crawling pipeline is being held up. It keeps > on retrying this line. > Could you help me understand this error? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1579) Error when crawling a MSSQL table
[ https://issues.apache.org/jira/browse/CONNECTORS-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Donald Van den Driessche updated CONNECTORS-1579: - Description: When I'm crawling a MSSQL table through the JDBC connector I get following error on multiple lines: {noformat} FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple document primary component dispositions not allowed: document '636' java.lang.IllegalStateException: Multiple document primary component dispositions not allowed: document '636' at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125) ~[mcf-pull-agent.jar:?] at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624) ~[mcf-pull-agent.jar:?] at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605) ~[mcf-pull-agent.jar:?] at org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944) ~[?:?] at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]{noformat} I looked this error up on the internet and it said that it might have something to do with using the same key for different lines. I checked, but I couldn't find any duplicates that match any of the selected fields in the JDBC. Hereby my queries: Seeding query {code:java} SELECT pk1 as $(IDCOLUMN) FROM dbo.bb2 WHERE search_url IS NOT NULL AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', 'application/xml', 'application/zip'); {code} Version check query: none Access token query: none Data query: {code:java} SELECT pk1 AS $(IDCOLUMN), search_url AS $(URLCOLUMN), ISNULL(content, '') AS $(DATACOLUMN), doc_id, search_url AS url, ISNULL(title, '') as title, ISNULL(groups,'') as groups, ISNULL(type,'') as document_type, ISNULL(users, '') as users FROM dbo.bb2 WHERE pk1 IN $(IDLIST); {code} The hereby added csv is the corresponding line from the table. [^636_bb2.csv] Due to this problem, the whole crawling pipeline is being held up. It keeps on retrying this line. Could you help me understand this error? was: When I'm crawling a MSSQL table through the JDBC connector I get following error on multiple lines: {noformat} FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple document primary component dispositions not allowed: document '636' java.lang.IllegalStateException: Multiple document primary component dispositions not allowed: document '636' at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125) ~[mcf-pull-agent.jar:?] at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624) ~[mcf-pull-agent.jar:?] at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605) ~[mcf-pull-agent.jar:?] at org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944) ~[?:?] at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]{noformat} I looked this error up on the internet and it said that it might have something to do with using the same key for different lines. I checked, but I couldn't find any duplicates that match any of the selected fields in the JDBC. Hereby my queries: Seeding query {code:java} SELECT pk1 as $(IDCOLUMN) FROM dbo.bb2 WHERE search_url IS NOT NULL AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', 'application/xml', 'application/zip'); {code} Version check query: none Access token query: none Data query: {code:java} SELECT pk1 AS $(IDCOLUMN), search_url AS $(URLCOLUMN), ISNULL(content, '') AS $(DATACOLUMN), doc_id, search_url AS url, ISNULL(title, '') as title, ISNULL(groups,'') as groups, ISNULL(type,'') as document_type, ISNULL(users, '') as users FROM dbo.bb2 WHERE pk1 IN $(IDLIST); {code} The hereby added csv is the corresponding line from the table. [^636_bb2.csv] Could you help me understand this error? > Error when crawling a MSSQL table > - > > Key: CONNECTORS-1579 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1579 > Project: ManifoldCF > Issue Type: Bug > Components: JDBC connector >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Priority: Major > Attachments: 636_bb2.csv > > > When I'm crawling a MSSQL table through the JDBC connector I get following > error on multiple lines: > > {noformat} > FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple > document primary component dispositions not allowed:
[jira] [Created] (CONNECTORS-1579) Error when crawling a MSSQL table
Donald Van den Driessche created CONNECTORS-1579: Summary: Error when crawling a MSSQL table Key: CONNECTORS-1579 URL: https://issues.apache.org/jira/browse/CONNECTORS-1579 Project: ManifoldCF Issue Type: Bug Components: JDBC connector Affects Versions: ManifoldCF 2.12 Reporter: Donald Van den Driessche Attachments: 636_bb2.csv When I'm crawling a MSSQL table through the JDBC connector I get following error on multiple lines: {noformat} FATAL 2019-02-05T13:21:58,929 (Worker thread '40') - Error tossed: Multiple document primary component dispositions not allowed: document '636' java.lang.IllegalStateException: Multiple document primary component dispositions not allowed: document '636' at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.checkMultipleDispositions(WorkerThread.java:2125) ~[mcf-pull-agent.jar:?] at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1624) ~[mcf-pull-agent.jar:?] at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.noDocument(WorkerThread.java:1605) ~[mcf-pull-agent.jar:?] at org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.processDocuments(JDBCConnector.java:944) ~[?:?] at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]{noformat} I looked this error up on the internet and it said that it might have something to do with using the same key for different lines. I checked, but I couldn't find any duplicates that match any of the selected fields in the JDBC. Hereby my queries: Seeding query {code:java} SELECT pk1 as $(IDCOLUMN) FROM dbo.bb2 WHERE search_url IS NOT NULL AND mimetype IS NOT NULL AND mimetype NOT IN ('unknown/unknown', 'application/xml', 'application/zip'); {code} Version check query: none Access token query: none Data query: {code:java} SELECT pk1 AS $(IDCOLUMN), search_url AS $(URLCOLUMN), ISNULL(content, '') AS $(DATACOLUMN), doc_id, search_url AS url, ISNULL(title, '') as title, ISNULL(groups,'') as groups, ISNULL(type,'') as document_type, ISNULL(users, '') as users FROM dbo.bb2 WHERE pk1 IN $(IDLIST); {code} The hereby added csv is the corresponding line from the table. [^636_bb2.csv] Could you help me understand this error? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738287#comment-16738287 ] Donald Van den Driessche commented on CONNECTORS-1562: -- Thanks! I asked the question at the company that provides the sitemap. I much appreciate all your effort! > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: Screenshot from 2018-12-31 11-17-29.png, > image-2019-01-09-14-20-50-616.png, manifoldcf.log.cleanup, > manifoldcf.log.init, manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738231#comment-16738231 ] Donald Van den Driessche commented on CONNECTORS-1562: -- Kar After resolving the issues with the API creation repository connections, I retested our crawling locally. With a docker which contains a ManifoldCF and an Elasticsearch container. I used No Bandwith throttles and a max connection count of 25. This on the seedmap existing of 1 page, our whitelist: [https://www.uantwerpen.be/admin/system/sitemap/sitemap.aspx?lang=nl=true] I still get the Stream Closed I/O exception:. Do you have any more ideas on how to keep the connection open, so that the whole whitelist can be processed? Printscreen Simple Report !image-2019-01-09-14-20-50-616.png! Stacktrace {code:java} ERROR 2019-01-09T13:08:37,876 (Worker thread '22') - Exception tossed: Repeated service interruptions - failure processing document: Stream Closed org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions - failure processing document: Stream Closed at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:489) [mcf-pull-agent.jar:?] Caused by: java.io.IOException: Stream Closed at java.io.FileInputStream.readBytes(Native Method) ~[?:1.8.0_191] at java.io.FileInputStream.read(FileInputStream.java:255) ~[?:1.8.0_191] at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) ~[?:1.8.0_191] at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) ~[?:1.8.0_191] at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) ~[?:1.8.0_191] at java.io.InputStreamReader.read(InputStreamReader.java:184) ~[?:1.8.0_191] at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex$IndexRequestEntity.writeTo(ElasticSearchIndex.java:221) ~[?:?] at org.apache.http.impl.execchain.RequestEntityProxy.writeTo(RequestEntityProxy.java:121) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:156) ~[httpcore-4.4.10.jar:4.4.10] at org.apache.http.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:160) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:238) ~[httpcore-4.4.10.jar:4.4.10] at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123) ~[httpcore-4.4.10.jar:4.4.10] at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection$CallThread.run(ElasticSearchConnection.java:133) ~[?:?] ERROR 2019-01-09T13:08:37,883 (Worker thread '7') - Exception tossed: Repeated service interruptions - failure processing document: Stream Closed org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions - failure processing document: Stream Closed at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:489) [mcf-pull-agent.jar:?] Caused by: java.io.IOException: Stream Closed at java.io.FileInputStream.readBytes(Native Method) ~[?:1.8.0_191] at java.io.FileInputStream.read(FileInputStream.java:255) ~[?:1.8.0_191] at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) ~[?:1.8.0_191] at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) ~[?:1.8.0_191] at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) ~[?:1.8.0_191] at java.io.InputStreamReader.read(InputStreamReader.java:184) ~[?:1.8.0_191] at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex$IndexRequestEntity.writeTo(ElasticSearchIndex.java:221) ~[?:?] at org.apache.http.impl.execchain.RequestEntityProxy.writeTo(RequestEntityProxy.java:121) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:156) ~[httpcore-4.4.10.jar:4.4.10] at org.apache.http.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:160) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:238)
[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Donald Van den Driessche updated CONNECTORS-1562: - Attachment: image-2019-01-09-14-20-50-616.png > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: Screenshot from 2018-12-31 11-17-29.png, > image-2019-01-09-14-20-50-616.png, manifoldcf.log.cleanup, > manifoldcf.log.init, manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1557) HTML Tag extractor
[ https://issues.apache.org/jira/browse/CONNECTORS-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16733105#comment-16733105 ] Donald Van den Driessche commented on CONNECTORS-1557: -- Karl The metadata adjuster doens't do what I needed. I send you hereby the code. > HTML Tag extractor > -- > > Key: CONNECTORS-1557 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1557 > Project: ManifoldCF > Issue Type: New Feature >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: html-tag-extraction-connector.zip > > > I wrote a HTML Tag extractor, based on the HTML Extractor. > I needed to extract specific HTML tags and transfer them to their own field > in my output repository. > Input > * Englobing tag (CSS selector) > * Blacklist (CSS selector) > * Fieldmapping (CSS selector) > * Strip HTML > Process > * Retrieve Englobing tag > * Remove blacklist > * Map selected CSS selectors in Fieldmapping (arrays if multiple finds) + > strip HTML (if requested) > * Englobing tag minus blacklist: strip HTML (if requested) and return as > output (content) > How can I best deliver the source code? > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711461#comment-16711461 ] Donald Van den Driessche commented on CONNECTORS-1562: -- Hi Karl I'm a colleague of Tim, we work together on this porject. Thank you for your time and we are interested in what the result brings. > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1557) HTML Tag extractor
[ https://issues.apache.org/jira/browse/CONNECTORS-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694408#comment-16694408 ] Donald Van den Driessche commented on CONNECTORS-1557: -- Hi Thank you for your e-mail. I'll be out of the office until July 12. I have limited access to my e-mail. Your message will not be forwarded. If you need urgent assistance, please contact Ken Mampaey (ken.mampaey@formica.digital) or Tom De Bruyn (tom.debruyn@formica.digital). Best regards Donald Van den Driessche > HTML Tag extractor > -- > > Key: CONNECTORS-1557 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1557 > Project: ManifoldCF > Issue Type: New Feature >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > > I wrote a HTML Tag extractor, based on the HTML Extractor. > I needed to extract specific HTML tags and transfer them to their own field > in my output repository. > Input > * Englobing tag (CSS selector) > * Blacklist (CSS selector) > * Fieldmapping (CSS selector) > * Strip HTML > Process > * Retrieve Englobing tag > * Remove blacklist > * Map selected CSS selectors in Fieldmapping (arrays if multiple finds) + > strip HTML (if requested) > * Englobing tag minus blacklist: strip HTML (if requested) and return as > output (content) > How can I best deliver the source code? > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (CONNECTORS-1557) HTML Tag extractor
Donald Van den Driessche created CONNECTORS-1557: Summary: HTML Tag extractor Key: CONNECTORS-1557 URL: https://issues.apache.org/jira/browse/CONNECTORS-1557 Project: ManifoldCF Issue Type: New Feature Reporter: Donald Van den Driessche I wrote a HTML Tag extractor, based on the HTML Extractor. I needed to extract specific HTML tags and transfer them to their own field in my output repository. Input * Englobing tag (CSS selector) * Blacklist (CSS selector) * Fieldmapping (CSS selector) * Strip HTML Process * Retrieve Englobing tag * Remove blacklist * Map selected CSS selectors in Fieldmapping (arrays if multiple finds) + strip HTML (if requested) * Englobing tag minus blacklist: strip HTML (if requested) and return as output (content) How can I best deliver the source code? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (CONNECTORS-1550) HTML Tag mapping
Donald Van den Driessche created CONNECTORS-1550: Summary: HTML Tag mapping Key: CONNECTORS-1550 URL: https://issues.apache.org/jira/browse/CONNECTORS-1550 Project: ManifoldCF Issue Type: Wish Components: Elastic Search connector, Tika extractor, Web connector Affects Versions: ManifoldCF 2.10 Reporter: Donald Van den Driessche I’ll be crawling a website with the standard Web connecter. I want to extract just certain html tags like , and . I’ve set up an HTML extractor transformation connector and the internal Tika transformation connector. But I can’t find any place to do a mapping to the output for this. Do I have to write my own transformation connector to extract the content of these tags? Or is there a built in solution? -- This message was sent by Atlassian JIRA (v7.6.3#76005)