[jira] [Comment Edited] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

2019-04-10 Thread Donald Van den Driessche (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814330#comment-16814330
 ] 

Donald Van den Driessche edited comment on CONNECTORS-1593 at 4/10/19 11:00 AM:


Hi [~kwri...@metacarta.com]
 The connector is a custom connector. It uses CloseableHttpClient with basic 
authentication.

When we created a project outside manifold to download, we had the same issues. 
About 8% of the downloaded docs had different hashes.
Only when putting about 3 seconds between the 2 downloads of the same file, we 
had 0% of different hashes.


was (Author: donaldvdd):
Hi [~kwri...@metacarta.com]
The connector is a custom connector. It uses CloseableHttpClient with basic 
authentication.

> Memory issue on 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> ---
>
> Key: CONNECTORS-1593
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1593
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: image-2019-03-22-08-57-53-887.png
>
>
> I have created an Issue with fontbox too: 
>  
> When using the internal Tika extractor in a Manifold Job on certain occasions 
> I get an Out of Memory Error.
> {code:java}
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of 
> memory - shutting down
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: 
> Java heap space
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> 

[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

2019-04-10 Thread Donald Van den Driessche (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814330#comment-16814330
 ] 

Donald Van den Driessche commented on CONNECTORS-1593:
--

Hi [~kwri...@metacarta.com]
The connector is a custom connector. It uses CloseableHttpClient with basic 
authentication.

> Memory issue on 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> ---
>
> Key: CONNECTORS-1593
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1593
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: image-2019-03-22-08-57-53-887.png
>
>
> I have created an Issue with fontbox too: 
>  
> When using the internal Tika extractor in a Manifold Job on certain occasions 
> I get an Out of Memory Error.
> {code:java}
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of 
> memory - shutting down
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: 
> Java heap space
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
> 

[jira] [Commented] (CONNECTORS-1592) Found long running query in manifold scheduled job

2019-04-10 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814326#comment-16814326
 ] 

Karl Wright commented on CONNECTORS-1592:
-

[~goovaertsr] Yes, if you have no intention of doing hopcount filtering ever, 
then disable hop count filtering forever.  It's far easier on the database.

Having said that, I'm pretty sure you have other problems too.


> Found long running query in manifold scheduled job
> --
>
> Key: CONNECTORS-1592
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1592
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.12
>Reporter: Subasini Rath
>Priority: Major
> Attachments: LongRunningWithPlan_thread39.txt, 
> SELECT_blocked_queries.txt, postgresql.conf, properties.xml
>
>
> Hi Karl,
>    I am also facing the above mentioned issue. (Similar to Connector-880)
> I am using manifold2.12 binary version. I am using Solr output connector and 
> Web repository connection. Manifold is using all default configuration.
> When I am running the jobs manually, it runs fine. Same jobs have been 
> scheduled to run everyday.
> I am getting below exceptions and the job gets hanged/ going to waiting stage.
> Could you please help me in resolving the same.
> I am getting the below error -
> Scenario-1
> WARN 2019-03-08T23:58:20,338 (qtp550147359-413) - Found a long-running query 
> (2706114 ms): [SELECT 
> t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs 
> t0 ORDER BY description ASC]
>  WARN 2019-03-08T23:58:20,337 (Document delete stuffer thread) - Found a 
> long-running query (2737370 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1]
>  WARN 2019-03-08T23:58:20,339 (Job reset thread) - Found a long-running query 
> (2770133 ms): [SELECT id FROM jobs WHERE status IN (?,?)]
>  WARN 2019-03-08T23:58:20,386 (Document delete stuffer thread) - Parameter 0: 
> 'e'
>  WARN 2019-03-08T23:58:20,337 (Set priority thread) - Found a long-running 
> query (2732379 ms): [SELECT id,dochash,docid,jobid FROM jobqueue WHERE 
> needpriority=? LIMIT 1000]
>  WARN 2019-03-08T23:58:20,386 (Set priority thread) - Parameter 0: 'T'
>  WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 0: 'I'
>  WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 1: 'i'
>  WARN 2019-03-08T23:58:20,372 (Seeding thread) - Parameter 2: '1552047176062'
>  WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Found a 
> long-running query (2737524 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1]
>  WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Parameter 
> 0: 'S'
>  WARN 2019-03-08T23:58:20,474 (Finisher thread) - Found a long-running query 
> (2752034 ms): [SELECT id FROM jobs WHERE status IN (?,?,?) FOR UPDATE]
>  WARN 2019-03-08T23:58:20,474 (Finisher thread) - Parameter 0: 'A'
>  WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 1: 'W'
>  WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 2: 'R'
>  WARN 2019-03-08T23:58:20,475 (Delete startup thread) - Found a long-running 
> query (2752036 ms): [SELECT id FROM jobs WHERE status=? FOR UPDATE]
>  WARN 2019-03-08T23:58:20,475 (Delete startup thread) - Parameter 0: 'E'
>  WARN 2019-03-08T23:58:20,483 (qtp550147359-4339) - Found a long-running 
> query (2496641 ms): [SELECT 
> t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs 
> t0 ORDER BY description ASC]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: 
> isDistinctSelect=[false]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: isGrouped=[false]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: isAggregated=[false]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: columns=[ COLUMN: 
> PUBLIC.JOBS.ID not nullable
>  WARN 2019-03-08T23:58:20,492 (qtp550147359-4346) - Found a long-running 
> query (2435908 ms): [SELECT 
> t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs 
> t0 ORDER BY description ASC]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: 
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: ]
>  WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: [range variable 1
>  WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: join type=INNER
>  WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: table=SYSTEM_SUBQUERY
>  WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: cardinality=0
>  WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: access=FULL SCAN
>  WARN 2019-03-08T23:58:20,500 (Finisher thread) - Plan: join condition = 
> [index=SYS_IDX_13329
>  WARN 2019-03-08T23:58:20,500 (Finisher thread) - Plan: ]
>  WARN 2019-03-08T23:58:20,500 (Finisher thread) - Plan: ][range variable 2
>  WARN 2019-03-08T23:58:20,500 (Finisher thread) - Plan: join 

[jira] [Commented] (CONNECTORS-1592) Found long running query in manifold scheduled job

2019-04-10 Thread roel goovaerts (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814313#comment-16814313
 ] 

roel goovaerts commented on CONNECTORS-1592:


The setting of the hop count mode was kept like this on the justification of 
requirements. 
But I think I'm following you now, I interpreted it as disabling the whole 
'tab'.
If i understand correctly with "disabling hop count filtering", you mean 
setting it to "keep unreachable documents, forever"?

> Found long running query in manifold scheduled job
> --
>
> Key: CONNECTORS-1592
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1592
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.12
>Reporter: Subasini Rath
>Priority: Major
> Attachments: LongRunningWithPlan_thread39.txt, 
> SELECT_blocked_queries.txt, postgresql.conf, properties.xml
>
>
> Hi Karl,
>    I am also facing the above mentioned issue. (Similar to Connector-880)
> I am using manifold2.12 binary version. I am using Solr output connector and 
> Web repository connection. Manifold is using all default configuration.
> When I am running the jobs manually, it runs fine. Same jobs have been 
> scheduled to run everyday.
> I am getting below exceptions and the job gets hanged/ going to waiting stage.
> Could you please help me in resolving the same.
> I am getting the below error -
> Scenario-1
> WARN 2019-03-08T23:58:20,338 (qtp550147359-413) - Found a long-running query 
> (2706114 ms): [SELECT 
> t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs 
> t0 ORDER BY description ASC]
>  WARN 2019-03-08T23:58:20,337 (Document delete stuffer thread) - Found a 
> long-running query (2737370 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1]
>  WARN 2019-03-08T23:58:20,339 (Job reset thread) - Found a long-running query 
> (2770133 ms): [SELECT id FROM jobs WHERE status IN (?,?)]
>  WARN 2019-03-08T23:58:20,386 (Document delete stuffer thread) - Parameter 0: 
> 'e'
>  WARN 2019-03-08T23:58:20,337 (Set priority thread) - Found a long-running 
> query (2732379 ms): [SELECT id,dochash,docid,jobid FROM jobqueue WHERE 
> needpriority=? LIMIT 1000]
>  WARN 2019-03-08T23:58:20,386 (Set priority thread) - Parameter 0: 'T'
>  WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 0: 'I'
>  WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 1: 'i'
>  WARN 2019-03-08T23:58:20,372 (Seeding thread) - Parameter 2: '1552047176062'
>  WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Found a 
> long-running query (2737524 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1]
>  WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Parameter 
> 0: 'S'
>  WARN 2019-03-08T23:58:20,474 (Finisher thread) - Found a long-running query 
> (2752034 ms): [SELECT id FROM jobs WHERE status IN (?,?,?) FOR UPDATE]
>  WARN 2019-03-08T23:58:20,474 (Finisher thread) - Parameter 0: 'A'
>  WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 1: 'W'
>  WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 2: 'R'
>  WARN 2019-03-08T23:58:20,475 (Delete startup thread) - Found a long-running 
> query (2752036 ms): [SELECT id FROM jobs WHERE status=? FOR UPDATE]
>  WARN 2019-03-08T23:58:20,475 (Delete startup thread) - Parameter 0: 'E'
>  WARN 2019-03-08T23:58:20,483 (qtp550147359-4339) - Found a long-running 
> query (2496641 ms): [SELECT 
> t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs 
> t0 ORDER BY description ASC]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: 
> isDistinctSelect=[false]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: isGrouped=[false]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: isAggregated=[false]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: columns=[ COLUMN: 
> PUBLIC.JOBS.ID not nullable
>  WARN 2019-03-08T23:58:20,492 (qtp550147359-4346) - Found a long-running 
> query (2435908 ms): [SELECT 
> t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs 
> t0 ORDER BY description ASC]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: 
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: ]
>  WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: [range variable 1
>  WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: join type=INNER
>  WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: table=SYSTEM_SUBQUERY
>  WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: cardinality=0
>  WARN 2019-03-08T23:58:20,499 (Finisher thread) - Plan: access=FULL SCAN
>  WARN 2019-03-08T23:58:20,500 (Finisher thread) - Plan: join condition = 
> [index=SYS_IDX_13329
>  WARN 2019-03-08T23:58:20,500 (Finisher thread) - Plan: ]
>  WARN 2019-03-08T23:58:20,500 (Finisher thread) - Plan: 

[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

2019-04-10 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814310#comment-16814310
 ] 

Karl Wright commented on CONNECTORS-1593:
-

Hi [~DonaldVdD], what connector is being used to download the files?  What is 
serving them?  Having the data get corrupted is very very odd; I can't imagine 
have code that does that accidentally.


> Memory issue on 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> ---
>
> Key: CONNECTORS-1593
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1593
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: image-2019-03-22-08-57-53-887.png
>
>
> I have created an Issue with fontbox too: 
>  
> When using the internal Tika extractor in a Manifold Job on certain occasions 
> I get an Out of Memory Error.
> {code:java}
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of 
> memory - shutting down
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: 
> Java heap space
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> 

[jira] [Commented] (CONNECTORS-1592) Found long running query in manifold scheduled job

2019-04-10 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814306#comment-16814306
 ] 

Karl Wright commented on CONNECTORS-1592:
-

{quote}
the largest was 223673ms, the minimum time spent was 172416ms, the others are 
distributed between these extrema
{quote}

I saw a longer-running query than that in the log you posted, some 200ms.  
But the plan was fine.  Once again, locking would have been the only 
explanation.  But if you are seeing no queries running in less than 172416ms, 
then I think you may well have found your problem.  The lion's share of 
Postgresql queries should be executing in well under a second. Times around 
20ms would be typical.  Something is very wrong with your Postgresql 
configuration or installation given that.

{quote}
Just one more question, considering what you said of the hopcount filtering; In 
the "Hop Filters"-tab we have nothing of configuration except for "hop count 
mode" is set to "delete unreachable", which i had interpreted as being the 
default. Is this correct that it is the default, and is there something else we 
could do to disable hop count filtering?
{quote}

That is the default; it's also the most inefficient.  From the manual:

{quote}
On this same tab, you can tell the Framework what to do should there be changes 
in the distance from the root to a document. The choice "Delete unreachable 
documents" requires the Framework to recalculate the distance to every 
potentially affected document whenever a change takes place. This may require 
expensive bookkeeping, however, so you also have the option of ignoring such 
changes. There are two varieties of this latter option - you can ignore the 
changes for now, with the option of turning back on the aggressive bookkeeping 
at a later time, or you can decide not to ever allow changes to propagate, in 
which case the Framework will discard the necessary bookkeeping information 
permanently. This last option is the most efficient.
{quote}


> Found long running query in manifold scheduled job
> --
>
> Key: CONNECTORS-1592
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1592
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.12
>Reporter: Subasini Rath
>Priority: Major
> Attachments: LongRunningWithPlan_thread39.txt, 
> SELECT_blocked_queries.txt, postgresql.conf, properties.xml
>
>
> Hi Karl,
>    I am also facing the above mentioned issue. (Similar to Connector-880)
> I am using manifold2.12 binary version. I am using Solr output connector and 
> Web repository connection. Manifold is using all default configuration.
> When I am running the jobs manually, it runs fine. Same jobs have been 
> scheduled to run everyday.
> I am getting below exceptions and the job gets hanged/ going to waiting stage.
> Could you please help me in resolving the same.
> I am getting the below error -
> Scenario-1
> WARN 2019-03-08T23:58:20,338 (qtp550147359-413) - Found a long-running query 
> (2706114 ms): [SELECT 
> t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs 
> t0 ORDER BY description ASC]
>  WARN 2019-03-08T23:58:20,337 (Document delete stuffer thread) - Found a 
> long-running query (2737370 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1]
>  WARN 2019-03-08T23:58:20,339 (Job reset thread) - Found a long-running query 
> (2770133 ms): [SELECT id FROM jobs WHERE status IN (?,?)]
>  WARN 2019-03-08T23:58:20,386 (Document delete stuffer thread) - Parameter 0: 
> 'e'
>  WARN 2019-03-08T23:58:20,337 (Set priority thread) - Found a long-running 
> query (2732379 ms): [SELECT id,dochash,docid,jobid FROM jobqueue WHERE 
> needpriority=? LIMIT 1000]
>  WARN 2019-03-08T23:58:20,386 (Set priority thread) - Parameter 0: 'T'
>  WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 0: 'I'
>  WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 1: 'i'
>  WARN 2019-03-08T23:58:20,372 (Seeding thread) - Parameter 2: '1552047176062'
>  WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Found a 
> long-running query (2737524 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1]
>  WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Parameter 
> 0: 'S'
>  WARN 2019-03-08T23:58:20,474 (Finisher thread) - Found a long-running query 
> (2752034 ms): [SELECT id FROM jobs WHERE status IN (?,?,?) FOR UPDATE]
>  WARN 2019-03-08T23:58:20,474 (Finisher thread) - Parameter 0: 'A'
>  WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 1: 'W'
>  WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 2: 'R'
>  WARN 2019-03-08T23:58:20,475 (Delete startup thread) - Found a long-running 
> query (2752036 ms): [SELECT id FROM jobs WHERE status=? FOR UPDATE]
>  WARN 2019-03-08T23:58:20,475 (Delete 

[jira] [Commented] (CONNECTORS-1592) Found long running query in manifold scheduled job

2019-04-10 Thread roel goovaerts (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814287#comment-16814287
 ] 

roel goovaerts commented on CONNECTORS-1592:


Hi Karl,
 
I have not yet seen any "very long-running" queries. upon looking at the logs 
(there was a bunch of long-running queries logged about an hour ago) there is 
not an 'extreme' maximum of time spent on a query: the largest was 223673ms, 
the minimum time spent was 172416ms, the others are distributed between these 
extrema. From this I suppose this is not really the issue.
 
I of course understand that it's not that evident to commit to a conference 
call, thanks for considering.
 
Just one more question, considering what you said of the hopcount filtering; In 
the "Hop Filters"-tab we have nothing of configuration except for "hop count 
mode" is set to "delete unreachable", which i had interpreted as being the 
default. Is this correct that it is the default, and is there something else we 
could do to disable hop count filtering?
 
We will continue to look for other possible external influences.
There is now a possibility that the settings of postgres automatically got 
reverted to the defaults (which would include autovacuum to be on), so we are 
looking into this now.
Thanks again for the info and the quick replies.
 
Regards,
Roel

> Found long running query in manifold scheduled job
> --
>
> Key: CONNECTORS-1592
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1592
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.12
>Reporter: Subasini Rath
>Priority: Major
> Attachments: LongRunningWithPlan_thread39.txt, 
> SELECT_blocked_queries.txt, postgresql.conf, properties.xml
>
>
> Hi Karl,
>    I am also facing the above mentioned issue. (Similar to Connector-880)
> I am using manifold2.12 binary version. I am using Solr output connector and 
> Web repository connection. Manifold is using all default configuration.
> When I am running the jobs manually, it runs fine. Same jobs have been 
> scheduled to run everyday.
> I am getting below exceptions and the job gets hanged/ going to waiting stage.
> Could you please help me in resolving the same.
> I am getting the below error -
> Scenario-1
> WARN 2019-03-08T23:58:20,338 (qtp550147359-413) - Found a long-running query 
> (2706114 ms): [SELECT 
> t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs 
> t0 ORDER BY description ASC]
>  WARN 2019-03-08T23:58:20,337 (Document delete stuffer thread) - Found a 
> long-running query (2737370 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1]
>  WARN 2019-03-08T23:58:20,339 (Job reset thread) - Found a long-running query 
> (2770133 ms): [SELECT id FROM jobs WHERE status IN (?,?)]
>  WARN 2019-03-08T23:58:20,386 (Document delete stuffer thread) - Parameter 0: 
> 'e'
>  WARN 2019-03-08T23:58:20,337 (Set priority thread) - Found a long-running 
> query (2732379 ms): [SELECT id,dochash,docid,jobid FROM jobqueue WHERE 
> needpriority=? LIMIT 1000]
>  WARN 2019-03-08T23:58:20,386 (Set priority thread) - Parameter 0: 'T'
>  WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 0: 'I'
>  WARN 2019-03-08T23:58:20,386 (Job reset thread) - Parameter 1: 'i'
>  WARN 2019-03-08T23:58:20,372 (Seeding thread) - Parameter 2: '1552047176062'
>  WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Found a 
> long-running query (2737524 ms): [SELECT id FROM jobs WHERE status=? LIMIT 1]
>  WARN 2019-03-08T23:58:20,474 (Document cleanup stuffer thread) - Parameter 
> 0: 'S'
>  WARN 2019-03-08T23:58:20,474 (Finisher thread) - Found a long-running query 
> (2752034 ms): [SELECT id FROM jobs WHERE status IN (?,?,?) FOR UPDATE]
>  WARN 2019-03-08T23:58:20,474 (Finisher thread) - Parameter 0: 'A'
>  WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 1: 'W'
>  WARN 2019-03-08T23:58:20,475 (Finisher thread) - Parameter 2: 'R'
>  WARN 2019-03-08T23:58:20,475 (Delete startup thread) - Found a long-running 
> query (2752036 ms): [SELECT id FROM jobs WHERE status=? FOR UPDATE]
>  WARN 2019-03-08T23:58:20,475 (Delete startup thread) - Parameter 0: 'E'
>  WARN 2019-03-08T23:58:20,483 (qtp550147359-4339) - Found a long-running 
> query (2496641 ms): [SELECT 
> t0.id,t0.description,t0.status,t0.starttime,t0.endtime,t0.errortext FROM jobs 
> t0 ORDER BY description ASC]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: 
> isDistinctSelect=[false]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: isGrouped=[false]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: isAggregated=[false]
>  WARN 2019-03-08T23:58:20,492 (Finisher thread) - Plan: columns=[ COLUMN: 
> PUBLIC.JOBS.ID not nullable
>  WARN 2019-03-08T23:58:20,492 (qtp550147359-4346) - Found a 

[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

2019-04-10 Thread Donald Van den Driessche (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814179#comment-16814179
 ] 

Donald Van den Driessche commented on CONNECTORS-1593:
--

[~kwri...@metacarta.com]

After further investigation we also figured out that there were only 2-3 bytes 
replaced by 0x00 bytes. And this in every file that gave an issue

Because the issue didn't persist in every file and on each run in different 
files, we narrowed it down to possible download issues. Also because no 
manipulation is done to the input before being passed to the TIKA Parser.
The current workaround is to download the file twice, check a hash of them and 
if they are different retry a double download. When the has is the same we 
continue with one of the downloaded files. This gave us no more PDFParser 
issues.
Also speed of the process wasn't really impacted.

> Memory issue on 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> ---
>
> Key: CONNECTORS-1593
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1593
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: image-2019-03-22-08-57-53-887.png
>
>
> I have created an Issue with fontbox too: 
>  
> When using the internal Tika extractor in a Manifold Job on certain occasions 
> I get an Out of Memory Error.
> {code:java}
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of 
> memory - shutting down
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: 
> Java heap space
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
>