[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

2019-03-22 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16798910#comment-16798910
 ] 

Karl Wright commented on CONNECTORS-1593:
-

There is a philosophy about memory consumption that we rigorously adhere to in 
ManifoldCF which is known as the "bounded memory consumption" philosophy, which 
is that connectors must be written so they are not sensitive to the size of the 
data they are indexing.  Streams are used and the data does not ever "hit 
memory".  But if you aren't careful, the custom connector you have might well 
put entire documents into memory and then of course all you need would be two 
large documents at the same time and you are hosed.  Can you check your custom 
connector for that issue?

If there is a problem there, you could work around it by limiting the number of 
custom connector connections to 1.  If that works reliably, then you know where 
the issue is.


> Memory issue on 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> ---
>
> Key: CONNECTORS-1593
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1593
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: image-2019-03-22-08-57-53-887.png
>
>
> I have created an Issue with fontbox too: 
>  
> When using the internal Tika extractor in a Manifold Job on certain occasions 
> I get an Out of Memory Error.
> {code:java}
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of 
> memory - shutting down
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: 
> Java heap space
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> Mar 16 14:20:06 

RE: Where and how is ManifoldCF used in production?

2019-03-22 Thread Ian Zapczynski
We use ManifoldCF in production yes, but for internal use currently – not for 
external clients.   We do research on many other firms and meet with their 
representatives often, so our users that attend these meetings have meeting 
notes typically in Word docs.  These are currently shared and stored on a 
Windows file share.   Our users wanted far better access to search those docs 
than what Windows can provide, so we use ManifoldCF to connect the Windows 
share to SOLR.The results impressed our users – and even the developer - 
considerably.

-Ian

From: James Thomas 
mailto:james.tho...@linguamatics.com>>
Sent: Thursday, March 21, 2019 10:52 AM
To: dev@manifoldcf.apache.org; 
u...@manifoldcf.apache.org
Subject: Where and how is ManifoldCF used in production?


Hi all,

We've been experimenting with and learning about ManifoldCF for a few months 
now and were wondering who is doing what with it in production.

Would any of you be able to share a little about the applications you're 
working on, or know about, please?

Any replies would be much appreciated.

Cheers,
James


--

James Thomas | Head of Testing | Linguamatics, an IQVIA company

324 Cambridge Science Park, Milton Road, CB4 0WG Cambridge, UK

Telephone number: +44 (0)1223 651910

web: www.linguamatics.com

---

Winner of the Frost & Sullivan Market Leadership Award 2016, 2017 and 2018

---

Linguamatics will process your personal data fairly and lawfully in accordance 
with professional standards and the Data Protection Act 2018, General Data 
Protection Regulation (EU) 2016/679 (as applicable) and any other applicable 
laws relating to the protection of personal data and the privacy of 
individuals. Further information is available in our privacy policy: 
https://www.linguamatics.com/privacy.

This email has been sent for the sole use of the intended recipient(s). The 
contents of this email and any attachments are very likely to contain 
confidential or proprietary information. If you received this email in error 
please advise the sender by reply email and immediately delete the message and 
any attachments without copying or disclosing the contents.

Linguamatics Limited is a company incorporated in England and Wales with 
registered number: 4248841


Re: Where and how is ManifoldCF used in production?

2019-03-22 Thread Steph van Schalkwyk
I've been using MCF in PROD for a couple of years now. Mostly Elasticsearch
(5.x onwards) and SOLR.
>From a couple of 100k docs to millions of jdbc or pdf/html etc.
Very solid. Make sure you apply the DB tuning parameters.
Steph



*Steph van Schalkwyk*
Principal, Remcam Search Engines
+1.314.452. <+1+314+452+2896>2896st...@remcam.net   http://remcam.net
 Skype: svanschalkwyk




On Fri, Mar 22, 2019 at 9:45 AM Ian Zapczynski 
wrote:

> We use ManifoldCF in production yes, but for internal use currently – not
> for external clients.   We do research on many other firms and meet with
> their representatives often, so our users that attend these meetings have
> meeting notes typically in Word docs.  These are currently shared and
> stored on a Windows file share.   Our users wanted far better access to
> search those docs than what Windows can provide, so we use ManifoldCF to
> connect the Windows share to SOLR.The results impressed our users – and
> even the developer - considerably.
>
> -Ian
>
> From: James Thomas  james.tho...@linguamatics.com>>
> Sent: Thursday, March 21, 2019 10:52 AM
> To: dev@manifoldcf.apache.org;
> u...@manifoldcf.apache.org
> Subject: Where and how is ManifoldCF used in production?
>
>
> Hi all,
>
> We've been experimenting with and learning about ManifoldCF for a few
> months now and were wondering who is doing what with it in production.
>
> Would any of you be able to share a little about the applications you're
> working on, or know about, please?
>
> Any replies would be much appreciated.
>
> Cheers,
> James
>
>
> --
>
> James Thomas | Head of Testing | Linguamatics, an IQVIA company
>
> 324 Cambridge Science Park, Milton Road, CB4 0WG Cambridge, UK
>
> Telephone number: +44 (0)1223 651910
>
> web: www.linguamatics.com
>
> ---
>
> Winner of the Frost & Sullivan Market Leadership Award 2016, 2017 and 2018
>
> ---
>
> Linguamatics will process your personal data fairly and lawfully in
> accordance with professional standards and the Data Protection Act 2018,
> General Data Protection Regulation (EU) 2016/679 (as applicable) and any
> other applicable laws relating to the protection of personal data and the
> privacy of individuals. Further information is available in our privacy
> policy: https://www.linguamatics.com/privacy.
>
> This email has been sent for the sole use of the intended recipient(s).
> The contents of this email and any attachments are very likely to contain
> confidential or proprietary information. If you received this email in
> error please advise the sender by reply email and immediately delete the
> message and any attachments without copying or disclosing the contents.
>
> Linguamatics Limited is a company incorporated in England and Wales with
> registered number: 4248841
>


[jira] [Updated] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

2019-03-22 Thread Donald Van den Driessche (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Donald Van den Driessche updated CONNECTORS-1593:
-
Attachment: image-2019-03-22-08-57-53-887.png

> Memory issue on 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> ---
>
> Key: CONNECTORS-1593
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1593
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: image-2019-03-22-08-57-53-887.png
>
>
> I have created an Issue with fontbox too: 
>  
> When using the internal Tika extractor in a Manifold Job on certain occasions 
> I get an Out of Memory Error.
> {code:java}
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of 
> memory - shutting down
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: 
> Java heap space
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> 

[jira] [Commented] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

2019-03-22 Thread Donald Van den Driessche (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16798815#comment-16798815
 ] 

Donald Van den Driessche commented on CONNECTORS-1593:
--

[~kwri...@metacarta.com]
Yes all connectors for that job pipeline are set to 2.

The pipeline exists of 4 stages
 * JDBC connector: this gets the url's and some metadata from a MSSQL repository
 * Webresource fetch connector (custom): Based on the passed url, this will 
make a connection to a intranet site (basic header authentication) and will 
retrieve a binary file
 * Tika extractor: Retrieves the content of the binary file
 * Elasticsearch output connector: puts the data to Elastic

 

I've reexamined the code for the webresource fetch connector, but I don't think 
any memory leaks are in that.

 

!image-2019-03-22-08-57-53-887.png!

> Memory issue on 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> ---
>
> Key: CONNECTORS-1593
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1593
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: image-2019-03-22-08-57-53-887.png
>
>
> I have created an Issue with fontbox too: 
>  
> When using the internal Tika extractor in a Manifold Job on certain occasions 
> I get an Out of Memory Error.
> {code:java}
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of 
> memory - shutting down
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: 
> Java heap space
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:199)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at 
>