[jira] [Commented] (CONNECTORS-1625) When processing a specific PDF Manifold goes out of memory
[ https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949467#comment-16949467 ] Donald Van den Driessche commented on CONNECTORS-1625: -- After another test, we came to the conclusion that the file is processed correctly after choosing "-- No extraction selected --" instead of "General purpose extraction" on the Boilerpipe parameter. Now I have to estimate the impact of the different Boilerpipe paramater. > When processing a specific PDF Manifold goes out of memory > -- > > Key: CONNECTORS-1625 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1625 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: abd-serotec-antibodies-uk.pdf > > > When processing attached file with manifoldcf 2.12, we keep getting an out of > memory error. > When just parsing it throug Tika 1.18, no issues are being found. > Can anyone look into it? > Thanks in advance! -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [jira] [Commented] (CONNECTORS-1625) When processing a specific PDF Manifold goes out of memory
If you call Tika yourself, and you aren't using streams, then that would be an obvious reason why your memory problems occur in that environment. Karl On Fri, Oct 11, 2019 at 9:26 AM Donald Van den Driessche (Jira) < j...@apache.org> wrote: > > [ > https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949443#comment-16949443 > ] > > Donald Van den Driessche commented on CONNECTORS-1625: > -- > > After running the same process (with the same config) locally, we had no > issues. > So, it might be something with the streams. > > > > We've written a custom connector to fetch the files. It might use the > wrong way to provide the file to the Tika parser. > > > When processing a specific PDF Manifold goes out of memory > > -- > > > > Key: CONNECTORS-1625 > > URL: > https://issues.apache.org/jira/browse/CONNECTORS-1625 > > Project: ManifoldCF > > Issue Type: Bug > > Components: Tika extractor > >Affects Versions: ManifoldCF 2.12 > >Reporter: Donald Van den Driessche > >Assignee: Karl Wright > >Priority: Major > > Attachments: abd-serotec-antibodies-uk.pdf > > > > > > When processing attached file with manifoldcf 2.12, we keep getting an > out of memory error. > > When just parsing it throug Tika 1.18, no issues are being found. > > Can anyone look into it? > > Thanks in advance! > > > > -- > This message was sent by Atlassian Jira > (v8.3.4#803005) >
[jira] [Commented] (CONNECTORS-1625) When processing a specific PDF Manifold goes out of memory
[ https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949443#comment-16949443 ] Donald Van den Driessche commented on CONNECTORS-1625: -- After running the same process (with the same config) locally, we had no issues. So, it might be something with the streams. We've written a custom connector to fetch the files. It might use the wrong way to provide the file to the Tika parser. > When processing a specific PDF Manifold goes out of memory > -- > > Key: CONNECTORS-1625 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1625 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: abd-serotec-antibodies-uk.pdf > > > When processing attached file with manifoldcf 2.12, we keep getting an out of > memory error. > When just parsing it throug Tika 1.18, no issues are being found. > Can anyone look into it? > Thanks in advance! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1625) When processing a specific PDF Manifold goes out of memory
[ https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944414#comment-16944414 ] Donald Van den Driessche commented on CONNECTORS-1625: -- We are running this pdf as the one and only document. It's manifold 2.12. We tried to parse it through Tika locally with Tika 1.18 and 1.22 and both succeeded. We've set the heap space to 3G and 5G and still the same issues. I've now read somewhere that disk space might be used. But since the file is only 21MB large, I don't see how much disk space might be used. > When processing a specific PDF Manifold goes out of memory > -- > > Key: CONNECTORS-1625 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1625 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: abd-serotec-antibodies-uk.pdf > > > When processing attached file with manifoldcf 2.12, we keep getting an out of > memory error. > When just parsing it throug Tika 1.18, no issues are being found. > Can anyone look into it? > Thanks in advance! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1625) When processing a specific PDF Manifold goes out of memory
[ https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944386#comment-16944386 ] Karl Wright commented on CONNECTORS-1625: - Also, FWIW, the default Java memory sizes on the example are not guaranteed to allow processing of N simultaneous Tika extractions (one per worker thread) of the sort that require more memory. Memory sizes allocated to the JVM are settable in the start-options files, and the first thing you want to do is increase those values to see if the problem goes away for you. > When processing a specific PDF Manifold goes out of memory > -- > > Key: CONNECTORS-1625 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1625 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: abd-serotec-antibodies-uk.pdf > > > When processing attached file with manifoldcf 2.12, we keep getting an out of > memory error. > When just parsing it throug Tika 1.18, no issues are being found. > Can anyone look into it? > Thanks in advance! -- This message was sent by Atlassian Jira (v8.3.4#803005)