[jira] [Commented] (CONNECTORS-1625) When processing a specific PDF Manifold goes out of memory

2019-10-11 Thread Donald Van den Driessche (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949467#comment-16949467
 ] 

Donald Van den Driessche commented on CONNECTORS-1625:
--

After another test, we came to the conclusion that the file is processed 
correctly after choosing "-- No extraction selected --" instead of "General 
purpose extraction" on the Boilerpipe parameter.

Now I have to estimate the impact of the different Boilerpipe paramater.

> When processing a specific PDF Manifold goes out of memory
> --
>
> Key: CONNECTORS-1625
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1625
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: abd-serotec-antibodies-uk.pdf
>
>
> When processing attached file with manifoldcf 2.12, we keep getting an out of 
> memory error.
> When just parsing it throug Tika 1.18, no issues are being found.
> Can anyone look into it?
> Thanks in advance!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [jira] [Commented] (CONNECTORS-1625) When processing a specific PDF Manifold goes out of memory

2019-10-11 Thread Karl Wright
If you call Tika yourself, and you aren't using streams, then that would be
an obvious reason why your memory problems occur in that environment.
Karl


On Fri, Oct 11, 2019 at 9:26 AM Donald Van den Driessche (Jira) <
j...@apache.org> wrote:

>
> [
> https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949443#comment-16949443
> ]
>
> Donald Van den Driessche commented on CONNECTORS-1625:
> --
>
> After running the same process (with the same config) locally, we had no
> issues.
> So, it might be something with the streams.
>
>
>
> We've written a custom connector to fetch the files. It might use the
> wrong way to provide the file to the Tika parser.
>
> > When processing a specific PDF Manifold goes out of memory
> > --
> >
> > Key: CONNECTORS-1625
> > URL:
> https://issues.apache.org/jira/browse/CONNECTORS-1625
> > Project: ManifoldCF
> >  Issue Type: Bug
> >  Components: Tika extractor
> >Affects Versions: ManifoldCF 2.12
> >Reporter: Donald Van den Driessche
> >Assignee: Karl Wright
> >Priority: Major
> > Attachments: abd-serotec-antibodies-uk.pdf
> >
> >
> > When processing attached file with manifoldcf 2.12, we keep getting an
> out of memory error.
> > When just parsing it throug Tika 1.18, no issues are being found.
> > Can anyone look into it?
> > Thanks in advance!
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)
>


[jira] [Commented] (CONNECTORS-1625) When processing a specific PDF Manifold goes out of memory

2019-10-11 Thread Donald Van den Driessche (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949443#comment-16949443
 ] 

Donald Van den Driessche commented on CONNECTORS-1625:
--

After running the same process (with the same config) locally, we had no issues.
So, it might be something with the streams.



We've written a custom connector to fetch the files. It might use the wrong way 
to provide the file to the Tika parser.

> When processing a specific PDF Manifold goes out of memory
> --
>
> Key: CONNECTORS-1625
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1625
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: abd-serotec-antibodies-uk.pdf
>
>
> When processing attached file with manifoldcf 2.12, we keep getting an out of 
> memory error.
> When just parsing it throug Tika 1.18, no issues are being found.
> Can anyone look into it?
> Thanks in advance!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1625) When processing a specific PDF Manifold goes out of memory

2019-10-04 Thread Donald Van den Driessche (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944414#comment-16944414
 ] 

Donald Van den Driessche commented on CONNECTORS-1625:
--

We are running this pdf as the one and only document.

It's manifold 2.12. We tried to parse it through Tika locally with Tika 1.18 
and 1.22 and both succeeded.

We've set the heap space to 3G and 5G and still the same issues.

I've now read somewhere that disk space might be used. But since the file is 
only 21MB large, I don't see how much disk space might be used.

 

> When processing a specific PDF Manifold goes out of memory
> --
>
> Key: CONNECTORS-1625
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1625
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: abd-serotec-antibodies-uk.pdf
>
>
> When processing attached file with manifoldcf 2.12, we keep getting an out of 
> memory error.
> When just parsing it throug Tika 1.18, no issues are being found.
> Can anyone look into it?
> Thanks in advance!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1625) When processing a specific PDF Manifold goes out of memory

2019-10-04 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944386#comment-16944386
 ] 

Karl Wright commented on CONNECTORS-1625:
-

Also, FWIW, the default Java memory sizes on the example are not guaranteed to 
allow processing of N simultaneous Tika extractions (one per worker thread) of 
the sort that require more memory.  Memory sizes allocated to the JVM are 
settable in the start-options files, and the first thing you want to do is 
increase those values to see if the problem goes away for you.


> When processing a specific PDF Manifold goes out of memory
> --
>
> Key: CONNECTORS-1625
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1625
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: abd-serotec-antibodies-uk.pdf
>
>
> When processing attached file with manifoldcf 2.12, we keep getting an out of 
> memory error.
> When just parsing it throug Tika 1.18, no issues are being found.
> Can anyone look into it?
> Thanks in advance!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)