[
https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944498#comment-16944498
]
Karl Wright commented on CONNECTORS-1625:
-----------------------------------------
[~DonaldVdD], MCF 2.12 uses Tika 1.19.1. There is no magic in ManifoldCF; it
simply calls the parse API for Tika. If that is blowing up then it's Tika
that's blowing up.
MCF uses disk storage in some cases for large documents when (for example)
there is more than one output in a pipeline. That's not the problem here I
would imagine. It otherwise uses streaming and does not put documents into
memory at all, unless you have a badly-behaved connector involved. The only
one we ship with this problem is the Solr Connector, which requires that the
entire document be fit into memory if you are not using Solr Cell. That is why
we insist that you set a document size limit when you operate the Solr
Connector in this mode.
I do recall that Tika v. 1.19.1 had a specific problem with memory usage for
some kinds of documents; it would probably be worthwhile trying the current
release to see if it has the same behavior.
> When processing a specific PDF Manifold goes out of memory
> ----------------------------------------------------------
>
> Key: CONNECTORS-1625
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1625
> Project: ManifoldCF
> Issue Type: Bug
> Components: Tika extractor
> Affects Versions: ManifoldCF 2.12
> Reporter: Donald Van den Driessche
> Assignee: Karl Wright
> Priority: Major
> Attachments: abd-serotec-antibodies-uk.pdf
>
>
> When processing attached file with manifoldcf 2.12, we keep getting an out of
> memory error.
> When just parsing it throug Tika 1.18, no issues are being found.
> Can anyone look into it?
> Thanks in advance!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)