Re: Job Content Length issue

Karl Wright Tue, 16 Feb 2021 05:59:05 -0800

Hi, do you mean content limiter length of 1000000?

I assume you are using the internal Tika transformer?  Are you combining
this with a Solr output connection that is not using the extract handler?

By "manifold crashes" I assume you actually mean it runs out of memory.
The "long running query" concern is a red herring because that does not
cause a crash under any circumstances.

This is quite likely if I described your setup above, because if you do not
use the Solr extract handler, the entire content of every document must be
loaded into memory.  That is why we require you to fill in a Solr field on
those kind of output connections that limits the number of bytes.

Karl

On Tue, Feb 16, 2021 at 8:45 AM ritika jain <ritikajain5...@gmail.com>
wrote:

>
>
> Hi users
>
>
> I am using manifoldcf 2.14 Fileshare connector to crawl files from smb
> server which is having some millions billions of records to process and
> crawl.
>
> Total system memory is 64Gb out of which start options file of manifold is
> defined as 32GB.
>
> We have some larger files to crawl around 30 MB of file or more that than .
>
> When mentioned size in the content limiter tab is 100000 that is 1 MB job
> works fine but when changed to 10000000 that is 10 MB .. manifold crashes
> with some logs with long running query .
>
> How we can achieve or optimise job specifications to process large
> documents also.
>
> Do I need to increase or decrease the number of connections or number of
> worker thread count or something.
>
> Can anybody help me on this to crawl larger files too at least till 10 MB
>
> Thanks
>
> Ritika
>

Re: Job Content Length issue

Reply via email to