Hi Karl,

I am using Elastic search as an output connector and yes using an internal
Tika extracter, not using solr output connection.

Also Elastic search server is on hosted on different server with huge
memory allocation.

On Tue, Feb 16, 2021 at 7:29 PM Karl Wright <daddy...@gmail.com> wrote:

> Hi, do you mean content limiter length of 1000000?
>
> I assume you are using the internal Tika transformer?  Are you combining
> this with a Solr output connection that is not using the extract handler?
>
> By "manifold crashes" I assume you actually mean it runs out of memory.
> The "long running query" concern is a red herring because that does not
> cause a crash under any circumstances.
>
> This is quite likely if I described your setup above, because if you do
> not use the Solr extract handler, the entire content of every document must
> be loaded into memory.  That is why we require you to fill in a Solr field
> on those kind of output connections that limits the number of bytes.
>
> Karl
>
>
> On Tue, Feb 16, 2021 at 8:45 AM ritika jain <ritikajain5...@gmail.com>
> wrote:
>
>>
>>
>> Hi users
>>
>>
>> I am using manifoldcf 2.14 Fileshare connector to crawl files from smb
>> server which is having some millions billions of records to process and
>> crawl.
>>
>> Total system memory is 64Gb out of which start options file of manifold
>> is defined as 32GB.
>>
>> We have some larger files to crawl around 30 MB of file or more that than
>> .
>>
>> When mentioned size in the content limiter tab is 100000 that is 1 MB job
>> works fine but when changed to 10000000 that is 10 MB .. manifold crashes
>> with some logs with long running query .
>>
>> How we can achieve or optimise job specifications to process large
>> documents also.
>>
>> Do I need to increase or decrease the number of connections or number of
>> worker thread count or something.
>>
>> Can anybody help me on this to crawl larger files too at least till 10 MB
>>
>> Thanks
>>
>> Ritika
>>
>

Reply via email to