Re: Job Content Length issue

Karl Wright Wed, 17 Feb 2021 03:32:47 -0800

The internal Tika is not memory bounded; some transformations stream, but
others put everything into memory.


You can try using the external tika, with a tika instance you run
separately, and that would likely help.  But you may need to give it lots
of memory too.

Karl


On Wed, Feb 17, 2021 at 3:50 AM ritika jain <ritikajain5...@gmail.com>
wrote:

> Hi Karl,
>
> I am using Elastic search as an output connector and yes using an internal
> Tika extracter, not using solr output connection.
>
> Also Elastic search server is on hosted on different server with huge
> memory allocation.
>
> On Tue, Feb 16, 2021 at 7:29 PM Karl Wright <daddy...@gmail.com> wrote:
>
>> Hi, do you mean content limiter length of 1000000?
>>
>> I assume you are using the internal Tika transformer?  Are you combining
>> this with a Solr output connection that is not using the extract handler?
>>
>> By "manifold crashes" I assume you actually mean it runs out of memory.
>> The "long running query" concern is a red herring because that does not
>> cause a crash under any circumstances.
>>
>> This is quite likely if I described your setup above, because if you do
>> not use the Solr extract handler, the entire content of every document must
>> be loaded into memory.  That is why we require you to fill in a Solr field
>> on those kind of output connections that limits the number of bytes.
>>
>> Karl
>>
>>
>> On Tue, Feb 16, 2021 at 8:45 AM ritika jain <ritikajain5...@gmail.com>
>> wrote:
>>
>>>
>>>
>>> Hi users
>>>
>>>
>>> I am using manifoldcf 2.14 Fileshare connector to crawl files from smb
>>> server which is having some millions billions of records to process and
>>> crawl.
>>>
>>> Total system memory is 64Gb out of which start options file of manifold
>>> is defined as 32GB.
>>>
>>> We have some larger files to crawl around 30 MB of file or more that
>>> than .
>>>
>>> When mentioned size in the content limiter tab is 100000 that is 1 MB
>>> job works fine but when changed to 10000000 that is 10 MB .. manifold
>>> crashes with some logs with long running query .
>>>
>>> How we can achieve or optimise job specifications to process large
>>> documents also.
>>>
>>> Do I need to increase or decrease the number of connections or number of
>>> worker thread count or something.
>>>
>>> Can anybody help me on this to crawl larger files too at least till 10 MB
>>>
>>> Thanks
>>>
>>> Ritika
>>>
>>

Re: Job Content Length issue

Reply via email to