Re: Job Content Length issue

2021-02-17 Thread Karl Wright
The internal Tika is not memory bounded; some transformations stream, but
others put everything into memory.

You can try using the external tika, with a tika instance you run
separately, and that would likely help.  But you may need to give it lots
of memory too.

Karl


On Wed, Feb 17, 2021 at 3:50 AM ritika jain 
wrote:

> Hi Karl,
>
> I am using Elastic search as an output connector and yes using an internal
> Tika extracter, not using solr output connection.
>
> Also Elastic search server is on hosted on different server with huge
> memory allocation.
>
> On Tue, Feb 16, 2021 at 7:29 PM Karl Wright  wrote:
>
>> Hi, do you mean content limiter length of 100?
>>
>> I assume you are using the internal Tika transformer?  Are you combining
>> this with a Solr output connection that is not using the extract handler?
>>
>> By "manifold crashes" I assume you actually mean it runs out of memory.
>> The "long running query" concern is a red herring because that does not
>> cause a crash under any circumstances.
>>
>> This is quite likely if I described your setup above, because if you do
>> not use the Solr extract handler, the entire content of every document must
>> be loaded into memory.  That is why we require you to fill in a Solr field
>> on those kind of output connections that limits the number of bytes.
>>
>> Karl
>>
>>
>> On Tue, Feb 16, 2021 at 8:45 AM ritika jain 
>> wrote:
>>
>>>
>>>
>>> Hi users
>>>
>>>
>>> I am using manifoldcf 2.14 Fileshare connector to crawl files from smb
>>> server which is having some millions billions of records to process and
>>> crawl.
>>>
>>> Total system memory is 64Gb out of which start options file of manifold
>>> is defined as 32GB.
>>>
>>> We have some larger files to crawl around 30 MB of file or more that
>>> than .
>>>
>>> When mentioned size in the content limiter tab is 10 that is 1 MB
>>> job works fine but when changed to 1000 that is 10 MB .. manifold
>>> crashes with some logs with long running query .
>>>
>>> How we can achieve or optimise job specifications to process large
>>> documents also.
>>>
>>> Do I need to increase or decrease the number of connections or number of
>>> worker thread count or something.
>>>
>>> Can anybody help me on this to crawl larger files too at least till 10 MB
>>>
>>> Thanks
>>>
>>> Ritika
>>>
>>


Re: Job Content Length issue

2021-02-17 Thread ritika jain
Hi Karl,

I am using Elastic search as an output connector and yes using an internal
Tika extracter, not using solr output connection.

Also Elastic search server is on hosted on different server with huge
memory allocation.

On Tue, Feb 16, 2021 at 7:29 PM Karl Wright  wrote:

> Hi, do you mean content limiter length of 100?
>
> I assume you are using the internal Tika transformer?  Are you combining
> this with a Solr output connection that is not using the extract handler?
>
> By "manifold crashes" I assume you actually mean it runs out of memory.
> The "long running query" concern is a red herring because that does not
> cause a crash under any circumstances.
>
> This is quite likely if I described your setup above, because if you do
> not use the Solr extract handler, the entire content of every document must
> be loaded into memory.  That is why we require you to fill in a Solr field
> on those kind of output connections that limits the number of bytes.
>
> Karl
>
>
> On Tue, Feb 16, 2021 at 8:45 AM ritika jain 
> wrote:
>
>>
>>
>> Hi users
>>
>>
>> I am using manifoldcf 2.14 Fileshare connector to crawl files from smb
>> server which is having some millions billions of records to process and
>> crawl.
>>
>> Total system memory is 64Gb out of which start options file of manifold
>> is defined as 32GB.
>>
>> We have some larger files to crawl around 30 MB of file or more that than
>> .
>>
>> When mentioned size in the content limiter tab is 10 that is 1 MB job
>> works fine but when changed to 1000 that is 10 MB .. manifold crashes
>> with some logs with long running query .
>>
>> How we can achieve or optimise job specifications to process large
>> documents also.
>>
>> Do I need to increase or decrease the number of connections or number of
>> worker thread count or something.
>>
>> Can anybody help me on this to crawl larger files too at least till 10 MB
>>
>> Thanks
>>
>> Ritika
>>
>


Re: Job Content Length issue

2021-02-16 Thread Karl Wright
Hi, do you mean content limiter length of 100?

I assume you are using the internal Tika transformer?  Are you combining
this with a Solr output connection that is not using the extract handler?

By "manifold crashes" I assume you actually mean it runs out of memory.
The "long running query" concern is a red herring because that does not
cause a crash under any circumstances.

This is quite likely if I described your setup above, because if you do not
use the Solr extract handler, the entire content of every document must be
loaded into memory.  That is why we require you to fill in a Solr field on
those kind of output connections that limits the number of bytes.

Karl


On Tue, Feb 16, 2021 at 8:45 AM ritika jain 
wrote:

>
>
> Hi users
>
>
> I am using manifoldcf 2.14 Fileshare connector to crawl files from smb
> server which is having some millions billions of records to process and
> crawl.
>
> Total system memory is 64Gb out of which start options file of manifold is
> defined as 32GB.
>
> We have some larger files to crawl around 30 MB of file or more that than .
>
> When mentioned size in the content limiter tab is 10 that is 1 MB job
> works fine but when changed to 1000 that is 10 MB .. manifold crashes
> with some logs with long running query .
>
> How we can achieve or optimise job specifications to process large
> documents also.
>
> Do I need to increase or decrease the number of connections or number of
> worker thread count or something.
>
> Can anybody help me on this to crawl larger files too at least till 10 MB
>
> Thanks
>
> Ritika
>