Re: Multiprocess file installation of manifold

2021-02-17 Thread Karl Wright
File synchronization is still supported but is deprecated.  We recommend
zookeeper synchronization unless you have a very good reason not to.

Karl


On Wed, Feb 17, 2021 at 12:26 PM Ananth Peddinti  wrote:

> Hello Team ,
>
>
> I would like to know if someone has already done multi-process model
>   installation of manifold on linux machine .I would like to know the
> process in detail.We are running into issues with the quick start model.
>
>
>
> Regards
>
> Ananth
> --
> 
> -SECURITY/CONFIDENTIALITY WARNING-
>
> This message and any attachments are intended solely for the individual or
> entity to which they are addressed. This communication may contain
> information that is privileged, confidential, or exempt from disclosure
> under applicable law (e.g., personal health information, research data,
> financial information). Because this e-mail has been sent without
> encryption, individuals other than the intended recipient may be able to
> view the information, forward it to others or tamper with the information
> without the knowledge or consent of the sender. If you are not the intended
> recipient, or the employee or person responsible for delivering the message
> to the intended recipient, any dissemination, distribution or copying of
> the communication is strictly prohibited. If you received the communication
> in error, please notify the sender immediately by replying to this message
> and deleting the message and any accompanying files from your system. If,
> due to the security risks, you do not wish to receive further
> communications via e-mail, please reply to this message and inform the
> sender that you do not wish to receive further e-mail from the sender.
> (LCP301)
> 
>


Multiprocess file installation of manifold

2021-02-17 Thread Ananth Peddinti
Hello Team ,

I would like to know if someone has already done multi-process model   
installation of manifold on linux machine .I would like to know the process in 
detail.We are running into issues with the quick start model.

Regards
Ananth

--

-SECURITY/CONFIDENTIALITY WARNING-  

This message and any attachments are intended solely for the individual or 
entity to which they are addressed. This communication may contain information 
that is privileged, confidential, or exempt from disclosure under applicable 
law (e.g., personal health information, research data, financial information). 
Because this e-mail has been sent without encryption, individuals other than 
the intended recipient may be able to view the information, forward it to 
others or tamper with the information without the knowledge or consent of the 
sender. If you are not the intended recipient, or the employee or person 
responsible for delivering the message to the intended recipient, any 
dissemination, distribution or copying of the communication is strictly 
prohibited. If you received the communication in error, please notify the 
sender immediately by replying to this message and deleting the message and any 
accompanying files from your system. If, due to the security risks, you do not 
wish to receive further communications via e-mail, please reply to this message 
and inform the sender that you do not wish to receive further e-mail from the 
sender. (LCP301)



Re: Job Content Length issue

2021-02-17 Thread Karl Wright
The internal Tika is not memory bounded; some transformations stream, but
others put everything into memory.

You can try using the external tika, with a tika instance you run
separately, and that would likely help.  But you may need to give it lots
of memory too.

Karl


On Wed, Feb 17, 2021 at 3:50 AM ritika jain 
wrote:

> Hi Karl,
>
> I am using Elastic search as an output connector and yes using an internal
> Tika extracter, not using solr output connection.
>
> Also Elastic search server is on hosted on different server with huge
> memory allocation.
>
> On Tue, Feb 16, 2021 at 7:29 PM Karl Wright  wrote:
>
>> Hi, do you mean content limiter length of 100?
>>
>> I assume you are using the internal Tika transformer?  Are you combining
>> this with a Solr output connection that is not using the extract handler?
>>
>> By "manifold crashes" I assume you actually mean it runs out of memory.
>> The "long running query" concern is a red herring because that does not
>> cause a crash under any circumstances.
>>
>> This is quite likely if I described your setup above, because if you do
>> not use the Solr extract handler, the entire content of every document must
>> be loaded into memory.  That is why we require you to fill in a Solr field
>> on those kind of output connections that limits the number of bytes.
>>
>> Karl
>>
>>
>> On Tue, Feb 16, 2021 at 8:45 AM ritika jain 
>> wrote:
>>
>>>
>>>
>>> Hi users
>>>
>>>
>>> I am using manifoldcf 2.14 Fileshare connector to crawl files from smb
>>> server which is having some millions billions of records to process and
>>> crawl.
>>>
>>> Total system memory is 64Gb out of which start options file of manifold
>>> is defined as 32GB.
>>>
>>> We have some larger files to crawl around 30 MB of file or more that
>>> than .
>>>
>>> When mentioned size in the content limiter tab is 10 that is 1 MB
>>> job works fine but when changed to 1000 that is 10 MB .. manifold
>>> crashes with some logs with long running query .
>>>
>>> How we can achieve or optimise job specifications to process large
>>> documents also.
>>>
>>> Do I need to increase or decrease the number of connections or number of
>>> worker thread count or something.
>>>
>>> Can anybody help me on this to crawl larger files too at least till 10 MB
>>>
>>> Thanks
>>>
>>> Ritika
>>>
>>


Re: Job Content Length issue

2021-02-17 Thread ritika jain
Hi Karl,

I am using Elastic search as an output connector and yes using an internal
Tika extracter, not using solr output connection.

Also Elastic search server is on hosted on different server with huge
memory allocation.

On Tue, Feb 16, 2021 at 7:29 PM Karl Wright  wrote:

> Hi, do you mean content limiter length of 100?
>
> I assume you are using the internal Tika transformer?  Are you combining
> this with a Solr output connection that is not using the extract handler?
>
> By "manifold crashes" I assume you actually mean it runs out of memory.
> The "long running query" concern is a red herring because that does not
> cause a crash under any circumstances.
>
> This is quite likely if I described your setup above, because if you do
> not use the Solr extract handler, the entire content of every document must
> be loaded into memory.  That is why we require you to fill in a Solr field
> on those kind of output connections that limits the number of bytes.
>
> Karl
>
>
> On Tue, Feb 16, 2021 at 8:45 AM ritika jain 
> wrote:
>
>>
>>
>> Hi users
>>
>>
>> I am using manifoldcf 2.14 Fileshare connector to crawl files from smb
>> server which is having some millions billions of records to process and
>> crawl.
>>
>> Total system memory is 64Gb out of which start options file of manifold
>> is defined as 32GB.
>>
>> We have some larger files to crawl around 30 MB of file or more that than
>> .
>>
>> When mentioned size in the content limiter tab is 10 that is 1 MB job
>> works fine but when changed to 1000 that is 10 MB .. manifold crashes
>> with some logs with long running query .
>>
>> How we can achieve or optimise job specifications to process large
>> documents also.
>>
>> Do I need to increase or decrease the number of connections or number of
>> worker thread count or something.
>>
>> Can anybody help me on this to crawl larger files too at least till 10 MB
>>
>> Thanks
>>
>> Ritika
>>
>