The internal Tika is not memory bounded; some transformations stream, but others put everything into memory.
You can try using the external tika, with a tika instance you run separately, and that would likely help. But you may need to give it lots of memory too. Karl On Wed, Feb 17, 2021 at 3:50 AM ritika jain <ritikajain5...@gmail.com> wrote: > Hi Karl, > > I am using Elastic search as an output connector and yes using an internal > Tika extracter, not using solr output connection. > > Also Elastic search server is on hosted on different server with huge > memory allocation. > > On Tue, Feb 16, 2021 at 7:29 PM Karl Wright <daddy...@gmail.com> wrote: > >> Hi, do you mean content limiter length of 1000000? >> >> I assume you are using the internal Tika transformer? Are you combining >> this with a Solr output connection that is not using the extract handler? >> >> By "manifold crashes" I assume you actually mean it runs out of memory. >> The "long running query" concern is a red herring because that does not >> cause a crash under any circumstances. >> >> This is quite likely if I described your setup above, because if you do >> not use the Solr extract handler, the entire content of every document must >> be loaded into memory. That is why we require you to fill in a Solr field >> on those kind of output connections that limits the number of bytes. >> >> Karl >> >> >> On Tue, Feb 16, 2021 at 8:45 AM ritika jain <ritikajain5...@gmail.com> >> wrote: >> >>> >>> >>> Hi users >>> >>> >>> I am using manifoldcf 2.14 Fileshare connector to crawl files from smb >>> server which is having some millions billions of records to process and >>> crawl. >>> >>> Total system memory is 64Gb out of which start options file of manifold >>> is defined as 32GB. >>> >>> We have some larger files to crawl around 30 MB of file or more that >>> than . >>> >>> When mentioned size in the content limiter tab is 100000 that is 1 MB >>> job works fine but when changed to 10000000 that is 10 MB .. manifold >>> crashes with some logs with long running query . >>> >>> How we can achieve or optimise job specifications to process large >>> documents also. >>> >>> Do I need to increase or decrease the number of connections or number of >>> worker thread count or something. >>> >>> Can anybody help me on this to crawl larger files too at least till 10 MB >>> >>> Thanks >>> >>> Ritika >>> >>