I see nothing indicating any single Tika extraction content type.  It's
basically just unhappy with heap fragmentation and is GC'ing too
frequently.  I would suggest just increasing the amount of memory you give
the process for an experiment.  This might allow it to succeed.

MCF uses the principle of "bounded memory", which means that every
connector cannot put whole documents into memory, but must be limited.  But
there is no stipulation as to any specific limit that each connector must
work within.  Some connectors, therefore, use a lot more memory than
others, and Tika is one of the ones that can use a lot.  But it is still
bounded unless there's a bug, so just try increasing for a start.

Karl


On Fri, Aug 16, 2019 at 6:25 AM Priya Arora <pr...@smartshore.nl> wrote:

> Please Find below error stack trace:-
>
> ERROR: agents process ran out of memory - shutting down
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>         at java.util.HashMap.newNode(HashMap.java:1750)
>         at java.util.HashMap.putVal(HashMap.java:631)
>         at java.util.HashMap.put(HashMap.java:612)
>         at
> org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState.noteTag(
>
>      HTMLParseState.java:51)
>         at
> org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState.dealWithC
>
>      haracter(TagParseState.java:638)
>         at
> org.apache.manifoldcf.connectorcommon.fuzzyml.SingleCharacterReceiver
>
>      .dealWithCharacters(SingleCharacterReceiver.java:51)
>         at
> org.apache.manifoldcf.connectorcommon.fuzzyml.DecodingByteReceiver.de
>
>      alWithBytes(DecodingByteReceiver.java:48)
>         at
> org.apache.manifoldcf.connectorcommon.fuzzyml.Parser.parseWithoutChar
>
>      setDetection(Parser.java:99)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnect
>
>      or.handleHTML(WebcrawlerConnector.java:4918)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnect
>
>      or.extractLinks(WebcrawlerConnector.java:3852)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnect
>
>      or.processDocuments(WebcrawlerConnector.java:747)
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.ja
>
>      va:399)
> agents process ran out of memory - shutting down
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> agents process ran out of memory - shutting down
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>         at java.nio.ByteBuffer.wrap(ByteBuffer.java:373)
>         at java.nio.ByteBuffer.wrap(ByteBuffer.java:396)
>         at
> org.apache.commons.compress.archivers.zip.ZipFile.resolveLocalFileHea
>
>      derData(ZipFile.java:1059)
>         at
> org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java
>
>      :296)
>         at
> org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java
>
>      :218)
>         at
> org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java
>
>      :201)
>         at
> org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java
>
>      :162)
>         at
> org.apache.tika.parser.pkg.ZipContainerDetector.detectOPCBased(ZipCon
>
>      tainerDetector.java:241)
>         at
> org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipCo
>
>      ntainerDetector.java:173)
>         at
> org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDe
>
>      tector.java:110)
>         at
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.jav
>
>      a:84)
>         at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
>
>      16)
>         at
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:7
>
>      2)
>         at
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbed
>
>      ded(ParsingEmbeddedDocumentExtractor.java:102)
>         at
> org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.jav
>
>      a:350)
>         at
> org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:287
>
>      )
>         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280
>
>      )
>         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280
>
>      )
>         at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
>
>      43)
>         at
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:7
>
>      2)
>         at
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbed
>
>      ded(ParsingEmbeddedDocumentExtractor.java:102)
>         at
> org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.ja
>
>      va:280)
>         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280
>
>      )
>         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280
>
>      )
>         at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
>
>      43)
>         at
> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(Tik
>
>      aParser.java:74)
>         at
> org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrR
>
>      eplaceDocumentWithException(TikaExtractor.java:235)
>         at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$Pi
>
>
>
>  
> pelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3
>
>            226)
>         at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$Pi
>
>      pelineAddFanout.sendDocument(IncrementalIngester.java:3077)
>         at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$Pi
>
>
>
>  
> pelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.j
>
>            ava:2708)
>         at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.do
>
>      cumentIngest(IncrementalIngester.java:756)
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ing
>
>      estDocumentWithException(WorkerThread.java:1583)
> agents process ran out of memory - shutting down
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> agents process ran out of memory - shutting down
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> [Thread-491] INFO org.eclipse.jetty.server.ServerConnector - Stopped
> ServerConne
>                        ctor@3a4621bd{HTTP/1.1}{0.0.0.0:8345}
> agents process ran out of memory - shutting down
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> agents process ran out of memory - shutting down
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> agents process ran out of memory - shutting down
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> [Thread-491] INFO org.eclipse.jetty.server.handler.ContextHandler - Stopped
> o.e.
>
> j.w.WebAppContext@6a57ae10
> {/mcf-api-service,file:/tmp/jetty-0.0.0.0-8345-mcf-api
>
>
>
>  
> -service.war-_mcf-api-service-any-3323783172971878700.dir/webapp/,UNAVAILABLE}{/
>
>            usr/share/manifoldcf/example/./../web/war/mcf-api-service.war}
> agents process ran out of memory - shutting down
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> agents process ran out of memory - shutting down
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> [Thread-491] INFO org.eclipse.jetty.server.handler.ContextHandler - Stopped
> o.e.
>
> j.w.WebAppContext@51c693d
> {/mcf-authority-service,file:/tmp/jetty-0.0.0.0-8345-mc
>
>
>
>  
> f-authority-service.war-_mcf-authority-service-any-3706951886687463454.dir/webap
>
>
>
>  
> p/,UNAVAILABLE}{/usr/share/manifoldcf/example/./../web/war/mcf-authority-service
>
>            .war}
>
> On Fri, Aug 16, 2019 at 3:22 PM Karl Wright <daddy...@gmail.com> wrote:
>
> > Without an out-of-memory stack trace, I cannot definitively point to Tika
> > or say that it's a specific kind of file.  Please send one.
> >
> > Karl
> >
> >
> > On Fri, Aug 16, 2019 at 2:09 AM Priya Arora <pr...@smartshore.nl> wrote:
> >
> > > *Existing Threads/connections configuration is :-*
> > >
> > > How many worker threads do you have? - 15 worker threads has been
> > > allocated(in properties.xml file).
> > > And the Tika Extractor connections -10 connections are defined.
> > >
> > > Is this suggested to reduce the number more.
> > > If not, what else can be a solution
> > >
> > > Thanks
> > > Priya
> > >
> > >
> > >
> > > On Wed, Aug 14, 2019 at 5:32 PM Karl Wright <daddy...@gmail.com>
> wrote:
> > >
> > > > How many worker threads do you have?
> > > > Even if each worker thread is constrained in memory, and they should
> > be,
> > > > you can easily cause things to run out of memory by giving too many
> > > worker
> > > > threads.  Another way to keep Tika's usage constrained would be to
> > reduce
> > > > the number of Tika Extractor connections, because that effectively
> > limits
> > > > the number of extractions that can be going on at the same time.
> > > >
> > > > Karl
> > > >
> > > >
> > > > On Wed, Aug 14, 2019 at 7:23 AM Priya Arora <pr...@smartshore.nl>
> > wrote:
> > > >
> > > > > Yes , I am using Tika Extractor. And the version used for manifold
> is
> > > > 2.13.
> > > > > Also I am using postgres as database.
> > > > >
> > > > > I have 4 types of jobs
> > > > > One is accessing/re crawling data from a public site. Other three
> are
> > > > > accessing intranet site.
> > > > > Out of which two are giving me correct output-without any error and
> > > third
> > > > > one which is having data more than the other two , and  giving me
> > this
> > > > > error.
> > > > >
> > > > > Is there any possibility with site accessibility issue. Can you
> > please
> > > > > suggest some solution
> > > > > Thanks and regards
> > > > > Priya
> > > > >
> > > > > On Wed, Aug 14, 2019 at 3:11 PM Karl Wright <daddy...@gmail.com>
> > > wrote:
> > > > >
> > > > > > I will need to know more.  Do you have the tika extractor in your
> > > > > > pipeline?  If so, what version of ManifoldCF are you using?  Tika
> > has
> > > > had
> > > > > > bugs related to memory consumption in the past; the out of memory
> > > > > exception
> > > > > > may be coming from it and therefore a stack trace is critical to
> > > have.
> > > > > >
> > > > > > Alternatively, you can upgrade to the latest version of MCF
> (2.13)
> > > and
> > > > > that
> > > > > > has a newer version of Tika without those problem.  But you may
> > need
> > > to
> > > > > get
> > > > > > the agents process more memory.
> > > > > >
> > > > > > Another possible cause is that you're using hsqldb in production.
> > > > HSQLDB
> > > > > > keeps all of its tables in memory.  If you have a large crawl,
> you
> > do
> > > > not
> > > > > > want to use HSQLDB.
> > > > > >
> > > > > > Thanks,
> > > > > > Karl
> > > > > >
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 3:41 AM Priya Arora <pr...@smartshore.nl
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi Karl,
> > > > > > >
> > > > > > > Manifold CF logs hints out me an error like :
> > > > > > > agents process ran out of memory - shutting down
> > > > > > > java.lang.OutOfMemoryError: Java heap space
> > > > > > >
> > > > > > > Also I have -Xms1024m ,-Xmx1024m memory allocated in
> > > > > > > start-options.env.unix, start-options.env.win file.
> > > > > > > Also Configuration:-
> > > > > > > 1) For Crawler server - 16 GB RAM and 8-Core Intel(R) Xeon(R)
> CPU
> > > > > E5-2660
> > > > > > > v3 @ 2.60GHz and
> > > > > > >
> > > > > > > 2) For Elasticsearch server - 48GB and 1-Core Intel(R) Xeon(R)
> > CPU
> > > > > > E5-2660
> > > > > > > v3 @ 2.60GHz and i am using postgres as database.
> > > > > > >
> > > > > > > Can you please help me out, what to do in this case.
> > > > > > >
> > > > > > > Thanks
> > > > > > > Priya
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 12:33 PM Karl Wright <
> daddy...@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > The error occurs, I believe, as the result of basic
> connection
> > > > > > problems,
> > > > > > > > e.g. the connection is getting rejected.  You can find more
> > > > > information
> > > > > > > in
> > > > > > > > the simple history, and in the manifoldcf log.
> > > > > > > >
> > > > > > > > I would like to know the underlying cause, since the
> connector
> > > > should
> > > > > > be
> > > > > > > > resilient against errors of this kind.
> > > > > > > >
> > > > > > > > Karl
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Aug 14, 2019, 1:46 AM Priya Arora <
> pr...@smartshore.nl
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Karl,
> > > > > > > > >
> > > > > > > > > I have an web Repository connector(Seeds:- an intranet
> > Site).,
> > > > and
> > > > > > job
> > > > > > > i
> > > > > > > > > son Production server.
> > > > > > > > >
> > > > > > > > > When i ran job on PROD, the job stops itself 2 times with
> and
> > > > > > > > error:Error:
> > > > > > > > > Unexpected HTTP result code: -1: null.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Can you please provide me an idea, in which it happens so?
> > > > > > > > >
> > > > > > > > > Thanks and regards
> > > > > > > > > Priya Arora
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to