Hi Priya,

What is your GC params and does it throw error at a particular Zip file?
Check its size and if much, consider limiting allowed file size to ingest.

Kind Regards,
Furkan KAMACI

16 Ağu 2019 Cum, saat 13:40 tarihinde Karl Wright <[email protected]> şunu
yazdı:

> I see nothing indicating any single Tika extraction content type.  It's
> basically just unhappy with heap fragmentation and is GC'ing too
> frequently.  I would suggest just increasing the amount of memory you give
> the process for an experiment.  This might allow it to succeed.
>
> MCF uses the principle of "bounded memory", which means that every
> connector cannot put whole documents into memory, but must be limited.  But
> there is no stipulation as to any specific limit that each connector must
> work within.  Some connectors, therefore, use a lot more memory than
> others, and Tika is one of the ones that can use a lot.  But it is still
> bounded unless there's a bug, so just try increasing for a start.
>
> Karl
>
>
> On Fri, Aug 16, 2019 at 6:25 AM Priya Arora <[email protected]> wrote:
>
> > Please Find below error stack trace:-
> >
> > ERROR: agents process ran out of memory - shutting down
> > java.lang.OutOfMemoryError: GC overhead limit exceeded
> >         at java.util.HashMap.newNode(HashMap.java:1750)
> >         at java.util.HashMap.putVal(HashMap.java:631)
> >         at java.util.HashMap.put(HashMap.java:612)
> >         at
> > org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState.noteTag(
> >
> >      HTMLParseState.java:51)
> >         at
> > org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState.dealWithC
> >
> >      haracter(TagParseState.java:638)
> >         at
> > org.apache.manifoldcf.connectorcommon.fuzzyml.SingleCharacterReceiver
> >
> >      .dealWithCharacters(SingleCharacterReceiver.java:51)
> >         at
> > org.apache.manifoldcf.connectorcommon.fuzzyml.DecodingByteReceiver.de
> >
> >      alWithBytes(DecodingByteReceiver.java:48)
> >         at
> > org.apache.manifoldcf.connectorcommon.fuzzyml.Parser.parseWithoutChar
> >
> >      setDetection(Parser.java:99)
> >         at
> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnect
> >
> >      or.handleHTML(WebcrawlerConnector.java:4918)
> >         at
> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnect
> >
> >      or.extractLinks(WebcrawlerConnector.java:3852)
> >         at
> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnect
> >
> >      or.processDocuments(WebcrawlerConnector.java:747)
> >         at
> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.ja
> >
> >      va:399)
> > agents process ran out of memory - shutting down
> > java.lang.OutOfMemoryError: GC overhead limit exceeded
> > agents process ran out of memory - shutting down
> > java.lang.OutOfMemoryError: GC overhead limit exceeded
> >         at java.nio.ByteBuffer.wrap(ByteBuffer.java:373)
> >         at java.nio.ByteBuffer.wrap(ByteBuffer.java:396)
> >         at
> > org.apache.commons.compress.archivers.zip.ZipFile.resolveLocalFileHea
> >
> >      derData(ZipFile.java:1059)
> >         at
> > org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java
> >
> >      :296)
> >         at
> > org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java
> >
> >      :218)
> >         at
> > org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java
> >
> >      :201)
> >         at
> > org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java
> >
> >      :162)
> >         at
> > org.apache.tika.parser.pkg.ZipContainerDetector.detectOPCBased(ZipCon
> >
> >      tainerDetector.java:241)
> >         at
> > org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipCo
> >
> >      ntainerDetector.java:173)
> >         at
> > org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDe
> >
> >      tector.java:110)
> >         at
> > org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.jav
> >
> >      a:84)
> >         at
> > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
> >
> >      16)
> >         at
> > org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:7
> >
> >      2)
> >         at
> > org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbed
> >
> >      ded(ParsingEmbeddedDocumentExtractor.java:102)
> >         at
> > org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.jav
> >
> >      a:350)
> >         at
> > org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:287
> >
> >      )
> >         at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280
> >
> >      )
> >         at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280
> >
> >      )
> >         at
> > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
> >
> >      43)
> >         at
> > org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:7
> >
> >      2)
> >         at
> > org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbed
> >
> >      ded(ParsingEmbeddedDocumentExtractor.java:102)
> >         at
> > org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.ja
> >
> >      va:280)
> >         at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280
> >
> >      )
> >         at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280
> >
> >      )
> >         at
> > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
> >
> >      43)
> >         at
> > org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(Tik
> >
> >      aParser.java:74)
> >         at
> > org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrR
> >
> >      eplaceDocumentWithException(TikaExtractor.java:235)
> >         at
> > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$Pi
> >
> >
> >
> >
> pelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3
> >
> >            226)
> >         at
> > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$Pi
> >
> >      pelineAddFanout.sendDocument(IncrementalIngester.java:3077)
> >         at
> > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$Pi
> >
> >
> >
> >
> pelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.j
> >
> >            ava:2708)
> >         at
> > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.do
> >
> >      cumentIngest(IncrementalIngester.java:756)
> >         at
> > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ing
> >
> >      estDocumentWithException(WorkerThread.java:1583)
> > agents process ran out of memory - shutting down
> > java.lang.OutOfMemoryError: GC overhead limit exceeded
> > agents process ran out of memory - shutting down
> > java.lang.OutOfMemoryError: GC overhead limit exceeded
> > [Thread-491] INFO org.eclipse.jetty.server.ServerConnector - Stopped
> > ServerConne
> >                        ctor@3a4621bd{HTTP/1.1}{0.0.0.0:8345}
> > agents process ran out of memory - shutting down
> > java.lang.OutOfMemoryError: GC overhead limit exceeded
> > agents process ran out of memory - shutting down
> > java.lang.OutOfMemoryError: GC overhead limit exceeded
> > agents process ran out of memory - shutting down
> > java.lang.OutOfMemoryError: GC overhead limit exceeded
> > [Thread-491] INFO org.eclipse.jetty.server.handler.ContextHandler -
> Stopped
> > o.e.
> >
> > j.w.WebAppContext@6a57ae10
> > {/mcf-api-service,file:/tmp/jetty-0.0.0.0-8345-mcf-api
> >
> >
> >
> >
> -service.war-_mcf-api-service-any-3323783172971878700.dir/webapp/,UNAVAILABLE}{/
> >
> >            usr/share/manifoldcf/example/./../web/war/mcf-api-service.war}
> > agents process ran out of memory - shutting down
> > java.lang.OutOfMemoryError: GC overhead limit exceeded
> > agents process ran out of memory - shutting down
> > java.lang.OutOfMemoryError: GC overhead limit exceeded
> > [Thread-491] INFO org.eclipse.jetty.server.handler.ContextHandler -
> Stopped
> > o.e.
> >
> > j.w.WebAppContext@51c693d
> > {/mcf-authority-service,file:/tmp/jetty-0.0.0.0-8345-mc
> >
> >
> >
> >
> f-authority-service.war-_mcf-authority-service-any-3706951886687463454.dir/webap
> >
> >
> >
> >
> p/,UNAVAILABLE}{/usr/share/manifoldcf/example/./../web/war/mcf-authority-service
> >
> >            .war}
> >
> > On Fri, Aug 16, 2019 at 3:22 PM Karl Wright <[email protected]> wrote:
> >
> > > Without an out-of-memory stack trace, I cannot definitively point to
> Tika
> > > or say that it's a specific kind of file.  Please send one.
> > >
> > > Karl
> > >
> > >
> > > On Fri, Aug 16, 2019 at 2:09 AM Priya Arora <[email protected]>
> wrote:
> > >
> > > > *Existing Threads/connections configuration is :-*
> > > >
> > > > How many worker threads do you have? - 15 worker threads has been
> > > > allocated(in properties.xml file).
> > > > And the Tika Extractor connections -10 connections are defined.
> > > >
> > > > Is this suggested to reduce the number more.
> > > > If not, what else can be a solution
> > > >
> > > > Thanks
> > > > Priya
> > > >
> > > >
> > > >
> > > > On Wed, Aug 14, 2019 at 5:32 PM Karl Wright <[email protected]>
> > wrote:
> > > >
> > > > > How many worker threads do you have?
> > > > > Even if each worker thread is constrained in memory, and they
> should
> > > be,
> > > > > you can easily cause things to run out of memory by giving too many
> > > > worker
> > > > > threads.  Another way to keep Tika's usage constrained would be to
> > > reduce
> > > > > the number of Tika Extractor connections, because that effectively
> > > limits
> > > > > the number of extractions that can be going on at the same time.
> > > > >
> > > > > Karl
> > > > >
> > > > >
> > > > > On Wed, Aug 14, 2019 at 7:23 AM Priya Arora <[email protected]>
> > > wrote:
> > > > >
> > > > > > Yes , I am using Tika Extractor. And the version used for
> manifold
> > is
> > > > > 2.13.
> > > > > > Also I am using postgres as database.
> > > > > >
> > > > > > I have 4 types of jobs
> > > > > > One is accessing/re crawling data from a public site. Other three
> > are
> > > > > > accessing intranet site.
> > > > > > Out of which two are giving me correct output-without any error
> and
> > > > third
> > > > > > one which is having data more than the other two , and  giving me
> > > this
> > > > > > error.
> > > > > >
> > > > > > Is there any possibility with site accessibility issue. Can you
> > > please
> > > > > > suggest some solution
> > > > > > Thanks and regards
> > > > > > Priya
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 3:11 PM Karl Wright <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > > I will need to know more.  Do you have the tika extractor in
> your
> > > > > > > pipeline?  If so, what version of ManifoldCF are you using?
> Tika
> > > has
> > > > > had
> > > > > > > bugs related to memory consumption in the past; the out of
> memory
> > > > > > exception
> > > > > > > may be coming from it and therefore a stack trace is critical
> to
> > > > have.
> > > > > > >
> > > > > > > Alternatively, you can upgrade to the latest version of MCF
> > (2.13)
> > > > and
> > > > > > that
> > > > > > > has a newer version of Tika without those problem.  But you may
> > > need
> > > > to
> > > > > > get
> > > > > > > the agents process more memory.
> > > > > > >
> > > > > > > Another possible cause is that you're using hsqldb in
> production.
> > > > > HSQLDB
> > > > > > > keeps all of its tables in memory.  If you have a large crawl,
> > you
> > > do
> > > > > not
> > > > > > > want to use HSQLDB.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Karl
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 3:41 AM Priya Arora <
> [email protected]
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi Karl,
> > > > > > > >
> > > > > > > > Manifold CF logs hints out me an error like :
> > > > > > > > agents process ran out of memory - shutting down
> > > > > > > > java.lang.OutOfMemoryError: Java heap space
> > > > > > > >
> > > > > > > > Also I have -Xms1024m ,-Xmx1024m memory allocated in
> > > > > > > > start-options.env.unix, start-options.env.win file.
> > > > > > > > Also Configuration:-
> > > > > > > > 1) For Crawler server - 16 GB RAM and 8-Core Intel(R) Xeon(R)
> > CPU
> > > > > > E5-2660
> > > > > > > > v3 @ 2.60GHz and
> > > > > > > >
> > > > > > > > 2) For Elasticsearch server - 48GB and 1-Core Intel(R)
> Xeon(R)
> > > CPU
> > > > > > > E5-2660
> > > > > > > > v3 @ 2.60GHz and i am using postgres as database.
> > > > > > > >
> > > > > > > > Can you please help me out, what to do in this case.
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Priya
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Aug 14, 2019 at 12:33 PM Karl Wright <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > The error occurs, I believe, as the result of basic
> > connection
> > > > > > > problems,
> > > > > > > > > e.g. the connection is getting rejected.  You can find more
> > > > > > information
> > > > > > > > in
> > > > > > > > > the simple history, and in the manifoldcf log.
> > > > > > > > >
> > > > > > > > > I would like to know the underlying cause, since the
> > connector
> > > > > should
> > > > > > > be
> > > > > > > > > resilient against errors of this kind.
> > > > > > > > >
> > > > > > > > > Karl
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Aug 14, 2019, 1:46 AM Priya Arora <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Karl,
> > > > > > > > > >
> > > > > > > > > > I have an web Repository connector(Seeds:- an intranet
> > > Site).,
> > > > > and
> > > > > > > job
> > > > > > > > i
> > > > > > > > > > son Production server.
> > > > > > > > > >
> > > > > > > > > > When i ran job on PROD, the job stops itself 2 times with
> > and
> > > > > > > > > error:Error:
> > > > > > > > > > Unexpected HTTP result code: -1: null.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Can you please provide me an idea, in which it happens
> so?
> > > > > > > > > >
> > > > > > > > > > Thanks and regards
> > > > > > > > > > Priya Arora
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to