Hi Priya, What is your GC params and does it throw error at a particular Zip file? Check its size and if much, consider limiting allowed file size to ingest.
Kind Regards, Furkan KAMACI 16 Ağu 2019 Cum, saat 13:40 tarihinde Karl Wright <[email protected]> şunu yazdı: > I see nothing indicating any single Tika extraction content type. It's > basically just unhappy with heap fragmentation and is GC'ing too > frequently. I would suggest just increasing the amount of memory you give > the process for an experiment. This might allow it to succeed. > > MCF uses the principle of "bounded memory", which means that every > connector cannot put whole documents into memory, but must be limited. But > there is no stipulation as to any specific limit that each connector must > work within. Some connectors, therefore, use a lot more memory than > others, and Tika is one of the ones that can use a lot. But it is still > bounded unless there's a bug, so just try increasing for a start. > > Karl > > > On Fri, Aug 16, 2019 at 6:25 AM Priya Arora <[email protected]> wrote: > > > Please Find below error stack trace:- > > > > ERROR: agents process ran out of memory - shutting down > > java.lang.OutOfMemoryError: GC overhead limit exceeded > > at java.util.HashMap.newNode(HashMap.java:1750) > > at java.util.HashMap.putVal(HashMap.java:631) > > at java.util.HashMap.put(HashMap.java:612) > > at > > org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState.noteTag( > > > > HTMLParseState.java:51) > > at > > org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState.dealWithC > > > > haracter(TagParseState.java:638) > > at > > org.apache.manifoldcf.connectorcommon.fuzzyml.SingleCharacterReceiver > > > > .dealWithCharacters(SingleCharacterReceiver.java:51) > > at > > org.apache.manifoldcf.connectorcommon.fuzzyml.DecodingByteReceiver.de > > > > alWithBytes(DecodingByteReceiver.java:48) > > at > > org.apache.manifoldcf.connectorcommon.fuzzyml.Parser.parseWithoutChar > > > > setDetection(Parser.java:99) > > at > > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnect > > > > or.handleHTML(WebcrawlerConnector.java:4918) > > at > > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnect > > > > or.extractLinks(WebcrawlerConnector.java:3852) > > at > > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnect > > > > or.processDocuments(WebcrawlerConnector.java:747) > > at > > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.ja > > > > va:399) > > agents process ran out of memory - shutting down > > java.lang.OutOfMemoryError: GC overhead limit exceeded > > agents process ran out of memory - shutting down > > java.lang.OutOfMemoryError: GC overhead limit exceeded > > at java.nio.ByteBuffer.wrap(ByteBuffer.java:373) > > at java.nio.ByteBuffer.wrap(ByteBuffer.java:396) > > at > > org.apache.commons.compress.archivers.zip.ZipFile.resolveLocalFileHea > > > > derData(ZipFile.java:1059) > > at > > org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java > > > > :296) > > at > > org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java > > > > :218) > > at > > org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java > > > > :201) > > at > > org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java > > > > :162) > > at > > org.apache.tika.parser.pkg.ZipContainerDetector.detectOPCBased(ZipCon > > > > tainerDetector.java:241) > > at > > org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipCo > > > > ntainerDetector.java:173) > > at > > org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDe > > > > tector.java:110) > > at > > org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.jav > > > > a:84) > > at > > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1 > > > > 16) > > at > > org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:7 > > > > 2) > > at > > org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbed > > > > ded(ParsingEmbeddedDocumentExtractor.java:102) > > at > > org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.jav > > > > a:350) > > at > > org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:287 > > > > ) > > at > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280 > > > > ) > > at > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280 > > > > ) > > at > > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1 > > > > 43) > > at > > org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:7 > > > > 2) > > at > > org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbed > > > > ded(ParsingEmbeddedDocumentExtractor.java:102) > > at > > org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.ja > > > > va:280) > > at > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280 > > > > ) > > at > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280 > > > > ) > > at > > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1 > > > > 43) > > at > > org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(Tik > > > > aParser.java:74) > > at > > org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrR > > > > eplaceDocumentWithException(TikaExtractor.java:235) > > at > > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$Pi > > > > > > > > > pelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3 > > > > 226) > > at > > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$Pi > > > > pelineAddFanout.sendDocument(IncrementalIngester.java:3077) > > at > > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$Pi > > > > > > > > > pelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.j > > > > ava:2708) > > at > > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.do > > > > cumentIngest(IncrementalIngester.java:756) > > at > > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ing > > > > estDocumentWithException(WorkerThread.java:1583) > > agents process ran out of memory - shutting down > > java.lang.OutOfMemoryError: GC overhead limit exceeded > > agents process ran out of memory - shutting down > > java.lang.OutOfMemoryError: GC overhead limit exceeded > > [Thread-491] INFO org.eclipse.jetty.server.ServerConnector - Stopped > > ServerConne > > ctor@3a4621bd{HTTP/1.1}{0.0.0.0:8345} > > agents process ran out of memory - shutting down > > java.lang.OutOfMemoryError: GC overhead limit exceeded > > agents process ran out of memory - shutting down > > java.lang.OutOfMemoryError: GC overhead limit exceeded > > agents process ran out of memory - shutting down > > java.lang.OutOfMemoryError: GC overhead limit exceeded > > [Thread-491] INFO org.eclipse.jetty.server.handler.ContextHandler - > Stopped > > o.e. > > > > j.w.WebAppContext@6a57ae10 > > {/mcf-api-service,file:/tmp/jetty-0.0.0.0-8345-mcf-api > > > > > > > > > -service.war-_mcf-api-service-any-3323783172971878700.dir/webapp/,UNAVAILABLE}{/ > > > > usr/share/manifoldcf/example/./../web/war/mcf-api-service.war} > > agents process ran out of memory - shutting down > > java.lang.OutOfMemoryError: GC overhead limit exceeded > > agents process ran out of memory - shutting down > > java.lang.OutOfMemoryError: GC overhead limit exceeded > > [Thread-491] INFO org.eclipse.jetty.server.handler.ContextHandler - > Stopped > > o.e. > > > > j.w.WebAppContext@51c693d > > {/mcf-authority-service,file:/tmp/jetty-0.0.0.0-8345-mc > > > > > > > > > f-authority-service.war-_mcf-authority-service-any-3706951886687463454.dir/webap > > > > > > > > > p/,UNAVAILABLE}{/usr/share/manifoldcf/example/./../web/war/mcf-authority-service > > > > .war} > > > > On Fri, Aug 16, 2019 at 3:22 PM Karl Wright <[email protected]> wrote: > > > > > Without an out-of-memory stack trace, I cannot definitively point to > Tika > > > or say that it's a specific kind of file. Please send one. > > > > > > Karl > > > > > > > > > On Fri, Aug 16, 2019 at 2:09 AM Priya Arora <[email protected]> > wrote: > > > > > > > *Existing Threads/connections configuration is :-* > > > > > > > > How many worker threads do you have? - 15 worker threads has been > > > > allocated(in properties.xml file). > > > > And the Tika Extractor connections -10 connections are defined. > > > > > > > > Is this suggested to reduce the number more. > > > > If not, what else can be a solution > > > > > > > > Thanks > > > > Priya > > > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:32 PM Karl Wright <[email protected]> > > wrote: > > > > > > > > > How many worker threads do you have? > > > > > Even if each worker thread is constrained in memory, and they > should > > > be, > > > > > you can easily cause things to run out of memory by giving too many > > > > worker > > > > > threads. Another way to keep Tika's usage constrained would be to > > > reduce > > > > > the number of Tika Extractor connections, because that effectively > > > limits > > > > > the number of extractions that can be going on at the same time. > > > > > > > > > > Karl > > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 7:23 AM Priya Arora <[email protected]> > > > wrote: > > > > > > > > > > > Yes , I am using Tika Extractor. And the version used for > manifold > > is > > > > > 2.13. > > > > > > Also I am using postgres as database. > > > > > > > > > > > > I have 4 types of jobs > > > > > > One is accessing/re crawling data from a public site. Other three > > are > > > > > > accessing intranet site. > > > > > > Out of which two are giving me correct output-without any error > and > > > > third > > > > > > one which is having data more than the other two , and giving me > > > this > > > > > > error. > > > > > > > > > > > > Is there any possibility with site accessibility issue. Can you > > > please > > > > > > suggest some solution > > > > > > Thanks and regards > > > > > > Priya > > > > > > > > > > > > On Wed, Aug 14, 2019 at 3:11 PM Karl Wright <[email protected]> > > > > wrote: > > > > > > > > > > > > > I will need to know more. Do you have the tika extractor in > your > > > > > > > pipeline? If so, what version of ManifoldCF are you using? > Tika > > > has > > > > > had > > > > > > > bugs related to memory consumption in the past; the out of > memory > > > > > > exception > > > > > > > may be coming from it and therefore a stack trace is critical > to > > > > have. > > > > > > > > > > > > > > Alternatively, you can upgrade to the latest version of MCF > > (2.13) > > > > and > > > > > > that > > > > > > > has a newer version of Tika without those problem. But you may > > > need > > > > to > > > > > > get > > > > > > > the agents process more memory. > > > > > > > > > > > > > > Another possible cause is that you're using hsqldb in > production. > > > > > HSQLDB > > > > > > > keeps all of its tables in memory. If you have a large crawl, > > you > > > do > > > > > not > > > > > > > want to use HSQLDB. > > > > > > > > > > > > > > Thanks, > > > > > > > Karl > > > > > > > > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 3:41 AM Priya Arora < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Karl, > > > > > > > > > > > > > > > > Manifold CF logs hints out me an error like : > > > > > > > > agents process ran out of memory - shutting down > > > > > > > > java.lang.OutOfMemoryError: Java heap space > > > > > > > > > > > > > > > > Also I have -Xms1024m ,-Xmx1024m memory allocated in > > > > > > > > start-options.env.unix, start-options.env.win file. > > > > > > > > Also Configuration:- > > > > > > > > 1) For Crawler server - 16 GB RAM and 8-Core Intel(R) Xeon(R) > > CPU > > > > > > E5-2660 > > > > > > > > v3 @ 2.60GHz and > > > > > > > > > > > > > > > > 2) For Elasticsearch server - 48GB and 1-Core Intel(R) > Xeon(R) > > > CPU > > > > > > > E5-2660 > > > > > > > > v3 @ 2.60GHz and i am using postgres as database. > > > > > > > > > > > > > > > > Can you please help me out, what to do in this case. > > > > > > > > > > > > > > > > Thanks > > > > > > > > Priya > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 12:33 PM Karl Wright < > > [email protected] > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > The error occurs, I believe, as the result of basic > > connection > > > > > > > problems, > > > > > > > > > e.g. the connection is getting rejected. You can find more > > > > > > information > > > > > > > > in > > > > > > > > > the simple history, and in the manifoldcf log. > > > > > > > > > > > > > > > > > > I would like to know the underlying cause, since the > > connector > > > > > should > > > > > > > be > > > > > > > > > resilient against errors of this kind. > > > > > > > > > > > > > > > > > > Karl > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Aug 14, 2019, 1:46 AM Priya Arora < > > [email protected] > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi Karl, > > > > > > > > > > > > > > > > > > > > I have an web Repository connector(Seeds:- an intranet > > > Site)., > > > > > and > > > > > > > job > > > > > > > > i > > > > > > > > > > son Production server. > > > > > > > > > > > > > > > > > > > > When i ran job on PROD, the job stops itself 2 times with > > and > > > > > > > > > error:Error: > > > > > > > > > > Unexpected HTTP result code: -1: null. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Can you please provide me an idea, in which it happens > so? > > > > > > > > > > > > > > > > > > > > Thanks and regards > > > > > > > > > > Priya Arora > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
