Re: Problem indexing windows files
Thanks for your help I try to look at the logs but didn't see anything in solr or manifolcf log files. I don't know where is Tika log file I download the binary of solr 4.4 and I am using the example in there On Wed, Sep 18, 2013 at 12:02 AM, Furkan KAMACI wrote: > Firstly; > > This may not be a Solr related problem. Did you check the log file of Solr? > Tika mayhave some circumstances at some kind of situations. For example > when parsing HTML that has a base64 encoded image it may have some > problems. If you find the correct logs you can detect it. On the other take > care of Manifold, there may be some problem too. > > 17 Eylül 2013 Salı tarihinde Yossi Nachum adlı > kullanıcı şöyle yazdı: > > Hi, > > > > I am trying to index my windows pc files with manifoldcf version 1.3 and > > solr version 4.4. > > > > I create output connection and repository connection and started a new > job > > that scan my E drive. > > > > Everything seems like it work ok but after a few minutes solr stop > getting > > new files to index. I am seeing that through tomcat log file. > > > > On manifold crawler ui I see that the job is still running but after few > > minutes I am getting the following error: > > "Error: Repeated service interruptions - failure processing document: > Read > > timed out" > > > > I am seeing that tomcat process is constantly consume 100% of one cpu (I > > have two cpu's) even after I get the error message from manifolfcf > crawler > > ui. > > > > I check the thread dump in solr admin and saw that the following threads > > take the most cpu/user time > > " > > http-8080-3 (32) > > > >- java.io.FileInputStream.readBytes(Native Method) > >- java.io.FileInputStream.read(FileInputStream.java:236) > >- java.io.BufferedInputStream.fill(BufferedInputStream.java:235) > >- java.io.BufferedInputStream.read1(BufferedInputStream.java:275) > >- java.io.BufferedInputStream.read(BufferedInputStream.java:334) > >- org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) > >- java.io.FilterInputStream.read(FilterInputStream.java:133) > >- org.apache.tika.io.TailStream.read(TailStream.java:117) > >- org.apache.tika.io.TailStream.skip(TailStream.java:140) > >- > org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283) > >- org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160) > >- > > > org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193) > >- org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71) > >- > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > >- > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > >- > > > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > >- > > > > > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) > >- > > > > > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > >- > > > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > >- > > > > > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241) > >- org.apache.solr.core.SolrCore.execute(SolrCore.java:1904) > >- > > > > > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659) > >- > > > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362) > >- > > > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158) > >- > > > > > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) > >- > > > > > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) > >- > > > > > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) > >- > > > > > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) > >- > > > > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > >- > > > > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) > >- > > > > > org.apache.catalina.core.StandardEngineValve.invoke(Sta
check which file/document cause solr to work hard
Hi, I am trying to index my windows pc files with manifoldcf version 1.3 and solr version 4.4. Few minutes after I start the crawler job I see that tomcat process constantly consume 100% of one cpu (I have two cpu's). I check the thread dump in solr admin and saw that the following threads take the most cpu/user time " http-8080-3 (32) - java.io.FileInputStream.readBytes(Native Method) - java.io.FileInputStream.read(FileInputStream.java:236) - java.io.BufferedInputStream.fill(BufferedInputStream.java:235) - java.io.BufferedInputStream.read1(BufferedInputStream.java:275) - java.io.BufferedInputStream.read(BufferedInputStream.java:334) - org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) - java.io.FilterInputStream.read(FilterInputStream.java:133) - org.apache.tika.io.TailStream.read(TailStream.java:117) - org.apache.tika.io.TailStream.skip(TailStream.java:140) - org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283) - org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160) - org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193) - org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71) - org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) - org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) - org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) - org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) - org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) - org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) - org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241) - org.apache.solr.core.SolrCore.execute(SolrCore.java:1904) - org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659) - org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362) - org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158) - org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) - org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) - org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) - org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) - org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) - org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) - org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) - org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) - org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857) - org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) - org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) - java.lang.Thread.run(Thread.java:679) " how can I check which file cause tika to work so hard? I don't see anything in the log files and I am stuck Thanks, Yossi
Problem indexing windows files
Hi, I am trying to index my windows pc files with manifoldcf version 1.3 and solr version 4.4. I create output connection and repository connection and started a new job that scan my E drive. Everything seems like it work ok but after a few minutes solr stop getting new files to index. I am seeing that through tomcat log file. On manifold crawler ui I see that the job is still running but after few minutes I am getting the following error: "Error: Repeated service interruptions - failure processing document: Read timed out" I am seeing that tomcat process is constantly consume 100% of one cpu (I have two cpu's) even after I get the error message from manifolfcf crawler ui. I check the thread dump in solr admin and saw that the following threads take the most cpu/user time " http-8080-3 (32) - java.io.FileInputStream.readBytes(Native Method) - java.io.FileInputStream.read(FileInputStream.java:236) - java.io.BufferedInputStream.fill(BufferedInputStream.java:235) - java.io.BufferedInputStream.read1(BufferedInputStream.java:275) - java.io.BufferedInputStream.read(BufferedInputStream.java:334) - org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) - java.io.FilterInputStream.read(FilterInputStream.java:133) - org.apache.tika.io.TailStream.read(TailStream.java:117) - org.apache.tika.io.TailStream.skip(TailStream.java:140) - org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283) - org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160) - org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193) - org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71) - org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) - org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) - org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) - org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) - org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) - org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) - org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241) - org.apache.solr.core.SolrCore.execute(SolrCore.java:1904) - org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659) - org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362) - org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158) - org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) - org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) - org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) - org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) - org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) - org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) - org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) - org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) - org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857) - org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) - org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) - java.lang.Thread.run(Thread.java:679) " does anyone know what can I do? how to debug this issue? how can I check which file cause tika to work so hard? I don't see anything in the log files and I am stuck Thanks, Yossi