Re: Problem indexing windows files

2013-09-18 Thread Yossi Nachum
Thanks for your help
I try to look at the logs but didn't see anything in solr or manifolcf log
files.
I don't know where is Tika log file I download the binary of solr 4.4 and I
am using the example in there


On Wed, Sep 18, 2013 at 12:02 AM, Furkan KAMACI wrote:

> Firstly;
>
> This may not be a Solr related problem. Did you check the log file of Solr?
> Tika mayhave some circumstances at some kind of situations. For example
> when parsing HTML that has a base64 encoded image it may have some
> problems. If you find the correct logs you can detect it. On the other take
> care of Manifold, there may be some problem too.
>
> 17 Eylül 2013 Salı tarihinde Yossi Nachum  adlı
> kullanıcı şöyle yazdı:
> > Hi,
> >
> > I am trying to index my windows pc files with manifoldcf version 1.3 and
> > solr version 4.4.
> >
> > I create output connection and repository connection and started a new
> job
> > that scan my E drive.
> >
> > Everything seems like it work ok but after a few minutes solr stop
> getting
> > new files to index. I am seeing that through tomcat log file.
> >
> > On manifold crawler ui I see that the job is still running but after few
> > minutes I am getting the following error:
> > "Error: Repeated service interruptions - failure processing document:
> Read
> > timed out"
> >
> > I am seeing that tomcat process is constantly consume 100% of one cpu (I
> > have two cpu's) even after I get the error message from manifolfcf
> crawler
> > ui.
> >
> > I check the thread dump in solr admin and saw that the following threads
> > take the most cpu/user time
> > "
> > http-8080-3 (32)
> >
> >- java.io.FileInputStream.readBytes(Native Method)
> >- java.io.FileInputStream.read(FileInputStream.java:236)
> >- java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> >- java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
> >- java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> >- org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
> >- java.io.FilterInputStream.read(FilterInputStream.java:133)
> >- org.apache.tika.io.TailStream.read(TailStream.java:117)
> >- org.apache.tika.io.TailStream.skip(TailStream.java:140)
> >-
> org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
> >- org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
> >-
> >
>  org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
> >- org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
> >-
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> >-
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> >-
> >
>  org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> >-
> >
>
>  
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
> >-
> >
>
>  
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >-
> >
>
>  
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >-
> >
>
>  
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
> >- org.apache.solr.core.SolrCore.execute(SolrCore.java:1904)
> >-
> >
>
>  
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659)
> >-
> >
>
>  
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362)
> >-
> >
>
>  
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
> >-
> >
>
>  
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> >-
> >
>
>  
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> >-
> >
>
>  
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> >-
> >
>
>  
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
> >-
> >
>
>  org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> >-
> >
>
>  org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> >-
> >
>
>  
> org.apache.catalina.core.StandardEngineValve.invoke(Sta

check which file/document cause solr to work hard

2013-09-17 Thread Yossi Nachum
Hi,

I am trying to index my windows pc files with manifoldcf version 1.3 and
solr version 4.4.

Few minutes after I start the crawler job I see that tomcat process
constantly consume 100% of one cpu (I have two cpu's).

I check the thread dump in solr admin and saw that the following threads
take the most cpu/user time
"
http-8080-3 (32)

   - java.io.FileInputStream.readBytes(Native Method)
   - java.io.FileInputStream.read(FileInputStream.java:236)
   - java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
   - java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
   - java.io.BufferedInputStream.read(BufferedInputStream.java:334)
   - org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
   - java.io.FilterInputStream.read(FilterInputStream.java:133)
   - org.apache.tika.io.TailStream.read(TailStream.java:117)
   - org.apache.tika.io.TailStream.skip(TailStream.java:140)
   - org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
   - org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
   -
   org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
   - org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
   - org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   - org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   -
   org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   -
   
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
   -
   
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
   -
   
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
   -
   
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
   - org.apache.solr.core.SolrCore.execute(SolrCore.java:1904)
   -
   
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659)
   -
   
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362)
   -
   
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
   -
   
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
   -
   
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
   -
   
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
   -
   
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
   -
   org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
   -
   org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
   -
   
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
   -
   org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
   -
   org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
   -
   
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
   - org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
   - java.lang.Thread.run(Thread.java:679)

"

how can I check which file cause tika to work so hard?
I don't see anything in the log files and I am stuck
Thanks,
Yossi


Problem indexing windows files

2013-09-17 Thread Yossi Nachum
Hi,

I am trying to index my windows pc files with manifoldcf version 1.3 and
solr version 4.4.

I create output connection and repository connection and started a new job
that scan my E drive.

Everything seems like it work ok but after a few minutes solr stop getting
new files to index. I am seeing that through tomcat log file.

On manifold crawler ui I see that the job is still running but after few
minutes I am getting the following error:
"Error: Repeated service interruptions - failure processing document: Read
timed out"

I am seeing that tomcat process is constantly consume 100% of one cpu (I
have two cpu's) even after I get the error message from manifolfcf crawler
ui.

I check the thread dump in solr admin and saw that the following threads
take the most cpu/user time
"
http-8080-3 (32)

   - java.io.FileInputStream.readBytes(Native Method)
   - java.io.FileInputStream.read(FileInputStream.java:236)
   - java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
   - java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
   - java.io.BufferedInputStream.read(BufferedInputStream.java:334)
   - org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
   - java.io.FilterInputStream.read(FilterInputStream.java:133)
   - org.apache.tika.io.TailStream.read(TailStream.java:117)
   - org.apache.tika.io.TailStream.skip(TailStream.java:140)
   - org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
   - org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
   -
   org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
   - org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
   - org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   - org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   -
   org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   -
   
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
   -
   
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
   -
   
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
   -
   
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
   - org.apache.solr.core.SolrCore.execute(SolrCore.java:1904)
   -
   
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659)
   -
   
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362)
   -
   
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
   -
   
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
   -
   
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
   -
   
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
   -
   
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
   -
   org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
   -
   org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
   -
   
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
   -
   org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
   -
   org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
   -
   
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
   - org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
   - java.lang.Thread.run(Thread.java:679)

"

does anyone know what can I do? how to debug this issue? how can I check
which file cause tika to work so hard?
I don't see anything in the log files and I am stuck
Thanks,
Yossi