Thanks for your help! I did the following changes like Erlend suggest: In solr output connection added to Documents -> Exclude mime types "audio/mpeg audio/mpeg3" In the job edit -> Paths I added excludes to \.mp3$ and \.avi$ I started the job and and still got stuck after a few minutes. I check my solr log file and saw 3gp extension, I added it also and now it run 10 minutes more but still get stuck with same thread dump. I will check my log and see which more file extensions I need to exclude from the search
Yossi On Mon, Sep 16, 2013 at 2:52 PM, Erlend Garåsen <[email protected]>wrote: > > You can also configure Solr to ignore TikaExceptions by adding the > following to <requestHandler name="/update/extract" ...> in solrconfig.xml: > <bool name="ignoreTikaException">**true</bool> > > This will prevent the MCF job from stopping. > > For efficiency reasons, I will strongly recommend to filter out all kinds > of documents you're not interested in, such as media files. > > Filtering out media files by filename extension may not be enough, so I > suggest filtering out mime types as well by adding a few lines in the > "Excluded mime types" field in Solr Output Connection. > > Filtering out mp3s for instance, might be done by adding this to the > "Exclude from crawl" field: > \.mp3$ > > - and the following to the "Excluded mime types" field: > audio/mpeg > audio/mpeg3 > > Erlend > > > On 9/16/13 11:40 AM, Karl Wright wrote: > >> I would second what Erlend said. >> >> If you nevertheless want to index mp3's, I'd bring this up on the Solr >> or Tika boards. >> >> Karl >> >> >> >> On Mon, Sep 16, 2013 at 5:15 AM, Erlend Garåsen <[email protected] >> <mailto:[email protected].**no <[email protected]>>> wrote: >> >> >> It seems that Tika is involved and tries to parse large files, i.e. >> MP3s. >> >> Do you really need to index such files? If not, try to filter them >> out by adding a rule in the "exclude from crawl" field for the >> configured job. >> >> Erlend >> >> >> On 9/16/13 7:13 AM, Yossi Nachum wrote: >> >> Hi, >> >> I am trying to index my windows pc files with manifoldcf version >> 1.3 and >> solr version 4.4. >> >> I create output connection and repository connection and started >> a new >> job that scan my E drive. >> >> Everything seems like it work ok but after a few minutes solr stop >> getting new, I am seeing that through tomcat log file. >> >> On manifold crawler ui I see that the job is still running but >> after few >> minutes I am getting the following error: >> "Error: Repeated service interruptions - failure processing >> document: >> Server at >> http://localhost:8080/solr/__**collection1<http://localhost:8080/solr/__collection1> >> >> >> <http://localhost:8080/solr/**collection1<http://localhost:8080/solr/collection1>> >> returned non ok >> status:500, message:Internal Server Error" >> >> I am seeing that tomcat process is constantly consume 100% of >> one cpu (I >> have two cpu's) even after I get the error message from manifolfcf >> crawler ui. >> >> I check the thread dump in solr admin and saw that the following >> threads >> take the most cpu/user time >> " >> http-8080-3 (32) >> >> * java.io.FileInputStream.__**readBytes(Native Method) >> * java.io.FileInputStream.read(_**_FileInputStream.java:236) >> * >> java.io.BufferedInputStream.__**fill(BufferedInputStream.java:** >> __235) >> * >> java.io.BufferedInputStream.__**read1(BufferedInputStream.__** >> java:275) >> * >> java.io.BufferedInputStream.__**read(BufferedInputStream.java:** >> __334) >> * org.apache.tika.io >> <http://org.apache.tika.io>.__**ProxyInputStream.read(__** >> ProxyInputStream.java:99) >> * java.io.FilterInputStream.__**read(FilterInputStream.java:__ >> **133) >> * org.apache.tika.io.TailStream.**__read(TailStream.java:117) >> * org.apache.tika.io.TailStream.**__skip(TailStream.java:140) >> * >> org.apache.tika.parser.mp3.__**MpegStream.skipStream(__** >> MpegStream.java:283) >> * >> org.apache.tika.parser.mp3.__**MpegStream.skipFrame(__** >> MpegStream.java:160) >> * >> org.apache.tika.parser.mp3.__**Mp3Parser.getAllTagHandlers(__** >> Mp3Parser.java:193) >> * >> org.apache.tika.parser.mp3.__**Mp3Parser.parse(Mp3Parser.__** >> java:71) >> * >> org.apache.tika.parser.__**CompositeParser.parse(__** >> CompositeParser.java:242) >> * >> org.apache.tika.parser.__**CompositeParser.parse(__** >> CompositeParser.java:242) >> * >> org.apache.tika.parser.__**AutoDetectParser.parse(__** >> AutoDetectParser.java:120) >> * >> org.apache.solr.handler.__**extraction.__** >> ExtractingDocumentLoader.load(**__ExtractingDocumentLoader.**java:__219) >> * >> org.apache.solr.handler.__**ContentStreamHandlerBase.__** >> handleRequestBody(__**ContentStreamHandlerBase.java:**__74) >> * >> org.apache.solr.handler.__**RequestHandlerBase.__** >> handleRequest(__**RequestHandlerBase.java:135) >> * >> org.apache.solr.core.__**RequestHandlers$__** >> LazyRequestHandlerWrapper.__**handleRequest(RequestHandlers.**__java:241) >> * org.apache.solr.core.SolrCore.** >> __execute(SolrCore.java:1904) >> * >> org.apache.solr.servlet.__**SolrDispatchFilter.execute(__** >> SolrDispatchFilter.java:659) >> * >> org.apache.solr.servlet.__**SolrDispatchFilter.doFilter(__** >> SolrDispatchFilter.java:362) >> * >> org.apache.solr.servlet.__**SolrDispatchFilter.doFilter(__** >> SolrDispatchFilter.java:158) >> * >> org.apache.catalina.core.__**ApplicationFilterChain.__** >> internalDoFilter(__**ApplicationFilterChain.java:__**235) >> * >> org.apache.catalina.core.__**ApplicationFilterChain.__** >> doFilter(__**ApplicationFilterChain.java:__**206) >> * >> org.apache.catalina.core.__**StandardWrapperValve.invoke(__** >> StandardWrapperValve.java:233) >> * >> org.apache.catalina.core.__**StandardContextValve.invoke(__** >> StandardContextValve.java:191) >> * >> org.apache.catalina.core.__**StandardHostValve.invoke(__** >> StandardHostValve.java:127) >> * >> org.apache.catalina.valves.__**ErrorReportValve.invoke(__** >> ErrorReportValve.java:102) >> * >> org.apache.catalina.core.__**StandardEngineValve.invoke(__** >> StandardEngineValve.java:109) >> * >> org.apache.catalina.connector.**__CoyoteAdapter.service(__** >> CoyoteAdapter.java:298) >> * >> org.apache.coyote.http11.__**Http11Processor.process(__** >> Http11Processor.java:857) >> * >> org.apache.coyote.http11.__**Http11Protocol$__** >> Http11ConnectionHandler.__**process(Http11Protocol.java:__**588) >> * org.apache.tomcat.util.net >> >> <http://org.apache.tomcat.**util.net<http://org.apache.tomcat.util.net> >> >.__JIoEndpoint$**Worker.run(__JIoEndpoint.java:**489) >> * java.lang.Thread.run(Thread.__**java:679) >> >> >> >> " >> >> does anyone know what can I do? how to debug this issue? I don't >> see >> anything in the log files and I am stuck >> Thanks, >> Yossi >> >> >> >> >> >> >
