Re: Problem indexing windows files

Yossi Nachum Mon, 16 Sep 2013 13:15:54 -0700

Thanks for your help!
I did the following changes like Erlend suggest:
In solr output connection added to Documents -> Exclude mime types
"audio/mpeg
audio/mpeg3"
In the job edit -> Paths I added excludes to \.mp3$ and \.avi$
I started the job and and still got stuck after a few minutes.
I check my solr log file and saw 3gp extension, I added it also and now it
run 10 minutes more but still get stuck with same thread dump.
I will check my log and see which more file extensions I need to exclude
from the search


Yossi



On Mon, Sep 16, 2013 at 2:52 PM, Erlend Garåsen <[email protected]>wrote:

>
> You can also configure Solr to ignore TikaExceptions by adding the
> following to  <requestHandler name="/update/extract" ...> in solrconfig.xml:
> <bool name="ignoreTikaException">**true</bool>
>
> This will prevent the MCF job from stopping.
>
> For efficiency reasons, I will strongly recommend to filter out all kinds
> of documents you're not interested in, such as media files.
>
> Filtering out media files by filename extension may not be enough, so I
> suggest filtering out mime types as well by adding a few lines in the
> "Excluded mime types" field in Solr Output Connection.
>
> Filtering out mp3s for instance, might be done by adding this to the
> "Exclude from crawl" field:
> \.mp3$
>
> - and the following to the "Excluded mime types" field:
> audio/mpeg
> audio/mpeg3
>
> Erlend
>
>
> On 9/16/13 11:40 AM, Karl Wright wrote:
>
>> I would second what Erlend said.
>>
>> If you nevertheless want to index mp3's, I'd bring this up on the Solr
>> or Tika boards.
>>
>> Karl
>>
>>
>>
>> On Mon, Sep 16, 2013 at 5:15 AM, Erlend Garåsen <[email protected]
>> <mailto:[email protected].**no <[email protected]>>> wrote:
>>
>>
>>     It seems that Tika is involved and tries to parse large files, i.e.
>>     MP3s.
>>
>>     Do you really need to index such files? If not, try to filter them
>>     out by adding a rule in the "exclude from crawl" field for the
>>     configured job.
>>
>>     Erlend
>>
>>
>>     On 9/16/13 7:13 AM, Yossi Nachum wrote:
>>
>>         Hi,
>>
>>         I am trying to index my windows pc files with manifoldcf version
>>         1.3 and
>>         solr version 4.4.
>>
>>         I create output connection and repository connection and started
>>         a new
>>         job that scan my E drive.
>>
>>         Everything seems like it work ok but after a few minutes solr stop
>>         getting new, I am seeing that through tomcat log file.
>>
>>         On manifold crawler ui I see that the job is still running but
>>         after few
>>         minutes I am getting the following error:
>>         "Error: Repeated service interruptions - failure processing
>>         document:
>>         Server at 
>> http://localhost:8080/solr/__**collection1<http://localhost:8080/solr/__collection1>
>>
>>         
>> <http://localhost:8080/solr/**collection1<http://localhost:8080/solr/collection1>>
>> returned non ok
>>         status:500, message:Internal Server Error"
>>
>>         I am seeing that tomcat process is constantly consume 100% of
>>         one cpu (I
>>         have two cpu's) even after I get the error message from manifolfcf
>>         crawler ui.
>>
>>         I check the thread dump in solr admin and saw that the following
>>         threads
>>         take the most cpu/user time
>>         "
>>         http-8080-3 (32)
>>
>>            * java.io.FileInputStream.__**readBytes(Native Method)
>>            * java.io.FileInputStream.read(_**_FileInputStream.java:236)
>>            *
>>         java.io.BufferedInputStream.__**fill(BufferedInputStream.java:**
>> __235)
>>            *
>>         java.io.BufferedInputStream.__**read1(BufferedInputStream.__**
>> java:275)
>>            *
>>         java.io.BufferedInputStream.__**read(BufferedInputStream.java:**
>> __334)
>>            * org.apache.tika.io
>>         <http://org.apache.tika.io>.__**ProxyInputStream.read(__**
>> ProxyInputStream.java:99)
>>            * java.io.FilterInputStream.__**read(FilterInputStream.java:__
>> **133)
>>            * org.apache.tika.io.TailStream.**__read(TailStream.java:117)
>>            * org.apache.tika.io.TailStream.**__skip(TailStream.java:140)
>>            *
>>         org.apache.tika.parser.mp3.__**MpegStream.skipStream(__**
>> MpegStream.java:283)
>>            *
>>         org.apache.tika.parser.mp3.__**MpegStream.skipFrame(__**
>> MpegStream.java:160)
>>            *
>>         org.apache.tika.parser.mp3.__**Mp3Parser.getAllTagHandlers(__**
>> Mp3Parser.java:193)
>>            *
>>         org.apache.tika.parser.mp3.__**Mp3Parser.parse(Mp3Parser.__**
>> java:71)
>>            *
>>         org.apache.tika.parser.__**CompositeParser.parse(__**
>> CompositeParser.java:242)
>>            *
>>         org.apache.tika.parser.__**CompositeParser.parse(__**
>> CompositeParser.java:242)
>>            *
>>         org.apache.tika.parser.__**AutoDetectParser.parse(__**
>> AutoDetectParser.java:120)
>>            *
>>         org.apache.solr.handler.__**extraction.__**
>> ExtractingDocumentLoader.load(**__ExtractingDocumentLoader.**java:__219)
>>            *
>>         org.apache.solr.handler.__**ContentStreamHandlerBase.__**
>> handleRequestBody(__**ContentStreamHandlerBase.java:**__74)
>>            *
>>         org.apache.solr.handler.__**RequestHandlerBase.__**
>> handleRequest(__**RequestHandlerBase.java:135)
>>            *
>>         org.apache.solr.core.__**RequestHandlers$__**
>> LazyRequestHandlerWrapper.__**handleRequest(RequestHandlers.**__java:241)
>>            * org.apache.solr.core.SolrCore.**
>> __execute(SolrCore.java:1904)
>>            *
>>         org.apache.solr.servlet.__**SolrDispatchFilter.execute(__**
>> SolrDispatchFilter.java:659)
>>            *
>>         org.apache.solr.servlet.__**SolrDispatchFilter.doFilter(__**
>> SolrDispatchFilter.java:362)
>>            *
>>         org.apache.solr.servlet.__**SolrDispatchFilter.doFilter(__**
>> SolrDispatchFilter.java:158)
>>            *
>>         org.apache.catalina.core.__**ApplicationFilterChain.__**
>> internalDoFilter(__**ApplicationFilterChain.java:__**235)
>>            *
>>         org.apache.catalina.core.__**ApplicationFilterChain.__**
>> doFilter(__**ApplicationFilterChain.java:__**206)
>>            *
>>         org.apache.catalina.core.__**StandardWrapperValve.invoke(__**
>> StandardWrapperValve.java:233)
>>            *
>>         org.apache.catalina.core.__**StandardContextValve.invoke(__**
>> StandardContextValve.java:191)
>>            *
>>         org.apache.catalina.core.__**StandardHostValve.invoke(__**
>> StandardHostValve.java:127)
>>            *
>>         org.apache.catalina.valves.__**ErrorReportValve.invoke(__**
>> ErrorReportValve.java:102)
>>            *
>>         org.apache.catalina.core.__**StandardEngineValve.invoke(__**
>> StandardEngineValve.java:109)
>>            *
>>         org.apache.catalina.connector.**__CoyoteAdapter.service(__**
>> CoyoteAdapter.java:298)
>>            *
>>         org.apache.coyote.http11.__**Http11Processor.process(__**
>> Http11Processor.java:857)
>>            *
>>         org.apache.coyote.http11.__**Http11Protocol$__**
>> Http11ConnectionHandler.__**process(Http11Protocol.java:__**588)
>>            * org.apache.tomcat.util.net
>>         
>> <http://org.apache.tomcat.**util.net<http://org.apache.tomcat.util.net>
>> >.__JIoEndpoint$**Worker.run(__JIoEndpoint.java:**489)
>>            * java.lang.Thread.run(Thread.__**java:679)
>>
>>
>>
>>         "
>>
>>         does anyone know what can I do? how to debug this issue? I don't
>> see
>>         anything in the log files and I am stuck
>>         Thanks,
>>         Yossi
>>
>>
>>
>>
>>
>>
>

Re: Problem indexing windows files

Reply via email to