On 17/12/2013 15:29, Augusto Camarotti wrote:
Hi guys,
    I'm having a problem with solr when trying to index some broken .doc
files.
    I have set up a test case using Solr to index all the files the
users save on the shared directorys of the company that i work for and
Solr is hanging when trying to index this file in particular(the one i'm
attaching on this e-mail). There are some others broken .doc files that
Solr index by the name without a problem, even logging some Tika erros
during the process, but when it reaches this file in particular, it
hangs and i have to cancel the upload.
    I cannot guarantee the directorys will never hold a broken .doc
file, or a broken file with some other extension, so i guess solr could
just return a failing message, or something like that.
    These are the logging messages solr is recording:
03:38:23        ERROR   SolrCore        org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.microsoft.OfficeParser@386f9474
03:38:25        ERROR   SolrDispatchFilter
null:org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.microsoft.OfficeParser@386f9474

So, how do I prevent solr from hanging when trying to index broken files?
Regards,
Augusto Camarotti

We don't like to run Tika from within Solr ourselves, as it has been known to barf (especially on large PDF files, yes there are such horrors as 3000 page PDFs!). We usually run it in an external process so it can be watched and killed if necessary.

Cheers

Charlie

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Reply via email to