On 18/12/2013 09:03, Alexandre Rafalovitch wrote:
Charlie,

Does it mean you are talking to it from a client program? Or are you
running Tika in a listen/server mode and build some adapters for standard
Solr processes?

If we're writing indexers in Python we usually run Tika as a server - which means we can try to restart it if it fails to respond, usually because it's eaten something that disagreed with it! We'd then submit the extracted text to Solr.

Regards

Charlie

Regards,
    Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Wed, Dec 18, 2013 at 3:47 PM, Charlie Hull <char...@flax.co.uk> wrote:

On 17/12/2013 15:29, Augusto Camarotti wrote:

Hi guys,
     I'm having a problem with solr when trying to index some broken .doc
files.
     I have set up a test case using Solr to index all the files the
users save on the shared directorys of the company that i work for and
Solr is hanging when trying to index this file in particular(the one i'm
attaching on this e-mail). There are some others broken .doc files that
Solr index by the name without a problem, even logging some Tika erros
during the process, but when it reaches this file in particular, it
hangs and i have to cancel the upload.
     I cannot guarantee the directorys will never hold a broken .doc
file, or a broken file with some other extension, so i guess solr could
just return a failing message, or something like that.
     These are the logging messages solr is recording:
03:38:23        ERROR   SolrCore        org.apache.solr.common.
SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.microsoft.OfficeParser@386f9474
03:38:25        ERROR   SolrDispatchFilter
null:org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.microsoft.OfficeParser@386f9474

So, how do I prevent solr from hanging when trying to index broken files?
Regards,
Augusto Camarotti


We don't like to run Tika from within Solr ourselves, as it has been known
to barf (especially on large PDF files, yes there are such horrors as 3000
page PDFs!). We usually run it in an external process so it can be watched
and killed if necessary.

Cheers

Charlie

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Reply via email to