Re: Help with a DIH config file

Tim Allison Fri, 15 Mar 2019 05:37:14 -0700

Haha, looks like Jörn just answered this... onError="skip|continue"


>greatly preferable if the indexing process could ignore exceptions
Please, no.  I'm 100% behind the sentiment that DIH should gracefully
handle Tika exceptions, but the better option is to log the
exceptions, store the stacktraces and report your high priority
problems to Apache Tika and/or its dependencies so that we can fix
them.  Try running tika-eval[0] against a subset of your docs,
perhaps.

That said, DIH's integration with Tika is not intended for robust
production use.  It is intended to get people up to speed quickly and,
effectively, for demo purposes.  I recognize that it is being used in
production around the world, but it really shouldn't be.

See Erick Erickson's[1]:
>But, i wouldn’t really recommend that you just ship the docs to Solr, I’d 
>recommend that you build a little program to do the extraction on one or more 
>clients, the details of why are here:

>https://lucidworks.com/2012/02/14/indexing-with-solrj/

[0] https://wiki.apache.org/tika/TikaEval
[1] 
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201903.mbox/ajax/%3CF2034803-D4A8-48E1-889A-DA9E44961EE6%40gmail.com%3E

On Fri, Mar 15, 2019 at 7:44 AM Demian Katz <demian.k...@villanova.edu> wrote:
>
> Jörn (and anyone else with more experience with this than I have),
>
> I've been working on Whitney with this issue. It is a PDF file, and it can be 
> opened successfully in a PDF reader. Interestingly, if I try to extract data 
> from it on the command line, Tika version 1.3 throws a lot of warnings but 
> does successfully extract data, but several newer versions, including 1.17 
> and 1.20 (haven't tested other intermediate versions) encounter a fatal error 
> and extract nothing. So this seems like something that used to work but has 
> stopped. Unfortunately, we haven't been able to find a way to downgrade to an 
> old enough Tika in her Solr installation to work around the problem that way.
>
> The bigger question, though, is whether there's a way to allow the DIH to 
> simply ignore errors and keep going. Whitney needs to index several terabytes 
> of arbitrary documents for her project, and at this scale, she can't afford 
> the time to stop and manually intervene for every strange document that 
> happens to be in the collection. It would be greatly preferable if the 
> indexing process could ignore exceptions and proceed on than if it just stops 
> dead at the first problem. (I'm also pretty sure that Whitney is already 
> using the ignoreTikaException attribute in her configuration, but it doesn't 
> seem to help in this instance).
>
> Any suggestions would be greatly appreciated!
>
> thanks,
> Demian
>
> -----Original Message-----
> From: Jörn Franke <jornfra...@gmail.com>
> Sent: Friday, March 15, 2019 4:18 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Help with a DIH config file
>
> Do you have an exception?
> It could be that the pdf is broken - can you open it on your computer with a 
> pdfreader?
>
> If the exception is related to Tika and pdf then file an issue with the 
> pdfbox project. If there is an issue with Tika and MsOffice documents then 
> Apache poi is the right project to ask.
>
> > Am 15.03.2019 um 03:41 schrieb wclarke <wcla...@widernet.org>:
> >
> > Thank you so much.  You helped a great deal.  I am running into one
> > last issue where the Tika DIH is stopping at a specific language and
> > fails there (Malayalam).  Do you know of a work around?
> >
> >
> >
> > --
> > Sent from:
> > https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Flucen
> > e.472066.n3.nabble.com%2FSolr-User-f472068.html&amp;data=02%7C01%7Cdem
> > ian.katz%40villanova.edu%7Ca54d5daee7b14648442908d6a91f9bf6%7C765a8de5
> > cf9444f09cafae5bf8cfa366%7C0%7C0%7C636882350564627071&amp;sdata=NpddZY
> > 2sHKJHAR8V%2BIlMt4j1i3oy94KP9%2Btp1EQ2xM4%3D&amp;reserved=0

Re: Help with a DIH config file

Reply via email to