Jörn (and anyone else with more experience with this than I have),

I've been working on Whitney with this issue. It is a PDF file, and it can be 
opened successfully in a PDF reader. Interestingly, if I try to extract data 
from it on the command line, Tika version 1.3 throws a lot of warnings but does 
successfully extract data, but several newer versions, including 1.17 and 1.20 
(haven't tested other intermediate versions) encounter a fatal error and 
extract nothing. So this seems like something that used to work but has 
stopped. Unfortunately, we haven't been able to find a way to downgrade to an 
old enough Tika in her Solr installation to work around the problem that way.

The bigger question, though, is whether there's a way to allow the DIH to 
simply ignore errors and keep going. Whitney needs to index several terabytes 
of arbitrary documents for her project, and at this scale, she can't afford the 
time to stop and manually intervene for every strange document that happens to 
be in the collection. It would be greatly preferable if the indexing process 
could ignore exceptions and proceed on than if it just stops dead at the first 
problem. (I'm also pretty sure that Whitney is already using the 
ignoreTikaException attribute in her configuration, but it doesn't seem to help 
in this instance).

Any suggestions would be greatly appreciated!

thanks,
Demian

-----Original Message-----
From: Jörn Franke <jornfra...@gmail.com> 
Sent: Friday, March 15, 2019 4:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Help with a DIH config file

Do you have an exception?
It could be that the pdf is broken - can you open it on your computer with a 
pdfreader?

If the exception is related to Tika and pdf then file an issue with the pdfbox 
project. If there is an issue with Tika and MsOffice documents then Apache poi 
is the right project to ask.

> Am 15.03.2019 um 03:41 schrieb wclarke <wcla...@widernet.org>:
> 
> Thank you so much.  You helped a great deal.  I am running into one 
> last issue where the Tika DIH is stopping at a specific language and 
> fails there (Malayalam).  Do you know of a work around?
> 
> 
> 
> --
> Sent from: 
> https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Flucen
> e.472066.n3.nabble.com%2FSolr-User-f472068.html&amp;data=02%7C01%7Cdem
> ian.katz%40villanova.edu%7Ca54d5daee7b14648442908d6a91f9bf6%7C765a8de5
> cf9444f09cafae5bf8cfa366%7C0%7C0%7C636882350564627071&amp;sdata=NpddZY
> 2sHKJHAR8V%2BIlMt4j1i3oy94KP9%2Btp1EQ2xM4%3D&amp;reserved=0

Reply via email to