>    1) the toughest pdfs to identify are those that are partly
    searchable (text) and partly not (image-based text).  However, I've
    found that such documents tend to exist in clusters.
Agreed.  We should do something better in Tika to identify image-only pages on 
a page-by-page basis, and then ship those with very little text to tesseract.  
We don't currently do this.

>    3) I have indexed other repositories and noticed some silent
    failures (mostly for large .doc documents).  Wish there was some way
    to log these errors so it would be obvious what documents have been
    excluded.
Agreed on the Solr side.  You can run `java -jar tika-app.jar -J -t -i 
<input_dir> -o <output_dir>` and then tika-eval on the <output_dir> to count 
exceptions, even exceptions in embedded documents, which are now silently 
ignored. ☹

>   4) I still don't understand the use of tika.eval - is that an
    application that you run against a collection or what?
Currently, it is set up to run against a directory of extracts (text+metadata 
extracted from pdfs/word/etc).  It will give you info about # of exceptions, 
lang id, and some other statistics that can help you get a sense of how well 
content extraction worked.  It wouldn't take much to add an adapter that would 
have it run against Solr to run the same content statistics.

>    5) I've seen reference to tika-server - but I have no idea on how
    that tool might be usefully applied.
 We have to harden it, but the benefit is that you isolate the tika process in 
its own jvm so that it can't harm Solr.  By harden, I mean we need to spawn a 
child process and set a parent process that will kill and restart on oom or 
permanent hang.  We don't have that yet.  Tika very rarely runs into serious, 
show stopping problems (kill -9 just might solve your problem).  If you only 
have a few 10s of thousands of docs, you aren't likely to run into these 
problems.  If you're processing a few million, esp noisy things that come of 
the internet, you're more likely to run into these kinds of problems.

>    6) Adobe Acrobat Pro apparently has a batch mode suitable for
    flagging unsearchable (that is, image-based) pdf files and fixing them.
 Great.  If you have commercial tools available, use them.  IMHO, we have a 
ways to go on our OCR integration with PDFs.

>    7) Another problem I've encountered is documents that are themselves
    a composite of other documents (like an email thread).  The problem
    is that a hit on such a document doesn't tell you much about the
    true relevance of each contained document.  You have to do a
    laborious manual search to figure it out.


Agreed.  Concordance search can be useful for making sense of large documents 
<self_promotion> https://github.com/mitre/rhapsode </self_promotion> The other 
thing that can be useful for handling genuine attachments (pdfs inside of 
email) is to treat the embedded docs as their own standalone/child doc (see 
github link and SOLR-7229.


>    8) Is there a way to return the size of a matching document (which,
    I think, would help identify non-searchable/image documents)?
Not that I'm aware of, but that's one of the stats calculated by tika-eval.  
Length of extracted string, number of tokens, number of alphabetic tokens, 
number of "common words" (I took top 20k most common words from Wikipedia dumps 
per lang)...and others.

Cheers,

            Tim

Reply via email to