> I have a PDF handler modeled on the CSVHandler that allows
> you to stream a PDF document to Solr and extract the text and store
> it.
Cool!
Any thoughts of a general framework for going from unstructured
document -> lucene document with fields? It feels like utilizing
Apache Tika here would be the way to go (although it's in the really
early stages).
-Yonik
Humm... So I have a PDF, Word, Excel, and Powerpoint, all as seperate
handlers. And there is a lot of duplication between them... I may
try and pull out the common stuff into some sort of
AbstractRichDocumentHandler, and then just add the special sauce for
each one. I am close to having the basic unit tests, modeled on
CSVHandler, and will post a JIRA issue with it.
I looked for Tika, but didn't see it, what is the URL?