>  I have a PDF handler modeled on the CSVHandler that allows
> you to stream a PDF document to Solr and extract the text and store
> it.

Cool!

Any thoughts of a general framework for going from unstructured
document -> lucene document with fields?  It feels like utilizing
Apache Tika here would be the way to go (although it's in the really
early stages).

-Yonik

Humm...  So I have a PDF, Word, Excel, and Powerpoint, all as seperate
handlers.  And there is a lot of duplication between them...  I may
try and pull out the common stuff into some sort of
AbstractRichDocumentHandler, and then just add the special sauce for
each one.   I am close to having the basic unit tests, modeled on
CSVHandler, and will post a JIRA issue with it.

I looked for Tika, but didn't see it, what is the URL?

Reply via email to