On 6/28/07, Eric Pugh <[EMAIL PROTECTED]> wrote:
> > I have a PDF handler modeled on the CSVHandler that allows > > you to stream a PDF document to Solr and extract the text and store > > it. > > Cool! > > Any thoughts of a general framework for going from unstructured > document -> lucene document with fields? It feels like utilizing > Apache Tika here would be the way to go (although it's in the really > early stages). > > -Yonik > Humm... So I have a PDF, Word, Excel, and Powerpoint, all as seperate handlers. And there is a lot of duplication between them... I may try and pull out the common stuff into some sort of AbstractRichDocumentHandler, and then just add the special sauce for each one. I am close to having the basic unit tests, modeled on CSVHandler, and will post a JIRA issue with it.
Another thing to consider is document type/charset/language detection. People may not want to have to hit a different URL for each different type of document.
I looked for Tika, but didn't see it, what is the URL?
It's *really* early (entered the incubator in March) http://incubator.apache.org/tika/ http://www.nabble.com/Apache-Tika---Development-f20913.html -Yonik