On 6/28/07, Eric Pugh <[EMAIL PROTECTED]> wrote:
> >  I have a PDF handler modeled on the CSVHandler that allows
> > you to stream a PDF document to Solr and extract the text and store
> > it.
>
> Cool!
>
> Any thoughts of a general framework for going from unstructured
> document -> lucene document with fields?  It feels like utilizing
> Apache Tika here would be the way to go (although it's in the really
> early stages).
>
> -Yonik
>
Humm...  So I have a PDF, Word, Excel, and Powerpoint, all as seperate
handlers.  And there is a lot of duplication between them...  I may
try and pull out the common stuff into some sort of
AbstractRichDocumentHandler, and then just add the special sauce for
each one.   I am close to having the basic unit tests, modeled on
CSVHandler, and will post a JIRA issue with it.

Another thing to consider is document type/charset/language detection.
People may not want to have to hit a different URL for each different
type of document.

I looked for Tika, but didn't see it, what is the URL?

It's *really* early (entered the incubator in March)
http://incubator.apache.org/tika/
http://www.nabble.com/Apache-Tika---Development-f20913.html


-Yonik

Reply via email to