We have not taken up anything yet. The idea is to create another contrib which will contain extensions to DIH which has external dependencies as SOLR-934. TikaEntityProcessor is something we wish to do but our limited bandwidth has been the problem
On Thu, Feb 5, 2009 at 5:15 AM, Chris Harris <rygu...@gmail.com> wrote: > Back in November, Shalin and Grant were discussing integrating > DataImportHandler and Tika. Shalin's estimation about the best way to > do this was as follows: > > ** > > I think the best way would be a TikaEntityProcessor which knows how to > handle documents. I guess a typical use-case would be > FileListEntityProcessor->TikaEntityProcessor as parent-child entities. > > Also see SOLR-833 which adds a FieldReaderDataSource using which you can > pass any field's content to an entity for processing. So you can have a > [SqlEntityProcessor, JdbcDataSource] producing a blob and a > [FieldReaderDataSource, TikaEntityProcessor] consuming it. > > (http://www.nabble.com/DataImportHandler-and-Blobs-td20464891.html) > > ** > > Has there been any work on something like this? Alternatively, is > anyone else put together an alternative way to get DataImportHandler > to extract body text from PDFs, Word files, etc.? > > Thanks, > Chris > -- --Noble Paul