It appears that this is simpler than I thought: in SOLR 4.4, at least, there is a dataSource class named "FieldStreamDataSource" that I can use directly with the TikaEntityProcessor. Given a blob column named DOCIMAGE, I can use the following Tika entity:
<dataSource type="FieldStreamDataSource" name="fieldstream"/> ... <entity name="tika" processor="TikaEntityProcessor" dataField="outer.DOCIMAGE" dataSource="fieldstream" format="xml"> <!--Do appropriate mapping here meta="true" means it is a metadata field --> <field column="Author" meta="true" name="xmauthor"/> <field column="title" meta="true" name="title"/> <!--'text' is an implicit field emited by TikaEntityProcessor . Map it appropriately--> <field column="text" name="content"/> <field column="content_type" name="content_type" meta="true"/> <field column="last_modified" name="last_modified" meta="true"/> </entity> This gives me document text extracted title and author, as expected. What I haven't been able to do, is to extract content_type and last_modified (last_modified may not be possible, unless there is an in-document property), but content_type should be detected by the parser. My best guess for this is that it is simply called something else --- but content_type (and last_modified) are the names used by ExtractingRequestHandler. On Tue, Jul 30, 2013 at 9:49 AM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > There's no BlobTransformer in DataImportHandler. You'll have to write one. > Also, you'd probably need to write a FieldInputStreamDataSource instead of > FieldReaderDataSource. > > > On Tue, Jul 30, 2013 at 12:30 PM, Raymond Wiker <rwi...@gmail.com> wrote: > > > I have a case where I want to documents and metadata content from a > > datebase. The metadata is is not a problem, but it does not appear that I > > can handle the document content (held as BLOBS in the database) with > > out-of-the-box SOLR 4.4 functionality. > > > > I was hoping to to be able to solve this by doing something like the > > following: > > > > *DataImportHandler *extracts all the columns (fields), including the > > document (BLOB) > > > > *BlobTransformer *to extract the BLOB content > > > > *FieldReaderDataSource *as a bridge between the extracted BLOB and and > Tika > > > > *TikeEntityExtractor *to extract the text and embedded metadata from the > > BLOB. > > > > The first problem is that "BlobTransfomer" does not appear to exist. It > > could be that I need to load some additional jar files, or it could be > that > > the "BlobTransfomer" functionality is simply not part of the Solr > > distribution. > > > > Is there a way of handling this type of content using DataImportHandler, > or > > do I need to write an external connector for it? > > > > > > -- > Regards, > Shalin Shekhar Mangar. >