Re: ExtractingRequestHandler and XmlUpdateHandler

Grant Ingersoll Mon, 15 Dec 2008 05:48:20 -0800


On Dec 15, 2008, at 8:20 AM, Jacob Singh wrote:

Hi Erik,

Sorry I wasn't totally clear.  Some responses inline:
If the file is visible from the Solr server, there is no need toactuallysend the bits through HTTP. Solr's content steam capabilitiesallow a file
to be retrieved from Solr itself.
Yeah, I know.  But in my case not possible.   Perhaps a simple file
receiving HTTP POST handler which simply stored the file on disk and
returned a path to it is the way to go here.
So I could send the file, and receive back a token which I wouldthenthrow into one of my fields as a reference. Then using it to maptika
fields as well. like:

<str name="file_mod_date">${FILETOKEN}.last_modified</str>

<str name="file_body">${FILETOKEN}.content</str>
Huh? I'm don't follow the file token thing. Perhaps you'rethinkingyou'll post the file, then later update other fields on that samedocument.
An important point here is that Solr currently does not have document
update capabilities. A document can be fully replaced, but cannothavefields added to it, once indexed. It needs to be handled all inone shot to
accomplish the blending of file/field indexing.  Note the
ExtractingRequestHandler already has the field mapping capability.
Sorta... I was more thinking of a new feature wherein a Solr Request
handler doesn't actually put the file in the index, merely runs it
through tika and stores a datastore which links a "token" with the
tika extraction.  Then the client could make another request w/ the
XMLUpdateHandler which referenced parts of the stored tika extraction.


Hmmm, thinking out loud....

Override SolrContentHandler. It is responsible for mapping the Tikaoutput to a Solr Document.

Capture all the content into a single buffer.
Add said buffer to a field that is stored only

Add a second field that is indexed. This is your "token". You could,just as well, have that token be the only thing that gets returned byextract only.

Alternately, you could implement an UpdateProcessor thingamajob thattakes the output and stores it to the filesystem and just adds thetoken to a document.

But, here's a solution that will work for you right now... let Tikaextract
the content and return back to you, then turn around and post it and
whatever other fields you like:

<http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput>
In that example, the contents aren't being indexed, just returnedback tothe client. And you can leverage the content stream capabilitywith this as
well avoiding posting the actual binary file, pointing the extracting
request to a file path visible by Solr.


Yeah, I saw that.  This is pretty much what I was talking about above,
the only disadvantage (which is a deal breaker in our case) is the
extra bandwidth to move the file back and forth.

Thanks for your help and quick response.

I think we'll integrate the POST fields as Grant has kindly provided
multi-value input now, and see what happens in the future.  I realize
what I'm talking about (XML and binary together) is probably not a
high priority feature.


Is the use case this:

1. You want to assign metadata and also store the original and have itstored in binary format, too? Thus, Solr becomes a backing,searchable store?

I think we could possibly add an option to serialize the ContentStreamonto a Field on the Document. In other words, store the original withthe Document. Of course, buyer beware on the cost of doing so.

Re: ExtractingRequestHandler and XmlUpdateHandler

Reply via email to