On Dec 15, 2008, at 8:20 AM, Jacob Singh wrote:

Hi Erik,

Sorry I wasn't totally clear.  Some responses inline:
If the file is visible from the Solr server, there is no need to actually send the bits through HTTP. Solr's content steam capabilities allow a file
to be retrieved from Solr itself.


Yeah, I know.  But in my case not possible.   Perhaps a simple file
receiving HTTP POST handler which simply stored the file on disk and
returned a path to it is the way to go here.

So I could send the file, and receive back a token which I would then throw into one of my fields as a reference. Then using it to map tika
fields as well. like:

<str name="file_mod_date">${FILETOKEN}.last_modified</str>

<str name="file_body">${FILETOKEN}.content</str>

Huh? I'm don't follow the file token thing. Perhaps you're thinking you'll post the file, then later update other fields on that same document.
An important point here is that Solr currently does not have document
update capabilities. A document can be fully replaced, but cannot have fields added to it, once indexed. It needs to be handled all in one shot to
accomplish the blending of file/field indexing.  Note the
ExtractingRequestHandler already has the field mapping capability.


Sorta... I was more thinking of a new feature wherein a Solr Request
handler doesn't actually put the file in the index, merely runs it
through tika and stores a datastore which links a "token" with the
tika extraction.  Then the client could make another request w/ the
XMLUpdateHandler which referenced parts of the stored tika extraction.


Hmmm, thinking out loud....

Override SolrContentHandler. It is responsible for mapping the Tika output to a Solr Document.
Capture all the content into a single buffer.
Add said buffer to a field that is stored only
Add a second field that is indexed. This is your "token". You could, just as well, have that token be the only thing that gets returned by extract only.

Alternately, you could implement an UpdateProcessor thingamajob that takes the output and stores it to the filesystem and just adds the token to a document.





But, here's a solution that will work for you right now... let Tika extract
the content and return back to you, then turn around and post it and
whatever other fields you like:

<http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput>

In that example, the contents aren't being indexed, just returned back to the client. And you can leverage the content stream capability with this as
well avoiding posting the actual binary file, pointing the extracting
request to a file path visible by Solr.


Yeah, I saw that.  This is pretty much what I was talking about above,
the only disadvantage (which is a deal breaker in our case) is the
extra bandwidth to move the file back and forth.

Thanks for your help and quick response.

I think we'll integrate the POST fields as Grant has kindly provided
multi-value input now, and see what happens in the future.  I realize
what I'm talking about (XML and binary together) is probably not a
high priority feature.


Is the use case this:

1. You want to assign metadata and also store the original and have it stored in binary format, too? Thus, Solr becomes a backing, searchable store?

I think we could possibly add an option to serialize the ContentStream onto a Field on the Document. In other words, store the original with the Document. Of course, buyer beware on the cost of doing so.

Reply via email to