Jacob,

Hmmm... seems the wires are still crossed and confusing.


On Dec 15, 2008, at 6:34 AM, Jacob Singh wrote:
This is indeed what I was talking about... It could even be handled
via some type of transient file storage system.  this might even be
better to avoid the risks associated with uploading a huge file across
a network and might (have no idea) be easier to implement.

If the file is visible from the Solr server, there is no need to actually send the bits through HTTP. Solr's content steam capabilities allow a file to be retrieved from Solr itself.

So I could send the file, and receive back a token which I would then
throw into one of my fields as a reference.  Then using it to map tika
fields as well. like:

<str name="file_mod_date">${FILETOKEN}.last_modified</str>

<str name="file_body">${FILETOKEN}.content</str>

Huh? I'm don't follow the file token thing. Perhaps you're thinking you'll post the file, then later update other fields on that same document. An important point here is that Solr currently does not have document update capabilities. A document can be fully replaced, but cannot have fields added to it, once indexed. It needs to be handled all in one shot to accomplish the blending of file/field indexing. Note the ExtractingRequestHandler already has the field mapping capability.

But, here's a solution that will work for you right now... let Tika extract the content and return back to you, then turn around and post it and whatever other fields you like:

  <http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput>

In that example, the contents aren't being indexed, just returned back to the client. And you can leverage the content stream capability with this as well avoiding posting the actual binary file, pointing the extracting request to a file path visible by Solr.

        Erik

Reply via email to