Jacob,
Hmmm... seems the wires are still crossed and confusing.
On Dec 15, 2008, at 6:34 AM, Jacob Singh wrote:
This is indeed what I was talking about... It could even be handled
via some type of transient file storage system. this might even be
better to avoid the risks associated with uploading a huge file across
a network and might (have no idea) be easier to implement.
If the file is visible from the Solr server, there is no need to
actually send the bits through HTTP. Solr's content steam
capabilities allow a file to be retrieved from Solr itself.
So I could send the file, and receive back a token which I would then
throw into one of my fields as a reference. Then using it to map tika
fields as well. like:
<str name="file_mod_date">${FILETOKEN}.last_modified</str>
<str name="file_body">${FILETOKEN}.content</str>
Huh? I'm don't follow the file token thing. Perhaps you're thinking
you'll post the file, then later update other fields on that same
document. An important point here is that Solr currently does not
have document update capabilities. A document can be fully replaced,
but cannot have fields added to it, once indexed. It needs to be
handled all in one shot to accomplish the blending of file/field
indexing. Note the ExtractingRequestHandler already has the field
mapping capability.
But, here's a solution that will work for you right now... let Tika
extract the content and return back to you, then turn around and post
it and whatever other fields you like:
<http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput>
In that example, the contents aren't being indexed, just returned back
to the client. And you can leverage the content stream capability
with this as well avoiding posting the actual binary file, pointing
the extracting request to a file path visible by Solr.
Erik