On Dec 15, 2008, at 8:20 AM, Jacob Singh wrote:
Hi Erik,
Sorry I wasn't totally clear. Some responses inline:
If the file is visible from the Solr server, there is no need to
actually
send the bits through HTTP. Solr's content steam capabilities
allow a file
to be retrieved from Solr itself.
Yeah, I know. But in my case not possible. Perhaps a simple file
receiving HTTP POST handler which simply stored the file on disk and
returned a path to it is the way to go here.
So I could send the file, and receive back a token which I would
then
throw into one of my fields as a reference. Then using it to map
tika
fields as well. like:
<str name="file_mod_date">${FILETOKEN}.last_modified</str>
<str name="file_body">${FILETOKEN}.content</str>
Huh? I'm don't follow the file token thing. Perhaps you're
thinking
you'll post the file, then later update other fields on that same
document.
An important point here is that Solr currently does not have document
update capabilities. A document can be fully replaced, but cannot
have
fields added to it, once indexed. It needs to be handled all in
one shot to
accomplish the blending of file/field indexing. Note the
ExtractingRequestHandler already has the field mapping capability.
Sorta... I was more thinking of a new feature wherein a Solr Request
handler doesn't actually put the file in the index, merely runs it
through tika and stores a datastore which links a "token" with the
tika extraction. Then the client could make another request w/ the
XMLUpdateHandler which referenced parts of the stored tika extraction.
Hmmm, thinking out loud....
Override SolrContentHandler. It is responsible for mapping the Tika
output to a Solr Document.
Capture all the content into a single buffer.
Add said buffer to a field that is stored only
Add a second field that is indexed. This is your "token". You could,
just as well, have that token be the only thing that gets returned by
extract only.
Alternately, you could implement an UpdateProcessor thingamajob that
takes the output and stores it to the filesystem and just adds the
token to a document.
But, here's a solution that will work for you right now... let Tika
extract
the content and return back to you, then turn around and post it and
whatever other fields you like:
<http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput>
In that example, the contents aren't being indexed, just returned
back to
the client. And you can leverage the content stream capability
with this as
well avoiding posting the actual binary file, pointing the extracting
request to a file path visible by Solr.
Yeah, I saw that. This is pretty much what I was talking about above,
the only disadvantage (which is a deal breaker in our case) is the
extra bandwidth to move the file back and forth.
Thanks for your help and quick response.
I think we'll integrate the POST fields as Grant has kindly provided
multi-value input now, and see what happens in the future. I realize
what I'm talking about (XML and binary together) is probably not a
high priority feature.
Is the use case this:
1. You want to assign metadata and also store the original and have it
stored in binary format, too? Thus, Solr becomes a backing,
searchable store?
I think we could possibly add an option to serialize the ContentStream
onto a Field on the Document. In other words, store the original with
the Document. Of course, buyer beware on the cost of doing so.