Re: ExtractingRequestHandler and XmlUpdateHandler

Jacob Singh Mon, 15 Dec 2008 05:20:30 -0800

Hi Erik,

Sorry I wasn't totally clear.  Some responses inline:
> If the file is visible from the Solr server, there is no need to actually
> send the bits through HTTP.  Solr's content steam capabilities allow a file
> to be retrieved from Solr itself.
>


Yeah, I know.  But in my case not possible.   Perhaps a simple file
receiving HTTP POST handler which simply stored the file on disk and
returned a path to it is the way to go here.

>> So I could send the file, and receive back a token which I would then
>> throw into one of my fields as a reference.  Then using it to map tika
>> fields as well. like:
>>
>> <str name="file_mod_date">${FILETOKEN}.last_modified</str>
>>
>> <str name="file_body">${FILETOKEN}.content</str>
>
> Huh?   I'm don't follow the file token thing.  Perhaps you're thinking
> you'll post the file, then later update other fields on that same document.
>  An important point here is that Solr currently does not have document
> update capabilities.  A document can be fully replaced, but cannot have
> fields added to it, once indexed.  It needs to be handled all in one shot to
> accomplish the blending of file/field indexing.  Note the
> ExtractingRequestHandler already has the field mapping capability.
>

Sorta... I was more thinking of a new feature wherein a Solr Request
handler doesn't actually put the file in the index, merely runs it
through tika and stores a datastore which links a "token" with the
tika extraction.  Then the client could make another request w/ the
XMLUpdateHandler which referenced parts of the stored tika extraction.

> But, here's a solution that will work for you right now... let Tika extract
> the content and return back to you, then turn around and post it and
> whatever other fields you like:
>
>  <http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput>
>
> In that example, the contents aren't being indexed, just returned back to
> the client.  And you can leverage the content stream capability with this as
> well avoiding posting the actual binary file, pointing the extracting
> request to a file path visible by Solr.
>

Yeah, I saw that.  This is pretty much what I was talking about above,
the only disadvantage (which is a deal breaker in our case) is the
extra bandwidth to move the file back and forth.

Thanks for your help and quick response.

I think we'll integrate the POST fields as Grant has kindly provided
multi-value input now, and see what happens in the future.  I realize
what I'm talking about (XML and binary together) is probably not a
high priority feature.

Best
Jacob
>        Erik
>
>



-- 

+1 510 277-0891 (o)
+91 9999 33 7458 (m)

web: http://pajamadesign.com

Skype: pajamadesign
Yahoo: jacobsingh
AIM: jacobsingh
gTalk: jacobsi...@gmail.com

Re: ExtractingRequestHandler and XmlUpdateHandler

Reply via email to