Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-20 Thread Jacob Singh
On Wed, Dec 17, 2008 at 11:06 AM, Chris Hostetter
 wrote:
>
> : > : If I can find the bandwidth, I'd like to make something which allows
> : > : file uploads via the XMLUpdateHandler as well... Do you have any ideas
> : >
> : > the XmlUpdateRequestHandler already supports file uploads ... all request
>
> : But it doesn't do what Jacob is asking for... he wants (if I'm not mistaken)
>
> Hmm ... i thought this was an offshoot question ... the main point of this
> thread seems to have already been solved by the new
> ext.literal.${fieldname}=${fieldvalue} param support Grant just added
> to ExtractingRequestHandler right?
>
> what am i missunderstanding about the usecase that isn't solved by that?
> the "tika doc" from the ContentStream is the primary "guts" of the doc,
> with additional literal "metadata" fields being added, correct?

Yes, absolutely.  That would be a bonus :)  Because the PHP client is
already tuned to send the request in XML format, we now have to design
an interface which will supply them via POST fields.  Not a huge deal
at all, just seeing what was possible out there.

I think we'll be endevoring to deploy the ExtractingRequestHandler
next iteration and will absolutely write up anything we find with it
and contribute the PHP client code back.

Best,
Jacob

-- 

+1 510 277-0891 (o)
+91  33 7458 (m)

web: http://pajamadesign.com

Skype: pajamadesign
Yahoo: jacobsingh
AIM: jacobsingh
gTalk: jacobsi...@gmail.com


Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-16 Thread Chris Hostetter

: > : If I can find the bandwidth, I'd like to make something which allows
: > : file uploads via the XMLUpdateHandler as well... Do you have any ideas
: > 
: > the XmlUpdateRequestHandler already supports file uploads ... all request

: But it doesn't do what Jacob is asking for... he wants (if I'm not mistaken)

Hmm ... i thought this was an offshoot question ... the main point of this 
thread seems to have already been solved by the new 
ext.literal.${fieldname}=${fieldvalue} param support Grant just added 
to ExtractingRequestHandler right?

what am i missunderstanding about the usecase that isn't solved by that?  
the "tika doc" from the ContentStream is the primary "guts" of the doc, 
with additional literal "metadata" fields being added, correct?

(I can imagine a more complicated usecase where someone might want a 
single document built from multiple ContentStreams parsed by Tika, with 
differnet pieces of each TikaDoc contributing in differnet ways ... ie: my 
name is Hoss, my address is X, my phone number is Y, this 
first ContentStream should be indexed as my bio field (doesn't matter if 
it's PDF, HTML, MS-Word, etc.), index and store the ID3 Title & length 
from any MP3 ContentStreams in the multivalued "lecture_title" and 
"lecture_length" fields, and any ContentStreams left over should be 
indexed in the misc "other_text" field.   but that's not what we're 
talking about here, correct?)


-Hoss



Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-16 Thread Jacob Singh
No, I didn't mean storing the binary along with, just that I could
send a binary file (or a text file) which tika could process and store
along with the XML which describes its literal meta-data.

Best,
Jacob

On Mon, Dec 15, 2008 at 7:17 PM, Grant Ingersoll  wrote:
>
> On Dec 15, 2008, at 8:20 AM, Jacob Singh wrote:
>
>> Hi Erik,
>>
>> Sorry I wasn't totally clear.  Some responses inline:
>>>
>>> If the file is visible from the Solr server, there is no need to actually
>>> send the bits through HTTP.  Solr's content steam capabilities allow a
>>> file
>>> to be retrieved from Solr itself.
>>>
>>
>> Yeah, I know.  But in my case not possible.   Perhaps a simple file
>> receiving HTTP POST handler which simply stored the file on disk and
>> returned a path to it is the way to go here.
>>
 So I could send the file, and receive back a token which I would then
 throw into one of my fields as a reference.  Then using it to map tika
 fields as well. like:

 ${FILETOKEN}.last_modified

 ${FILETOKEN}.content
>>>
>>> Huh?   I'm don't follow the file token thing.  Perhaps you're thinking
>>> you'll post the file, then later update other fields on that same
>>> document.
>>> An important point here is that Solr currently does not have document
>>> update capabilities.  A document can be fully replaced, but cannot have
>>> fields added to it, once indexed.  It needs to be handled all in one shot
>>> to
>>> accomplish the blending of file/field indexing.  Note the
>>> ExtractingRequestHandler already has the field mapping capability.
>>>
>>
>> Sorta... I was more thinking of a new feature wherein a Solr Request
>> handler doesn't actually put the file in the index, merely runs it
>> through tika and stores a datastore which links a "token" with the
>> tika extraction.  Then the client could make another request w/ the
>> XMLUpdateHandler which referenced parts of the stored tika extraction.
>>
>
> Hmmm, thinking out loud
>
> Override SolrContentHandler.  It is responsible for mapping the Tika output
> to a Solr Document.
> Capture all the content into a single buffer.
> Add said buffer to a field that is stored only
> Add a second field that is indexed.  This is your "token".  You could, just
> as well, have that token be the only thing that gets returned by extract
> only.
>
> Alternately, you could implement an UpdateProcessor thingamajob that takes
> the output and stores it to the filesystem and just adds the token to a
> document.
>
>
>
>
>
>>> But, here's a solution that will work for you right now... let Tika
>>> extract
>>> the content and return back to you, then turn around and post it and
>>> whatever other fields you like:
>>>
>>> 
>>>
>>> In that example, the contents aren't being indexed, just returned back to
>>> the client.  And you can leverage the content stream capability with this
>>> as
>>> well avoiding posting the actual binary file, pointing the extracting
>>> request to a file path visible by Solr.
>>>
>>
>> Yeah, I saw that.  This is pretty much what I was talking about above,
>> the only disadvantage (which is a deal breaker in our case) is the
>> extra bandwidth to move the file back and forth.
>>
>> Thanks for your help and quick response.
>>
>> I think we'll integrate the POST fields as Grant has kindly provided
>> multi-value input now, and see what happens in the future.  I realize
>> what I'm talking about (XML and binary together) is probably not a
>> high priority feature.
>>
>
> Is the use case this:
>
> 1. You want to assign metadata and also store the original and have it
> stored in binary format, too?  Thus, Solr becomes a backing, searchable
> store?
>
> I think we could possibly add an option to serialize the ContentStream onto
> a Field on the Document.  In other words, store the original with the
> Document.  Of course, buyer beware on the cost of doing so.
>
>



-- 

+1 510 277-0891 (o)
+91  33 7458 (m)

web: http://pajamadesign.com

Skype: pajamadesign
Yahoo: jacobsingh
AIM: jacobsingh
gTalk: jacobsi...@gmail.com


Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-15 Thread Grant Ingersoll


On Dec 15, 2008, at 8:20 AM, Jacob Singh wrote:


Hi Erik,

Sorry I wasn't totally clear.  Some responses inline:
If the file is visible from the Solr server, there is no need to  
actually
send the bits through HTTP.  Solr's content steam capabilities  
allow a file

to be retrieved from Solr itself.



Yeah, I know.  But in my case not possible.   Perhaps a simple file
receiving HTTP POST handler which simply stored the file on disk and
returned a path to it is the way to go here.

So I could send the file, and receive back a token which I would  
then
throw into one of my fields as a reference.  Then using it to map  
tika

fields as well. like:

${FILETOKEN}.last_modified

${FILETOKEN}.content


Huh?   I'm don't follow the file token thing.  Perhaps you're  
thinking
you'll post the file, then later update other fields on that same  
document.

An important point here is that Solr currently does not have document
update capabilities.  A document can be fully replaced, but cannot  
have
fields added to it, once indexed.  It needs to be handled all in  
one shot to

accomplish the blending of file/field indexing.  Note the
ExtractingRequestHandler already has the field mapping capability.



Sorta... I was more thinking of a new feature wherein a Solr Request
handler doesn't actually put the file in the index, merely runs it
through tika and stores a datastore which links a "token" with the
tika extraction.  Then the client could make another request w/ the
XMLUpdateHandler which referenced parts of the stored tika extraction.



Hmmm, thinking out loud

Override SolrContentHandler.  It is responsible for mapping the Tika  
output to a Solr Document.

Capture all the content into a single buffer.
Add said buffer to a field that is stored only
Add a second field that is indexed.  This is your "token".  You could,  
just as well, have that token be the only thing that gets returned by  
extract only.


Alternately, you could implement an UpdateProcessor thingamajob that  
takes the output and stores it to the filesystem and just adds the  
token to a document.






But, here's a solution that will work for you right now... let Tika  
extract

the content and return back to you, then turn around and post it and
whatever other fields you like:



In that example, the contents aren't being indexed, just returned  
back to
the client.  And you can leverage the content stream capability  
with this as

well avoiding posting the actual binary file, pointing the extracting
request to a file path visible by Solr.



Yeah, I saw that.  This is pretty much what I was talking about above,
the only disadvantage (which is a deal breaker in our case) is the
extra bandwidth to move the file back and forth.

Thanks for your help and quick response.

I think we'll integrate the POST fields as Grant has kindly provided
multi-value input now, and see what happens in the future.  I realize
what I'm talking about (XML and binary together) is probably not a
high priority feature.



Is the use case this:

1. You want to assign metadata and also store the original and have it  
stored in binary format, too?  Thus, Solr becomes a backing,  
searchable store?


I think we could possibly add an option to serialize the ContentStream  
onto a Field on the Document.  In other words, store the original with  
the Document.  Of course, buyer beware on the cost of doing so.




Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-15 Thread Jacob Singh
Hi Erik,

Sorry I wasn't totally clear.  Some responses inline:
> If the file is visible from the Solr server, there is no need to actually
> send the bits through HTTP.  Solr's content steam capabilities allow a file
> to be retrieved from Solr itself.
>

Yeah, I know.  But in my case not possible.   Perhaps a simple file
receiving HTTP POST handler which simply stored the file on disk and
returned a path to it is the way to go here.

>> So I could send the file, and receive back a token which I would then
>> throw into one of my fields as a reference.  Then using it to map tika
>> fields as well. like:
>>
>> ${FILETOKEN}.last_modified
>>
>> ${FILETOKEN}.content
>
> Huh?   I'm don't follow the file token thing.  Perhaps you're thinking
> you'll post the file, then later update other fields on that same document.
>  An important point here is that Solr currently does not have document
> update capabilities.  A document can be fully replaced, but cannot have
> fields added to it, once indexed.  It needs to be handled all in one shot to
> accomplish the blending of file/field indexing.  Note the
> ExtractingRequestHandler already has the field mapping capability.
>

Sorta... I was more thinking of a new feature wherein a Solr Request
handler doesn't actually put the file in the index, merely runs it
through tika and stores a datastore which links a "token" with the
tika extraction.  Then the client could make another request w/ the
XMLUpdateHandler which referenced parts of the stored tika extraction.

> But, here's a solution that will work for you right now... let Tika extract
> the content and return back to you, then turn around and post it and
> whatever other fields you like:
>
>  
>
> In that example, the contents aren't being indexed, just returned back to
> the client.  And you can leverage the content stream capability with this as
> well avoiding posting the actual binary file, pointing the extracting
> request to a file path visible by Solr.
>

Yeah, I saw that.  This is pretty much what I was talking about above,
the only disadvantage (which is a deal breaker in our case) is the
extra bandwidth to move the file back and forth.

Thanks for your help and quick response.

I think we'll integrate the POST fields as Grant has kindly provided
multi-value input now, and see what happens in the future.  I realize
what I'm talking about (XML and binary together) is probably not a
high priority feature.

Best
Jacob
>Erik
>
>



-- 

+1 510 277-0891 (o)
+91  33 7458 (m)

web: http://pajamadesign.com

Skype: pajamadesign
Yahoo: jacobsingh
AIM: jacobsingh
gTalk: jacobsi...@gmail.com


Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-15 Thread Erik Hatcher

Jacob,

Hmmm... seems the wires are still crossed and confusing.


On Dec 15, 2008, at 6:34 AM, Jacob Singh wrote:

This is indeed what I was talking about... It could even be handled
via some type of transient file storage system.  this might even be
better to avoid the risks associated with uploading a huge file across
a network and might (have no idea) be easier to implement.


If the file is visible from the Solr server, there is no need to  
actually send the bits through HTTP.  Solr's content steam  
capabilities allow a file to be retrieved from Solr itself.



So I could send the file, and receive back a token which I would then
throw into one of my fields as a reference.  Then using it to map tika
fields as well. like:

${FILETOKEN}.last_modified

${FILETOKEN}.content


Huh?   I'm don't follow the file token thing.  Perhaps you're thinking  
you'll post the file, then later update other fields on that same  
document.  An important point here is that Solr currently does not  
have document update capabilities.  A document can be fully replaced,  
but cannot have fields added to it, once indexed.  It needs to be  
handled all in one shot to accomplish the blending of file/field  
indexing.  Note the ExtractingRequestHandler already has the field  
mapping capability.


But, here's a solution that will work for you right now... let Tika  
extract the content and return back to you, then turn around and post  
it and whatever other fields you like:


  

In that example, the contents aren't being indexed, just returned back  
to the client.  And you can leverage the content stream capability  
with this as well avoiding posting the actual binary file, pointing  
the extracting request to a file path visible by Solr.


Erik



Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-15 Thread Jacob Singh
Hi Erik,

This is indeed what I was talking about... It could even be handled
via some type of transient file storage system.  this might even be
better to avoid the risks associated with uploading a huge file across
a network and might (have no idea) be easier to implement.

So I could send the file, and receive back a token which I would then
throw into one of my fields as a reference.  Then using it to map tika
fields as well. like:

${FILETOKEN}.last_modified

${FILETOKEN}.content

Best,
Jacob


On Mon, Dec 15, 2008 at 2:29 PM, Erik Hatcher
 wrote:
>
> On Dec 15, 2008, at 3:13 AM, Chris Hostetter wrote:
>
>>
>> : If I can find the bandwidth, I'd like to make something which allows
>> : file uploads via the XMLUpdateHandler as well... Do you have any ideas
>>
>> the XmlUpdateRequestHandler already supports file uploads ... all request
>> handlers do using the ContentStream abstraction...
>>
>>http://wiki.apache.org/solr/ContentStream
>
> But it doesn't do what Jacob is asking for... he wants (if I'm not mistaken)
> the ability to send a binary file along with Solr XML, and merge the
> extraction from the file (via Tika) with the fields specified in the XML.
>
> Currently this is not possible, as far as I know.  Maybe this sort of thing
> could be coded to part of an update processor chain?  Somehow DIH and the
> Tika need to tie together eventually too, eh?
>
>Erik
>
>



-- 

+1 510 277-0891 (o)
+91  33 7458 (m)

web: http://pajamadesign.com

Skype: pajamadesign
Yahoo: jacobsingh
AIM: jacobsingh
gTalk: jacobsi...@gmail.com


Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-15 Thread Erik Hatcher


On Dec 15, 2008, at 3:13 AM, Chris Hostetter wrote:



: If I can find the bandwidth, I'd like to make something which allows
: file uploads via the XMLUpdateHandler as well... Do you have any  
ideas


the XmlUpdateRequestHandler already supports file uploads ... all  
request

handlers do using the ContentStream abstraction...

http://wiki.apache.org/solr/ContentStream


But it doesn't do what Jacob is asking for... he wants (if I'm not  
mistaken) the ability to send a binary file along with Solr XML, and  
merge the extraction from the file (via Tika) with the fields  
specified in the XML.


Currently this is not possible, as far as I know.  Maybe this sort of  
thing could be coded to part of an update processor chain?  Somehow  
DIH and the Tika need to tie together eventually too, eh?


Erik



Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-15 Thread Chris Hostetter

: If I can find the bandwidth, I'd like to make something which allows
: file uploads via the XMLUpdateHandler as well... Do you have any ideas

the XmlUpdateRequestHandler already supports file uploads ... all request 
handlers do using the ContentStream abstraction...

http://wiki.apache.org/solr/ContentStream


-Hoss



Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-14 Thread Jacob Singh
Hey,

thanks!  This is good stuff.  I didn't expect you to just make the fix!

If I can find the bandwidth, I'd like to make something which allows
file uploads via the XMLUpdateHandler as well... Do you have any ideas
here?  I was thinking we could just send the XML payload as another
POST field.

Would this work?

Thanks again,

Jacob

On Sun, Dec 14, 2008 at 9:18 AM, Grant Ingersoll  wrote:
> Hi Jacob,
>
> I just updated the code such that it should now be possible to send in
> multiple values as literals, as in an HTML form that looks like:
>
>  method="POST">
>  
>  
>  
> Choose a file to upload: 
> 
> 
>
> Cheers,
> Grant
>
> On Dec 12, 2008, at 11:53 PM, Jacob Singh wrote:
>
>> Hi Grant,
>>
>> Thanks for the quick response.  My Colleague looked into the code a
>> bit, and I did as well, here is what I see (my Java sucks):
>>
>>
>> http://svn.apache.org/repos/asf/lucene/solr/trunk/contrib/extraction/src/main/java/org/apache/solr/handler/extraction/SolrContentHandler.java
>> //handle the literals from the params
>>   Iterator paramNames = params.getParameterNamesIterator();
>>   while (paramNames.hasNext()) {
>> String name = paramNames.next();
>> if (name.startsWith(LITERALS_PREFIX)) {
>>   String fieldName = name.substring(LITERALS_PREFIX.length());
>>   //no need to map names here, since they are literals from the user
>>   SchemaField schFld = schema.getFieldOrNull(fieldName);
>>   if (schFld != null) {
>> String value = params.get(name);
>> boost = getBoost(fieldName);
>> //no need to transform here, b/c we can assume the user sent
>> it in correctly
>> document.addField(fieldName, value, boost);
>>   } else {
>> handleUndeclaredField(fieldName);
>>   }
>> }
>>   }
>>
>>
>> I don't know the solr source quite well enough to know if
>> document.addField() can take a struct in the form of some serialized
>> string, but how can I pass a multi-valued field via a
>> file-upload/multi-part POST?
>>
>> One idea is that as one of the POST fields, I could add an XML payload
>> as could be parsed by the XML handler, and then we could instantiate
>> it, pass in the doc by reference, and get its multivalue fields all
>> populated nicely.  But this perhaps isn't a fantastic solution, I'm
>> really not much of a Java programmer at all, would love to hear your
>> expert opinion on how to solve this.
>>
>> Best,
>> J
>>
>> On Fri, Dec 12, 2008 at 6:40 PM, Grant Ingersoll 
>> wrote:
>>>
>>> Hmmm, I think I see the disconnect, but I'm not sure.  Sending to the ERH
>>> (ExtractingReqHandler) is not an XML command at all, it's a file-upload/
>>> multi-part encoding.  I think you will need an API that does something
>>> like:
>>>
>>> (Just making this up, this is not real code)
>>> File file = new File(fileToIndex)
>>> resp = solr.addFile(file, params);
>>> 
>>>
>>> Where params contains the literals, captures, etc.  Then, in your API you
>>> need to do whatever PHP does to send that file as a multipart file (I
>>> think
>>> you can also POST it, too, but that has some downsides as described on
>>> the
>>> wiki)
>>>
>>> I'll try to whip up some SolrJ sample code, as I know others have asked
>>> for
>>> that.
>>>
>>> -Grant
>>>
>>> On Dec 12, 2008, at 5:34 AM, Jacob Singh wrote:
>>>
 Hi Grant,

 Happy to.

 Currently we are sending over documents by building a big XML file of
 all of the fields of that document. Something like this:

 $document = new Apache_Solr_Document();
  $document->id = apachesolr_document_id($node->nid);
  $document->title = $node->title;
  $document->body = strip_tags($text);
  $document->type  = $node->type;
  foreach ($categories as $cat) {
$document->setMultiValue('category', $cat);
  }

 The PHP Client library then takes all of this, and builds it into an
 XML payload which we POST over to Solr.

 When we implement rich file handling, I see these instructions:

 -
 Literals

 To add in your own metadata, pass in the literal parameter along with
 the
 file:

 curl

 http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text\&ext.map.div=foo_t\&ext.capture=div\&ext.boost.foo_t=3\&ext.literal.blah_i=1
 -F "tutori...@tutorial.pdf"

 -

 So it seems we can:

 a). Refactor the class to not generate XML, but rather to build post
 headers for each field.  We would like to avoid this.
 b)  Instead, I was hoping we could send the XML payload with all the
 literal fields defined (like id, type, etc), and the post fields
 required for the file content and the field it belongs to in one
 reqeust

 Since my understanding is that docs in Solr are immutable, there is no:
 c). Send the file contents over, give it an ID, and then send over the
 rest of the fields and m

Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-13 Thread Grant Ingersoll

Hi Jacob,

I just updated the code such that it should now be possible to send in  
multiple values as literals, as in an HTML form that looks like:


method="POST">

  
  
  
Choose a file to upload: 



Cheers,
Grant

On Dec 12, 2008, at 11:53 PM, Jacob Singh wrote:


Hi Grant,

Thanks for the quick response.  My Colleague looked into the code a
bit, and I did as well, here is what I see (my Java sucks):

http://svn.apache.org/repos/asf/lucene/solr/trunk/contrib/extraction/src/main/java/org/apache/solr/handler/extraction/SolrContentHandler.java
//handle the literals from the params
   Iterator paramNames = params.getParameterNamesIterator();
   while (paramNames.hasNext()) {
 String name = paramNames.next();
 if (name.startsWith(LITERALS_PREFIX)) {
   String fieldName = name.substring(LITERALS_PREFIX.length());
   //no need to map names here, since they are literals from the  
user

   SchemaField schFld = schema.getFieldOrNull(fieldName);
   if (schFld != null) {
 String value = params.get(name);
 boost = getBoost(fieldName);
 //no need to transform here, b/c we can assume the user sent
it in correctly
 document.addField(fieldName, value, boost);
   } else {
 handleUndeclaredField(fieldName);
   }
 }
   }


I don't know the solr source quite well enough to know if
document.addField() can take a struct in the form of some serialized
string, but how can I pass a multi-valued field via a
file-upload/multi-part POST?

One idea is that as one of the POST fields, I could add an XML payload
as could be parsed by the XML handler, and then we could instantiate
it, pass in the doc by reference, and get its multivalue fields all
populated nicely.  But this perhaps isn't a fantastic solution, I'm
really not much of a Java programmer at all, would love to hear your
expert opinion on how to solve this.

Best,
J

On Fri, Dec 12, 2008 at 6:40 PM, Grant Ingersoll  
 wrote:
Hmmm, I think I see the disconnect, but I'm not sure.  Sending to  
the ERH
(ExtractingReqHandler) is not an XML command at all, it's a file- 
upload/
multi-part encoding.  I think you will need an API that does  
something like:


(Just making this up, this is not real code)
File file = new File(fileToIndex)
resp = solr.addFile(file, params);


Where params contains the literals, captures, etc.  Then, in your  
API you
need to do whatever PHP does to send that file as a multipart file  
(I think
you can also POST it, too, but that has some downsides as described  
on the

wiki)

I'll try to whip up some SolrJ sample code, as I know others have  
asked for

that.

-Grant

On Dec 12, 2008, at 5:34 AM, Jacob Singh wrote:


Hi Grant,

Happy to.

Currently we are sending over documents by building a big XML file  
of

all of the fields of that document. Something like this:

$document = new Apache_Solr_Document();
 $document->id = apachesolr_document_id($node->nid);
 $document->title = $node->title;
 $document->body = strip_tags($text);
 $document->type  = $node->type;
 foreach ($categories as $cat) {
$document->setMultiValue('category', $cat);
 }

The PHP Client library then takes all of this, and builds it into an
XML payload which we POST over to Solr.

When we implement rich file handling, I see these instructions:

-
Literals

To add in your own metadata, pass in the literal parameter along  
with the

file:

curl
http://localhost:8983/solr/update/extract?ext.idx.attr=true 
\&ext.def.fl=text\&ext.map.div=foo_t\&ext.capture=div 
\&ext.boost.foo_t=3\&ext.literal.blah_i=1

-F "tutori...@tutorial.pdf"

-

So it seems we can:

a). Refactor the class to not generate XML, but rather to build post
headers for each field.  We would like to avoid this.
b)  Instead, I was hoping we could send the XML payload with all the
literal fields defined (like id, type, etc), and the post fields
required for the file content and the field it belongs to in one
reqeust

Since my understanding is that docs in Solr are immutable, there  
is no:
c). Send the file contents over, give it an ID, and then send over  
the

rest of the fields and merge into that ID.

If the unfortunate answer is a, then how do we deal with multi-value
fields?  I don't know how to format them given the ext.literal  
format

above.

Thanks for your help and awesome contributions!

-Jacob




On Fri, Dec 12, 2008 at 4:52 AM, Grant Ingersoll >

wrote:


On Dec 10, 2008, at 10:21 PM, Jacob Singh wrote:


Hey folks,

I'm looking at implementing ExtractingRequestHandler in the
Apache_Solr_PHP
library, and I'm wondering what we can do about adding meta-data.

I saw the docs, which suggests you use different post headers to  
pass

field
values along with ext.literal.  Is there anyway to use the
XmlUpdateHandler
instead along with a document?  I'm not sure how this would work,
perhaps it
would require 2 trips, perhaps the XML would be in the post  
"content"

and
the fi

Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-13 Thread Grant Ingersoll


On Dec 12, 2008, at 11:53 PM, Jacob Singh wrote:


Hi Grant,

Thanks for the quick response.  My Colleague looked into the code a
bit, and I did as well, here is what I see (my Java sucks):

http://svn.apache.org/repos/asf/lucene/solr/trunk/contrib/extraction/src/main/java/org/apache/solr/handler/extraction/SolrContentHandler.java
//handle the literals from the params
   Iterator paramNames = params.getParameterNamesIterator();
   while (paramNames.hasNext()) {
 String name = paramNames.next();
 if (name.startsWith(LITERALS_PREFIX)) {
   String fieldName = name.substring(LITERALS_PREFIX.length());
   //no need to map names here, since they are literals from the  
user

   SchemaField schFld = schema.getFieldOrNull(fieldName);
   if (schFld != null) {
 String value = params.get(name);
 boost = getBoost(fieldName);
 //no need to transform here, b/c we can assume the user sent
it in correctly
 document.addField(fieldName, value, boost);
   } else {
 handleUndeclaredField(fieldName);
   }
 }
   }


I don't know the solr source quite well enough to know if
document.addField() can take a struct in the form of some serialized
string, but how can I pass a multi-valued field via a
file-upload/multi-part POST?


Ah, I think I see the problem, you want to be able to pass in multiple  
literals for the same values.


In other words:
method="POST">

  

  
Choose a file to upload: 




Am I understanding correctly?

-Grant


Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-12 Thread Jacob Singh
Hi Grant,

Thanks for the quick response.  My Colleague looked into the code a
bit, and I did as well, here is what I see (my Java sucks):

http://svn.apache.org/repos/asf/lucene/solr/trunk/contrib/extraction/src/main/java/org/apache/solr/handler/extraction/SolrContentHandler.java
//handle the literals from the params
Iterator paramNames = params.getParameterNamesIterator();
while (paramNames.hasNext()) {
  String name = paramNames.next();
  if (name.startsWith(LITERALS_PREFIX)) {
String fieldName = name.substring(LITERALS_PREFIX.length());
//no need to map names here, since they are literals from the user
SchemaField schFld = schema.getFieldOrNull(fieldName);
if (schFld != null) {
  String value = params.get(name);
  boost = getBoost(fieldName);
  //no need to transform here, b/c we can assume the user sent
it in correctly
  document.addField(fieldName, value, boost);
} else {
  handleUndeclaredField(fieldName);
}
  }
}


I don't know the solr source quite well enough to know if
document.addField() can take a struct in the form of some serialized
string, but how can I pass a multi-valued field via a
file-upload/multi-part POST?

One idea is that as one of the POST fields, I could add an XML payload
as could be parsed by the XML handler, and then we could instantiate
it, pass in the doc by reference, and get its multivalue fields all
populated nicely.  But this perhaps isn't a fantastic solution, I'm
really not much of a Java programmer at all, would love to hear your
expert opinion on how to solve this.

Best,
J

On Fri, Dec 12, 2008 at 6:40 PM, Grant Ingersoll  wrote:
> Hmmm, I think I see the disconnect, but I'm not sure.  Sending to the ERH
> (ExtractingReqHandler) is not an XML command at all, it's a file-upload/
> multi-part encoding.  I think you will need an API that does something like:
>
> (Just making this up, this is not real code)
> File file = new File(fileToIndex)
> resp = solr.addFile(file, params);
> 
>
> Where params contains the literals, captures, etc.  Then, in your API you
> need to do whatever PHP does to send that file as a multipart file (I think
> you can also POST it, too, but that has some downsides as described on the
> wiki)
>
> I'll try to whip up some SolrJ sample code, as I know others have asked for
> that.
>
> -Grant
>
> On Dec 12, 2008, at 5:34 AM, Jacob Singh wrote:
>
>> Hi Grant,
>>
>> Happy to.
>>
>> Currently we are sending over documents by building a big XML file of
>> all of the fields of that document. Something like this:
>>
>> $document = new Apache_Solr_Document();
>>   $document->id = apachesolr_document_id($node->nid);
>>   $document->title = $node->title;
>>   $document->body = strip_tags($text);
>>   $document->type  = $node->type;
>>   foreach ($categories as $cat) {
>>  $document->setMultiValue('category', $cat);
>>   }
>>
>> The PHP Client library then takes all of this, and builds it into an
>> XML payload which we POST over to Solr.
>>
>> When we implement rich file handling, I see these instructions:
>>
>> -
>> Literals
>>
>> To add in your own metadata, pass in the literal parameter along with the
>> file:
>>
>> curl
>> http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text\&ext.map.div=foo_t\&ext.capture=div\&ext.boost.foo_t=3\&ext.literal.blah_i=1
>> -F "tutori...@tutorial.pdf"
>>
>> -
>>
>> So it seems we can:
>>
>> a). Refactor the class to not generate XML, but rather to build post
>> headers for each field.  We would like to avoid this.
>> b)  Instead, I was hoping we could send the XML payload with all the
>> literal fields defined (like id, type, etc), and the post fields
>> required for the file content and the field it belongs to in one
>> reqeust
>>
>> Since my understanding is that docs in Solr are immutable, there is no:
>> c). Send the file contents over, give it an ID, and then send over the
>> rest of the fields and merge into that ID.
>>
>> If the unfortunate answer is a, then how do we deal with multi-value
>> fields?  I don't know how to format them given the ext.literal format
>> above.
>>
>> Thanks for your help and awesome contributions!
>>
>> -Jacob
>>
>>
>>
>>
>> On Fri, Dec 12, 2008 at 4:52 AM, Grant Ingersoll 
>> wrote:
>>>
>>> On Dec 10, 2008, at 10:21 PM, Jacob Singh wrote:
>>>
 Hey folks,

 I'm looking at implementing ExtractingRequestHandler in the
 Apache_Solr_PHP
 library, and I'm wondering what we can do about adding meta-data.

 I saw the docs, which suggests you use different post headers to pass
 field
 values along with ext.literal.  Is there anyway to use the
 XmlUpdateHandler
 instead along with a document?  I'm not sure how this would work,
 perhaps it
 would require 2 trips, perhaps the XML would be in the post "content"
 and
 the file in something else?  

Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-12 Thread Grant Ingersoll
Hmmm, I think I see the disconnect, but I'm not sure.  Sending to the  
ERH (ExtractingReqHandler) is not an XML command at all, it's a file- 
upload/ multi-part encoding.  I think you will need an API that does  
something like:


(Just making this up, this is not real code)
File file = new File(fileToIndex)
resp = solr.addFile(file, params);


Where params contains the literals, captures, etc.  Then, in your API  
you need to do whatever PHP does to send that file as a multipart file  
(I think you can also POST it, too, but that has some downsides as  
described on the wiki)


I'll try to whip up some SolrJ sample code, as I know others have  
asked for that.


-Grant

On Dec 12, 2008, at 5:34 AM, Jacob Singh wrote:


Hi Grant,

Happy to.

Currently we are sending over documents by building a big XML file of
all of the fields of that document. Something like this:

$document = new Apache_Solr_Document();
   $document->id = apachesolr_document_id($node->nid);
   $document->title = $node->title;
   $document->body = strip_tags($text);
   $document->type  = $node->type;
   foreach ($categories as $cat) {
  $document->setMultiValue('category', $cat);
   }

The PHP Client library then takes all of this, and builds it into an
XML payload which we POST over to Solr.

When we implement rich file handling, I see these instructions:

-
Literals

To add in your own metadata, pass in the literal parameter along  
with the file:


curl http://localhost:8983/solr/update/extract?ext.idx.attr=true 
\&ext.def.fl=text\&ext.map.div=foo_t\&ext.capture=div 
\&ext.boost.foo_t=3\&ext.literal.blah_i=1

-F "tutori...@tutorial.pdf"

-

So it seems we can:

a). Refactor the class to not generate XML, but rather to build post
headers for each field.  We would like to avoid this.
b)  Instead, I was hoping we could send the XML payload with all the
literal fields defined (like id, type, etc), and the post fields
required for the file content and the field it belongs to in one
reqeust

Since my understanding is that docs in Solr are immutable, there is  
no:

c). Send the file contents over, give it an ID, and then send over the
rest of the fields and merge into that ID.

If the unfortunate answer is a, then how do we deal with multi-value
fields?  I don't know how to format them given the ext.literal format
above.

Thanks for your help and awesome contributions!

-Jacob




On Fri, Dec 12, 2008 at 4:52 AM, Grant Ingersoll  
 wrote:


On Dec 10, 2008, at 10:21 PM, Jacob Singh wrote:


Hey folks,

I'm looking at implementing ExtractingRequestHandler in the  
Apache_Solr_PHP

library, and I'm wondering what we can do about adding meta-data.

I saw the docs, which suggests you use different post headers to  
pass field
values along with ext.literal.  Is there anyway to use the  
XmlUpdateHandler
instead along with a document?  I'm not sure how this would work,  
perhaps it
would require 2 trips, perhaps the XML would be in the post  
"content" and
the file in something else?  The thing is we would need to  
refactor the
class pretty heavily in this case when indexing RichDocs and we  
were hoping

to avoid it.



I'm not sure I follow how the XmlUpdateHandler plays in, can you  
explain a little more?  My PHP is weak, but maybe some code will  
help...




Thanks,
Jacob
--

+1 510 277-0891 (o)
+91  33 7458 (m)

web: http://pajamadesign.com

Skype: pajamadesign
Yahoo: jacobsingh
AIM: jacobsingh
gTalk: jacobsi...@gmail.com







--

+1 510 277-0891 (o)
+91  33 7458 (m)

web: http://pajamadesign.com

Skype: pajamadesign
Yahoo: jacobsingh
AIM: jacobsingh
gTalk: jacobsi...@gmail.com


--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ












Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-12 Thread Jacob Singh
Hi Grant,

Happy to.

Currently we are sending over documents by building a big XML file of
all of the fields of that document. Something like this:

$document = new Apache_Solr_Document();
$document->id = apachesolr_document_id($node->nid);
$document->title = $node->title;
$document->body = strip_tags($text);
$document->type  = $node->type;
foreach ($categories as $cat) {
   $document->setMultiValue('category', $cat);
}

The PHP Client library then takes all of this, and builds it into an
XML payload which we POST over to Solr.

When we implement rich file handling, I see these instructions:

-
Literals

To add in your own metadata, pass in the literal parameter along with the file:

 curl 
http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text\&ext.map.div=foo_t\&ext.capture=div\&ext.boost.foo_t=3\&ext.literal.blah_i=1
 -F "tutori...@tutorial.pdf"

-

So it seems we can:

a). Refactor the class to not generate XML, but rather to build post
headers for each field.  We would like to avoid this.
b)  Instead, I was hoping we could send the XML payload with all the
literal fields defined (like id, type, etc), and the post fields
required for the file content and the field it belongs to in one
reqeust

Since my understanding is that docs in Solr are immutable, there is no:
c). Send the file contents over, give it an ID, and then send over the
rest of the fields and merge into that ID.

If the unfortunate answer is a, then how do we deal with multi-value
fields?  I don't know how to format them given the ext.literal format
above.

Thanks for your help and awesome contributions!

-Jacob




On Fri, Dec 12, 2008 at 4:52 AM, Grant Ingersoll  wrote:
>
> On Dec 10, 2008, at 10:21 PM, Jacob Singh wrote:
>
>> Hey folks,
>>
>> I'm looking at implementing ExtractingRequestHandler in the Apache_Solr_PHP
>> library, and I'm wondering what we can do about adding meta-data.
>>
>> I saw the docs, which suggests you use different post headers to pass field
>> values along with ext.literal.  Is there anyway to use the XmlUpdateHandler
>> instead along with a document?  I'm not sure how this would work, perhaps it
>> would require 2 trips, perhaps the XML would be in the post "content" and
>> the file in something else?  The thing is we would need to refactor the
>> class pretty heavily in this case when indexing RichDocs and we were hoping
>> to avoid it.
>>
>
> I'm not sure I follow how the XmlUpdateHandler plays in, can you explain a 
> little more?  My PHP is weak, but maybe some code will help...
>
>
>> Thanks,
>> Jacob
>> --
>>
>> +1 510 277-0891 (o)
>> +91  33 7458 (m)
>>
>> web: http://pajamadesign.com
>>
>> Skype: pajamadesign
>> Yahoo: jacobsingh
>> AIM: jacobsingh
>> gTalk: jacobsi...@gmail.com
>
>



--

+1 510 277-0891 (o)
+91  33 7458 (m)

web: http://pajamadesign.com

Skype: pajamadesign
Yahoo: jacobsingh
AIM: jacobsingh
gTalk: jacobsi...@gmail.com


Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-11 Thread Grant Ingersoll


On Dec 10, 2008, at 10:21 PM, Jacob Singh wrote:


Hey folks,

I'm looking at implementing ExtractingRequestHandler in the  
Apache_Solr_PHP

library, and I'm wondering what we can do about adding meta-data.

I saw the docs, which suggests you use different post headers to  
pass field
values along with ext.literal.  Is there anyway to use the  
XmlUpdateHandler
instead along with a document?  I'm not sure how this would work,  
perhaps it
would require 2 trips, perhaps the XML would be in the post  
"content" and
the file in something else?  The thing is we would need to refactor  
the
class pretty heavily in this case when indexing RichDocs and we were  
hoping

to avoid it.



I'm not sure I follow how the XmlUpdateHandler plays in, can you  
explain a little more?  My PHP is weak, but maybe some code will help...




Thanks,
Jacob
--

+1 510 277-0891 (o)
+91  33 7458 (m)

web: http://pajamadesign.com

Skype: pajamadesign
Yahoo: jacobsingh
AIM: jacobsingh
gTalk: jacobsi...@gmail.com