Hey, thanks! This is good stuff. I didn't expect you to just make the fix!
If I can find the bandwidth, I'd like to make something which allows file uploads via the XMLUpdateHandler as well... Do you have any ideas here? I was thinking we could just send the XML payload as another POST field. Would this work? Thanks again, Jacob On Sun, Dec 14, 2008 at 9:18 AM, Grant Ingersoll <gsing...@apache.org> wrote: > Hi Jacob, > > I just updated the code such that it should now be possible to send in > multiple values as literals, as in an HTML form that looks like: > > <form enctype="multipart/form-data" action="/solr/update/extract" > method="POST"> > <input name="ext.literal.features" value="solr"/> > <input name="ext.literal.features" value="cool"/> > <input name="ext.def.fl" value="text"/> > Choose a file to upload: <input name="file" type="file" /><br /> > <input type="submit" value="Upload File" /> > </form> > > Cheers, > Grant > > On Dec 12, 2008, at 11:53 PM, Jacob Singh wrote: > >> Hi Grant, >> >> Thanks for the quick response. My Colleague looked into the code a >> bit, and I did as well, here is what I see (my Java sucks): >> >> >> http://svn.apache.org/repos/asf/lucene/solr/trunk/contrib/extraction/src/main/java/org/apache/solr/handler/extraction/SolrContentHandler.java >> //handle the literals from the params >> Iterator<String> paramNames = params.getParameterNamesIterator(); >> while (paramNames.hasNext()) { >> String name = paramNames.next(); >> if (name.startsWith(LITERALS_PREFIX)) { >> String fieldName = name.substring(LITERALS_PREFIX.length()); >> //no need to map names here, since they are literals from the user >> SchemaField schFld = schema.getFieldOrNull(fieldName); >> if (schFld != null) { >> String value = params.get(name); >> boost = getBoost(fieldName); >> //no need to transform here, b/c we can assume the user sent >> it in correctly >> document.addField(fieldName, value, boost); >> } else { >> handleUndeclaredField(fieldName); >> } >> } >> } >> >> >> I don't know the solr source quite well enough to know if >> document.addField() can take a struct in the form of some serialized >> string, but how can I pass a multi-valued field via a >> file-upload/multi-part POST? >> >> One idea is that as one of the POST fields, I could add an XML payload >> as could be parsed by the XML handler, and then we could instantiate >> it, pass in the doc by reference, and get its multivalue fields all >> populated nicely. But this perhaps isn't a fantastic solution, I'm >> really not much of a Java programmer at all, would love to hear your >> expert opinion on how to solve this. >> >> Best, >> J >> >> On Fri, Dec 12, 2008 at 6:40 PM, Grant Ingersoll <gsing...@apache.org> >> wrote: >>> >>> Hmmm, I think I see the disconnect, but I'm not sure. Sending to the ERH >>> (ExtractingReqHandler) is not an XML command at all, it's a file-upload/ >>> multi-part encoding. I think you will need an API that does something >>> like: >>> >>> (Just making this up, this is not real code) >>> File file = new File(fileToIndex) >>> resp = solr.addFile(file, params); >>> ---- >>> >>> Where params contains the literals, captures, etc. Then, in your API you >>> need to do whatever PHP does to send that file as a multipart file (I >>> think >>> you can also POST it, too, but that has some downsides as described on >>> the >>> wiki) >>> >>> I'll try to whip up some SolrJ sample code, as I know others have asked >>> for >>> that. >>> >>> -Grant >>> >>> On Dec 12, 2008, at 5:34 AM, Jacob Singh wrote: >>> >>>> Hi Grant, >>>> >>>> Happy to. >>>> >>>> Currently we are sending over documents by building a big XML file of >>>> all of the fields of that document. Something like this: >>>> >>>> $document = new Apache_Solr_Document(); >>>> $document->id = apachesolr_document_id($node->nid); >>>> $document->title = $node->title; >>>> $document->body = strip_tags($text); >>>> $document->type = $node->type; >>>> foreach ($categories as $cat) { >>>> $document->setMultiValue('category', $cat); >>>> } >>>> >>>> The PHP Client library then takes all of this, and builds it into an >>>> XML payload which we POST over to Solr. >>>> >>>> When we implement rich file handling, I see these instructions: >>>> >>>> ----------------------------- >>>> Literals >>>> >>>> To add in your own metadata, pass in the literal parameter along with >>>> the >>>> file: >>>> >>>> curl >>>> >>>> http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text\&ext.map.div=foo_t\&ext.capture=div\&ext.boost.foo_t=3\&ext.literal.blah_i=1 >>>> -F "tutori...@tutorial.pdf" >>>> >>>> ----------------------------- >>>> >>>> So it seems we can: >>>> >>>> a). Refactor the class to not generate XML, but rather to build post >>>> headers for each field. We would like to avoid this. >>>> b) Instead, I was hoping we could send the XML payload with all the >>>> literal fields defined (like id, type, etc), and the post fields >>>> required for the file content and the field it belongs to in one >>>> reqeust >>>> >>>> Since my understanding is that docs in Solr are immutable, there is no: >>>> c). Send the file contents over, give it an ID, and then send over the >>>> rest of the fields and merge into that ID. >>>> >>>> If the unfortunate answer is a, then how do we deal with multi-value >>>> fields? I don't know how to format them given the ext.literal format >>>> above. >>>> >>>> Thanks for your help and awesome contributions! >>>> >>>> -Jacob >>>> >>>> >>>> >>>> >>>> On Fri, Dec 12, 2008 at 4:52 AM, Grant Ingersoll <gsing...@apache.org> >>>> wrote: >>>>> >>>>> On Dec 10, 2008, at 10:21 PM, Jacob Singh wrote: >>>>> >>>>>> Hey folks, >>>>>> >>>>>> I'm looking at implementing ExtractingRequestHandler in the >>>>>> Apache_Solr_PHP >>>>>> library, and I'm wondering what we can do about adding meta-data. >>>>>> >>>>>> I saw the docs, which suggests you use different post headers to pass >>>>>> field >>>>>> values along with ext.literal. Is there anyway to use the >>>>>> XmlUpdateHandler >>>>>> instead along with a document? I'm not sure how this would work, >>>>>> perhaps it >>>>>> would require 2 trips, perhaps the XML would be in the post "content" >>>>>> and >>>>>> the file in something else? The thing is we would need to refactor >>>>>> the >>>>>> class pretty heavily in this case when indexing RichDocs and we were >>>>>> hoping >>>>>> to avoid it. >>>>>> >>>>> >>>>> I'm not sure I follow how the XmlUpdateHandler plays in, can you >>>>> explain >>>>> a little more? My PHP is weak, but maybe some code will help... >>>>> >>>>> >>>>>> Thanks, >>>>>> Jacob >>>>>> -- >>>>>> >>>>>> +1 510 277-0891 (o) >>>>>> +91 9999 33 7458 (m) >>>>>> >>>>>> web: http://pajamadesign.com >>>>>> >>>>>> Skype: pajamadesign >>>>>> Yahoo: jacobsingh >>>>>> AIM: jacobsingh >>>>>> gTalk: jacobsi...@gmail.com >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> >>>> +1 510 277-0891 (o) >>>> +91 9999 33 7458 (m) >>>> >>>> web: http://pajamadesign.com >>>> >>>> Skype: pajamadesign >>>> Yahoo: jacobsingh >>>> AIM: jacobsingh >>>> gTalk: jacobsi...@gmail.com >>> >>> -------------------------- >>> Grant Ingersoll >>> >>> Lucene Helpful Hints: >>> http://wiki.apache.org/lucene-java/BasicsOfPerformance >>> http://wiki.apache.org/lucene-java/LuceneFAQ >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> >> >> >> -- >> >> +1 510 277-0891 (o) >> +91 9999 33 7458 (m) >> >> web: http://pajamadesign.com >> >> Skype: pajamadesign >> Yahoo: jacobsingh >> AIM: jacobsingh >> gTalk: jacobsi...@gmail.com > > -------------------------- > Grant Ingersoll > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > > > > -- +1 510 277-0891 (o) +91 9999 33 7458 (m) web: http://pajamadesign.com Skype: pajamadesign Yahoo: jacobsingh AIM: jacobsingh gTalk: jacobsi...@gmail.com