Re: Indexing on plain text and binary data in a single HTTP POST request

Alexandre Rafalovitch Mon, 09 Dec 2013 03:49:49 -0800

Not a solution, but a couple of thoughts:
1) For your email  address fields, you are escaping the brackets, right?
Not just "solr solr
<s...@abc.com>" as you show, but the < and > escaped, right? Otherwise,
those email addresses become part of XML markup and mess it all up
2) Your binary content is encoded in some way inside XML, right? Not just
random binary, which would make it invalid XML? Like base64 or something?
3) I suspect you will need to use UpdateRequestProcessor one way or
another. To decode base64 as first step and to feed it through whatever you
want to process actually binary with as a second step. So, it might be a
custom URP, with similar functionality to ExtractingRequestHandler with the
difference that you already have a document object and you are mapping one
- binary - field in it into a bunch of other fields with some conventions
on names, overrides, etc.


Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Mon, Dec 9, 2013 at 5:55 PM, neerajp <neeraj_star2...@yahoo.com> wrote:

> Hi,
> I am using Solr for searching my email data. My application is in C++ so I
> a
> using CURL library to POST the data to Solr for indexing. I am posting data
> in XML format and some of the XML fields are in plain text and some of the
> fields are in binary format. I want to know what should I do so that Solr
> can index both types of data (plain text as well as binary data) coming in
> a
> single XML file.
>
> For the reference my XML file looks like:
> "<add><doc><field name=mailbox-id>1111</field><field
> name=folder>INBOX</field><field name=from>solr solr
> <s...@abc.com></field><field name=to>solr <s...@abc.com></field><field
> name=email-body>HI I AM EMAIL BODY\r\n\r\nTHANKS</field><field
> name=email-attachment>Some binary data</doc></add>"
>
> I tried to use ExtractingUpdateProcessorFactory  but it seems to me that
> ExtractingUpdateProcessorFactory support is not in Solr 4.5(which I am
> using) even not in any of the Solr version available in market.
>
> Also, I think I can not use ExtractingRequestHandler for my problem as the
> document is of type XML format and having mixed type of data(text and
> binary). Am I right ?? If yes, pls. suggest me how to proceed and if no,
> how
> can I  extract text using ExtractingRequestHandler from some of the binary
> fields.
>
> Any help is highly appreciated.....
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-on-plain-text-and-binary-data-in-a-single-HTTP-POST-request-tp4105661.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Indexing on plain text and binary data in a single HTTP POST request

Reply via email to