Solr Cell: Content extraction problem with ContentStreamUpdateRequest and multiple files

2011-03-09 Thread Karthik Shiraly
Hi,

I'm using Solr 1.4.1.
The scenario involves user uploading multiple files. These have content
extracted using SolrCell, then indexed by Solr along with other information
about the user.

ContentStreamUpdateRequest seemed like the right choice for this - use
addFile() to send file data, and use setParam() to add normal data fields.

However, when I do multiple addFile() to ContentStreamUpdateRequest, I
observed that at the server side, even the file parts of this multipart post
are interpreted as regular form fields by the FileUpload component.
I found that FileUpload does so because the filename value in
Content-Disposition headers of each part are not being set.
Digging a bit further, it seems the actual root cause is in the client side
solrj API ... the CommonsHttpSolrServer class is not setting filename
value in Content-Disposition header while creating multipart Part
instances (from HttpClient framework).

I solved this problem by a hack - in CommonsHttpSolrServer.request() method
where the PartBase instances are created, I overrode
sendDispositionHeader() and added filename value. That solved the
problem.

However, my questions are:
1. Am I using ContentStreamUpdateRequest wrong, or is this actually a bug?
Should I be using something else?

2. My end goal is to map contents of each file to *separate* fields, not a
common field. Since the regular ExtractingRequestHandler maps all content to
just one field, I believe I've to create a custom RequestHandler (possibly
reusing existing SolrCell classes).
Is this approach right?

Thanks
Karthik


Re: Solr Cell: Content extraction problem with ContentStreamUpdateRequest and multiple files

2011-03-09 Thread Karthik Shiraly
In case the exact problem was not clear to somebody:
The problem with FileUpload interpreting file data as regular form fields is
that, Solr thinks there are no content streams in the request and throws a
missing_content_stream exception.

On Thu, Mar 10, 2011 at 10:59 AM, Karthik Shiraly 
karthikshiral...@gmail.com wrote:

 Hi,

 I'm using Solr 1.4.1.
 The scenario involves user uploading multiple files. These have content
 extracted using SolrCell, then indexed by Solr along with other information
 about the user.

 ContentStreamUpdateRequest seemed like the right choice for this - use
 addFile() to send file data, and use setParam() to add normal data fields.

 However, when I do multiple addFile() to ContentStreamUpdateRequest, I
 observed that at the server side, even the file parts of this multipart post
 are interpreted as regular form fields by the FileUpload component.
 I found that FileUpload does so because the filename value in
 Content-Disposition headers of each part are not being set.
 Digging a bit further, it seems the actual root cause is in the client side
 solrj API ... the CommonsHttpSolrServer class is not setting filename
 value in Content-Disposition header while creating multipart Part
 instances (from HttpClient framework).

 I solved this problem by a hack - in CommonsHttpSolrServer.request() method
 where the PartBase instances are created, I overrode
 sendDispositionHeader() and added filename value. That solved the
 problem.

 However, my questions are:
 1. Am I using ContentStreamUpdateRequest wrong, or is this actually a bug?
 Should I be using something else?

 2. My end goal is to map contents of each file to *separate* fields, not a
 common field. Since the regular ExtractingRequestHandler maps all content to
 just one field, I believe I've to create a custom RequestHandler (possibly
 reusing existing SolrCell classes).
 Is this approach right?

 Thanks
 Karthik