Hi all,

I'm researching the ticket CONNECTORS-513.  In this ticket we seem to have 
different behavior between Solr 3.x and Solr 4.x as far as Tika content 
extraction is concerned.  The differences seem to be related to the content 
type that is posted to Solr, and can be demonstrated with cURL.

Specifically, with Solr 3.x, posting a particular s-jis file with a 
content-type of "application/octet-stream", Tika correctly identifies the file 
as text and extracts the s-jis content.  With Solr 4.x, posting the same file 
with the same content-type, the content is NOT extracted.  However, if you 
provide *no* content type at all in cURL against Solr 4.x, Tika then does the 
correct thing and extracts the s-jis just fine.  Of course, I can't actually 
see what content type cURL is sending because it's in the multipart/form 
content that cURL uses for this case.

Unfortunately, with SolrJ against Solr 4.x, the s-jis is currently not 
extracted.  This sort of makes sense, since the getContentType() method for the 
stream is returning "application/octet-stream".  On the other hand, I don't 
know the following:

(1) Does the getContentType() method actually even get used on Solrj?  When I 
looked at wire logging, it seemed that Solrj just posts a generic 
"application/xml; charset=UTF-8" content type, and does not transmit anything 
else.  It uses standard POST, not multipart/form POST, also.

(2) Is there any way I can override whatever content type Solrj is transmitting 
to Tika?  Since it now seems to matter so much?

Thanks,
Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to