Hi all, I'm researching the ticket CONNECTORS-513. In this ticket we seem to have different behavior between Solr 3.x and Solr 4.x as far as Tika content extraction is concerned. The differences seem to be related to the content type that is posted to Solr, and can be demonstrated with cURL.
Specifically, with Solr 3.x, posting a particular s-jis file with a content-type of "application/octet-stream", Tika correctly identifies the file as text and extracts the s-jis content. With Solr 4.x, posting the same file with the same content-type, the content is NOT extracted. However, if you provide *no* content type at all in cURL against Solr 4.x, Tika then does the correct thing and extracts the s-jis just fine. Of course, I can't actually see what content type cURL is sending because it's in the multipart/form content that cURL uses for this case. Unfortunately, with SolrJ against Solr 4.x, the s-jis is currently not extracted. This sort of makes sense, since the getContentType() method for the stream is returning "application/octet-stream". On the other hand, I don't know the following: (1) Does the getContentType() method actually even get used on Solrj? When I looked at wire logging, it seemed that Solrj just posts a generic "application/xml; charset=UTF-8" content type, and does not transmit anything else. It uses standard POST, not multipart/form POST, also. (2) Is there any way I can override whatever content type Solrj is transmitting to Tika? Since it now seems to matter so much? Thanks, Karl --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
