A quick update - it appears that cURL is providing a Content-Type header in the content part of its multipart post, and is using the file extension to come up with "text/plain". Changing the file name causes cURL to change this content-type to "application/octet-stream". But the questions still apply: since Tika apparently cares deeply about content-type now, what content-type can I supply through SolrJ to tell it 'please discover the document type on your own'? And how do I do that through SolrJ?
Karl ________________________________________ From: Wright Karl (Nokia-LC/Cambridge) Sent: Thursday, January 17, 2013 5:15 AM To: [email protected] Subject: Solrj/Tika question about content types Hi all, I'm researching the ticket CONNECTORS-513. In this ticket we seem to have different behavior between Solr 3.x and Solr 4.x as far as Tika content extraction is concerned. The differences seem to be related to the content type that is posted to Solr, and can be demonstrated with cURL. Specifically, with Solr 3.x, posting a particular s-jis file with a content-type of "application/octet-stream", Tika correctly identifies the file as text and extracts the s-jis content. With Solr 4.x, posting the same file with the same content-type, the content is NOT extracted. However, if you provide *no* content type at all in cURL against Solr 4.x, Tika then does the correct thing and extracts the s-jis just fine. Of course, I can't actually see what content type cURL is sending because it's in the multipart/form content that cURL uses for this case. Unfortunately, with SolrJ against Solr 4.x, the s-jis is currently not extracted. This sort of makes sense, since the getContentType() method for the stream is returning "application/octet-stream". On the other hand, I don't know the following: (1) Does the getContentType() method actually even get used on Solrj? When I looked at wire logging, it seemed that Solrj just posts a generic "application/xml; charset=UTF-8" content type, and does not transmit anything else. It uses standard POST, not multipart/form POST, also. (2) Is there any way I can override whatever content type Solrj is transmitting to Tika? Since it now seems to matter so much? Thanks, Karl --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
