A quick update - it appears that cURL is providing a Content-Type header in the 
content part of its multipart post, and is using the file extension to come up 
with "text/plain".  Changing the file name causes cURL to change this 
content-type to "application/octet-stream".  But the questions still apply: 
since Tika apparently cares deeply about content-type now, what content-type 
can I supply through SolrJ to tell it 'please discover the document type on 
your own'?  And how do I do that through SolrJ?

Karl

________________________________________
From: Wright Karl (Nokia-LC/Cambridge)
Sent: Thursday, January 17, 2013 5:15 AM
To: [email protected]
Subject: Solrj/Tika question about content types

Hi all,

I'm researching the ticket CONNECTORS-513.  In this ticket we seem to have 
different behavior between Solr 3.x and Solr 4.x as far as Tika content 
extraction is concerned.  The differences seem to be related to the content 
type that is posted to Solr, and can be demonstrated with cURL.

Specifically, with Solr 3.x, posting a particular s-jis file with a 
content-type of "application/octet-stream", Tika correctly identifies the file 
as text and extracts the s-jis content.  With Solr 4.x, posting the same file 
with the same content-type, the content is NOT extracted.  However, if you 
provide *no* content type at all in cURL against Solr 4.x, Tika then does the 
correct thing and extracts the s-jis just fine.  Of course, I can't actually 
see what content type cURL is sending because it's in the multipart/form 
content that cURL uses for this case.

Unfortunately, with SolrJ against Solr 4.x, the s-jis is currently not 
extracted.  This sort of makes sense, since the getContentType() method for the 
stream is returning "application/octet-stream".  On the other hand, I don't 
know the following:

(1) Does the getContentType() method actually even get used on Solrj?  When I 
looked at wire logging, it seemed that Solrj just posts a generic 
"application/xml; charset=UTF-8" content type, and does not transmit anything 
else.  It uses standard POST, not multipart/form POST, also.

(2) Is there any way I can override whatever content type Solrj is transmitting 
to Tika?  Since it now seems to matter so much?

Thanks,
Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to