Dominik Geelen created SOLR-6475:
------------------------------------

             Summary: SOLR-5517 broke the ExtractingRequestHandler / Tika 
content-type detection.
                 Key: SOLR-6475
                 URL: https://issues.apache.org/jira/browse/SOLR-6475
             Project: Solr
          Issue Type: Bug
          Components: contrib - Solr Cell (Tika extraction)
    Affects Versions: 4.7
            Reporter: Dominik Geelen


Hi,

as discussed with "hoss" on IRC, i'm creating this Issue about a problem we 
recently ran into:

Our company uses Solr to index user-generated files for fulltext searching 
(PDFs, etc.) by using the ExtractingRequestHandler / Tika. 
Since we recently upgraded to Solr 4.9, the indexing process began to throw the 
following exception: "Must specify a Content-Type header with POST requests" 
(in solr/servlet/SolrRequestParsers.java, line 684 in the 4.9 source).

This behavior was introduced with SOLR-5517, but even as the Solr wiki states, 
Tika needs the content-type to be empty or not present to trigger auto 
detection of the content- / mime-type.

Since both features block each other, but are basically both correct behavior, 
"hoss" suggested that Tika should be fixed to trigger the auto-detection on 
content-type "application/octet-stream" too and i highly agree with this 
proposal.

*Test case:*
Just use the example from the ExtractingRequestHandler wiki page:
{noformat}
curl 
"http://localhost:8983/solr/update/extract?literal.id=doc5&defaultField=text";  
--data-binary @tutorial.html  [-H 'Content-type:text/html']
{noformat}
but don't send the content-type, obviously. or you could just use the 
"SimplePostTool (post.jar)" mentioned in the wiki, but i guess this would be 
broken now, too.

*Proposed solution:*
Fix the Tika content guessing in that way, that it also triggers the auto 
detection on content-type "application/octet-stream".

Thanks,
Dominik



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to