[ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554061 ]
Yonik Seeley commented on SOLR-443: ----------------------------------- The problem is, the body isn't really in UTF8. Here's a request from SolrJ with the patch: {code} POST /solr/select HTTP/1.1 Content-Type: application/x-www-form-urlencoded; charset=UTF-8 User-Agent: Solr[org.apache.solr.client.solrj.impl.CommonsHttpSolrServer] 1.0 Host: localhost:8983 Content-Length: 42 q=features%3Ah%C3%A9llo&wt=xml&version=2.2 {code} The SolrJ code is {code} SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr"); ModifiableSolrParams params = new ModifiableSolrParams(); QueryRequest req = new QueryRequest(params); params.set("q","features:h\u00E9llo"); req.setMethod(SolrRequest.METHOD.POST); QueryResponse rsp = server.query(params); {code} What HttpClient is outputing is percent encoded UTF8 bytes (and that's not UTF-8). So the charset here really isn't the problem, because the body is nothing but ASCII. The body coding matches the type of coding specified in the URI RFC http://www.ietf.org/rfc/rfc3986.txt But that only specifies the coding for parameters that go in the URI. I haven't been able to find an updated standard that specifies percent encoded UTF-8 bytes for application/x-www-form-urlencoded. Does anyone know if there is one? Anyway, long story short is that this may still fail on Tomcat. > POST queries don't declare its charset > -------------------------------------- > > Key: SOLR-443 > URL: https://issues.apache.org/jira/browse/SOLR-443 > Project: Solr > Issue Type: Bug > Components: clients - java > Affects Versions: 1.2 > Environment: Tomcat 6.0.14 > Reporter: Andrew Schurman > Priority: Minor > Attachments: solr-443.patch, solr-443.patch > > > When sending a query via POST, the content-type is not set. The content > charset for the POST parameters are set, but this only appears to be used for > creating the Content-Length header in the commons library. Since a query is > encoded in UTF-8, the http headers should also specify content type charset. > On Tomcat, this causes problems when the query string contains non-ascii > characters (characters with accents and such) as it tries to parse the POST > body in its default ISO-9886-1. There appears to be no way to set/change the > default encoding for a message body on Tomcat. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.