[ 
https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554061
 ] 

Yonik Seeley commented on SOLR-443:
-----------------------------------

The problem is, the body isn't really in UTF8.  Here's a request from SolrJ 
with the patch:

{code}
POST /solr/select HTTP/1.1
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
User-Agent: Solr[org.apache.solr.client.solrj.impl.CommonsHttpSolrServer] 1.0
Host: localhost:8983
Content-Length: 42

q=features%3Ah%C3%A9llo&wt=xml&version=2.2
{code}

The SolrJ code is
{code}
    SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr";);
    ModifiableSolrParams params = new ModifiableSolrParams();
    QueryRequest req = new QueryRequest(params);
    params.set("q","features:h\u00E9llo");
    req.setMethod(SolrRequest.METHOD.POST);
    QueryResponse rsp = server.query(params);
{code}

What HttpClient is outputing is percent encoded UTF8 bytes (and that's not 
UTF-8). So the charset here really isn't the problem, because the body is 
nothing but ASCII.  The body coding matches the type of coding specified in the 
URI RFC http://www.ietf.org/rfc/rfc3986.txt
But that only specifies the coding for parameters that go in the URI.
I haven't been able to find an updated  standard that specifies percent encoded 
UTF-8 bytes for application/x-www-form-urlencoded.  Does anyone know if there 
is one?

Anyway, long story short is that this may still fail on Tomcat.




> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>         Attachments: solr-443.patch, solr-443.patch
>
>
> When sending a query via POST, the content-type is not set. The content 
> charset for the POST parameters are set, but this only appears to be used for 
> creating the Content-Length header in the commons library. Since a query is 
> encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii 
> characters (characters with accents and such) as it tries to parse the POST 
> body in its default ISO-9886-1. There appears to be no way to set/change the 
> default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to