[
https://issues.apache.org/jira/browse/SOLR-3375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roger Håkansson updated SOLR-3375:
----------------------------------
Attachment: httpsolrserver-dump.txt
commonshttpsolrserver-dump.txt
Uploaded network dumps which shows difference between CommonsHttpSolrServer and
HttpSolrServer
> Charset problem using HttpSolrServer instead of CommonsHttpSolrServer
> ---------------------------------------------------------------------
>
> Key: SOLR-3375
> URL: https://issues.apache.org/jira/browse/SOLR-3375
> Project: Solr
> Issue Type: Bug
> Components: clients - java
> Affects Versions: 3.6, 4.0, 3.6.1
> Reporter: Roger Håkansson
> Attachments: SolrTest.java, commonshttpsolrserver-dump.txt,
> httpsolrserver-dump.txt
>
>
> I've written an application which sends PDF files to Solr for indexing, but I
> also need to index some meta-data which isn't contained inside the PDF.
> I recently upgraded to 3.6.0 and when recompiling my app, I got some
> deprecated messages which mainly was to switch from CommonsHttpSolrServer to
> HttpSolrServer.
> The problem I've noticed since doing this, is that all extra fields which I
> add is sent to the Solr server as ASCII only, i.e UTF-8/ISO-8859-1 doesn't
> matter, anything above char 127 is sent as '?'. This was not the behaviour of
> CommonsHttpSolrServer.
> I've tracked it down to a line (271 in 3.6.0) in HttpSolrServer.java which is:
> entity.addPart(name, new StringBody(value));
> The problem is that StringBody(String text) maps to
> StringBody(text, "text/plain", null);
> and in
> StringBody(String text, String mimeType, Charset charset)
> we have this piece of code:
> if (charset == null) {
> charset = Charset.forName("US-ASCII");
> }
> this.content = text.getBytes(charset.name());
> this.charset = charset;
> So unless charset is set everything is converted to US-ASCII.
> On the other hand, in CommonsHttpSolrServer.java (line 310 in 3.6.0) there is
> this line
> parts.add(new StringPart(p, v, "UTF-8"));
> which adds everything as UTF-8.
> The simple solution would be to change the faulty line in HttpSolrServer.java
> to
> entity.addPart(name, new StringBody(value,Charset.forName("UTF-8")));
> However, this doesn't work either since my tests have shown that neither
> Jetty or Tomcat recognizes the strings as UTF-8 but interprets them as 8-bit
> (8859-1 I guess).
> So changing HttpSolrServer.java to
> entity.addPart(name, new StringBody(value,Charset.forName("ISO-8859-1")));
> actually gives me the same result as using CommonsHttpSolrServer.
> But my investigations have shown that there is a difference in how
> Commons-HttpClient and HttpClient-4.x works.
> Commons-HttpClient sends all parameters as regular POST parameters but
> URLEncoded (/update/extract?param1=value¶m2=value2) while
> HttpClient-4.x sends them as multipart/form-data messages and I think that
> the problem is that each multipart-message should have its own charset
> parameter.
> I.e HttpClient-4.x sends
> -----------------------------------------------------------------------------------
> --jNljZ3jE1sHG529HrzSjZWYEad-6Wu
> Content-Disposition: form-data; name="literal.string_txt"
> åäö
> -----------------------------------------------------------------------------------
> But it should probably send something like this
> -----------------------------------------------------------------------------------
> --jNljZ3jE1sHG529HrzSjZWYEad-6Wu
> Content-Disposition: form-data; name="literal.string_txt"
> Content-Type: text/plain; charset=utf-8
> åäö
> -----------------------------------------------------------------------------------
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]