[ https://issues.apache.org/jira/browse/SOLR-3375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257400#comment-13257400 ]
Roger Håkansson commented on SOLR-3375: --------------------------------------- I've downloaded HttpSolrServer.java from trunk and recompiled the 3.6 tree and verified that the fix solves the problem. > Charset problem using HttpSolrServer instead of CommonsHttpSolrServer > --------------------------------------------------------------------- > > Key: SOLR-3375 > URL: https://issues.apache.org/jira/browse/SOLR-3375 > Project: Solr > Issue Type: Bug > Components: clients - java > Affects Versions: 3.6 > Reporter: Roger Håkansson > Assignee: Sami Siren > Fix For: 3.6.1 > > Attachments: SolrTest.java, commonshttpsolrserver-dump.txt, > httpsolrserver-dump.txt > > > I've written an application which sends PDF files to Solr for indexing, but I > also need to index some meta-data which isn't contained inside the PDF. > I recently upgraded to 3.6.0 and when recompiling my app, I got some > deprecated messages which mainly was to switch from CommonsHttpSolrServer to > HttpSolrServer. > The problem I've noticed since doing this, is that all extra fields which I > add is sent to the Solr server as ASCII only, i.e UTF-8/ISO-8859-1 doesn't > matter, anything above char 127 is sent as '?'. This was not the behaviour of > CommonsHttpSolrServer. > I've tracked it down to a line (271 in 3.6.0) in HttpSolrServer.java which is: > entity.addPart(name, new StringBody(value)); > The problem is that StringBody(String text) maps to > StringBody(text, "text/plain", null); > and in > StringBody(String text, String mimeType, Charset charset) > we have this piece of code: > if (charset == null) { > charset = Charset.forName("US-ASCII"); > } > this.content = text.getBytes(charset.name()); > this.charset = charset; > So unless charset is set everything is converted to US-ASCII. > On the other hand, in CommonsHttpSolrServer.java (line 310 in 3.6.0) there is > this line > parts.add(new StringPart(p, v, "UTF-8")); > which adds everything as UTF-8. > The simple solution would be to change the faulty line in HttpSolrServer.java > to > entity.addPart(name, new StringBody(value,Charset.forName("UTF-8"))); > However, this doesn't work either since my tests have shown that neither > Jetty or Tomcat recognizes the strings as UTF-8 but interprets them as 8-bit > (8859-1 I guess). > So changing HttpSolrServer.java to > entity.addPart(name, new StringBody(value,Charset.forName("ISO-8859-1"))); > actually gives me the same result as using CommonsHttpSolrServer. > But my investigations have shown that there is a difference in how > Commons-HttpClient and HttpClient-4.x works. > Commons-HttpClient sends all parameters as regular POST parameters but > URLEncoded (/update/extract?param1=value¶m2=value2) while > HttpClient-4.x sends them as multipart/form-data messages and I think that > the problem is that each multipart-message should have its own charset > parameter. > I.e HttpClient-4.x sends > ----------------------------------------------------------------------------------- > --jNljZ3jE1sHG529HrzSjZWYEad-6Wu > Content-Disposition: form-data; name="literal.string_txt" > åäö > ----------------------------------------------------------------------------------- > But it should probably send something like this > ----------------------------------------------------------------------------------- > --jNljZ3jE1sHG529HrzSjZWYEad-6Wu > Content-Disposition: form-data; name="literal.string_txt" > Content-Type: text/plain; charset=utf-8 > åäö > ----------------------------------------------------------------------------------- -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org