[ 
http://issues.apache.org/jira/browse/SOLR-32?page=comments#action_12421674 ] 
            
Mike Klaas commented on SOLR-32:
--------------------------------

Just wanted to confirm this problem in my application.  It is an odd 
interaction with unicode and Jetty's io stack--it seems to only occur when an 
offseted write() into a  String with unicode characters that was preceded by a 
non-unicode write (writing the whole string is fine, as is writing char arrays).

The attached patch fixed the problem, though it was only necessary to convert 
the first write() to substring.

> Result of select request is not well-formed XML when text field contains 
> non-ASCII chars and ampersand
> ------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-32
>                 URL: http://issues.apache.org/jira/browse/SOLR-32
>             Project: Solr
>          Issue Type: Bug
>          Components: search
>         Environment: Seen when running with the supplied Jetty container, 
> macosx, JDK  1.5.0_06
>            Reporter: Bertrand Delacretaz
>         Assigned To: Yonik Seeley
>
> Starting with the supplied start.jar, the ampersand from this field is not 
> correctly escaped in the XML search results provided by the select page:
> <?xml version="1.0" encoding="UTF-8"?>
> <add>
>   <doc>
>     <field name="id">amp-test-one</field>
>     <field name="content">Les événements chez Bonnie &amp; Clyde.</field>
>   </doc>
> </add>
> </stuff>
> The "content" field is defined as a "text" field in the schema.
> Adding this document to the index and querying on "id:amp-test-one" returns
> ...
>  <doc>
>   <str name="content">Les événements chez Bonnie & Clyde.&amp; Clyde.</str>
>   <str name="id">amp-test-one</str>
>  </doc>
> With first "Bonnie & Clyde" unescaped and then the correct escaped &amp;
> Browsing the index with Luke shows that the field is correctly stored.
> I think this might be a Jetty bug: patching the util/XML class of SOLR to 
> avoid the use of Writer.write(String,start,len) fixes the problem. Maybe the 
> Jetty ServletWriter gets confused by the presence of non-ascii chars?
> Here are my changes in util/XML.java. It looks like the class did use 
> String.substring(...) before, Writer.write might be faster but it seems like 
> it's broken in that environment.
> Here are my patches to util/XML.java:
> Index: src/java/org/apache/solr/util/XML.java
> ===================================================================
> --- src/java/org/apache/solr/util/XML.java      (revision 422655)
> +++ src/java/org/apache/solr/util/XML.java      (working copy)
> @@ -159,8 +159,8 @@
>        }
>        if (subst != null) {
>          if (start<i) {
> -          // out.write(str.substring(start,i));
> -          out.write(str, start, i-start);
> +          out.write(str.substring(start,i));
> +          // out.write(str, start, i-start);
>            // n+=i-start;
>          }
>          out.write(subst);
> @@ -172,8 +172,8 @@
>        out.write(str);
>        // n += str.length();
>      } else if (start<str.length()) {
> -      // out.write(str.substring(start));
> -      out.write(str, start, str.length()-start);
> +      out.write(str.substring(start));
> +      // out.write(str, start, str.length()-start);
>        // n += str.length()-start;
>      }
>      // return n;

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to