[ 
https://issues.apache.org/jira/browse/SOLR-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13003980#comment-13003980
 ] 

Bernd Fehling commented on SOLR-2381:
-------------------------------------

Robert, unfortunately I wasn't able to build a reproducible test so I decided 
to debug it on my server.
The bug is in Jetty and has been fixed with jetty-7.3.1.v20110307.
Because I started debugging during weekend I used the older jetty.7.3.0 with 
the bug included, located the bug 
and recognized today that it had just been fixed in the new version from 
yesterday.

Nevertheless here is the description because I went through all the bits and 
bytes.
In jetty-7 there is jetty-server with org.eclipse.jetty.server.HttpWriter.java.
That is the OutputWriter which extends Writer and does the UTF-8 encoding.
The buffer comes of size 8192 bytes and is chunked and encoded with HttpWriter 
in sizes of 512 bytes.
The encoding is that in java it is UTF-16 and is read as integer. If the code 
is above BMP ist has a surrogate
which is read first and thereafter the next integer.
Excample: 55349(dec) and 56320(dec) is converted to 119808(10) which is U+1D400

Remember that the buffer is of size 512 bytes. But what if the counter is at 
510 and a Unicode above
BMP comes up? The solution is to write the current buffer to output, reset it 
and start over with an empty
buffer. And here is/was the bug.
The "surrogate reminder" was cleared to early at a wrong place and got lost.

If I find a svn with jetty-6.1.26 sources I will look into that one also.
Otherwise use jetty-7.3.1-v20110307 that is fixed.

May be we should setup a xml page for testing that has at least more than 512 
characters of UTF-8 code 
above BMP in a row for testing?


> The included jetty server does not support UTF-8
> ------------------------------------------------
>
>                 Key: SOLR-2381
>                 URL: https://issues.apache.org/jira/browse/SOLR-2381
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2381.patch, SOLR-ServletOutputWriter.patch, 
> jetty-6.1.26-patched-JETTY-1340.jar, jetty-util-6.1.26-patched-JETTY-1340.jar
>
>
> Some background here: 
> http://www.lucidimagination.com/search/document/6babe83bd4a98b64/which_unicode_version_is_supported_with_lucene
> Some possible solutions:
> * wait and see if we get resolution on 
> http://jira.codehaus.org/browse/JETTY-1340. To be honest, I am not even sure 
> where jetty is being maintained (there is a separate jetty project at 
> eclipse.org with another bugtracker, but the older releases are at codehaus).
> * include a patched version of jetty with correct utf-8, using that patch.
> * remove jetty and include a different container instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to