: A code point (unicode character) outside of the BMP (basic
: multilingual plane, fits in 16 bits) is represented as two java chars
: - a surrogate pair.  It's a single logical character - see
: String.codePointAt().  In correct UTF-8 it should be encoded as a
: single code point... but Jetty is ignoring the fact that it's a
: surrogate pair and encoding each Java char as it's own code point...
: this is often called modified-UTF8 or CESU-8.
: 
: So... say you have this incorrect CESU-8 that is masquerading as
: UTF-8: all is not lost.
        ...

I must be missunderstanding something still ... based on your description, 
it sounds like it shouldn't matter if the encoder knows that it's one 
logical character or not, either way it should wind up outputing the same 
number of bytes....

Except that if that were the case, why would we have had this bug in the 
first place?  clearly i'm still missunderstanding something.

: Bottom line - if we correctly encapsulate whatever the servlet
: container is writing, it's certainly possible for clients to use the
: output correctly.

I still come back to not liking that this is a hardcoded hack just for 
jetty ... it seems like it would be easy for a future version of jetty to 
change this behavior in some way that makes solr start breaking -- jetty 
could fix this bug and break solr's byte counting ... that doesn't seem 
cool

why don't we just output the raw bytes ourselves?  the code to generate 
the byte[] was/is allready there, we're just ignoring it and only using 
the length.



-Hoss

Reply via email to