: A code point (unicode character) outside of the BMP (basic
: multilingual plane, fits in 16 bits) is represented as two java chars
: - a surrogate pair. It's a single logical character - see
: String.codePointAt(). In correct UTF-8 it should be encoded as a
: single code point... but Jetty is ignoring the fact that it's a
: surrogate pair and encoding each Java char as it's own code point...
: this is often called modified-UTF8 or CESU-8.
:
: So... say you have this incorrect CESU-8 that is masquerading as
: UTF-8: all is not lost.
...
I must be missunderstanding something still ... based on your description,
it sounds like it shouldn't matter if the encoder knows that it's one
logical character or not, either way it should wind up outputing the same
number of bytes....
Except that if that were the case, why would we have had this bug in the
first place? clearly i'm still missunderstanding something.
: Bottom line - if we correctly encapsulate whatever the servlet
: container is writing, it's certainly possible for clients to use the
: output correctly.
I still come back to not liking that this is a hardcoded hack just for
jetty ... it seems like it would be easy for a future version of jetty to
change this behavior in some way that makes solr start breaking -- jetty
could fix this bug and break solr's byte counting ... that doesn't seem
cool
why don't we just output the raw bytes ourselves? the code to generate
the byte[] was/is allready there, we're just ignoring it and only using
the length.
-Hoss