: A code point (unicode character) outside of the BMP (basic : multilingual plane, fits in 16 bits) is represented as two java chars : - a surrogate pair. It's a single logical character - see : String.codePointAt(). In correct UTF-8 it should be encoded as a : single code point... but Jetty is ignoring the fact that it's a : surrogate pair and encoding each Java char as it's own code point... : this is often called modified-UTF8 or CESU-8. : : So... say you have this incorrect CESU-8 that is masquerading as : UTF-8: all is not lost. ...
I must be missunderstanding something still ... based on your description, it sounds like it shouldn't matter if the encoder knows that it's one logical character or not, either way it should wind up outputing the same number of bytes.... Except that if that were the case, why would we have had this bug in the first place? clearly i'm still missunderstanding something. : Bottom line - if we correctly encapsulate whatever the servlet : container is writing, it's certainly possible for clients to use the : output correctly. I still come back to not liking that this is a hardcoded hack just for jetty ... it seems like it would be easy for a future version of jetty to change this behavior in some way that makes solr start breaking -- jetty could fix this bug and break solr's byte counting ... that doesn't seem cool why don't we just output the raw bytes ourselves? the code to generate the byte[] was/is allready there, we're just ignoring it and only using the length. -Hoss