On 11/10/10 4:39 PM, Bjoern Hoehrmann wrote:
In most cases you do not need to store the bytes in order to get them
back, you can just apply the character encoding scheme used to decode
the bytes to the string and you'll have the original byte string, so
long as the character encoding scheme is bijective, which is true for
most of the relevant schemes like UTF-8 and UTF-16.

Neither of those is bijective.

In particular, both encoding schemes are not surjective as functions from Unicode strings onto byte streams (that is, there are such things as invalid byte sequences for both of them). Therefore they can't possibly be bijective. Specifically, invalid byte sequences typically lead to U+FFFD ending up in the Unicode string no matter what the particular values of the invalid bytes were.

like with UTF-8 encoded strings that are not-wellformed

Right. See above. Note that most cases when the data is really desired as a byte array will in fact not be valid UTF-8.

-Boris

Reply via email to