On Wed, 4 Jun 2014 00:23:53 +0000 "Whistler, Ken" <ken.whist...@sap.com> wrote:
> You cannot even be "very confident" of not finding actual ill-formed > UTF-16, like unpaired surrogates, in an external file, let alone > noncharacters. I though unpaired surrogates were normally mojibake, broken characters, or sabotage attempts. > Any one of those test strings could be > trivially turned into a text file by piping out that one UTF-16 > string to a file. At that point, you should be in detailed control of the Unicode encoding scheme. Also, would not the system be using one of UTF16 with byte order marks, UTF-16BE and UTF-16LE? > And I could then write conformant test software > that would read UTF-16 string input data from that file and run it > through the UCA algorithm to construct sortkeys for it. Given the number of control characters in that file, I wouldn't be confident of getting the output back the same as it went out unless the input were controlled at a binary level. > As Peter said, the main thing that prevents running into these is > that it isn't very *useful* to start off files (or strings) with > U+FFFE. Actually, for sorting records using the CLDR collation algorithm, it may be very useful to use U+FFFE as a field separator. If the most significant field for sorting is sometimes empty (e.g. surname in a list of contacts), then the field separator could very easily be the first non-BOM character after sorting. I suppose one had better use something like <COMMA, U+FFFE> as a field separator instead. > (And, additionally, in the case of UTF-16 text data files, it > would be confusing and possibly lead to misinterpretation of byte > order, if you were somehow depending solely on initial BOMs -- which > I wouldn't advise, anyway.) Interesting. Goodbye UTF-16 encoding scheme and hello automatic encoding detection. I'm not sure how automatic detection is supposed to work with a file consisting of just a test string from the collation test. > Basically, the rules of standards (e.g., you shouldn't try to > publicly interchange noncharacters) are not like laws of > physics. Just because the standard says you shouldn't do > it doesn't mean it doesn't happen. Just as theft happens. Richard. _______________________________________________ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode