On 9 November 2011 13:11, Ian Lynagh <ig...@earth.li> wrote: > If we aren't going to guarantee that the encoded string is unicode, then > is there any benefit to encoding it in the first place?
(I think you mean decoded here - my understanding is that decode :: ByteString -> String, encode :: String -> ByteString) > Why not encode into private chars, i.e. encode U+EF00 (which in UTF8 is > 0xEE 0xBC 0x80) as U+EFEE U+EFBC U+EF80, etc? > > (Max gave some reasons earlier in this thread, but I'd need examples of > what goes wrong to understand them). We can do this but it doesn't solve all problems. Here are two such problems: PROBLEM 1 (bleeding from non-escaping to escaping TextEncodings) === So let's say we are reading a filename from stdin. Currently stdin uses the utf8 TextEncoding -- this TextEncoding knows nothing about private-char roundtripping, and will throw an exception when decoding bad bytes or encoding our private chars. Now the user types a UTF-8 U+EF80 character - i.e. we get the bytes 0xEE 0xBC 0x80 on stdin. The utf8 TextEncoding naively decodes this byte sequence to the character sequence U+EF80. We have lost at this point: if the user supplies the resulting String to a function that encodes the String with the fileSystemEncoding, the String will be encoded into the byte sequence 0x80. This is probably not what we want to happen! It means that a program like this: """ main = do fp <- getLine readFile fp >>= putStrLn """ Will fail ("file not found: \x80") when given the name of an (existant) file 0xEE 0xBC 0x80. PROBLEM 2 (bleeding between two different escaping TextEncodings) === So let's say the user supplies the UTF-8 encoded U+EF00 (byte sequence 0xEE 0xBC 0x80) as a command line argument, so it goes through the fileSystemEncoding. In your scheme the resulting Char sequence is U+EFEE U+EFBC U+EF80. What happens when we that *encode* that Char sequence using a UTF-16 TextEncoding (that knows about the 0xEFxx escape mechanism)? The resulting byte sequence is 0xEE 0xBC 0x80, NOT the UTF-16 encoded version of U+EF00! This is certainly contrary to what the user would expect. PROBLEM 3 (bleeding from escaping to non-escaping TextEncodings) === Just as above, let's say the user supplies the UTF-8 encoded U+EF00 (byte sequence 0xEE 0xBC 0x80) as a command line argument, so it goes through the fileSystemEncoding. In your scheme the resulting Char sequence is U+EFEE U+EFBC U+EF80. If you try to write this String to stdout (which uses the UTF-8 encoding that knows nothing about 0xEFxx escapes) you just get an exception, NOT the UTF-8 encoded version of U+EF00. Game over man, game over! CONCLUSION === As far as I can see, the proposed escaping scheme recovers the roundtrip property but fails to regain a lot of other reasonable-looking behaviours. (Note that the above outlined problems are problems in the current implementation too -- but the current implementation doesn't even pretend to support U+EFxx characters. Its correctness is entirely dependent on them never showing up, which is why we chose a part of the private codepoint region that is reserved specifically for the purpose of encoding hacks). Max _______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users