Re: [Haskell-cafe] Re: Strings and utf-8
> Am I wrong to think that UTF8 should be THE > standard? I believe it can encode anything > encoded by other encodings. All the UTF-* encodings can encode the same code points. There are different trade offs though. > Can't we consider non-utf8 text as "legacy"? > I don't like that word, but I do think it is > the right way to go for text. If you know > your text has a diferent encoding, just use > 'iconv' to convert it, or a special Haskell > library for conversion. The important thing (I think) is to have an abstract concept that encompasses all the necessary characters (i.e. Unicode) and then a few well specified encodings with different trade offs. A Unicode Haskell library should handle at least a few of them (and more importantly keep track of the encoding.) -- Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Re: Strings and utf-8
>> Language of messages is quite different >> from language of a file you read. (...) > Yes, it's a fundamental limitation of the > unix locale system and multi-user > systems. However it's no less wrong than > just picking UTF8 all the time. (...) Am I wrong to think that UTF8 should be THE standard? I believe it can encode anything encoded by other encodings. Can't we consider non-utf8 text as "legacy"? I don't like that word, but I do think it is the right way to go for text. If you know your text has a diferent encoding, just use 'iconv' to convert it, or a special Haskell library for conversion. That will make life difficult for a few, but make life a lot easier for programers and users. Maurício ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Strings and utf-8
Thomas Hartman wrote: A translation of http://www.ahinea.com/en/tech/perl-unicode-struggle.html from perl to haskell would be a very useful piece of documentation, I think. Perl encodes both Unicode and binary data as the same (dynamic) data type. Haskell - at least in theory - has two different types for them, namely [Char] for characters and [Word8] or ByteString for sequences of bytes. I think the Haskell approach is better, because the programmer in most cases knows whether he wants to treat his data as characters or as bytes. Perl does it the Perlish "We guess at what the coder means" way, which leads to a lot of frustration when Perl guesses wrong. The problems of the Haskeller trying to use Unicode, I think, will be different from those of the Perl hacker trying to use Unicode: the Haskeller will have to search for third-party modules to do what he wants, and finding those modules is the problem. The Perl hacker has all the Unicode support built in, but has to fight Perl occasionally to keep it from doing byte operations on his Unicode data. I had a colleague here go all but insane last week trying to use 'split' on a Unicode string in Perl on Windows. split would break the string in the middle of a UTF-8 wide character, crashing UTF-8 processing later on. Reinier ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Strings and utf-8
A translation of http://www.ahinea.com/en/tech/perl-unicode-struggle.html from perl to haskell would be a very useful piece of documentation, I think. That explanation really helped me get to grips with the encoding stuff, in a perl context. thomas. Duncan Coutts <[EMAIL PROTECTED]> Sent by: [EMAIL PROTECTED] 11/29/2007 07:44 AM To Maurício <[EMAIL PROTECTED]> cc haskell-cafe@haskell.org Subject Re: [Haskell-cafe] Re: Strings and utf-8 On Wed, 2007-11-28 at 17:38 -0200, Maurício wrote: > >>(...) When it's phrased as "truncates to 8 > >> bits" it sounds so simple, surely all we need > >> to do is not truncate to 8 bits right? > >> > >> The problem is, what encoding should it pick? > >> UTF8, 16, 32, EBDIC? (...) > >> > >> One sensible suggestion many people have made > >> is that H98 file IO should use the locale > >> encoding and do Unicode/String <-> locale > >> conversion. (...) > > I'm really afraid of solutions where the behavior > of your program changes with an environment > variable that not everybody has configured > properly, or even know to exist. Be afraid of all your standard Unix utils in that case. They are all locale dependent, not just for encoding but also for sorting order and the language of messages. Using the locale is standard Unix behaviour (and these days the locale usually specifies UTF8 encoding). On OSX the default should be UTF8. On Windows it's a bit less clear, supposedly text files should use UTF16 but nobody actually does that as far as I can see. Duncan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe --- This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden.___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Strings and utf-8
On Thu, 2007-11-29 at 13:05 +, Jules Bean wrote: > Language of messages is quite different from language of a file you read. > > Suppose I am English, and I have a russian friend, Vlad. > > My default locale is, say, latin-1, and his is something cyrillic. > > I might well open files including my own files, and his files. The > locale of the current user is simple no guide to the correct encoding to > read a file in, and not a particularly reliable guide to writing a file out. > > Locale makes perfect sense for messages (you are communicating with the > user, his locale tells you what language he speaks). It makes much less > sense for file IO. Yes, it's a fundamental limitation of the unix locale system and multi-user systems. However it's no less wrong than just picking UTF8 all the time. Obviously one needs a text file api that allows one to specify the encoding for the cases where you happen to know it, but for the H98 file api where there is no way of specifying an encoding, what's better than using the unix default method? (at least on unix) Duncan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Strings and utf-8
Duncan Coutts wrote: On Wed, 2007-11-28 at 17:38 -0200, Maurício wrote: (...) When it's phrased as "truncates to 8 >> bits" it sounds so simple, surely all we need >> to do is not truncate to 8 bits right? >> >> The problem is, what encoding should it pick? >> UTF8, 16, 32, EBDIC? (...) >> >> One sensible suggestion many people have made >> is that H98 file IO should use the locale >> encoding and do Unicode/String <-> locale >> conversion. (...) I'm really afraid of solutions where the behavior of your program changes with an environment variable that not everybody has configured properly, or even know to exist. Be afraid of all your standard Unix utils in that case. They are all locale dependent, not just for encoding but also for sorting order and the language of messages. Language of messages is quite different from language of a file you read. Suppose I am English, and I have a russian friend, Vlad. My default locale is, say, latin-1, and his is something cyrillic. I might well open files including my own files, and his files. The locale of the current user is simple no guide to the correct encoding to read a file in, and not a particularly reliable guide to writing a file out. Locale makes perfect sense for messages (you are communicating with the user, his locale tells you what language he speaks). It makes much less sense for file IO. Jules ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Strings and utf-8
On Wed, 2007-11-28 at 17:38 -0200, Maurício wrote: > >>(...) When it's phrased as "truncates to 8 > >> bits" it sounds so simple, surely all we need > >> to do is not truncate to 8 bits right? > >> > >> The problem is, what encoding should it pick? > >> UTF8, 16, 32, EBDIC? (...) > >> > >> One sensible suggestion many people have made > >> is that H98 file IO should use the locale > >> encoding and do Unicode/String <-> locale > >> conversion. (...) > > I'm really afraid of solutions where the behavior > of your program changes with an environment > variable that not everybody has configured > properly, or even know to exist. Be afraid of all your standard Unix utils in that case. They are all locale dependent, not just for encoding but also for sorting order and the language of messages. Using the locale is standard Unix behaviour (and these days the locale usually specifies UTF8 encoding). On OSX the default should be UTF8. On Windows it's a bit less clear, supposedly text files should use UTF16 but nobody actually does that as far as I can see. Duncan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Re: Strings and utf-8
>>(...) When it's phrased as "truncates to 8 >> bits" it sounds so simple, surely all we need >> to do is not truncate to 8 bits right? >> >> The problem is, what encoding should it pick? >> UTF8, 16, 32, EBDIC? (...) >> >> One sensible suggestion many people have made >> is that H98 file IO should use the locale >> encoding and do Unicode/String <-> locale >> conversion. (...) I'm really afraid of solutions where the behavior of your program changes with an environment variable that not everybody has configured properly, or even know to exist. > Wouldn't it be sensible not to use the H98 file > I/O operations at all anymore with binary files? > A Char represents a Unicode code point value and > is not the right data type to use to represent a > byte from a binary stream. That seems nice, we would not have to create a "wide char" type just for Unicode. This topic made me search the net for that nice quote: "Explanations exist: they have existed for all times, for there is always an easy solution to every problem — neat, plausible and wrong." (See: en.wikiquote.org/wiki/H._L._Mencken That guy has many quotes worth reading.) Strings as char lists is a very good example of that. It's simple and clean, but strings are not char lists in any reasonable sense. Best, Maurício ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe