Re: Unicode windows console output.
On Thu, Nov 4, 2010 at 6:09 AM, Simon Marlow marlo...@gmail.com wrote: On 04/11/2010 02:35, David Sankel wrote: On Wed, Nov 3, 2010 at 9:00 AM, Simon Marlow marlo...@gmail.com mailto:marlo...@gmail.com wrote: On 03/11/2010 10:36, Bulat Ziganshin wrote: Hello Max, Wednesday, November 3, 2010, 1:26:50 PM, you wrote: 1. You need to use chcp 65001 to set the console code page to UTF8 2. It is very likely that your Windows console won't have the fonts required to actually make sense of the output. Pipe the output to foo.txt. If you open this file in notepad you will see the correct characters show up. it will work even without chcp. afaik nor ghc nor windows adjusts text being output to current console codepage GHC certainly does. We use GetConsoleCP() when deciding what code page to use by default - see libraries/base/GHC/IO/Encoding/CodePage.hs. This can actually be quite helpful. I've discovered that if you have a console set to code page 65001 (UTF-8) and use WriteConsoleA (the non-wide version) with UTF-8 encoded strings, the console displays the text properly! So the solution seems to be, when outputting to a utf8 console use WriteConsoleA. We need someone to rewrite the IO library backend for Win32. Currently it is going via the msvcrt POSIX emulation layer, i.e. using write() and pseudo-file-descriptors. More than a few problems have been caused by this, and it's totally unnecessary except that we get to share some code between the POSIX and Windows backends. We ought to be using the native Win32 APIs and HANDLE directly, then we could use WriteConsoleA. It looks like replacing the POSIX layer isn't necessary to fix the Unicode console output bug. I've made a ticket and in a comment I illustrate the _setmode call that magically makes everything work: http://hackage.haskell.org/trac/ghc/ticket/4471 I could attempt a ghc patch for this, but I don't have any experience with the ghc code. Perhaps someone could add this _setmode call with relative ease? David -- David Sankel Sankel Software www.sankelsoftware.com 585 617 4748 (Office) ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Unicode windows console output.
It is possible to output some non Latin1 symbols if you use the wide string API but not all of them. Basically the console supports all European language but nothing else - Latin, Cyrillic and Greek. 2010/11/2 David Sankel cam...@gmail.com: Is there a ghc wontfix bug ticket for this? Perhaps we can make a small C test case and send it to the Microsoft people. Some[1] are reporting success with Unicode console output. David [1] http://www.codeproject.com/KB/cpp/unicode_console_output.aspx On Tue, Nov 2, 2010 at 3:49 AM, Krasimir Angelov kr.ange...@gmail.com wrote: This is evidence for the broken Unicode support in the Windows terminal and not a problem with GHC. I experienced the same many times. 2010/11/2 David Sankel cam...@gmail.com: On Mon, Nov 1, 2010 at 10:20 PM, David Sankel cam...@gmail.com wrote: Hello all, I'm attempting to output some Unicode on the windows console. I set my windows console code page to utf-8 using chcp 65001. The program: -- Test.hs main = putStr λ.x→x The output of `runghc Test.hs`: λ.x→ From within ghci, typing `main`: λ*** Exception: stdout: hPutChar: permission denied (Permission denied) I suspect both of these outputs are evidence of bugs. Might I be doing something wrong? (aside from using windows ;)) I forgot to mention that I'm using Windows XP with ghc 6.12.3. -- David Sankel Sankel Software www.sankelsoftware.com 585 617 4748 (Office) ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users -- David Sankel Sankel Software www.sankelsoftware.com 585 617 4748 (Office) ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Unicode windows console output.
On 2 November 2010 21:05, David Sankel cam...@gmail.com wrote: Is there a ghc wontfix bug ticket for this? Perhaps we can make a small C test case and send it to the Microsoft people. Some[1] are reporting success with Unicode console output. I confirmed that I can output Chinese unicode from Haskell. You can test this by using a program like: main = putStrLn 我学习电脑科学 When you run it: 1. You need to use chcp 65001 to set the console code page to UTF8 2. It is very likely that your Windows console won't have the fonts required to actually make sense of the output. Pipe the output to foo.txt. If you open this file in notepad you will see the correct characters show up. If you want to see the actual correct output in the console, there are some more issues: 1. You need to do some registry hacking to use e.g. SimSum Regular as the console font. 2. Even if you do this, my understanding is that it probably won't work (you will still get junk output in the form of the actual UTF-8 bytes). I think you would instead need to use chcp 936 (the Simplified Chinese GBK code page) which tells the Windows API to output GBK code points instead of the UTF-8 encoding. These should then render correctly. However, to install the code page so chcp works you need to have East Asian language support installed (so Windows 7 Professional users like me are out of luck, because it appears to have been dropped in favour of Language packs, which are only available for 7 Ultimate/Enterprise...) I don't know how all this would adapt to the lambda character. Maybe you need to use a Greek code page?? And I have no idea where that permission denied error is coming from. In summary, this will probably never work properly. This sort of rubbish is why I switched to OS X :-) Cheers, Max ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Unicode windows console output.
On Wed, Nov 3, 2010 at 9:00 AM, Simon Marlow marlo...@gmail.com wrote: On 03/11/2010 10:36, Bulat Ziganshin wrote: Hello Max, Wednesday, November 3, 2010, 1:26:50 PM, you wrote: 1. You need to use chcp 65001 to set the console code page to UTF8 2. It is very likely that your Windows console won't have the fonts required to actually make sense of the output. Pipe the output to foo.txt. If you open this file in notepad you will see the correct characters show up. it will work even without chcp. afaik nor ghc nor windows adjusts text being output to current console codepage GHC certainly does. We use GetConsoleCP() when deciding what code page to use by default - see libraries/base/GHC/IO/Encoding/CodePage.hs. This can actually be quite helpful. I've discovered that if you have a console set to code page 65001 (UTF-8) and use WriteConsoleA (the non-wide version) with UTF-8 encoded strings, the console displays the text properly! So the solution seems to be, when outputting to a utf8 console use WriteConsoleA. David -- David Sankel Sankel Software www.sankelsoftware.com 585 617 4748 (Office) ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Unicode windows console output.
This is evidence for the broken Unicode support in the Windows terminal and not a problem with GHC. I experienced the same many times. 2010/11/2 David Sankel cam...@gmail.com: On Mon, Nov 1, 2010 at 10:20 PM, David Sankel cam...@gmail.com wrote: Hello all, I'm attempting to output some Unicode on the windows console. I set my windows console code page to utf-8 using chcp 65001. The program: -- Test.hs main = putStr λ.x→x The output of `runghc Test.hs`: λ.x→ From within ghci, typing `main`: λ*** Exception: stdout: hPutChar: permission denied (Permission denied) I suspect both of these outputs are evidence of bugs. Might I be doing something wrong? (aside from using windows ;)) I forgot to mention that I'm using Windows XP with ghc 6.12.3. -- David Sankel Sankel Software www.sankelsoftware.com 585 617 4748 (Office) ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Unicode windows console output.
Is there a ghc wontfix bug ticket for this? Perhaps we can make a small C test case and send it to the Microsoft people. Some[1] are reporting success with Unicode console output. David [1] http://www.codeproject.com/KB/cpp/unicode_console_output.aspx On Tue, Nov 2, 2010 at 3:49 AM, Krasimir Angelov kr.ange...@gmail.comwrote: This is evidence for the broken Unicode support in the Windows terminal and not a problem with GHC. I experienced the same many times. 2010/11/2 David Sankel cam...@gmail.com: On Mon, Nov 1, 2010 at 10:20 PM, David Sankel cam...@gmail.com wrote: Hello all, I'm attempting to output some Unicode on the windows console. I set my windows console code page to utf-8 using chcp 65001. The program: -- Test.hs main = putStr λ.x→x The output of `runghc Test.hs`: λ.x→ From within ghci, typing `main`: λ*** Exception: stdout: hPutChar: permission denied (Permission denied) I suspect both of these outputs are evidence of bugs. Might I be doing something wrong? (aside from using windows ;)) I forgot to mention that I'm using Windows XP with ghc 6.12.3. -- David Sankel Sankel Software www.sankelsoftware.com 585 617 4748 (Office) ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users -- David Sankel Sankel Software www.sankelsoftware.com 585 617 4748 (Office) ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Unicode windows console output.
On Mon, Nov 1, 2010 at 10:20 PM, David Sankel cam...@gmail.com wrote: Hello all, I'm attempting to output some Unicode on the windows console. I set my windows console code page to utf-8 using chcp 65001. The program: -- Test.hs main = putStr λ.x→x The output of `runghc Test.hs`: λ.x→ From within ghci, typing `main`: λ*** Exception: stdout: hPutChar: permission denied (Permission denied) I suspect both of these outputs are evidence of bugs. Might I be doing something wrong? (aside from using windows ;)) I forgot to mention that I'm using Windows XP with ghc 6.12.3. -- David Sankel Sankel Software www.sankelsoftware.com 585 617 4748 (Office) ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: unicode characters in operator name
On Saturday 11 September 2010 03:12:11, Greg wrote: If I read the Haskell Report correctly, operators are named by (symbol {symbol | : }), where symbol is either an ascii symbol (including *) or a unicode symbol (defined as any Unicode symbol or punctuation). I'm pretty sure º is a unicode symbol or punctuation. No, Prelude Data.Char generalCategory 'º' LowercaseLetter weird, but that's how it is. If it were a symbol or punctuation, you couldn't use it in function names like fº. ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: unicode characters in operator name
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 9/10/10 21:39 , Daniel Fischer wrote: On Saturday 11 September 2010 03:12:11, Greg wrote: a unicode symbol (defined as any Unicode symbol or punctuation). I'm pretty sure º is a unicode symbol or punctuation. Prelude Data.Char generalCategory 'º' LowercaseLetter weird, but that's how it is. If it were a symbol or punctuation, you couldn't use it in function names like fº. Weird, but that's how Spanish at least treats it; it's a visually distinct lowercase o (along with the visually distinct lowercase a, ª) which indicates gender on an abbreviated ordinal (primero = 1º, primera = 1ª; by convention they are raised, but 1o/1a are equally valid). - -- brandon s. allbery [linux,solaris,freebsd,perl] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkyK37UACgkQIn7hlCsL25XPcACgmOhZ/0rM05l1/bPQ2EJNLZZS 87UAoIeyBNAefnbctVB0Ld7hrovRX4R5 =Qyau -END PGP SIGNATURE- ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: unicode characters in operator name
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 9/10/10 21:12 , Greg wrote: unicode symbol (defined as any Unicode symbol or punctuation). I'm pretty sure º is a unicode symbol or punctuation. No, it's a raised lowercase o used by convention to indicate gender of abbreviated ordinals. You probably want U+00B0 DEGREE SIGN instead of U+00BA MASCULINE ORDINAL INDICATOR. - -- brandon s. allbery [linux,solaris,freebsd,perl] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkyK4CEACgkQIn7hlCsL25VOngCgu5qkmMzgIw/yBd6G3EikXT88 6AkAoKDXh+NIuN5XgT6A/vA0FVkFfsnJ =NOt1 -END PGP SIGNATURE- ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: unicode characters in operator name
Oh cripe... Yet another reason not to use funny symbols-- even the developer can't tell them apart!Yeah, I wanted a degree sign, but if it's all that subtle then I should probably reconsider the whole idea.On the positive side, I know what ª is for now so today wasn't a complete waste. =)Thanks--GregOn Sep 10, 2010, at 06:49 PM, Brandon S Allbery KF8NH allb...@ece.cmu.edu wrote:-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 9/10/10 21:12 , Greg wrote: unicode symbol (defined as any Unicode symbol or punctuation). I'm pretty sure º is a unicode symbol or punctuation. No, it's a raised lowercase "o" used by convention to indicate gender of abbreviated ordinals. You probably want U+00B0 DEGREE SIGN instead of U+00BA MASCULINE ORDINAL INDICATOR. - -- brandon s. allbery [linux,solaris,freebsd,perl] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkyK4CEACgkQIn7hlCsL25VOngCgu5qkmMzgIw/yBd6G3EikXT88 6AkAoKDXh+NIuN5XgT6A/vA0FVkFfsnJ =NOt1 -END PGP SIGNATURE- ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Unicode alternative for '..' (ticket #3894)
On Wed, Apr 21, 2010 at 12:51 AM, Yitzchak Gale g...@sefer.org wrote: Yes, sorry. Either use TWO DOT LEADER, or remove this Unicode alternative altogether (i.e. leave it the way it is *without* the UnicodeSyntax extension). I'm happy with either of those. I just don't like moving the dots up to the middle, or changing the number of dots. I would be happy with either changing the character to the baseline ellipsis or removing it altogether. It would be nice if we could grep (or emacs grep-find) all sources on Hackage to check which packages use the ⋯ character. I suspect it is very close to 0. ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Unicode alternative for '..' (ticket #3894)
I wrote: My opinion is that we should either use TWO DOT LEADER, or just leave it as it is now, two FULL STOP characters. Simon Marlow wrote: Just to be clear, you're suggesting *removing* the Unicode alternative for '..' from GHC's UnicodeSyntax extension? Yes, sorry. Either use TWO DOT LEADER, or remove this Unicode alternative altogether (i.e. leave it the way it is *without* the UnicodeSyntax extension). I'm happy with either of those. I just don't like moving the dots up to the middle, or changing the number of dots. Thanks, Yitz ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Unicode alternative for '..' (ticket #3894)
On 15/04/2010 18:12, Yitzchak Gale wrote: My opinion is that we should either use TWO DOT LEADER, or just leave it as it is now, two FULL STOP characters. Just to be clear, you're suggesting *removing* the Unicode alternative for '..' from GHC's UnicodeSyntax extension? I have no strong opinions about this and I'm happy to defer to those who know more about such things than me. The current choice of MIDLINE is probably accidental. Cheers, Simon Two dots indicating a range is not the same symbol as a three dot ellipsis. Traditional non-Unicode Haskell will continue to be around for a long time to come. It would be very confusing to have two different visual glyphs for this symbol. I don't think there is any semantic problem with using TWO DOT LEADER here. All three of the characters ONE DOT LEADER, TWO DOT LEADER, and HORIZONTAL ELLIPSIS are legacy characters from Xerox's XCCS. There, the characters they come from were used for forming dot leaders, e.g., in a table of contents. Using them that way in Unicode is considered incorrect unless they represent text that was originally encoded in XCCS; in Unicode, one does not form dot leaders using those characters. However, other new uses are considered legitimate. For example, HORIZONTAL ELLIPSIS can be used for fonts that have a special ellipsis glyph, and ONE DOT LEADER represents mijaket in Armenian encodings. So I don't see any reason why we can't use TWO DOT LEADER to represent the two-dot range symbol. The above analysis is based in part upon a discussion of these characters on the Unicode list in 2003: http://www.mail-archive.com/unic...@unicode.org/msg16285.html The author of that particular message, Kenneth Whistler, is of the opinion that two dots expressing a range as in [0..1] should be represented in Unicode as two FULL STOP characters, as we do now in Haskell. Others in that thread - whom Mr. Whistler seems to feel are less expert than himself regarding Unicode - think that TWO DOT LEADER is appropriate. No one considers replacing two-dot ranges with HORIZONTAL ELLIPSIS. If we can't find a Unicode character that everyone agrees upon, I also don't see any problem with leaving it as two FULL STOP characters. Thanks, Yitz ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Unicode alternative for '..' (ticket #3894)
I think the baseline ellipsis makes much more sense; it's hard to see how the midline ellipsis was chosen. -- Jason Dusek ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Unicode alternative for '..' (ticket #3894)
My opinion is that we should either use TWO DOT LEADER, or just leave it as it is now, two FULL STOP characters. Two dots indicating a range is not the same symbol as a three dot ellipsis. Traditional non-Unicode Haskell will continue to be around for a long time to come. It would be very confusing to have two different visual glyphs for this symbol. I don't think there is any semantic problem with using TWO DOT LEADER here. All three of the characters ONE DOT LEADER, TWO DOT LEADER, and HORIZONTAL ELLIPSIS are legacy characters from Xerox's XCCS. There, the characters they come from were used for forming dot leaders, e.g., in a table of contents. Using them that way in Unicode is considered incorrect unless they represent text that was originally encoded in XCCS; in Unicode, one does not form dot leaders using those characters. However, other new uses are considered legitimate. For example, HORIZONTAL ELLIPSIS can be used for fonts that have a special ellipsis glyph, and ONE DOT LEADER represents mijaket in Armenian encodings. So I don't see any reason why we can't use TWO DOT LEADER to represent the two-dot range symbol. The above analysis is based in part upon a discussion of these characters on the Unicode list in 2003: http://www.mail-archive.com/unic...@unicode.org/msg16285.html The author of that particular message, Kenneth Whistler, is of the opinion that two dots expressing a range as in [0..1] should be represented in Unicode as two FULL STOP characters, as we do now in Haskell. Others in that thread - whom Mr. Whistler seems to feel are less expert than himself regarding Unicode - think that TWO DOT LEADER is appropriate. No one considers replacing two-dot ranges with HORIZONTAL ELLIPSIS. If we can't find a Unicode character that everyone agrees upon, I also don't see any problem with leaving it as two FULL STOP characters. Thanks, Yitz ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Unicode alternative for '..' (ticket #3894)
That is very interesting. I didn't know the history of those characters. If we can't find a Unicode character that everyone agrees upon, I also don't see any problem with leaving it as two FULL STOP characters. I agree. I don't like the current Unicode variant for .., therefore I suggested an alternative. But I didn't consider removing it altogether. It is an interesting idea. ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
RE: Unicode in GHC: need more advice
On 14 January 2005 12:58, Dimitry Golubovsky wrote: Now I need more advice on which flavor of Unicode support to implement. In Haskell-cafe, there were 3 flavors summarized: I am reposting the table here (its latest version). |Sebastien's| Marcin's | Hugs ---+---+--+-- alnum | L* N* | L* N*| L*, M*, N* 1 alpha | L*| L* | L* 1 cntrl | Cc| Cc Zl Zp | Cc digit | N*| Nd | '0'..'9' lower | Ll| Ll | Ll 1 punct | P*| P* | P* upper | Lu| Lt Lu| Lu Lt 1 blank | Z* \t\n\r | Z*(except| ' ' \t\n\r\f\v U+00A0 U+00A0 U+2007 U+202F) \t\n\v\f\r U+0085 1: for characters outside Latin1 range. For Latin1 characters (0 to 255), there is a lookup table defined as unsigned char charTable[NUM_LAT1_CHARS]; I did not post the contents of the table Hugs uses for the Latin1 part. However, with that table completely removed, Hugs did not work properly. So its contents somehow differs from what Unicode defines for that character range. If needed, I may decode that table and post its mapping of character categories (keeping in mind that those are Haskell-recognized character categories, not Unicode) I don't know enough to comment on which of the above flavours is best. However, I'd prefer not to use a separate table for Latin-1 characters if possible. We should probably stick to the Report definitions for isDigit and isSpace, but we could add a separate isUniDigit/isUniSpace for the full Unicode classes. One more question that I had when experimenting with Hugs: if a character (like those extra blank chars) is forced into some category for the purposes of Haskell language compilation (per the Report), does this mean that any other Haskell application should recognize Haskell-defined category of that character rather than Unicode-defined? For Hugs, there were no choice but say Yes, because both compiler and interpreter used the same code to decide on character category. In GHC this may be different. To be specific: the Report requires that the Haskell lexical class of space characters includes Unicode spaces, but that the implementation of isSpace only recognises Latin-1 spaces. That means we need two separate classes of space characters (or just use the report definition of isSpace). GHC's parser doesn't currently use the Data.Char character class predicates, but at some point we will want to parse Unicode so we'll need appropriate class predicates then. Since Hugs got there first, does it make sense just follow what was done here, or will a different decision be adopted for GHC: say, for the Parser, extra characters are forced to be blank, but for the rest of the programs compiled by GHC, Unicode definitions are adhered to. Does what I said above help answer this question? Cheers, Simon ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Unicode in GHC: need more advice
Hi, Simon Marlow wrote: You're doing fine - but a better place for the tables is as part of the base package, rather than the RTS. We already have some C files in the base package: see libraries/base/cbits, for example. I suggest just putting your code in there. I have done that - now GHCi recognizes those symbols and loads fine. The test program also works when compiled. I still got some messages about missing prototypes and implicitly declared functions that I defined instead of libc functions, especially during Stage 1. I need to check into that, but since all those functions are basically int - int, it does not affect the result. The code I use is some draft code, based on what I submitted for Hugs (pure Unicode basically, even without extra space characters). Now I need more advice on which flavor of Unicode support to implement. In Haskell-cafe, there were 3 flavors summarized: I am reposting the table here (its latest version). |Sebastien's| Marcin's | Hugs ---+---+--+-- alnum | L* N* | L* N*| L*, M*, N* 1 alpha | L*| L* | L* 1 cntrl | Cc| Cc Zl Zp | Cc digit | N*| Nd | '0'..'9' lower | Ll| Ll | Ll 1 punct | P*| P* | P* upper | Lu| Lt Lu| Lu Lt 1 blank | Z* \t\n\r | Z*(except| ' ' \t\n\r\f\v U+00A0 U+00A0 U+2007 U+202F) \t\n\v\f\r U+0085 1: for characters outside Latin1 range. For Latin1 characters (0 to 255), there is a lookup table defined as unsigned char charTable[NUM_LAT1_CHARS]; I did not post the contents of the table Hugs uses for the Latin1 part. However, with that table completely removed, Hugs did not work properly. So its contents somehow differs from what Unicode defines for that character range. If needed, I may decode that table and post its mapping of character categories (keeping in mind that those are Haskell-recognized character categories, not Unicode) I am not asking for discussion in this list again. I rather expect some suggestion from the GHC team leads, which flavor (of shown above, or some combination of the above) to implement. One more question that I had when experimenting with Hugs: if a character (like those extra blank chars) is forced into some category for the purposes of Haskell language compilation (per the Report), does this mean that any other Haskell application should recognize Haskell-defined category of that character rather than Unicode-defined? For Hugs, there were no choice but say Yes, because both compiler and interpreter used the same code to decide on character category. In GHC this may be different. Since Hugs got there first, does it make sense just follow what was done here, or will a different decision be adopted for GHC: say, for the Parser, extra characters are forced to be blank, but for the rest of the programs compiled by GHC, Unicode definitions are adhered to. PS The latest rebuild I did, used ghc with new code compiled in as Stage 1 compiler. Dimitry Golubovsky Middletown, CT ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
RE: Unicode in GHC: need some advice on building
On 11 January 2005 02:29, Dimitry Golubovsky wrote: Bad thing is, LD_PRELOAD does not work on all systems. So I tried to put the code directly into the runtime (where I believe it should be; the Unicode properties table is packed, and won't eat much space). I renamed foreign function names in GHC.Unicode (to avoid conflict with libc functions) adding u_ to them (so now they are u_iswupper, etc). I placed the new file into ghc/rts, and the include file into ghc/includes. I could not avoid messages about missing prototypes for u_... functions , but finally I was able to build ghc. Now when I compiled my test program with the rebuilt ghc, it worked without the LD_PRELOADed library. However, GHCi could not start complaining that it could not see these u_... symbols. I noticed some other entry points into the runtime like revertCAFs, or getAllocations, declared in the Haskell part of GHCi just as other foreign calls, so I just followed the same style - partly unsuccessfully. Where am I wrong? You're doing fine - but a better place for the tables is as part of the base package, rather than the RTS. We already have some C files in the base package: see libraries/base/cbits, for example. I suggest just putting your code in there. Cheers, Simon ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: UniCode
Dylan Thurston [EMAIL PROTECTED] writes: Right. In Unicode, the concept of a character is not really so useful; After reading a bit about it, I'm certainly confused. Unicode/ISO-10646 contains a lot of things that aren'r really one character, e.g. ligatures. most functions that traditionally operate on characters (e.g., uppercase or display-width) fundamentally need to operate on strings. (This is due to properties of particular languages, not any design flaw of Unicode.) I think an argument could be put forward that Unicode is trying to be more than just a character set. At least at first glance, it seems to try to be both a character set and a glyph map, and incorporate things like transliteration between character sets (or subsets, now that Unicode contains them all), directionality of script, and so on. toUpper, toLower - Not OK. There are cases where upper casing a character yields two characters. I though title case was supposed to handle this. I'm probably confused, though. etc. Any program using this library is bound to get confused on Unicode strings. Even before Unicode, there is much functionality missing; for instance, I don't see any way to compare strings using a localized order. And you can't really use list functions like length on strings, since one item can be two characters (Lj, ij, fi) and several items can compose one character (combining characters). And map (==) can't compare two Strings since, e.g. in the presence of combining characters. How are other systems handling this? It may be that Unicode isn't flawed, but it's certainly extremely complex. I guess I'll have to delve a bit deeper into it. -kzm -- If I haven't seen further, it is by standing in the footprints of giants ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Unicode
- Original Message - From: Ketil Malde [EMAIL PROTECTED] To: Dylan Thurston [EMAIL PROTECTED] Cc: Andrew J Bromage [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Monday, October 08, 2001 9:02 AM Subject: Re: UniCode (The spelling is 'Unicode' (and none other).) Dylan Thurston [EMAIL PROTECTED] writes: Right. In Unicode, the concept of a character is not really so useful; After reading a bit about it, I'm certainly confused. Unicode/ISO-10646 contains a lot of things that aren'r really one character, e.g. ligatures. The ligatures that are included are there for compatiblity with older character encodings. Normally, for modern technology..., ligatures are (to be) formed automatically through the font. OpenType (OT, MS and Adobe) and AAT (Apple) have support for this. There are often requests to add more ligatures to 10646/Unicode, but they are rejected since 10646/Unicode encode characters, not glyphs. (With two well-known exceptions: for compatibility, and certain dingbats.) most functions that traditionally operate on characters (e.g., uppercase or display-width) fundamentally need to operate on strings. (This is due to properties of particular languages, not any design flaw of Unicode.) I think an argument could be put forward that Unicode is trying to be more than just a character set. At least at first glance, it seems to Yes, but: try to be both a character set and a glyph map, and incorporate things not that. See above. like transliteration between character sets (or subsets, now that Unicode contains them all), directionality of script, and so on. Unicode (but not 10646) does handle bidirectionality (seeUAX 9: http://www.unicode.org/unicode/reports/tr9/), but not transliteration. (Tranliteration is handled in IBMs ICU, though: http://www-124.ibm.com/developerworks/oss/icu4j/index.html) toUpper, toLower - Not OK. There are cases where upper casing a character yields two characters. I though title case was supposed to handle this. I'm probably confused, though. The titlecase characters in Unicode are (essentially) only there for compatibility reasons (originally for transliterating between certain subsets of Cyrillic and Latin scripts in a 1-1 way). You're not supposed to really use them... The cases where toUpper of a single character give two characters is for some (classical) Greek, where a builtin subscript iota turn into a capital iota, and other cases where there is no corresponding uppercase letter. It is also the case that case mapping is context sensitive. E.g. mapping capital sigma to small sigma (mostly) or ς (small final sigma) (at end of word), or the capital i to ı (small dotless i), if Turkish, or insert/ delete combining dot above for i and j in Lithuanian. See UTR 21 and http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt. etc. Any program using this library is bound to get confused on Unicode strings. Even before Unicode, there is much functionality missing; for instance, I don't see any way to compare strings using a localized order. And you can't really use list functions like length on strings, since one item can be two characters (Lj, ij, fi) and several items can compose one character (combining characters). Depends on what you mean by lenght and character... You seem to be after what is sometimes referred to as grapheme, and counting those. There is a proposal for a definition of language independent grapheme (with lexical syntax), but I don't think it is stable yet. And map (==) can't compare two Strings since, e.g. in the presence of combining characters. How are other systems handling this? I guess it is not very systematic. Java and XML make the comparisons directly by equality of the 'raw' characters *when* comparing identifiers/similar, though for XML there is a proposal for early normalisation essentially to NFC (normal form C). I would have preferred comparing the normal forms of the identifiers instead. For searches, the recommendation (though I doubt in practice yet) is to use a collation key based comparison. (Note that collation keys are usually language dependent. More about collation in UTS 10, http://www.unicode.org/unicode/reports/tr10/, and ISO/IEC 14651.) What does NOT make sense is to expose (to a user) the raw ordering () of Unicode strings, though it may be useful internally. Orders exposed to people (or other systems, for that matter) that are't concerned with the inner workings of a program should always be collation based. (But that holds for any character encoding, it's just more apparent for Unicode.) It may be that Unicode isn't flawed, but it's certainly extremely complex. I guess I'll have to delve a bit deeper into it. It's complex, but it is because the scripts of world are complex (and add to that politics, as well as compatbility and implementation issues). Kind regards /kent k
Re: UniCode
G'day all. On Fri, Oct 05, 2001 at 06:17:26PM +, Marcin 'Qrczak' Kowalczyk wrote: This information is out of date. AFAIR about 4 of them is assigned. Most for Chinese (current, not historic). I wasn't aware of this. Last time I looked was Unicode 3.0. Thanks for the update. In Haskell String = [Char]. I'll concede that String and [Char] are identical as far as the programmer is concerned. :-) There was some research 10+ years ago about alternative representations for lists which were semantically identical but a little more efficient in memory use. Even if you don't go that far (it is fiddly), constant strings, for example, could be representable as UTF-16/UTF-8/whatever along with some machinery to generate the list on demand. Char objects could be implemented as flyweights. Lots of possibilities. Cheers, Andrew Bromage ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: UniCode
Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] writes: Fri, 5 Oct 2001 02:29:51 -0700 (PDT), Krasimir Angelov [EMAIL PROTECTED] pisze: Why Char is 32 bit. UniCode characters is 16 bit. No, Unicode characters have 21 bits (range U+..10). We've been through all this, of course, but here's a quote: Unicode originally implied that the encoding was UCS-2 and it initially didn't make any provisions for characters outside the BMP (U+ to U+). When it became clear that more than 64k characters would be needed for certain special applications (historic alphabets and ideographs, mathematical and musical typesetting, etc.), Unicode was turned into a sort of 21-bit character set with possible code points in the range U- to U-0010. The 2×1024 surrogate characters (U+D800 to U+DFFF) were introduced into the BMP to allow 1024×1024 non-BMP characters to be represented as a sequence of two 16-bit surrogate characters. This way UTF-16 was born, which represents the extended 21-bit Unicode in a way backwards compatible with UCS-2. The term UTF-32 was introduced in Unicode to mean a 4-byte encoding of the extended 21-bit Unicode. UTF-32 is the exact same thing as UCS-4, except that by definition UTF-32 is never used to represent characters above U-0010, while UCS-4 can cover all 231 code positions up to U-7FFF. from a/the Unicode FAQ at http://www.cl.cam.ac.uk/~mgk25/unicode.html Does Haskell's support of Unicode mean UTF-32, or full UCS-4? Recent messages seem to indicate the former, but I don't see any reason against the latter. -kzm -- If I haven't seen further, it is by standing in the footprints of giants ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: UniCode
G'day all. On Fri, Oct 05, 2001 at 02:29:51AM -0700, Krasimir Angelov wrote: Why Char is 32 bit. UniCode characters is 16 bit. It's not quite as simple as that. There is a set of one million (more correctly, 1M) Unicode characters which are only accessible using surrogate pairs (i.e. two UTF-16 codes). There are currently none of these codes assigned, and when they are, they'll be extremely rare. So rare, in fact, that the cost of strings taking up twice the space that the currently do simply isn't worth the cost. However, you still need to be able to handle them. I don't know what the official Haskell reasoning is (it may have more to do with word size than Unicode semantics), but it makes sense to me to store single characters in UTF-32 but strings in a more compressed format (UTF-8 or UTF-16). See also: http://www.unicode.org/unicode/faq/utf_bom.html It just goes to show that strings are not merely arrays of characters like some languages would have you believe. Cheers, Andrew Bromage ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: UniCode
Fri, 5 Oct 2001 23:23:50 +1000, Andrew J Bromage [EMAIL PROTECTED] pisze: There is a set of one million (more correctly, 1M) Unicode characters which are only accessible using surrogate pairs (i.e. two UTF-16 codes). There are currently none of these codes assigned, This information is out of date. AFAIR about 4 of them is assigned. Most for Chinese (current, not historic). So rare, in fact, that the cost of strings taking up twice the space that the currently do simply isn't worth the cost. In Haskell strings already have high overhead. In GHC a Char# value (inside Char object) always takes the same size as the pointer (32 or 64 bits), no matter how much of it is used. It just goes to show that strings are not merely arrays of characters like some languages would have you believe. In Haskell String = [Char]. It's true that Char values don't necessarily correspond to glyphs, but Strings are composed of Chars. -- __( Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTÊPCZA QRCZAK ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: UniCode
05 Oct 2001 14:35:17 +0200, Ketil Malde [EMAIL PROTECTED] pisze: Does Haskell's support of Unicode mean UTF-32, or full UCS-4? It's not decided officially. GHC uses UTF-32. It's expected that UCS-4 will vanish and ISO-10646 will be reduced to the same range U+..10 as Unicode. -- __( Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTÊPCZA QRCZAK ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Unicode
Manuel M. T. Chakravarty writes: The problem with restricting youself to the Jouyou-Kanji is that you have a hard time with names (of persons and places). Many exotic and otherwise unused Kanji are used in names (for historical reasons) and as the Kanji representation of a name is the official identifier, it is rather bad form to write a person's name in Kana (the phonetic alphabets). You're absolutely right. This fact slipped my mind. Still, probably 85% (just a guess) of Japanese names can be written with Jyouyou kanji, and the CJK set in Unicode is a strict superset of the Jyouyou, so there are actually more kanji available, and the problem is not quite so severe. However, for Chinese names I can imagine it being quite restrictive. -- Frank Atanassow, Dept. of Computer Science, Utrecht University Padualaan 14, PO Box 80.089, 3508 TB Utrecht, Netherlands Tel +31 (030) 253-1012, Fax +31 (030) 251-3791
Re: Unicode
Marcin 'Qrczak' Kowalczyk wrote: As for the language standard: I hope that Char will be allowed or required to have =30 bits instead of current 16; but never more than Int, to be able to use ord and chr safely. Er does it have to? The Java Virtual Machine implements Unicode with 16 bits. (OK, so I suppose that means it can't cope with Korean or Chinese.) So requiring Char to be =30 bits would stop anyone implementing a conformant Haskell on the JVM. (I feel strongly about this having been involved with MLj, which compiles ML to JVM; Standard ML requires 8-bit chars, a requirement we decided to ignore.)
RE: Unicode
OTOH, it wouldn't be hard to change GHC's Char datatype to be a full 32-bit integral data type. Could we do it please? It will not break anything if done slowly. I imagine that {read,write}CharOffAddr and _ccall_ will still use only 8 bits of Char. But after Char is wide, libraries dealing with text conversion will be possible to be designed, to prepare for future international I/O, together with Foreign libraries. I agree it should be done. But not for 4.07; we can start breaking the tree as soon as I've forked the 4.07 branch though (hopefully today...). We have some other small wibbles to deal with; currently a Char never resides in the heap, because there are only 256 possible Chars we declare them all statically in the RTS. Now we have to check whether the Char falls in the allowed range before using this table (that's fairly easy, we already do this for Int). Cheers, Simon
Re: Unicode
George Russell writes: Marcin 'Qrczak' Kowalczyk wrote: As for the language standard: I hope that Char will be allowed or required to have =30 bits instead of current 16; but never more than Int, to be able to use ord and chr safely. Er does it have to? The Java Virtual Machine implements Unicode with 16 bits. (OK, so I suppose that means it can't cope with Korean or Chinese.) Just to set the record straight: Many CJK (Chinese-Japanese-Korean) characters are encodable in 16 bits. I am not so familiar with the Chinese or Korean situations, but in Japan there is a nationally standardized subset of about 2000 characters called the Jyouyou ("often-used") kanji, which newspapers and most printed books are mostly supposed to respect. These are all strictly contained in the 16-bit space. One only needs the additional 16-bits for foreign characters (say, Chinese), older literary works and such-like. Even then, since Japanese has two phoenetic alphabets as well, and you can usually substitute phoenetic characters in the place of non-Jyouyou kanji---in fact, since these kanji are considered difficult, one often _does_ supplement the ideographic representation with a phoenetic one. Of course, using only phoenetic characters in such cases would look unprofessional in some contexts, and it forces the reader to guess at which word was meant... For Korean and especially Chinese, the situation is not so pat. Korean's phoenetic alphabet is of course wholly contained within the 16 bit space, but Chinese, as a rule, don't use phoenetic characters. Koreans rely on their phoenetic alphabet more than the Japanese, but they still tend to use (I believe) more esoteric Chinese ideographic characters than the Japanese do. And the Chinese have a much larger set of ideographic characters in common use than either of the other two. I'm not sure what percentage is contained in the 16-bit space; it's probably enough that you can communicate most non-specialized subjects fairly comfortably, but it is safe to say that the Chinese would be the first to demand more encoding space. In summary, 16 bits is enough to encode most modern texts if you don't mind fudging a bit, but for high-quality productions, historical and/or specialized texts, CJK users will want 32 bits. Of course, you can always come up with specialized schemes involving stateful encodings and/or "block-swapping" (using the Unicode private-use areas, for example), but then, that subverts the purpose of Unicode. -- Frank Atanassow, Dept. of Computer Science, Utrecht University Padualaan 14, PO Box 80.089, 3508 TB Utrecht, Netherlands Tel +31 (030) 253-1012, Fax +31 (030) 251-3791
Re: Unicode
Tue, 16 May 2000 10:44:28 +0200, George Russell [EMAIL PROTECTED] pisze: As for the language standard: I hope that Char will be allowed or required to have =30 bits instead of current 16; but never more than Int, to be able to use ord and chr safely. Er does it have to? The Java Virtual Machine implements Unicode with 16 bits. (OK, so I suppose that means it can't cope with Korean or Chinese.) So requiring Char to be =30 bits would stop anyone implementing a conformant Haskell on the JVM. OK, "allowed", not "required"; currently it is not even allowed. The minimum should probably be 16, maximum - the size of Int. Oops, ord will have to be allowed to return negative numbers when the size of Char is equal to the size of Int. Another solution is to make Char at least one bit less than Int, or also at the same time no larger than 31 bits. ISO-10646 describes the space of 31 bits, UTF-8 is able to encode up to 31 bits, so then a UTF-8 encoder would be portable without worrying about Char values that don't fit, and a decoder could easily check if a character is representable in Char: ord maxBound would be guaranteed to be positive. Choices I see: - 30 = Int, 16 = Char = 31, Char Int - 30 = Int, 16 = Char, Char Int - 30 = Int, 16 = Char, Char = Int -- __("Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ GCS/M d- s+:-- a23 C+++$ UL++$ P+++ L++$ E- ^^ W++ N+++ o? K? w(---) O? M- V? PS-- PE++ Y? PGP+ t QRCZAK 5? X- R tv-- b+++ DI D- G+ e h! r--%++ y-
Re: Unicode
Tue, 16 May 2000 12:26:12 +0200 (MET DST), Frank Atanassow [EMAIL PROTECTED] pisze: Of course, you can always come up with specialized schemes involving stateful encodings and/or "block-swapping" (using the Unicode private-use areas, for example), but then, that subverts the purpose of Unicode. There is already a standard UTF-16 encoding that fits 2^20 characters into 16bit space, by encoding characters =2^16 as pairs of "characters" from the range D800..DFFF, which are otherwise unused in Unicode. Programmers should not be expected to care about this; most will not anyway. Libraries will handle this format in external UTF-16-encoded strings. UTF-8 is usually a better choice for external encoding; UTF-16 should be rarely used. -- __("Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ GCS/M d- s+:-- a23 C+++$ UL++$ P+++ L++$ E- ^^ W++ N+++ o? K? w(---) O? M- V? PS-- PE++ Y? PGP+ t QRCZAK 5? X- R tv-- b+++ DI D- G+ e h! r--%++ y-
Re: Unicode
Frank Atanassow [EMAIL PROTECTED] wrote, George Russell writes: Marcin 'Qrczak' Kowalczyk wrote: As for the language standard: I hope that Char will be allowed or required to have =30 bits instead of current 16; but never more than Int, to be able to use ord and chr safely. Er does it have to? The Java Virtual Machine implements Unicode with 16 bits. (OK, so I suppose that means it can't cope with Korean or Chinese.) Just to set the record straight: Many CJK (Chinese-Japanese-Korean) characters are encodable in 16 bits. I am not so familiar with the Chinese or Korean situations, but in Japan there is a nationally standardized subset of about 2000 characters called the Jyouyou ("often-used") kanji, which newspapers and most printed books are mostly supposed to respect. These are all strictly contained in the 16-bit space. One only needs the additional 16-bits for foreign characters (say, Chinese), older literary works and such-like. Even then, since Japanese has two phoenetic alphabets as well, and you can usually substitute phoenetic characters in the place of non-Jyouyou kanji---in fact, since these kanji are considered difficult, one often _does_ supplement the ideographic representation with a phoenetic one. Of course, using only phoenetic characters in such cases would look unprofessional in some contexts, and it forces the reader to guess at which word was meant... The problem with restricting youself to the Jouyou-Kanji is that you have a hard time with names (of persons and places). Many exotic and otherwise unused Kanji are used in names (for historical reasons) and as the Kanji representation of a name is the official identifier, it is rather bad form to write a person's name in Kana (the phonetic alphabets). Cheers, Manuel