Re: Unicode windows console output.

2010-11-04 Thread David Sankel
On Thu, Nov 4, 2010 at 6:09 AM, Simon Marlow marlo...@gmail.com wrote:

 On 04/11/2010 02:35, David Sankel wrote:

 On Wed, Nov 3, 2010 at 9:00 AM, Simon Marlow marlo...@gmail.com
 mailto:marlo...@gmail.com wrote:

On 03/11/2010 10:36, Bulat Ziganshin wrote:

Hello Max,

Wednesday, November 3, 2010, 1:26:50 PM, you wrote:

1. You need to use chcp 65001 to set the console code page
to UTF8
2. It is very likely that your Windows console won't have
the fonts
required to actually make sense of the output. Pipe the
output to
foo.txt. If you open this file in notepad you will see the
correct
characters show up.


it will work even without chcp. afaik nor ghc nor windows
adjusts text
being output to current console codepage


GHC certainly does.  We use GetConsoleCP() when deciding what code
page to use by default - see
 libraries/base/GHC/IO/Encoding/CodePage.hs.



 This can actually be quite helpful. I've discovered that if you have a
 console set to code page 65001 (UTF-8) and use WriteConsoleA (the
 non-wide version) with UTF-8 encoded strings, the console displays the
 text properly!

 So the solution seems to be, when outputting to a utf8 console use
 WriteConsoleA.


 We need someone to rewrite the IO library backend for Win32.  Currently it
 is going via the msvcrt POSIX emulation layer, i.e. using write() and
 pseudo-file-descriptors.  More than a few problems have been caused by this,
 and it's totally unnecessary except that we get to share some code between
 the POSIX and Windows backends.  We ought to be using the native Win32 APIs
 and HANDLE directly, then we could use WriteConsoleA.


It looks like replacing the POSIX layer isn't necessary to fix the Unicode
console output bug. I've made a ticket and in a comment I illustrate the
_setmode call that magically makes everything work:

http://hackage.haskell.org/trac/ghc/ticket/4471

I could attempt a ghc patch for this, but I don't have any experience with
the ghc code. Perhaps someone could add this _setmode call with relative
ease?

David

-- 
David Sankel
Sankel Software
www.sankelsoftware.com
585 617 4748 (Office)
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Unicode windows console output.

2010-11-03 Thread Krasimir Angelov
It is possible to output some non Latin1 symbols if you use the wide
string API but not all of them. Basically the console supports all
European language but nothing else - Latin, Cyrillic and Greek.


2010/11/2 David Sankel cam...@gmail.com:
 Is there a ghc wontfix bug ticket for this? Perhaps we can make a small C
 test case and send it to the Microsoft people. Some[1] are reporting success
 with Unicode console output.
 David

 [1] http://www.codeproject.com/KB/cpp/unicode_console_output.aspx

 On Tue, Nov 2, 2010 at 3:49 AM, Krasimir Angelov kr.ange...@gmail.com
 wrote:

 This is evidence for the broken Unicode support in the Windows
 terminal and not a problem with GHC. I experienced the same many
 times.

 2010/11/2 David Sankel cam...@gmail.com:
 
  On Mon, Nov 1, 2010 at 10:20 PM, David Sankel cam...@gmail.com wrote:
 
  Hello all,
  I'm attempting to output some Unicode on the windows console. I set my
  windows console code page to utf-8 using chcp 65001.
  The program:
 
  -- Test.hs
  main = putStr λ.x→x
 
  The output of `runghc Test.hs`:
 
  λ.x→
 
  From within ghci, typing `main`:
 
  λ*** Exception: stdout: hPutChar: permission denied (Permission
  denied)
 
  I suspect both of these outputs are evidence of bugs. Might I be doing
  something wrong? (aside from using windows ;))
 
  I forgot to mention that I'm using Windows XP with ghc 6.12.3.
 
 
  --
  David Sankel
  Sankel Software
  www.sankelsoftware.com
  585 617 4748 (Office)
 
  ___
  Glasgow-haskell-users mailing list
  Glasgow-haskell-users@haskell.org
  http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
 
 



 --
 David Sankel
 Sankel Software
 www.sankelsoftware.com
 585 617 4748 (Office)

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Unicode windows console output.

2010-11-03 Thread Max Bolingbroke
On 2 November 2010 21:05, David Sankel cam...@gmail.com wrote:
 Is there a ghc wontfix bug ticket for this? Perhaps we can make a small C
 test case and send it to the Microsoft people. Some[1] are reporting success
 with Unicode console output.

I confirmed that I can output Chinese unicode from Haskell. You can
test this by using a program like:

main = putStrLn 我学习电脑科学

When you run it:
1. You need to use chcp 65001 to set the console code page to UTF8
2. It is very likely that your Windows console won't have the fonts
required to actually make sense of the output. Pipe the output to
foo.txt. If you open this file in notepad you will see the correct
characters show up.

If you want to see the actual correct output in the console, there are
some more issues:
1. You need to do some registry hacking to use e.g. SimSum Regular
as the console font.
2. Even if you do this, my understanding is that it probably won't
work (you will still get junk output in the form of the actual UTF-8
bytes). I think you would instead need to use chcp 936 (the
Simplified Chinese GBK code page) which tells the Windows API to
output GBK code points instead of the UTF-8 encoding. These should
then render correctly. However, to install the code page so chcp works
you need to have East Asian language support installed (so Windows 7
Professional users like me are out of luck, because it appears to have
been dropped in favour of Language packs, which are only available
for 7 Ultimate/Enterprise...)

I don't know how all this would adapt to the lambda character. Maybe
you need to use a Greek code page?? And I have no idea where that
permission denied error is coming from.

In summary, this will probably never work properly. This sort of
rubbish is why I switched to OS X :-)

Cheers,
Max
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Unicode windows console output.

2010-11-03 Thread David Sankel
On Wed, Nov 3, 2010 at 9:00 AM, Simon Marlow marlo...@gmail.com wrote:

 On 03/11/2010 10:36, Bulat Ziganshin wrote:

 Hello Max,

 Wednesday, November 3, 2010, 1:26:50 PM, you wrote:

  1. You need to use chcp 65001 to set the console code page to UTF8
 2. It is very likely that your Windows console won't have the fonts
 required to actually make sense of the output. Pipe the output to
 foo.txt. If you open this file in notepad you will see the correct
 characters show up.


 it will work even without chcp. afaik nor ghc nor windows adjusts text
 being output to current console codepage


 GHC certainly does.  We use GetConsoleCP() when deciding what code page to
 use by default - see libraries/base/GHC/IO/Encoding/CodePage.hs.



This can actually be quite helpful. I've discovered that if you have a
console set to code page 65001 (UTF-8) and use WriteConsoleA (the non-wide
version) with UTF-8 encoded strings, the console displays the text properly!

So the solution seems to be, when outputting to a utf8 console use
WriteConsoleA.

David

-- 
David Sankel
Sankel Software
www.sankelsoftware.com
585 617 4748 (Office)
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Unicode windows console output.

2010-11-02 Thread Krasimir Angelov
This is evidence for the broken Unicode support in the Windows
terminal and not a problem with GHC. I experienced the same many
times.

2010/11/2 David Sankel cam...@gmail.com:

 On Mon, Nov 1, 2010 at 10:20 PM, David Sankel cam...@gmail.com wrote:

 Hello all,
 I'm attempting to output some Unicode on the windows console. I set my
 windows console code page to utf-8 using chcp 65001.
 The program:

 -- Test.hs
 main = putStr λ.x→x

 The output of `runghc Test.hs`:

 λ.x→

 From within ghci, typing `main`:

 λ*** Exception: stdout: hPutChar: permission denied (Permission denied)

 I suspect both of these outputs are evidence of bugs. Might I be doing
 something wrong? (aside from using windows ;))

 I forgot to mention that I'm using Windows XP with ghc 6.12.3.


 --
 David Sankel
 Sankel Software
 www.sankelsoftware.com
 585 617 4748 (Office)

 ___
 Glasgow-haskell-users mailing list
 Glasgow-haskell-users@haskell.org
 http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Unicode windows console output.

2010-11-02 Thread David Sankel
Is there a ghc wontfix bug ticket for this? Perhaps we can make a small C
test case and send it to the Microsoft people. Some[1] are reporting success
with Unicode console output.

David

[1] http://www.codeproject.com/KB/cpp/unicode_console_output.aspx

On Tue, Nov 2, 2010 at 3:49 AM, Krasimir Angelov kr.ange...@gmail.comwrote:

 This is evidence for the broken Unicode support in the Windows
 terminal and not a problem with GHC. I experienced the same many
 times.

 2010/11/2 David Sankel cam...@gmail.com:
 
  On Mon, Nov 1, 2010 at 10:20 PM, David Sankel cam...@gmail.com wrote:
 
  Hello all,
  I'm attempting to output some Unicode on the windows console. I set my
  windows console code page to utf-8 using chcp 65001.
  The program:
 
  -- Test.hs
  main = putStr λ.x→x
 
  The output of `runghc Test.hs`:
 
  λ.x→
 
  From within ghci, typing `main`:
 
  λ*** Exception: stdout: hPutChar: permission denied (Permission
 denied)
 
  I suspect both of these outputs are evidence of bugs. Might I be doing
  something wrong? (aside from using windows ;))
 
  I forgot to mention that I'm using Windows XP with ghc 6.12.3.
 
 
  --
  David Sankel
  Sankel Software
  www.sankelsoftware.com
  585 617 4748 (Office)
 
  ___
  Glasgow-haskell-users mailing list
  Glasgow-haskell-users@haskell.org
  http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
 
 




-- 
David Sankel
Sankel Software
www.sankelsoftware.com
585 617 4748 (Office)
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Unicode windows console output.

2010-11-01 Thread David Sankel
On Mon, Nov 1, 2010 at 10:20 PM, David Sankel cam...@gmail.com wrote:

 Hello all,

 I'm attempting to output some Unicode on the windows console. I set my
 windows console code page to utf-8 using chcp 65001.

 The program:

 -- Test.hs
 main = putStr λ.x→x


 The output of `runghc Test.hs`:

 λ.x→


 From within ghci, typing `main`:

 λ*** Exception: stdout: hPutChar: permission denied (Permission denied)


 I suspect both of these outputs are evidence of bugs. Might I be doing
 something wrong? (aside from using windows ;))


I forgot to mention that I'm using Windows XP with ghc 6.12.3.


-- 
David Sankel
Sankel Software
www.sankelsoftware.com
585 617 4748 (Office)
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: unicode characters in operator name

2010-09-10 Thread Daniel Fischer
On Saturday 11 September 2010 03:12:11, Greg wrote:

 If I read the Haskell Report correctly, operators are named by (symbol
 {symbol | : }), where symbol is either an ascii symbol (including *) or
 a unicode symbol (defined as any Unicode symbol or punctuation).  I'm
 pretty sure º is a unicode symbol or punctuation.

No,

Prelude Data.Char generalCategory 'º'
LowercaseLetter

weird, but that's how it is. If it were a symbol or punctuation, you 
couldn't use it in function names like fº.

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: unicode characters in operator name

2010-09-10 Thread Brandon S Allbery KF8NH
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 9/10/10 21:39 , Daniel Fischer wrote:
 On Saturday 11 September 2010 03:12:11, Greg wrote:
 a unicode symbol (defined as any Unicode symbol or punctuation).  I'm
 pretty sure º is a unicode symbol or punctuation.
 
 Prelude Data.Char generalCategory 'º'
 LowercaseLetter
 
 weird, but that's how it is. If it were a symbol or punctuation, you 
 couldn't use it in function names like fº.

Weird, but that's how Spanish at least treats it; it's a visually distinct
lowercase o (along with the visually distinct lowercase a, ª) which
indicates gender on an abbreviated ordinal (primero = 1º, primera =
1ª; by convention they are raised, but 1o/1a are equally valid).

- -- 
brandon s. allbery [linux,solaris,freebsd,perl]  allb...@kf8nh.com
system administrator  [openafs,heimdal,too many hats]  allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon university  KF8NH
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkyK37UACgkQIn7hlCsL25XPcACgmOhZ/0rM05l1/bPQ2EJNLZZS
87UAoIeyBNAefnbctVB0Ld7hrovRX4R5
=Qyau
-END PGP SIGNATURE-
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: unicode characters in operator name

2010-09-10 Thread Brandon S Allbery KF8NH
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 9/10/10 21:12 , Greg wrote:
 unicode symbol (defined as any Unicode symbol or punctuation).  I'm pretty
 sure º is a unicode symbol or punctuation.

No, it's a raised lowercase o used by convention to indicate gender of
abbreviated ordinals.  You probably want U+00B0 DEGREE SIGN instead of
U+00BA MASCULINE ORDINAL INDICATOR.

- -- 
brandon s. allbery [linux,solaris,freebsd,perl]  allb...@kf8nh.com
system administrator  [openafs,heimdal,too many hats]  allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon university  KF8NH
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkyK4CEACgkQIn7hlCsL25VOngCgu5qkmMzgIw/yBd6G3EikXT88
6AkAoKDXh+NIuN5XgT6A/vA0FVkFfsnJ
=NOt1
-END PGP SIGNATURE-
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: unicode characters in operator name

2010-09-10 Thread Greg
Oh cripe... Yet another reason not to use funny symbols-- even the developer can't tell them apart!Yeah, I wanted a degree sign, but if it's all that subtle then I should probably reconsider the whole idea.On the positive side, I know what ª is for now so today wasn't a complete waste. =)Thanks--GregOn Sep 10, 2010, at 06:49 PM, Brandon S Allbery KF8NH allb...@ece.cmu.edu wrote:-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 9/10/10 21:12 , Greg wrote:
 unicode symbol (defined as any Unicode symbol or punctuation).  I'm pretty
 sure º is a unicode symbol or punctuation.

No, it's a raised lowercase "o" used by convention to indicate gender of
abbreviated ordinals.  You probably want U+00B0 DEGREE SIGN instead of
U+00BA MASCULINE ORDINAL INDICATOR.

- -- 
brandon s. allbery [linux,solaris,freebsd,perl]  allb...@kf8nh.com
system administrator  [openafs,heimdal,too many hats]  allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon university  KF8NH
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkyK4CEACgkQIn7hlCsL25VOngCgu5qkmMzgIw/yBd6G3EikXT88
6AkAoKDXh+NIuN5XgT6A/vA0FVkFfsnJ
=NOt1
-END PGP SIGNATURE-
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Unicode alternative for '..' (ticket #3894)

2010-04-21 Thread Roel van Dijk
On Wed, Apr 21, 2010 at 12:51 AM, Yitzchak Gale g...@sefer.org wrote:
 Yes, sorry. Either use TWO DOT LEADER, or remove
 this Unicode alternative altogether
 (i.e. leave it the way it is *without* the UnicodeSyntax extension).

 I'm happy with either of those. I just don't like moving the dots
 up to the middle, or changing the number of dots.

I would be happy with either changing the character to the baseline
ellipsis or removing it altogether.

It would be nice if we could grep (or emacs grep-find) all sources on
Hackage to check which packages use the ⋯ character. I suspect it is
very close to 0.
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Unicode alternative for '..' (ticket #3894)

2010-04-20 Thread Yitzchak Gale
I wrote:
 My opinion is that we should either use TWO DOT LEADER,
 or just leave it as it is now, two FULL STOP characters.

Simon Marlow wrote:
 Just to be clear, you're suggesting *removing* the Unicode alternative for
 '..' from GHC's UnicodeSyntax extension?

Yes, sorry. Either use TWO DOT LEADER, or remove
this Unicode alternative altogether
(i.e. leave it the way it is *without* the UnicodeSyntax extension).

I'm happy with either of those. I just don't like moving the dots
up to the middle, or changing the number of dots.

Thanks,
Yitz
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Unicode alternative for '..' (ticket #3894)

2010-04-19 Thread Simon Marlow

On 15/04/2010 18:12, Yitzchak Gale wrote:

My opinion is that we should either use TWO DOT LEADER,
or just leave it as it is now, two FULL STOP characters.


Just to be clear, you're suggesting *removing* the Unicode alternative 
for '..' from GHC's UnicodeSyntax extension?


I have no strong opinions about this and I'm happy to defer to those who 
know more about such things than me.  The current choice of MIDLINE is 
probably accidental.


Cheers,
Simon




Two dots indicating a range is not the same symbol
as a three dot ellipsis.

Traditional non-Unicode Haskell will continue to be
around for a long time to come. It would be very
confusing to have two different visual glyphs for
this symbol.

I don't think there is any semantic problem with using
TWO DOT LEADER here. All three of the characters
ONE DOT LEADER, TWO DOT LEADER, and HORIZONTAL
ELLIPSIS are legacy characters from Xerox's XCCS.
There, the characters they come from were used for forming
dot leaders, e.g., in a table of contents. Using them that way
in Unicode is considered incorrect unless they represent text
that was originally encoded in XCCS; in Unicode, one does
not form dot leaders using those characters. However, other
new uses are considered legitimate. For example, HORIZONTAL
ELLIPSIS can be used for fonts that have a special ellipsis glyph,
and ONE DOT LEADER represents mijaket in Armenian encodings.
So I don't see any reason why we can't use TWO DOT LEADER to
represent the two-dot range symbol.

The above analysis is based in part upon a discussion of these
characters on the Unicode list in 2003:

http://www.mail-archive.com/unic...@unicode.org/msg16285.html

The author of that particular message, Kenneth Whistler, is
of the opinion that two dots expressing a range as in [0..1]
should be represented in Unicode as two FULL STOP characters,
as we do now in Haskell. Others in that thread - whom
Mr. Whistler seems to feel are less expert than himself
regarding Unicode - think that TWO DOT LEADER is appropriate.
No one considers replacing two-dot ranges with HORIZONTAL
ELLIPSIS.

If we can't find a Unicode character that everyone agrees upon,
I also don't see any problem with leaving it as two FULL STOP
characters.

Thanks,
Yitz
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Unicode alternative for '..' (ticket #3894)

2010-04-15 Thread Jason Dusek
  I think the baseline ellipsis makes much more sense; it's
  hard to see how the midline ellipsis was chosen.

--
Jason Dusek
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Unicode alternative for '..' (ticket #3894)

2010-04-15 Thread Yitzchak Gale
My opinion is that we should either use TWO DOT LEADER,
or just leave it as it is now, two FULL STOP characters.

Two dots indicating a range is not the same symbol
as a three dot ellipsis.

Traditional non-Unicode Haskell will continue to be
around for a long time to come. It would be very
confusing to have two different visual glyphs for
this symbol.

I don't think there is any semantic problem with using
TWO DOT LEADER here. All three of the characters
ONE DOT LEADER, TWO DOT LEADER, and HORIZONTAL
ELLIPSIS are legacy characters from Xerox's XCCS.
There, the characters they come from were used for forming
dot leaders, e.g., in a table of contents. Using them that way
in Unicode is considered incorrect unless they represent text
that was originally encoded in XCCS; in Unicode, one does
not form dot leaders using those characters. However, other
new uses are considered legitimate. For example, HORIZONTAL
ELLIPSIS can be used for fonts that have a special ellipsis glyph,
and ONE DOT LEADER represents mijaket in Armenian encodings.
So I don't see any reason why we can't use TWO DOT LEADER to
represent the two-dot range symbol.

The above analysis is based in part upon a discussion of these
characters on the Unicode list in 2003:

http://www.mail-archive.com/unic...@unicode.org/msg16285.html

The author of that particular message, Kenneth Whistler, is
of the opinion that two dots expressing a range as in [0..1]
should be represented in Unicode as two FULL STOP characters,
as we do now in Haskell. Others in that thread - whom
Mr. Whistler seems to feel are less expert than himself
regarding Unicode - think that TWO DOT LEADER is appropriate.
No one considers replacing two-dot ranges with HORIZONTAL
ELLIPSIS.

If we can't find a Unicode character that everyone agrees upon,
I also don't see any problem with leaving it as two FULL STOP
characters.

Thanks,
Yitz
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Unicode alternative for '..' (ticket #3894)

2010-04-15 Thread Roel van Dijk
That is very interesting. I didn't know the history of those characters.

 If we can't find a Unicode character that everyone agrees upon,
 I also don't see any problem with leaving it as two FULL STOP
 characters.

I agree. I don't like the current Unicode variant for .., therefore
I suggested an alternative. But I didn't consider removing it
altogether. It is an interesting idea.
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


RE: Unicode in GHC: need more advice

2005-01-17 Thread Simon Marlow
On 14 January 2005 12:58, Dimitry Golubovsky wrote:

 Now I need more advice on which flavor of Unicode support to
 implement. In Haskell-cafe, there were 3 flavors summarized: I am
 reposting the table here (its latest version).
 
 |Sebastien's| Marcin's | Hugs
  ---+---+--+--
   alnum | L* N* | L* N*| L*, M*, N* 1
   alpha | L*| L*   | L* 1
   cntrl | Cc| Cc Zl Zp | Cc
   digit | N*| Nd   | '0'..'9'
   lower | Ll| Ll   | Ll 1
   punct | P*| P*   | P*
   upper | Lu| Lt Lu| Lu Lt 1
   blank | Z* \t\n\r | Z*(except| ' ' \t\n\r\f\v U+00A0
   U+00A0
   U+2007
   U+202F)
   \t\n\v\f\r U+0085
 
 1: for characters outside Latin1 range. For Latin1 characters
 (0 to 255), there is a lookup table defined as
 unsigned char   charTable[NUM_LAT1_CHARS];
 
 I did not post the contents of the table Hugs uses for the Latin1
 part. However, with that table completely removed, Hugs did not work
 properly. So its contents somehow differs from what Unicode defines
 for that character range. If needed, I may decode that table and post
 its mapping of character categories (keeping in mind that those are
 Haskell-recognized character categories, not Unicode)

I don't know enough to comment on which of the above flavours is best.
However, I'd prefer not to use a separate table for Latin-1 characters
if possible.

We should probably stick to the Report definitions for isDigit and
isSpace, but we could add a separate isUniDigit/isUniSpace for the full
Unicode classes.

 One more question that I had when experimenting with Hugs: if a
 character (like those extra blank chars) is forced into some category
 for the purposes of Haskell language compilation (per the Report),
 does this mean that any other Haskell application should recognize
 Haskell-defined category of that character rather than
 Unicode-defined? 

 For Hugs, there were no choice but say Yes, because both compiler and
 interpreter used the same code to decide on character category. In GHC
 this may be different.

To be specific: the Report requires that the Haskell lexical class of
space characters includes Unicode spaces, but that the implementation of
isSpace only recognises Latin-1 spaces.  That means we need two separate
classes of space characters (or just use the report definition of
isSpace).

GHC's parser doesn't currently use the Data.Char character class
predicates, but at some point we will want to parse Unicode so we'll
need appropriate class predicates then.

 Since Hugs got there first, does it make sense just follow what was
 done here, or will a different decision be adopted for GHC: say, for
 the Parser, extra characters are forced to be blank, but for the rest
 of the programs compiled by GHC, Unicode definitions are adhered to.

Does what I said above help answer this question?

Cheers,
Simon
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Unicode in GHC: need more advice

2005-01-14 Thread Dimitry Golubovsky
Hi,
Simon Marlow wrote:

You're doing fine - but a better place for the tables is as part of the
base package, rather than the RTS.  We already have some C files in the
base package: see libraries/base/cbits, for example.  I suggest just
putting your code in there.
I have done that - now GHCi recognizes those symbols and loads fine. The 
test program also works when compiled. I still got some messages about 
missing prototypes and implicitly declared functions that I defined 
instead of libc functions, especially during Stage 1. I need to check 
into that, but since all those functions are basically int - int, it 
does not affect the result.

The code I use is some draft code, based on what I submitted for Hugs 
(pure Unicode basically, even without extra space characters).

Now I need more advice on which flavor of Unicode support to 
implement. In Haskell-cafe, there were 3 flavors summarized: I am 
reposting the table here (its latest version).

   |Sebastien's| Marcin's | Hugs
---+---+--+--
 alnum | L* N* | L* N*| L*, M*, N* 1
 alpha | L*| L*   | L* 1
 cntrl | Cc| Cc Zl Zp | Cc
 digit | N*| Nd   | '0'..'9'
 lower | Ll| Ll   | Ll 1
 punct | P*| P*   | P*
 upper | Lu| Lt Lu| Lu Lt 1
 blank | Z* \t\n\r | Z*(except| ' ' \t\n\r\f\v U+00A0
 U+00A0
 U+2007
 U+202F)
 \t\n\v\f\r U+0085
1: for characters outside Latin1 range. For Latin1 characters
(0 to 255), there is a lookup table defined as
unsigned char   charTable[NUM_LAT1_CHARS];
I did not post the contents of the table Hugs uses for the Latin1 part. 
However, with that table completely removed, Hugs did not work properly. 
So its contents somehow differs from what Unicode defines for that 
character range. If needed, I may decode that table and post its mapping 
of character categories (keeping in mind that those are 
Haskell-recognized character categories, not Unicode)

I am not asking for discussion in this list again. I rather expect some 
 suggestion from the GHC team leads, which flavor (of shown above, or 
some combination of the above) to implement.

One more question that I had when experimenting with Hugs: if a 
character (like those extra blank chars) is forced into some category 
for the purposes of Haskell language compilation (per the Report), does 
this mean that any other Haskell application should recognize 
Haskell-defined category of that character rather than Unicode-defined?

For Hugs, there were no choice but say Yes, because both compiler and 
interpreter used the same code to decide on character category. In GHC 
this may be different.

Since Hugs got there first, does it make sense just follow what was done 
here, or will a different decision be adopted for GHC: say, for the 
Parser, extra characters are forced to be blank, but for the rest of the 
programs compiled by GHC, Unicode definitions are adhered to.

PS The latest rebuild I did, used ghc with new code compiled in as Stage 
1 compiler.

Dimitry Golubovsky
Middletown, CT
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


RE: Unicode in GHC: need some advice on building

2005-01-11 Thread Simon Marlow
On 11 January 2005 02:29, Dimitry Golubovsky wrote:

 Bad thing is, LD_PRELOAD does not work on all systems. So I tried to
 put the code directly into the runtime (where I believe it should be;
 the Unicode properties table is packed, and won't eat much space). I
 renamed foreign function names in GHC.Unicode (to avoid conflict with
 libc functions) adding u_ to them (so now they are u_iswupper, etc).
 I placed the new file into ghc/rts, and the include file into
 ghc/includes. I could not avoid messages about missing prototypes for
 u_... functions , but finally I was able to build ghc. Now when I
 compiled my test program with the rebuilt ghc, it worked without the
 LD_PRELOADed library. However, GHCi could not start complaining that
 it could not see these u_... symbols. I noticed some other entry
 points into the runtime like revertCAFs, or getAllocations, declared
 in the Haskell part of GHCi just as other foreign calls, so I just
 followed the same style - partly unsuccessfully.
 
 Where am I wrong?

You're doing fine - but a better place for the tables is as part of the
base package, rather than the RTS.  We already have some C files in the
base package: see libraries/base/cbits, for example.  I suggest just
putting your code in there.

Cheers,
Simon
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: UniCode

2001-10-08 Thread Ketil Malde

Dylan Thurston [EMAIL PROTECTED] writes:

 Right.  In Unicode, the concept of a character is not really so
 useful;

After reading a bit about it, I'm certainly confused.
Unicode/ISO-10646 contains a lot of things that aren'r really one
character, e.g. ligatures.

 most functions that traditionally operate on characters (e.g.,
 uppercase or display-width) fundamentally need to operate on strings.
 (This is due to properties of particular languages, not any design
 flaw of Unicode.)

I think an argument could be put forward that Unicode is trying to be
more than just a character set.  At least at first glance, it seems to
try to be both a character set and a glyph map, and incorporate things
like transliteration between character sets (or subsets, now that
Unicode contains them all), directionality of script, and so on.

   toUpper, toLower - Not OK.  There are cases where upper casing a
  character yields two characters.

I though title case was supposed to handle this.  I'm probably
confused, though.

 etc.  Any program using this library is bound to get confused on
 Unicode strings.  Even before Unicode, there is much functionality
 missing; for instance, I don't see any way to compare strings using
 a localized order.

And you can't really use list functions like length on strings,
since one item can be two characters (Lj, ij, fi) and several items
can compose one character (combining characters).

And map (==) can't compare two Strings since, e.g. in the presence
of combining characters.  How are other systems handling this?  

It may be that Unicode isn't flawed, but it's certainly extremely
complex.  I guess I'll have to delve a bit deeper into it.

-kzm
-- 
If I haven't seen further, it is by standing in the footprints of giants

___
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users



Re: Unicode

2001-10-08 Thread Kent Karlsson


- Original Message -
From: Ketil Malde [EMAIL PROTECTED]
To: Dylan Thurston [EMAIL PROTECTED]
Cc: Andrew J Bromage [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
[EMAIL PROTECTED]
Sent: Monday, October 08, 2001 9:02 AM
Subject: Re: UniCode

(The spelling is 'Unicode' (and none other).)

 Dylan Thurston [EMAIL PROTECTED] writes:

  Right.  In Unicode, the concept of a character is not really so
  useful;

 After reading a bit about it, I'm certainly confused.
 Unicode/ISO-10646 contains a lot of things that aren'r really one
 character, e.g. ligatures.

The ligatures that are included are there for compatiblity with older
character encodings.  Normally, for modern technology..., ligatures
are (to be) formed automatically through the font.  OpenType (OT,
MS and Adobe) and AAT (Apple) have support for this. There are
often requests to add more ligatures to 10646/Unicode, but they are
rejected since 10646/Unicode encode characters, not glyphs. (With
two well-known exceptions: for compatibility, and certain dingbats.)

  most functions that traditionally operate on characters (e.g.,
  uppercase or display-width) fundamentally need to operate on strings.
  (This is due to properties of particular languages, not any design
  flaw of Unicode.)

 I think an argument could be put forward that Unicode is trying to be
 more than just a character set.  At least at first glance, it seems to

Yes, but:

 try to be both a character set and a glyph map, and incorporate things

not that. See above.

 like transliteration between character sets (or subsets, now that
 Unicode contains them all), directionality of script, and so on.

Unicode (but not 10646) does handle bidirectionality
(seeUAX 9: http://www.unicode.org/unicode/reports/tr9/), but not transliteration.
(Tranliteration is handled in IBMs ICU, though: 
http://www-124.ibm.com/developerworks/oss/icu4j/index.html)


toUpper, toLower - Not OK.  There are cases where upper casing a
   character yields two characters.

 I though title case was supposed to handle this.  I'm probably
 confused, though.

The titlecase characters in Unicode are (essentially) only there
for compatibility reasons (originally for transliterating between
certain subsets of Cyrillic and Latin scripts in a 1-1 way).  You're
not supposed to really use them...

The cases where toUpper of a single character give two characters
is for some (classical) Greek, where a builtin subscript iota turn into
a capital iota, and other cases where there is no corresponding
uppercase letter.

It is also the case that case mapping is context sensitive.  E.g.
mapping capital sigma to small sigma (mostly) or ς (small final sigma)
(at end of word), or the capital i to ı (small dotless i), if Turkish, or insert/
delete combining dot above for i and j in Lithuanian. See UTR 21
and http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt.


  etc.  Any program using this library is bound to get confused on
  Unicode strings.  Even before Unicode, there is much functionality
  missing; for instance, I don't see any way to compare strings using
  a localized order.

 And you can't really use list functions like length on strings,
 since one item can be two characters (Lj, ij, fi) and several items
 can compose one character (combining characters).

Depends on what you mean by lenght and character...
You seem to be after what is sometimes referred to as grapheme,
and counting those.  There is a proposal for a definition of
language independent grapheme (with lexical syntax), but I don't
think it is stable yet.

 And map (==) can't compare two Strings since, e.g. in the presence
 of combining characters.  How are other systems handling this?

I guess it is not very systematic.  Java and XML make the comparisons
directly by equality of the 'raw' characters *when* comparing identifiers/similar,
though for XML there is a proposal for early normalisation essentially to
NFC (normal form C).  I would have preferred comparing the normal forms
of the identifiers instead.  For searches, the recommendation (though I doubt
in practice yet) is to use a collation key based comparison. (Note that collation
keys are usually language dependent. More about collation in UTS 10,
http://www.unicode.org/unicode/reports/tr10/, and ISO/IEC 14651.)

What does NOT make sense is to expose (to a user) the raw ordering ()
of Unicode strings, though it may be useful internally.  Orders exposed to
people (or other systems, for that matter) that are't concerned with the
inner workings of a program should always be collation based.  (But that
holds for any character encoding, it's just more apparent for Unicode.)

 It may be that Unicode isn't flawed, but it's certainly extremely
 complex.  I guess I'll have to delve a bit deeper into it.

It's complex, but it is because the scripts of world are complex (and add
to that politics, as well as compatbility and implementation issues).

Kind regards
/kent k

Re: UniCode

2001-10-06 Thread Andrew J Bromage

G'day all.

On Fri, Oct 05, 2001 at 06:17:26PM +, Marcin 'Qrczak' Kowalczyk wrote:

 This information is out of date. AFAIR about 4 of them is assigned.
 Most for Chinese (current, not historic).

I wasn't aware of this.  Last time I looked was Unicode 3.0.  Thanks
for the update.

 In Haskell String = [Char].

I'll concede that String and [Char] are identical as far as the
programmer is concerned. :-)

There was some research 10+ years ago about alternative representations
for lists which were semantically identical but a little more efficient
in memory use.  Even if you don't go that far (it is fiddly), constant
strings, for example, could be representable as UTF-16/UTF-8/whatever
along with some machinery to generate the list on demand.  Char objects
could be implemented as flyweights.  Lots of possibilities.

Cheers,
Andrew Bromage

___
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users



Re: UniCode

2001-10-05 Thread Ketil Malde

Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] writes:

 Fri, 5 Oct 2001 02:29:51 -0700 (PDT), Krasimir Angelov [EMAIL PROTECTED] pisze:
 
  Why Char is 32 bit. UniCode characters is 16 bit.

 No, Unicode characters have 21 bits (range U+..10).

We've been through all this, of course, but here's a quote:

 Unicode originally implied that the encoding was UCS-2 and it
 initially didn't make any provisions for characters outside the BMP
 (U+ to U+). When it became clear that more than 64k
 characters would be needed for certain special applications
 (historic alphabets and ideographs, mathematical and musical
 typesetting, etc.), Unicode was turned into a sort of 21-bit
 character set with possible code points in the range U- to
 U-0010. The 2×1024 surrogate characters (U+D800 to U+DFFF) were
 introduced into the BMP to allow 1024×1024 non-BMP characters to be
 represented as a sequence of two 16-bit surrogate characters. This
 way UTF-16 was born, which represents the extended 21-bit Unicode
 in a way backwards compatible with UCS-2. The term UTF-32 was
 introduced in Unicode to mean a 4-byte encoding of the extended
 21-bit Unicode. UTF-32 is the exact same thing as UCS-4, except
 that by definition UTF-32 is never used to represent characters
 above U-0010, while UCS-4 can cover all 231 code positions up to
 U-7FFF.

from a/the Unicode FAQ at http://www.cl.cam.ac.uk/~mgk25/unicode.html

Does Haskell's support of Unicode mean UTF-32, or full UCS-4?
Recent messages seem to indicate the former, but I don't see any
reason against the latter.

-kzm
-- 
If I haven't seen further, it is by standing in the footprints of giants

___
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users



Re: UniCode

2001-10-05 Thread Andrew J Bromage

G'day all.

On Fri, Oct 05, 2001 at 02:29:51AM -0700, Krasimir Angelov wrote:

 Why Char is 32 bit. UniCode characters is 16 bit.

It's not quite as simple as that.  There is a set of one million
(more correctly, 1M) Unicode characters which are only accessible
using surrogate pairs (i.e. two UTF-16 codes).  There are currently 
none of these codes assigned, and when they are, they'll be extremely
rare.  So rare, in fact, that the cost of strings taking up twice the
space that the currently do simply isn't worth the cost.

However, you still need to be able to handle them.  I don't know what
the official Haskell reasoning is (it may have more to do with word
size than Unicode semantics), but it makes sense to me to store single
characters in UTF-32 but strings in a more compressed format (UTF-8 or
UTF-16).

See also: http://www.unicode.org/unicode/faq/utf_bom.html

It just goes to show that strings are not merely arrays of characters
like some languages would have you believe.

Cheers,
Andrew Bromage

___
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users



Re: UniCode

2001-10-05 Thread Marcin 'Qrczak' Kowalczyk

Fri, 5 Oct 2001 23:23:50 +1000, Andrew J Bromage [EMAIL PROTECTED] pisze:

 There is a set of one million (more correctly, 1M) Unicode characters
 which are only accessible using surrogate pairs (i.e. two UTF-16
 codes).  There are currently none of these codes assigned,

This information is out of date. AFAIR about 4 of them is assigned.
Most for Chinese (current, not historic).

 So rare, in fact, that the cost of strings taking up twice the
 space that the currently do simply isn't worth the cost.

In Haskell strings already have high overhead. In GHC a Char# value
(inside Char object) always takes the same size as the pointer
(32 or 64 bits), no matter how much of it is used.

 It just goes to show that strings are not merely arrays of characters
 like some languages would have you believe.

In Haskell String = [Char]. It's true that Char values don't
necessarily correspond to glyphs, but Strings are composed of Chars.

-- 
 __(  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTÊPCZA
QRCZAK


___
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users



Re: UniCode

2001-10-05 Thread Marcin 'Qrczak' Kowalczyk

05 Oct 2001 14:35:17 +0200, Ketil Malde [EMAIL PROTECTED] pisze:

 Does Haskell's support of Unicode mean UTF-32, or full UCS-4?

It's not decided officially. GHC uses UTF-32. It's expected that
UCS-4 will vanish and ISO-10646 will be reduced to the same range
U+..10 as Unicode.

-- 
 __(  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTÊPCZA
QRCZAK


___
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users



Re: Unicode

2000-05-17 Thread Frank Atanassow

Manuel M. T. Chakravarty writes:
  The problem with restricting youself to the Jouyou-Kanji is
  that you have a hard time with names (of persons and
  places).  Many exotic and otherwise unused Kanji are used in
  names (for historical reasons) and as the Kanji
  representation of a name is the official identifier, it is
  rather bad form to write a person's name in Kana (the
  phonetic alphabets).

You're absolutely right. This fact slipped my mind.

Still, probably 85% (just a guess) of Japanese names can be written with
Jyouyou kanji, and the CJK set in Unicode is a strict superset of the Jyouyou,
so there are actually more kanji available, and the problem is not quite so
severe. However, for Chinese names I can imagine it being quite restrictive.

-- 
Frank Atanassow, Dept. of Computer Science, Utrecht University
Padualaan 14, PO Box 80.089, 3508 TB Utrecht, Netherlands
Tel +31 (030) 253-1012, Fax +31 (030) 251-3791





Re: Unicode

2000-05-16 Thread George Russell

Marcin 'Qrczak' Kowalczyk wrote:
 As for the language standard: I hope that Char will be allowed or
 required to have =30 bits instead of current 16; but never more than
 Int, to be able to use ord and chr safely.
Er does it have to?  The Java Virtual Machine implements Unicode with
16 bits.  (OK, so I suppose that means it can't cope with Korean or Chinese.)
So requiring Char to be =30 bits would stop anyone implementing a
conformant Haskell on the JVM.  (I feel strongly about this having been
involved with MLj, which compiles ML to JVM; Standard ML requires 8-bit
chars, a requirement we decided to ignore.)




RE: Unicode

2000-05-16 Thread Simon Marlow

  OTOH, it wouldn't be hard to change GHC's Char datatype to be a
  full 32-bit integral data type.
 
 Could we do it please?
 
 It will not break anything if done slowly. I imagine that
 {read,write}CharOffAddr and _ccall_ will still use only 8 bits of
 Char. But after Char is wide, libraries dealing with text conversion
 will be possible to be designed, to prepare for future international
 I/O, together with Foreign libraries.

I agree it should be done.  But not for 4.07; we can start breaking the tree
as soon as I've forked the 4.07 branch though (hopefully today...).

We have some other small wibbles to deal with; currently a Char never
resides in the heap, because there are only 256 possible Chars we declare
them all statically in the RTS.  Now we have to check whether the Char falls
in the allowed range before using this table (that's fairly easy, we already
do this for Int).

Cheers,
Simon




Re: Unicode

2000-05-16 Thread Frank Atanassow

George Russell writes:
  Marcin 'Qrczak' Kowalczyk wrote:
   As for the language standard: I hope that Char will be allowed or
   required to have =30 bits instead of current 16; but never more than
   Int, to be able to use ord and chr safely.
  Er does it have to?  The Java Virtual Machine implements Unicode with
  16 bits.  (OK, so I suppose that means it can't cope with Korean or Chinese.)

Just to set the record straight:

Many CJK (Chinese-Japanese-Korean) characters are encodable in 16 bits. I am
not so familiar with the Chinese or Korean situations, but in Japan there is a
nationally standardized subset of about 2000 characters called the Jyouyou
("often-used") kanji, which newspapers and most printed books are mostly
supposed to respect. These are all strictly contained in the 16-bit space. One
only needs the additional 16-bits for foreign characters (say, Chinese), older
literary works and such-like. Even then, since Japanese has two phoenetic
alphabets as well, and you can usually substitute phoenetic characters in the
place of non-Jyouyou kanji---in fact, since these kanji are considered
difficult, one often _does_ supplement the ideographic representation with a
phoenetic one. Of course, using only phoenetic characters in such cases would
look unprofessional in some contexts, and it forces the reader to guess at
which word was meant...

For Korean and especially Chinese, the situation is not so pat. Korean's
phoenetic alphabet is of course wholly contained within the 16 bit space, but
Chinese, as a rule, don't use phoenetic characters. Koreans rely on their
phoenetic alphabet more than the Japanese, but they still tend to use (I
believe) more esoteric Chinese ideographic characters than the Japanese
do. And the Chinese have a much larger set of ideographic characters in common
use than either of the other two. I'm not sure what percentage is contained in
the 16-bit space; it's probably enough that you can communicate most
non-specialized subjects fairly comfortably, but it is safe to say that the
Chinese would be the first to demand more encoding space.

In summary, 16 bits is enough to encode most modern texts if you don't mind
fudging a bit, but for high-quality productions, historical and/or specialized
texts, CJK users will want 32 bits.

Of course, you can always come up with specialized schemes involving stateful
encodings and/or "block-swapping" (using the Unicode private-use areas, for
example), but then, that subverts the purpose of Unicode.

-- 
Frank Atanassow, Dept. of Computer Science, Utrecht University
Padualaan 14, PO Box 80.089, 3508 TB Utrecht, Netherlands
Tel +31 (030) 253-1012, Fax +31 (030) 251-3791





Re: Unicode

2000-05-16 Thread Marcin 'Qrczak' Kowalczyk

Tue, 16 May 2000 10:44:28 +0200, George Russell [EMAIL PROTECTED] pisze:

  As for the language standard: I hope that Char will be allowed or
  required to have =30 bits instead of current 16; but never more than
  Int, to be able to use ord and chr safely.
 
 Er does it have to?  The Java Virtual Machine implements Unicode with
 16 bits.  (OK, so I suppose that means it can't cope with Korean or Chinese.)
 So requiring Char to be =30 bits would stop anyone implementing a
 conformant Haskell on the JVM.

OK, "allowed", not "required"; currently it is not even allowed.
The minimum should probably be 16, maximum - the size of Int.

Oops, ord will have to be allowed to return negative numbers when
the size of Char is equal to the size of Int. Another solution is to
make Char at least one bit less than Int, or also at the same time
no larger than 31 bits. ISO-10646 describes the space of 31 bits,
UTF-8 is able to encode up to 31 bits, so then a UTF-8 encoder would
be portable without worrying about Char values that don't fit, and
a decoder could easily check if a character is representable in Char:
ord maxBound would be guaranteed to be positive.

Choices I see:
- 30 = Int, 16 = Char = 31, Char   Int
- 30 = Int, 16 = Char,   Char   Int
- 30 = Int, 16 = Char,   Char = Int

-- 
 __("Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/  GCS/M d- s+:-- a23 C+++$ UL++$ P+++ L++$ E-
  ^^  W++ N+++ o? K? w(---) O? M- V? PS-- PE++ Y? PGP+ t
QRCZAK  5? X- R tv-- b+++ DI D- G+ e h! r--%++ y-





Re: Unicode

2000-05-16 Thread Marcin 'Qrczak' Kowalczyk

Tue, 16 May 2000 12:26:12 +0200 (MET DST), Frank Atanassow [EMAIL PROTECTED] pisze:

 Of course, you can always come up with specialized schemes involving stateful
 encodings and/or "block-swapping" (using the Unicode private-use areas, for
 example), but then, that subverts the purpose of Unicode.

There is already a standard UTF-16 encoding that fits 2^20 characters
into 16bit space, by encoding characters =2^16 as pairs of "characters"
from the range D800..DFFF, which are otherwise unused in Unicode.

Programmers should not be expected to care about this; most will not
anyway. Libraries will handle this format in external UTF-16-encoded
strings.

UTF-8 is usually a better choice for external encoding; UTF-16 should
be rarely used.

-- 
 __("Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/  GCS/M d- s+:-- a23 C+++$ UL++$ P+++ L++$ E-
  ^^  W++ N+++ o? K? w(---) O? M- V? PS-- PE++ Y? PGP+ t
QRCZAK  5? X- R tv-- b+++ DI D- G+ e h! r--%++ y-





Re: Unicode

2000-05-16 Thread Manuel M. T. Chakravarty

Frank Atanassow [EMAIL PROTECTED] wrote,

 George Russell writes:
   Marcin 'Qrczak' Kowalczyk wrote:
As for the language standard: I hope that Char will be allowed or
required to have =30 bits instead of current 16; but never more than
Int, to be able to use ord and chr safely.
   Er does it have to?  The Java Virtual Machine implements Unicode with
   16 bits.  (OK, so I suppose that means it can't cope
   with Korean or Chinese.) 
 
 Just to set the record straight:
 
 Many CJK (Chinese-Japanese-Korean) characters are
 encodable in 16 bits. I am not so familiar with the
 Chinese or Korean situations, but in Japan there is a
 nationally standardized subset of about 2000 characters
 called the Jyouyou ("often-used") kanji, which newspapers
 and most printed books are mostly supposed to
 respect. These are all strictly contained in the 16-bit
 space. One only needs the additional 16-bits for foreign
 characters (say, Chinese), older literary works and
 such-like. Even then, since Japanese has two phoenetic
 alphabets as well, and you can usually substitute
 phoenetic characters in the place of non-Jyouyou
 kanji---in fact, since these kanji are considered
 difficult, one often _does_ supplement the ideographic
 representation with a phoenetic one. Of course, using only
 phoenetic characters in such cases would look
 unprofessional in some contexts, and it forces the reader
 to guess at which word was meant...

The problem with restricting youself to the Jouyou-Kanji is
that you have a hard time with names (of persons and
places).  Many exotic and otherwise unused Kanji are used in
names (for historical reasons) and as the Kanji
representation of a name is the official identifier, it is
rather bad form to write a person's name in Kana (the
phonetic alphabets).

Cheers,
Manuel