Re: Haskell 1.4 and Unicode

1997-11-10 Thread Hans Aberg

  I think that we are perhaps getting a little off-topic now, but Unicde
will clearly help forward computing, so perhaps it can continue a few more
postings. :-)

At 17:45 +0100 97/11/10, Kent Karlsson [EMAIL PROTECTED] wrote:
>   Let me reiterate:
>   Unicode is ***NOT*** a glyph encoding!
...
>and never will be.  The same character can be displayed as
>a variety of glyphs, depending not only of the font/style,
>but also, and this is the important point, on the characters
>surrounding a particular instance of the character.  Also,
>a sequence of characters can be displayed as a single glyph,
>and a character can be displayed as a sequence of glyphs.
>Which will be the case, is often font dependent.

  According to my Merriam Webster's dictionary a glyph is "a symbol that
conveys nonverbal information". So I am not sure, what is wrong with
viewing Unicode as a collection of glyphs? :-)

>I would be interested in knowing why you think
>"the idea of it as a character encoding thoroughly
>breaks down in a mathematical context".  Deciding
>what gets encoded as a character is more an
>international social process than a mathematical
>process...

>PPS I don't know what you mean by "semantics of glyphs"

  A symbol (or character) is roughly a graphical entity used to convey
semantic information; this is different from an illustration, which is a
graphical entity which uses certain semantic information as input, but from
which that semantic information may not be (fully) extractable.

  For example, in Unicode, the Latin letters which diacritical marks are
classified; this is reasonable, because European languages languages have
fixed sets of letter symbols. But in mathematics, this is usually
semantically wrong; a diacritical mark usually an alteration of the symbol
to which it applies. Another example, Unicode has exponential digits
"^1",...,"^0", but using that is mathematically wrong because 10^1^2 (as
you have to write in Unicode) is not the same thing as 10^{12}. In
languages, changing the style of a glyph usually does not alter its
semantic information, but in math, it usually does. Sometimes the Unicode
mathematical glyphs are classified as graphical entities, sometimes as
mathematical quantities.

  So Unicode is not sufficently coherent and comprehensive, in order to
suffice as a math symbol encoding. (I do not claim the problem is easy to
solve.)

  In fact, there are some other protocols underway studying this problem,
and which may interact with Unicode in the future: One is MathML, another
is the LaTeX3 math-encoding project. It is quite difficult findong good
classifications of math glyphs; I have great respect for those working with
that.

  Hans Aberg
  * Email: Hans Aberg 
  * AMS member listing: 







Re: Haskell 1.4 and Unicode

1997-11-10 Thread Ron Wichers Schreur

Carl R. Witty wrote (to the Haskell mailing list):
 
> [..]
> The Report could give up and say that column numbers in the
> presence of \u escapes are explicitly implementation-defined.
> [..]
> [This] sounds pretty bad (effectively prohibiting layout in portable
> programs using Unicode characters);

I do agree that an implementation defined rule is undesirable, but
it is possible to write portable programs with the layout rule. Just
make sure you only use tabs at the beginning of a line.

This has a few advantages. The tab size of the editor and the
implementation don't have to correspond and you can use proportional
fonts (or a Unicode aware editor). Disadvantage is that the program
will get longer. I don't mind this, I like white space.


Cheers,

Ronny Wichers Schreur







Re: Haskell 1.4 and Unicode

1997-11-07 Thread John C. Peterson

I had option 1 in mind when that part of the report was written.  We
should clarify this in the next revision.

And thanks for your analysis of the problem!

   John








Re: Haskell 1.4 and Unicode

1997-11-07 Thread Hans Aberg

>Carl R. Witty wrote:
>
>> 1) I assume that layout processing occurs after Unicode preprocessing;
>> otherwise, you can't even find the lexemes.  If so, are all Unicode
>> characters assumed to be the same width?

  The easiest way of thinking of Unicode is perhaps as a font encoding; a
font using this encoding would add such things as typeface family, style,
size, kerning (but Unicode probably does not have ligatures), etc., which
then can be used by a graphical rendering system.

  The idea is that other protocols should stand for the selection of those
other factors (like a typesetting program or a hypertext protocol, or
something).

  Hans Aberg
  * Email: Hans Aberg 
  * AMS member listing: 







Re: Haskell 1.4 and Unicode

1997-11-07 Thread Kent Karlsson

Carl R. Witty wrote:

> 1) I assume that layout processing occurs after Unicode preprocessing;
> otherwise, you can't even find the lexemes.  If so, are all Unicode
> characters assumed to be the same width?

Unicode characters ***cannot in any way*** be considered as being of
the same display width.  Many characters have intrinsic width properties,
like "halfwidth Katakana", "fullwidth ASCII", "ideographic space",
"thin space", "zero width space", and so on (most of which are
compatability characters, i.e. present only for conversion reasons).
But more importantly there are combining characters which "modify"
a "base character". For instance A (A with ring above) can be given
as an A followed by a combining ring above, i.e. two Unicode characters.
(For this and many others there is also a 'precomposed' character.) 
For many scripts vowels are combining characters.  And there may be an
indefinitely long (in principle, but three is a lot) sequence of
combining characters after each non-combining character.

What about bidirectional scripts?  Especially for the Arabic
script which is a cursive (joined) script, where in addition
vowels are combining characters.

Furthermore, Unicode characters in the "extended range" (no characters
allocated yet) are encoded using two *non-character* 16-bit codes
(when using UTF-16, which is the preferred encoding for Unicode).

What would "Unicode preprocessing" be?  UTF-16 decoding?
Java-ish escape sequence decoding?

...
> 3) What does it mean that Char can include any Unicode character?

I think it *does not* mean that a Char can hold any Unicode 
character.  I think it *does* means that it can hold any single
(UTF-16) 16-bit value.  Which is something quite different.  To store
an arbitrary Unicode character 'straight off', one would need up
to at least 21 bits to cover the UTF-16 range.  ISO/IEC 10646-1 allows
for up to 31 bits, but nobody(?) is planning to need all that.
Some use 32-bit values to store Unicode characters.  Perfectly
allowed by 10646, though not by Unicode proper.  Following Unicode
proper one would always use sequence of UTF-16 codes, in order to
be able to treat a "user perceived character" as a single entity
both for UTF-16 reasons, and also for combining sequences reasons,
independently of how the "user perceived character" was given as
Unicode characters.

/kent k

PS
Java gets some Unicode things wrong too.  Including that Java's
UTF-8 encoding is non-conforming (to both Unicode 2.0 and ISO/IEC
10646-1 Amd. 2).






Re: Haskell 1.4 and Unicode

1997-11-07 Thread Carl R. Witty

Kent Karlsson <[EMAIL PROTECTED]> writes:

> Carl R. Witty wrote:
> 
> > 1) I assume that layout processing occurs after Unicode preprocessing;
> > otherwise, you can't even find the lexemes.  If so, are all Unicode
> > characters assumed to be the same width?

I guess I wasn't as clear as I should have been.

> What would "Unicode preprocessing" be?  UTF-16 decoding?
> Java-ish escape sequence decoding?

By "Unicode preprocessing", I was referring to the following paragraph
in the Haskell 1.4 Report.

| Haskell uses a pre-processor to convert non-Unicode character sets
| into Unicode. This pre-processor converts all characters to Unicode
| and uses the escape sequence \u, where the "h" are hex digits, to
| denote escaped unicode characters. Since this translation occurs
| before the program is compiled, escaped Unicode characters may appear
| in identifiers and any other place in the program.

I'm quite aware that anything which actually hopes to render Unicode
characters will not use a fixed-width font (although I hadn't realized
the situation was as complicated as you describe.)  I was concerned
about the following.

| Definitions: The indentation of a lexeme is the column number
| indicating the start of that lexeme; the indentation of a line is the
| indentation of its leftmost lexeme. To determine the column number,
| assume a fixed-width font with this tab convention: tab stops are 8
| characters apart, and a tab character causes the insertion of enough
| spaces to align the current position with the next tab stop.

The above paragraph defines the "indentation" of a lexeme; the
indentation so defined is part of the defined syntax of Haskell.  In
order to properly parse Haskell programs, the compiler must assign 
column numbers to lexemes.  For Haskell programs to be portable (which
is, after all, the whole reason to have a standard) all compilers must
assign the same column number.

The question is, how should compilers assign column numbers in the
presence of \u escapes?

In particular, how should the following program be parsed?

(This is a definition of the "APPROACHES THE LIMIT" operator symbol.)
  x \u2250 y = do foo
 bar

If we treat the \u2250 as being the same width as all the other
characters for the purpose of assigning column numbers in the parser,
then the above parses as

  x \u2250 y = do {foo; bar}

If we assign some other width to the \u2250 for the purpose of
assigning column numbers in the parser, then the expression will be
parsed differently.  (If \u2250 is less than a character wide, the
parse will be

  x \u2250 y = do {foo bar}

If \u2250 is more than a character wide, the expression would be a
syntax error.)

Given that the Report must specify a method for assigning column
numbers in the presence of \u escapes if we are to have portable
programs, I can see four options.

1) The Report could say that all \u escapes are treated as
having the same width as "standard" characters.

2) The Report could say that all \u escapes are treated as having
the same width as 6 "standard" characters (so that people who edit the
source file in an ASCII editor using the \u escapes will see the
correct indentation).

3) The Report could attempt to specify a more accurate column width
for \u escapes, taking into account halfwidth characters,
fullwidth characters, combining characters, bidirectional scripts
(ouch!), pairs of non-character UTF-16 codes, etc; realizing that even
so, people editing files with different Unicode-capable editors or
fonts will not see the indentation that the Report specifies.

4) The Report could give up and say that column numbers in the
presence of \u escapes are explicitly implementation-defined.

Given the nightmarish complexity of option 3, and the fact that option
3 still doesn't allow people to look at the file with a Unicode editor
and see how it will parse, I think we should rule it out.

Option 2 is extremely ugly.  For instance, it means that the
Unicode preprocessor must somehow present to the lexer the difference
between a width-1 space character and a width-6 space character in the
following definition.

  ' ' `charEquals` '\u0020' = a where a = b
  b = True

Option 4 sounds pretty bad (effectively prohibiting layout in portable
programs using Unicode characters); however, there is a way to
mitigate the effects.  If someone wrote a Unicode-aware Haskell
structure editor, the editor could parse a braces-and-semicolons
Haskell file as it was read, and display it using indentation; the
editor would write the file back out in braces-and-semicolons style.

Option 1 is simple and easy to implement; however, it doesn't actually
let people view their source files in a Unicode editor and predict how
they will parse.  For that, I don't see any alternative to a
Unicode-aware Haskell structure editor (which could easily display the
program with indentation matching its internal parse tree

Re: Haskell 1.4 and Unicode

1997-11-07 Thread Lennart Augustsson


Unicode was added at the last moment, so there is likely to
be some descrepancies.

> 1) I assume that layout processing occurs after Unicode preprocessing;
> otherwise, you can't even find the lexemes.  If so, are all Unicode
> characters assumed to be the same width?
I think that's what is intended.

> However, it would also seem quite reasonable to include class Lo
> (which includes things like "Hebrew letter Alef") in UNIsmall or
> UNIlarge; and to include some of the Punctuation classes in UNIsymbol.
It's hard to put Lo in a sensible place since Haskell relies on
the upper/lower distinction.  Therefore Lo is not included
in upper or lower.

> 3) What does it mean that Char can include any Unicode character?
It means that within a Haskell program Char can hold a Unicode character.

> If I compile and run the following program on my vanilla American UNIX
> box:
> 
>   main = putChar '\x2473' {- print a "circled number twenty" -}
> 
> to get a program "ctwenty", and I run
> 
>   ./ctwenty | od -c
> 
> (od prints out each byte of output), what will I see?
> 
> Will the following program
> 
>   main = getChar >>= (print . fromEnum)
> 
> ever print out a number greater than 256?
The I/O library has not been converted to Unicode.  So I would
expect implementation to silently truncate Unicode characters
to 8 bits.

To do sensibly output (or input) of Unicode characters you need to
encode them somehow.  Hbc comes with encode/decode functions (in the
Char library) for three encodings: two bytes per Char, UTF-8, and the
Java encoding (\u).

-- Lennart

dogbert% cat ctwenty.hs
import Char
main = putStr (encodeUnicode "\x2473")
dogbert% hbc ctwenty.hs -o ctwenty
dogbert% ./ctwenty | od -c
000$   s
002

dogbert% cat ctwenty.hs
import Char
main = putStr (encodeUTF8 "\x2473")
dogbert% hbc ctwenty.hs -o ctwenty
dogbert% ./ctwenty | od -c
000  342 221 263
003

dogbert% cat ctwenty.hs
import Char
main = putStr (encodeEscape "\x2473")
dogbert% hbc ctwenty.hs -o ctwenty
dogbert% ./ctwenty | od -c
000\   u   2   4   7   3
006