Re: Haskell 1.4 and Unicode

Carl R. Witty 07 Nov 1997 17:20:12 -0800
Kent Karlsson <[EMAIL PROTECTED]> writes:

> Carl R. Witty wrote:
> 
> > 1) I assume that layout processing occurs after Unicode preprocessing;
> > otherwise, you can't even find the lexemes.  If so, are all Unicode
> > characters assumed to be the same width?

I guess I wasn't as clear as I should have been.

> What would "Unicode preprocessing" be?  UTF-16 decoding?
> Java-ish escape sequence decoding?

By "Unicode preprocessing", I was referring to the following paragraph
in the Haskell 1.4 Report.

| Haskell uses a pre-processor to convert non-Unicode character sets
| into Unicode. This pre-processor converts all characters to Unicode
| and uses the escape sequence \uhhhh, where the "h" are hex digits, to
| denote escaped unicode characters. Since this translation occurs
| before the program is compiled, escaped Unicode characters may appear
| in identifiers and any other place in the program.

I'm quite aware that anything which actually hopes to render Unicode
characters will not use a fixed-width font (although I hadn't realized
the situation was as complicated as you describe.)  I was concerned
about the following.

| Definitions: The indentation of a lexeme is the column number
| indicating the start of that lexeme; the indentation of a line is the
| indentation of its leftmost lexeme. To determine the column number,
| assume a fixed-width font with this tab convention: tab stops are 8
| characters apart, and a tab character causes the insertion of enough
| spaces to align the current position with the next tab stop.

The above paragraph defines the "indentation" of a lexeme; the
indentation so defined is part of the defined syntax of Haskell.  In
order to properly parse Haskell programs, the compiler must assign 
column numbers to lexemes.  For Haskell programs to be portable (which
is, after all, the whole reason to have a standard) all compilers must
assign the same column number.

The question is, how should compilers assign column numbers in the
presence of \uhhhh escapes?

In particular, how should the following program be parsed?

(This is a definition of the "APPROACHES THE LIMIT" operator symbol.)
  x \u2250 y = do foo
             bar

If we treat the \u2250 as being the same width as all the other
characters for the purpose of assigning column numbers in the parser,
then the above parses as

  x \u2250 y = do {foo; bar}

If we assign some other width to the \u2250 for the purpose of
assigning column numbers in the parser, then the expression will be
parsed differently.  (If \u2250 is less than a character wide, the
parse will be

  x \u2250 y = do {foo bar}

If \u2250 is more than a character wide, the expression would be a
syntax error.)

Given that the Report must specify a method for assigning column
numbers in the presence of \uhhhh escapes if we are to have portable
programs, I can see four options.

1) The Report could say that all \uhhhh escapes are treated as
having the same width as "standard" characters.

2) The Report could say that all \uhhhh escapes are treated as having
the same width as 6 "standard" characters (so that people who edit the
source file in an ASCII editor using the \uhhhh escapes will see the
correct indentation).

3) The Report could attempt to specify a more accurate column width
for \uhhhh escapes, taking into account halfwidth characters,
fullwidth characters, combining characters, bidirectional scripts
(ouch!), pairs of non-character UTF-16 codes, etc; realizing that even
so, people editing files with different Unicode-capable editors or
fonts will not see the indentation that the Report specifies.

4) The Report could give up and say that column numbers in the
presence of \uhhhh escapes are explicitly implementation-defined.

Given the nightmarish complexity of option 3, and the fact that option
3 still doesn't allow people to look at the file with a Unicode editor
and see how it will parse, I think we should rule it out.

Option 2 is extremely ugly.  For instance, it means that the
Unicode preprocessor must somehow present to the lexer the difference
between a width-1 space character and a width-6 space character in the
following definition.

  ' ' `charEquals` '\u0020' = a where a = b
                                      b = True

Option 4 sounds pretty bad (effectively prohibiting layout in portable
programs using Unicode characters); however, there is a way to
mitigate the effects.  If someone wrote a Unicode-aware Haskell
structure editor, the editor could parse a braces-and-semicolons
Haskell file as it was read, and display it using indentation; the
editor would write the file back out in braces-and-semicolons style.

Option 1 is simple and easy to implement; however, it doesn't actually
let people view their source files in a Unicode editor and predict how
they will parse.  For that, I don't see any alternative to a
Unicode-aware Haskell structure editor (which could easily display the
program with indentation matching its internal parse tree).

Given the above choices, I would vote for option 1 or option 4 (with a
slight preference for option 1).

Carl Witty
[EMAIL PROTECTED]
Re: Haskell 1.4 and Unicode

Reply via email to