Doug Ransom <[EMAIL PROTECTED]> wrote,
> > There is no need for "." or [^abc] as Haskell list operators
> > can be used to "simulate" them. The following is from the C
> > lexer and matches all visible characters and all characters
> > except newline, respectively:
> >
> > visible = alt [' '..'\127']
> > anyButNL = alt (['\0'..'\255'] \\ ['\n'])
>
>
> That is true, but how about dealing with unicode characters?
>
> anyButNl = anyButNL = alt (['\0'..'\65536'] \\ ['\n'])
>
> The space required becomes excessive.
True, but the current implementation would be hopeless for
unicode anyway, as it builds a table representing a
deterministic finite state automaton (DFA), where the worst
case size of the table is
<character range> * <number of states>
In all practical cases, the required space is much smaller
as states with less than 20 characters having a non-error
transition store the state transitions in a list.
Furthermore, even in states with more than 20 characters
with a non-error transition, the size of the table is only
that of
ord <largest character> - ord <smallest character> + 1
(these are characters with non-error transitions).
For 16bit character ranges, it would be necessary to
directly store negated character sets (such as [^abc]).
>From what he told me, Doitse Swierstra is working on a lexer
that is using explicit ranges, but I am not sure whether he
also has negated ranges.
Currently, most Haskell systems don't support unicode anyway
(I think, hbc is the only exception), so I guess this is not
a pressing issue. As soon as, we have unicode support and
there is a need for lexers handling unicode input, I am
willing to extend the lexer library to gracefully handle the
cases that you outlined.
Cheers,
Manuel