Doug Ransom <[EMAIL PROTECTED]> wrote,

> > There is no need for "." or [^abc] as Haskell list operators
> > can be used to "simulate" them.  The following is from the C
> > lexer and matches all visible characters and all characters
> > except newline, respectively:
> > 
> >   visible  = alt [' '..'\127']
> >   anyButNL = alt (['\0'..'\255'] \\ ['\n'])
> 
> 
> That is true, but how about dealing with unicode characters?
> 
> anyButNl = anyButNL = alt (['\0'..'\65536'] \\ ['\n'])
> 
> The space required becomes excessive.

True, but the current implementation would be hopeless for
unicode anyway, as it builds a table representing a
deterministic finite state automaton (DFA), where the worst
case size of the table is

  <character range> * <number of states>

In all practical cases, the required space is much smaller
as states with less than 20 characters having a non-error
transition store the state transitions in a list.
Furthermore, even in states with more than 20 characters
with a non-error transition, the size of the table is only
that of

  ord <largest character> - ord <smallest character> + 1

(these are characters with non-error transitions).

For 16bit character ranges, it would be necessary to
directly store negated character sets (such as [^abc]).
>From what he told me, Doitse Swierstra is working on a lexer
that is using explicit ranges, but I am not sure whether he
also has negated ranges.

Currently, most Haskell systems don't support unicode anyway
(I think, hbc is the only exception), so I guess this is not
a pressing issue.  As soon as, we have unicode support and
there is a need for lexers handling unicode input, I am
willing to extend the lexer library to gracefully handle the
cases that you outlined.

Cheers,
Manuel

Reply via email to