I think unicode is very important for xml document processing, which is my
interest.
> -----Original Message-----
> From: Manuel M. T. Chakravarty [mailto:[EMAIL PROTECTED]]
> Sent: Monday, September 25, 2000 9:11 PM
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: RE: combinator parsers and XSLT
>
>
> Doug Ransom <[EMAIL PROTECTED]> wrote,
>
> > > There is no need for "." or [^abc] as Haskell list operators
> > > can be used to "simulate" them. The following is from the C
> > > lexer and matches all visible characters and all characters
> > > except newline, respectively:
> > >
> > > visible = alt [' '..'\127']
> > > anyButNL = alt (['\0'..'\255'] \\ ['\n'])
> >
> >
> > That is true, but how about dealing with unicode characters?
> >
> > anyButNl = anyButNL = alt (['\0'..'\65536'] \\ ['\n'])
> >
> > The space required becomes excessive.
>
> True, but the current implementation would be hopeless for
> unicode anyway, as it builds a table representing a
> deterministic finite state automaton (DFA), where the worst
> case size of the table is
>
> <character range> * <number of states>
>
> In all practical cases, the required space is much smaller
> as states with less than 20 characters having a non-error
> transition store the state transitions in a list.
> Furthermore, even in states with more than 20 characters
> with a non-error transition, the size of the table is only
> that of
>
> ord <largest character> - ord <smallest character> + 1
>
> (these are characters with non-error transitions).
>
> For 16bit character ranges, it would be necessary to
> directly store negated character sets (such as [^abc]).
> From what he told me, Doitse Swierstra is working on a lexer
> that is using explicit ranges, but I am not sure whether he
> also has negated ranges.
>
> Currently, most Haskell systems don't support unicode anyway
> (I think, hbc is the only exception), so I guess this is not
> a pressing issue. As soon as, we have unicode support and
> there is a need for lexers handling unicode input, I am
> willing to extend the lexer library to gracefully handle the
> cases that you outlined.
>
> Cheers,
> Manuel
>