I think unicode is very important for xml document processing, which  is my
interest.


> -----Original Message-----
> From: Manuel M. T. Chakravarty [mailto:[EMAIL PROTECTED]]
> Sent: Monday, September 25, 2000 9:11 PM
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: RE: combinator parsers and XSLT 
> 
> 
> Doug Ransom <[EMAIL PROTECTED]> wrote,
> 
> > > There is no need for "." or [^abc] as Haskell list operators
> > > can be used to "simulate" them.  The following is from the C
> > > lexer and matches all visible characters and all characters
> > > except newline, respectively:
> > > 
> > >   visible  = alt [' '..'\127']
> > >   anyButNL = alt (['\0'..'\255'] \\ ['\n'])
> > 
> > 
> > That is true, but how about dealing with unicode characters?
> > 
> > anyButNl = anyButNL = alt (['\0'..'\65536'] \\ ['\n'])
> > 
> > The space required becomes excessive.
> 
> True, but the current implementation would be hopeless for
> unicode anyway, as it builds a table representing a
> deterministic finite state automaton (DFA), where the worst
> case size of the table is
> 
>   <character range> * <number of states>
> 
> In all practical cases, the required space is much smaller
> as states with less than 20 characters having a non-error
> transition store the state transitions in a list.
> Furthermore, even in states with more than 20 characters
> with a non-error transition, the size of the table is only
> that of
> 
>   ord <largest character> - ord <smallest character> + 1
> 
> (these are characters with non-error transitions).
> 
> For 16bit character ranges, it would be necessary to
> directly store negated character sets (such as [^abc]).
> From what he told me, Doitse Swierstra is working on a lexer
> that is using explicit ranges, but I am not sure whether he
> also has negated ranges.
> 
> Currently, most Haskell systems don't support unicode anyway
> (I think, hbc is the only exception), so I guess this is not
> a pressing issue.  As soon as, we have unicode support and
> there is a need for lexers handling unicode input, I am
> willing to extend the lexer library to gracefully handle the
> cases that you outlined.
> 
> Cheers,
> Manuel
> 

Reply via email to