[EMAIL PROTECTED] wrote,

> >>>>> "Manuel M. T. Chakravarty" <[EMAIL PROTECTED]> (MMTC) writes:
> 
> MMTC> [EMAIL PROTECTED] (Marcin 'Qrczak' Kowalczyk) wrote,
> >> Wed, 27 Sep 2000 00:22:05 +1100, Manuel M. T. Chakravarty <[EMAIL PROTECTED]> 
>pisze:
> >> 
> >> > Hmm, this seems like a shortcoming in the Haskell spec.  We have all
> >> > these isAlpha, isDigit, etc functions, but I can't get at a list of,
> >> > say, all characters for which isAlpha is true.
> >> 
> >> You can: filter isAlpha ['\0'..'\xFFFF']
> >> (don't use maxBound here because it's too large and we know that
> >> currently there are no isAlpha characters outside this range).
> >> 
> >> Working on large explicit lists is inefficient. 45443 characters
> >> are isAlpha. A lexer should be designed to avoid using a full list.
> 
> MMTC> You are right, just having a list of the characters is to
> MMTC> naive an approach.  But this re-enforces may point, we need
> MMTC> an _efficient_ way of getting at the unicode ranges for
> MMTC> certain character classes.  H98 is seems to be lacking some
> MMTC> features for practical use of unicode - the header to the
> MMTC> standard library `Char' actually admits that
> 
> Doaitse Swierstra's [This is the correct spelling!] parser combinators in
> their newest incarnation have symbol ranges as their basis. Internally they
> are also used to allow binary search which is the primary reason for their
> speed. There are now also facilities for writing scanners  using these
> combinators. With the ranges parsing Unicode shouldn't be less efficient
> than parsing ASCII.

Yes, Doaiste told me about the ranges, but that wasn't the
point here.  The question is how do you *know* which range,
eg, the alphanumeric characters in a given unicode encoding
have?  This is certainly different in Dutch and Japanese.
So, you can't hardcode it in your scanner spec, but instead
you have to get it from the OS via some Haskell library.
The question is how to represent this information in this
Haskell library.

Manuel

Reply via email to