subject:"Alex unicode trick"

Re: Alex unicode trick

2014-01-13 Thread Simon Marlow



On 07/01/2014 18:18, Mateusz Kowalczyk wrote:


Ah, I think I understand now. If this is the case, at least the
‘alexGetChar’ could be removed, right? Is Alex 2.x compatibility
necessary for any reason whatsoever?


Yes, the backwards compatibility could be removed now that we require a 
very recent version of Alex.


Cheers,
Simon
___
ghc-devs mailing list
ghc-devs@haskell.org
http://www.haskell.org/mailman/listinfo/ghc-devs

Re: Alex unicode trick

2014-01-07 Thread Krasimir Angelov

Hi,

I was recenly looking at this code to see how the lexer decides that a
character is a letter, space, etc. The problem is that with Unicode
there are hundreds of thousands of characters that are declared to be
alphanumeric. Even if they are compressed into a regular expression
with a list of ranges there will be still ~390 ranges. The GHC lexer
avoids hardcoding this ranges by calling isSpace, isAlpha, etc and
then converting this result to a code. Ideally it would be nice if
Alex had a predefined macroses corresponding to the Unicode
categories, but for now you have to either hard code the ranges with
huge regular expressions or use the workaround used in GHC. Is there
any other solution?

Regards,
  Krasimir


2014/1/7 Carter Schonwald carter.schonw...@gmail.com:
 you're probably right, this could be regarded as dead code for ghc 7.8 (esp
 since alex and happy must be the recent versions to even build ghc HEAD ! )


 On Tue, Jan 7, 2014 at 2:25 AM, Mateusz Kowalczyk fuuze...@fuuzetsu.co.uk
 wrote:

 Greetings,

 When looking at the GHC lexer (Lexer.x), there's:

  $unispace= \x05 -- Trick Alex into handling Unicode. See
  alexGetChar.
  $whitechar   = [\ \n\r\f\v $unispace]
  $white_no_nl = $whitechar # \n
  $tab = \t

 Scrolling down to alexGetChar and alexGetChar', we see the comments:


  -- backwards compatibility for Alex 2.x
  alexGetChar :: AlexInput - Maybe (Char,AlexInput)
 
  -- This version does not squash unicode characters, it is used when
  -- lexing strings.
  alexGetChar' :: AlexInput - Maybe (Char,AlexInput)

 What's the reason for these? I was under the impression that since
 3.0, Alex has natively supported unicode. Is it just dead code? Could
 all the hex $uni* functions be removed? If not, why not?

 --
 Mateusz K.
 ___
 ghc-devs mailing list
 ghc-devs@haskell.org
 http://www.haskell.org/mailman/listinfo/ghc-devs



 ___
 ghc-devs mailing list
 ghc-devs@haskell.org
 http://www.haskell.org/mailman/listinfo/ghc-devs

___
ghc-devs mailing list
ghc-devs@haskell.org
http://www.haskell.org/mailman/listinfo/ghc-devs

Re: Alex unicode trick

2014-01-07 Thread Simon Marlow

Krasimir is right, it would be hard to use Alex's built-in Unicode 
support because we have to automatically generate the character classes 
from the Unicode spec somehow.  Probably Alex ought to include these as 
built-in macros, but right now it doesn't.


Even if we did have access to the right regular expressions, I'm 
slightly concerned that the generated state machine might be enormous.


Cheers,
Simon

On 07/01/2014 08:26, Krasimir Angelov wrote:

Hi,

I was recenly looking at this code to see how the lexer decides that a
character is a letter, space, etc. The problem is that with Unicode
there are hundreds of thousands of characters that are declared to be
alphanumeric. Even if they are compressed into a regular expression
with a list of ranges there will be still ~390 ranges. The GHC lexer
avoids hardcoding this ranges by calling isSpace, isAlpha, etc and
then converting this result to a code. Ideally it would be nice if
Alex had a predefined macroses corresponding to the Unicode
categories, but for now you have to either hard code the ranges with
huge regular expressions or use the workaround used in GHC. Is there
any other solution?

Regards,
   Krasimir


2014/1/7 Carter Schonwald carter.schonw...@gmail.com:

you're probably right, this could be regarded as dead code for ghc 7.8 (esp
since alex and happy must be the recent versions to even build ghc HEAD ! )


On Tue, Jan 7, 2014 at 2:25 AM, Mateusz Kowalczyk fuuze...@fuuzetsu.co.uk
wrote:


Greetings,

When looking at the GHC lexer (Lexer.x), there's:


$unispace= \x05 -- Trick Alex into handling Unicode. See
alexGetChar.
$whitechar   = [\ \n\r\f\v $unispace]
$white_no_nl = $whitechar # \n
$tab = \t


Scrolling down to alexGetChar and alexGetChar', we see the comments:



-- backwards compatibility for Alex 2.x
alexGetChar :: AlexInput - Maybe (Char,AlexInput)

-- This version does not squash unicode characters, it is used when
-- lexing strings.
alexGetChar' :: AlexInput - Maybe (Char,AlexInput)


What's the reason for these? I was under the impression that since
3.0, Alex has natively supported unicode. Is it just dead code? Could
all the hex $uni* functions be removed? If not, why not?

--
Mateusz K.
___
ghc-devs mailing list
ghc-devs@haskell.org
http://www.haskell.org/mailman/listinfo/ghc-devs




___
ghc-devs mailing list
ghc-devs@haskell.org
http://www.haskell.org/mailman/listinfo/ghc-devs


___
ghc-devs mailing list
ghc-devs@haskell.org
http://www.haskell.org/mailman/listinfo/ghc-devs


___
ghc-devs mailing list
ghc-devs@haskell.org
http://www.haskell.org/mailman/listinfo/ghc-devs

Alex unicode trick

2014-01-06 Thread Mateusz Kowalczyk

Greetings,

When looking at the GHC lexer (Lexer.x), there's:

 $unispace= \x05 -- Trick Alex into handling Unicode. See alexGetChar.
 $whitechar   = [\ \n\r\f\v $unispace]
 $white_no_nl = $whitechar # \n
 $tab = \t

Scrolling down to alexGetChar and alexGetChar', we see the comments:


 -- backwards compatibility for Alex 2.x
 alexGetChar :: AlexInput - Maybe (Char,AlexInput)

 -- This version does not squash unicode characters, it is used when
 -- lexing strings.
 alexGetChar' :: AlexInput - Maybe (Char,AlexInput)

What's the reason for these? I was under the impression that since
3.0, Alex has natively supported unicode. Is it just dead code? Could
all the hex $uni* functions be removed? If not, why not?

--
Mateusz K.
___
ghc-devs mailing list
ghc-devs@haskell.org
http://www.haskell.org/mailman/listinfo/ghc-devs

Re: Alex unicode trick

Re: Alex unicode trick

Re: Alex unicode trick

Alex unicode trick

4 matches

Site Navigation

Mail list logo

Footer information