Re: combinator parsers and XSLT

2000-10-02 Thread Manuel M. T. Chakravarty

Tom Pledger <[EMAIL PROTECTED]> wrote,

> Manuel M. T. Chakravarty writes:
>  > Lars Henrik Mathiesen <[EMAIL PROTECTED]> wrote,
>  > 
>  > Ok, I should have been more precise.  The problem is to make
>  > it efficient.  Usually, this is achieved by having a table
>  > into which you index with the input character to compute
>  > what state to enter next.  If you have many predicates and
>  > potentially have to test a large number of them for each
>  > input character before being able to determine the next
>  > state, this might adversely influence the performance of the
>  > scanner.
>
> Would it help to use lazily populated tables, to cache the results of
> evaluating the corresponding predicates?  It could be done in an outer
> layer, so that it doesn't mar the purity of the predicate composition
> approach.  It may even be a happy medium, in cases where the input
> document only uses a tiny fraction of the character set.

Caching the results of predicate evaluation might be a good
idea.  I will keep that in mind when attempting an
implementation.

Thanks,
Manuel





Re: combinator parsers and XSLT

2000-10-01 Thread Tom Pledger

Manuel M. T. Chakravarty writes:
 > Lars Henrik Mathiesen <[EMAIL PROTECTED]> wrote,
 > 
 > > > From: "Manuel M. T. Chakravarty" <[EMAIL PROTECTED]>
 > > > Date: Fri, 29 Sep 2000 10:17:56 +1100
 > > 
 > > > I agree that usually the predicates as proposed by you would
 > > > be better.  The problem is that a scanner that wants to use
 > > > the usual finite deterministic automation techniques for
 > > > scanning, needs to be able to compute the overlap between
 > > > different predicates.
 > > 
 > > If the predicates are functions, computing the overlap as another
 > > function is easy.
 > 
 > Ok, I should have been more precise.  The problem is to make
 > it efficient.  Usually, this is achieved by having a table
 > into which you index with the input character to compute
 > what state to enter next.  If you have many predicates and
 > potentially have to test a large number of them for each
 > input character before being able to determine the next
 > state, this might adversely influence the performance of the
 > scanner.
 > 
 > Manuel

Would it help to use lazily populated tables, to cache the results of
evaluating the corresponding predicates?  It could be done in an outer
layer, so that it doesn't mar the purity of the predicate composition
approach.  It may even be a happy medium, in cases where the input
document only uses a tiny fraction of the character set.

Regards,
Tom




Re: combinator parsers and XSLT

2000-10-01 Thread Manuel M. T. Chakravarty

Lars Henrik Mathiesen <[EMAIL PROTECTED]> wrote,

> > From: "Manuel M. T. Chakravarty" <[EMAIL PROTECTED]>
> > Date: Fri, 29 Sep 2000 10:17:56 +1100
> 
> > I agree that usually the predicates as proposed by you would
> > be better.  The problem is that a scanner that wants to use
> > the usual finite deterministic automation techniques for
> > scanning, needs to be able to compute the overlap between
> > different predicates.
> 
> If the predicates are functions, computing the overlap as another
> function is easy.

Ok, I should have been more precise.  The problem is to make
it efficient.  Usually, this is achieved by having a table
into which you index with the input character to compute
what state to enter next.  If you have many predicates and
potentially have to test a large number of them for each
input character before being able to determine the next
state, this might adversely influence the performance of the
scanner.

Manuel




Re: combinator parsers and XSLT

2000-09-29 Thread Marcin 'Qrczak' Kowalczyk

Fri, 29 Sep 2000 10:14:05 +1100, Manuel M. T. Chakravarty <[EMAIL PROTECTED]> pisze:

> The question is how do you *know* which range, eg, the alphanumeric
> characters in a given unicode encoding have?  This is certainly
> different in Dutch and Japanese.

This is not different in practice.

Even if it was different, types of Haskell's character predicates
require that they are constant.

In C the situation is different because the meaning of both char and
wchar_t may depend on the current locale, where in Haskell Char is
always Unicode.

My implementation uses a static table of official Unicode character
categories, and predicates test the category, sometimes with
exceptions. Details are being discussed with people on other mailing
lists - the mapping between categories and predicates is not obvious.

The situation is worse with toUpper/toLower, where not only it
may depend on the locale (with the most known case of Turkish),
but needs not to map one character to one. That's why in Unicode
toUpper/toLower mapping is informative, even though isUpper/isLower
is normative. Haskell's toUpper/toLower must be stateless and
Char->Char. In future there will probably be a stateful locale
framework in Haskell, with e.g. locale-dependent string comparison
and correct String->String case mapping.

-- 
 __("<  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTÊPCZA
QRCZAK





Re: combinator parsers and XSLT

2000-09-28 Thread Lars Henrik Mathiesen

> From: "Manuel M. T. Chakravarty" <[EMAIL PROTECTED]>
> Date: Fri, 29 Sep 2000 10:17:56 +1100

> I agree that usually the predicates as proposed by you would
> be better.  The problem is that a scanner that wants to use
> the usual finite deterministic automation techniques for
> scanning, needs to be able to compute the overlap between
> different predicates.

If the predicates are functions, computing the overlap as another
function is easy.

Isn't the point really that you need to know when the overlap is
empty, so you don't get a combinatorial explosion of cases?

Lars Mathiesen (U of Copenhagen CS Dep) <[EMAIL PROTECTED]> (Humour NOT marked)





Re: combinator parsers and XSLT

2000-09-28 Thread Manuel M. T. Chakravarty

[EMAIL PROTECTED] (Marcin 'Qrczak' Kowalczyk) wrote,

> 28 Sep 2000 11:09:28 +0200, [EMAIL PROTECTED] <[EMAIL PROTECTED]> pisze:
> 
> > Doaitse Swierstra's [This is the correct spelling!] parser combinators
> > in their newest incarnation have symbol ranges as their basis.
> > Internally they are also used to allow binary search
> 
> Certainly better than character lists.
> 
> I'm not sure if basing on character predicates would not be better.
> E.g. these 45443 isAlpha characters are split into 255 ranges.
> 
> My character category table is internally implemented as 256 vectors
> of 256 vectors of categories (all characters larger than '\x'
> currently have the same category).
> 
> Even though eight comparisons is not much compared to two table
> lookups, neither Haskell98 nor my library provides easy access to
> character ranges for predicates like isAlpha.
> 
> It would not be a problem to provide it, but it makes the interface
> almost twice bigger.

I agree that usually the predicates as proposed by you would
be better.  The problem is that a scanner that wants to use
the usual finite deterministic automation techniques for
scanning, needs to be able to compute the overlap between
different predicates.

Manuel




Re: combinator parsers and XSLT

2000-09-28 Thread Manuel M. T. Chakravarty

[EMAIL PROTECTED] wrote,

> > "Manuel M. T. Chakravarty" <[EMAIL PROTECTED]> (MMTC) writes:
> 
> MMTC> [EMAIL PROTECTED] (Marcin 'Qrczak' Kowalczyk) wrote,
> >> Wed, 27 Sep 2000 00:22:05 +1100, Manuel M. T. Chakravarty <[EMAIL PROTECTED]> 
>pisze:
> >> 
> >> > Hmm, this seems like a shortcoming in the Haskell spec.  We have all
> >> > these isAlpha, isDigit, etc functions, but I can't get at a list of,
> >> > say, all characters for which isAlpha is true.
> >> 
> >> You can: filter isAlpha ['\0'..'\x']
> >> (don't use maxBound here because it's too large and we know that
> >> currently there are no isAlpha characters outside this range).
> >> 
> >> Working on large explicit lists is inefficient. 45443 characters
> >> are isAlpha. A lexer should be designed to avoid using a full list.
> 
> MMTC> You are right, just having a list of the characters is to
> MMTC> naive an approach.  But this re-enforces may point, we need
> MMTC> an _efficient_ way of getting at the unicode ranges for
> MMTC> certain character classes.  H98 is seems to be lacking some
> MMTC> features for practical use of unicode - the header to the
> MMTC> standard library `Char' actually admits that
> 
> Doaitse Swierstra's [This is the correct spelling!] parser combinators in
> their newest incarnation have symbol ranges as their basis. Internally they
> are also used to allow binary search which is the primary reason for their
> speed. There are now also facilities for writing scanners  using these
> combinators. With the ranges parsing Unicode shouldn't be less efficient
> than parsing ASCII.

Yes, Doaiste told me about the ranges, but that wasn't the
point here.  The question is how do you *know* which range,
eg, the alphanumeric characters in a given unicode encoding
have?  This is certainly different in Dutch and Japanese.
So, you can't hardcode it in your scanner spec, but instead
you have to get it from the OS via some Haskell library.
The question is how to represent this information in this
Haskell library.

Manuel




Re: combinator parsers and XSLT

2000-09-28 Thread Marcin 'Qrczak' Kowalczyk

28 Sep 2000 11:09:28 +0200, [EMAIL PROTECTED] <[EMAIL PROTECTED]> pisze:

> Doaitse Swierstra's [This is the correct spelling!] parser combinators
> in their newest incarnation have symbol ranges as their basis.
> Internally they are also used to allow binary search

Certainly better than character lists.

I'm not sure if basing on character predicates would not be better.
E.g. these 45443 isAlpha characters are split into 255 ranges.

My character category table is internally implemented as 256 vectors
of 256 vectors of categories (all characters larger than '\x'
currently have the same category).

Even though eight comparisons is not much compared to two table
lookups, neither Haskell98 nor my library provides easy access to
character ranges for predicates like isAlpha.

It would not be a problem to provide it, but it makes the interface
almost twice bigger.

-- 
 __("<  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTÊPCZA
QRCZAK





Re: combinator parsers and XSLT

2000-09-28 Thread piet

> "Manuel M. T. Chakravarty" <[EMAIL PROTECTED]> (MMTC) writes:

MMTC> [EMAIL PROTECTED] (Marcin 'Qrczak' Kowalczyk) wrote,
>> Wed, 27 Sep 2000 00:22:05 +1100, Manuel M. T. Chakravarty <[EMAIL PROTECTED]> 
>pisze:
>> 
>> > Hmm, this seems like a shortcoming in the Haskell spec.  We have all
>> > these isAlpha, isDigit, etc functions, but I can't get at a list of,
>> > say, all characters for which isAlpha is true.
>> 
>> You can: filter isAlpha ['\0'..'\x']
>> (don't use maxBound here because it's too large and we know that
>> currently there are no isAlpha characters outside this range).
>> 
>> Working on large explicit lists is inefficient. 45443 characters
>> are isAlpha. A lexer should be designed to avoid using a full list.

MMTC> You are right, just having a list of the characters is to
MMTC> naive an approach.  But this re-enforces may point, we need
MMTC> an _efficient_ way of getting at the unicode ranges for
MMTC> certain character classes.  H98 is seems to be lacking some
MMTC> features for practical use of unicode - the header to the
MMTC> standard library `Char' actually admits that

Doaitse Swierstra's [This is the correct spelling!] parser combinators in
their newest incarnation have symbol ranges as their basis. Internally they
are also used to allow binary search which is the primary reason for their
speed. There are now also facilities for writing scanners  using these
combinators. With the ranges parsing Unicode shouldn't be less efficient
than parsing ASCII.
-- 
Piet van Oostrum <[EMAIL PROTECTED]>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: [EMAIL PROTECTED]





Re: combinator parsers and XSLT

2000-09-27 Thread Marcin 'Qrczak' Kowalczyk

Wed, 27 Sep 2000 20:00:09 +1100, Manuel M. T. Chakravarty <[EMAIL PROTECTED]> pisze:

> But this re-enforces may point, we need an _efficient_ way of
> getting at the unicode ranges for certain character classes.

IMHO usually the best representation of such subset is a function
Char->Bool, probably composed of predicates like isAlpha.

The structure of such expression should be simpler and more compact
than an explicit dispatching table on individual characters, and
is more general - allows using arbitrary predicates available in
that form.

It should be a natural choice in a functional language :-)

> H98 is seems to be lacking some features for practical use of unicode
> - the header to the standard library `Char' actually admits that
> 
>   This module offers only a limited view of the full Unicode
>   character set; the full set of Unicode character
>   attributes is not accessible in this library.

I am working on a fuller module Char replacement, consulting details
with people from unicode and linux-utf8 mailing lists. It's in
 (already a bit out
of date).

A problem is that it is not Haskell98. Not only because of a limited
set of predicates, but Haskell98 specifies behavior of some predicates
in a way considered heretic and unfair by Unicode people (isSpace works
only for ISO-8859-1, isSpace '\xA0', isDigit works only for ASCII,
letters from alphabets without cases are all considered uppercase).
So I am temporarily forgetting about Haskell98, sorry.

Predicates are of course based on character categories from the
Unicode character database. Categories are exposed directly too.

One question about the interface. There are 30 categories, denoted by
two-letter abbreviations. Of course we could have a flat enumeration
of all 30. But perhaps it would be better to divide them according
to their structure (which corresponds to the first and second letter
of their abbreviations):

data Category
= Letter  !Letter
| Mark!Mark
| Number  !Number
| Separator   !Separator
| Other   !Other
| Punctuation !Punctuation
| Symbol  !Symbol

data Letter = Uppercase | Lowercase | Titlecase | ModifierLetter | OtherLetter
data Mark = NonSpacing | Spacing | Enclosing
data Number = Decimal | LetterNumber | OtherNumber
data Separator = Space | Line | Paragraph
data Other = Control | Format | Surrogate | PrivateUse | NotAssigned
data Punctuation = Connector | Dash | Open | Close | Initial | Final | OtherPunctuation
data Symbol = Math | Currency | ModifierSymbol | OtherSymbol

This leads to simpler predicates, e.g. isAlphaNum checks only the
outer constructor being Letter or Number, instead of enumerating
eight categories, so I guess that it will be similarly simpler for
somebody wanting to check categories directly. It would also be more
rubust if a subcategory is added in a future Unicode standard.

But it makes the structure of the Category type more complex.
I'm not sure if this is a good idea.

It happens that GHC nicely optimizes some compound predicates.
For example isPunct ch || isSymbol ch compiles into the code like
case category ch of
Punctuation _ -> True
Symbol  _ -> True
_ -> False

-- 
 __("<  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTÊPCZA
QRCZAK





Re: combinator parsers and XSLT

2000-09-27 Thread Manuel M. T. Chakravarty

[EMAIL PROTECTED] (Marcin 'Qrczak' Kowalczyk) wrote,

> Wed, 27 Sep 2000 00:22:05 +1100, Manuel M. T. Chakravarty <[EMAIL PROTECTED]> 
>pisze:
> 
> > Hmm, this seems like a shortcoming in the Haskell spec.  We have all
> > these isAlpha, isDigit, etc functions, but I can't get at a list of,
> > say, all characters for which isAlpha is true.
> 
> You can: filter isAlpha ['\0'..'\x']
> (don't use maxBound here because it's too large and we know that
> currently there are no isAlpha characters outside this range).
> 
> Working on large explicit lists is inefficient. 45443 characters
> are isAlpha. A lexer should be designed to avoid using a full list.

You are right, just having a list of the characters is to
naive an approach.  But this re-enforces may point, we need
an _efficient_ way of getting at the unicode ranges for
certain character classes.  H98 is seems to be lacking some
features for practical use of unicode - the header to the
standard library `Char' actually admits that

  This module offers only a limited view of the full Unicode
  character set; the full set of Unicode character
  attributes is not accessible in this library.

Manuel




RE: combinator parsers and XSLT

2000-09-27 Thread Manuel M. T. Chakravarty

Doug Ransom <[EMAIL PROTECTED]> wrote,

> I think unicode is very important for xml document processing, which  is my
> interest.

Fair enough, but in an earlier message you wrote,

> I am currently doing all my work in hugs.

Hugs doesn't support unicode `Char's anyway.  However, I
have added support for ranges to my todo list for the lexer
combinators.  So, thanks for pointing this issue out.

Manuel




Re: combinator parsers and XSLT

2000-09-26 Thread Marcin 'Qrczak' Kowalczyk

Wed, 27 Sep 2000 00:22:05 +1100, Manuel M. T. Chakravarty <[EMAIL PROTECTED]> pisze:

> Hmm, this seems like a shortcoming in the Haskell spec.  We have all
> these isAlpha, isDigit, etc functions, but I can't get at a list of,
> say, all characters for which isAlpha is true.

You can: filter isAlpha ['\0'..'\x']
(don't use maxBound here because it's too large and we know that
currently there are no isAlpha characters outside this range).

Working on large explicit lists is inefficient. 45443 characters
are isAlpha. A lexer should be designed to avoid using a full list.

-- 
 __("<  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTÊPCZA
QRCZAK





RE: combinator parsers and XSLT

2000-09-26 Thread Doug Ransom

I think unicode is very important for xml document processing, which  is my
interest.


> -Original Message-
> From: Manuel M. T. Chakravarty [mailto:[EMAIL PROTECTED]]
> Sent: Monday, September 25, 2000 9:11 PM
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: RE: combinator parsers and XSLT 
> 
> 
> Doug Ransom <[EMAIL PROTECTED]> wrote,
> 
> > > There is no need for "." or [^abc] as Haskell list operators
> > > can be used to "simulate" them.  The following is from the C
> > > lexer and matches all visible characters and all characters
> > > except newline, respectively:
> > > 
> > >   visible  = alt [' '..'\127']
> > >   anyButNL = alt (['\0'..'\255'] \\ ['\n'])
> > 
> > 
> > That is true, but how about dealing with unicode characters?
> > 
> > anyButNl = anyButNL = alt (['\0'..'\65536'] \\ ['\n'])
> > 
> > The space required becomes excessive.
> 
> True, but the current implementation would be hopeless for
> unicode anyway, as it builds a table representing a
> deterministic finite state automaton (DFA), where the worst
> case size of the table is
> 
>* 
> 
> In all practical cases, the required space is much smaller
> as states with less than 20 characters having a non-error
> transition store the state transitions in a list.
> Furthermore, even in states with more than 20 characters
> with a non-error transition, the size of the table is only
> that of
> 
>   ord  - ord  + 1
> 
> (these are characters with non-error transitions).
> 
> For 16bit character ranges, it would be necessary to
> directly store negated character sets (such as [^abc]).
> From what he told me, Doitse Swierstra is working on a lexer
> that is using explicit ranges, but I am not sure whether he
> also has negated ranges.
> 
> Currently, most Haskell systems don't support unicode anyway
> (I think, hbc is the only exception), so I guess this is not
> a pressing issue.  As soon as, we have unicode support and
> there is a need for lexers handling unicode input, I am
> willing to extend the lexer library to gracefully handle the
> cases that you outlined.
> 
> Cheers,
> Manuel
> 




Re: combinator parsers and XSLT

2000-09-26 Thread Manuel M. T. Chakravarty

Lars Henrik Mathiesen <[EMAIL PROTECTED]> wrote,

> > From: "Manuel M. T. Chakravarty" <[EMAIL PROTECTED]>
> > Date: Tue, 26 Sep 2000 15:11:23 +1100
> 
> > For 16bit character ranges, it would be necessary to
> > directly store negated character sets (such as [^abc]).
> > >From what he told me, Doitse Swierstra is working on a lexer
> > that is using explicit ranges, but I am not sure whether he
> > also has negated ranges.
> 
> People with experience from other Unicode-enabled environments will
> expect support for character classes like letter or digit --- which in
> Unicode are not simple single ranges, but widely scattered over the
> map. (Just look at Latin-1, where you have to use [A-Za-zÀ-ÖØ-öø-ÿ]
> because two arithmetic operators snuck into the accented character
> range. (Blame the French)).
> 
> Such support will also allow your parser to work with the next, bigger
> version of Unicode, since the parser library should just inherit the
> character class support from the Haskell runtime, which should in turn
> get it from the OS. The OS people are already doing the work to get
> the necessary tables and routines compressed into a few kilobytes.

Hmm, this seems like a shortcoming in the Haskell spec.  We
have all these isAlpha, isDigit, etc functions, but I can't
get at a list of, say, all characters for which isAlpha is
true. 

> Also, Unicode isn't 16-bit any more, it's more like 20.1 bits --- the
> range is hex 0 to 1f. Although the official character assignments
> will stay below hex 2 or so, your code may have to work on systems
> with private character assignments in the hex 10+ range.

Ok, I didn't really mean that the mentioned extension will
rely on Unicode being 16 bits.  This is only a size, where
you don't really want to build an exhaustive transition
table anymore.

Manuel




Re: combinator parsers and XSLT

2000-09-26 Thread Manuel M. T. Chakravarty

Lennart Augustsson <[EMAIL PROTECTED]> wrote,

> "Manuel M. T. Chakravarty" wrote:
> 
> > Currently, most Haskell systems don't support unicode anyway
> > (I think, hbc is the only exception), so I guess this is not
> > a pressing issue.  As soon as, we have unicode support and
> > there is a need for lexers handling unicode input, I am
> > willing to extend the lexer library to gracefully handle the
> > cases that you outlined.
> I'm sorry, but I much object (strongly) towards this attitude.
> It's this kind of reasoning that stops Unicode from becoming
> widespread.

I am tempted to agree with you.  I am just a lazy bastard,
that's the problem.

> Soon the GHC people (or whoever :) will say "Well, why should we
> support Unicode, there's all this software out there that breaks down
> with it." and we're in a viscious circle.

Hmmm, in this particular case nothing breaks down.  The
lexer combinators themselves never internally use the
assumption that a char is 8bit (I may be lazy, but I still
prefer clean code).  Only when you explicily use them to
build a scanner that does scan unicode files (and is aware
of it), you might run into space efficiency problems.

> Strong hint to various people:
> Haskell has had Unicode for a long time now.  I think that before
> you start implementing various extensions to Haskell, perhaps you
> should implement what the standard says should be there.
> Implementing Unicode isn't that hard, just a few days work.

You might be pleased to hear that - if I am not mistaken -
Qrczak is working at Unicode support for ghc.

> Strongly opposing Anglosaxan language imperialism

:-)

Manuel




Re: combinator parsers and XSLT

2000-09-26 Thread Marcin 'Qrczak' Kowalczyk

Tue, 26 Sep 2000 00:42:15 -0400, Lennart Augustsson <[EMAIL PROTECTED]> 
pisze:

> Implementing Unicode isn't that hard, just a few days work.

GHC in CVS already has Unicode Chars. Unfortunately designing
libraries such that file contents are correctly translated between
Unicode and the local charset by default, but with the ability to
use other encodings when needed, is not that trivial...

-- 
 __("<  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTÊPCZA
QRCZAK





Re: combinator parsers and XSLT

2000-09-26 Thread Lars Henrik Mathiesen

> From: "Manuel M. T. Chakravarty" <[EMAIL PROTECTED]>
> Date: Tue, 26 Sep 2000 15:11:23 +1100

> For 16bit character ranges, it would be necessary to
> directly store negated character sets (such as [^abc]).
> >From what he told me, Doitse Swierstra is working on a lexer
> that is using explicit ranges, but I am not sure whether he
> also has negated ranges.

People with experience from other Unicode-enabled environments will
expect support for character classes like letter or digit --- which in
Unicode are not simple single ranges, but widely scattered over the
map. (Just look at Latin-1, where you have to use [A-Za-zÀ-ÖØ-öø-ÿ]
because two arithmetic operators snuck into the accented character
range. (Blame the French)).

Such support will also allow your parser to work with the next, bigger
version of Unicode, since the parser library should just inherit the
character class support from the Haskell runtime, which should in turn
get it from the OS. The OS people are already doing the work to get
the necessary tables and routines compressed into a few kilobytes.

Also, Unicode isn't 16-bit any more, it's more like 20.1 bits --- the
range is hex 0 to 1f. Although the official character assignments
will stay below hex 2 or so, your code may have to work on systems
with private character assignments in the hex 10+ range.

Lars Mathiesen (U of Copenhagen CS Dep) <[EMAIL PROTECTED]> (Humour NOT marked)




Re: combinator parsers and XSLT

2000-09-25 Thread Lennart Augustsson

"Manuel M. T. Chakravarty" wrote:

> Currently, most Haskell systems don't support unicode anyway
> (I think, hbc is the only exception), so I guess this is not
> a pressing issue.  As soon as, we have unicode support and
> there is a need for lexers handling unicode input, I am
> willing to extend the lexer library to gracefully handle the
> cases that you outlined.
I'm sorry, but I much object (strongly) towards this attitude.
It's this kind of reasoning that stops Unicode from becoming
widespread.
Soon the GHC people (or whoever :) will say "Well, why should we
support Unicode, there's all this software out there that breaks down
with it." and we're in a viscious circle.

Strong hint to various people:
Haskell has had Unicode for a long time now.  I think that before
you start implementing various extensions to Haskell, perhaps you
should implement what the standard says should be there.
Implementing Unicode isn't that hard, just a few days work.

Strongly opposing Anglosaxan language imperialism
-- Lennart






RE: combinator parsers and XSLT

2000-09-25 Thread Manuel M. T. Chakravarty

Doug Ransom <[EMAIL PROTECTED]> wrote,

> > There is no need for "." or [^abc] as Haskell list operators
> > can be used to "simulate" them.  The following is from the C
> > lexer and matches all visible characters and all characters
> > except newline, respectively:
> > 
> >   visible  = alt [' '..'\127']
> >   anyButNL = alt (['\0'..'\255'] \\ ['\n'])
> 
> 
> That is true, but how about dealing with unicode characters?
> 
> anyButNl = anyButNL = alt (['\0'..'\65536'] \\ ['\n'])
> 
> The space required becomes excessive.

True, but the current implementation would be hopeless for
unicode anyway, as it builds a table representing a
deterministic finite state automaton (DFA), where the worst
case size of the table is

   * 

In all practical cases, the required space is much smaller
as states with less than 20 characters having a non-error
transition store the state transitions in a list.
Furthermore, even in states with more than 20 characters
with a non-error transition, the size of the table is only
that of

  ord  - ord  + 1

(these are characters with non-error transitions).

For 16bit character ranges, it would be necessary to
directly store negated character sets (such as [^abc]).
>From what he told me, Doitse Swierstra is working on a lexer
that is using explicit ranges, but I am not sure whether he
also has negated ranges.

Currently, most Haskell systems don't support unicode anyway
(I think, hbc is the only exception), so I guess this is not
a pressing issue.  As soon as, we have unicode support and
there is a need for lexers handling unicode input, I am
willing to extend the lexer library to gracefully handle the
cases that you outlined.

Cheers,
Manuel




RE: combinator parsers and XSLT

2000-09-25 Thread Doug Ransom



 
> 
> There is no need for "." or [^abc] as Haskell list operators
> can be used to "simulate" them.  The following is from the C
> lexer and matches all visible characters and all characters
> except newline, respectively:
> 
>   visible  = alt [' '..'\127']
>   anyButNL = alt (['\0'..'\255'] \\ ['\n'])


That is true, but how about dealing with unicode characters?

anyButNl = anyButNL = alt (['\0'..'\65536'] \\ ['\n'])

The space required becomes excessive.





Re: combinator parsers and XSLT

2000-09-25 Thread Manuel M. T. Chakravarty

"Doug Ransom" <[EMAIL PROTECTED]> wrote,

> Has anyone had experience with Parsec vs. --  the Parser combinators
> supplied with Manuel M. T. Chakravarty's Compiler Toolkit: Self-optimizing
> LL(1) parser combinators?
> 
> As someone trying to learn FP, I have been playing with the Compiler Toolkit
> lexical analyzer combinators, but I have had some roadblocks:
> - there are no examples and the tests do not cover the typical use cases.

The main example at the moment is the C lexer and parser
that is part of C->Haskell:

  http://www.cse.unsw.edu.au/~chak/haskell/c2hs/

(You will find all source and instructions for CVS access at
that page.)

Moreover, I have recently implemented another (smaller, and
thus, hopefully easier to understand) lexer and parser with
the combinators.  I attach the source code of the
lexer/parser and the AST at this message.  The grammer of
the accepted language is part of the initial comment of the
parser module.

> - regular expression support is limited -- there is no "." operator to match
> any char, nor any exclusion operator i.e. [^abc] for anything in abc.
> I anticipate success, but I think the Compiler Toolkit needs some
> improvements.

There is no need for "." or [^abc] as Haskell list operators
can be used to "simulate" them.  The following is from the C
lexer and matches all visible characters and all characters
except newline, respectively:

  visible  = alt [' '..'\127']
  anyButNL = alt (['\0'..'\255'] \\ ['\n'])

I agree that the documentation of the toolkit should be
improved :-/

Anyway, generally, let me know if you have any specific
improvements.

Cheers,
Manuel

PS: Sorry, for the late reply, but I just returned from ICFP
- which was great btw.


--  Flattening Prototype: parser for the lambda calculus with parallel arrays
--
--  Authors: Nepal Team
--  Created: 13 September 2000
--
--  Version $Revision: 1.3 $ from $Date: 2000/09/15 12:16:24 $
--
--  Copyright (c) 2000 Chakravarty, Keller, Lechtchinsky & Pfannenstiel
--
--  This file is free software; you can redistribute it and/or modify
--  it under the terms of the GNU General Public License as published by
--  the Free Software Foundation; either version 2 of the License, or
--  (at your option) any later version.
--
--  This file is distributed in the hope that it will be useful,
--  but WITHOUT ANY WARRANTY; without even the implied warranty of
--  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
--  GNU General Public License for more details.
--
--- DESCRIPTION ---
--
--  Parses a sugared version of the lambda calculus with parallel arrays.
--
--- DOCU --
--
--  language: Haskell 98
--
--  Grammar of the accepted language:
--
--  ident  -> alphanumeric identifiers with lower and upper letters plus the
--characters `_' and `''
--  ifxide -> string of symbols from "!#$%&*+-/:<=>?^|~"
--  int-> one or more digits
--
--  type  -> `/' ide `::' expr
-- | ftype
--  ftype -> [stype `->'] ftype
--  stype -> [stype `+' ] ptype
--  ptype -> [btype `*' ] type
--  btype -> `()' | `Bool' | `Int'
-- | `[:' type `]'
-- | `(' type `)'
--
--  expr  -> `\' ident `::' type `.' expr
-- | `/' ident `::' type `.' expr
-- | lexpr
-- | iexpr
--  iexpr -> iexpr ifxide aexpr
--  aexpr -> aexpr bexpr
--  bexpr -> `()' | `(,)'| `false' | true' | int | ident
-- | `(' ifxide `)'
-- | `(' expr `)'
-- | `(' expr `,' expr `)'
--
--  lexpr -> `let' bind_1 ... bind_n `in' expr  (n > 1)
--  bind  -> ident args `::' type `=' expr ';'
--  args  -> arg_1 ... arg_n(n > 0)
--  arg   -> `(' ident `::' type `)'
--
--  script -> bind_1 ... bind_n main(n > 0)
--  main   -> `main' `=' expr
--
--  Note: This module uses the self-optimising parsers from the CTKlight,
--which rely on existentially quantified type variables - a feature
--not included in Haskell 98.
--
--- TODO --
--

module Parser (
  parseScript, parseExpr, parseType
) where


import Common  (Position, Pos(..), nopos)
import Utils   (Tag(tag))
import Errors  (showError)
import Lexers  (Lexer, Regexp, (>||<), (>|<), (+>), star, plus, epsilon,
char, string, alt, lexaction, ctrlLexer, execLexer)
import Parsers (Token, Parser, (<|>), (*>), (-*>), (*->), list, list1, many,
many1, sep1, seplist1, skip, token, action, execParser)

import CoreAST
import Trafo   (mkPair, mkPairPos)


-- lexical analysis
-- 

-- token definition
--
data AToken = TokLParen  Position   -- `('
| TokRParen  Position   -- `)'
| TokUnitPosition   -- `()'
| TokLArray  Position   -- `[:'
 

Re: combinator parsers and XSLT

2000-09-19 Thread Joe English


Doug Ransom wrote:

> Eventually I hope to parse xpath, and then xslt using nodes in an xml tree
> as tokens. If I get that far, I would like to find a way to implement an
> xslt processor that makes one pass over an xml document using an algebra for
> low-memory processing.

If you figure out a way to do this, the XSLT community
will be very very happy :-)  An XSLT implementation that
doesn't require the entire input document to be loaded
into memory is an open research question.


--Joe English

  [EMAIL PROTECTED]