Re: Bison lexer

2018-09-15 Thread Hans Åberg


> On 15 Sep 2018, at 18:34, Akim Demaille  wrote:
> 
>> Le 15 sept. 2018 à 14:51, Hans Åberg  a écrit :
>> 
>> 
>>> On 15 Sep 2018, at 07:07, Akim Demaille  wrote:
>>> 
>>> So, while I totally understand Frank’s point, I’m less worried than
>>> he is, and use Flex’s C++ backend.
>> 
>> Which Flex version? It only works before 2.6.0. See:
>> 
>> https://stackoverflow.com/questions/34438023/openfoam-flex-yyin-rdbufstdcin-rdbuf-error
> 
> I’ve been using all the versions of Flex for a while now.  See:
> 
> https://gitlab.lrde.epita.fr/vcsn/vcsn/commit/dcdf2b3cae7353d99520448a7b22a4334a72bc5d

You patch up the Flex header and then use it locally? — I used that approach, 
but then I got it working without patching in 2.5.37 by adding a class in .yy, 
but then it broke in 2.6. Another method I used in the past was my own skeleton 
file, but then must be maintained.





Re: Bison lexer

2018-09-15 Thread Akim Demaille



> Le 15 sept. 2018 à 14:51, Hans Åberg  a écrit :
> 
> 
>> On 15 Sep 2018, at 07:07, Akim Demaille  wrote:
>> 
>> So, while I totally understand Frank’s point, I’m less worried than
>> he is, and use Flex’s C++ backend.
> 
> Which Flex version? It only works before 2.6.0. See:
> 
> https://stackoverflow.com/questions/34438023/openfoam-flex-yyin-rdbufstdcin-rdbuf-error

I’ve been using all the versions of Flex for a while now.  See:

https://gitlab.lrde.epita.fr/vcsn/vcsn/commit/dcdf2b3cae7353d99520448a7b22a4334a72bc5d


Re: Bison lexer

2018-09-15 Thread Hans Åberg


> On 15 Sep 2018, at 07:07, Akim Demaille  wrote:
> 
>> Le 31 août 2018 à 23:39, Hans Åberg  a écrit :
>> 
> But the final straw was when, after changing to C++ Bison, I wanted
> to switch to C++ Flex too and found this beautiful comment:
> 
> /* The c++ scanner is a mess. The FlexLexer.h header file relies on the
>  * following macro. This is required in order to pass the 
> c++-multiple-scanners
>  * test in the regression suite. We get reports that it breaks 
> inheritance.
>  * We will address this in a future release of flex, or omit the C++ 
> scanner
>  * altogether. */
 
 It has been like that since the 1990s, I believe.
>>> 
>>> Even better! :(
>>> 
>>> Especially since C++ in the 1990s was totally different from modern
>>> C++, so I have no idea if anything of this comment is still
>>> relevant, or maybe even more relevant, today compared to then.
>> 
>> Indeed, very old.
> 
> So, while I totally understand Frank’s point, I’m less worried than
> he is, and use Flex’s C++ backend.

Which Flex version? It only works before 2.6.0. See:

https://stackoverflow.com/questions/34438023/openfoam-flex-yyin-rdbufstdcin-rdbuf-error

> It seems that the resources developments of Flex are scarce.  They
> easily agree on issues, but even for the most trivial ones (e.g.,
> delete three lines, https://github.com/westes/flex/issues/379),
> they ask for a patch.
> 
> But, then, who am I to discuss about the maintenance resources :-(

The issue above was discussed on their new mailing list in 2016 or so, but no 
fix yet.





Re: Bison lexer

2018-09-15 Thread Hans Åberg


> On 15 Sep 2018, at 07:02, Akim Demaille  wrote:
> 
>> Le 29 août 2018 à 15:56, Hans Åberg  a écrit :
>> 
>> I did that, too: I wrote some DFA/NFA code, and incidentally found the most 
>> efficient method make action matches via a reverse NFA lookup, cf. [1-3]. 
>> Also, I have made UTF-8/32 to octet character class translations. 
>> 
>> 1. https://gcc.gnu.org/ml/libstdc++/2018-04/msg00032.html
>> 2. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472
>> 3. https://gcc.gnu.org/ml/libstdc++/2018-05/msg00015.html
> 
> That was interesting.  

Thanks. I wanted a dynamic lexer and and at least a partially dynamic parser so 
users define their own operators. One thing that remains with the lexer is the 
backreferenses (see below).

> I found that Tim Shen exposed his work on
>  https://www.youtube.com/watch?v=N_rkHzhXueo.

I haven't seen all.

> When it comes to conversion from expressions to automaton, I’m
> a big fan of Brzozozski’s derivatives, that, in addition, easily
> supported complement and intersection.  

I just have a C++ automaton class that builds the NFA directly through 
operators corresponding to those of a regular expression. The NFA then has no 
empty transitions, and a set of start states, which correspond to the singel 
DFA start state, in the subset construction. For example, for alteration just 
take the union of both the NFAs and their start state sets. A working regex 
implementation then has some additions, such as loops for count matches.

> No idea about group
> captures, and certainly not backward references.

Then when building the NFA, its start and end states form a group, which can be 
identified with unique number, if you so will. Backreferences I think of 
working so that when it appearing, one makes a lookup of its value by the 
reverse NFA method I give, and then inserting it as a dynamic NFA. Strictly, 
the value of the backreference may then change as one comes to a new one, but I 
suspect those that invented the concept have not considered that.

> redgrep implements this approach.  This talks touches the case
> of capturing groups.
> 
> https://www.youtube.com/watch?v=CMhqlRBfVX4&t=8s&frags=pl%2Cwn

The method I give in effect the sub-NFA that the input string uses, so the 
group capture is automatic. Then working together towards DFA minimalization, 
it turns out that one cannot even use the DFA, because different  group markers 
may be merged in to the same DFA state. Any DF minimalization must then keep 
track of that. So it may be similar to LALR state merging, where one must to 
keep track of the whole rules.





Re: Bison lexer

2018-09-14 Thread Akim Demaille



> Le 31 août 2018 à 23:39, Hans Åberg  a écrit :
> 
 But the final straw was when, after changing to C++ Bison, I wanted
 to switch to C++ Flex too and found this beautiful comment:
 
  /* The c++ scanner is a mess. The FlexLexer.h header file relies on the
   * following macro. This is required in order to pass the 
 c++-multiple-scanners
   * test in the regression suite. We get reports that it breaks 
 inheritance.
   * We will address this in a future release of flex, or omit the C++ 
 scanner
   * altogether. */
>>> 
>>> It has been like that since the 1990s, I believe.
>> 
>> Even better! :(
>> 
>> Especially since C++ in the 1990s was totally different from modern
>> C++, so I have no idea if anything of this comment is still
>> relevant, or maybe even more relevant, today compared to then.
> 
> Indeed, very old.

So, while I totally understand Frank’s point, I’m less worried than
he is, and use Flex’s C++ backend.

It seems that the resources developments of Flex are scarce.  They
easily agree on issues, but even for the most trivial ones (e.g.,
delete three lines, https://github.com/westes/flex/issues/379),
they ask for a patch.

But, then, who am I to discuss about the maintenance resources :-(


Re: Bison lexer

2018-09-14 Thread Akim Demaille



> Le 29 août 2018 à 15:56, Hans Åberg  a écrit :
> 
> I did that, too: I wrote some DFA/NFA code, and incidentally found the most 
> efficient method make action matches via a reverse NFA lookup, cf. [1-3]. 
> Also, I have made UTF-8/32 to octet character class translations. 
> 
> 1. https://gcc.gnu.org/ml/libstdc++/2018-04/msg00032.html
> 2. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472
> 3. https://gcc.gnu.org/ml/libstdc++/2018-05/msg00015.html

That was interesting.  I found that Tim Shen exposed his work on
 https://www.youtube.com/watch?v=N_rkHzhXueo.

When it comes to conversion from expressions to automaton, I’m
a big fan of Brzozozski’s derivatives, that, in addition, easily
supported complement and intersection.  No idea about group
captures, and certainly not backward references.

redgrep implements this approach.  This talks touches the case
of capturing groups.

https://www.youtube.com/watch?v=CMhqlRBfVX4&t=8s&frags=pl%2Cwn




Re: Bison lexer

2018-08-31 Thread Hans Åberg


> On 1 Sep 2018, at 00:12, Frank Heckenbach  wrote:
> 
> Hans Åberg wrote:
> 
>>> I haven't used gcc-8 yet, but how is this relevant? If anything, I
>>> expect newer gcc versions to produce more warnings (usually useful)
>>> which flex might also suffer from.
>> 
>> Maybe the Flex lexers errors is due to using C89 to compile it or something.
> 
> No, the warnings seemed legit.

It uses "register" which has been deprecated in C++17.

>>> Interesting, thanks. Fortunately, my REs are not so complex, so the
>>> bug you reported won't affect me and lexing speed is not so
>>> important for me, so (at least for now) I can just use the library
>>> as is. But if I ever need something more sophisticated, I'll keep
>>> this in mind.
>> 
>> If that is what you are using, note that it is recursive, so the function 
>> stack might overflow. But perhaps the rewrite it someday.
> 
> I don't think my lexing REs should cause much recursion. No nested
> repetitions or such.

It is in the backtracking, which it does instead of a DFA iteration, in the GCC 
regex library, that is. Some example in the links I gave illustrate that.





Re: Bison lexer

2018-08-31 Thread Frank Heckenbach
Hans Åberg wrote:

> > I haven't used gcc-8 yet, but how is this relevant? If anything, I
> > expect newer gcc versions to produce more warnings (usually useful)
> > which flex might also suffer from.
> 
> Maybe the Flex lexers errors is due to using C89 to compile it or something.

No, the warnings seemed legit.

> > Interesting, thanks. Fortunately, my REs are not so complex, so the
> > bug you reported won't affect me and lexing speed is not so
> > important for me, so (at least for now) I can just use the library
> > as is. But if I ever need something more sophisticated, I'll keep
> > this in mind.
> 
> If that is what you are using, note that it is recursive, so the function 
> stack might overflow. But perhaps the rewrite it someday.

I don't think my lexing REs should cause much recursion. No nested
repetitions or such.

Regards,
Frank



Re: Bison lexer

2018-08-31 Thread Hans Åberg


> On 31 Aug 2018, at 22:26, Frank Heckenbach  wrote:
> 
> Hans Åberg wrote:
> 
>>> For a start, I didn't have very good experience communicating with
>>> Flex maintainer(s?) who seemed rather nonchalant WRT gcc warnings
>>> etc. in the generated code, so over the years I'd been adjusting
>>> various warning-suppression gcc options or doing dirty #define
>>> tricks to avoid warnings, or sometimes even post-processing the
>>> generated lexer with sed.
>> 
>> GCC 8.2 uses C17 as default.
> 
> I haven't used gcc-8 yet, but how is this relevant? If anything, I
> expect newer gcc versions to produce more warnings (usually useful)
> which flex might also suffer from.

Maybe the Flex lexers errors is due to using C89 to compile it or something.

>>> But the final straw was when, after changing to C++ Bison, I wanted
>>> to switch to C++ Flex too and found this beautiful comment:
>>> 
>>>   /* The c++ scanner is a mess. The FlexLexer.h header file relies on the
>>>* following macro. This is required in order to pass the 
>>> c++-multiple-scanners
>>>* test in the regression suite. We get reports that it breaks 
>>> inheritance.
>>>* We will address this in a future release of flex, or omit the C++ 
>>> scanner
>>>* altogether. */
>> 
>> It has been like that since the 1990s, I believe.
> 
> Even better! :(
> 
> Especially since C++ in the 1990s was totally different from modern
> C++, so I have no idea if anything of this comment is still
> relevant, or maybe even more relevant, today compared to then.

Indeed, very old.

> Lesson (as if anyone was listening): Always put a date on such
> messages.

Probably just a hack, never actually developed.

>>> So I wrote a small library that builds that massive RE out of single
>>> rules and maps subexpressions back to rules (even in the case that
>>> rules contain subexpressions of their own), and that works for me.
>> 
>> I did that, too: I wrote some DFA/NFA code, and incidentally found
>> the most efficient method make action matches via a reverse NFA
>> lookup, cf. [1-3]. Also, I have made UTF-8/32 to octet character
>> class translations.
>> 
>> 1. https://gcc.gnu.org/ml/libstdc++/2018-04/msg00032.html
>> 2. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472
>> 3. https://gcc.gnu.org/ml/libstdc++/2018-05/msg00015.html
> 
> Interesting, thanks. Fortunately, my REs are not so complex, so the
> bug you reported won't affect me and lexing speed is not so
> important for me, so (at least for now) I can just use the library
> as is. But if I ever need something more sophisticated, I'll keep
> this in mind.

If that is what you are using, note that it is recursive, so the function stack 
might overflow. But perhaps the rewrite it someday.





Re: Bison lexer

2018-08-31 Thread Frank Heckenbach
Hans Åberg wrote:

> > For a start, I didn't have very good experience communicating with
> > Flex maintainer(s?) who seemed rather nonchalant WRT gcc warnings
> > etc. in the generated code, so over the years I'd been adjusting
> > various warning-suppression gcc options or doing dirty #define
> > tricks to avoid warnings, or sometimes even post-processing the
> > generated lexer with sed.
> 
> GCC 8.2 uses C17 as default.

I haven't used gcc-8 yet, but how is this relevant? If anything, I
expect newer gcc versions to produce more warnings (usually useful)
which flex might also suffer from.

> > But the final straw was when, after changing to C++ Bison, I wanted
> > to switch to C++ Flex too and found this beautiful comment:
> > 
> >/* The c++ scanner is a mess. The FlexLexer.h header file relies on the
> > * following macro. This is required in order to pass the 
> > c++-multiple-scanners
> > * test in the regression suite. We get reports that it breaks 
> > inheritance.
> > * We will address this in a future release of flex, or omit the C++ 
> > scanner
> > * altogether. */
> 
> It has been like that since the 1990s, I believe.

Even better! :(

Especially since C++ in the 1990s was totally different from modern
C++, so I have no idea if anything of this comment is still
relevant, or maybe even more relevant, today compared to then.

Lesson (as if anyone was listening): Always put a date on such
messages.

> > So I wrote a small library that builds that massive RE out of single
> > rules and maps subexpressions back to rules (even in the case that
> > rules contain subexpressions of their own), and that works for me.
> 
> I did that, too: I wrote some DFA/NFA code, and incidentally found
> the most efficient method make action matches via a reverse NFA
> lookup, cf. [1-3]. Also, I have made UTF-8/32 to octet character
> class translations.
> 
> 1. https://gcc.gnu.org/ml/libstdc++/2018-04/msg00032.html
> 2. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472
> 3. https://gcc.gnu.org/ml/libstdc++/2018-05/msg00015.html

Interesting, thanks. Fortunately, my REs are not so complex, so the
bug you reported won't affect me and lexing speed is not so
important for me, so (at least for now) I can just use the library
as is. But if I ever need something more sophisticated, I'll keep
this in mind.

Regards,
Frank



Bison lexer

2018-08-29 Thread Hans Åberg


> On 29 Aug 2018, at 00:31, Frank Heckenbach  wrote:
> 
> Hans Åberg wrote:
> 
>>> On 27 Aug 2018, at 22:10, Akim Demaille  wrote:
>>> 
 Most of my porting work, apart from writing the new skeletons, was
 general grammar cleanup and conversion of semantic types from raw
 pointers and containers to smart pointers and other RAII classes
 (which was my main goal of the port, of course), and changes in the
 lexer (dropping flex, but that's another story).
>>> 
>>> I fought a lot with Flex, but it works ok in C++ too with lalr1.cc.
>>> I have one parser here, 
>>> https://gitlab.lrde.epita.fr/vcsn/vcsn/tree/master/lib/vcsn/dot,
>>> and another there 
>>> https://gitlab.lrde.epita.fr/vcsn/vcsn/tree/master/lib/vcsn/rat
>>> for instance, using Flex.
>> 
>> That is probably versions before 2.6; the yyin and yyout have been
>> changed in the C++ header so that they are no longer pointers, so
>> it is not only incompatible with the header of older versions, but
>> also with the code it writes, resulting in the issue [1].
>> 
>> 1. 
>> https://stackoverflow.com/questions/34438023/openfoam-flex-yyin-rdbufstdcin-rdbuf-error
> 
> Though this wasn't actually my problem, I'll reply to this mail
> rather than the main thraed to keep it separate from the actual
> Bison discussion.

One can change the subject. :-)

> For a start, I didn't have very good experience communicating with
> Flex maintainer(s?) who seemed rather nonchalant WRT gcc warnings
> etc. in the generated code, so over the years I'd been adjusting
> various warning-suppression gcc options or doing dirty #define
> tricks to avoid warnings, or sometimes even post-processing the
> generated lexer with sed.

GCC 8.2 uses C17 as default.

> But the final straw was when, after changing to C++ Bison, I wanted
> to switch to C++ Flex too and found this beautiful comment:
> 
>/* The c++ scanner is a mess. The FlexLexer.h header file relies on the
> * following macro. This is required in order to pass the 
> c++-multiple-scanners
> * test in the regression suite. We get reports that it breaks inheritance.
> * We will address this in a future release of flex, or omit the C++ 
> scanner
> * altogether. */

It has been like that since the 1990s, I believe.

> I know there are no guarantees in the future of free software
> (neither of non-free software, of course), but such an
> announcement/threat seemed too risky to me.

Indeed, it seems broken now.

> Meanwhile I'd often thought that all Flex actually does is matching
> alternative regular expressions. Plain RE can do that as well, and
> by capturing subexpressions I can find out which alternative was
> matched.
> 
> Of course, it would (indeed turn out to be) somewhat slower (RE
> built at runtime vs. compile time), but like parsing, lexing speed
> is not a big issue to me. So I was ready to trade that in for
> convenience of programming and one less dependence on a problematic
> tool.
> 
> (Side node: Many years ago, on a different project, I dropped gperf
> to recognize predefined identifiers for similar reasons, and put
> them in a look-up table instead. Except for a tiny slowdown, that
> had worked out well, so I was confident I could drop Flex, too. --
> Now apparently the next one in line after dropping gperf and Flex
> should be Bison, but don't worry, I don't see an easy way to replace
> it, since Bison actually does some nontrivial stuff. :)
> 
> So I wrote a small library that builds that massive RE out of single
> rules and maps subexpressions back to rules (even in the case that
> rules contain subexpressions of their own), and that works for me.

I did that, too: I wrote some DFA/NFA code, and incidentally found the most 
efficient method make action matches via a reverse NFA lookup, cf. [1-3]. Also, 
I have made UTF-8/32 to octet character class translations. 

1. https://gcc.gnu.org/ml/libstdc++/2018-04/msg00032.html
2. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472
3. https://gcc.gnu.org/ml/libstdc++/2018-05/msg00015.html