Re: [Pharo-users] Binary Decision Diagram Package in Smalltalk

Steffen Märcker Wed, 08 Nov 2017 05:23:45 -0800

I see. How about the following (sketched) solution to avoid looping overall characters? It might be very well the case that you already considered(and dismissed) that path.


A) Assumption

In order to allow any meaningful matching, the input to the scanner isnormalized according to the unicode spec.


B) Abstraction

Treat each character and character group of an regex as a set of intervalsin the unicode code points. Lets call them "character tests" and lift thecommon set operations union, intersection and difference to them.


C) Construct NFA

NFA has potentially overlapping character tests at the transitions of eachstate.


D) Construct DFA

Given a product state s in the DFA and two transitions t1, t2 from theoriginal NFA, add three new transitions to the DFA:- a transition labeled with the character test of t1 minus the charactertest of t2- a transition labeled with the intersection of the character tests of t1and t2- a transition labeled with the character test of t2 minus the charactertest of t1


E) Extension

Instead of sets of unicode intervals we could also use test-functions,e.g. blocks. Then, in step D), the set operations are translated toboolean operations:

- difference t1 - t2 becomes: t1 && not t2
- intersection of t1 and t2 becomes: t1 && t2

This would allow to use optimized test functions, e.g., bdds, instead ofrelying on charcter tests only.



Cheers,
Steffen

Am .11.2017, 23:16 Uhr, schrieb Thierry Goubier<thierry.goub...@gmail.com>:

Le 07/11/2017 à 23:00, Steffen Märcker a écrit :

I am not familiar with the literature on scanners. May I ask you aboutsome details on the "for all characters" algorithms you are referringto?


The two main ones available, from the typical Aho/Ullman textbook, are:

- NFA to DFA conversion (i.e. build a NFA with your regular expressions,then convert it to a DFA)


- Direct regular expression to DFA construction

Both of them have loop of the type:

for (each input symbol a) {
...

Building a (or connecting to) a BDD library would be fun, indeed. Butwithin that time frame it seems not realistic. Anyway, after finishingmy thesis, I'd like to come back to that idea.

It would certainly be interesting. Please contact us again when you'llhave time :)


Regards,

Thierry

Ciao, Steffen

Am 7. November 2017 16:33:03 MEZ schrieb Andrew Glynn<aglyn...@gmail.com>:A possible way to accomplish it would be to use an object graphwith

    an incremental query engine, such as EMF/CDO with Viatra or
    something similar.  You could then put different character sets in
    connected objects and query only as far as you need to.
     Andrew Glynn
     Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
    Windows 10
     *From: *Thierry Goubier <mailto:thierry.goub...@gmail.com>
    *Sent: *Tuesday, November 7, 2017 7:17 AM
    *To: *Any question about pharo is welcome
    <mailto:pharo-users@lists.pharo.org>
    *Subject: *Re: [Pharo-users] Binary Decision Diagram Package in
    Smalltalk
     Hi Andrew, Steffen,
     2017-11-07 13:10 GMT+01:00 Prof. Andrew P. Black <bl...@cs.pdx.edu
    <mailto:bl...@cs.pdx.edu>>:
           > On 28 Oct 2017, at 17:37 , Steffen Märcker <merk...@web.de
        <mailto:merk...@web.de>> wrote:
         >
         > Does that mean the sets/bdd would be constructed mainly at
        comile time? Anyway, Andrew, feel free to contact me, I might
        help you with this.
         >
         Thanks for the offer, Steffen!  The problem is that I need to
        use SmaCC for my current project, and really do not have a month

to take off and re-design the way that it builds its scanner.I’ve talked to Thierry Goubier about, and he doesn’t have time

        either!  It would be a fun project, though, and it ought to be
        fairly separate from other parts of SmaCC.  I’ve spent a fair
        bit of time thinking about how to do it, but don’t think that I
        will be able to actually focus on it.
     Yes, this is the essence of the issue. There are a few alternatives
    about it, but none we have the time to pursue.

An alternative approach, which Thierry has suggested, is tomake

        SmaCC work on the UTF-8 representation of the Unicode.  Then we
        could represent character sets as prefix trees.  But the core
        problem would still exist: you can’t run an algorithm that
        repeatedly executes
                          for all characters in the alphabet do:
         when there are 2^21 characters in the alphabet!

The main issue is that `for all characters`... All the literatureon

    scanner building uses 'for all characters do'.
     Thierry
                   Andrew

Re: [Pharo-users] Binary Decision Diagram Package in Smalltalk

Reply via email to