I see. How about the following (sketched) solution to avoid looping over all characters? It might be very well the case that you already considered (and dismissed) that path.

A) Assumption
In order to allow any meaningful matching, the input to the scanner is normalized according to the unicode spec.

B) Abstraction
Treat each character and character group of an regex as a set of intervals in the unicode code points. Lets call them "character tests" and lift the common set operations union, intersection and difference to them.

C) Construct NFA
NFA has potentially overlapping character tests at the transitions of each state.

D) Construct DFA
Given a product state s in the DFA and two transitions t1, t2 from the original NFA, add three new transitions to the DFA: - a transition labeled with the character test of t1 minus the character test of t2 - a transition labeled with the intersection of the character tests of t1 and t2 - a transition labeled with the character test of t2 minus the character test of t1

E) Extension
Instead of sets of unicode intervals we could also use test-functions, e.g. blocks. Then, in step D), the set operations are translated to boolean operations:
- difference t1 - t2 becomes: t1 && not t2
- intersection of t1 and t2 becomes: t1 && t2
This would allow to use optimized test functions, e.g., bdds, instead of relying on charcter tests only.


Cheers,
Steffen




Am .11.2017, 23:16 Uhr, schrieb Thierry Goubier <thierry.goub...@gmail.com>:

Le 07/11/2017 à 23:00, Steffen Märcker a écrit :
I am not familiar with the literature on scanners. May I ask you about some details on the "for all characters" algorithms you are referring to?

The two main ones available, from the typical Aho/Ullman textbook, are:

- NFA to DFA conversion (i.e. build a NFA with your regular expressions, then convert it to a DFA)

- Direct regular expression to DFA construction

Both of them have loop of the type:

for (each input symbol a) {
...

Building a (or connecting to) a BDD library would be fun, indeed. But within that time frame it seems not realistic. Anyway, after finishing my thesis, I'd like to come back to that idea.

It would certainly be interesting. Please contact us again when you'll have time :)

Regards,

Thierry

Ciao, Steffen
Am 7. November 2017 16:33:03 MEZ schrieb Andrew Glynn <aglyn...@gmail.com>: A possible way to accomplish it would be to use an object graph with
    an incremental query engine, such as EMF/CDO with Viatra or
    something similar.  You could then put different character sets in
    connected objects and query only as far as you need to.
     Andrew Glynn
     Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
    Windows 10
     *From: *Thierry Goubier <mailto:thierry.goub...@gmail.com>
    *Sent: *Tuesday, November 7, 2017 7:17 AM
    *To: *Any question about pharo is welcome
    <mailto:pharo-users@lists.pharo.org>
    *Subject: *Re: [Pharo-users] Binary Decision Diagram Package in
    Smalltalk
     Hi Andrew, Steffen,
     2017-11-07 13:10 GMT+01:00 Prof. Andrew P. Black <bl...@cs.pdx.edu
    <mailto:bl...@cs.pdx.edu>>:
           > On 28 Oct 2017, at 17:37 , Steffen Märcker <merk...@web.de
        <mailto:merk...@web.de>> wrote:
         >
         > Does that mean the sets/bdd would be constructed mainly at
        comile time? Anyway, Andrew, feel free to contact me, I might
        help you with this.
         >
         Thanks for the offer, Steffen!  The problem is that I need to
        use SmaCC for my current project, and really do not have a month
to take off and re-design the way that it builds its scanner. I’ve talked to Thierry Goubier about, and he doesn’t have time
        either!  It would be a fun project, though, and it ought to be
        fairly separate from other parts of SmaCC.  I’ve spent a fair
        bit of time thinking about how to do it, but don’t think that I
        will be able to actually focus on it.
     Yes, this is the essence of the issue. There are a few alternatives
    about it, but none we have the time to pursue.
An alternative approach, which Thierry has suggested, is to make
        SmaCC work on the UTF-8 representation of the Unicode.  Then we
        could represent character sets as prefix trees.  But the core
        problem would still exist: you can’t run an algorithm that
        repeatedly executes
                          for all characters in the alphabet do:
         when there are 2^21 characters in the alphabet!
The main issue is that `for all characters`... All the literature on
    scanner building uses 'for all characters do'.
     Thierry
                   Andrew



Reply via email to