I see. How about the following (sketched) solution to avoid looping over
all characters? It might be very well the case that you already considered
(and dismissed) that path.
A) Assumption
In order to allow any meaningful matching, the input to the scanner is
normalized according to the unicode spec.
B) Abstraction
Treat each character and character group of an regex as a set of intervals
in the unicode code points. Lets call them "character tests" and lift the
common set operations union, intersection and difference to them.
C) Construct NFA
NFA has potentially overlapping character tests at the transitions of each
state.
D) Construct DFA
Given a product state s in the DFA and two transitions t1, t2 from the
original NFA, add three new transitions to the DFA:
- a transition labeled with the character test of t1 minus the character
test of t2
- a transition labeled with the intersection of the character tests of t1
and t2
- a transition labeled with the character test of t2 minus the character
test of t1
E) Extension
Instead of sets of unicode intervals we could also use test-functions,
e.g. blocks. Then, in step D), the set operations are translated to
boolean operations:
- difference t1 - t2 becomes: t1 && not t2
- intersection of t1 and t2 becomes: t1 && t2
This would allow to use optimized test functions, e.g., bdds, instead of
relying on charcter tests only.
Cheers,
Steffen
Am .11.2017, 23:16 Uhr, schrieb Thierry Goubier
<thierry.goub...@gmail.com>:
Le 07/11/2017 à 23:00, Steffen Märcker a écrit :
I am not familiar with the literature on scanners. May I ask you about
some details on the "for all characters" algorithms you are referring
to?
The two main ones available, from the typical Aho/Ullman textbook, are:
- NFA to DFA conversion (i.e. build a NFA with your regular expressions,
then convert it to a DFA)
- Direct regular expression to DFA construction
Both of them have loop of the type:
for (each input symbol a) {
...
Building a (or connecting to) a BDD library would be fun, indeed. But
within that time frame it seems not realistic. Anyway, after finishing
my thesis, I'd like to come back to that idea.
It would certainly be interesting. Please contact us again when you'll
have time :)
Regards,
Thierry
Ciao, Steffen
Am 7. November 2017 16:33:03 MEZ schrieb Andrew Glynn
<aglyn...@gmail.com>:
A possible way to accomplish it would be to use an object graph
with
an incremental query engine, such as EMF/CDO with Viatra or
something similar. You could then put different character sets in
connected objects and query only as far as you need to.
Andrew Glynn
Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
Windows 10
*From: *Thierry Goubier <mailto:thierry.goub...@gmail.com>
*Sent: *Tuesday, November 7, 2017 7:17 AM
*To: *Any question about pharo is welcome
<mailto:pharo-users@lists.pharo.org>
*Subject: *Re: [Pharo-users] Binary Decision Diagram Package in
Smalltalk
Hi Andrew, Steffen,
2017-11-07 13:10 GMT+01:00 Prof. Andrew P. Black <bl...@cs.pdx.edu
<mailto:bl...@cs.pdx.edu>>:
> On 28 Oct 2017, at 17:37 , Steffen Märcker <merk...@web.de
<mailto:merk...@web.de>> wrote:
>
> Does that mean the sets/bdd would be constructed mainly at
comile time? Anyway, Andrew, feel free to contact me, I might
help you with this.
>
Thanks for the offer, Steffen! The problem is that I need to
use SmaCC for my current project, and really do not have a month
to take off and re-design the way that it builds its scanner.
I’ve talked to Thierry Goubier about, and he doesn’t have time
either! It would be a fun project, though, and it ought to be
fairly separate from other parts of SmaCC. I’ve spent a fair
bit of time thinking about how to do it, but don’t think that I
will be able to actually focus on it.
Yes, this is the essence of the issue. There are a few alternatives
about it, but none we have the time to pursue.
An alternative approach, which Thierry has suggested, is to
make
SmaCC work on the UTF-8 representation of the Unicode. Then we
could represent character sets as prefix trees. But the core
problem would still exist: you can’t run an algorithm that
repeatedly executes
for all characters in the alphabet do:
when there are 2^21 characters in the alphabet!
The main issue is that `for all characters`... All the literature
on
scanner building uses 'for all characters do'.
Thierry
Andrew