To try this we need stringf.

Linda

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Raul Miller
Sent: Saturday, January 19, 2013 3:12 PM
To: Programming forum
Subject: [Jprogramming] implementing regular expressions in J

I was revisiting regular expressions, and I decided to do an implementation
in J.

To keep it simple, I left out:
y
1) irregular expressions (parenthesized references)
2) parsing
3) optimizations

While all of these are possible, they complicate things.

To represent a regular expression, I decided to use a bit array with three
dimensions, representing three different kinds of sets in a regular
expression recognizer:

The first dimension represents "current state"
The second dimension represents "ascii character"
The third dimension represents "next state"

This is not quite complete -- we also need to know which state(s) to start
in, so a regular expression will actually be a boxed pair, with the first
pair being a bit list with 1 for current state, and the second box will
contain the transition array.

Since a three dimensional bit array is uncomfortable to read, one of the
first things I built was a routine to display this representation in a
readable format.

disp=: <@(#&a.)"1@(1&|:)`I.@.(1=#@$)L:0

And, one of the last things I wrote was a routine to translate a string into
a regular fragment which would match that string, and this looks like this
(note that this needs to be viewed in a fixed width
font):

   disp stringf 'this'
+-+----------+
|0|++-+-+-+-+|
| |||t| | | ||
| |++-+-+-+-+|
| ||| |h| | ||
| |++-+-+-+-+|
| ||| | |i| ||
| |++-+-+-+-+|
| ||| | | |s||
| |++-+-+-+-+|
+-+----------+

Or, if I express this in the pcre syntax for regular expressions, it would
be simply: 'this'

A "regular fragment" is just like the "regular expression" data structure I
described, except that the details of the final state are not specified.

In this display the 0 in the leftmost box means that this fragment has only
state 0 for its initial state.  And in the second box we see the state
definitions (from top to bottom) and the states we transition to (from left
to right) and the character which performs that transition (in the inner
boxes).

So... how do we build this kind of thing?

First, I needed something that would build a transition for a set of
characters:

chf=: 1 0; ,:@,.~&0@e.~&a.

Here's how this works:

   disp chf 'this'
+-+-------+
|0|++----+|
| |||hist||
| |++----+|
+-+-------+

In other words, any of the characters in the argument to chf will give us a
transition from the first state to the final state of this fragment.  In the
pcre syntax for regular expressions, this would be '[this]'

Next, we need a routine to let us combine fragments sequentially.
This is a bit more complicated:

checksize=: -: _1 256 0 + (,.1 0 1) +/ .* ]
and=: *
or=: and&.-.

seqf=:4 :0
  'IX TX'=. x assert. TX checksize&$ IX
  'IY TY'=. y assert. TY checksize&$ IY
  I=. (}:IX),({:IX)and IY
  T=. ((}:"1 TX),"1({:"1 TX)and/IY),(-($TY)+0 0,<:{:$TX){. TY
  I;T
)

Note that you could use boolean definitions for 'and' and 'or' instead of
bayesian (*. and +. instead of * and *&.-.) - they are equivalent for bit
data.

Note that the assert. statements do nothing if we set a global parameter to
disable them.  But I like them there because they catch some simple mistakes

Anyways, what's happening here is that when I combine my initial states, the
final state of the first fragment becomes any of the initial states in the
second fragment.  Also, any transition to the final state of the first
fragment becomes a transition to any of the initial states of the second
fragment.

Another thing we might want to do involves alternation -- we have two
regular expression fragments and we want to have a regular expression that
matches if either of our alternatives match.  That's something like this:

orf=:4 :0
  'IX TX'=. x assert. TX checksize&$ IX
  'IY TY'=. y assert. TY checksize&$ IY
  I=. (}:IX),(}:IY),IX or&{: IY
  T=. TX, (-($TY)+0 0,<:{:$TX){. TY
  I;T
)

Here, our new start state is almost the start states of our two options
strung together.  The only thing we have to do is recognize that the final
state of the combination is present in the start state if it was in either
of our original start states.

And the states transitions themselves are used "as is" (except that we
string them together.

So now we can do something like this:

   disp (chf 'b') orf (chf 'c') seqf chf 'd'
+---+--------+
|0 1|++-+-+-+|
|   |||b| | ||
|   |++-+-+-+|
|   ||| |c| ||
|   |++-+-+-+|
|   ||| | |d||
|   |++-+-+-+|
+---+--------+

Note that we now start with two states.  We can't put our 'b' and 'c'
transitions in the same state because 'd' is valid after 'c' but not after
'b'.  This corresponds to the pcre expression 'b|cd'.

Now watch what happens if I add an 'a' in front of this last expression
(something like this pcre expression:  'a(b|cd)'):

   disp (chf 'a') seqf (chf 'b') orf (chf 'c') seqf chf 'd'
+-+----------+
|0|++-+-+-+-+|
| |||a|a| | ||
| |++-+-+-+-+|
| ||| |b| | ||
| |++-+-+-+-+|
| ||| | |c| ||
| |++-+-+-+-+|
| ||| | | |d||
| |++-+-+-+-+|
+-+----------+

Now we only have one starting state.  But an 'a' in that starting state
gives us a transition to two different states.

Another thing we might want to do with a regular expression fragment is to
say that it can be repeated an arbitrary number of times.  This is
equivalent to saying that any transition to the end state must also be a
transition to the start states.

repf=:3 :0
  'I T'=. y assert. T checksize&$ I
  I;T+.({:"1 T) and/ I
)

   NB. pcre:  '(a(b|cd))+'
   disp repf (chf 'a') seqf (chf 'b') orf (chf 'c') seqf chf 'd'
+-+-----------+
|0|+-+-+-+-+-+|
| || |a|a| | ||
| |+-+-+-+-+-+|
| || | |b| | ||
| |+-+-+-+-+-+|
| || | | |c| ||
| |+-+-+-+-+-+|
| ||d| | | |d||
| |+-+-+-+-+-+|
+-+-----------+

Another thing we might want to say is that a regular expression is optional.
This is equivalent to saying that the end state is an initial state:

optf=:3 :0
  'I T'=. y assert. T checksize&$ I
  (I+.1 {.~-#I);T
)

   NB. pcre:  'a?'
   disp optf chf 'a'
+---+----+
|0 1|++-+|
|   |||a||
|   |++-+|
+---+----+

Or, for example

   NB. pcre: '(a|(b|cd?))+'
   disp repf (chf 'a') seqf (chf 'b') orf (chf 'c') seqf optf chf 'd'
+-+-----------+
|0|+-+-+-+-+-+|
| || |a|a| | ||
| |+-+-+-+-+-+|
| || | |b| | ||
| |+-+-+-+-+-+|
| ||c| | |c|c||
| |+-+-+-+-+-+|
| ||d| | | |d||
| |+-+-+-+-+-+|
+-+-----------+

Did I leave out anything important? Quite possibly, this is getting long.

But here are a few other routines, and some tests:

NB. string to fragment
stringf=: [: > [: seqf&.>/ chf&.>

NB. turn a regular fragment into a regular expression
complete=: ({.~ (>.|.)@$)L:0

match=:4 :0
  'STATE NEXT'=. complete y
  for_CHAR. a.i.x do.
      STATE=. STATE or/ .and CHAR {"2 NEXT
  end.
  {:STATE
)

assert (disp chf'a')-:,L:0]0;<,:'';'a'
assert (disp 'a' seqf&chf 'b')-:,L:0]0;<('';'a'),:'';'';'b'
assert (disp 'a' orf&chf 'bc')-:,L:0]0 1;<('';'a'),:'';'';'bc'
assert (disp optf chf 'a')-:,L:0]0 1;<,:'';'a'
assert (disp repf chf 'a')-:,L:0]0;<,:'a';'a'
assert 'abc' match stringf 'abc'
assert 'aaab' match (repf chf 'a') seqf chf 'b'
assert 'aaab' match (repf optf chf 'a') seqf chf 'b'
assert 'b' match (repf optf chf 'a') seqf chf 'b'
assert -.'aaabb' match (repf optf chf 'a') seqf chf 'b'

I should also note that pcre has an irregular "reference" feature associated
with its parenthesis.  It also offers a more complex syntax that provides
groups that cannot be referenced, but I thought it would be simpler to just
ignore the irregularities.

Also, note that there are a variety of optimizations that could be
implemented, which I did not implement.  For example

   NB. pcre 'a|b'
   disp 'a' orf&chf 'b'
   NB. pcre '[ab]'
   disp chf 'ab'

These two fragments match exactly the same way, but the data structures are
different.  We could build an code which translates the longer form to the
shorter form.

Another kind of optimization involves a scanner that instead of scanning
sequentially scans every nth character to see if the match would be even
possible.  This could be difficult for the general case, but for a simple
string match it's very easy to implement.

It's also perhaps worth implementing reference-able sub-matches.  But that
makes the matching algorithm considerably more complicated.  The pcre code
handles this by using an algorithm with exponential (slow) performance in
some cases.

Note that I did not show any example like 'a*' in pcre.  Here are two ways
of doing that:

   disp optf repf chf 'a'
   disp repf optf chf 'a'

We could even define:

kleenestare=: optf@repf

and this should always produce the same result as repf@optf

Since this implementation is very simple, it should be not-too-difficult to
implement a parser or to implement a more concise set of operations for
building regular expressions from characters.
But that's enough, for me, for now.

(And, hopefully I've not implemented any really badly erroneous verbs here.
But when building this code I encountered cases where routines that worked
on simple data structures failed on longer structures.
Usually, simplifying the code fixed these problems.)

FYI,

--
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to