To try this we need stringf. Linda
-----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Raul Miller Sent: Saturday, January 19, 2013 3:12 PM To: Programming forum Subject: [Jprogramming] implementing regular expressions in J I was revisiting regular expressions, and I decided to do an implementation in J. To keep it simple, I left out: y 1) irregular expressions (parenthesized references) 2) parsing 3) optimizations While all of these are possible, they complicate things. To represent a regular expression, I decided to use a bit array with three dimensions, representing three different kinds of sets in a regular expression recognizer: The first dimension represents "current state" The second dimension represents "ascii character" The third dimension represents "next state" This is not quite complete -- we also need to know which state(s) to start in, so a regular expression will actually be a boxed pair, with the first pair being a bit list with 1 for current state, and the second box will contain the transition array. Since a three dimensional bit array is uncomfortable to read, one of the first things I built was a routine to display this representation in a readable format. disp=: <@(#&a.)"1@(1&|:)`I.@.(1=#@$)L:0 And, one of the last things I wrote was a routine to translate a string into a regular fragment which would match that string, and this looks like this (note that this needs to be viewed in a fixed width font): disp stringf 'this' +-+----------+ |0|++-+-+-+-+| | |||t| | | || | |++-+-+-+-+| | ||| |h| | || | |++-+-+-+-+| | ||| | |i| || | |++-+-+-+-+| | ||| | | |s|| | |++-+-+-+-+| +-+----------+ Or, if I express this in the pcre syntax for regular expressions, it would be simply: 'this' A "regular fragment" is just like the "regular expression" data structure I described, except that the details of the final state are not specified. In this display the 0 in the leftmost box means that this fragment has only state 0 for its initial state. And in the second box we see the state definitions (from top to bottom) and the states we transition to (from left to right) and the character which performs that transition (in the inner boxes). So... how do we build this kind of thing? First, I needed something that would build a transition for a set of characters: chf=: 1 0; ,:@,.~&0@e.~&a. Here's how this works: disp chf 'this' +-+-------+ |0|++----+| | |||hist|| | |++----+| +-+-------+ In other words, any of the characters in the argument to chf will give us a transition from the first state to the final state of this fragment. In the pcre syntax for regular expressions, this would be '[this]' Next, we need a routine to let us combine fragments sequentially. This is a bit more complicated: checksize=: -: _1 256 0 + (,.1 0 1) +/ .* ] and=: * or=: and&.-. seqf=:4 :0 'IX TX'=. x assert. TX checksize&$ IX 'IY TY'=. y assert. TY checksize&$ IY I=. (}:IX),({:IX)and IY T=. ((}:"1 TX),"1({:"1 TX)and/IY),(-($TY)+0 0,<:{:$TX){. TY I;T ) Note that you could use boolean definitions for 'and' and 'or' instead of bayesian (*. and +. instead of * and *&.-.) - they are equivalent for bit data. Note that the assert. statements do nothing if we set a global parameter to disable them. But I like them there because they catch some simple mistakes Anyways, what's happening here is that when I combine my initial states, the final state of the first fragment becomes any of the initial states in the second fragment. Also, any transition to the final state of the first fragment becomes a transition to any of the initial states of the second fragment. Another thing we might want to do involves alternation -- we have two regular expression fragments and we want to have a regular expression that matches if either of our alternatives match. That's something like this: orf=:4 :0 'IX TX'=. x assert. TX checksize&$ IX 'IY TY'=. y assert. TY checksize&$ IY I=. (}:IX),(}:IY),IX or&{: IY T=. TX, (-($TY)+0 0,<:{:$TX){. TY I;T ) Here, our new start state is almost the start states of our two options strung together. The only thing we have to do is recognize that the final state of the combination is present in the start state if it was in either of our original start states. And the states transitions themselves are used "as is" (except that we string them together. So now we can do something like this: disp (chf 'b') orf (chf 'c') seqf chf 'd' +---+--------+ |0 1|++-+-+-+| | |||b| | || | |++-+-+-+| | ||| |c| || | |++-+-+-+| | ||| | |d|| | |++-+-+-+| +---+--------+ Note that we now start with two states. We can't put our 'b' and 'c' transitions in the same state because 'd' is valid after 'c' but not after 'b'. This corresponds to the pcre expression 'b|cd'. Now watch what happens if I add an 'a' in front of this last expression (something like this pcre expression: 'a(b|cd)'): disp (chf 'a') seqf (chf 'b') orf (chf 'c') seqf chf 'd' +-+----------+ |0|++-+-+-+-+| | |||a|a| | || | |++-+-+-+-+| | ||| |b| | || | |++-+-+-+-+| | ||| | |c| || | |++-+-+-+-+| | ||| | | |d|| | |++-+-+-+-+| +-+----------+ Now we only have one starting state. But an 'a' in that starting state gives us a transition to two different states. Another thing we might want to do with a regular expression fragment is to say that it can be repeated an arbitrary number of times. This is equivalent to saying that any transition to the end state must also be a transition to the start states. repf=:3 :0 'I T'=. y assert. T checksize&$ I I;T+.({:"1 T) and/ I ) NB. pcre: '(a(b|cd))+' disp repf (chf 'a') seqf (chf 'b') orf (chf 'c') seqf chf 'd' +-+-----------+ |0|+-+-+-+-+-+| | || |a|a| | || | |+-+-+-+-+-+| | || | |b| | || | |+-+-+-+-+-+| | || | | |c| || | |+-+-+-+-+-+| | ||d| | | |d|| | |+-+-+-+-+-+| +-+-----------+ Another thing we might want to say is that a regular expression is optional. This is equivalent to saying that the end state is an initial state: optf=:3 :0 'I T'=. y assert. T checksize&$ I (I+.1 {.~-#I);T ) NB. pcre: 'a?' disp optf chf 'a' +---+----+ |0 1|++-+| | |||a|| | |++-+| +---+----+ Or, for example NB. pcre: '(a|(b|cd?))+' disp repf (chf 'a') seqf (chf 'b') orf (chf 'c') seqf optf chf 'd' +-+-----------+ |0|+-+-+-+-+-+| | || |a|a| | || | |+-+-+-+-+-+| | || | |b| | || | |+-+-+-+-+-+| | ||c| | |c|c|| | |+-+-+-+-+-+| | ||d| | | |d|| | |+-+-+-+-+-+| +-+-----------+ Did I leave out anything important? Quite possibly, this is getting long. But here are a few other routines, and some tests: NB. string to fragment stringf=: [: > [: seqf&.>/ chf&.> NB. turn a regular fragment into a regular expression complete=: ({.~ (>.|.)@$)L:0 match=:4 :0 'STATE NEXT'=. complete y for_CHAR. a.i.x do. STATE=. STATE or/ .and CHAR {"2 NEXT end. {:STATE ) assert (disp chf'a')-:,L:0]0;<,:'';'a' assert (disp 'a' seqf&chf 'b')-:,L:0]0;<('';'a'),:'';'';'b' assert (disp 'a' orf&chf 'bc')-:,L:0]0 1;<('';'a'),:'';'';'bc' assert (disp optf chf 'a')-:,L:0]0 1;<,:'';'a' assert (disp repf chf 'a')-:,L:0]0;<,:'a';'a' assert 'abc' match stringf 'abc' assert 'aaab' match (repf chf 'a') seqf chf 'b' assert 'aaab' match (repf optf chf 'a') seqf chf 'b' assert 'b' match (repf optf chf 'a') seqf chf 'b' assert -.'aaabb' match (repf optf chf 'a') seqf chf 'b' I should also note that pcre has an irregular "reference" feature associated with its parenthesis. It also offers a more complex syntax that provides groups that cannot be referenced, but I thought it would be simpler to just ignore the irregularities. Also, note that there are a variety of optimizations that could be implemented, which I did not implement. For example NB. pcre 'a|b' disp 'a' orf&chf 'b' NB. pcre '[ab]' disp chf 'ab' These two fragments match exactly the same way, but the data structures are different. We could build an code which translates the longer form to the shorter form. Another kind of optimization involves a scanner that instead of scanning sequentially scans every nth character to see if the match would be even possible. This could be difficult for the general case, but for a simple string match it's very easy to implement. It's also perhaps worth implementing reference-able sub-matches. But that makes the matching algorithm considerably more complicated. The pcre code handles this by using an algorithm with exponential (slow) performance in some cases. Note that I did not show any example like 'a*' in pcre. Here are two ways of doing that: disp optf repf chf 'a' disp repf optf chf 'a' We could even define: kleenestare=: optf@repf and this should always produce the same result as repf@optf Since this implementation is very simple, it should be not-too-difficult to implement a parser or to implement a more concise set of operations for building regular expressions from characters. But that's enough, for me, for now. (And, hopefully I've not implemented any really badly erroneous verbs here. But when building this code I encountered cases where routines that worked on simple data structures failed on longer structures. Usually, simplifying the code fixed these problems.) FYI, -- Raul ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
