Regular Patterns (RE: [Haskell] regular expression syntax)

Niklas Broberg Fri, 27 Feb 2004 06:04:52 -0800

Subject: [Haskell] regular expression syntax - perl ain't got nothin on haskell

Agreed, Perl certainly ain't got nothing on Haskell, but we could go even further than just imitating (although better than the original) Perl functionality. =) Hence:

As a spin-off of another project [1], we found ourselves in need of a nice way of expressing regular expression matching in Haskell, and not only for strings. We came up with the idea of HaRP: Haskell Regular Patterns. We have implemented this as a pre-processor to ordinary Haskell and it works quite well. However it is something we are still working on, and if it hadn't been for this discussion on the list we would probably have waited a few weeks more before presenting it, but here goes:

Simple pattern matching on concrete, fully specified lists can be done in Haskell as so: foo [Foo, Bar, Baz] = ...

We propose an extension of this to regular pattern matching, example:
foo [/ Foo Bar* Baz /] = ...

The intuition of the above is that we get a match for any list that starts with a Foo, ends with a Baz, and has zero or more Bar in between.

Regular patterns that can be used are: * - match zero or more + - match one or more ? - match zero or one a | b - match a or b (/ a b /) - match the sequence of a, then b (this is also implicit in the top level [/ ... /]).

Applying the regular expressions on patterns gives some extra nice features. One is that the regular patterns are "type safe", i.e. they are not encoded in strings. Another is that identifiers can be named and bound inside regular patterns, examples: foo [/ _* a /] = ... => a is bound to the last element of the list foo [/ a@(/ _ _ /) _* /] = ... => a is bound to the list containing the first two elements foo [/ (/ a _ /)* /] = ... => a is bound to the list of the first, third, fifth etc elements of a list of even length

Note that since we introduce bindings of identifiers inside what we call numerating patterns, i.e. patterns that imply an uncertain number of matches, identifiers are automatically bound to lists inside such patterns. We also introduce an explicit list-binding operator, @:, for accumulating items in lists as so: foo [/ (_ a@:(/ _ _ /))* /] = ... => a is bound to a list of lists (exactly what the elements will be is left as an exercise to the reader ;)

The types of sub-patterns are as follows (a :: a, b :: b): a* => [a] a+ => [a] a? => Maybe a (/ ... /) => [e], where e is the type of the elements in the list matched, regardless of sub-patterns ( a | b ) => Either a b

As more complete example using all the presented features:

foo [/ _ [EMAIL PROTECTED] b [EMAIL PROTECTED] 4+ [EMAIL PROTECTED] e@(/ f@:6 g /)* h@( 8 | (/ 9 i /) ) /] = (a,b,c,d,e,f,g,h,i)

Assuming all the numerical literals are of type Int, foo will have the following type:

foo :: [Int] -> (Int, Int, [Int], Maybe Int, [[Int]], [Int], [Int], Either Int [Int], [Int])

Examples of applying foo to some lists:

?> foo [0,1,2,3,4,5,6,7,8]
(1, 2, [3], Just 5, [[6,7]], [6], [7], Left 8, [])

?> foo [0,1,2,3,3,3,4,6,0,6,1,6,2,9,10] (1,2,[3,3,3], Nothing, [[6,0],[6,1],[6,2]], [6,6,6], [0,1,2], Right [9,10], [10])

Discussion of each variable in detail: a :: Int - a binds to a single element at top level (top level meaning it is bound outside any numerating pattern). b :: Int - b binds to a single element at top level. c :: [Int] - c is bound to a zero-or-many pattern, and it will contain all the matches of the sub-pattern, in this case all matches of 3. d :: Maybe Int - d is bound to a zero-or-one pattern, and it will be Nothing in case of zero matches, and Just the match to the sub-pattern in case of a match, in this case 5. e :: [[Int]] - e is bound to a zero-or-more pattern, and will thus contain a list of all the matches of the sub-pattern. In this case the sub-pattern is a sequence, which has a list type, so the type of e is a list of lists. f :: [Int] - f is bound using the list-binding operator @:, so its type will always be a list of the type of the sub-pattern, regardless of the context it appears in. It will contain all matches of the sub-pattern (Note that a normal bind using @ would have been illegal here). At top level (and in ordinary pattern matching), the pattern foo is equivalent to [EMAIL PROTECTED], but inside numerating patterns the pattern foo is equivalent to foo@:_. (see discussion below) g :: [Int] - g is equivalent to g@:_ as mentioned above, so the same will hold for g as for f. h :: Either Int [Int] - h is bound to a choice pattern (or union pattern if you prefer), so it will be bound to the match of one of the two sub-patterns, annotated with Left or Right. In this case the left sub-pattern matches a single element of type Int, whereas the right sub-pattern matches a sequence of type [Int]. i :: [Int] - Since the choice pattern is numerating (each of the sub-patterns are matched zero or one times), i is equivalent to i@:_.

For completeness, another example to show how sequences work:
bar :: [Int] -> [Int]
bar [/ 0 a@(/ 1* 2 (3|4) (/ 5 6 /) 7? /) /] = a

In this case a will have the type [Int], since a sequence will always have the type [e] where e is the type of the elements of the list to match. So in this example,

?> bar [0,2,3,5,6]
[2,3,5,6]
?> bar [0,1,1,1,2,4,5,6,7]
[1,1,1,2,4,5,6,7]

Hopefully that's enough examples, it should be fairly clear how it all works. =)

Regarding @ vs @:, it would be fully possible to implement this just using @ and change its behavior depending on the context it appears in, much like we do with identifiers bound without using the explicit @ operator. We feel that doing so could lead to (even more) confusion regarding how variables are bound, and have therefore chosen to introduce the extra @: operator to make this differing behavior explicit. That identifiers bound without a @ or @: have differing semantics depending on context is unfortunate but unavoidable, and we feel that the added confusion is minor in this case.

As we said at the beginning we have implemented this as a preprocessor to ordinary Haskell, but it still needs some more alpha-testing before we release it... we need to catch the big fish first.

There are a number of open issues still, one is the choice of delimeter token. We have chosen to delimit patterns with just a whitespace, but using a comma might be better. The motivation for the current choice is that [/ a, b*, c /] lends the feeling that each pattern is an element of sorts, when in fact each pattern may represent any number of elements. Any input on this?

Another open issue is greedy vs. non-greedy matching of * and +. The current implementation is greedy, but some voices were raised earlier on this list that non-greedy matching would be better as default. We envision a system where you can choose behavior by adding "flags" to the pattern, i.e. [g/ .... /] might indicate a greedy match etc.

Strings are a special syntactic case of a list, and we are planning a special case of regular patterns for it, for instance [s/ "Hello " a* /] would be equal to [/ 'H' 'e' 'l' 'l' 'o' ' ' a* /], but this is not yet implemented.

Of course, unfortunately, there is also a downside. Our preprocessor is implemented by translating regular patterns into code in a parser monad. Due to the way that matching is done, matching will always be "strict" in the sense that if you apply the function

foo [/ a* /] = a

to an infinite list, the call will loop forever. At this point we see no reasonable way around this, there may be some special cases we can catch but we see no overall solution to the problem.

Another drawback, at least in our implementation, is efficiency. Using an external C engine for matching regular expression would far outperform our system, no matter how efficient we make our backend. Thus this is not intended to replace any existing functionality, rather we envision it as a very high-level interface to matching regular expressions in the spirit of String being [Char], where Text.Regex can be compared to using PackedString for critical programs.

Perhaps by integrating regular patterns directly into a compiler like ghc we could use ugly low-level behind-the-scenes tricks for doing the actual matching in an external engine, thus greatly increasing efficiency. Any thoughts? =)

We will supply the preprocessor for public scrutiny some time next week, we just need to do some more alpha-testing first... Comment away!

regards,

Niklas Broberg, d00nibro[at]dtek.chalmers.se
Andreas Farre, d00farre[at]dtek.chalmers.se
Chalmers University of Technology

[1] Feel free to also visit the project that lead to this spin-off, Haskell Server Pages http://www.dtek.chalmers.se/~d00nibro/hsp/

_________________________________________________________________ Take off on a romantic weekend or a family adventure to these great U.S. locations. http://special.msn.com/local/hotdestinations.armx

_______________________________________________
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell

Regular Patterns (RE: [Haskell] regular expression syntax)

Reply via email to