Re: [Haskell-cafe] Efficient parallel regular expressions
Mitchell, Neil wrote: Hi Martijn, > It's not that tricky if you do a regular expression state machine > yourself, but that's probably a bit too much work. One way to get > a speed up might be to take the regular expressions a,b,c,d and > generate a regex a+b+c+d, and one a+b. You can then check any string > s against a+b+c+d, if that matches check a+b, if that matches check > a. At each stage you eliminate half the regular expressions, which > means a match will take log n, where n is the number of regular > expressions. > > This assumes the underlying regular expression engine constructs a > finite state machine, making it O(m) where m is the length of the > string to match. If you're implementing the machine yourself, you can implement it as trie automaton which has some "value" associated with each final state. That is, rather than just accepting or rejecting, when the automaton accepts it returns the value associated with the particular final state that it accepted by. This is a trivial extension on DFA/NFA implementations and is perfectly suited to the problem of combining multiple linear tries (sic: regexes) into a single machine. The minimization algorithms are a bit more complex than for DFA/NFAs, but still fairly straightforward if you're only doing prefix merging. And this gets rid of the O(log n) factor. -- Live well, ~wren ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Efficient parallel regular expressions
On Wed, Nov 5, 2008 at 1:56 PM, Martijn van Steenbergen <[EMAIL PROTECTED]> wrote: > I think I'll try roger's (private) eek, bitten by "reply to sender only" again! i had intended to send to the list too. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Efficient parallel regular expressions
On Tue, Nov 4, 2008 at 6:44 PM, i wrote: > i'm sorry if this is obviously wrong (i haven't used Text.Regex), but > can't you do this with submatches? rights or wrongs of regexps aside, i just checked that the above approach *is* feasible with Text.Regex here's some code: >module Multimatch(multimatch) where > import Text.Regex > import qualified Data.List as DL > import qualified Data.Maybe as DM > > brcount :: String -> Int > brcount ('\\' : _ : s) = > brcount s > brcount ('(' : s) = > 1 + brcount s > brcount (_ : s) = > brcount s > brcount [] = > 0 > > -- given a list of strings representing regular expressions, > -- each associated with a tag, match against a string > -- and return the match, along with the associated tag, > -- or Nothing if there's no match. > multimatch :: [(tag, String)] -> (String -> Maybe (tag, String)) > multimatch rs = > let re = mkRegex $ DL.intercalate "|" $ map ((\s -> "(x(" ++ s > ++ "))") . snd) rs in > let tags = submatches rs in > (\ s -> > do > ms <- matchRegex re ("x" ++ s) > (tag, m) <- DL.find (\(_, m) -> not (null m)) > (zip tags ms) > return (DM.fromJust tag, tail m)) > submatches [] = > [] > submatches ((tag, r) : rs) = > (Just tag : take (brcount r + 1) (repeat Nothing)) ++ > submatches rs i'm sure there's a more compact implementation in there somewhere: i'm just a haskell newbie. cheers, rog. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Efficient parallel regular expressions
Hello everyone, Thank you all for your comments! Those are some very useful ideas. I think I'll try roger's (private) and ChrisK's suggestion first: using the match groups. I'm not sure if the match groups inside the individual regexes will cause much trouble, but we'll see. I imagine I'll have to count parentheses, except when it's followed by a \, except when that \ follows another \, etc. There's probably other situations where a () doesn't count as a group, perhaps when it's followed by a * or +. I'll look into that. If that doesn't work out I'll go for Neil's (from an algorithmic POV beautiful) suggestion. While I understand that some of you suggest I use parsec (or some other mature parser library) I'm pretty sure that's not what I want here. The patterns will almost always be very simple and regular expressions offer an extremely concise way of expressing when a hook should fire. Forcing the user to use full parsers would cause the programs to become much more verbose. Still, Yogurt is flexible enough to allow the user to use parsec if he or she so chooses. Thanks again, Martijn. Mitchell, Neil wrote: Hi Martijn, It's not that tricky if you do a regular expression state machine yourself, but that's probably a bit too much work. One way to get a speed up might be to take the regular expressions a,b,c,d and generate a regex a+b+c+d, and one a+b. You can then check any string s against a+b+c+d, if that matches check a+b, if that matches check a. At each stage you eliminate half the regular expressions, which means a match will take log n, where n is the number of regular expressions. This assumes the underlying regular expression engine constructs a finite state machine, making it O(m) where m is the length of the string to match. Thanks Neil ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Efficient parallel regular expressions
Hi Martijn, If you are brave to start implementing DFA with all required optimisations then you might want to look at: http://www.ontotext.com/gate/japec.html This is a compiler for language called JAPE. In the language you define a set of rules where the right hand side is a regular expression and the left hand side is a Java code. The compiler itself is implemented in Haskell. It includes code to build DFA from the set of regexps and then it does determinization and minimization. I wrote the compiler few years ago. You can decide to take and change the code or to reimplement it yourself. Definitely DFA guarantees that the performance is always linear while with Parsec you have to be careful. Regards, Krasimir On Tue, Nov 4, 2008 at 6:05 PM, Martijn van Steenbergen <[EMAIL PROTECTED]> wrote: > Hello all, > > For my mud client Yogurt (see hackage) I'm currently working on > improving the efficiency of the hooks. Right now several hooks, each > consisting of a regex and an action can be active at the same time. > Every time a line of input is available (usually several times a second) > I run the line through all the available regexes and execute the first > matching action. > > I figured this is not the cleverest approach and it'd be better if I > |'ed all regexes into one big DFA. However, how do I then find out which > of the original hooks matched and so which action to execute? > > As far as I know there's no way to do that with Text.Regex. Alex looks > promising but is really only an executable and doesn't offer an API. > I've also found mr. João Saraiva's HaLex but I don't know if that was > meant to be used seriously. > > Does anyone have any experience with this? What's the best way to > achieve this? > > Thanks much, > > Martijn. > > ___ > Haskell-Cafe mailing list > Haskell-Cafe@haskell.org > http://www.haskell.org/mailman/listinfo/haskell-cafe > ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Efficient parallel regular expressions
On Tue, Nov 4, 2008 at 9:05 AM, Martijn van Steenbergen <[EMAIL PROTECTED]> wrote: > > For my mud client Yogurt (see hackage) I'm currently working on > improving the efficiency of the hooks. Right now several hooks, each > consisting of a regex and an action can be active at the same time. > Every time a line of input is available (usually several times a second) > I run the line through all the available regexes and execute the first > matching action. > > I figured this is not the cleverest approach and it'd be better if I > |'ed all regexes into one big DFA. However, how do I then find out which > of the original hooks matched and so which action to execute? Is this really a problem in practice? I've done similar things in Ruby, which is a much slower language, and not had any issues - particularly in something IO bound like a MUD client it doesn't seem that running down a few tens of regexps would be a bottleneck of any sort. martin ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
RE: [Haskell-cafe] Efficient parallel regular expressions
Hi Martijn, It's not that tricky if you do a regular expression state machine yourself, but that's probably a bit too much work. One way to get a speed up might be to take the regular expressions a,b,c,d and generate a regex a+b+c+d, and one a+b. You can then check any string s against a+b+c+d, if that matches check a+b, if that matches check a. At each stage you eliminate half the regular expressions, which means a match will take log n, where n is the number of regular expressions. This assumes the underlying regular expression engine constructs a finite state machine, making it O(m) where m is the length of the string to match. Thanks Neil > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of > Martijn van Steenbergen > Sent: 04 November 2008 5:05 pm > To: Haskell Cafe > Subject: [Haskell-cafe] Efficient parallel regular expressions > > Hello all, > > For my mud client Yogurt (see hackage) I'm currently working > on improving the efficiency of the hooks. Right now several > hooks, each consisting of a regex and an action can be active > at the same time. > Every time a line of input is available (usually several > times a second) I run the line through all the available > regexes and execute the first matching action. > > I figured this is not the cleverest approach and it'd be better if I > |'ed all regexes into one big DFA. However, how do I then > find out which > of the original hooks matched and so which action to execute? > > As far as I know there's no way to do that with Text.Regex. > Alex looks promising but is really only an executable and > doesn't offer an API. > I've also found mr. João Saraiva's HaLex but I don't know if > that was meant to be used seriously. > > Does anyone have any experience with this? What's the best > way to achieve this? > > Thanks much, > > Martijn. > > ___ > Haskell-Cafe mailing list > Haskell-Cafe@haskell.org > http://www.haskell.org/mailman/listinfo/haskell-cafe > > == Please access the attached hyperlink for an important electronic communications disclaimer: http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html == ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Efficient parallel regular expressions
Hello all, For my mud client Yogurt (see hackage) I'm currently working on improving the efficiency of the hooks. Right now several hooks, each consisting of a regex and an action can be active at the same time. Every time a line of input is available (usually several times a second) I run the line through all the available regexes and execute the first matching action. I figured this is not the cleverest approach and it'd be better if I |'ed all regexes into one big DFA. However, how do I then find out which of the original hooks matched and so which action to execute? As far as I know there's no way to do that with Text.Regex. Alex looks promising but is really only an executable and doesn't offer an API. I've also found mr. João Saraiva's HaLex but I don't know if that was meant to be used seriously. Does anyone have any experience with this? What's the best way to achieve this? Thanks much, Martijn. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe