Re: [Haskell-cafe] Efficient parallel regular expressions

2008-11-06 Thread wren ng thornton

Mitchell, Neil wrote:

Hi Martijn,


> It's not that tricky if you do a regular expression state machine
> yourself, but that's probably a bit too much work. One way to get
> a speed up might be to take the regular expressions a,b,c,d and
> generate a regex a+b+c+d, and one a+b. You can then check any string
> s against a+b+c+d, if that matches check a+b, if that matches check
> a. At each stage you eliminate half the regular expressions, which
> means a match will take log n, where n is the number of regular
> expressions.
>
> This assumes the underlying regular expression engine constructs a
> finite state machine, making it O(m) where m is the length of the
> string to match.


If you're implementing the machine yourself, you can implement it as 
trie automaton which has some "value" associated with each final state. 
That is, rather than just accepting or rejecting, when the automaton 
accepts it returns the value associated with the particular final state 
that it accepted by. This is a trivial extension on DFA/NFA 
implementations and is perfectly suited to the problem of combining 
multiple linear tries (sic: regexes) into a single machine. The 
minimization algorithms are a bit more complex than for DFA/NFAs, but 
still fairly straightforward if you're only doing prefix merging.


And this gets rid of the O(log n) factor.

--
Live well,
~wren
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Efficient parallel regular expressions

2008-11-05 Thread roger peppe
On Wed, Nov 5, 2008 at 1:56 PM, Martijn van Steenbergen
<[EMAIL PROTECTED]> wrote:
> I think I'll try roger's (private)

eek, bitten by "reply to sender only" again!

i had intended to send to the list too.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Efficient parallel regular expressions

2008-11-05 Thread roger peppe
On Tue, Nov 4, 2008 at 6:44 PM, i wrote:
> i'm sorry if this is obviously wrong (i haven't used Text.Regex), but
> can't you do this with submatches?

rights or wrongs of regexps aside, i just checked that the above
approach *is* feasible with Text.Regex

here's some code:

>module Multimatch(multimatch) where
>   import Text.Regex
>   import qualified Data.List as DL
>   import qualified Data.Maybe as DM
>
>   brcount :: String -> Int
>   brcount ('\\' : _ : s) =
>   brcount s
>   brcount ('(' : s) =
>   1 + brcount s
>   brcount (_ : s) =
>   brcount s
>   brcount [] =
>   0
>
>   -- given a list of strings representing regular expressions,
>   -- each associated with a tag, match against a string
>   -- and return the match, along with the associated tag,
>   -- or Nothing if there's no match.
>   multimatch :: [(tag, String)] -> (String -> Maybe (tag, String))
>   multimatch rs =
>   let re = mkRegex $ DL.intercalate "|" $ map ((\s -> "(x(" ++ s 
> ++ "))") . snd) rs in
>   let tags = submatches rs in
>   (\ s ->
>   do
>   ms <- matchRegex re ("x" ++ s)
>   (tag, m) <- DL.find (\(_, m) -> not (null m)) 
> (zip tags ms)
>   return (DM.fromJust tag, tail m))
>   submatches [] =
>   []
>   submatches ((tag, r) : rs) =
>   (Just tag : take (brcount r + 1) (repeat Nothing)) ++ 
> submatches rs

i'm sure there's a more compact implementation in there somewhere:
i'm just a haskell newbie.

 cheers,
   rog.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Efficient parallel regular expressions

2008-11-05 Thread Martijn van Steenbergen

Hello everyone,

Thank you all for your comments! Those are some very useful ideas.

I think I'll try roger's (private) and ChrisK's suggestion first: using 
the match groups. I'm not sure if the match groups inside the individual 
 regexes will cause much trouble, but we'll see. I imagine I'll have to 
count parentheses, except when it's followed by a \, except when that \ 
follows another \, etc. There's probably other situations where a () 
doesn't count as a group, perhaps when it's followed by a * or +. I'll 
look into that.


If that doesn't work out I'll go for Neil's (from an algorithmic POV 
beautiful) suggestion.


While I understand that some of you suggest I use parsec (or some other 
mature parser library) I'm pretty sure that's not what I want here. The 
patterns will almost always be very simple and regular expressions offer 
an extremely concise way of expressing when a hook should fire. Forcing 
the user to use full parsers would cause the programs to become much 
more verbose. Still, Yogurt is flexible enough to allow the user to use 
parsec if he or she so chooses.


Thanks again,

Martijn.



Mitchell, Neil wrote:

Hi Martijn,

It's not that tricky if you do a regular expression state machine
yourself, but that's probably a bit too much work. One way to get a
speed up might be to take the regular expressions a,b,c,d and
generate a regex a+b+c+d, and one a+b. You can then check any string
s against a+b+c+d, if that matches check a+b, if that matches check
a. At each stage you eliminate half the regular expressions, which
means a match will take log n, where n is the number of regular
expressions.

This assumes the underlying regular expression engine constructs a
finite state machine, making it O(m) where m is the length of the
string to match.

Thanks

Neil

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Efficient parallel regular expressions

2008-11-05 Thread Krasimir Angelov
Hi Martijn,

If you are brave to start implementing DFA with all required
optimisations then you might want to look at:

http://www.ontotext.com/gate/japec.html

This is a compiler for language called JAPE. In the language you
define a set of rules where the right hand side
is a regular expression and the left hand side is a Java code. The
compiler itself is implemented in Haskell.
It includes code to build DFA from the set of regexps and then it does
determinization and minimization.

I wrote the compiler few years ago. You can decide to take and change
the code or to reimplement it yourself. Definitely DFA guarantees that
the performance is always linear while with Parsec you have to be
careful.

Regards,
   Krasimir



On Tue, Nov 4, 2008 at 6:05 PM, Martijn van Steenbergen
<[EMAIL PROTECTED]> wrote:
> Hello all,
>
> For my mud client Yogurt (see hackage) I'm currently working on
> improving the efficiency of the hooks. Right now several hooks, each
> consisting of a regex and an action can be active at the same time.
> Every time a line of input is available (usually several times a second)
> I run the line through all the available regexes and execute the first
> matching action.
>
> I figured this is not the cleverest approach and it'd be better if I
> |'ed all regexes into one big DFA. However, how do I then find out which
> of the original hooks matched and so which action to execute?
>
> As far as I know there's no way to do that with Text.Regex. Alex looks
> promising but is really only an executable and doesn't offer an API.
> I've also found mr. João Saraiva's HaLex but I don't know if that was
> meant to be used seriously.
>
> Does anyone have any experience with this? What's the best way to
> achieve this?
>
> Thanks much,
>
> Martijn.
>
> ___
> Haskell-Cafe mailing list
> Haskell-Cafe@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Efficient parallel regular expressions

2008-11-04 Thread Martin DeMello
On Tue, Nov 4, 2008 at 9:05 AM, Martijn van Steenbergen
<[EMAIL PROTECTED]> wrote:
>
> For my mud client Yogurt (see hackage) I'm currently working on
> improving the efficiency of the hooks. Right now several hooks, each
> consisting of a regex and an action can be active at the same time.
> Every time a line of input is available (usually several times a second)
> I run the line through all the available regexes and execute the first
> matching action.
>
> I figured this is not the cleverest approach and it'd be better if I
> |'ed all regexes into one big DFA. However, how do I then find out which
> of the original hooks matched and so which action to execute?

Is this really a problem in practice? I've done similar things in
Ruby, which is a much slower language, and not had any issues -
particularly in something IO bound like a MUD client it doesn't seem
that running down a few tens of regexps would be a bottleneck of any
sort.

martin
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


RE: [Haskell-cafe] Efficient parallel regular expressions

2008-11-04 Thread Mitchell, Neil
Hi Martijn,

It's not that tricky if you do a regular expression state machine yourself, but 
that's probably a bit too much work. One way to get a speed up might be to take 
the regular expressions a,b,c,d and generate a regex a+b+c+d, and one a+b. You 
can then check any string s against a+b+c+d, if that matches check a+b, if that 
matches check a. At each stage you eliminate half the regular expressions, 
which means a match will take log n, where n is the number of regular 
expressions.

This assumes the underlying regular expression engine constructs a finite state 
machine, making it O(m) where m is the length of the string to match.

Thanks

Neil

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of 
> Martijn van Steenbergen
> Sent: 04 November 2008 5:05 pm
> To: Haskell Cafe
> Subject: [Haskell-cafe] Efficient parallel regular expressions
> 
> Hello all,
> 
> For my mud client Yogurt (see hackage) I'm currently working 
> on improving the efficiency of the hooks. Right now several 
> hooks, each consisting of a regex and an action can be active 
> at the same time.
> Every time a line of input is available (usually several 
> times a second) I run the line through all the available 
> regexes and execute the first matching action.
> 
> I figured this is not the cleverest approach and it'd be better if I
> |'ed all regexes into one big DFA. However, how do I then 
> find out which
> of the original hooks matched and so which action to execute?
> 
> As far as I know there's no way to do that with Text.Regex. 
> Alex looks promising but is really only an executable and 
> doesn't offer an API.
> I've also found mr. João Saraiva's HaLex but I don't know if 
> that was meant to be used seriously.
> 
> Does anyone have any experience with this? What's the best 
> way to achieve this?
> 
> Thanks much,
> 
> Martijn.
> 
> ___
> Haskell-Cafe mailing list
> Haskell-Cafe@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
> 
> 

==
Please access the attached hyperlink for an important electronic communications 
disclaimer: 

http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
==

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Efficient parallel regular expressions

2008-11-04 Thread Martijn van Steenbergen

Hello all,

For my mud client Yogurt (see hackage) I'm currently working on
improving the efficiency of the hooks. Right now several hooks, each
consisting of a regex and an action can be active at the same time.
Every time a line of input is available (usually several times a second)
I run the line through all the available regexes and execute the first
matching action.

I figured this is not the cleverest approach and it'd be better if I
|'ed all regexes into one big DFA. However, how do I then find out which
of the original hooks matched and so which action to execute?

As far as I know there's no way to do that with Text.Regex. Alex looks
promising but is really only an executable and doesn't offer an API.
I've also found mr. João Saraiva's HaLex but I don't know if that was
meant to be used seriously.

Does anyone have any experience with this? What's the best way to
achieve this?

Thanks much,

Martijn.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe