On Feb 21, 12:14 pm, Pop User <[EMAIL PROTECTED]> wrote: > [EMAIL PROTECTED] wrote: > > While creating a log parser for fairly large logs, we have run into an > > issue where the time to process was relatively unacceptable (upwards > > of 5 minutes for 1-2 million lines of logs). In contrast, using the > > Linux tool grep would complete the same search in a matter of seconds. > > Its very hard to beat grep depending on the nature of the regex you are > searching using. The regex engines in python/perl/php/ruby have traded > the speed of grep/awk for the ability to do more complex searches. > > http://swtch.com/~rsc/regexp/regexp1.html > > This might not be your problem but if it is you can always popen grep. > > It would be nice if there were a Thompson NFA re module.
Or a Glushkov NFA simulated by bit parallelism re module ... see http://citeseer.ist.psu.edu/551772.html (which Russ Cox (author of the paper you cited) seems not to have read). Cox uses a "pathological regex" (regex = "a?" * 29 + "a" * 29, in Python code) to make his point: grep uses a Thompson gadget and takes linear time, while Python perl and friends use backtracking and go off the planet. The interesting thing is that in Navarro's NR-grep, that's not pathological at all; it's a simple case of an "extended pattern" (? + and * operators applied to a single character (or character class)) -- takes linear time with a much easier setup than an NFA/DFA and not much code executed per byte scanned. Getting back to the "It would be nice ..." bit: yes, it would be nice to have even more smarts in re, but who's going to do it? It's not a "rainy Sunday afternoon" job :-) Cheers, John -- http://mail.python.org/mailman/listinfo/python-list