[dev] off topic - awk versions performance comparison

Joseph Xu Wed, 11 Aug 2010 03:14:31 -0700

Hi everyone,

I was playing around with awk and ran into some surprising data for thefunction match(s, r), which returns the position of the first match ofregular expression r in string s. Here's the test I ran:


$ yes | head -10000 | tr -d '\n' >/tmp/big
$ yes | head -1000000 | tr -d '\n' >/tmp/bigger
$ time awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/big

real    0m0.056s
user    0m0.053s
sys     0m0.000s
$ time awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/bigger

real    0m5.695s
user    0m5.140s
sys     0m0.553s

The difference is almost exactly 100x, which is the size differencebetween the input files. It seems ridiculous that the amount of timetaken to match the first character of a string grows linearly with thesize of the string. The time it takes to load the contents of the filedoes not contribute significantly to this increase.


That was GNU awk. Trying p9p awk:

$ time 9 awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/big

real    0m0.024s
user    0m0.023s
sys     0m0.000s
$ time 9 awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/bigger

real    0m1.856s
user    0m1.857s
sys     0m0.000s

This is a lot faster compared to GNU awk but there's still a 77xdifference between the two input files.

Finally, trying Kernighan's One True Awk (fromhttp://www.cs.princeton.edu/~bwk/btl.mirror/awk.tar.gz)


$ time nawk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/big

real    0m0.001s
user    0m0.000s
sys     0m0.000s
$ time nawk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/bigger

real    0m0.015s
user    0m0.010s
sys     0m0.003s

The time difference here is actually only due to loading a larger file:

$ time nawk '{ }' /tmp/big

real    0m0.001s
user    0m0.000s
sys     0m0.000s
$ time nawk '{ }' /tmp/bigger

real    0m0.016s
user    0m0.013s
sys     0m0.003s

So at least nawk's performance makes sense. To make things a little moreconfusing, I tried matching on a non-existent pattern:


$ time awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/big

real    0m0.005s
user    0m0.003s
sys     0m0.000s
$ time awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/bigger

real    0m0.085s
user    0m0.080s
sys     0m0.003s
$ time 9 awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/big

real    0m0.006s
user    0m0.003s
sys     0m0.000s
$ time 9 awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/bigger

real    0m0.077s
user    0m0.077s
sys     0m0.000s
$ time nawk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/big

real    0m0.007s
user    0m0.007s
sys     0m0.000s
$ time nawk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/bigger

real    0m0.612s
user    0m0.607s
sys     0m0.003s

Now GNU awk and p9p awk are showing on the order of 10x differencesbetween the two files, while nawk shows 100x difference. Obviously theregex algorithm that nawk uses is different from the other two, andthere's a trade off here, though nawk still looks superior.


Last test:

$ echo -n 'z' >>/tmp/big
$ echo -n 'z' >>/tmp/bigger

Now the pattern does match, but at the very end of the string.

$ time awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/big

real    0m0.056s
user    0m0.057s
sys     0m0.000s
$ time awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/bigger

real    0m5.861s
user    0m5.270s
sys     0m0.590s
$ time 9 awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/big

real    0m0.057s
user    0m0.057s
sys     0m0.000s
$ time 9 awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/bigger

real    0m4.760s
user    0m4.756s
sys     0m0.000s
$ time nawk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/big

real    0m0.007s
user    0m0.007s
sys     0m0.000s
$ time nawk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/bigger

real    0m0.612s
user    0m0.610s
sys     0m0.003s

In this case all the numbers make sense, assuming a linear scan is going on.

Can someone shed some light on these vastly different numbers? Why doGNU awk and p9p awk perform so poorly on the easy case where they justhave to match the first character?

[dev] off topic - awk versions performance comparison

Reply via email to