Hi everyone,

I was playing around with awk and ran into some surprising data for the function match(s, r), which returns the position of the first match of regular expression r in string s. Here's the test I ran:

$ yes | head -10000 | tr -d '\n' >/tmp/big
$ yes | head -1000000 | tr -d '\n' >/tmp/bigger
$ time awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/big

real    0m0.056s
user    0m0.053s
sys     0m0.000s
$ time awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/bigger

real    0m5.695s
user    0m5.140s
sys     0m0.553s

The difference is almost exactly 100x, which is the size difference between the input files. It seems ridiculous that the amount of time taken to match the first character of a string grows linearly with the size of the string. The time it takes to load the contents of the file does not contribute significantly to this increase.

That was GNU awk. Trying p9p awk:

$ time 9 awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/big

real    0m0.024s
user    0m0.023s
sys     0m0.000s
$ time 9 awk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/bigger

real    0m1.856s
user    0m1.857s
sys     0m0.000s

This is a lot faster compared to GNU awk but there's still a 77x difference between the two input files.

Finally, trying Kernighan's One True Awk (from http://www.cs.princeton.edu/~bwk/btl.mirror/awk.tar.gz)

$ time nawk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/big

real    0m0.001s
user    0m0.000s
sys     0m0.000s
$ time nawk '{for (i=1; i < 100; i++) { match($0, "y")} }' /tmp/bigger

real    0m0.015s
user    0m0.010s
sys     0m0.003s

The time difference here is actually only due to loading a larger file:

$ time nawk '{ }' /tmp/big

real    0m0.001s
user    0m0.000s
sys     0m0.000s
$ time nawk '{ }' /tmp/bigger

real    0m0.016s
user    0m0.013s
sys     0m0.003s

So at least nawk's performance makes sense. To make things a little more confusing, I tried matching on a non-existent pattern:

$ time awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/big

real    0m0.005s
user    0m0.003s
sys     0m0.000s
$ time awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/bigger

real    0m0.085s
user    0m0.080s
sys     0m0.003s
$ time 9 awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/big

real    0m0.006s
user    0m0.003s
sys     0m0.000s
$ time 9 awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/bigger

real    0m0.077s
user    0m0.077s
sys     0m0.000s
$ time nawk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/big

real    0m0.007s
user    0m0.007s
sys     0m0.000s
$ time nawk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/bigger

real    0m0.612s
user    0m0.607s
sys     0m0.003s

Now GNU awk and p9p awk are showing on the order of 10x differences between the two files, while nawk shows 100x difference. Obviously the regex algorithm that nawk uses is different from the other two, and there's a trade off here, though nawk still looks superior.

Last test:

$ echo -n 'z' >>/tmp/big
$ echo -n 'z' >>/tmp/bigger

Now the pattern does match, but at the very end of the string.

$ time awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/big

real    0m0.056s
user    0m0.057s
sys     0m0.000s
$ time awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/bigger

real    0m5.861s
user    0m5.270s
sys     0m0.590s
$ time 9 awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/big

real    0m0.057s
user    0m0.057s
sys     0m0.000s
$ time 9 awk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/bigger

real    0m4.760s
user    0m4.756s
sys     0m0.000s
$ time nawk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/big

real    0m0.007s
user    0m0.007s
sys     0m0.000s
$ time nawk '{for (i=1; i < 100; i++) { match($0, "z")} }' /tmp/bigger

real    0m0.612s
user    0m0.610s
sys     0m0.003s

In this case all the numbers make sense, assuming a linear scan is going on.

Can someone shed some light on these vastly different numbers? Why do GNU awk and p9p awk perform so poorly on the easy case where they just have to match the first character?

Reply via email to