Hello,

I've found some very interesting behaviour when subjecting various awk 
implementations to some very specific circumstances.

I'm basically looking for a sanity check here to confirm if I'm just wildly 
flailing, or if I am indeed onto something here.

Here's my situation:

When parsing some RIR data in parallel using awk with xargs, I seem to have 
found a way to reliable lose and/or mangle output with parallel xargs. My 
google-fu seems to be failing me. I understand that xargs does not buffer 
output and that lines may arrive out of order, but in this case I am reliably 
and reproducibly losing data and receiving mangled output. But wait, it gets 
stranger.

I don't want to lose you guys here with a long winded explanation, so I'm going 
to show you a diff that shows reproducibly mangled output when using xargs in 
parallel mode:

--- /tmp/bad.txt  Wed Apr 14 21:06:51 2021
+++ /tmp/good.txt  Wed Apr 14 21:06:41 2021
@@ -1,5 +1,3 @@
-267386
-A264890
 AS262399
 AS262400
 AS262401
@@ -1774,6 +1772,7 @@
 AS264887
 AS264888
 AS264889
+AS264890
 AS264891
 AS264892
 AS264893
@@ -3552,6 +3551,7 @@
 AS267383
 AS267384
 AS267385
+AS267386
 AS267387
 AS267388
 AS267389
@@ -4220,6 +4220,7 @@
 AS268318
 AS268319
 AS268320
+AS268320
 AS268321
 AS268321
 AS268323
@@ -7785,6 +7786,7 @@
 AS270633
 AS270633
 AS270634
+AS270634
 AS270635
 AS270635
 AS270636
@@ -10277,5 +10279,3 @@
 AS46210
 AS46280
 AS46280
-ASAS268320
-ASS270634

The only thing that changed between these runs was me using either xargs -P 1 
or -P 2.

To allow folks to follow along with me at home, I've included the two files 
(gzipped for politeness) I used to trigger this behaviour.

Once you've extracted the attached text files into your working directory, 
here's a snippet that should reproduce my issue:

$ printf 'BR\nCA\n' > cc.txt

$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 2 -- awk -F '|' 
'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' 
cc.txt

What does this 1 liner do, well it's supposed to slurp the country codes 
specified in cc.txt into an array where we then check the first field of each 
row of the RIR data against. If the first field matches a country code in the 
array and the second field indicates that this row is an ASN record, then we 
print the 3rd field prepended with 'AS'. As you can see, if you grep the output 
of the above command for the string "ASAS", "ASS" or 'A2' you should see some 
mangled ASNs. If you change "-P 2" to "-P 1" this mangling will not occur.

Here's where things get very weird. While parsing this data (as part of a 
larger dataset comprising an aggregation of all the registrar delegation 
statistics) I've been using this snippet for a while to quickly fetch ASN 
records. It is not until I have BOTH the BR and CA country codes in the array 
that I can trigger this bug. I can have any number of country codes in the 
array, but if Brazil AND Canada happen to be specified in the array, then I get 
mangled output, but ONLY if executed with parallel xargs. This reproducibly 
happens when using awk, gawk or mawk. To further melt your brain, this 
behaviour has NOT been observed when using goawk, a POSIX compliant awk 
implementation written in go.

Just to prove my point, here's me testing the hash outputs between various awk 
implementations with my above 1 liner:

$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 2 -- awk -F '|' 
'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' 
cc.txt | sort | md5
    2a20f44ce6a23d5c49b05b9f2689ef93

$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 1 -- awk -F '|' 
'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' 
cc.txt | sort | md5
    9ab3dbfbff5746f059cdb35221ff73b1
---
$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 2 -- mawk -F '|' 
'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' 
cc.txt | sort | md5
    2a20f44ce6a23d5c49b05b9f2689ef93

$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 1 -- mawk -F '|' 
'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' 
cc.txt | sort | md5         >
    9ab3dbfbff5746f059cdb35221ff73b1
---
$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 2 -- 
~/go/bin/goawk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { 
printf("AS%s\n", $3) }' cc.txt | sort | md>
    9ab3dbfbff5746f059cdb35221ff73b1

$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 1 -- 
~/go/bin/goawk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { 
printf("AS%s\n", $3) }' cc.txt | sort | md
    9ab3dbfbff5746f059cdb35221ff73b1

I've racked my brain and the internet for hours, I've tested and toiled, and 
I'm left thoroughly perplexed. I now humbly ask the fine folks here in OpenBSD 
Land for guidance, insight or suggestions.

As always, is this a bug, or am I holding it wrong?

Regards,

Jordan

Attachment: 1.txt.gz
Description: application/gzip

Attachment: 2.txt.gz
Description: application/gzip

Reply via email to