Hello, I've found some very interesting behaviour when subjecting various awk implementations to some very specific circumstances.
I'm basically looking for a sanity check here to confirm if I'm just wildly flailing, or if I am indeed onto something here. Here's my situation: When parsing some RIR data in parallel using awk with xargs, I seem to have found a way to reliable lose and/or mangle output with parallel xargs. My google-fu seems to be failing me. I understand that xargs does not buffer output and that lines may arrive out of order, but in this case I am reliably and reproducibly losing data and receiving mangled output. But wait, it gets stranger. I don't want to lose you guys here with a long winded explanation, so I'm going to show you a diff that shows reproducibly mangled output when using xargs in parallel mode: --- /tmp/bad.txt Wed Apr 14 21:06:51 2021 +++ /tmp/good.txt Wed Apr 14 21:06:41 2021 @@ -1,5 +1,3 @@ -267386 -A264890 AS262399 AS262400 AS262401 @@ -1774,6 +1772,7 @@ AS264887 AS264888 AS264889 +AS264890 AS264891 AS264892 AS264893 @@ -3552,6 +3551,7 @@ AS267383 AS267384 AS267385 +AS267386 AS267387 AS267388 AS267389 @@ -4220,6 +4220,7 @@ AS268318 AS268319 AS268320 +AS268320 AS268321 AS268321 AS268323 @@ -7785,6 +7786,7 @@ AS270633 AS270633 AS270634 +AS270634 AS270635 AS270635 AS270636 @@ -10277,5 +10279,3 @@ AS46210 AS46280 AS46280 -ASAS268320 -ASS270634 The only thing that changed between these runs was me using either xargs -P 1 or -P 2. To allow folks to follow along with me at home, I've included the two files (gzipped for politeness) I used to trigger this behaviour. Once you've extracted the attached text files into your working directory, here's a snippet that should reproduce my issue: $ printf 'BR\nCA\n' > cc.txt $ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 2 -- awk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' cc.txt What does this 1 liner do, well it's supposed to slurp the country codes specified in cc.txt into an array where we then check the first field of each row of the RIR data against. If the first field matches a country code in the array and the second field indicates that this row is an ASN record, then we print the 3rd field prepended with 'AS'. As you can see, if you grep the output of the above command for the string "ASAS", "ASS" or 'A2' you should see some mangled ASNs. If you change "-P 2" to "-P 1" this mangling will not occur. Here's where things get very weird. While parsing this data (as part of a larger dataset comprising an aggregation of all the registrar delegation statistics) I've been using this snippet for a while to quickly fetch ASN records. It is not until I have BOTH the BR and CA country codes in the array that I can trigger this bug. I can have any number of country codes in the array, but if Brazil AND Canada happen to be specified in the array, then I get mangled output, but ONLY if executed with parallel xargs. This reproducibly happens when using awk, gawk or mawk. To further melt your brain, this behaviour has NOT been observed when using goawk, a POSIX compliant awk implementation written in go. Just to prove my point, here's me testing the hash outputs between various awk implementations with my above 1 liner: $ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 2 -- awk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' cc.txt | sort | md5 2a20f44ce6a23d5c49b05b9f2689ef93 $ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 1 -- awk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' cc.txt | sort | md5 9ab3dbfbff5746f059cdb35221ff73b1 --- $ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 2 -- mawk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' cc.txt | sort | md5 2a20f44ce6a23d5c49b05b9f2689ef93 $ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 1 -- mawk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' cc.txt | sort | md5 > 9ab3dbfbff5746f059cdb35221ff73b1 --- $ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 2 -- ~/go/bin/goawk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' cc.txt | sort | md> 9ab3dbfbff5746f059cdb35221ff73b1 $ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 1 -- ~/go/bin/goawk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }' cc.txt | sort | md 9ab3dbfbff5746f059cdb35221ff73b1 I've racked my brain and the internet for hours, I've tested and toiled, and I'm left thoroughly perplexed. I now humbly ask the fine folks here in OpenBSD Land for guidance, insight or suggestions. As always, is this a bug, or am I holding it wrong? Regards, Jordan
1.txt.gz
Description: application/gzip
2.txt.gz
Description: application/gzip