Over a year ago, I lamented that sort followed by uniq -u wasn't removing duplicates from a list:
https://trisquel.info/en/forum/sort-and-uniq-fail-remove-all-duplicates-list-hostnames-and-their-ipv4-addresses

Recently I've been faced with the results of grep searches in other files that overlap because they contain the same string on which grep was searching. After sorting the grep outputs, then cutting & pasting, I ended up with pairs of files that contain many duplicates because the strings
were caught twice.

grep -h lns03.v6.018.net.il *Rev.oGnMap.txt >> PTR.IPv6-Data/IPv6-lns03.v6.018.net.il.txt ; grep -h cable-lns03.v6.018.net.il *Rev.oGnMap.txt >> PTR.IPv6-Data/IPv6-cable-lns03.v6.018.net.il.txt

The grep outputs were expected to list the PTR record in the first column and the corresponding IPv6 address in the second column, because I reversed the order of those columns in the outputs of the originsl nMap -oG searches as well as removing the parentheses enclosing the IPv6 addresses. In the sorting scripts below, $1 is the PTR and $2 is the IPv6 address, except for the uniq -c script where I printed $2 and $3 to skip the counts column produced by uniq -c.

Here are the three pairs of scripts intended to consolidate the files:

sort IPv6-lns03.v6.018.net.il.txt | uniq -u > IPv6-uniq.lns03.v6.018.net.il.txt ; sort IPv6-cable-lns03.v6.018.net.il.txt | uniq -u > IPv6-uniq.cable-lns03.v6.018.net.il.txt

sort -k 2 IPv6-lns03.v6.018.net.il.txt | uniq -c | awk '{print $2"\t"$3}' '-' > IPv6-uniq.lns03.v6.018.net.il.txt ; sort -k 2 IPv6-cable-lns03.v6.018.net.il.txt | uniq -c | awk '{print $2"\t"$3}' '-' > IPv6-uniq.cable-lns03.v6.018.net.il.txt

sort -u IPv6-lns03.v6.018.net.il.txt  > IPv6-uniqB.lns03.v6.018.net.il.txt
sort -u IPv6-cable-lns03.v6.018.net.il.txt > IPv6-uniqB.cable-lns03.v6.018.net.il.txt

The first pair produced zero bytes output for both scripts; the original files were not zero.

The second pair reduced both files by half as expected.

Then I remembered to check this forum, wherein Magic Banana had suggested using sort -u instead of the first pair's combination of sort and uniq -u. This third pair produced the exact same halving of the original file sizes as my less efficient use of uniq -c and awk
to eliminate the counts column. Thank you again, Magic Banana !

I had tried to "fix" the uniq -u debacle of the second pair of sorting scripts by copying the affected file names directly from the File manager into the script text, as that has been a useful workaround in the past, but this time the first pair of sorting scripts produced zero
bytes output again, same as did my first attempt.

What is it about uniq -u of which I should be wary ?

George Langford


Reply via email to