Over a year ago, I lamented that sort followed by uniq -u wasn't removing
duplicates from a list:
https://trisquel.info/en/forum/sort-and-uniq-fail-remove-all-duplicates-list-hostnames-and-their-ipv4-addresses
Recently I've been faced with the results of grep searches in other files
that overlap because
they contain the same string on which grep was searching. After sorting the
grep outputs, then
cutting & pasting, I ended up with pairs of files that contain many
duplicates because the strings
were caught twice.
grep -h lns03.v6.018.net.il *Rev.oGnMap.txt >>
PTR.IPv6-Data/IPv6-lns03.v6.018.net.il.txt ;
grep -h cable-lns03.v6.018.net.il *Rev.oGnMap.txt >>
PTR.IPv6-Data/IPv6-cable-lns03.v6.018.net.il.txt
The grep outputs were expected to list the PTR record in the first column and
the corresponding
IPv6 address in the second column, because I reversed the order of those
columns in the outputs
of the originsl nMap -oG searches as well as removing the parentheses
enclosing the IPv6 addresses.
In the sorting scripts below, $1 is the PTR and $2 is the IPv6 address,
except for the uniq -c
script where I printed $2 and $3 to skip the counts column produced by uniq
-c.
Here are the three pairs of scripts intended to consolidate the files:
sort IPv6-lns03.v6.018.net.il.txt | uniq -u >
IPv6-uniq.lns03.v6.018.net.il.txt ;
sort IPv6-cable-lns03.v6.018.net.il.txt | uniq -u >
IPv6-uniq.cable-lns03.v6.018.net.il.txt
sort -k 2 IPv6-lns03.v6.018.net.il.txt | uniq -c | awk '{print $2"\t"$3}' '-'
> IPv6-uniq.lns03.v6.018.net.il.txt ;
sort -k 2 IPv6-cable-lns03.v6.018.net.il.txt | uniq -c | awk '{print
$2"\t"$3}' '-' > IPv6-uniq.cable-lns03.v6.018.net.il.txt
sort -u IPv6-lns03.v6.018.net.il.txt > IPv6-uniqB.lns03.v6.018.net.il.txt
sort -u IPv6-cable-lns03.v6.018.net.il.txt >
IPv6-uniqB.cable-lns03.v6.018.net.il.txt
The first pair produced zero bytes output for both scripts; the original
files were not zero.
The second pair reduced both files by half as expected.
Then I remembered to check this forum, wherein Magic Banana had suggested
using sort -u
instead of the first pair's combination of sort and uniq -u. This third pair
produced the
exact same halving of the original file sizes as my less efficient use of
uniq -c and awk
to eliminate the counts column. Thank you again, Magic Banana !
I had tried to "fix" the uniq -u debacle of the second pair of sorting
scripts by copying the
affected file names directly from the File manager into the script text, as
that has been a
useful workaround in the past, but this time the first pair of sorting
scripts produced zero
bytes output again, same as did my first attempt.
What is it about uniq -u of which I should be wary ?
George Langford