[Trisquel-users] Re : Script needed to compare one two-column file with another two-column file

2019-06-06 Thread lcerf
As far as I understand, amenex wants a "natural join".  Also, you could write  
'sort -u SmallerFile.txt LargerFile.txt' instead of 'cat SmallerFile.txt  
LargerFile.txt | sort | uniq', which is not only uselessly long but also much  
slower in presence of many duplicates.


[Trisquel-users] Re : Script needed to compare one two-column file with another two-column file

2019-06-06 Thread lcerf

First of all:

SmallerFile_0.txt is not sorted (conceptcable.com would be first): below, I  
sort the files;
I do not understand why OutputFile_0.txt does not associate pool.mirgiga.net  
with Uhnagty, Yjnmase, and Bnhjyht: below, I assume it should.



The format you use is redundant.  Moreover, in the output, it becomes hard  
(if not impossible) to set apart what comes from the "larger file" and from  
the "smaller file".  I suggest to transform the two input files to have no  
duplicate in the first columns and a list of comma-separated values in the  
second columns (if commas can appear in the files, change that character),  
using twice the same command line:
$ sort -k 1,1 LargerFile_0.txt | awk '{ if ($1 == key) printf "," $2; else {  
printf "\n" $0; key = $1 } }' | tail -n +2 > LargerFile_0.csv
$ sort -k 1,1 SmallerFile_0.txt | awk '{ if ($1 == key) printf "," $2; else {  
printf "\n" $0; key = $1 } }' | tail -n +2 > SmallerFile_0.csv


You then only need to "join" the two files (see  
https://en.wikipedia.org/wiki/Relational_algebra#Natural_join_(%E2%8B%88) for  
the theory):

$ join LargerFile_0.csv SmallerFile_0.csv
pool.giga.net.ru  
91.210.179.94,91.210.179.95,91.210.179.96,91.210.179.97,91.210.179.98,91.210.179.99  
Evgbhan,Ghbfght,Kmnslet,Loasfrt,Wnhmahy
pool.mirgiga.net  
78.158.193.1,78.158.193.10,78.158.193.104,78.158.193.105,78.158.193.106,78.158.193.107,78.158.193.11,78.158.193.110,78.158.193.111,78.158.193.112,78.158.193.113  
Bnhjyht,Uhnagty,Yjnmase
pool.sevtele.com  
46.172.203.8,46.172.203.80,46.172.203.83,46.172.203.85,46.172.203.87,46.172.203.88  
Ghbfght


As a script taking the two files as arguments and running everything in  
parallel:

#!/bin/sh

if [ -z "$2" ]
then
printf "Usage: $0 file1 file2
"
exit
fi

TMP=$(mktemp)
trap "rm $TMP* 2>/dev/null" 0

mkfifo $TMP.1 $TMP.2

sort -k 1,1 "$1" | awk '{ if ($1 == key) printf "," $2; else { printf "\n"  
$0; key = $1 } }' | tail -n +2 > $TMP.1 &
sort -k 1,1 "$2" | awk '{ if ($1 == key) printf "," $2; else { printf "\n"  
$0; key = $1 } }' | tail -n +2 > $TMP.2 &
join $TMP.1 $TMP.2 # | awk '{ for (i = 1; ++i