That is not a problem. That is an algorithm whose first step is not even
clear: a Google search on {stats "view all sites"} returns a "normal
response", a list of 224,000 pages from different websites.
Still assuming that the three files you posted in the original post are a
sample of your input, I actually wonder if all you want is not simply:
$ awk '{ print FILENAME, $0 }' *.txt | sort -k 3 | awk 'p != $3 { if (p !=
"") print c, p r; p = $3; c = 0; r = "" } { ++c; r = r " " $1 " " $2 }' |
sort -nrk 1,1 > out
If *.txt catches the three files, "out" is (with the same semantics as
explained in my previous post, except that the file name is now before the
number):
3 xhsjs.preferdrive.net HNs.bst_.lt_.txt 1 HNs.www_.barcodeus.com_.txt 2
HNs.www_.outwardbound.net_.txt 6
3 webislab40.medien.uni-weimar.de HNs.bst_.lt_.txt 1
HNs.www_.barcodeus.com_.txt 1 HNs.www_.outwardbound.net_.txt 2
(...)
1 027a74fd.bb.sky.com HNs.www_.outwardbound.net_.txt 188
1 014199116180.ctinets.com HNs.www_.outwardbound.net_.txt 3
The input files can then be removed: all the information is in "out". You
can query it with 'grep' and 'awk'. For instance:
To only get the lines with hostnames in "HNs.bst_.lt_.txt" (hence a selection
with as many lines as "HNs.bst_.lt_.txt"):
$ grep -F ' HNs.bst_.lt_.txt ' out
To additionally impose that the selected hostnames are in at least another
file (as in the problem I stated):
$ grep -F ' HNs.bst_.lt_.txt ' out | awk '$1 > 1'
To only keep the hostnames of the previous output:
$ grep -F ' HNs.bst_.lt_.txt ' out | awk '$1 > 1 { print $2 }'