That is not a problem. That is an algorithm whose first step is not even clear: a Google search on {stats "view all sites"} returns a "normal response", a list of 224,000 pages from different websites.

Still assuming that the three files you posted in the original post are a sample of your input, I actually wonder if all you want is not simply: $ awk '{ print FILENAME, $0 }' *.txt | sort -k 3 | awk 'p != $3 { if (p != "") print c, p r; p = $3; c = 0; r = "" } { ++c; r = r " " $1 " " $2 }' | sort -nrk 1,1 > out

If *.txt catches the three files, "out" is (with the same semantics as explained in my previous post, except that the file name is now before the number): 3 xhsjs.preferdrive.net HNs.bst_.lt_.txt 1 HNs.www_.barcodeus.com_.txt 2 HNs.www_.outwardbound.net_.txt 6 3 webislab40.medien.uni-weimar.de HNs.bst_.lt_.txt 1 HNs.www_.barcodeus.com_.txt 1 HNs.www_.outwardbound.net_.txt 2
(...)
1 027a74fd.bb.sky.com HNs.www_.outwardbound.net_.txt 188
1 014199116180.ctinets.com HNs.www_.outwardbound.net_.txt 3

The input files can then be removed: all the information is in "out". You can query it with 'grep' and 'awk'. For instance:

To only get the lines with hostnames in "HNs.bst_.lt_.txt" (hence a selection with as many lines as "HNs.bst_.lt_.txt"):
$ grep -F ' HNs.bst_.lt_.txt ' out
To additionally impose that the selected hostnames are in at least another file (as in the problem I stated):
$ grep -F ' HNs.bst_.lt_.txt ' out | awk '$1 > 1'
To only keep the hostnames of the previous output:
$ grep -F ' HNs.bst_.lt_.txt ' out | awk '$1 > 1 { print $2 }'

Reply via email to