[Trisquel-users] Re : Grep consumes all my RAM and swap ding a big job

lcerf Sun, 28 Jul 2019 17:34:37 -0700

That is not a problem. That is an algorithm whose first step is not evenclear: a Google search on {stats "view all sites"} returns a "normalresponse", a list of 224,000 pages from different websites.

Still assuming that the three files you posted in the original post are asample of your input, I actually wonder if all you want is not simply:$ awk '{ print FILENAME, $0 }' *.txt | sort -k 3 | awk 'p != $3 { if (p !="") print c, p r; p = $3; c = 0; r = "" } { ++c; r = r " " $1 " " $2 }' |sort -nrk 1,1 > out

If *.txt catches the three files, "out" is (with the same semantics asexplained in my previous post, except that the file name is now before thenumber):3 xhsjs.preferdrive.net HNs.bst_.lt_.txt 1 HNs.www_.barcodeus.com_.txt 2HNs.www_.outwardbound.net_.txt 63 webislab40.medien.uni-weimar.de HNs.bst_.lt_.txt 1HNs.www_.barcodeus.com_.txt 1 HNs.www_.outwardbound.net_.txt 2

(...)
1 027a74fd.bb.sky.com HNs.www_.outwardbound.net_.txt 188
1 014199116180.ctinets.com HNs.www_.outwardbound.net_.txt 3

The input files can then be removed: all the information is in "out". Youcan query it with 'grep' and 'awk'. For instance:

To only get the lines with hostnames in "HNs.bst_.lt_.txt" (hence a selectionwith as many lines as "HNs.bst_.lt_.txt"):

$ grep -F ' HNs.bst_.lt_.txt ' out

To additionally impose that the selected hostnames are in at least anotherfile (as in the problem I stated):

$ grep -F ' HNs.bst_.lt_.txt ' out | awk '$1 > 1'
To only keep the hostnames of the previous output:
$ grep -F ' HNs.bst_.lt_.txt ' out | awk '$1 > 1 { print $2 }'

[Trisquel-users] Re : Grep consumes all my RAM and swap ding a big job

Reply via email to