Re: [Trisquel-users] Grep consumes all my RAM and swap ding a big job

amenex Sun, 28 Jul 2019 13:20:35 -0700

Hmmm. We seem both to be writing at once ...

Magic Banana is saying:


Quoting amenex:

> I want to guard against double-counting, as with 01j01.txt or 01j02.txt vs02j01.txt, and that requires

> some heavy-duty concentration.

>> "My" solution (since my first post in this thread) joins one file with allthe other files. Not pairwise.

>> There is nothing to concatenate at the end.

amenex again:

> I have a script that does a nice job of grouping the duplicated hostnames,but it won't separate them with

> blank lines ... (yet).

>> "My" solution (since my first post in this thread) outputs the hostnamesin order. They are already grouped.>> To prepend them with blank lines, the output of every join can be pipedto:

>>> awk '$1 != p { p = $1; print "" } { print }'

>> However, I believe I have finally understood the whole task and I do notsee much point in having the>> repetitions on several lines (uselessly repeating the hostname). AWK cancount the number of other files>> where the hostname is found, print that count, the hostname (once) and therest (the number and the file>> name). 'sort' can then sort in decreasing order of count. The wholesolution is:


amenex:

I'll try that later ... right now I'm worried that the problem may beanalyzed another way, simply byconcatenating all the Recent Visitors into one [huge] file while retainingeach hostname's association with thedomains' Webalizer data, then grouping the Recent Visitor hostnames accordingto the quantities of theiroccurrences, and therefter discarding the smallest numbers of duplicatehostnames. The data total 39 MB.

Making the current directory that in which the numerically coded two-columnhostname lists reside:time awk '{print FILENAME"\t"$0}' 01.txt 02.txt 03.txt 04.txt 05.txt 06.txt07.txt 08.txt 09.txt 10.txt 11.txt 12.txt 13.txt 14.txt 15.txt 16.txt 17.txt18.txt 19.txt 20.txt 21.txt 22.txt 23.txt 24.txt 25.txt 26.txt 27.txt 28.txt29.txt 30.txt 31.txt 32.txt 33.txt 34.txt 35.txt 36.txt 37.txt 38.txt 39.txt40.txt 41.txt 42.txt 43.txt 44.txt 45.txt >Joins/ProcessedVisitorLists/FILENAME.txt ... 46.2 MB; 1,038,048 rows (0.067sec.)


> time sort -k3 FILENAME.txt > Sorted.txt (0.112 sec.)

> time awk 'NR >= 2 { print $1, $2, $3 }' 'Sorted.txt' | uniq --skip-fields=2--all-repeated=none | awk '{ print $1 "\t" $2 "\t" $3}' > Duplicates.txt ...7.0 MB; 168,976 rows (0.093 sec.)

Forgive me for my use of the unsophisticated script ... the groups are in theappropriate bunches, but the bunches are inalphabetical order, but all the original domain data are still present. Surebeats grep, though ...

Print only the hostname column: > time awk '{ print $3 }' 'Duplicates02.txt'> CountsOrder.txt (now 5.5 MB; 0.016 sec.)

Now do the counting step: > time uniq -c CountsOrder.txt > OrderCounts.txt... back up to 1.1 MB; 0.009 sec

Finally, sort them according to count frequency: > time sort -rgOrderCounts.txt > SortedByFrequency.txt ... still 1.1 MB; 0.003 sec.

Truncate to include only counts greater than 2: > SortedByFrequencyGT2.txt714 KB (attached)


There are a lot of high-count repetitions.

Re: [Trisquel-users] Grep consumes all my RAM and swap ding a big job

Reply via email to