Hmmm. We seem both to be writing at once ...

Magic Banana is saying:

Quoting amenex:
> I want to guard against double-counting, as with 01j01.txt or 01j02.txt vs 02j01.txt, and that requires
> some heavy-duty concentration.

>> "My" solution (since my first post in this thread) joins one file with all the other files. Not pairwise.
>> There is nothing to concatenate at the end.

amenex again:
> I have a script that does a nice job of grouping the duplicated hostnames, but it won't separate them with
> blank lines ... (yet).

>> "My" solution (since my first post in this thread) outputs the hostnames in order. They are already grouped. >> To prepend them with blank lines, the output of every join can be piped to:
>>> awk '$1 != p { p = $1; print "" } { print }'

>> However, I believe I have finally understood the whole task and I do not see much point in having the >> repetitions on several lines (uselessly repeating the hostname). AWK can count the number of other files >> where the hostname is found, print that count, the hostname (once) and the rest (the number and the file >> name). 'sort' can then sort in decreasing order of count. The whole solution is:

amenex:
I'll try that later ... right now I'm worried that the problem may be analyzed another way, simply by concatenating all the Recent Visitors into one [huge] file while retaining each hostname's association with the domains' Webalizer data, then grouping the Recent Visitor hostnames according to the quantities of their occurrences, and therefter discarding the smallest numbers of duplicate hostnames. The data total 39 MB.

Making the current directory that in which the numerically coded two-column hostname lists reside: time awk '{print FILENAME"\t"$0}' 01.txt 02.txt 03.txt 04.txt 05.txt 06.txt 07.txt 08.txt 09.txt 10.txt 11.txt 12.txt 13.txt 14.txt 15.txt 16.txt 17.txt 18.txt 19.txt 20.txt 21.txt 22.txt 23.txt 24.txt 25.txt 26.txt 27.txt 28.txt 29.txt 30.txt 31.txt 32.txt 33.txt 34.txt 35.txt 36.txt 37.txt 38.txt 39.txt 40.txt 41.txt 42.txt 43.txt 44.txt 45.txt > Joins/ProcessedVisitorLists/FILENAME.txt ... 46.2 MB; 1,038,048 rows (0.067 sec.)

> time sort -k3 FILENAME.txt > Sorted.txt (0.112 sec.)
> time awk 'NR >= 2 { print $1, $2, $3 }' 'Sorted.txt' | uniq --skip-fields=2 --all-repeated=none | awk '{ print $1 "\t" $2 "\t" $3}' > Duplicates.txt ... 7.0 MB; 168,976 rows (0.093 sec.)

Forgive me for my use of the unsophisticated script ... the groups are in the appropriate bunches, but the bunches are in alphabetical order, but all the original domain data are still present. Sure beats grep, though ...

Print only the hostname column: > time awk '{ print $3 }' 'Duplicates02.txt' > CountsOrder.txt (now 5.5 MB; 0.016 sec.)

Now do the counting step: > time uniq -c CountsOrder.txt > OrderCounts.txt ... back up to 1.1 MB; 0.009 sec

Finally, sort them according to count frequency: > time sort -rg OrderCounts.txt > SortedByFrequency.txt ... still 1.1 MB; 0.003 sec.

Truncate to include only counts greater than 2: > SortedByFrequencyGT2.txt 714 KB (attached)

There are a lot of high-count repetitions.


Reply via email to