On Thu, Jan 14, 2010 at 10:47 AM, Magnus Manske
<magnusman...@googlemail.com> wrote:
> Suggestion :
> * log search and SHA1 IP hash (anonymous!)

*Any* mapping of the IP is not anonymous. Please see the AOL search
results where unique IDs were connected between searches to disclose
information.   (More over a straight simple hash of an IP can be
reversed simply by making a table of all expected IPs)

However: Since this is just for internal logging there is no need to
hash the IP.  Just log it directly, and thus avoid the risk that
someone later will think the hash is something which can be disclosed.


> * search queries are logged in a standardized fashion (for grouping),
> e.g. lowercase, single spaces, no leading/trailing spaces, special
> chars converted to spaces, etc.

Excellent.

> * display searches per week (?) that have been searched for at least
> 10 times from at least 5 different IP hashes (to avoid people
> searching their own name 100 times...)

What I've suggested elsewhere was at least 4 different IPs, 5 sounds
fine to me too.  I don't know that the minimum of 10 queries matters
once the 5 IP check is in place.

Per week would be okay. No shorter though.


If someone gives me a log format, I'll gladly write a fast tool for
producing this output.
(I did something like that before where I gave Brion a tool to produce
stats from access logs)

I think I have a C code for a parser for wikimedia's squid logs... so
if its just that I already have a good chunk of it done.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to