On Thu, Jan 14, 2010 at 10:47 AM, Magnus Manske <magnusman...@googlemail.com> wrote: > Suggestion : > * log search and SHA1 IP hash (anonymous!)
*Any* mapping of the IP is not anonymous. Please see the AOL search results where unique IDs were connected between searches to disclose information. (More over a straight simple hash of an IP can be reversed simply by making a table of all expected IPs) However: Since this is just for internal logging there is no need to hash the IP. Just log it directly, and thus avoid the risk that someone later will think the hash is something which can be disclosed. > * search queries are logged in a standardized fashion (for grouping), > e.g. lowercase, single spaces, no leading/trailing spaces, special > chars converted to spaces, etc. Excellent. > * display searches per week (?) that have been searched for at least > 10 times from at least 5 different IP hashes (to avoid people > searching their own name 100 times...) What I've suggested elsewhere was at least 4 different IPs, 5 sounds fine to me too. I don't know that the minimum of 10 queries matters once the 5 IP check is in place. Per week would be okay. No shorter though. If someone gives me a log format, I'll gladly write a fast tool for producing this output. (I did something like that before where I gave Brion a tool to produce stats from access logs) I think I have a C code for a parser for wikimedia's squid logs... so if its just that I already have a good chunk of it done. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l