On Thu, Jan 14, 2010 at 11:01 AM, David Gerard <dger...@gmail.com> wrote:
> 2010/1/14 Bryan Tong Minh <bryan.tongm...@gmail.com>:
>> On Thu, Jan 14, 2010 at 4:47 PM, Magnus Manske
>> <magnusman...@googlemail.com> wrote:
>
>>> * log search and SHA1 IP hash (anonymous!)
>
>> There are only 2 billion unique addresses and they can all be found in
>> half an hour probably.
>
>
> A count of search terms, with no IP info at all? Would be more useful
> than nothing.
>
> (modulo the issue Michael Snow raised re: searches on suppressable names)

Magnus was not suggesting disclosing the IP hash, as far as I can
tell. He demonstrating an abundance of caution in suggesting only
logging that. (er, well, yea, if he was suggesting disclosing that...
we shouldn't do that.  Even if we add a secret to the hash, it's risky
and allows interesting correlation attacks)


Here is what I would suggest disclosing:

#start_datetime end_datetime hits search_string
2010-01-01-0:0:4 2010-01-13-23-59-50 39284 naked people
2010-01-01-0:0:4 2010-01-13-23-59-50 23950 hot grits
...
2010-01-01-0:0:4 2010-01-13-23-59-50 5 autoerotic quantum chromodynamics


Which has first been filtered by:
* Canonicalization of strings (at least ascii case folding)
* Excluding strings over some length
* Excluding searches which did not come from at least 5 distinct IPs
during the reporting interval



There will be useful information excluded by this process, e.g. gads
of misspellings which came from only two to four unique IPs... but the
output would still be *far* more useful no information at all.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to