I'm not sure where the personal info is leaked, we aren't proposing to make who made the query available, just what the query is and I suspect the IP info, etc. could be stripped fairly easily. So, we wouldn't necessarily know who is searching for "Yonik Seeley" when we see that query term, just that it was searched for. Maybe we can inquire to infrastructure what is even available first. The other question is if ASF has a disclaimer about information being logged, etc. For instance, all emails to public mailing lists are considered public.

At any rate, I think the bigger issue is finding a good set of data and query logs that we can use. An alternate way is to just start creating a query set based on the Wikipedia data, but that isn't as "real world" as query logs are.

Here's another possible thought: What if we took our own java-user mailing list for a time period and we used the subject line or some other piece of info in the text (maybe we can automatically identify questions (not hard to do for simple cases (just identify sentences ending in ?), which would give us enough, methinks) and treat them as queries? This may be a decent approximation of a user's information need and probably wouldn't be all that hard to crank out and it has the nice feature that the user has consented to make the information available.

Of course, we could see if there is a way to purchase the TREC data (donations, anyone?) and make it available to committers on zones. This is about the only legal way to do this, but to me is less than satisfactory as it doesn't allow much innovation from other contributors. See http://www.gossamer-threads.com/lists/lucene/java-dev/52022?search_string=TREC;#52022 for that discussion.

-Grant


On Nov 19, 2007, at 1:46 PM, Chris Hostetter wrote:


: > report of (querystring,accesscount)->url mappings based on requests that : > had a major search engine as the refer URL, that should be fine right?
:
: Query strings can leak personal info too (think of someone googling
: themselves or their SSN)

right ... i'm not suggesting we do this in an automatic un-human- involved
way; i'm suggesting that a "trusted" person generate this report,
ignore anything with a count less then some number (both to remove noise,
and eliminate most of the random "identifiable" queries), and then
manually remove anything that looks "personal"



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to