I'm not sure where the personal info is leaked, we aren't proposing to
make who made the query available, just what the query is and I
suspect the IP info, etc. could be stripped fairly easily. So, we
wouldn't necessarily know who is searching for "Yonik Seeley" when we
see that query term, just that it was searched for. Maybe we can
inquire to infrastructure what is even available first. The other
question is if ASF has a disclaimer about information being logged,
etc. For instance, all emails to public mailing lists are considered
public.
At any rate, I think the bigger issue is finding a good set of data
and query logs that we can use. An alternate way is to just start
creating a query set based on the Wikipedia data, but that isn't as
"real world" as query logs are.
Here's another possible thought: What if we took our own java-user
mailing list for a time period and we used the subject line or some
other piece of info in the text (maybe we can automatically identify
questions (not hard to do for simple cases (just identify sentences
ending in ?), which would give us enough, methinks) and treat them as
queries? This may be a decent approximation of a user's information
need and probably wouldn't be all that hard to crank out and it has
the nice feature that the user has consented to make the information
available.
Of course, we could see if there is a way to purchase the TREC data
(donations, anyone?) and make it available to committers on zones.
This is about the only legal way to do this, but to me is less than
satisfactory as it doesn't allow much innovation from other
contributors. See http://www.gossamer-threads.com/lists/lucene/java-dev/52022?search_string=TREC;#52022
for that discussion.
-Grant
On Nov 19, 2007, at 1:46 PM, Chris Hostetter wrote:
: > report of (querystring,accesscount)->url mappings based on
requests that
: > had a major search engine as the refer URL, that should be fine
right?
:
: Query strings can leak personal info too (think of someone googling
: themselves or their SSN)
right ... i'm not suggesting we do this in an automatic un-human-
involved
way; i'm suggesting that a "trusted" person generate this report,
ignore anything with a count less then some number (both to remove
noise,
and eliminate most of the random "identifiable" queries), and then
manually remove anything that looks "personal"
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]