Re: Apache logs and data

Grant Ingersoll Mon, 19 Nov 2007 12:15:38 -0800

I'm not sure where the personal info is leaked, we aren't proposing tomake who made the query available, just what the query is and Isuspect the IP info, etc. could be stripped fairly easily. So, wewouldn't necessarily know who is searching for "Yonik Seeley" when wesee that query term, just that it was searched for. Maybe we caninquire to infrastructure what is even available first. The otherquestion is if ASF has a disclaimer about information being logged,etc. For instance, all emails to public mailing lists are consideredpublic.

At any rate, I think the bigger issue is finding a good set of dataand query logs that we can use. An alternate way is to just startcreating a query set based on the Wikipedia data, but that isn't as"real world" as query logs are.

Here's another possible thought: What if we took our own java-usermailing list for a time period and we used the subject line or someother piece of info in the text (maybe we can automatically identifyquestions (not hard to do for simple cases (just identify sentencesending in ?), which would give us enough, methinks) and treat them asqueries? This may be a decent approximation of a user's informationneed and probably wouldn't be all that hard to crank out and it hasthe nice feature that the user has consented to make the informationavailable.

Of course, we could see if there is a way to purchase the TREC data(donations, anyone?) and make it available to committers on zones.This is about the only legal way to do this, but to me is less thansatisfactory as it doesn't allow much innovation from othercontributors. See http://www.gossamer-threads.com/lists/lucene/java-dev/52022?search_string=TREC;#52022for that discussion.


-Grant


On Nov 19, 2007, at 1:46 PM, Chris Hostetter wrote:

: > report of (querystring,accesscount)->url mappings based onrequests that: > had a major search engine as the refer URL, that should be fineright?
:
: Query strings can leak personal info too (think of someone googling
: themselves or their SSN)
right ... i'm not suggesting we do this in an automatic un-human-involved
way; i'm suggesting that a "trusted" person generate this report,
ignore anything with a count less then some number (both to removenoise,
and eliminate most of the random "identifiable" queries), and then
manually remove anything that looks "personal"



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Apache logs and data

Reply via email to