Re: [Wiki-research-l] [Wikimedia-l] wikipedia access traces ?

Oliver Keyes Sat, 20 Sep 2014 10:09:06 -0700

Given that the request logs aren't transparent about which cached version
of a page is being provided I'm finding it pretty difficult to see how
they'd help you answer interesting questions here :/.


On 20 September 2014 04:02, Pine W <wiki.p...@gmail.com> wrote:

> A few more thoughts:
>
> * You probably don't need the full URLs of the content being accessed, so
> those could be anonymized and replaced with random identifiers to some
> degree, right?
>
> * Someone might be able to monitor the user's end of the transactions,
> such as by having university network logs that show destination domains and
> timestamps, in such a way that they could pair the university logs with
> Wikimedia access traces of one second granularity and thus defeat some
> measures of privacy for the university's Wikimedia users, correct?
>
> * I am not sure that the staff time required to analyze this request and
> produce the data is a good use of resources on Wikimedia's end. Toby would
> be a good person to ask about this.
>
> Pine
>  On Sep 20, 2014 12:45 AM, "Pine W" <wiki.p...@gmail.com> wrote:
>
>> Thanks for the explanation. On moderate to high traffic pages, let's say
>> with a minimum of 10 hits per minute across the entire time span studied,
>> perhaps the requested data could be provided while still providing strong
>> privacy protection. Toby might need to discuss this with WMF Legal.
>>
>> Pine
>> On Sep 19, 2014 4:57 AM, "Valerio Schiavoni" <valerio.schiav...@gmail.com>
>> wrote:
>>
>>> Hello everyone,
>>> it seems the discussion is sparkling an interesting debate, thanks to
>>> everyone.
>>>
>>> To put back things in context, we use Wikipedia as one of the few
>>> websites where users can access different 'versions' of the same page.
>>> Users mostly read the most recent version of a given page, but from time
>>> to time, read accesses to the 'history' of a page happens.
>>> New versions of a page are created as well. Finally, users might
>>> potentially need to explore several old versions of a given web page, for
>>> example by accessing the details of its history[1].
>>> Access traces need to be accurate to model the workload on the servers
>>> that are storing the contents being served the web serves.
>>> A resolution bigger than 1 second would not reflect the access patterns
>>> on Wikipedia, or similarly versioned, web sites.
>>> We use these access patterns to test different version-aware storage
>>> techniques.
>>> For those interested, I could send the pre-print version of an article
>>> that
>>> I will present next month at the IEEE SRDS'14 conference.
>>>
>>> For what concern potential privacy concerns about disclosing such
>>> traces, I would like to stress that we are not looking into 'who' or from
>>> 'where' a given URL was requested. Those informations are completely absent
>>> from the Wikibench traces, and can/should remain such in new traces.
>>>
>>> Let's say Wikipedia somehow reveals the top-10 most-visited pages in the
>>> last minute: would that represent a privacy breach for some users? I hardly
>>> doubt so, and I invite the audience to convince me about the contrary.
>>>
>>> Best regards,
>>> Valerio
>>>
>>> 1- For example:
>>> http://it.wikipedia.org/w/index.php?title=George_W._Bush&action=history
>>>
>>> On Fri, Sep 19, 2014 at 8:36 AM, Pine W <wiki.p...@gmail.com> wrote:
>>>
>>>> Let's loop back to the request at hand. Valerio, can you describe your
>>>> use case for access traces at intervals shorter than one hour? The very
>>>> likely outcome of this discussion is that the access traces at shorter
>>>> intervals will not be made available, but I'm curious about what you would
>>>> do with the data if you had it.
>>>>
>>>> Pine
>>>> On Sep 18, 2014 4:55 PM, "Richard Jensen" <rjen...@uic.edu> wrote:
>>>>
>>>>> the basic issue in sampling is to decide what the target population T
>>>>> actually is. Then you weight the sample so that each person in the target
>>>>> population has an equal chance w  and people not in it have weight zero.
>>>>>
>>>>> So what is the target population we want to study?
>>>>> --the world's population?
>>>>> --the world's educated population?
>>>>> --everyone with internet access
>>>>> --everyone who ever uses Wikipedia
>>>>> --everyone who use it a lot
>>>>> --everyone  who has knowledge to contribute in positive fashion?
>>>>> --everyone  who has the internet, skills and potential to contribute?
>>>>> --everyone  who has the potential to contribute but does not do so?
>>>>>
>>>>> Richard Jensen
>>>>> rjen...@uic.edu
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Wiki-research-l mailing list
>>>>> Wiki-research-l@lists.wikimedia.org
>>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>>>
>>>>
>>>> _______________________________________________
>>>> Wiki-research-l mailing list
>>>> Wiki-research-l@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Wiki-research-l mailing list
>>> Wiki-research-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>


-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] [Wikimedia-l] wikipedia access traces ?

Reply via email to