from:"Dario Taraborelli"

[Analytics] Farewell, Erik!

2019-02-06 Thread Dario Taraborelli

“[R]ecent revisions of an article can be peeled off to reveal older layers,
which are still meaningful for historians. Even graffiti applied by vandals
can by its sheer informality convey meaningful information, just like
historians learned a lot from graffiti on walls of classic Pompei. Likewise
view patterns can tell future historians a lot about what was hot and what
wasn’t in our times. Reason why these raw view data are meant to be
preserved for a long time.”

Erik Zachte wrote these lines in a blog post
<https://web.archive.org/web/20171018194720/http://infodisiac.com/blog/2009/07/michael-jackson/>
almost
ten years ago, and I cannot find better words to describe the gift he gave
us. Erik retired <http://infodisiac.com/back_to_volunteer_mode.htm> this
past Friday, leaving behind an immense legacy. I had the honor to work with
him for several years, and I hosted this morning an intimate, tearful
celebration of what Erik has represented for the Wikimedia movement.

His Wikistats project <https://stats.wikimedia.org/>—with his signature
pale yellow background we've known and loved since the mid 2000s
<https://web.archive.org/web/20060412043240/https://stats.wikimedia.org/>—has
been much more than an "analytics platform". It's been an individual
attempt he initiated, and grew over time, to try and comprehend and make
sense of the largest open collaboration project in human history, driven by
curiosity and by an insatiable desire to serve data to the communities that
most needed it.

Through this project, Erik has created a live record of data describing the
growth and reach of all Wikimedia communities, across languages and
projects, putting multi-lingualism and smaller communities at the very
center of his attention. He coined metrics such as "active editors" that
defined the benchmark for volunteers, the Wikimedia Foundation, and the
academic community to understand some of the growing pains and editor
retention issues
<https://web.archive.org/web/20110608214507/http://infodisiac.com/blog/2009/12/new-editors-are-joining-english-wikipedia-in-droves/>
the movement has faced. He created countless reports—that predate by nearly
a decade modern visualizations of online attention—to understand what
Wikipedia traffic means in the context of current events like elections
<https://web.archive.org/web/20160405055621/http://infodisiac.com/blog/2008/09/sarah-palin/>
or public health crises
<https://web.archive.org/web/20090708011216/http://infodisiac.com/blog/2009/05/h1n1-flu-or-new-flu-or/>.
He has created countless
<https://twitter.com/Infodisiac/status/1039244151953543169> visualizations
<https://blog.wikimedia.org/2017/10/27/new-interactive-visualization-wikipedia/>
that show the enormous gaps in local language content and representation
that, as a movement, we face in our efforts to build an encyclopedia for
and about everyone. He has also made extensive use of pie charts
<https://web.archive.org/web/20141222073751/http://infodisiac.com/blog/wp-content/uploads/2008/10/piechartscorrected.png>,
which—as friends—we are ready to turn a blind eye towards.

Most importantly, the data Erik has brougth to life has been cited over
1,000 times
<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=stats.wikimedia.org>
in the scholarly literature. If we gave credit to open data creators in the
same way as we credit authors of scholarly papers, Erik would be one of the
most influential authors in the field, and I don't think it is much of a
stretch to say that the massive trove of data and metrics Erik has made
available had a direct causal role in the birth and growth of the academic
field of Wikimedia research, and more broadly, scholarship of online
collaboration.

Like I said this morning, Erik -- you have been not only an invaluable
colleague and a steward for the movement, but also a very decent human
being, and I am grateful we shared some of this journey together.

Please join me in celebrating Erik on his well-deserved retirement, read
his statement <http://infodisiac.com/back_to_volunteer_mode.htm> to learn
what he's planning to do next, or check this lovely portrait
<https://www.wired.com/2013/12/erik-zachte-wikistats/> Wired published a
while back about "the Stats Master Making Sense of Wikipedia's Massive Data
Trove".

Dario


-- 
*Dario Taraborelli  *Director, Head of Research, Wikimedia Foundation
research.wikimedia.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Save the date: Wiki Workshop 2019 to be hosted at The Web Conference 2019 in San Francisco (May 13-14, 2019)

2018-12-10 Thread Dario Taraborelli

Hi everyone,

We are thrilled to announce that the *6th annual Wiki Workshop* [1] will be
hosted at *The Web Conference 2019* (formerly known as WWW) in San
Francisco, CA, on May 13 or 14, 2019 [2]. The workshop provides an annual
forum for researchers exploring all aspects of Wikipedia, Wikidata, and
other Wikimedia projects to present their work. We'd love to have your
contributions, so please take a look at the details in this call:
http://wikiworkshop.org/2019/#call

Please note that *January 31, 2019* is the submission deadline if you want
your paper to appear in the (archival) conference proceedings, and *March
14, 2019* is for all other, non-archival submissions. [3]

Following past year's format, the workshop will include invited talks, a
poster session, as well as offer an opportunity for participants to meet
and discuss future research directions. We look forward to receiving your
submissions and seeing you in San Francisco in May!

Best,
Dario on behalf of the organizers [4]


[image: ww19_banner_www.png]

[1] http://wikiworkshop.org/
[2] https://www2019.thewebconf.org/
[3] http://wikiworkshop.org/2019/#dates
[4] http://wikiworkshop.org/2019/#organization
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Modeling interactions on talk pages and detecting early signs of conversational failure: Research Showcase - June 18, 2018 (11:30 AM PDT| 18:30 UTC)

2018-06-18 Thread Dario Taraborelli

Hey all,

a reminder that the livestream of our monthly research showcase will start
in about 2 hours (11:30 PT / 18:30 UTC) with our collaborators from Jigsaw
and Cornell as guest speakers. You can follow the stream on YouTube:
https://www.youtube.com/watch?v=m4vzI0k4OSg and join the live Q&A on IRC in
the #wikimedia-research channel.

Looking forward to seeing you there!

Dario


On Thu, May 31, 2018 at 5:07 PM Dario Taraborelli <
dtarabore...@wikimedia.org> wrote:

> Hey everyone,
>
> we're hosting a dedicated session in June on our joint work with Cornell
> and Jigsaw on predicting conversational failure
> <https://arxiv.org/abs/1805.05345> on Wikipedia talk pages. This is part
> of our contribution to WMF's Anti-Harassment program.
>
> The showcase
> <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#June_2018> will be
> live-streamed <https://www.youtube.com/watch?v=m4vzI0k4OSg> on *Monday,
> June 18, 2018* at 11:30 AM (PDT), 18:30 (UTC).  (Please note this falls
> on a Monday this month).
>
> Conversations Gone Awry. Detecting Early Signs of Conversational FailureBy
>  *Justine Zhang and Jonathan Chang, Cornell University*One of the main
> challenges online social systems face is the prevalence of antisocial
> behavior, such as harassment and personal attacks. In this work, we
> introduce the task of predicting from the very start of a conversation
> whether it will get out of hand. As opposed to detecting undesirable
> behavior after the fact, this task aims to enable early, actionable
> prediction at a time when the conversation might still be salvaged. To this
> end, we develop a framework for capturing pragmatic devices—such as
> politeness strategies and rhetorical prompts—used to start a conversation,
> and analyze their relation to its future trajectory. Applying this
> framework in a controlled setting, we demonstrate the feasibility of
> detecting early warning signs of antisocial behavior in online discussions.
>
>
> Building a rich conversation corpus from Wikipedia Talk pagesWe present a
> corpus of conversations that encompasses the complete history of
> interactions between contributors to English Wikipedia's Talk Pages. This
> captures a new view of these interactions by containing not only the final
> form of each conversation but also detailed information on all the actions
> that led to it: new comments, as well as modifications, deletions and
> restorations. This level of detail supports new research questions
> pertaining to the process (and challenges) of large-scale online
> collaboration. As an example, we present a small study of removed comments
> highlighting that contributors successfully take action on more toxic
> behavior than was previously estimated.
>
> YouTube stream:  https://www.youtube.com/watch?v=m4vzI0k4OSg
>
> As usual, you can join the conversation on IRC at #wikimedia-research.
> And, you can watch our past research showcases here
> <https://www.youtube.com/playlist?list=PLhV3K_DS5YfLQLgwU3oDFiGaU3K7pUVoW>
> .
>
> Hope to see you there on June 18!
> Dario
>


-- 

*Dario Taraborelli  *Director, Head of Research, Wikimedia Foundation
research.wikimedia.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Fwd: Modeling interactions on talk pages and detecting early signs of conversational failure: Research Showcase - June 18, 2018 (11:30 AM PDT| 18:30 UTC)

2018-05-31 Thread Dario Taraborelli

Hey everyone,

we're hosting a dedicated session in June on our joint work with Cornell
and Jigsaw on predicting conversational failure
 on Wikipedia talk pages. This is part of
our contribution to WMF's Anti-Harassment program.

The showcase
 will be
live-streamed  on *Monday,
June 18, 2018* at 11:30 AM (PDT), 18:30 (UTC).  (Please note this falls on
a Monday this month).

Conversations Gone Awry. Detecting Early Signs of Conversational
FailureBy *Justine
Zhang and Jonathan Chang, Cornell University*One of the main challenges
online social systems face is the prevalence of antisocial behavior, such
as harassment and personal attacks. In this work, we introduce the task of
predicting from the very start of a conversation whether it will get out of
hand. As opposed to detecting undesirable behavior after the fact, this
task aims to enable early, actionable prediction at a time when the
conversation might still be salvaged. To this end, we develop a framework
for capturing pragmatic devices—such as politeness strategies and
rhetorical prompts—used to start a conversation, and analyze their relation
to its future trajectory. Applying this framework in a controlled setting,
we demonstrate the feasibility of detecting early warning signs of
antisocial behavior in online discussions.


Building a rich conversation corpus from Wikipedia Talk pagesWe present a
corpus of conversations that encompasses the complete history of
interactions between contributors to English Wikipedia's Talk Pages. This
captures a new view of these interactions by containing not only the final
form of each conversation but also detailed information on all the actions
that led to it: new comments, as well as modifications, deletions and
restorations. This level of detail supports new research questions
pertaining to the process (and challenges) of large-scale online
collaboration. As an example, we present a small study of removed comments
highlighting that contributors successfully take action on more toxic
behavior than was previously estimated.

YouTube stream:  https://www.youtube.com/watch?v=m4vzI0k4OSg

As usual, you can join the conversation on IRC at #wikimedia-research. And,
you can watch our past research showcases here
.

Hope to see you there on June 18!
Dario
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Wiki-research-l] A new landing page for the Wikimedia Research team

2018-02-11 Thread Dario Taraborelli

Hey all,

thanks for the great feedback. A couple of notes to expand on Jonathan's
response.

On Thu, Feb 8, 2018 at 8:46 AM, Jonathan Morgan 
wrote:

> Aaron: I'll ask Baha about the issue tracking... *issue* today. The code
> is hosted on Gerrit now, with a one-way mirror on this GitHub repo[1],
> which is not ideal from an openness/collaboration POV. For me, enabling
> easy issue tracking and pull requests is the most pressing issue. In the
> meantime, you can submit tasks through Phab. Add them to the Research
> board[2] and/or as subtasks of our Landing Page creation epic[3]. Not
> ideal, but at least you can capture things this way.
>

this is far from optimal. Due to production requirements, all code needs to
be on Gerrit, but asking people who want to suggest typo fixed to go
through the developer access instructions is a usability nightmare.
Jonathan's suggestion is a temporary solution, I'd like to work with Baha
to figure out if there's a possible workflow that allows us to receive PRs
and issues on GitHub, have them synced with Gerrit, before they are
reviewed and, if +2'ed, merged there. This may take a while so we
appreciate your patience.

> Federico: Translation via translatewiki would be very cool. We haven't
> prioritized this because, well, none of our on-wiki research team pages
> were ever translated, and this microsite is intended to supplement our
> on-wiki content, not replace it. But it sounds like a potential 'roadmap'
> kinda deal and I'll make sure to track it.
>

Our assumption was that the place for volunteer communities to find
translated content is (and should be) on wiki, and we can tap all the
existing workflows for translation there as needed. The main audiences for
this landing page are (primarily English speaking) funding and research
organizations who don't know how to navigate content across 4+ wikis and a
number of external data / publication repositories. I support the idea of
translations, if we can make it work and if there's appetite for it, the
minimum viable content was intentionally conceived to be in English.

Iolanda: this is the landing page for the Wikimedia Foundation Research
> team[4], not for the international community of researchers who study
> Wiki[*]edia. It's also not the landing page for all researchers and
> research activities within the Wikimedia Foundation--just those of team
> members (and Aaron, whose Scoring Platform team is a kind of spin
> off/sibling of the research team).
>

As an additional clarification: the Research Index on Meta remains the
central hub of all research projects created by the volunteer community,
academic researchers, and Wikimedia Foundation staff. This landing page
acts as a filter, and a thin layer of discoverability, to the contributions
made by the Wikimedia Research team to the Research Index (as well as
additional documentation that may exist across other wikis). Hope that
makes sense.

> Thanks everyone for the feedback so far. Keep it coming,
>
> Jonathan
>
> 1. https://github.com/wikimedia/research-landing-page
> 2. https://phabricator.wikimedia.org/tag/research/
> 3. https://phabricator.wikimedia.org/T107389
> 4. https://www.mediawiki.org/wiki/Wikimedia_Research
>
> On Thu, Feb 8, 2018 at 8:09 AM, Aaron Halfaker 
> wrote:
>
>> Hey folks, I see you're using github[1], but you've disabled the issue
>> tracker there.  Where should I submit bug reports and feature requests?
>> Maybe you could add a link next to "source code" at the bottom of the
>> page.
>>
>> 1. https://github.com/wikimedia/research-landing-page
>>
>> On Thu, Feb 8, 2018 at 10:02 AM, Aaron Halfaker > >
>> wrote:
>>
>> > Depends on which standard.  This is not a wiki page so it won't be
>> > translatable using the on-wiki translate tools.  However, it's quite
>> > possible that we could use something like translatewiki.net.  I'm not
>> > sure if that is on the road map.  Dario, what do you think?
>> >
>> > On Thu, Feb 8, 2018 at 1:31 AM, Federico Leva (Nemo) <
>> nemow...@gmail.com>
>> > wrote:
>> >
>> >> Will it be translatable with standard tools?
>> >>
>> >> Federico
>> >>
>> >>
>> >> ___
>> >> Wiki-research-l mailing list
>> >> wiki-researc...@lists.wikimedia.org
>> >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>> >>
>> >
>> >
>> ___
>> Wiki-research-l mailing list
>> wiki-researc...@lists.wikimedia.org
>> https://lists.wiki

[Analytics] A new landing page for the Wikimedia Research team

2018-02-06 Thread Dario Taraborelli

*Hey all,We’re thrilled to announce the Wikimedia Research team now has a
simple, navigable, and accessible landing page, making our output,
projects, and resources easy to discover and learn about:
https://research.wikimedia.org <https://research.wikimedia.org/> The
Research team decided to create a single go-to page (T107389
<https://phabricator.wikimedia.org/T107389>) to provide an additional way
to discover information we have on wiki, for the many audiences we would
like to engage with – particularly those who are not already familiar with
how to navigate our projects. On this page, potential academic
collaborators, journalists, funding organizations, and others will find
links to relevant resources, contact information, collaboration and
partnership opportunities, and ways to follow the team's work.There are
many more research resources produced by different teams and departments at
WMF – from Analytics, to Audiences, to Grantmaking, and Programs. If you
see anything that’s missing within the scope of the Research team, please
let us know <https://phabricator.wikimedia.org/T107389>!Dario*


-- 

*Dario Taraborelli  *Director, Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Wikimedia-l] Research Showcase Wednesday, January 17, 2018

2018-01-17 Thread Dario Taraborelli

Hey all,

a reminder that the livestream of our monthly research showcase starts in
45 minutes (11.30 PT)

   - Video: https://www.youtube.com/watch?v=L-1uzYYneUo
   - IRC: #wikimedia-research
   - Abstracts:
   https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#January_2018

Dario


On Tue, Jan 16, 2018 at 9:45 AM, Lani Goto  wrote:

> Hi Everyone,
>
> The next Research Showcase will be live-streamed this Wednesday, January
> 17, 2018 at 11:30 AM (PST) 19:30 UTC.
>
> YouTube stream: https://www.youtube.com/watch?v=L-1uzYYneUo
>
> As usual, you can join the conversation on IRC at #wikimedia-research. And,
> you can watch our past research showcases here.
>
> This month's presentation:
>
> *What motivates experts to contribute to public information goods? A field
> experiment at Wikipedia*
> By Yan Chen, University of Michigan
> Wikipedia is among the most important information sources for the general
> public. Motivating domain experts to contribute to Wikipedia can improve
> the accuracy and completeness of its content. In a field experiment, we
> examine the incentives which might motivate scholars to contribute their
> expertise to Wikipedia. We vary the mentioning of likely citation, public
> acknowledgement and the number of views an article receives. We find that
> experts are significantly more interested in contributing when citation
> benefit is mentioned. Furthermore, cosine similarity between a Wikipedia
> article and the expert's paper abstract is the most significant factor
> leading to more and higher-quality contributions, indicating that better
> matching is a crucial factor in motivating contributions to public
> information goods. Other factors correlated with contribution include
> social distance and researcher reputation.
>
> *Wikihounding on Wikipedia*
> By Caroline Sinders, WMF
> Wikihounding (a form of digital stalking on Wikipedia) is incredibly
> qualitative and quantitive. What makes wikihounding different then
> mentoring? It's the context of the action or the intention. However, all
> interactions inside of a digital space has a quantitive aspect to it, every
> comment, revert, etc is a data point. By analyzing data points
> comparatively inside of wikihounding cases and reading some of the cases,
> we can create a baseline for what are the actual overlapping similarities
> inside of wikihounding to study what makes up wikihounding. Wikihounding
> currently has a fairly loose definition. Wikihounding, as defined by the
> Harassment policy on en:wp, is: “the singling out of one or more editors,
> joining discussions on multiple pages or topics they may edit or multiple
> debates where they contribute, to repeatedly confront or inhibit their
> work. This is with an apparent aim of creating irritation, annoyance or
> distress to the other editor. Wikihounding usually involves following the
> target from place to place on Wikipedia.” This definition doesn't outline
> parameters around cases such as frequency of interaction, duration, or
> minimum reverts, nor is there a lot known about what a standard or
> canonical case of wikihounding looks like. What is the average wikihounding
> case? This talk will cover the approaches myself and members of the
> research team: Diego Saez-Trumper, Aaron Halfaker and Jonathan Morgan are
> taking on starting this research project.
>
> --
> Lani Goto
> Project Assistant, Engineering Admin
> ___
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> wiki/Wikimedia-l
> New messages to: wikimedi...@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>




-- 

*Dario Taraborelli  *Director, Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Kaggle competition to forecast Wikipedia article traffic

2017-07-18 Thread Dario Taraborelli

Wanted to make sure everyone saw this challenge announced by Kaggle:

https://www.kaggle.com/c/web-traffic-time-series-forecasting
https://twitter.com/kaggle/status/887093338117201923

The timeline:


   - September 1st, 2017 - Deadline to accept competition rules.
   - September 1st, 2017 - Team Merger deadline. This is the last day
   participants may join or merge teams.
   - September 1st, 2017 - Final dataset is released.
   - September 10th, 2017 - Final submission deadline.

Competition winners will be revealed after November 10, 2017.

Dario

-- 

*Dario Taraborelli  *Director, Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Migrated Reportcard with Updated Data

2017-04-09 Thread Dario Taraborelli

Nice job, Analytics! I too am eagerly waiting for the 2.0 transition.

On Fri, Apr 7, 2017 at 3:48 PM, Toby Negrin  wrote:

> Congrats Nuria and team! This looks great and I'm super excited for
> Wikistats 2.0.
>
> -Toby
>
> On Fri, Apr 7, 2017 at 11:30 AM, Nuria Ruiz  wrote:
>
>> Hello!
>>
>> The Analytics team would like to announce that we have migrated the
>> reportcard to a new domain:
>>
>> https://analytics.wikimedia.org/dashboards/reportcard/#pagev
>> iews-july-2015-now
>>
>> The migrated reportcard includes both legacy and current pageview data,
>> daily unique devices and new editors data. Pageview and devices data is
>> updated daily but editor data is still updated ad-hoc.
>>
>> The team is working at this time on revamping the way we compute edit
>> data and we hope to be able to provide monthly updates for the main edit
>> metrics this quarter. Some of those will be visible in the reportcard but
>> the new wikistats will have more detailed reports.
>>
>> You can follow the new wikistats project here: https://phabricator.wiki
>> media.org/T130256
>>
>> Thanks,
>>
>> Nuria
>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 

*Dario Taraborelli  *Director, Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Research-wmf] Research Showcase, December 21, 2016

2016-12-21 Thread Dario Taraborelli

A reminder that the livestream will start in an hour (11:30am PT / 7:30pm
UTC): https://www.youtube.com/watch?v=nmrlu5qTgyA

If you want to learn more about perceptions of privacy and safety among Tor
users and Wikimedia contributors or are eager to know how much high-quality
content gender-focused initiatives have contributed to Wikipedia, come and
join us today (the discussion will be hosted on IRC).

Dario

On Mon, Dec 19, 2016 at 8:45 AM, Sarah R  wrote:

> Hi Everyone,
>
> The next Research Showcase will be live-streamed this Wednesday,
> December 21, 2016 at 11:30 AM (PST) 18:30 (UTC).
>
> YouTube stream: https://www.youtube.com/watch?v=nmrlu5qTgyA
>
> As usual, you can join the conversation on IRC at #wikimedia-research.
> And, you can watch our past research showcases here
> <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#December_2016>
> .
>
> The December 2016 Research Showcase includes:
>
> English Wikipedia Quality Dynamics and the Case of WikiProject Women
> ScientistsBy *Aaron Halfaker
> <https://meta.wikimedia.org/wiki/User:Halfak_(WMF)>*With every productive
> edit, Wikipedia is steadily progressing towards higher and higher quality.
> In order to track quality improvements, Wikipedians have developed an
> article quality assessment rating scale that ranges from "Stub" at the
> bottom to "Featured Articles" at the top. While this quality scale has the
> promise of giving us insights into the dynamics of quality improvements in
> Wikipedia, it is hard to use due to the sporadic nature of manual
> re-assessments. By developing a highly accurate prediction model (based on
> work by Warncke-Wang et al.), we've developed a method to assess an
> articles quality at any point in history. Using this model, we explore
> general trends in quality in Wikipedia and compare these trends to those of
> an interesting cross-section: Articles tagged by WikiProject Women
> Scientists. Results suggest that articles about women scientists were lower
> quality than the rest of the wiki until mid-2013, after which a dramatic
> shift occurred towards higher quality. This shift may correlate with (and
> even be caused by) this WikiProjects initiatives.
>
>
> Privacy, Anonymity, and Perceived Risk in Open Collaboration. A Study of
> Tor Users and WikipediansBy *Andrea Forte*In a recent qualitative study
> to be published at CSCW 2017, collaborators Rachel Greenstadt, Naz
> Andalibi, and I examined privacy practices and concerns among contributors
> to open collaboration projects. We collected interview data from people who
> use the anonymity network Tor who also contribute to online projects and
> from Wikipedia editors who are concerned about their privacy to better
> understand how privacy concerns impact participation in open collaboration
> projects. We found that risks perceived by contributors to open
> collaboration projects include threats of surveillance, violence,
> harassment, opportunity loss, reputation loss, and fear for loved ones. We
> explain participants’ operational and technical strategies for mitigating
> these risks and how these strategies affect their contributions. Finally,
> we discuss chilling effects associated with privacy loss, the need for open
> collaboration projects to go beyond attracting and educating participants
> to consider their privacy, and some of the social and technical approaches
> that could be explored to mitigate risk at a project or community level.
>
> --
> Sarah R. Rodlund
> Senior Project Coordinator-Engineering, Wikimedia Foundation
> srodl...@wikimedia.org
>
> ___
> Research-wmf mailing list
> research-...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/research-wmf
>
>


-- 

*Dario Taraborelli  *Director, Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Upcoming Research Showcase, November 16, 2016

2016-11-16 Thread Dario Taraborelli

an't make it, please
>> feel free to watch the video later and get in touch with us with
>> questions/comments. :)
>>
>> Best,
>> Leila
>> --
>> Leila Zia
>> Senior Research Scientist
>> Wikimedia Foundation
>>
>> [1] WMF Research and researchers from three academic institutions: EPFL,
>> GESIS, and Stanford University, in collaboration with WMF Reading.
>> 
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 

*Dario Taraborelli  *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] SPARQL workshop and WDQS tutorials

2016-09-15 Thread Dario Taraborelli

The Wikimedia Foundation's Discovery and Research teams recently hosted an
introductory workshop on the SPARQL query language and the Wikidata Query
Service.

We made the video stream <https://www.youtube.com/watch?v=NaMdh4fXy18> and
materials
<https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/2016_SPARQL_Workshop>
(demo
queries, slidedecks) from this workshop publicly available.

Guest speakers:

   - Ruben Verborgh, *Ghent University* and *Linked Data Fragments*
   - Benjamin Good, *Scripps Research Institute* and *Gene Wiki*
   - Tim Putman, *Scripps Research Institute* and *Gene Wiki*
   - Lucas, *@WikidataFacts*


Dario and Stas


*Dario Taraborelli  *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] browser dashboards again!

2016-08-30 Thread Dario Taraborelli

The animated sunburst chart is the bomb.

Great stuff, Analytics!

On Tue, Aug 30, 2016 at 2:04 PM, Dan Andreescu 
wrote:

> hm.  good point.  The hover legend is sorted by percentage so you can see
> iOS when you hover.  But the key legend on the left is trying to be sorted
> alphabetically, except... it's not... so iOS gets pushed down way below the
> fold.  I'll see what's up.
>
> On Tue, Aug 30, 2016 at 4:21 PM, Ryan Kaldari 
> wrote:
>
>> Very cool! At first I was confused by Ubuntu being the 3rd most popular
>> operating system.[1] But then I realized it was actually iOS, which for
>> some reason is missing from the key.
>>
>> 1. https://analytics.wikimedia.org/dashboards/browsers/#all-sites-by-os
>>
>> On Tue, Aug 30, 2016 at 12:00 PM, Toby Negrin 
>> wrote:
>>
>>> The  browser dashboards made Ben Evans[1] _and_ Product Hunt[2] :)
>>> Congrats!
>>>
>>> -Toby
>>>
>>> [1] http://us6.campaign-archive2.com/?u=b98e2de85f03865f1d38de74
>>> f&id=73f838f55a
>>> [2] https://www.producthunt.com/tech/data-dashboard-by-wikimedia
>>> -foundation
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 

*Dario Taraborelli  *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] pageview counts on page redirects

2016-08-27 Thread Dario Taraborelli

Pageview data (both in the dumps and pageviews API) is counted for the
nominal page title as requested, i.e. it is agnostic as to what that title
redirects to.

To obtain a complete dataset of pageviews across all redirects for a given
page you would need to reconstruct its redirect graph over the time range
you're interested in, which is a pretty laborious process.

If you're doing research on this topic, you may be interested in this
recent work by Mako Hill and Aaron Shaw looking at redirects and how they
affect the quality of data on Wikipedia articles.

*Consider the Redirect: A Missing Dimension of Wikipedia Research*
http://dx.doi.org/10.1145/2641580.2641616

HTH,
Dario

On Thu, Aug 25, 2016 at 5:25 AM, Aubrey Rembert 
wrote:

> hi,
>
> our team is trying to determine how pageviews are attributed to pages that
> redirect to other pages.
>
> for instance, the page Panic!_at_the_*d*isco redirects to the page
> Panic!_at_the_*D*isco, however, in the pageview dumps file
> there is an entry for both Panic!_at_the_disco and Panic!_at_the_Disco.
> does this mean that a single visit to the page Panic!_at_the_disco
> generates two entries
> in the pageview dumps file (one entry for the source page of the redirect
> and another for the target page of the redirect)?
>
> -best,
> -ace
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>

-- 

*Dario Taraborelli  *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Analysing link

2016-08-27 Thread Dario Taraborelli

sorry, I hit submit too fast.

The *clickstream dataset* contains data from individual page requests
(extracted from the referral, when available, of any single page
requested). The *navigation vector* data Leila referred to measures visits
to pages that co-occur within a browser session.

There is extensive documentation on each dataset on the corresponding Meta
pages as well as notebooks
<http://ewulczyn.github.io/Wikipedia_Clickstream_Getting_Started/> that the
author of the dataset (Ellery) produced which should help get you started
analyzing this data.

Hope this helps!
Dario

On Sat, Aug 27, 2016 at 1:16 PM, Dario Taraborelli <
dtarabore...@wikimedia.org> wrote:

> The closest open dataset to what you are referring to is the clickstream
> dataset:
>
> https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream
> https://dx.doi.org/10.6084/m9.figshare.1305770
>
> On Fri, Aug 26, 2016 at 2:38 PM, Leila Zia  wrote:
>
>>
>> On Fri, Aug 26, 2016 at 1:38 AM, Federico Leva (Nemo) > > wrote:
>>
>>> Jan Dittrich, 26/08/2016 10:03:
>>>
>>>> or even click paths
>>>>
>>>
>>> Do you know about https://meta.wikimedia.org/wik
>>> i/Research:Improving_link_coverage/Release_page_traces ?
>>>
>>
>> and https://meta.wikimedia.org/wiki/Research:Wikipedia_Navigatio
>> n_Vectors ?
>>
>> Leila
>>
>>
>>
>>>
>>> Nemo
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
>
> *Dario Taraborelli  *Head of Research, Wikimedia Foundation
> wikimediafoundation.org • nitens.org • @readermeter
> <http://twitter.com/readermeter>
>



-- 

*Dario Taraborelli  *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Analysing link

2016-08-27 Thread Dario Taraborelli

The closest open dataset to what you are referring to is the clickstream
dataset:

https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream
https://dx.doi.org/10.6084/m9.figshare.1305770

On Fri, Aug 26, 2016 at 2:38 PM, Leila Zia  wrote:

>
> On Fri, Aug 26, 2016 at 1:38 AM, Federico Leva (Nemo) 
> wrote:
>
>> Jan Dittrich, 26/08/2016 10:03:
>>
>>> or even click paths
>>>
>>
>> Do you know about https://meta.wikimedia.org/wik
>> i/Research:Improving_link_coverage/Release_page_traces ?
>>
>
> and https://meta.wikimedia.org/wiki/Research:Wikipedia_Navigation_Vectors
> ?
>
> Leila
>
>
>
>>
>> Nemo
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
> ___________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 

*Dario Taraborelli  *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Q4-2016 (April-June) quarterly report for Wikimedia Research

2016-07-30 Thread Dario Taraborelli

This is what we've been up to at Wikimedia Research this past quarter
(April - June 2016):

   - Research and Data
   
<https://commons.wikimedia.org/w/index.php?title=File:Technology_Quarterly_Review_-_Q4_FY15-16-_Research_and_Data,_Design_Research,_Analytics,_Performance.pdf&page=3>
   - Design Research
   
<https://commons.wikimedia.org/w/index.php?title=File%3ATechnology_Quarterly_Review_-_Q4_FY15-16-_Research_and_Data%2C_Design_Research%2C_Analytics%2C_Performance.pdf&page=14>

You might also be interested in the Analytics Engineering
<https://commons.wikimedia.org/w/index.php?title=File:Technology_Quarterly_Review_-_Q4_FY15-16-_Research_and_Data,_Design_Research,_Analytics,_Performance.pdf&page=26>
team's quarterly report.

Best,
Dario




*Dario Taraborelli  *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Research FAQ gets a facelift

2016-06-20 Thread Dario Taraborelli

We just released a new version of Research:FAQ on Meta [1], significantly
expanded and updated, to make our processes at WMF more transparent and to
meet an explicit FDC request to clarify the role and responsibilities of
individual teams involved in research across the organization.

The previous version – written from the perspective of the (now inactive)
Research:Committee, and mostly obsolete since the release of WMF's open
access policy [2] – can still be found here [3].

Comments and bold edits to the new version of the document are welcome. For
any question or concern, you can drop me a line or ping my username on-wiki.

Thanks,
Dario

[1] https://meta.wikimedia.org/wiki/Research:FAQ
[2] https://wikimediafoundation.org/wiki/Open_access_policy
[3] https://meta.wikimedia.org/w/index.php?title=Research:FAQ&oldid=15176953


*Dario Taraborelli  *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] 'Unique Devices' Data Visualizations Available

2016-05-30 Thread Dario Taraborelli

neat, well done people

> On May 24, 2016, at 9:51 PM, Nuria Ruiz  wrote:
> 
> 
> Hello!
> 
> 
> The analytics team would like to announce that we have a new visualization 
> for Unique Devices data. As you know Unique Devices [1] is our best proxy to 
> calculate Unique Users. We would like to reiterate that the data is available 
> in a public API that anyone can access [2]. We calculate Uniques daily and 
> monthly.
>   
> 
> See, for example, "Daily Unique Devices" for Spanish Wikipedia versus French 
> wikipedia:
> https://vital-signs.wmflabs.org/#projects=frwiki,eswiki/metrics=UniqueDevices 
> <https://vital-signs.wmflabs.org/#projects=frwiki,eswiki/metrics=UniqueDevices>
> 
> FYI that dashboard would not work on IE, only on Edge. 
> 
> Thanks, 
> 
> Nuria
> 
> [1] https://meta.wikimedia.org/wiki/Research:Unique_Devices 
> <https://meta.wikimedia.org/wiki/Research:Unique_Devices>
> [2] https://wikitech.wikimedia.org/wiki/Analytics/Unique_Devices 
> <https://wikitech.wikimedia.org/wiki/Analytics/Unique_Devices>
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics



Dario Taraborelli  Head of Research, Wikimedia Foundation
wikimediafoundation.org <http://wikimediafoundation.org/> • nitens.org 
<http://nitens.org/> • @readermeter <http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Wikipedia Clickstream dataset refreshed (March 2016)

2016-05-02 Thread Dario Taraborelli

Hey Thomas,

yes, I agree this dataset is really valuable (just looking at the sheer
number of downloads  [1] and requests for similar data we've received). I
can see the value of making it more easily accessible via an API.

Ellery and I have been talking about the idea of – at the very least –
scheduling the generation of new dumps, if not exposing the data
programmatically. Right now, I am afraid this is not within my team's
capacity and Analytics has a number of other high-priority areas to focus
on. We were planning to talk to Joseph et al anyway and decide how to move
forward (hi Joseph!), we'll report back on the lists as soon as this
happens.

Dario

[1] https://figshare.altmetric.com/details/3707715

On Mon, May 2, 2016 at 3:12 AM, Thomas Steiner  wrote:

> Hi Dario,
>
> This data is super interesting! How realistic is it that your team
> make it available through the Wikimedia REST API [1]? I would then in
> turn love to add it to Wikipedia Tools [2], just imagine how amazing
> it would be to be able to ask a spreadsheet for…
>
>   =WIKI{OUT|IN}BOUNDTRAFFIC("en:London", TODAY()-2, TODAY()-1)
>
> …(or obviously the API respectively) and get the results back
> immediately without the need to download a dump first. What do you
> think?
>
> Cheers,
> Tom
>
> --
> [1] https://wikimedia.org/api/rest_v1/?doc
> [2] http://bit.ly/wikipedia-tools-add-on
>
> --
> Dr. Thomas Steiner, Employee (http://blog.tomayac.com,
> https://twitter.com/tomayac)
>
> Google Germany GmbH, ABC-Str. 19, 20354 Hamburg, Germany
> Managing Directors: Matthew Scott Sucherman, Paul Terence Manicle
> Registration office and registration number: Hamburg, HRB 86891
>
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v2.0.29 (GNU/Linux)
>
>
> iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck0fjumBl3DCharaCTersAttH3b0ttom
> hTtPs://xKcd.cOm/1181/
> -END PGP SIGNATURE-
>
> _______
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

-- 

*Dario Taraborelli  *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Wikipedia Clickstream dataset refreshed (March 2016)

2016-04-28 Thread Dario Taraborelli

Hey all,

heads up that a refreshed Wikipedia Clickstream dataset is now available
for March 2016, containing 25 million (referer, resource) pairs extracted
from about 7 billion webrequests.

https://dx.doi.org/10.6084/m9.figshare.1305770.v16

Ellery (the author of the dataset) is cc'ed if you have any questions, or
you can chime in on the talk page of the dataset entry on Meta
<https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream>.

Show us what you do with this data, if you use it in your research.

Dario

*Dario Taraborelli  *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Wikimedia-l] [Wiki-research-l] Research showcase: Evolution of privacy loss in Wikipedia

2016-03-19 Thread Dario Taraborelli

On Wed, Mar 16, 2016 at 7:53 PM, SarahSV  wrote:

> Dario and Aaron, thanks for letting us know about this. Is the research
> available in writing for people who don't want to sit through the video?
>
> Sarah
>

Sarah – yes, see http://cm.cecs.anu.edu.au/post/wikiprivacy/

On Wed, Mar 16, 2016 at 12:55 PM, Aaron Halfaker 
> wrote:
>
> > Reminder, this showcase is starting in 5 minutes.  See the stream here:
> > https://www.youtube.com/watch?v=Xle0oOFCNnk
> >
> > Join us on Freenode at #wikimedia-research
> > <http://webchat.freenode.net/?channels=wikimedia-research> to ask Andrei
> > questions.
> >
> > -Aaron
> >
> > On Tue, Mar 15, 2016 at 12:53 PM, Dario Taraborelli <
> > dtarabore...@wikimedia.org> wrote:
> >
> > > This month, our research showcase
> > > <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#March_2016
> >
> > hosts
> > > Andrei Rizoiu (Australian National University) to talk about his work
> > > <http://cm.cecs.anu.edu.au/post/wikiprivacy/> on *how private traits
> of
> > > Wikipedia editors can be exposed from public data* (such as edit
> > > histories) using off-the-shelf machine learning techniques. (abstract
> > below)
> > >
> > > If you're interested in learning what the combination of machine
> learning
> > > and public data mean for privacy and surveillance, come and join us
> this
> > *Wednesday
> > > March 16* at *1pm Pacific Time*.
> > >
> > > The event will be recorded and publicly streamed
> > > <https://www.youtube.com/watch?v=Xle0oOFCNnk>. As usual, we will be
> > > hosting the conversation with the speaker and Q&A on the
> > > #wikimedia-research channel on IRC.
> > >
> > > Looking forward to seeing you there,
> > >
> > > Dario
> > >
> > >
> > > Evolution of Privacy Loss in WikipediaThe cumulative effect of
> collective
> > > online participation has an important and adverse impact on individual
> > > privacy. As an online system evolves over time, new digital traces of
> > > individual behavior may uncover previously hidden statistical links
> > between
> > > an individual’s past actions and her private traits. To quantify this
> > > effect, we analyze the evolution of individual privacy loss by studying
> > > the edit history of Wikipedia over 13 years, including more than
> 117,523
> > > different users performing 188,805,088 edits. We trace each Wikipedia’s
> > > contributor using apparently harmless features, such as the number of
> > edits
> > > performed on predefined broad categories in a given time period (e.g.
> > > Mathematics, Culture or Nature). We show that even at this unspecific
> > level
> > > of behavior description, it is possible to use off-the-shelf machine
> > > learning algorithms to uncover usually undisclosed personal traits,
> such
> > as
> > > gender, religion or education. We provide empirical evidence that the
> > > prediction accuracy for almost all private traits consistently improves
> > > over time. Surprisingly, the prediction performance for users who
> stopped
> > > editing after a given time still improves. The activities performed by
> > new
> > > users seem to have contributed more to this effect than additional
> > > activities from existing (but still active) users. Insights from this
> > work
> > > should help users, system designers, and policy makers understand and
> > make
> > > long-term design choices in online content creation systems.
> > >
> > >
> > > *Dario Taraborelli  *Head of Research, Wikimedia Foundation
> > > wikimediafoundation.org • nitens.org • @readermeter
> > > <http://twitter.com/readermeter>
> > >
> > > ___
> > > Wiki-research-l mailing list
> > > wiki-researc...@lists.wikimedia.org
> > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >
> > >
> > _______
> > Wikimedia-l mailing list, guidelines at:
> > https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> > New messages to: wikimedi...@lists.wikimedia.org
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>
> >
> ___
> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> New messages to: wikimedi...@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>




-- 


*Dario Taraborelli  *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Research showcase: Evolution of privacy loss in Wikipedia

2016-03-15 Thread Dario Taraborelli

This month, our research showcase
<https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#March_2016> hosts
Andrei Rizoiu (Australian National University) to talk about his work
<http://cm.cecs.anu.edu.au/post/wikiprivacy/> on *how private traits of
Wikipedia editors can be exposed from public data* (such as edit histories)
using off-the-shelf machine learning techniques. (abstract below)

If you're interested in learning what the combination of machine learning
and public data mean for privacy and surveillance, come and join us
this *Wednesday
March 16* at *1pm Pacific Time*.

The event will be recorded and publicly streamed
<https://www.youtube.com/watch?v=Xle0oOFCNnk>. As usual, we will be hosting
the conversation with the speaker and Q&A on the #wikimedia-research
channel on IRC.

Looking forward to seeing you there,

Dario


Evolution of Privacy Loss in WikipediaThe cumulative effect of collective
online participation has an important and adverse impact on individual
privacy. As an online system evolves over time, new digital traces of
individual behavior may uncover previously hidden statistical links between
an individual’s past actions and her private traits. To quantify this
effect, we analyze the evolution of individual privacy loss by studying the
edit history of Wikipedia over 13 years, including more than 117,523
different users performing 188,805,088 edits. We trace each Wikipedia’s
contributor using apparently harmless features, such as the number of edits
performed on predefined broad categories in a given time period (e.g.
Mathematics, Culture or Nature). We show that even at this unspecific level
of behavior description, it is possible to use off-the-shelf machine
learning algorithms to uncover usually undisclosed personal traits, such as
gender, religion or education. We provide empirical evidence that the
prediction accuracy for almost all private traits consistently improves
over time. Surprisingly, the prediction performance for users who stopped
editing after a given time still improves. The activities performed by new
users seem to have contributed more to this effect than additional
activities from existing (but still active) users. Insights from this work
should help users, system designers, and policy makers understand and make
long-term design choices in online content creation systems.


*Dario Taraborelli  *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Ops] Dark traffic

2016-03-01 Thread Dario Taraborelli

hey Andrew,

we're monitoring the impact of this change (which we rolled out on 2/22)
with a number of external partners (BBC, Le Monde, JSTOR, Elsevier) and
we're planning to write a full report in April. Elsevier reported that in
June visible inbound traffic from Wikipedia dropped by 99% in June 2015.
This change should fix this, while preserving the privacy of our readers
browsing content over HTTPS.

Background:
https://meta.wikimedia.org/wiki/Research:Wikimedia_referrer_policy

Dario


On Tue, Mar 1, 2016 at 7:20 AM, Andrew Lih  wrote:

> Thanks James, Dan, Chris and all for the quick answer.
>
> Nice to see this change. As Alex Stinson pointed out in the Phabricator
> discussion, it helps with our GLAM partners so they can keep tracking how
> much referral traffic comes from WM projects.
>
> -Andrew
>
>
> On Tue, Mar 1, 2016 at 10:02 AM, Chris Steipp 
> wrote:
>
>> Hi Dan,
>>
>> https://phabricator.wikimedia.org/T87276
>>
>> On Tue, Mar 1, 2016 at 6:58 AM, Dan Andreescu 
>> wrote:
>>
>>> I think this is more of an ops question, cc-ing them.
>>>
>>> On Tue, Mar 1, 2016 at 9:55 AM, Andrew Lih  wrote:
>>>
>>>> Hi folks, I got this note from an external organization that wanted to
>>>> know more about what Wikimedia changed so that they are now accurate
>>>> getting referral info. Any pointers?
>>>>
>>>> "Wikipedia was implementing a fix so it would not be “dark traffic" in
>>>> the analytics reports. This has been happening for the past 10 months. Just
>>>> noticed today that Wikipedia is showing again in the referrals report.”
>>>>
>>>>  Thanks.
>>>>
>>>>
>>>> ___
>>>> Analytics mailing list
>>>> Analytics@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>
>>> ___
>>> Ops mailing list
>>> o...@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/ops
>>>
>>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 


*Dario Taraborelli  *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Wiki Workshop 2016 @ ICWSM: deadline extended to March 3

2016-02-23 Thread Dario Taraborelli

Hi all – heads up that we extended the submission deadline for the Wiki
Workshop at ICWSM '16 to *Wednesday, March 3, 2016*. (The second deadline
remains unchanged: March 11, 2016).

You can check the workshop's website
 for submission instructions or
follow us at @wikiworkshop16  for live
updates.

Looking forward to your contributions.

Dario
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Wiki-Medicine] Zika

2016-02-18 Thread Dario Taraborelli

hat complementary knowledge we want to produce, working
>> with
>> >> WikiProject Medicine can be helpful, too.
>> >
>> >
>> > Cool, yeah, I'm nowhere close to knowledgeable on this, I can data-dog
>> > though :)
>> >
>> >
>> > [1] www.cbc.ca/news/health/microcephaly-brazil-zika-reality-1.3442580
>> >
>> > ___
>> > Wikimedia-Medicine mailing list
>> > wikimedia-medic...@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/wikimedia-medicine
>> >
>>
>> ___
>> Wikimedia-Medicine mailing list
>> wikimedia-medic...@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikimedia-medicine
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
>
> --
> Thank you.
>
> Alex Druk
> alex.d...@gmail.com
> (775) 237-8550 Google voice
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 


*Dario Taraborelli  *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Pageviews] [Technical] Simplifying the available static dumps of pageview data

2016-01-06 Thread Dario Taraborelli

Erik's proposal sounds very reasonable.

There might be some confusion about what we mean by "keeping the old
datasets for longitudinal analysis". No one is planning to remove the old
static dumps, just stop generating them/maintaining them going forward.

I also want to echo Nuria regarding the human cost of maintaining multiple
definitions. I just finished preparing a response to a reporter who was
asking about project-level mobile PV data and I was not immediately able to
answer if a specific data source I wanted to cite was using the old or new
definition (until I talked to Dan and we looked up together a gerrit
patch).

How do people feel about turning off the generation of old dumps by *May
2016*, i.e. one year after having the two series of data available in
parallel?



On Wed, Jan 6, 2016 at 10:17 AM, Nuria Ruiz  wrote:

> >As I just mentioned to Dan in a private email conversation, keeping
> datasets even with imperfect measurements is important. Particularly for
> longitudinal analysis.
> Have in mind that maintaining these old dumps is not "free", it causes a
> lot of confusion and maintenance costs to have several pageview definitions
> around. We get a lot of questions about spiky-ness of old definition and we
> need to maintain software that generates the old files thus, we think is
> reasonable to ask our users to transition to the new definition and
> eventually (in a period of months) turn off the old dumps.
>
> On Thu, Dec 24, 2015 at 6:12 AM, Maurice Vergeer 
> wrote:
>
>> Dear all,
>>
>> As I just mentioned to Dan in a private email conversation, keeping
>> datasets even with imperfect measurements is important. Particularly for
>> longitudinal analysis.
>>
>> Also, from what I understand - me being a newby here - is that the data
>> are stored in separate files. Dan suggested reordering the page into
>> categories. Maybe, another option is to create more extensive datasets with
>> more different measurements in a single datafile. On the other hand, the
>> files would become even bigger in size. Not an issue for mee, but for users
>> in the field accesibility (dowlnload bandwidth) could become an issue.
>>
>> my two cents
>> Maurice
>>
>>
>> On Thu, Dec 24, 2015 at 2:58 PM, Alex Druk  wrote:
>>
>>> Nothing against this approach!
>>>
>>> On Thu, Dec 24, 2015 at 2:55 PM, Dan Andreescu >> > wrote:
>>>
>>>>
>>>>
>>>> On Thu, Dec 24, 2015 at 8:48 AM, Alex Druk  wrote:
>>>>
>>>>> Hi Dan,
>>>>> Happy holidays!
>>>>> Good idea to combine these datasets! However we have one more dataset
>>>>> by Erik Zachte : http://dumps.wikimedia.org/other/pagecounts-ez/
>>>>>
>>>>
>>>> And that's an important one!  But I was thinking we could re-organize
>>>> the page into categories.  Erik's dataset could go into a "processed data"
>>>> category or something like that.  The three I wanted to talk about on this
>>>> thread are just the raw data.
>>>>
>>>> ___
>>>> Analytics mailing list
>>>> Analytics@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>
>>>
>>> --
>>> Thank you.
>>>
>>> Alex Druk
>>> alex.d...@gmail.com
>>> (775) 237-8550 Google voice
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>>
>> --
>> 
>> Maurice Vergeer
>> To contact me, see http://mauricevergeer.nl/node/5
>> To see my publications, see http://mauricevergeer.nl/node/1
>> 
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 


*Dario Taraborelli  *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] What Wikimedia Research is up to in the next quarter

2015-12-18 Thread Dario Taraborelli

Hey all,

I’m glad to announce that the Wikimedia Research team’s goals
<https://www.mediawiki.org/wiki/Wikimedia_Research/Goals#January_-_March_2016_.28Q3.29>
for
the next quarter (January - March 2016) are up on wiki.

The Research and Data
<https://www.mediawiki.org/wiki/Wikimedia_Research#Research_and_Data> team
will continue to work with our volunteers and collaborators on revision
scoring as a service <https://meta.wikimedia.org/wiki/R:Revscoring> adding
support for 5 new languages and prototyping new models (including an edit
type classifier
<https://meta.wikimedia.org/wiki/Research:Automated_classification_of_edit_types>).
We will also continue to iterate on the design of article creation
recommendations
<https://meta.wikimedia.org/wiki/Research:Increasing_article_coverage>,
running a dedicated campaign in coordination with existing editathons to
improve the quality of these recommendations. Finally, we will extend a
research project we started in November aimed at understanding the behavior
of Wikipedia readers
<https://meta.wikimedia.org/wiki/Research:Characterizing_Wikipedia_Reader_Behaviour>
, by combining qualitative survey data with behavioral analysis from our
HTTP request logs.

The Design Research
<https://www.mediawiki.org/wiki/Wikimedia_Research#Design_Research> team
will conduct an in-depth study of user needs (particularly readers) on the
ground in February. We will continue to work with other Wikimedia
Engineering teams throughout the quarter to ensure the adoption of
human-centered design principles and pragmatic personas
<https://www.mediawiki.org/wiki/Personas_for_product_development> in our
product development cycle. We’re also excited to start a collaboration
<https://meta.wikimedia.org/wiki/Research:Publicly_available_online_learning_resource_survey>
with
students at the University of Washington to understand what free online
information resources (including, but not limited to, Wikimedia projects)
students use.

I am also glad to report that two papers on link and article
recommendations (the result of a formal collaboration with a team at
Stanford) were accepted for presentation at WSDM '16 and WWW ’16 (preprints
will be made available shortly). An overview on revision scoring as a
service
<http://blog.wikimedia.org/2015/11/30/artificial-intelligence-x-ray-specs/> was
published a few weeks ago on the Wikimedia blog, and got some good media
coverage
<https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Media>
.

We're constantly looking for contributors and as usual we welcome feedback
on these projects via the corresponding talk pages on Meta. You can contact
us for any question on IRC via the #wikimedia-research channel and follow
@WikiResearch <https://twitter.com/WikiResearch> on Twitter for the latest
Wikipedia and Wikimedia research updates hot off the press.

Wishing you all happy holidays,

Dario and Abbey on behalf of the team


*Dario Taraborelli  *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Page view API questions regarding user agent

2015-12-15 Thread Dario Taraborelli

On Tue, Dec 15, 2015 at 2:56 AM, Oliver Keyes  wrote:

> 2-3 weeks? What are you doing, taking /vacations at Christmas/?
> Unacceptable!
>
> More seriously: the work on the API thus far - the data that has been
> moved in, the responsiveness around bug reports, the intuitive nature
> of the interface from a client library POV - has been fantastic. I
> hope you all enjoy your break :). I am honoured to call you coworkers.
>

hear hear

+1 on better documenting what "bot" refers to, not 100% intuitive.



>
> On 14 December 2015 at 18:51, Madhumitha Viswanathan
>  wrote:
> > +1 Oliver - User agents tagged with WikimediaBot are tagged as bot - I do
> > agree that our documentation on this can be approved, I'll update the
> > Webrequest and Pageview tables docs to reflect this.
> >
> > The backfilling jobs for May-July have been paused at the moment, the
> plan
> > is to resume backfilling in 2-3 weeks.
> >
> > On Mon, Dec 14, 2015 at 3:45 PM, Oliver Keyes 
> wrote:
> >>
> >> Hey Felix,
> >>
> >> To answer some questions in order:
> >>
> >> 1. Bots are automated systems with a Wikimedia specific tag
> >> (WikimediaBot, iirc) in their user agent. We don't expect this to be
> >> widely adopted yet because it hasn't been widely advertised. The
> >> standard itself is very new, which is probably why you don't see any
> >> traffic referring to it in August.
> >> 2. The idea is to have traffic no earlier than May 2015 - because
> >> that's when the new pageview definition was instrumented (and so the
> >> earliest point we have data from) but that doesn't mean all the data
> >> has been loaded in yet.
> >>
> >> On 14 December 2015 at 18:39, Felix J. Scholz
> >>  wrote:
> >> > Dear All:
> >> >
> >> > Maybe this question is a little bit too simple, but I did not
> >> > immediately
> >> > find the answer in the docs.
> >> >
> >> > How does the API differentiate between the two user agents spider and
> >> > bot?
> >> >
> >> > I'm asking because for some articles, there seems to be no bot traffic
> >> > at
> >> > all, including the main page in August:
> >> >
> >> >
> https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/bot/Main_Page/daily/20150801/20150901
> >> >
> >> > ---
> >> > Another, unrelated question:
> >> > By my recollection, I read somewhere that the data available via the
> API
> >> > dates back to sometime in May of 2015. However, when doing queries
> >> > today,
> >> > the API only returned data starting on August 1, 2015. Is that
> correct?
> >> >
> >> > Best,
> >> > Felix
> >> >
> >> > ___
> >> > Analytics mailing list
> >> > Analytics@lists.wikimedia.org
> >> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >> >
> >>
> >>
> >>
> >> --
> >> Oliver Keyes
> >> Count Logula
> >> Wikimedia Foundation
> >>
> >> ___
> >> Analytics mailing list
> >> Analytics@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/analytics
> >
> >
> >
> >
> > --
> > --Madhu :)
> >
> > ___
> > Analytics mailing list
> > Analytics@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
>
>
>
> --
> Oliver Keyes
> Count Logula
> Wikimedia Foundation
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 


*Dario Taraborelli  *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Wikistats upgraded to new page view definition

2015-12-03 Thread Dario Taraborelli

Thanks for sharing the writeup, Erik. Page views aside, the “active wikis” 
plots will be really helpful to have an honest discussion about how many 
language communities Wikimedia actually supports.

Dario

> On Dec 3, 2015, at 2:35 PM, Jonathan Morgan  wrote:
> 
> This is awesome. Thanks Erik!
> 
> Jonathan
> 
> On Thu, Dec 3, 2015 at 2:09 PM, Erik Zachte  <mailto:ezac...@wikimedia.org>> wrote:
> Hi all,
> 
>  
> 
> I just released a major upgrade for Wikistats traffic reports: see blog post
> 
> http://infodisiac.com/blog/2015/12/wikistats-upgraded-to-new-page-view-definition/
>  
> <http://infodisiac.com/blog/2015/12/wikistats-upgraded-to-new-page-view-definition/>
>  
> 
> Erik Zachte
> 
>  
> 
>  
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> <https://lists.wikimedia.org/mailman/listinfo/analytics>
> 
> 
> 
> 
> -- 
> Jonathan T. Morgan
> Senior Design Researcher
> Wikimedia Foundation
> User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
> 
> _______
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics



Dario Taraborelli  Head of Research, Wikimedia Foundation
wikimediafoundation.org <http://wikimediafoundation.org/> • nitens.org 
<http://nitens.org/> • @readermeter <http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Backlinks TO Wikipedia

2015-12-02 Thread Dario Taraborelli

what Greg said, Common Crawl is an excellent data source to answer these 
questions, see:

http://blog.commoncrawl.org/2015/04/announcing-the-common-crawl-index/
http://blog.commoncrawl.org/2015/02/wikireverse-visualizing-reverse-links-with-open-data/

for aggregate stats about referrals to individual articles by traffic and 
aggregated at domain level you mail also be interested in this dataset:

http://figshare.com/articles/Wikipedia_Clickstream/1305770

> On Dec 2, 2015, at 8:06 AM, Greg Lindahl  wrote:
> 
> On Tue, Dec 01, 2015 at 07:50:23PM +0100, Federico Leva (Nemo) wrote:
>> Edison Nica, 29/11/2015 16:56:
>>> how many non-wikipedia pages point to a certain wikipedia page
>> 
>> I guess the only way we have to know this (other than grepping
>> request logs for referrers, which would be quite a nightmare) is to
>> access the Google Webmaster account for wikipedia.org (to which a
>> couple employees had access, IIRC).
> 
> There are a couple of other ways to figure out inlinks:
> 
> * Common Crawl
> * Commercial SEO services like Moz or Ahrefs
> 
> In the medium term the Internet Archive is going to be generating this
> kind of link data as part of the Wayback Machine search engine effort.
> 
> And finally, Edison, counting the number of inlinks without
> considering their rank or popularity will probably leave you
> vulnerable to people orchestrating googlebombs. And you might want to
> also know the anchortext, that's extremely valuable for search
> indexing.
> 
> -- greg
> 
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics



Dario Taraborelli  Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Revision scoring as a service launched

2015-11-30 Thread Dario Taraborelli

We just published an announcement on the Wikimedia blog marking the official 
launch of revision scoring as a service 
<https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service> and I 
wanted to say a few words about this project:

Blog post: 
https://blog.wikimedia.org/2015/11/30/artificial-intelligence-x-ray-specs/ 
<https://blog.wikimedia.org/2015/11/30/artificial-intelligence-x-ray-specs/>
Docs on Meta: https://meta.wikimedia.org/wiki/ORES 
<https://meta.wikimedia.org/wiki/ORES> 

First off: what’s revision scoring 
<https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service#Rationale>?
 On the surface, it’s a set of open APIs allowing you to automatically “score” 
any edit and measure their probability of being damaging or good-faith 
contributions. The real goal behind this project, though, is to fix the damage 
indirectly caused by vandal-fighting bots and tools on good-faith contributors 
and to bring back a collaborative dimension to how we do quality control on 
Wikipedia. I invite you to read the whole blog post 
<https://blog.wikimedia.org/2015/11/30/artificial-intelligence-x-ray-specs/> if 
you want to know more about the motivations and expected outcome of this 
project.

I am thrilled this project is coming to fruition and I’d like to congratulate 
Aaron Halfaker <https://wikimediafoundation.org/wiki/User:Ahalfaker> and all 
the project contributors 
<https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service#Team> 
on hitting this big milestone: revision scoring started as Aaron’s side project 
well over a year ago and it has been co-designed (as in – literally – 
conceived, implemented, tested, improved and finally adopted) by a distributed 
team of volunteer developers, editors, and researchers. We worked with 
volunteers in 14 different Wikipedia language editions and as of today revision 
scores are integrated 
<https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service#Tools_that_use_ORES>
 in the workflow of several quality control interfaces, WikiProjects and 3rd 
party tools. The project would not have seen the light without the technical 
support provided by the TechOps team (Yuvi in particular) and seminal funding 
provided by the WMF IEG program and Wikimedia Germany.

So, here you go: the next time someone tells you that LLAMAS GROW ON TREES 
<https://en.wikipedia.org/w/index.php?diff=prev&oldid=642215410> you can 
confidently tell them they should stop damaging 
<http://ores.wmflabs.org/scores/enwiki/damaging/642215410/> Wikipedia.

Dario


Dario Taraborelli  Head of Research, Wikimedia Foundation
wikimediafoundation.org <http://wikimediafoundation.org/> • nitens.org 
<http://nitens.org/> • @readermeter <http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] A belated project completion shout-out

2015-10-29 Thread Dario Taraborelli

big data big cake. Congrats, folks! 

> On Oct 29, 2015, at 2:52 PM, Leila Zia  wrote:
> 
> Somehow I understand the firework language and this email better than all the 
> other presentations we had so far. :D
> 
> Congratulations, team! This is amazing! :-)
> 
> L
> p.s. and I bake you a big cake the next time you're in town if you can wait 
> until then. ;-)
> 
> On Thu, Oct 29, 2015 at 2:40 PM, Dan Andreescu  <mailto:dandree...@wikimedia.org>> wrote:
> We're gonna need a Really big cake.
> 
> On Thu, Oct 29, 2015 at 5:33 PM, Kevin Leduc  <mailto:ke...@wikimedia.org>> wrote:
> I was grooming our Kanban board and believe it's time to close four projects 
> (epic tasks) which we completed last quarter in support of our quarterly 
> goals.  It's time to mark these tasks as resolved and think back on them 
> fondly.
> 
> Project: Pageviews in Vital Signs
> Animal code name: musk
> Result: https://vital-signs.wmflabs.org/ <https://vital-signs.wmflabs.org/>
> Ticket: https://phabricator.wikimedia.org/T101120 
> <https://phabricator.wikimedia.org/T101120>
> 
> Project: Total Pageview count in Vital Signs
> Animal code name: wren
> Result: https://vital-signs.wmflabs.org/#projects=all/metrics=Pageviews 
> <https://vital-signs.wmflabs.org/#projects=all/metrics=Pageviews>
> Ticket: https://phabricator.wikimedia.org/T96314 
> <https://phabricator.wikimedia.org/T96314>
> 
> Project: Hadoop Cluster Expansion
> Animal code name: mule
> Result: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hardware 
> <https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hardware>
> Ticket: https://phabricator.wikimedia.org/T99952 
> <https://phabricator.wikimedia.org/T99952>
> 
> Project: EventLogging on Kafka
> Animal code name: stag
> Result: 
> https://www.mediawiki.org/wiki/File:EventLogging_on_Kafka_-_Lightning_Talk.pdf
>  
> <https://www.mediawiki.org/wiki/File:EventLogging_on_Kafka_-_Lightning_Talk.pdf>
> Ticket: https://phabricator.wikimedia.org/T102225 
> <https://phabricator.wikimedia.org/T102225>
> 
> 
> (Photo by Alex Sims 
> <https://commons.wikimedia.org/wiki/File:Skyshow_Adelaide_2006.JPG>)
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> <https://lists.wikimedia.org/mailman/listinfo/analytics>
> 
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> <https://lists.wikimedia.org/mailman/listinfo/analytics>
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics



Dario Taraborelli  Head of Research, Wikimedia Foundation
wikimediafoundation.org <http://wikimediafoundation.org/> • nitens.org 
<http://nitens.org/> • @readermeter <http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] [job opening] Software Engineer - Research

2015-09-21 Thread Dario Taraborelli

ki-ai/revscoring <https://github.com/wiki-ai/revscoring>
https://github.com/wiki-ai/ores <https://github.com/wiki-ai/revscoring>
https://github.com/halfak/MediaWiki-Utilities 
<https://github.com/halfak/MediaWiki-Utilities>
https://github.com/halfak/mwstreaming <https://github.com/halfak/mwstreaming>   






Dario Taraborelli  Head of Research, Wikimedia Foundation
wikimediafoundation.org <http://wikimediafoundation.org/> • nitens.org 
<http://nitens.org/> • @readermeter <http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Wikimedia traffic forecast application

2015-09-15 Thread Dario Taraborelli

An updated version of a pageview forecasting application written by Ellery 
(Research & Data team) has just been released: 

https://ewulczyn.shinyapps.io/pageview_forecasting 
<https://ewulczyn.shinyapps.io/pageview_forecasting>
https://twitter.com/WikiResearch/status/643942154549592064 
<https://twitter.com/WikiResearch/status/643942154549592064>

The data is refreshed monthly and it includes breakdowns by country and 
platform.

Dario



Dario Taraborelli  Head of Research, Wikimedia Foundation
wikimediafoundation.org <http://wikimediafoundation.org/> • nitens.org 
<http://nitens.org/> • @readermeter <http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Pageviews definition + measurement for apps adding link previews + using RESTBase

2015-08-19 Thread Dario Taraborelli


> On Aug 18, 2015, at 6:32 PM, Kevin Leduc  wrote:
> 
> We briefly considered counting views of Hover Cards as Pageviews, but it was 
> quickly dismissed.  First, the feature is not widely used enough to justify 
> Changing the pageview definition.

I second that, these are “impressions” and they should be measured separately. 
I would be very worried about inflating our baseline PV numbers with all sort 
of features revealing snippets of content. 

> I’m still open to counting previews as pageviews, but I think the Readership 
> team and their product managers need to weigh in heavily as Pageviews is a 
> key metric for them.
> 
> Finally, counting Pageviews served through RESTBase sounds like a new project 
> and I'd like to hear more about the effort needed from the analytics 
> engineers.
> 
> 
> On Tue, Aug 18, 2015 at 4:58 PM, Oliver Keyes  > wrote:
> On 18 August 2015 at 19:11, Bernd Sitzmann  > wrote:
> > This discussion is about needed updates of the definition and Analytics
> > implementation for mobile apps page view metrics. There is also an
> > associated Phab task[4]. Please add the proper Analytics project there.
> >
> > Background / Changes
> >
> > As you probably remember, the Android app splits a page view into two
> > requests: one for the lead section and metadata, plus another one for the
> > remainder.
> >
> > The mobile apps are going to change the way they load pages in two different
> > ways:
> >
> > We'll add a link preview when someone clicks on a link from a page.
> > We're planning on switching over the using RESTBase for loading pages and
> > also the link preview (initially just the Android beta, ater more)
> >
> 
> Woah woah woah woah woah. By RESTBase do you mean Gabriel's RESTful service 
> API?
> 
> Last time I checked that wasn't even consumed by HDFS. Is it now being
> consumed by HDFS?
> 
> More importantly the actual URLs are going to look /totally/
> different. If we do not include RESTBase requests, we will miss the
> apps. If we /do/ include RESTBase requests we will not only have to
> rewrite the pageview definition for the apps to recognise the new URL
> scheme, we will also potentially have to rewrite every /other/ bit of
> the definition to /not/ incorporate those requests.
> 
> (I use "we" in a collective sense. This isn't my baby any more,
> although if Joseph et al want help with the refactor here I'm happy to
> spend my volunteer time on it).
> 
> But basically every other bit of your email is important but now
> secondary: this is a potentially massive change, all on its own, even
> without the link preview, even if the substance of the requests going
> to RESTBase were identical.
> 
> > This will have implications for the pageviews definition and how we count
> > user engagement.
> >
> > The big question is
> >
> > Should we count link previews as a page view since it's an indication of
> > user engagement? Or should there be a separate metric for link previews?
> >
> > Counting page views
> >
> > IIRC we currently count action=mobileview§ions=0 query parameters of
> > api.php as a page view. When we publish link previews for all Android app
> > users then we would either want to count also the calls to
> > action=query&prop=extracts as a page view or add them to another metric.
> >
> > Once the apps use RESTBase the HTTPS requests will be very different:
> >
> > Page view: Instead of action=mobileview§ions=0 the app would call the
> > RESTBase endpoint for lead request[1] instead of the PHP API mentioned
> > above. Then it would call [2].
> > Link preview: Instead of action=query&prop=extracts it would call the lead
> > request[1], too, since there is a lot of overlap. At least that our current
> > plan. The advantage of that is that the client doesn't need to execute the
> > lead request a second time if the user clicks on the link preview (-- either
> > through caching or app logic.)
> >
> > So, in the RESTBase case we either want to count the
> > mobile-html-sections-lead requests or the mobile-html-sections-remaining
> > requests depending on what our definition for page views actually is. We
> > could also add a query parameter or extra HTTP header to one of the
> > mobile-html-sections-lead requests if we need to distinguish between
> > previews and page views.
> >
> > Both the current PHP API and the RESTBase based metrics would need to be
> > compatible and be collected in parallel since we cannot control when users
> > update their apps.
> >
> > [1]
> > https://en.wikipedia.org/api/rest_v1/page/mobile-html-sections-lead/Dilbert 
> > 
> > [2]
> > https://en.wikipedia.org/api/rest_v1/page/mobile-html-sections-remaining/Dilbert
> >  
> > 
> > [3]
> > https://www.mediawiki.org/wiki/Wikimedia_Apps/Team/RESTBase_services_for_apps
> >  
>

Re: [Analytics] [Wikimedia-search] Scaleable Event Systems recap

2015-08-03 Thread Dario Taraborelli

nm, clarified with Kevin. 

> On Aug 3, 2015, at 18:38, Dario Taraborelli  
> wrote:
> 
> what are the implications (if any) on event validation?
> 
>> On Mon, Aug 3, 2015 at 3:19 PM, Tomasz Finc  wrote:
>> Very excited to see this moving forward
>> 
>> On Mon, Aug 3, 2015 at 3:12 PM, Oliver Keyes  wrote:
>> > Heyo, Discovery team!
>> >
>> > (Analytics CCd)
>> >
>> > This is just a quick writeup of the Scaleable Event Systems meeting
>> > that Erik, Dan, Stas and I went to (although just from my
>> > perspective).
>> >
>> > For people not in the initial thread, this is a proposal to replace
>> > the internal architecture of EventLogging and similar services with
>> > Apache Kafka brokers
>> > (http://www.confluent.io/blog/stream-data-platform-1/ ). What that
>> > means in practice is that the current 1-2k events/second limit on
>> > EventLogging will disappear and we can stop worrying about sampling
>> > and accidentally bringing down the system. We can be a lot less
>> > cautious about our schemas and a lot less cautious about our sampling
>> > rate!
>> >
>> > It also offers up a lot of opportunities around streaming data and
>> > making it available in a layered fashion - while we don't want to
>> > explore that right now, I don't think, it's nice to have as an option
>> > when we better understand our search data and how we can safely
>> > distribute it.
>> >
>> > I'd like to thank the Analytics team, particularly Andrew, for putting
>> > this together; it was a super-helpful discussion to be in and this
>> > sort of product is precisely what I, at least, have been hoping for
>> > out of the AnEng brain trust. Full speed ahead!
>> >
>> > --
>> > Oliver Keyes
>> > Count Logula
>> > Wikimedia Foundation
>> >
>> > _______
>> > Wikimedia-search mailing list
>> > wikimedia-sea...@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
>> 
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
> 
> 
> 
> -- 
> 
> 
> Dario Taraborelli  Head of Research, Wikimedia Foundation
> wikimediafoundation.org • nitens.org • @readermeter 
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Wikimedia-search] Scaleable Event Systems recap

2015-08-03 Thread Dario Taraborelli

what are the implications (if any) on event validation?

On Mon, Aug 3, 2015 at 3:19 PM, Tomasz Finc  wrote:

> Very excited to see this moving forward
>
> On Mon, Aug 3, 2015 at 3:12 PM, Oliver Keyes  wrote:
> > Heyo, Discovery team!
> >
> > (Analytics CCd)
> >
> > This is just a quick writeup of the Scaleable Event Systems meeting
> > that Erik, Dan, Stas and I went to (although just from my
> > perspective).
> >
> > For people not in the initial thread, this is a proposal to replace
> > the internal architecture of EventLogging and similar services with
> > Apache Kafka brokers
> > (http://www.confluent.io/blog/stream-data-platform-1/ ). What that
> > means in practice is that the current 1-2k events/second limit on
> > EventLogging will disappear and we can stop worrying about sampling
> > and accidentally bringing down the system. We can be a lot less
> > cautious about our schemas and a lot less cautious about our sampling
> > rate!
> >
> > It also offers up a lot of opportunities around streaming data and
> > making it available in a layered fashion - while we don't want to
> > explore that right now, I don't think, it's nice to have as an option
> > when we better understand our search data and how we can safely
> > distribute it.
> >
> > I'd like to thank the Analytics team, particularly Andrew, for putting
> > this together; it was a super-helpful discussion to be in and this
> > sort of product is precisely what I, at least, have been hoping for
> > out of the AnEng brain trust. Full speed ahead!
> >
> > --
> > Oliver Keyes
> > Count Logula
> > Wikimedia Foundation
> >
> > ___
> > Wikimedia-search mailing list
> > wikimedia-sea...@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 


*Dario Taraborelli  *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Fwd: Wikipedia Page views access

2015-06-19 Thread Dario Taraborelli

Forwarding a note from Ashok Rao (cc’ed), can anyone comment on the dumps 
server returning 503s?

Ashok – we don’t have yet an in-house API to retrieve pageview data, but the 
Analytics team is working on one: see this thread 
.
Depending on what you’re doing, http://stats.grok.se/  
may also come in handy.

Best,
Dario

> Begin forwarded message:
> 
> From: Ashok Rao 
> Subject: Wikipedia Page views access
> Date: June 18, 2015 at 5:53:12 PM GMT+2
> To: da...@wikimedia.org
> 
> Hi Dario,
> 
> Good morning. I'm a student at the University of Pennsylvania and I've been 
> trying to perform a few analyses based on Wikipedia page views data. I've 
> written a script that grabs data from the main dump site – 
> https://dumps.wikimedia.org/other/pagecounts-raw/ 
>  – but run into many 
> sporadic 503 errors (sometimes with the download link, other times with the 
> main page itself). I noticed some of this data might be available directly on 
> Wikimedia servers that can be utilized for research purposes.
> 
> I was hoping I could get access to this and appreciate your help.
> 
> Best,
> Ashok
> 
> -- 
> Ashok M. Rao
> The Rajendra and Neera Singh Program in Market and Social Systems Engineering
> School of Engineering and Applied Sciences
> University of Pennsylvania | Class of '17

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-09 Thread Dario Taraborelli

Thanks, Gabriel – this is super-helpful. 

Dan/Kevin: slightly OT, are we aware of any use case related to features that 
would be exposing PV data in production? I’ve seen mocks from the Discovery 
team with PV data embedded in articles or search interfaces and I’m not sure 
what their status is.

> On Jun 9, 2015, at 6:59 PM, Gabriel Wicke  wrote:
> 
> I think Eric's original response got lost, so let me include it below:
> 
> >>> dim1,  dim2,  dim3,  ...,  dimN,  val
> >>>a,  null,  null,  ...,  null,   15// pv for dim1=a
> >>>a, x,  null,  ...,  null,   34// pv for dim1=a & dim2=x
> >>>a, x, 1,  ...,  null,   27// pv for dim1=a & dim2=x &
> >>> dim3=1
> >>>a, x, 1,  ...,  true,  undef  // pv for dim1=a & dim2=x & ...
> >>> & dimN=true
> >>>
> >>> So the size of this dataset would be something between 100M and 200M
> >>> records per year, I think.
> 
> > Could you expound on this a bit?  Is it just the 3 dimensions above
> > (day, project, type), or something more? Also, how will this be
> > queried?  Do we need to query by dimensions arbitrarily, or will the
> > "higher" dimensions always be qualified with matches on the lower
> > ones, as in the example above ( dim1=a, dim1=a & dim2=x, pv for dim1=a
> > & dim2=x & dimN=true)?
> 
> Out of the box, Cassandra is fairly limited in the kind of indexing it 
> provides. Its main data structure is a distributed hash table, with the 
> ability to select a single hierarchical range below a fixed key. This is why 
> Eric asked about whether your query patterns are hierarchical.
> 
> There is some very limited support for secondary indexes, but those work very 
> differently from relational databases (only equality & no efficient global 
> queries), so are only useful in special cases. In RESTBase we have built an 
> abstraction that lets us maintain more useful secondary indexes in regular 
> Cassandra tables. However, this implementation still lacks features like 
> dynamic index creation and authentication, so is not anywhere close to the 
> functionality provided by a relational database.
> 
> In any case, I think it makes sense to isolate analytics backends from 
> production content storage. Trouble in the page view backend should not 
> affect content storage end points. We can still expose and document both in a 
> single public API at /api/rest_v1/ using RESTBase, but are free to use any 
> backend service or storage. The backend service could be built using the 
> service template <https://github.com/wikimedia/service-template-node> and 
> some $DB, a RESTBase instance or any other technology if it makes sense, as 
> long as it exposes a REST-ish API that is reasonably easy to use and proxy.
> 
> Gabriel
> 
> On Tue, Jun 9, 2015 at 7:10 AM, Dario Taraborelli  <mailto:dtarabore...@wikimedia.org>> wrote:
> I too would love to understand if RestBASE can become our default solution 
> for this kind of data-intensive APIs. Can you guys briefly explain what kind 
> of queries and aggregations would be problematic if we were to go with 
> Cassandra?
> 
> > On Jun 9, 2015, at 8:39 AM, Oliver Keyes  > <mailto:oke...@wikimedia.org>> wrote:
> >
> > Remember that (as things currently stand) putting the thing on labs
> > means meta-analytics ("how are the cubes being used?") being a pain in
> > the backside to integrate with our existing storage solutions.
> >
> > On 8 June 2015 at 22:52, Dan Andreescu  > <mailto:dandree...@wikimedia.org>> wrote:
> >>> As always, I'd recommend that we go with tech we are familiar with --
> >>> mysql or cassandra. We have a cassandra committer on staff who would be 
> >>> able
> >>> to answer these questions in detail.
> >>>
> >>>
> >>> WMF uses PostGRES for some things, no?  Or is that is just in labs?
> >>
> >>
> >> Since this data is meant to be fully public and queryable in any way, we
> >> could put it in the PostgreSQL instance on labs.  We should check with labs
> >> folks, perhaps horse trade some hardware, but I think that would be a
> >> splendid solution.
> >>
> >> However, and I'm trying to understate this in case people are not familiar
> >> with my hyperbolic style, I'd rather drink Bud Lite Lime than use MySQL for
> >> this.  MySQL is suited for a lot of things, but analytics is not one of
> >> them.
> >>
> >> p.s. I will never drink Bud Lite Lime.  Like,

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-09 Thread Dario Taraborelli

I too would love to understand if RestBASE can become our default solution for 
this kind of data-intensive APIs. Can you guys briefly explain what kind of 
queries and aggregations would be problematic if we were to go with Cassandra?

> On Jun 9, 2015, at 8:39 AM, Oliver Keyes  wrote:
> 
> Remember that (as things currently stand) putting the thing on labs
> means meta-analytics ("how are the cubes being used?") being a pain in
> the backside to integrate with our existing storage solutions.
> 
> On 8 June 2015 at 22:52, Dan Andreescu  wrote:
>>> As always, I'd recommend that we go with tech we are familiar with --
>>> mysql or cassandra. We have a cassandra committer on staff who would be able
>>> to answer these questions in detail.
>>> 
>>> 
>>> WMF uses PostGRES for some things, no?  Or is that is just in labs?
>> 
>> 
>> Since this data is meant to be fully public and queryable in any way, we
>> could put it in the PostgreSQL instance on labs.  We should check with labs
>> folks, perhaps horse trade some hardware, but I think that would be a
>> splendid solution.
>> 
>> However, and I'm trying to understate this in case people are not familiar
>> with my hyperbolic style, I'd rather drink Bud Lite Lime than use MySQL for
>> this.  MySQL is suited for a lot of things, but analytics is not one of
>> them.
>> 
>> p.s. I will never drink Bud Lite Lime.  Like, never.
>> 
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>> 
> 
> 
> 
> -- 
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Fwd: [Wikitech-l] API BREAKING CHANGE: Default continuation mode for action=query will change at the end of this month

2015-06-03 Thread Dario Taraborelli

Many people on these lists design and use tools that depend on action=query 
(beyond bots). If you do, please read the following:

> Begin forwarded message:
> 
> From: "Brad Jorsch (Anomie)" 
> Subject: [Wikitech-l] API BREAKING CHANGE: Default continuation mode for 
> action=query will change at the end of this month
> Date: June 2, 2015 at 10:42:47 PM GMT+2
> To: Wikimedia developers , 
> mediawiki-api-annou...@lists.wikimedia.org
> Reply-To: Wikimedia developers 
> 
> As has been announced several times (most recently at
> https://lists.wikimedia.org/pipermail/wikitech-l/2015-April/081559.html),
> the default continuation mode for action=query requests to api.php will be
> changing to be easier for new coders to use correctly.
> 
> *The date is now set:* we intend to merge the change to ride the deployment
> train at the end of June. That should be 1.26wmf12, to be deployed to test
> wikis on June 30, non-Wikipedias on July 1, and Wikipedias on July 2.
> 
> If your bot or script is receiving the warning about this upcoming change
> (as seen here
> , for
> example), it's time to fix your code!
> 
>   - The simple solution is to simply include the "rawcontinue" parameter
>   with your request to continue receiving the raw continuation data (
>   example
>   
> ).
>   No other code changes should be necessary.
>   - Or you could update your code to use the simplified continuation
>   documented at https://www.mediawiki.org/wiki/API:Query#Continuing_queries
>   (example
>   ),
>   which is much easier for clients to implement correctly.
> 
> Either of the above solutions may be tested immediately, you'll know it
> works because you stop seeing the warning.
> 
> I've compiled a list of bots that have hit the deprecation warning more
> than 1 times over the course of the week May 23–29. If you are
> responsible for any of these bots, please fix them. If you know who is,
> please make sure they've seen this notification. Thanks.
> 
> AAlertBot
> AboHeidiBot
> AbshirBot
> Acebot
> Ameenbot
> ArnauBot
> Beau.bot
> Begemot-Bot
> BeneBot*
> BeriBot
> BOT-Superzerocool
> CalakBot
> CamelBot
> CandalBot
> CategorizationBot
> CatWatchBot
> ClueBot_III
> ClueBot_NG
> CobainBot
> CorenSearchBot
> Cyberbot_I
> Cyberbot_II
> DanmicholoBot
> DeltaQuadBot
> Dexbot
> Dibot
> EdinBot
> ElphiBot
> ErfgoedBot
> Faebot
> Fatemibot
> FawikiPatroller
> HAL
> HasteurBot
> HerculeBot
> Hexabot
> HRoestBot
> IluvatarBot
> Invadibot
> Irclogbot
> Irfan-bot
> Jimmy-abot
> JYBot
> Krdbot
> Legobot
> Lowercase_sigmabot_III
> MahdiBot
> MalarzBOT
> MastiBot
> Merge_bot
> NaggoBot
> NasirkhanBot
> NirvanaBot
> Obaid-bot
> PatruBOT
> PBot
> Phe-bot
> Rezabot
> RMCD_bot
> Shuaib-bot
> SineBot
> SteinsplitterBot
> SvickBOT
> TaxonBot
> Theo's_Little_Bot
> W2Bot
> WLE-SpainBot
> Xqbot
> YaCBot
> ZedlikBot
> ZkBot
> 
> 
> -- 
> Brad Jorsch (Anomie)
> Software Engineer
> Wikimedia Foundation
> ___
> Wikitech-l mailing list
> wikitec...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] The awful truth about Wikimedia's article counts

2015-05-22 Thread Dario Taraborelli

On May 22, 2015, at 2:15 PM, Erik Zachte  wrote:
> 
> Historically consistent? Hmm, the article's main story is about how 
> historical in-wiki data are unreliable and a periodic recount is needed. Just 
> saying.

by “historically consistent” I mean not subject to arbitrary changes making 
measurement foo at time t1 incommensurable with foo at time t2. Aaron and I put 
a good deal of thinking into how to avoid recounts or issues due to arbitrary 
software configuration changes.

> And the main theme in comments is “do we care about article count?"

agreed. I added a note in the comments on work related to quality assessment.


> -Original Message-
> From: analytics-boun...@lists.wikimedia.org 
> [mailto:analytics-boun...@lists.wikimedia.org] On Behalf Of Dario Taraborelli
> Sent: Friday, May 22, 2015 21:38
> To: A mailing list for the Analytics Team at WMF and everybody who has an 
> interest in Wikipedia and analytics.
> Subject: [Analytics] The awful truth about Wikimedia's article counts
> 
> From this week’s Signpost, worth reading: 
> 
>   
> https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-05-20/In_focus
> 
> this is a great illustration of why we need stateless, historically and 
> globally consistent measurements to report the growth of Wikimedia projects 
> (and particularly why the legacy definition of a “countable” article is 
> ridiculously problematic):
> 
>   
> https://meta.wikimedia.org/wiki/Research:Refining_the_definition_of_monthly_active_editors#Principles
>   https://meta.wikimedia.org/wiki/Research:Metrics_standardization
> 
> Dario
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] The awful truth about Wikimedia's article counts

2015-05-22 Thread Dario Taraborelli

From this week’s Signpost, worth reading: 


https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-05-20/In_focus

this is a great illustration of why we need stateless, historically and 
globally consistent measurements to report the growth of Wikimedia projects 
(and particularly why the legacy definition of a “countable” article is 
ridiculously problematic):


https://meta.wikimedia.org/wiki/Research:Refining_the_definition_of_monthly_active_editors#Principles
https://meta.wikimedia.org/wiki/Research:Metrics_standardization

Dario
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [WikimediaMobile] Share a Fact Initial Analysis

2015-05-22 Thread Dario Taraborelli

Thanks for sharing this, Adam. Aside from engagement/funnel data, the critical 
question for this feature is: does it bring back eyeballs to the site from 
social media? It looks like it doesn’t yet, at least not in a substantial way, 
even with the caveat that App traffic is a very small fraction of total mobile 
traffic. 

Having looked into referrals for this feature before and after comparing them 
to Twitter’s own engagement analytics (and finding some big discrepancy), you 
should consider removing spiders/crawlers from the data (see [1]) to avoid  
inflating pageviews with non-human activity.

I’m a big fan of this feature and look forward to seeing how you guys intend to 
scale it.

Dario

[1] 
https://github.com/ewulczyn/wmf/blob/b9f726ee3468852c3fed2780af1d8ac0004eda73/mc/oozie/hive_query.sql#L60
 



> On May 21, 2015, at 12:37 PM, Toby Negrin  wrote:
> 
> Hi all - some interesting analysis on the share-a-fact feature from the 
> mobile team. 
> 
> -Toby
> 
> Begin forwarded message:
> 
>> From: Adam Baso mailto:ab...@wikimedia.org>>
>> Date: May 21, 2015 at 12:05:29 PDT
>> To: mobile-l > >
>> Subject: [WikimediaMobile] Share a Fact Initial Analysis
>> 
>> Hello all,
>> 
>> We’ve been looking at some initial results from the Share a Fact feature 
>> introduced on the Wikipedia apps for Android and iOS in its basic "minimal 
>> viable product" implementation. Here’s some analysis, using data from one 
>> day (20150512) with respect to the latest stable versions of the apps 
>> (2.0-r-2015-04-23 on Android and 4.1.2 on iOS) for that day.
>> 
>> * On iOS, when a user initiates the first step of the default sharing 
>> workflow - tapping the up-arrow box share button (6,194 non-highlighting 
>> instances for the day under question) - about 11.7% of the time it yielded 
>> successful sharing.
>> 
>> * On Android, it’s not possible to easily tell when the sharing workflow was 
>> carried through to successful share, but we anticipate the Android success 
>> rate is currently much higher, as general engagement percentage up to the 
>> point of picking an app for sharing is higher on Android than on iOS.
>> 
>> * On Android, when presented with the share card preview, 28.0% of the time 
>> the ‘Share as image’ button was tapped and 55.5% of the time the 'Share as 
>> text' button was tapped, whereas on iOS it was 8.4% ‘Share as image’ and 
>> 16.8% ‘Share as text’.
>> 
>> * The forthcoming 4.1.4 version of the iOS app will relax its default 
>> sharing snippet generation rules and be more like the Android version in 
>> that respect. We anticipate this will result in higher engagement with both 
>> the ‘Share as image’ and ‘Share as text’ buttons on iOS, and we should be 
>> able to verify this once the 4.1.4 iOS version is released and generally 
>> adopted (usually takes 4-5 days after release; the 4.1.4 release isn’t 
>> released yet).
>> 
>> * On the Android app the ‘Share’ option is located on the overflow menu, not 
>> as part of the main set of UI buttons. This potentially increases the 
>> likelihood of Android users being primed to step through the workflow. On 
>> the iOS app, the share button (up-arrow box) is plainly visible from the 
>> main UI and not an overflow menu, and this probably creates a different 
>> priming dynamic for the iOS demographic.
>> 
>> * When users on iOS tapped on the ‘Share as image’ or ‘Share as text’ 
>> buttons, there is a pretty sharp drop off at the next stage - the system 
>> sharesheet. Once the sharesheet was presented to iOS users, 41.6% of the 
>> time it resulted in active abandonment. We believe this probably has 
>> something to do with the relatively small set of default apps listed on the 
>> sharesheet and the extra work involved with exposing additional social apps 
>> for sharing in that context. As with the Android app, the labels of ‘Share 
>> as image’ and ’Share as text’ may also pose something of a hurdle at least 
>> for first time users of the feature. To this end, there is an onboarding 
>> tutorial planned at least on Android.
>> 
>> * For a one hour period (2015051201) there were about 100 pageviews in some 
>> sense attributable to Share a Fact using a provenance parameter available on 
>> the latest stable versions of the apps at that time; this may slightly 
>> overstate the number of pageviews attributable to the two specific apps 
>> reviewed in this analysis, but probably not too much (n.b., previously a 
>> different source parameter was used than the new wprov provenance 
>> parameter). Pageviews are not the sole motivation for the feature, but 
>> following the trendline over the long run should be interesting. Impact on 
>> social media and the destinations of shares is a little harder to capture 
>> directly, but 
>> https://twitter.com/search?f=realtime&q=%40wikipedia%20-%40itzwikipedi

Re: [Analytics] [Maniphest] [Commented On] T44259: Make domas' pageviews data available in semi-publicly queryable database format

2015-05-21 Thread Dario Taraborelli

the title of the Phab ticket is obsolete, the plan is not to work off the 
existing hourly PV dumps, per Dan’s note.

> On May 21, 2015, at 9:50 AM, Oliver Keyes  wrote:
> 
> Why is the work on Domas's data, which we know is incredibly
> unreliable? Because people still rely on it?
> 
> On 21 May 2015 at 12:39, Dan Andreescu  wrote:
>> Thanks Dario, I should've thought to do the same.  As I say in my comment,
>> I'd love to get a discussion going here.  This project has been in the dark
>> and postponed for too long, and now that we're focusing on it everyone
>> deserves our direct thoughts on it.  Everyone here also has the right to
>> directly influence our thoughts and plans.  So please, don't be shy :)
>> 
>> On Thu, May 21, 2015 at 12:36 PM, Dario Taraborelli
>>  wrote:
>>> 
>>> Dan –  thanks for the thorough update, hope you don’t mind if I repost
>>> this to the analytics list – I bet several people on this list are eager to
>>> know where this is going.
>>> 
>>> Dario
>>> 
>>> Begin forwarded message:
>>> 
>>> 
>>> From: Milimetric 
>>> Subject: [Maniphest] [Commented On] T44259: Make domas' pageviews data
>>> available in semi-publicly queryable database format
>>> Date: May 21, 2015 at 9:31:36 AM PDT
>>> To: da...@wikimedia.org
>>> Reply-To: t44259+public+a4a5010c21d15...@phabricator.wikimedia.org
>>> 
>>> Milimetric added a comment.
>>> 
>>> I'd love to start a more open discussion about our progress on this.
>>> Here's the recent history and where we are:
>>> 
>>> February 2015: with data flowing into the Hadoop cluster, we defined which
>>> raw webrequests were "page views". The research is here and the code is here
>>> March 2015: we used this page view definition to create a raw pageview
>>> table in Hadoop. This is queryable by Hive but it's about 3 TB per day of
>>> data. So we don't have the resources to expose it publicly
>>> April 2015: we used this data internally to query but it overloaded our
>>> cluster and queries were slow
>>> May 2015: we're working on an intermediate aggregation that would total up
>>> page counts by hour over the dimensions that we think most people care
>>> about. We estimate this will cut down size by a factor of 50
>>> 
>>> Progress has been slow mostly because Event Logging is our main priority
>>> and it's been having serious scaling issues. We think we have a good handle
>>> on the Event Logging issues after our latest patch, and in a week or so
>>> we're going to mostly focus on the Pageview API.
>>> 
>>> Once this new intermediate aggregation is done, we'll hopefully free up
>>> some cluster resources and be in a better position to load up a public API.
>>> Right now, we are evaluating two possible data pipelines:
>>> 
>>> Pipeline 1:
>>> 
>>> Put daily aggregates into PostgreSQL. We think per article hourly data
>>> would be too big for PostgreSQL.
>>> 
>>> Pipeline 2:
>>> 
>>> Query data from the Hive tables directly with Impala. Impala is good for
>>> medium to small data, but is much faster than Hive. We might be able to
>>> query the hourly data if we use this method.
>>> 
>>> Common Pipeline after we make the choice above:
>>> 
>>> Mondrian builds OLAP cubes and handles caching which is very useful with
>>> this much data
>>> point RESTBase to Mondrian and expose API publicly at
>>> restbase.wikimedia.org. This will be a reliable public API that people can
>>> build tools around
>>> point Saiku to Mondrian and make a new public website for exploratory
>>> analytics. Saiku is an open source OLAP cube visualization and analysis tool
>>> 
>>> Hope that helps. As we get closer to making this API real, we would love
>>> your input, participation, questions, etc.
>>> 
>>> 
>>> TASK DETAIL
>>> https://phabricator.wikimedia.org/T44259
>>> 
>>> EMAIL PREFERENCES
>>> https://phabricator.wikimedia.org/settings/panel/emailpreferences/
>>> 
>>> To: Milimetric
>>> Cc: Daniel_Mietchen, PKM, jeremyb, Arjunaraoc, Mr.Z-man, Tbayer, Elitre,
>>> scfc, Milimetric, Legoktm, drdee, Nemo_bis, Tnegrin, -jem-, DarTar, jayvdb,
>>> Aubrey, Ricordisamoa, MZMcBride, Magnus, MrBlueSky, Multichill
>>> 
>>> 
>>> 
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>> 
>> 
>> 
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>> 
> 
> 
> 
> -- 
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Fwd: [Maniphest] [Commented On] T44259: Make domas' pageviews data available in semi-publicly queryable database format

2015-05-21 Thread Dario Taraborelli

Dan –  thanks for the thorough update, hope you don’t mind if I repost this to 
the analytics list – I bet several people on this list are eager to know where 
this is going.

Dario

Begin forwarded message:
> 
> From: Milimetric 
> Subject: [Maniphest] [Commented On] T44259: Make domas' pageviews data 
> available in semi-publicly queryable database format
> Date: May 21, 2015 at 9:31:36 AM PDT
> To: da...@wikimedia.org
> Reply-To: t44259+public+a4a5010c21d15...@phabricator.wikimedia.org
> 
> Milimetric added a comment.
> 
> I'd love to start a more open discussion about our progress on this. Here's 
> the recent history and where we are:
> 
> February 2015: with data flowing into the Hadoop cluster, we defined which 
> raw webrequests were "page views". The research is here 
>  and the code is here 
> 
> March 2015: we used this page view definition to create a raw pageview table 
> in Hadoop. This is queryable by Hive but it's about 3 TB per day of data. So 
> we don't have the resources to expose it publicly
> April 2015: we used this data internally to query but it overloaded our 
> cluster and queries were slow
> May 2015: we're working on an intermediate aggregation that would total up 
> page counts by hour over the dimensions that we think most people care about. 
> We estimate this will cut down size by a factor of 50
> Progress has been slow mostly because Event Logging is our main priority and 
> it's been having serious scaling issues. We think we have a good handle on 
> the Event Logging issues after our latest patch, and in a week or so we're 
> going to mostly focus on the Pageview API.
> 
> Once this new intermediate aggregation is done, we'll hopefully free up some 
> cluster resources and be in a better position to load up a public API. Right 
> now, we are evaluating two possible data pipelines:
> 
> Pipeline 1:
> 
> Put daily aggregates into PostgreSQL. We think per article hourly data would 
> be too big for PostgreSQL.
> Pipeline 2:
> 
> Query data from the Hive tables directly with Impala. Impala is good for 
> medium to small data, but is much faster than Hive. We might be able to query 
> the hourly data if we use this method.
> Common Pipeline after we make the choice above:
> 
> Mondrian builds OLAP cubes and handles caching which is very useful with this 
> much data
> point RESTBase to Mondrian and expose API publicly at restbase.wikimedia.org. 
> This will be a reliable public API that people can build tools around
> point Saiku to Mondrian and make a new public website for exploratory 
> analytics. Saiku is an open source OLAP cube visualization and analysis tool
> Hope that helps. As we get closer to making this API real, we would love your 
> input, participation, questions, etc.
> 
> 
> TASK DETAIL
> https://phabricator.wikimedia.org/T44259 
> 
> EMAIL PREFERENCES
> https://phabricator.wikimedia.org/settings/panel/emailpreferences/ 
> 
> To: Milimetric
> Cc: Daniel_Mietchen, PKM, jeremyb, Arjunaraoc, Mr.Z-man, Tbayer, Elitre, 
> scfc, Milimetric, Legoktm, drdee, Nemo_bis, Tnegrin, -jem-, DarTar, jayvdb, 
> Aubrey, Ricordisamoa, MZMcBride, Magnus, MrBlueSky, Multichill

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] May 2015 research showcase

2015-05-13 Thread Dario Taraborelli

a reminder that the showcase will start at 11.30 PT. Broadcast link: 
http://youtu.be/Hj7o5d-OEis   

> On May 11, 2015, at 4:27 PM, Leila Zia  wrote:
> 
> Hi everyone,
> 
> The next research showcase will be live-streamed this Wednesday, May 13 at 
> 11.30 PT. The streaming link will be posted on the lists a few minutes before 
> the showcase starts and as usual, you can join the conversation on IRC at 
> #wikimedia-research.
> 
> We look forward to seeing you!
> 
> Leila
> 
> This month
> 
> The people's classifier: Towards an open model for algorithmic infrastructure
> By Aaron Halfaker 
> 
> Recent research has implicated that Wikipedia's algorithmic infrastructure is 
> perpetuating social issues. However, these same algorithmic tools are 
> critical to maintaining efficiency of open projects like Wikipedia at scale. 
> But rather than simply critiquing algorithmic wiki-tools and calling for less 
> algorithmic infrastructure, I'll propose a different strategy -- an open 
> approach to building this algorithmic infrastructure. In this presentation, 
> I'll demo a set of services that are designed to open a critical part 
> Wikipedia's quality control infrastructure -- machine classifiers. I'll also 
> discuss how this strategy unites critical/feminist HCI with more dominant 
> narratives about efficiency and productivity.
> 
> Social transparency online
> By Jennifer Marlow  and Laura Dabbish 
> 
> 
> An emerging Internet trend is greater social transparency, such as the use of 
> real names in social networking sites, feeds of friends' activities, traces 
> of others' re-use of content, and visualizations of team interactions. There 
> is a potential for this transparency to radically improve coordination, 
> particularly in open collaboration settings like Wikipedia. In this talk, we 
> will describe some of our research identifying how transparency influences 
> collaborative performance in online work environments. First, we have been 
> studying professional social networking communities. Social media allows 
> individuals in these communities to create an interest network of people and 
> digital artifacts, and get moment-by-moment updates about actions by those 
> people or changes to those artifacts. It affords and unprecedented level of 
> transparency about the actions of others over time. We will describe 
> qualitative work examining how members of these communities use transparency 
> to accomplish their goals. Second, we have been looking at the impact of 
> making workflows transparent. In a series of field experiments we are 
> investigating how socially transparent interfaces, and activity trace 
> information in particular, influence perceptions and behavior towards others 
> and evaluations of their work.
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Fwd: "pagecounts-all-sites" has halted

2015-05-01 Thread Dario Taraborelli

Relaying from Andrew West:

> Hi analytics folks,
> 
> Quickly browsed your mailing list and didn't see anything about this so I 
> thought I would write.
> 
> The "pagecounts-all-sites" reporting seems to have halted. The last hour 
> processed is 20150430-18. I was alerted when my ingestion over the 
> evening did not find the expected 24 files.
> 
> [http://dumps.wikimedia.org/other/pagecounts-all-sites/2015/2015-04/]
> 
> Thanks, -AW
> 
> -- 
> Andrew G. West
> Sr. Research Scientist
> Verisign Labs - Reston, VA
> http://www.andrew-g-west.com


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Technical] WMF-Last-Access

2015-04-27 Thread Dario Taraborelli

I also noticed the cookie stores a string with a 3-letter month (27-Apr-2015), 
any reason not to use a shorter ISO date instead (2015-04-27)?

> On Apr 27, 2015, at 3:00 PM, Marcel Ruiz Forns  wrote:
> 
> +1 'last'
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Research & Data quarterly review

2015-04-17 Thread Dario Taraborelli

An overview of what the Wikimedia Foundation’s Research & Data team has been up 
to, this past quarter (fiscal Q3, 2014-15):
https://commons.wikimedia.org/wiki/File:Analytics_Quarterly_Review_Q3_2014-15_(Research_and_Data).pdf

Dario
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] April 2015 research showcase: remix and reuse in collaborative communities; the oral citations debate

2015-04-16 Thread Dario Taraborelli

I am thrilled to announce our speaker lineup for this month’s research showcase 
.
  

Jeff Nickerson (Stevens Institute of Technology) will talk about remix and 
reuse in collaborative communities; Heather Ford (Oxford Internet Institute) 
will present an overview of the oral citations debate in the English Wikipedia.

The showcase will be recorded and publicly streamed at 11.30 PT on Thursday, 
April 30 (livestream link will follow). We’ll hold a discussion and take 
questions from remote attendees via the Wikimedia Research IRC channel 
(#wikimedia-research  
on freenode) as usual.

Looking forward to seeing you there.

Dario


Creating, remixing, and planning in open online communities
Jeff Nickerson
Paradoxically, users in remixing communities don’t remix very much. But an 
analysis of one remix community, Thingiverse, shows that those who actively 
remix end up producing work that is in turn more likely to remixed. What does 
this suggest about Wikipedia editing? Wikipedia allows more types of 
contribution, because creating and editing pages are done in a planning 
context: plans are discussed on particular loci, including project talk pages. 
Plans on project talk pages lead to both creation and editing; some editors 
specialize in making article changes and others, who tend to have more 
experience, focus on planning rather than acting. Contributions can happen at 
the level of the article and also at a series of meta levels. Some patterns of 
behavior – with respect to creating versus editing and acting versus planning – 
are likely to lead to more sustained engagement and to higher quality work. 
Experiments are proposed to test these conjectures.
Authority, power and culture on Wikipedia: The oral citations debate
Heather Ford
In 2011, Wikimedia Foundation Advisory Board member, Achal Prabhala was funded 
by the WMF to run a project called 'People are knowledge' or the Oral citations 
project . The goal of 
the project was to respond to the dearth of published material about topics of 
relevance to communities in the developing world and, although the majority of 
articles in languages other than English remain intact, the English editions of 
these articles have had their oral citations removed. I ask why this happened, 
what the policy implications are for oral citations generally, and what steps 
can be taken in the future to respond to the problem that this project (and 
more recent versions of it 
) set out to 
solve. This talk comes out of an ethnographic project in which I have 
interviewed some of the actors involved in the original oral citations project, 
including the majority of editors of the surr 
 article that I trace in a chapter of my 
PhD[1] .

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Page views on a more frequent than hourly basis

2015-04-15 Thread Dario Taraborelli

thanks, both. Let's go ahead with English only and no spiders filtered or
mobile/desktop breakdown, per Oliver.

Michelle – given the aggregation level I am fine moving forward with this
release, but let me know off-thread if you have any questions.

Dario

On Wed, Apr 15, 2015 at 9:53 AM, Oliver Keyes  wrote:

> Dario,
>
> No spider filtering, and no split between mobile and desktop; mobile
> and desktop are grouped.
>
> On 15 April 2015 at 12:46, Hirav Gandhi  wrote:
> > e.g. German*
> >
> > I need more coffee.
> >
> >
> >
> > On Wed, Apr 15, 2015 at 9:35 AM, Hirav Gandhi 
> > wrote:
> >>
> >> Dario - we just want a representative samples of traffic for a popular
> >> site like Wikipedia. We thought limiting to the English Wikipedia would
> be
> >> easier.
> >>
> >> If we get aggregated data across all language Wikipedia sites, we would
> >> need someway to tease out which language is being queried when. Some
> >> languages (for e.g. German) we would hypothesize would have more daily
> >> seasonality than languages like English.
> >>
> >>
> >>
> >> On Wed, Apr 15, 2015 at 9:32 AM, Dario Taraborelli
> >>  wrote:
> >>>
> >>> Hirav, Bharath – I also want to hear from you if there's a specific
> >>> reason to ask for English Wikipedia only or if a dataset encompassing
> >>> aggregate pageviews across all Wikimedia properties would do the job.
> >>>
> >>> Dario
> >>>
> >>> On Wed, Apr 15, 2015 at 9:09 AM, Dario Taraborelli
> >>>  wrote:
> >>>>
> >>>> Oliver -- thanks for running a preliminary check, I'm fine releasing
> >>>> this data in aggregate under CC0, I believe it would be valuable for
> this
> >>>> and other research projects (copying Michelle from Legal).
> >>>>
> >>>> Before we do so, though, I want to confirm the specs: aggregate
> >>>> pageviews per second to English Wikipedia, excluding bot traffic,
> broken
> >>>> down by access method (mobile web vs desktop site, not apps) for a
> 60-day
> >>>> period. Oliver – are these the filters you used to identify the data
> point
> >>>> with the smallest number of observations?
> >>>>
> >>>> Obviously, we will need to take into account this release when we
> start
> >>>> working on projects such as
> >>>>
> https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_edits
> >>>> and
> >>>>
> https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pageviews
> >>>>
> >>>> Dario
> >>>>
> >>>> On Mon, Apr 13, 2015 at 9:37 PM, Oliver Keyes 
> >>>> wrote:
> >>>>>
> >>>>> Bumping for Dario, per Pine's excellent example :)
> >>>>>
> >>>>> On 13 April 2015 at 22:18, Hirav Gandhi 
> wrote:
> >>>>> > Oliver: Two months is fine. Thank you so much for your help!
> >>>>> >
> >>>>> >> On Apr 13, 2015, at 4:40 PM,
> analytics-requ...@lists.wikimedia.org
> >>>>> >> wrote:
> >>>>> >>
> >>>>> >> Send Analytics mailing list submissions to
> >>>>> >>   analytics@lists.wikimedia.org
> >>>>> >>
> >>>>> >> To subscribe or unsubscribe via the World Wide Web, visit
> >>>>> >>   https://lists.wikimedia.org/mailman/listinfo/analytics
> >>>>> >> or, via email, send a message with subject or body 'help' to
> >>>>> >>   analytics-requ...@lists.wikimedia.org
> >>>>> >>
> >>>>> >> You can reach the person managing the list at
> >>>>> >>   analytics-ow...@lists.wikimedia.org
> >>>>> >>
> >>>>> >> When replying, please edit your Subject line so it is more
> specific
> >>>>> >> than "Re: Contents of Analytics digest..."
> >>>>> >>
> >>>>> >>
> >>>>> >> Today's Topics:
> >>>>> >>
> >>>>> >>   1. Re: Page views on a more frequent than hourly basis (Pine W)
> >>>>> >>   2. Re: Page views on a more frequent than hourly basis (Hirav
&

Re: [Analytics] Page views on a more frequent than hourly basis

2015-04-15 Thread Dario Taraborelli

Hirav, Bharath – I also want to hear from you if there's a specific reason
to ask for English Wikipedia only or if a dataset encompassing aggregate
pageviews across all Wikimedia properties would do the job.

Dario

On Wed, Apr 15, 2015 at 9:09 AM, Dario Taraborelli <
dtarabore...@wikimedia.org> wrote:

> Oliver -- thanks for running a preliminary check, I'm fine releasing this
> data in aggregate under CC0, I believe it would be valuable for this and
> other research projects (copying Michelle from Legal).
>
> Before we do so, though, I want to confirm the specs: aggregate pageviews
> per second to English Wikipedia, excluding bot traffic, broken down by
> access method (mobile web vs desktop site, not apps) for a 60-day period.
> Oliver – are these the filters you used to identify the data point with the
> smallest number of observations?
>
> Obviously, we will need to take into account this release when we start
> working on projects such as
> https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_edits
> and
> https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pageviews
>
> Dario
>
> On Mon, Apr 13, 2015 at 9:37 PM, Oliver Keyes 
> wrote:
>
>> Bumping for Dario, per Pine's excellent example :)
>>
>> On 13 April 2015 at 22:18, Hirav Gandhi  wrote:
>> > Oliver: Two months is fine. Thank you so much for your help!
>> >
>> >> On Apr 13, 2015, at 4:40 PM, analytics-requ...@lists.wikimedia.org
>> wrote:
>> >>
>> >> Send Analytics mailing list submissions to
>> >>   analytics@lists.wikimedia.org
>> >>
>> >> To subscribe or unsubscribe via the World Wide Web, visit
>> >>   https://lists.wikimedia.org/mailman/listinfo/analytics
>> >> or, via email, send a message with subject or body 'help' to
>> >>   analytics-requ...@lists.wikimedia.org
>> >>
>> >> You can reach the person managing the list at
>> >>   analytics-ow...@lists.wikimedia.org
>> >>
>> >> When replying, please edit your Subject line so it is more specific
>> >> than "Re: Contents of Analytics digest..."
>> >>
>> >>
>> >> Today's Topics:
>> >>
>> >>   1. Re: Page views on a more frequent than hourly basis (Pine W)
>> >>   2. Re: Page views on a more frequent than hourly basis (Hirav Gandhi)
>> >>   3. Re: Page views on a more frequent than hourly basis (Oliver Keyes)
>> >>
>> >>
>> >> --
>> >>
>> >> Message: 1
>> >> Date: Mon, 13 Apr 2015 13:34:23 -0700
>> >> From: Pine W 
>> >> To: "A mailing list for the Analytics Team at WMF and everybody who
>> >>   has an  interest in Wikipedia and analytics."
>> >>   
>> >> Subject: Re: [Analytics] Page views on a more frequent than hourly
>> >>   basis
>> >> Message-ID:
>> >>   > dyjjzmdfthz+0+lwnhb9m8xuod4wetgcfuxyb9qyf7cy...@mail.gmail.com>
>> >> Content-Type: text/plain; charset="utf-8"
>> >>
>> >> Hi Oliver, re ccing people who are on list, this is the protocol we
>> >> followed in IEGCom to ping people who are subscribed and mentioned in
>> >> certain emails but, like many of us, may automatically move emails from
>> >> lists directly to folders where they may be unread for days. So there
>> is a
>> >> reason to do this.
>> >>
>> >> Thanks,
>> >>
>> >> Pine
>> >> -- next part --
>> >> An HTML attachment was scrubbed...
>> >> URL: <
>> https://lists.wikimedia.org/pipermail/analytics/attachments/20150413/aac0ef89/attachment-0001.html
>> >
>> >>
>> >> --
>> >>
>> >> Message: 2
>> >> Date: Mon, 13 Apr 2015 16:30:43 -0700
>> >> From: Hirav Gandhi 
>> >> To: analytics@lists.wikimedia.org
>> >> Subject: Re: [Analytics] Page views on a more frequent than hourly
>> >>   basis
>> >> Message-ID:
>> >>   > uojpt2vxbnfmhcipqn1pumace-...@mail.gmail.com>
>> >> Content-Type: text/plain; charset="utf-8"
>> >>
>> >> Thanks Oliver!
>> >>
>> >> We would like this data for as broad of a time period as you can
>> muster.
>> >> T

Re: [Analytics] Page views on a more frequent than hourly basis

2015-04-15 Thread Dario Taraborelli

t; Then that sounds much more viable. I'll run a quick test now to see
> >>>>> how much clustering we'd see at, say, the one-second resolution
> level,
> >>>>> and throw it out here so we can make more informed decisions about a
> >>>>> data release on this.
> >>>>>
> >>>>> On 13 April 2015 at 08:08, Hirav Gandhi 
> wrote:
> >>>>>> Hi Oliver,
> >>>>>>
> >>>>>> Re: Hirav: would you be looking for temporally /and/ contextually
> >>>>>> granular
> >>>>>> pageviews, i.e. "a view to X page at Y time", or just temporally
> >>>>>> granular,
> >>>>>> so "a view to a page on enwiki at X time"? If the latter you've got
> >>>>>> more of
> >>>>>> a shot, I suspect.
> >>>>>>
> >>>>>> I only want the latter - I am not concerned with the context so
> much as
> >>>>>> just
> >>>>>> “a view to a page on enwiki at X time.”
> >>>>>>
> >>>>>> Hirav
> >>>>>>
> >>>>>>
> >>>>>> On Apr 13, 2015, at 5:00 AM, analytics-requ...@lists.wikimedia.org
> >>>>>> wrote:
> >>>>>>
> >>>>>> Send Analytics mailing list submissions to
> >>>>>> analytics@lists.wikimedia.org
> >>>>>>
> >>>>>> To subscribe or unsubscribe via the World Wide Web, visit
> >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
> >>>>>> or, via email, send a message with subject or body 'help' to
> >>>>>> analytics-requ...@lists.wikimedia.org
> >>>>>>
> >>>>>> You can reach the person managing the list at
> >>>>>> analytics-ow...@lists.wikimedia.org
> >>>>>>
> >>>>>> When replying, please edit your Subject line so it is more specific
> >>>>>> than "Re: Contents of Analytics digest..."
> >>>>>>
> >>>>>>
> >>>>>> Today's Topics:
> >>>>>>
> >>>>>>  1. Re: Page views on a more frequent than hourly basis (Pine W)
> >>>>>>  2. Re: Page views on a more frequent than hourly basis (Oliver
> Keyes)
> >>>>>>
> >>>>>>
> >>>>>>
> --
> >>>>>>
> >>>>>> Message: 1
> >>>>>> Date: Mon, 13 Apr 2015 00:47:31 -0700
> >>>>>> From: Pine W 
> >>>>>> To: "A mailing list for the Analytics Team at WMF and everybody who
> >>>>>> has an interest in Wikipedia and analytics."
> >>>>>> 
> >>>>>> Cc: Bharath Sitaraman 
> >>>>>> Subject: Re: [Analytics] Page views on a more frequent than hourly
> >>>>>> basis
> >>>>>> Message-ID:
> >>>>>>  >
> >>>>>> Content-Type: text/plain; charset="utf-8"
> >>>>>>
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> This issue of pageview data granularity has been discussed before,
> and
> >>>>>> the
> >>>>>> answer has been that hourly is the smallest increment allowed to be
> >>>>>> revealed publicly, for privacy reasons.
> >>>>>>
> >>>>>> I believe that the person you will want to discuss your request
> with is
> >>>>>> Toby, who I have cc'd here.
> >>>>>>
> >>>>>> Pine
> >>>>>> On Apr 13, 2015 12:11 AM, "Hirav Gandhi" 
> >>>>>> wrote:
> >>>>>>
> >>>>>> Hi Wikimedia Analytics Team,
> >>>>>>
> >>>>>> My colleague Bharath and I are doing research on dynamic server
> >>>>>> allocation
> >>>>>> algorithms and we were looking for a suitable datasets to test our
> >>>>>> predictive algorithm on. We noticed that Wikimedia has an amazing
> data
> >>>>>> set
> >>>>>> of hourly page views, but we were looking for something a bit more
> >>>>>> granular, such as aggregated page requests to English Wikipedia on a
> >>>>>> minute
> >>>>>> by minute basis or second by second basis if possible.
> >>>>>>
> >>>>>> We are more than happy to pour through any raw data you might have
> that
> >>>>>> would help us calculate page requests at this granular level. Please
> >>>>>> let us
> >>>>>> know if it would be possible to get such data and if so how. Thank
> you
> >>>>>> in
> >>>>>> advance for your help.
> >>>>>>
> >>>>>> Best,
> >>>>>>
> >>>>>> Hirav Gandhi
> >>>>>> ___
> >>>>>> Analytics mailing list
> >>>>>> Analytics@lists.wikimedia.org
> >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
> >>>>>>
> >>>>>> -- next part --
> >>>>>> An HTML attachment was scrubbed...
> >>>>>> URL:
> >>>>>>
> >>>>>> <
> https://lists.wikimedia.org/pipermail/analytics/attachments/20150413/a88287b6/attachment-0001.html
> >
> >>>>>>
> >>>>>> --
> >>>>>>
> >>>>>> Message: 2
> >>>>>> Date: Mon, 13 Apr 2015 06:39:45 -0400
> >>>>>> From: Oliver Keyes 
> >>>>>> To: "A mailing list for the Analytics Team at WMF and everybody who
> >>>>>> has an interest in Wikipedia and analytics."
> >>>>>> 
> >>>>>> Cc: Bharath Sitaraman 
> >>>>>> Subject: Re: [Analytics] Page views on a more frequent than hourly
> >>>>>> basis
> >>>>>> Message-ID:
> >>>>>>  >
> >>>>>> Content-Type: text/plain; charset=UTF-8
> >>>>>>
> >>>>>>
> >>>>>> Preeetty sure that Toby is on the analytics list, Pine. He's the
> >>>>>> director of analytics.
> >>>>>>
> >>>>>> Hirav: would you be looking for temporally /and/ contextually
> granular
> >>>>>> pageviews, i.e. "a view to X page at Y time", or just temporally
> >>>>>> granular, so "a view to a page on enwiki at X time"? If the latter
> >>>>>> you've got more of a shot, I suspect.
> >>>>>>
> >>>>>> On 13 April 2015 at 03:47, Pine W  wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> This issue of pageview data granularity has been discussed before,
> and
> >>>>>> the
> >>>>>> answer has been that hourly is the smallest increment allowed to be
> >>>>>> revealed
> >>>>>> publicly, for privacy reasons.
> >>>>>>
> >>>>>> I believe that the person you will want to discuss your request
> with is
> >>>>>> Toby, who I have cc'd here.
> >>>>>>
> >>>>>> Pine
> >>>>>>
> >>>>>> On Apr 13, 2015 12:11 AM, "Hirav Gandhi" 
> >>>>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>> Hi Wikimedia Analytics Team,
> >>>>>>
> >>>>>> My colleague Bharath and I are doing research on dynamic server
> >>>>>> allocation
> >>>>>> algorithms and we were looking for a suitable datasets to test our
> >>>>>> predictive algorithm on. We noticed that Wikimedia has an amazing
> data
> >>>>>> set
> >>>>>> of hourly page views, but we were looking for something a bit more
> >>>>>> granular,
> >>>>>> such as aggregated page requests to English Wikipedia on a minute by
> >>>>>> minute
> >>>>>> basis or second by second basis if possible.
> >>>>>>
> >>>>>> We are more than happy to pour through any raw data you might have
> that
> >>>>>> would help us calculate page requests at this granular level. Please
> >>>>>> let us
> >>>>>> know if it would be possible to get such data and if so how. Thank
> you
> >>>>>> in
> >>>>>> advance for your help.
> >>>>>>
> >>>>>> Best,
> >>>>>>
> >>>>>> Hirav Gandhi
> >>>>>> ___
> >>>>>> Analytics mailing list
> >>>>>> Analytics@lists.wikimedia.org
> >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> ___
> >>>>>> Analytics mailing list
> >>>>>> Analytics@lists.wikimedia.org
> >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Oliver Keyes
> >>>>>> Research Analyst
> >>>>>> Wikimedia Foundation
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>>
> >>>>>> ___
> >>>>>> Analytics mailing list
> >>>>>> Analytics@lists.wikimedia.org
> >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
> >>>>>>
> >>>>>>
> >>>>>> End of Analytics Digest, Vol 38, Issue 21
> >>>>>> *
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> ___
> >>>>>> Analytics mailing list
> >>>>>> Analytics@lists.wikimedia.org
> >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Oliver Keyes
> >>>>> Research Analyst
> >>>>> Wikimedia Foundation
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Oliver Keyes
> >>>> Research Analyst
> >>>> Wikimedia Foundation
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> ___
> >>>> Analytics mailing list
> >>>> Analytics@lists.wikimedia.org
> >>>> https://lists.wikimedia.org/mailman/listinfo/analytics
> >>>
> >>>
> >>> ___
> >>> Analytics mailing list
> >>> Analytics@lists.wikimedia.org
> >>> https://lists.wikimedia.org/mailman/listinfo/analytics
> >>>
> >>
> >>
> >>
> >> --
> >> Oliver Keyes
> >> Research Analyst
> >> Wikimedia Foundation
> >>
> >>
> >>
> >> --
> >>
> >> ___
> >> Analytics mailing list
> >> Analytics@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/analytics
> >>
> >>
> >> End of Analytics Digest, Vol 38, Issue 24
> >> *
> >
> >
> > ___
> > Analytics mailing list
> > Analytics@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
>



-- 
Dario Taraborelli
Senior Research Scientist, Research and Data Lead
Wikimedia Foundation
http://wikimediafoundation.org
http://nitens.org/taraborelli
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Eventlogging outage

2015-04-08 Thread Dario Taraborelli

to clarify: does this affect all logs or client-side logs only?

> On Apr 8, 2015, at 11:13 AM, Aaron Halfaker  wrote:
> 
> Thanks guys.  As a frequent user of event logging, the dataloss and potential 
> backfilling are of great importance to me.  It would be helpful for me if, in 
> the future, these could be summarized in announcement emails.  
> 
> 
> On Wed, Apr 8, 2015 at 12:45 PM, Kevin Leduc  > wrote:
> the data loss and no-backfilling are documented in the incident report 
> https://wikitech.wikimedia.org/wiki/Incident_documentation/20150406-EventLogging#Actionables
>  
> 
> 
> On Wed, Apr 8, 2015 at 10:40 AM, Dan Andreescu  > wrote:
> It did cause data loss, and we can not backfill because the disk was full so 
> the logs were not written.
> 
> On Wed, Apr 8, 2015 at 1:37 PM, Aaron Halfaker  > wrote:
> Thanks Nuria.
> 
> Did this cause data loss and if so, is there a plan to backfill?
> 
> -Aaron
> 
> On Wed, Apr 8, 2015 at 12:28 PM, Nuria Ruiz  > wrote:
> Team:
> 
> As you might know we have swapped EL old vanadium box to a a never, more 
> resilient one. 
> 
> This new box had less disk space and the move caused a small outage due to a 
> bug already present on EL code that was not apparent on vanadium. 
> 
> Details can be found here:
> 
> https://wikitech.wikimedia.org/wiki/Incident_documentation/20150406-EventLogging
>  
> 
> 
> Thanks, 
> 
> Nuria
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> 
> 
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> 
> 
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> 
> 
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> 
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Fwd: [Engineering] Wikimedia REST content API is now available in beta

2015-03-10 Thread Dario Taraborelli

Cross-posting from wikitech-l, this will definitely be of interest to those of 
you on this list who work with our APIs. 

Begin forwarded message:

> From: Gabriel Wicke 
> Date: March 10, 2015 at 15:23:03 PDT
> To: Wikimedia developers , 
> wikitech-ambassd...@lists.wikimedia.org, Development and Operations Engineers 
> , mediawiki-...@lists.wikimedia.org
> Subject: [Engineering] Wikimedia REST content API is now available in beta
> 
> Hello all,
> I am happy to announce the beta release of the Wikimedia REST Content API at 
> https://rest.wikimedia.org/
> Each domain has its own API documentation, which is auto-generated from 
> Swagger API specs. For example, here is the link for the English Wikipedia:
> https://rest.wikimedia.org/en.wikipedia.org/v1/?doc
> At present, this API provides convenient and low-latency access to article 
> HTML, page metadata and content conversions between HTML and wikitext. After 
> extensive testing we are confident that these endpoints are ready for 
> production use, but have marked them as 'unstable' until we have also 
> validated this with production users. You can start writing applications that 
> depend on it now, if you aren't afraid of possible minor changes before 
> transitioning to 'stable' status. For the definition of the terms ‘stable’ 
> and ‘unstable’ see https://www.mediawiki.org/wiki/API_versioning .
> While general and not specific to VisualEditor, the selection of endpoints 
> reflects this release's focus on speeding up VisualEditor. By storing private 
> Parsoid round-trip information separately, we were able to reduce the HTML 
> size by about 40%. This in turn reduces network transfer and processing 
> times, which will make loading and saving with VisualEditor faster. We are 
> also switching from a cache to actual storage, which will eliminate slow 
> VisualEditor loads caused by cache misses. Other users of Parsoid HTML like 
> Flow, HTML dumps, the OCG PDF renderer or Content translation will benefit 
> similarly.
> But, we are not done yet. In the medium term, we plan to further reduce the 
> HTML size by separating out all read-write metadata. This should allow us to 
> use Parsoid HTML with its semantic markup directly for both views and editing 
> without increasing the HTML size over the current output. Combined with 
> performance work in VisualEditor, this has the potential to make switching to 
> visual editing instantaneous and free of any scrolling.
> We are also investigating a sub-page-level edit API for micro-contributions 
> and very fast VisualEditor saves. HTML saves don't necessarily have to wait 
> for the page to re-render from wikitext, which means that we can potentially 
> make them faster than wikitext saves. For this to work we'll need to minimize 
> network transfer and processing time on both client and server.
> More generally, this API is intended to be the beginning of a multi-purpose 
> content API. Its implementation (RESTBase) is driven by a declarative Swagger 
> API specification, which helps to make it straightforward to extend the API 
> with new entry points. The same API spec is also used to auto-generate the 
> aforementioned sandbox environment, complete with handy "try it" buttons. So, 
> please give it a try and let us know what you think!
> This API is currently unmetered; we recommend that users not perform more 
> than 200 requests per second and may implement limitations if necessary.
> I also want to use this opportunity to thank all contributors who made this 
> possible:
> - Marko Obrovac, Eric Evans, James Douglas and Hardik Juneja on the Services 
> team worked hard to build RESTBase, and to make it as extensible and clean as 
> it is now.
> - Filippo Giunchedi, Alex Kosiaris, Andrew Otto, Faidon Liambotis, Rob 
> Halsell and Mark Bergsma helped to procure and set up the Cassandra storage 
> cluster backing this API.
> - The Parsoid team with Subbu Sastry, Arlo Breault, C. Scott Ananian and Marc 
> Ordinas i Llopis is solving the extremely difficult task of converting 
> between wikitext and HTML, and built a new API that lets us retrieve and pass 
> in metadata separately.
> - On the MediaWiki core team, Brad Jorsch quickly created a minimal 
> authorization API that will let us support private wikis, and Aaron Schulz, 
> Alex Monk and Ori Livneh built and extended the VirtualRestService that lets 
> VisualEditor and MediaWiki in general easily access external services.
> 
> We welcome your feedback here: https://www.mediawiki.org/wiki/Talk:RESTBase - 
> and in Phabricator.
> 
> Sincerely --
> Gabriel Wicke
> Principal Software Engineer, Wikimedia Foundation
> ___
> Engineering mailing list
> engineer...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/engineering
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Provenance Params

2015-03-10 Thread Dario Taraborelli

On Mar 10, 2015, at 11:26 AM, Adam Baso  wrote:
> 
> We're going to use the following format:
> 
> ?wprov=<3_char_feature>
> 
> For the first version on iOS, this will be
> 
> ?wprov=safi1
> 
> And Android:
> 
> ?wprov=safa1

Thanks for the closing the loop on this. Dan, Adam – have you guys considered 
tagging the type of “share”? I expect “image shares” will have higher 
engagement/click-through than “text shares”, if that’s a data point you want to 
collect explicitly, you’ll want to pass a different value to 3_char_feature 
(assuming that’s possible).

Is the new parameter going to be in the next beta build?

Dario

> 
> On Mon, Mar 9, 2015 at 1:39 PM, Adam Baso  > wrote:
> Okay, we'll plan on wprov.
> 
> On Wed, Mar 4, 2015 at 12:44 PM, Dan Garry  > wrote:
> Works for me.
> 
> Dan
> 
> On 4 March 2015 at 12:33, Adam Baso  > wrote:
> How about 'wprov'?
> 
> On Wed, Mar 4, 2015 at 12:29 PM, Dan Garry  > wrote:
> I'd really rather this be either something that's totally not understandable 
> by the user (e.g. ?saf=1), or something that is clearly understandable (e.g. 
> ?appshareafact=1).
> 
> Dan
> 
> On 4 March 2015 at 12:26, Adam Baso  > wrote:
> Ha! I'm cool with 'provenance' if no one objects.
> 
> On Wed, Mar 4, 2015 at 11:25 AM, Andrew Otto  > wrote:
> Oof, only that it is ugly! :)
> 
> Can you just call it ‘provenance', or are you trying to be more future proof?
> 
> 
> 
> 
> 
>> On Mar 4, 2015, at 12:11, Adam Baso > > wrote:
>> 
>> I pinged on Phabricator at https://phabricator.wikimedia.org/T90606 
>>  about modeling after that patch. 
>> That sort of approach should avoid cache fragmentation.
>> 
>> As for parameter name, 'wmfxan' is short and I think would avoid collisions. 
>> Any problems with this parameter name?
>> 
>> -Adam
>> 
>> 
>> On Thu, Feb 26, 2015 at 8:27 AM, Nuria Ruiz > > wrote:
>> Ping ... (regarding cache question)
>> 
>> On Tue, Feb 24, 2015 at 5:18 PM, Gergo Tisza > > wrote:
>> On Tue, Feb 24, 2015 at 3:48 PM, Nuria Ruiz > > wrote:
>> 2. What about caching? 
>> Is this page: http://wikipedia.org/BarackObama?some_param=some-value 
>>  being served from 
>> the cache as it should be?
>> 
>> The file download parameter was handled via this patch: 
>> https://gerrit.wikimedia.org/r/#/c/120617/ 
>>  
>> Seems like an analogous scenario.
>> 
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org 
>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>> 
>> 
>> 
>> 
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org 
>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>> 
>> 
>> 
> 
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> 
> 
> 
> 
> 
> -- 
> Dan Garry
> Associate Product Manager, Mobile Apps
> Wikimedia Foundation
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> 
> 
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> 
> 
> 
> 
> 
> -- 
> Dan Garry
> Associate Product Manager, Mobile Apps
> Wikimedia Foundation
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> 
> 
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Technical] missing dialect subdomains in the new pageviews definition

2015-03-09 Thread Dario Taraborelli

thanks, Oliver (and James for spotting this).

> On Mar 9, 2015, at 2:30 PM, Oliver Keyes  wrote:
> 
> Now logged in Phabricator at https://phabricator.wikimedia.org/T92020
> 
> On 9 March 2015 at 16:24, Oliver Keyes  wrote:
>> Bah; folder names, rather than subdomains.
>> 
>> On 9 March 2015 at 16:24, Oliver Keyes  wrote:
>>> Hey all,
>>> 
>>> One of the big improvements of the new definition over the old one is
>>> that the old one is not limited to /wiki/. It includes all of the
>>> chinese and serbian dialects that have their own folder names and were
>>> not appearing, as a result, in the old pageview counts.
>>> 
>>> James F (thanks James!) recently pointed out to me that there are
>>> other wikis that do this - see the list at
>>> https://meta.wikimedia.org/wiki/Wikipedias_in_multiple_writing_systems#With_Automatic_Conversion_System
>>> . These need to be factored into the new pageviews definition to avoid
>>> culturally and nationally biased undercounting.
>>> 
>>> Have fun,
>>> 
>>> --
>>> Oliver Keyes
>>> Research Analyst
>>> Wikimedia Foundation
>> 
>> 
>> 
>> --
>> Oliver Keyes
>> Research Analyst
>> Wikimedia Foundation
> 
> 
> 
> -- 
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Discussion] User agent data releases

2015-03-05 Thread Dario Taraborelli

heads up that after a review with Legal we decided that we should not release 
the sampled raw dataset. Oliver is now working on making parsed UA data 
available.

> On Mar 5, 2015, at 10:52 AM, Oliver Keyes  wrote:
> 
> Just a clarifying note: Dario still needs to review the actual
> methodology. While Legal have approved it from their end, they've also
> made clear that this is contingent on the anonymisation methodology
> pasting muster from an R&D point of view.
> 
> On 5 March 2015 at 12:39, Oliver Keyes  wrote:
>> Just an FYI that Legal have approved this release under the
>> anonymisation procedures we've set out (thanks Michelle!) on the
>> condition that Dario, too, is comfortable with them. Dario?
>> 
>> On 4 March 2015 at 17:16, Oliver Keyes  wrote:
>>> So it's distinct people, globally - and I deliberately made it wooly
>>> it by operating over username, which means the threshold is fuzzy
>>> (i.e., at a minimum it's 50. At a maximum it's 50x[number of wikis]).
>>> 
>>> It's very deliberately dimension-free: user_agent,
>>> edit_count_in_non_specified_90_day_period, and that's it.
>>> 
>>> On 4 March 2015 at 17:12, Aaron Halfaker  wrote:
 Assuming this was public, I could use this data on seldom edited Wikis to
 find out which editors likely have old browser/OS versions with
 vulnerabilities that I could attack[1].  This would be easier and easier 
 the
 more dimensions you add to the data.

 OK.  The anonymization strategy for dropping records that represent < 50
 distinct editors seems to address this concern.   50 edits is a lot.  So
 this data wouldn't be too terribly useful for under-active wikis.  Then
 again, if you just want to a sense for what the dominant browser/OS pairs
 are, then they will likely represent > 50 unique editors on most projects.

 1. Props to Matt Flaschen and Dan Andreescu for helping me work through the
 implications of that one.

 On Tue, Mar 3, 2015 at 9:59 PM, Oliver Keyes  wrote:
> 
> Yeah, makes sense.
> 
> On 3 March 2015 at 20:38, Nuria Ruiz  wrote:
>>> Agreed. Do we have a way of syncing files to Labs yet?
>> No need to sync if file is available in an endpoint like
>> htpp://some-data-here
>> 
>> On Tue, Mar 3, 2015 at 4:50 PM, Oliver Keyes 
>> wrote:
>>> 
>>> On 3 March 2015 at 19:35, Nuria Ruiz  wrote:
> Erik has asked me to write an exploratory app for user-agent data.
> The
> idea is to enable Product Managers and engineers to easily explore
> what users use so they know what to support. I've thrown up an
> example
> screenshot at http://ironholds.org/agents_example_screen.png

 I cannot speak as to the interest of community about this data but
 for
 developers and PM we should make sure we have a solid way to update
 any
 data
 we put up. User Agent data is outdated as soon as a new version of
 android
 or iOs is released, a new popular phone comes along or a new
 autoupdate
 for
 popular browsers. Not only that, if we make changes to, say, redirect
 all
 iPad users to the desktop site we want to asses effect of those
 changes
 as
 soon as possible. A monthly update will be a must. Also
 distinguishing
 between browser percentages on desktop site versus mobile site versus
 apps
 is a must for this data to be real useful for PMs and developers
 (specially
 for bug triage).

>>> 
>>> Yes! However, I am addressing a specific ad-hoc request. If there is a
>>> need for this (I agree there is) I hope Toby and Kevin can eke out the
>>> time on the Analytics Engineering schedule to work on it; y'all are a
>>> lot better at infrastructure work than me :).
>>> 

 We have couple backlog items to make monthly reports on this regard.
 A
 UI on
 top of them will be superb.

>>> 
>>> Agreed. Do we have a way of syncing files to Labs yet? That's the
>>> biggest blocker. The UI doesn't care what the file contains as long as
>>> it's a TSV with a header row - I've deliberately built it so that
>>> things like the download links are dynamic and can change.
>>> 

 On Tue, Mar 3, 2015 at 1:05 PM, Oliver Keyes 
 wrote:
> 
> Hey all,
> 
> (Sending this to the public list because it's more transparent and
> I'd
> like people who think this data is useful to be able to shout out)
> 
> Erik has asked me to write an exploratory app for user-agent data.
> The
> idea is to enable Product Managers and engineers to easily explore
> what users use so they know what to support. I've thr

Re: [Analytics] [Announce] new Pageviews definition complete and implemented

2015-03-04 Thread Dario Taraborelli

very exciting, thanks Oliver and everyone else involved in this.

Just a note to clarify this point:

> when the data begins coming out through stats.wikimedia.org and elsewhere, 
> you can expect to see a substantial drop in traffic. 

there won’t be any sudden change of traffic data in the existing reports and we 
need to figure out how to make the transition to the new definition as graceful 
as possible. We will publish detailed FAQ on the change whenever it becomes 
operational.

Dario

> On Mar 4, 2015, at 10:20 AM, Oliver Keyes  wrote:
> 
> Hey all,
> 
> I'm very pleased to announce that the new pageviews definition is (1)
> complete and (2) implemented. Prominent features include:
> 
> 1. A removal of the per-project double-counting due to banners;
> 2. The removal of meta over-over-OVER-counting due to EventLogging;
> 3. The inclusion of Mobile App traffic;
> 4. The inclusion of projects with non-standard URL schemes.
> 
> What this means in practice is that when the data begins coming out
> through stats.wikimedia.org and elsewhere, you can expect to see a
> substantial drop in traffic. This is not a drop in traffic; it is a
> correction for the massive inaccuracies in the existing definition,
> which are causing an artificial /rise/.
> 
> So, what's next? Well, the Analytics Engineering team has to implement
> the functionality on a regularly running job to get the data released
> on a consistent basis. We also need to split out per-article pageviews
> and do some tagging to provide granular reports - see
> https://meta.wikimedia.org/wiki/Research:Page_view#Future_work . But
> the core definition is complete.
> 
> Huge thanks to Andrew Otto, Christian, Nuria, Aaron and Bob West for
> their contributions to this project.
> 
> -- 
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] page views by location

2015-03-02 Thread Dario Taraborelli

unfortunately not. The proposal hasn’t been cleared yet and we don’t have an 
ETA for its launch.

> On Mar 2, 2015, at 9:53 AM, Seth Stephens-Davidowitz 
>  wrote:
> 
> Thanks. Do you know when that might be available?
> 
> Seth
> 
> On Mon, Mar 2, 2015 at 12:52 PM, Dario Taraborelli 
> mailto:dtarabore...@wikimedia.org>> wrote:
> Seth, check out this proposal submitted by a team at Los Alamos National 
> Laboratory: 
> https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pageviews
>  
> <https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pageviews>
> 
>> On Mar 2, 2015, at 9:47 AM, Toby Negrin > <mailto:tneg...@wikimedia.org>> wrote:
>> 
>> Hi Seth -- we're currently working to provide geo-located page views with a 
>> privacy acceptable level of aggregation. We don't currently have an ETA. I'm 
>> cc'ing the public analytics list for more information.
>> 
>> Best,
>> 
>> -Toby
>> 
>> On Mon, Mar 2, 2015 at 9:41 AM, Seth Stephens-Davidowitz 
>> mailto:seth.steph...@gmail.com>> wrote:
>> Dear Toby,
>> Domas Mituzas suggested I contact you.  I am looking for data on page views 
>> by location.  I only am able to find total page views. But it is not broken 
>> down by location. Does this data exist anywhere?
>> 
>> Thanks so much,
>> Seth
>> 
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>> <https://lists.wikimedia.org/mailman/listinfo/analytics>
> 
> 

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] page views by location

2015-03-02 Thread Dario Taraborelli

Seth, check out this proposal submitted by a team at Los Alamos National 
Laboratory: 
https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pageviews 


> On Mar 2, 2015, at 9:47 AM, Toby Negrin  wrote:
> 
> Hi Seth -- we're currently working to provide geo-located page views with a 
> privacy acceptable level of aggregation. We don't currently have an ETA. I'm 
> cc'ing the public analytics list for more information.
> 
> Best,
> 
> -Toby
> 
> On Mon, Mar 2, 2015 at 9:41 AM, Seth Stephens-Davidowitz 
> mailto:seth.steph...@gmail.com>> wrote:
> Dear Toby,
> Domas Mituzas suggested I contact you.  I am looking for data on page views 
> by location.  I only am able to find total page views. But it is not broken 
> down by location. Does this data exist anywhere?
> 
> Thanks so much,
> Seth
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Provenance Params

2015-02-24 Thread Dario Taraborelli

it sounds like we have consensus for a short-term solution based on a vanilla 
parameter, as long as it doesn’t clash with other internal parameters. I agree 
with Gergo that a shortener is appealing as a long-term solution, this is what 
the vast majority of platforms are using for analytics purposes, it also has 
the added benefit of addressing the impact of referrer information being 
stripped for HTTPS requests. If there’s no other objection, we can safely fold 
this under the discussion of long-term options and go ahead with the proposed 
implementation, per Dan.

Thanks, everybody.

> On Feb 24, 2015, at 11:56 AM, Gergo Tisza  wrote:
> 
> On Tue, Feb 24, 2015 at 9:48 AM, Adam Baso  > wrote:
> Hi Nemo - I think the concern was that it might be the case that the 'title' 
> parameter may be at the end of the URL, and the 'title' parameter could in 
> principle support a value with forward slashes potentially indistinguishable 
> from the string in option #2. Of course, regular expressions can make 
> anything possible in theory :) Anybody else able to explain further on the 
> title schema risk?
> 
> Well, it doesn't work. Not sure I'd call that a risk though :-)
> How did that even come up? Why not use an ampersand instead of a forward 
> slash? Ampersands have a well-defined meaning in the query part of the URL, 
> while slashes don't.
> 
> Personally, I would favor the URL shortener. It is a useful feature on it's 
> own, good for branding (if you don't shorten, many sites will shorten for you 
> using their own schema, which results in nondescript URLs), you get nice URLs 
> (in the short URL you can just factor the parameters into the shortened part, 
> in the full URL you don't need them because the user has been counted 
> already), you get less cache fragmentation (even if you remove the parameter 
> in Varnish, you'll still fragment the client cache). On the negative side, 
> it's one more request so clicking through becomes somewhat slower.
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] s1-analytics-slave

2015-02-17 Thread Dario Taraborelli

Hi Sean, no objection on my end either. I’ll have to update a bunch of scripts 
that populate the EE dashboards [1] but it’s no big deal as long as we clearly 
communicate the ETA.

[1] http://ee-dashboard.wmflabs.org/dashboards/enwiki-metrics 


> On Feb 15, 2015, at 7:43 PM, Sean Pringle  wrote:
> 
> So, bump :-)
> 
> - A week's notice would be needed for Halfak to vacate s1-analytics-slave, 
> and presumably others could meet the same target. Or make it a month since 
> there is no desperate rush.
> 
> - Geowiki needs some coordination for the switchover to analytics-store 
> staging db.
> 
> Anything else?
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Wikipedia aggregate clickstream data released

2015-02-17 Thread Dario Taraborelli

We’re glad to announce the release of an aggregate clickstream dataset 
extracted from English Wikipedia

http://dx.doi.org/10.6084/m9.figshare.1305770 


This dataset contains counts of (referer, article) pairs aggregated from the 
HTTP request logs of English Wikipedia. This snapshot captures 22 million 
(referer, article) pairs from a total of 4 billion requests collected during 
the month of January 2015.

This data can be used for various purposes:
• determining the most frequent links people click on for a given 
article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a 
link in that article
• generating a Markov chain over English Wikipedia

We created a page on Meta for feedback and discussion about this release: 
https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream 


Ellery and Dario___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] February 2015 Research Showcase: Global South survey results; data imports in OpenStreetMap

2015-02-11 Thread Dario Taraborelli

I am thrilled to announce our speaker lineup for this month’s research showcase 
.
  

Our own Haitham Shammaa will present results from the Global South survey. We 
also invited Stamen’s Alan McConchie, an OpenStreetMap expert, to talk about 
the challenges the OSM community is facing with external data imports.

The showcase will be recorded and publicly streamed at 11.30 PT on Wednesday, 
February 18 (livestream link will follow). We’ll hold a discussion and take 
questions from remote participants via the Wikimedia Research IRC channel 
(#wikimedia-research  
on freenode).

Looking forward to seeing you there.

Dario


Global South User Survey 2014
By Haitham Shammaa 
Users' trends in the Global South have significantly changed over the past two 
years, and given the increase in interest in Global South communities and their 
activities, we wanted this survey to focus on understanding the statistics and 
needs of our users (both readers, and editors) in the regions listed in the 
WMF's New Global South Strategy 
. This 
survey aims to provide a better understanding of the specific needs of local 
user communities in the Global South, as well as provide data that supports 
product and program development decision making process.

Ingesting Open Geodata: Observations from OpenStreetMap
By Alan McConchie 
As Wikidata grapples with the challenges of ingesting external data sources 
such as Freebase, what lessons can we learn from other open knowledge projects 
that have had similar experiences? OpenStreetMap, often called "The Wikipedia 
of Maps", is a crowdsourced geospatial data project covering the entire world. 
Since the earliest years of the project, OSM has combined user contributions 
with existing data imported from external sources. Within the OSM community, 
these imports have been controversial; some core OSM contributors complain that 
imported data is lower quality than user-contributed data, or that it 
discourages the growth of local mapping communities. In this talk, I'll review 
the history of data imports in OSM, and describe how OSM's best-practices have 
evolved over time in response to these critiques.

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Client-side URL redirects

2015-02-03 Thread Dario Taraborelli

Reporting here the results of some quick investigation we did on MediaWiki’s 
handling of redirects. Since this change [1] got merged, page redirects (such 
as en:Obama => en:Barack_Obama) refresh the URL client-side via Javascript. 
This doesn’t result in an extra HTTP request so the change should have no 
impact on pageview analysis based on the request logs. Thanks to Legoktm, 
Emufarmers and Halfak for tracking the source of this change.

Dario

[1] https://gerrit.wikimedia.org/r/#/c/143852/
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Scholarly citations by PMID/PMCID in Wikipedia

2015-02-02 Thread Dario Taraborelli

Hey all,

we just released a dataset of scholarly citations in the English Wikipedia by 
Pubmed / Pubmed Central ID. 

http://dx.doi.org/10.6084/m9.figshare.1299540

The dataset currently includes the first known occurrence of a PMID or PMCID 
citation in an English Wikipedia article and the associated revision metadata, 
based on the most recent complete content dump of English Wikipedia. We’re 
planning on expanding this dataset to include other types of scholarly 
identifier soon.

Feel free to share this with anyone interested or spread the word via: 
https://twitter.com/WikiResearch/status/562422538613956608

Dario and Aaron
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Early registration for CSCW 2015 ends January 30th

2015-01-27 Thread Dario Taraborelli

For those of you interested in attending, the early registration deadline is 
January 30.
See also https://meta.wikimedia.org/wiki/Research:CSCW_2015 


— — — 

CSCW 2015 | March 14-18 | Vancouver, BC, Canada
http://cscw.acm.org 

* Early registration ends January 30th.
* Advance program is available at http://cscw.acm.org/2015/program/ 

* Conference hotel rooms are already selling out at $135/night

The 18th ACM Conference on Computer Supported Cooperative Work and
Social Computing (CSCW 2015) will be held March 14-18 in Vancouver, BC,
Canada and is co-located with ACM Learning at Scale.

CSCW is the premier venue for presenting research in the design and use
of technologies that affect groups, organizations, communities, and
networks. Bringing together top researchers and practitioners from
academia and industry in the area of social computing, CSCW encompasses
both the technical and social challenges encountered when supporting
collaboration.

Jeff Hancock from Cornell University will give the opening keynote
address discussing "The Facebook Study: A Personal Account of Data
Science, Ethics and Change."

The closing keynote speaker will be Zeynep Tufekci from University of
North Carolina Chapel Hill, speaking on "Algorithms in our Midst:
Information, Power and Choice when Software is Everywhere."

We are also pleased to announce that Wanda Orlikowski will receive the
CSCW 2015 Lasting Impact Award and present a retrospective on her
groundbreaking 1992 paper, "Learning from Notes: organizational issues
in groupware implementation."

Registration is available at:
http://cscw.acm.org/2015/attend/registration.php 

Registration questions? Ask Yvonne Lopez
Hotel reservations can be made at:
http://cscw.acm.org/2015/attend/hotel.php 


Use #CSCW2015 and follow us at http://twitter.com/ACM_CSCW 
 or
http://www.facebook.com/acmCSCW  for updates.

Conference Co-chairs
Dan Cosley, Cornell University
Andrea Forte, Drexel University
chairs2...@cscw.acm.org ___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Wikimedia referrer policy

2015-01-20 Thread Dario Taraborelli

I’ve been discussing with the folks at CrossRef (the largest registry of 
Digital Object Identifiers, think of it as the ICANN of science) how to 
accurately measure the impact of traffic driven from Wikipedia/Wikimedia to 
scholarly resources. 

While digging into their data, we realized that since Wikimedia started the 
HTTPS switchover and an increasing portion of inbound traffic happens over SSL, 
Wikimedia sites may have stopped advertising themselves as sources of referred 
traffic to external sites. While this is a literal implication of HTTPS, it 
means that Wikimedia's impact on traffic directed to other sites is becoming 
largely invisible and Wikimedia might be turning into a large source of dark 
traffic.

I wrote a proposal reviewing the CrossRef use case and discussing how other top 
web properties deal with this issue by adopting a so-called "Referrer Policy”: 

https://meta.wikimedia.org/wiki/Research:Wikimedia_referrer_policy 


Feedback is welcome on the talk page: 

https://meta.wikimedia.org/wiki/Research_talk:Wikimedia_referrer_policy 


Dario___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] DNT, standards, and expectations

2015-01-16 Thread Dario Taraborelli

I’m searching for references looking at user perception of third-party 
behavioral tracking vs logging, any pointer would be appreciated. 

> On Jan 16, 2015, at 8:16 PM, Dario Taraborelli  
> wrote:
> 
> I didn’t reference the McDonald study in my reply, but I too am not 
> particularly persuaded by the conclusions. 
> 
> “Many think it means they will not be tracked at all, including collection” 
> 
> suggests to me a fundamental lack of literacy among the users surveyed about 
> what data that browsers pass with HTTP requests.
> 
>> On Jan 16, 2015, at 7:54 PM, Dario Taraborelli > <mailto:da...@wikimedia.org>> wrote:
>> 
>> Ori,
>> 
>>> we are making use of the header that we think is consistent with the 
>>> expectation of users
>> 
>> based on what evidence?
>> 
>> I’ve seen a single reference cited in this thread pointing to a study that 
>> candidly declares in its abstract:
>> 
>> “Because Do Not Track is so new, as far as we know this is the first 
>> scholarship on this topic. This paper has been neither presented nor 
>> published. “ [1]
>> 
>> The ample and representative sample considered by the EFF is well captured 
>> at the beginning of this statement:
>> 
>> “Intuitively, users who we’ve talked to want Do Not Track to provide 
>> meaningful limits on collection and retention of data.” 
>> 
>> Nobody is questioning the need to be transparent to our users about what 
>> data we’re collecting, how long this data is retained and what it’s being 
>> used for. But I see a thread full of handwaving statements about “what users 
>> really want”, in contrast to a pretty straightforward truth that nobody who 
>> participated in this thread would challenge: 
>> 
>>> which departs from the standard in a significant way.
>> 
>> 
>> I don’t see myself blessing a proposal that represents “a significant 
>> departure from the standard” and I’d love to see more substantial evidence 
>> on user expectations to justify this. 
>> 
>> Dario
>> 
>> [1] http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1993133 
>> <http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1993133>

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] DNT, standards, and expectations

2015-01-16 Thread Dario Taraborelli

I didn’t reference the McDonald study in my reply, but I too am not 
particularly persuaded by the conclusions. 

“Many think it means they will not be tracked at all, including collection” 

suggests to me a fundamental lack of literacy among the users surveyed about 
what data that browsers pass with HTTP requests.

> On Jan 16, 2015, at 7:54 PM, Dario Taraborelli  wrote:
> 
> Ori,
> 
>> we are making use of the header that we think is consistent with the 
>> expectation of users
> 
> based on what evidence?
> 
> I’ve seen a single reference cited in this thread pointing to a study that 
> candidly declares in its abstract:
> 
> “Because Do Not Track is so new, as far as we know this is the first 
> scholarship on this topic. This paper has been neither presented nor 
> published. “ [1]
> 
> The ample and representative sample considered by the EFF is well captured at 
> the beginning of this statement:
> 
> “Intuitively, users who we’ve talked to want Do Not Track to provide 
> meaningful limits on collection and retention of data.” 
> 
> Nobody is questioning the need to be transparent to our users about what data 
> we’re collecting, how long this data is retained and what it’s being used 
> for. But I see a thread full of handwaving statements about “what users 
> really want”, in contrast to a pretty straightforward truth that nobody who 
> participated in this thread would challenge: 
> 
>> which departs from the standard in a significant way.
> 
> 
> I don’t see myself blessing a proposal that represents “a significant 
> departure from the standard” and I’d love to see more substantial evidence on 
> user expectations to justify this. 
> 
> Dario
> 
> [1] http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1993133 
> <http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1993133>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] DNT, standards, and expectations

2015-01-16 Thread Dario Taraborelli

Ori,

> we are making use of the header that we think is consistent with the 
> expectation of users

based on what evidence?

I’ve seen a single reference cited in this thread pointing to a study that 
candidly declares in its abstract:

“Because Do Not Track is so new, as far as we know this is the first 
scholarship on this topic. This paper has been neither presented nor published. 
“ [1]

The ample and representative sample considered by the EFF is well captured at 
the beginning of this statement:

“Intuitively, users who we’ve talked to want Do Not Track to provide meaningful 
limits on collection and retention of data.” 

Nobody is questioning the need to be transparent to our users about what data 
we’re collecting, how long this data is retained and what it’s being used for. 
But I see a thread full of handwaving statements about “what users really 
want”, in contrast to a pretty straightforward truth that nobody who 
participated in this thread would challenge: 

> which departs from the standard in a significant way.


I don’t see myself blessing a proposal that represents “a significant departure 
from the standard” and I’d love to see more substantial evidence on user 
expectations to justify this. 

Dario

[1] http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1993133 
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] DNT, standards, and expectations

2015-01-16 Thread Dario Taraborelli

I second Aaron’s concerns, which I previously expressed during the consultation 
about the new privacy policy. My main objection to the proposed solution is 
that by saying “Wikimedia honors DNT headers” we imply – by the most popular/de 
facto interpretation of DNT – that we do 3rd party tracking but we allow users 
to opt out, which puts WMF on par with aggressive tracking practices adopted by 
most sites. 

I’d rather focus on a clean and transparent implementation of an opt out 
mechanism that doesn’t create confusion, gives the user a clear understanding 
of what s/he is opting out from instead of piggybacking on DNT.

I too am worried of the impact of the exclusion of a segment of the user 
population from (aggregate) measurements that we obtain via instrumentation and 
that we use to assess the impact of Product changes, but I’m ready to push the 
discussion of what is an acceptable tradeoff to our customers (the community 
and decision-makers at WMF). It’s also worth reminding that all data collected 
via EventLogging that contains PII such as IP addresses or raw UserAgents is 
subject to our data retention guidelines. [1]

Dario

[1] https://meta.wikimedia.org/wiki/Data_retention_guidelines 


> On Jan 16, 2015, at 1:29 PM, Aaron Halfaker  wrote:
> 
> Ori,
> 
> I agree on all points.  My assertions are this:
> DNT means 3rd party tracking.  It's in the definition.  
> However, we'd like to have a strict interpretation and act beyond the 
> definition.  This empowers our users and sets a good precedent. 
> The categorical exclusion of a substantial set of our users from field 
> studies is concerning and can cause problem.
> Though Nuria pointed out that DNT/IE10 is not the only potential categorical 
> exclusion, that does not reduce the problem.  If we can can confirm that this 
> won't cause a substantial issue or implement a strategy to make sure it does 
> not, then this won't be a problem.
> 
> -Aaron
> 
> On Fri, Jan 16, 2015 at 1:42 PM, Ori Livneh  > wrote:
> 
> 
> On Thu, Jan 15, 2015 at 9:55 PM, Aaron Halfaker  > wrote:
> What I find concerning is the idea that a biased subset of our users would be 
> categorically ignored for this type of evaluation.  If you agree with me that 
> such evaluation is valuable to our users, I think you ought to also find such 
> categorical exclusions concerning
> 
> (In the e-mail below I sometimes use "we" to mean "Wikimedians" and sometimes 
> to mean "Wikimedia Foundation employees". I am aware that this is a public 
> discussion and that not all participants are employees of the Foundation. 
> Hopefully the context will make my meaning clear.)
> 
> Aaron's point is valid. If we collect any data at all, we are morally 
> obligated to do so in a way that can actually support rigorous research on 
> questions of broad value to the community and humanity as a whole. Collecting 
> data in a manner that we know cannot support serious research is morally 
> obnoxious and it invalidates the mandate we claim to collect any data at all.
> 
> That said, I am not convinced that adopting a strong interpretation of DNT 
> (and acting on it) would substantially compromise our ability to do research. 
> The bias that it potentially introduces is of comparable magnitude to the 
> risks of bias that scientists routinely accept in the interest of meeting 
> ethical standards and respecting the rights of individuals. The fact that 
> participation in drug trials is voluntary and that the compensation (when 
> there is any) is usually fixed at a set amount is a good example.
> 
> I also think that our ability to conduct research would be compromised far 
> more substantially were we to lose the confidence of our users. The only hope 
> we have of gaining an understanding of Wikimedia is (in my opinion) through 
> peer collaboration with our community. The question of whether we (Foundation 
> employees) will be able to support a broad community of inquiry has much 
> higher stakes than whether or not our data is fully representative of all 
> user-agents.
> 
> The fact that there is no strong legal requirement forcing our hand here and 
> that weaker interpretations of the header are defensible and plausible means 
> that there is an opportunity here to be lead by example and to send a strong 
> message to our community and to the internet at large about our values and 
> our commitment to our users. It's an opportunity I think we should take.
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> 
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/list

[Analytics] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

2015-01-12 Thread Dario Taraborelli

I’m sharing a proposal that Reid Priedhorsky and his collaborators at Los 
Alamos National Laboratory recently submitted to the Wikimedia Analytics Team 
aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview data 
dumps and making them available to the public and the research community. [1] 

Reid and his team spearheaded the use of the public Wikipedia pageview dumps to 
monitor and forecast the spread of influenza and other diseases, using language 
as a proxy for location. This proposal describes an aggregation strategy adding 
a geographical dimension to the existing dumps.

Feedback on the proposal is welcome on the lists or the project talk page on 
Meta [3]

Dario

[1] 
https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pageviews
[2] http://dx.doi.org/10.1371/journal.pcbi.1003892
[3] 
https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_pageviews
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] January 2015 Wikimedia Research Showcase: Felipe Ortega and Benjamin Mako Hill

2015-01-12 Thread Dario Taraborelli

The upcoming Wikimedia Research showcase 
 
(Wednesday January 14, 11.30 PT) will host two guest speakers: Felipe Ortega 
 (University of Madrid) and 
Benjamin Mako Hill  
(University of Washington). 
As usual, the showcase will be broadcast on YouTube (the livestream link will 
follow on the list) and we’ll host the QA on the #wikimedia-research IRC 
channel on freenode.

We look forward to seeing you there.

Dario


Functional roles and career paths in Wikipedia
By Felipe Ortega 
An understanding of participation dynamics within online production communities 
requires an examination of the roles assumed by participants. Recent studies 
have established that the organizational structure of such communities is not 
flat; rather, participants can take on a variety of well-defined functional 
roles. What is the nature of functional roles? How have they evolved? And how 
do participants assume these functions? Prior studies focused primarily on 
participants' activities, rather than functional roles. Further, extant 
conceptualizations of role transitions in production communities, such as the 
Reader to Leader framework, emphasize a single dimension: organizational power, 
overlooking distinctions between functions. In contrast, in this paper we 
empirically study the nature and structure of functional roles within 
Wikipedia, seeking to validate existing theoretical frameworks. The analysis 
sheds new light on the nature of functional roles, revealing the intricate “ 
areer paths" resulting from participants' role transitions.

Free Knowledge Beyond Wikipedia
A conversation facilitated by Benjamin Mako Hill 

In some of my research with Leah Buechley 
, I’ve explored the way that 
increasing engagement and diversity in technology communities often means not 
just attacking systematic barriers to participation but also designing for new 
genres and types of engagement. I hope to facilitate a conversation about how 
WMF might engage new readers by supporting more non-encyclopedic production. 
I'd like to call out some examples from the new Wikimedia project proposals 
list , encourage 
folks to share entirely new ideas, and ask for ideas about how we could 
dramatically better support Wikipedia's sister projects.

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] most clicked links in articles

2015-01-12 Thread Dario Taraborelli

Hey Andrew,

that’s a great question. I asked Legal to review the implications of publicly 
releasing a snapshot of this data and I’ll post the outcome of the audit on 
this list. FWIW the data in question will be aggregated from the logs of raw 
HTTP request that WMF passively receives. This is the same type of data we 
previously used for the presentation on readership trends the Analytics Team 
gave at Monthly Metrics in December [1] The format of the logs and the data 
they contain is described here [2]

Personally identifiable information (such as IP addresses or User Agents) will 
not be used other than for the purpose of filtering bots and automated 
requests: clickthrough data will be obtained by parsing and counting specific 
string occurrences (such as an article title) in the referer string of an HTTP 
request. In other words, we will be counting and aggregating occurrences of 
requests for article B having article A as a string in the referral. I’ll work 
with Ellery to release the code of the log parsing script so it can be publicly 
reviewed before we move forward.

Hope this addresses your concerns,

Dario

[1] 
https://meta.wikimedia.org/w/index.php?title=File:2014_Readership_Update,_WMF_Metrics_Meeting,_December.pdf&page=10
 

[2] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive 


> On Jan 12, 2015, at 1:27 PM, Andrew Gray  wrote:
> 
> Hi all,
> 
> I'm curious about the privacy implications as well. I can't think of
> specific problems with this data, *but* it's information that I didn't
> think we'd ever been logging. We've historically been quite hands-off
> with any kind of reader information, other than raw hit counts, and
> there might well be some community discomfort at discovering it's been
> both tracked and released, even if completely anonymised.
> 
> Andrew.
> 
> On 12 January 2015 at 20:08, Toby Negrin  wrote:
>> Thanks Amir -- feel free to have your friend reach out to this list
>> directly.
>> 
>> As Ellery said, we're figuring our if there are any privacy implications in
>> releasing this dataset.
>> 
>> -Toby
>> 
>> On Mon, Jan 12, 2015 at 12:05 PM, Amir E. Aharoni
>>  wrote:
>>> 
>>> I am asking for a real-life friend who is doing some research. It's not
>>> for any particular project of mine, but I can easily imagine that it can be
>>> useful for a lot of editors and product managers as I wrote in the opening
>>> post.
>>> 
>>> (And I cannot think of any privacy problems if the data is not tied to any
>>> particular people, but maybe I'm naive.)
>>> 
>>> 
>>> --
>>> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
>>> http://aharoni.wordpress.com
>>> ‪“We're living in pieces,
>>> I want to live in peace.” – T. Moore‬
>>> 
>>> 2015-01-12 22:00 GMT+02:00 Toby Negrin :
 
 Hi Amir --
 
 Would you like to see these datasets released publicly or was there a
 specific project you were interested in using them for?
 
 thanks,
 
 -Toby
 
 On Mon, Jan 12, 2015 at 5:44 AM, Amir E. Aharoni
  wrote:
> 
> Hi,
> 
> Are there metrics about which links in each article are the most
> clicked?
> 
> I can think there's a lot to be learned from it:
> * Data-driven suggestions for manual of style about linking (too much
> and too few links are a perennial topic of argument)
> * How do people traverse between topics.
> * Which terms in the article may need a short explanation in parentheses
> rather than just a link.
> * How far down into the article do people bother to read.
> 
> Anyway, I can think that accessibility to such data can optimize both
> readership and editing.
> 
> And maybe this can be just taken right from the logs, without any
> additional EventLogging.
> 
> --
> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
> http://aharoni.wordpress.com
> ‪“We're living in pieces,
> I want to live in peace.” – T. Moore‬
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
> 
 
 
 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics
 
>>> 
>>> 
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>> 
>> 
>> 
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>> 
> 
> 
> 
> -- 
> - Andrew Gray
>  andrew.g...@dunelm.org.uk
> 
>

Re: [Analytics] Making EventLogging output to a log file instead of the DB

2015-01-07 Thread Dario Taraborelli


> On Jan 7, 2015, at 6:42 AM, Gilles Dubuc  wrote:
> 
> Right -- couldn't we just tag the URL?
> 
> The event of the user actually viewing the image is completely disconnected 
> from the URL hit in Media Viewer, which is why we need EL and can't rely on 
> existing server logs.
>  
> Eventlogging data currently does go to files, as well as to the DB.
> 
> Great, then I guess it's a matter of only making the data go to files and not 
> to DB for the particular schema we'll create. Does that sound like something 
> feasible? How much work would be required to set it up?

this is a feature that other teams requested in the past, I agree it would be 
very helpful. In an ideal world, we would be able to specify the log 
configuration (where to write the data, pruning requirements, schema ownership) 
directly via a JSON object associated with the main schema.

Dario

> On Tue, Jan 6, 2015 at 7:45 PM, Andrew Otto  > wrote:
> Eventlogging data currently does go to files, as well as to the DB.  Check it 
> out on stat1003 at /srv/eventlogging/archive.
> 
> If you need something with higher throughput then eventlogging itself 
> supports…then let’s talk :D
> 
> -Ao
> 
> 
> 
> 
>> On Jan 6, 2015, at 13:28, Erik Zachte > > wrote:
>> 
>> You mean attach an X-analytics parameter, for extra images beyond the one 
>> the user initially requested.
>>  
>> But then we would undercount, basically missing all image views from 
>> clicking right arrow in image viewer.
>> I'm not sure how much we would miss then.
>> iirc Gilles said this browsing feature was used quite a long, but I'm not 
>> sure.
>>  
>> From: analytics-boun...@lists.wikimedia.org 
>>  
>> [mailto:analytics-boun...@lists.wikimedia.org 
>> ] On Behalf Of Toby Negrin
>> Sent: Tuesday, January 06, 2015 19:16
>> To: A mailing list for the Analytics Team at WMF and everybody who has an 
>> interest in Wikipedia and analytics.
>> Subject: Re: [Analytics] Making EventLogging output to a log file instead of 
>> the DB
>> 
>>  
>> 
>> Right -- couldn't we just tag the URL?
>> 
>>  
>> 
>> On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte > > wrote:
>> 
>> Just to clarify, this is about prefetched images which have not been shown 
>> to the public.
>> 
>> They were sent to the browser ahead of a possible request to speed things up 
>> but in many cases never actually requested.
>> 
>> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts#Prefetched_images
>>  
>> 
>> - Erik
>> 
>>  
>> 
>> From: analytics-boun...@lists.wikimedia.org 
>>  
>> [mailto:analytics-boun...@lists.wikimedia.org 
>> ] On Behalf Of Toby Negrin
>> Sent: Tuesday, January 06, 2015 18:49
>> To: A mailing list for the Analytics Team at WMF and everybody who has an 
>> interest in Wikipedia and analytics.
>> Subject: Re: [Analytics] Making EventLogging output to a log file instead of 
>> the DB
>> 
>>  
>> 
>> Hi Gilles -- why won't the page view logs work by themselves for this 
>> purpose? EL can be configured to write into Hadoop which is probably the 
>> best way to get the throughput you need but it seems overcomplicated.
>> 
>>  
>> 
>> -Toby
>> 
>>  
>> 
>> On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc > > wrote:
>> 
>> This depends on [1] so we're not going to need that immediately, but in 
>> order to help Erik Zachte with his RfC [2] to track unique media views in 
>> Media Viewer, I'm going to need to use something almost exactly like 
>> EventLogging. The main difference being that it should skip writing to the 
>> database and write to a log file instead.
>> 
>> That's because we'll be recording around 20-25M image views per day, which 
>> would needlessly overload EventLogging for little purpose since the data 
>> will be used for offline stats generation and doesn't need to be made 
>> available in a relational database. Of course if storage space and 
>> EventLogging capacity were no object, we could just use EL and keep the 
>> ever-growing table forever, but I have the impression that we want to be 
>> reasonable here and only write to a log, since that's what Erik needs.
>> 
>> So here's the question: for a specific schema, can EventLogging work the way 
>> it does but only record hits to a log file (maybe it already does that 
>> before hitting the DB?) and not write to the DB? If not, how difficult would 
>> it be to make EL capable of doing that?
>> 
>> 
>> [1] https://phabricator.wikimedia.org/T44815 
>> 
>> [2] 
>> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts
>>  
>>

Re: [Analytics] WikiGrok and EventLogging

2015-01-07 Thread Dario Taraborelli

agreed. Many of these articles will see spikes in traffic during the test (as 
the sample includes many celebrity articles) but the historical volume of 
traffic for the whole sample should give us a decent estimate of the throughput.

I also wouldn’t worry about any events other than 
MobileWebWikiGrok.page-impression and the events in the error log: all other 
events require user interaction.

Dario

> On Jan 7, 2015, at 7:08 AM, Aaron Halfaker  > wrote:
> 
> Leila,
> 
> It might be worthwhile to merge that article set with the webrequest data we 
> have in order to get a sense for how many pageloads/second to expect.  
> 
> -Aaron
> 
> On Tue, Jan 6, 2015 at 7:50 PM, Ryan Kaldari  > wrote:
> The highest volume events we are going to log will be:
> 1. For each of the 166,000 articles, one event when the page loads
> 2. For each of the 166,000 articles, one event when the WikiGrok widget 
> enters the viewport (about half as often as #1)
> 
> These will be active for all mobile users, logged in and logged out, 
> including many high pageview articles.
> 
> Given that information, do you have any idea if we are in danger of 
> overloading EventLogging? If so, do you have recommendations on sampling? So 
> far, everyone has said not to worry about it, but it would be good to get a 
> sanity check for this test specifically.
> 
> Kaldari
> 
> On Tue, Jan 6, 2015 at 4:57 PM, Nuria Ruiz  > wrote:
> (cc-ing mobile-tech)
> 
> Since we do not the details of how wikigrok is used and its throughput of 
> requests we can not "estimate" sampling ourselves. I imagine wikigrok is been 
> deployed to a number of users and it is with that usage the mobile team could 
> estimate the total throughput expected, with this throughput we can recommend 
> sampling ratios. 
> 
> 
> Thanks for asking about this without before deploying!
> 
> 
> On Tue, Jan 6, 2015 at 4:55 PM, Ryan Kaldari  > wrote:
> I can elaborate on this after I finished the SWAT deployment Gimme 30 
> minutes or so.
> 
> On Tue, Jan 6, 2015 at 4:51 PM, Leila Zia  > wrote:
> Hi,
> 
>   The mobile team is planning to switch WikiGrok on for non-logged in users 
> next week (2014-01-12). The widget will be on on 166,029 article pages in 
> enwiki. There are two EventLogging schema that may collect data heavily and 
> we want to make sure EL can handle the influx of data.
> 
> The two schema collecting data are: 
> https://meta.wikimedia.org/wiki/Schema:MobileWebWikiGrok 
> 
> https://meta.wikimedia.org/wiki/Schema:MobileWebWikiGrokError 
> 
> and the list of pages affected is in: 
> wgq_page in enwiki.wikigrok_questions.
> 
>It would be great if someone from the dev side let us know whether we will 
> need sampling.
> 
> Thanks,
> Leila
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> 
> 
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> 
> 
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> 
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Article feedback corpus released

2014-12-24 Thread Dario Taraborelli

I’m glad to announce the release of an open-licensed corpus with 1.5M records 
from the Article Feedback v5 pilot. 

http://dx.doi.org/10.6084/m9.figshare.1277784

Thanks to everyone who helped make this happen, Fabrice in particular for 
shepherding this through.

Dario

—
This dataset contains the entire corpus of feedback submitted on the English, 
French and German Wikipedia during the Article Feedback v.5 pilot (AFT). [1] 
The Wikimedia Foundation ran the Article Feedback pilot for a year between 
March 2013 and March 2014. During the pilot, 1,549,842 feedback messages were 
collected across the three languages.

All feedback messages and their metadata (as described in this schema [2]) are 
available in this dataset, with the exception of messages that have been 
oversighted and/or deleted by the end of the pilot.
The corpus is released [3] under the following license:

• CC BY SA 3.0 for feedback messages
• CC0 for the associated metadata

Results from the pilot are discussed in: Halfaker, A., Keyes, O. and 
Taraborelli, D (2013). Making peripheral participation legitimate: Reader 
engagement experiments in Wikipedia. CSCW ’13 Proceedings of the 2013 
Conference on Computer Supported Cooperative Work [4][5]

[1] https://www.mediawiki.org/wiki/Article_feedback/Version_5
[2] 
https://www.mediawiki.org/wiki/Article_feedback/Version_5/Technical_Design_Schema#aft_feedback
[3] https://wikimediafoundation.org/wiki/Feedback_data#Article_Feedback
[4] http://dx.doi.org/10.1145/2441776.2441872
[5] http://nitens.org/docs/cscw13.pdf___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Freebase winding down, to be ingested into Wikidata

2014-12-17 Thread Dario Taraborelli

In case you missed the announcement: 
https://plus.google.com/app/basic/stream/z122wpyxhob0hjoik04cc3vatw2zfv4zszk0k

Dario___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Page view generalized filter draft (due Friday, Dec 12th)

2014-12-15 Thread Dario Taraborelli

Oliver, Aaron – thanks for pushing this forward! Glad that we’re moving on with 
the implementation.

> On Dec 15, 2014, at 11:32 AM, Oliver Keyes  wrote:
> 
> Totally!
> 
> On 15 December 2014 at 14:22, Andrew Otto  > wrote:
> Ah cool, didn’t realize there was a neutral definition.  We should call that 
> the ‘formal specification’ then.
> 
>> ...of course, now that I've said that, cosmic irony demands we end up 
>> implementing in C, or something.
> Hm, a UDF that does this rather than a Hive query would probably be better.  
> E.g.
> 
>   SELECT
> request_qualifier(uri_host),
> count(*)
>   FROM
> wmf_raw.webrequest
>   WHERE
> is_pageview(uri_host, uri_path, http_status, content_type)
>   GROUP BY
> request_qualifier(uri_host)
>   ;
> 
> 
> Or something like that.
> 
> -Ao
> 
> 
> 
> 
> 
> 
>> On Dec 15, 2014, at 14:07, Oliver Keyes > > wrote:
>> 
>> It's totally tech-agnostic; the neutral definition is on meta. The hive 
>> query is just because, since we suspect that's how we'll be generating the 
>> data, it makes sense to turn the draft def into HQL for exploratory queries 
>> and testing.
>> 
>> ...of course, now that I've said that, cosmic irony demands we end up 
>> implementing in C, or something.
>> 
>> On 15 December 2014 at 13:46, Toby Negrin > > wrote:
>> I think the hive code is "representative" in that it's an implementation. 
>> It's certainly not the only permitted one. 
>> 
>> On Dec 15, 2014, at 10:34 AM, Andrew Otto > > wrote:
>> 
  We're moving forward to generate Hive queries that will represent the 
 formal specification.
>>> Should a specific implementation (e.g. Hive) represent the formal 
>>> specification?  I tend to think it should be tech-agnostic, no?
>>> 
>>> 
>>> 
 On Dec 15, 2014, at 12:15, Aaron Halfaker >>> > wrote:

 Toby, that's right.  We're moving forward to generate Hive queries that 
 will represent the formal specification.  

 -Aaron

 On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes >>> > wrote:
 We've written the draft Hive queries and I'm reviewing them with Otto now. 
 Currently blocked on Hadoop heapsize issues, but I'm sure we'll work it 
 through :).

 On 15 December 2014 at 12:10, Toby Negrin >>> > wrote:
 Hi Aaron, all --

 I haven't seen any discussion on this which is a sign that we can forward 
 with turning over the draft. Thoughts?

 thanks,

 -Toby

 On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker >>> > wrote:
 Hey folks,

 As discussions on the new page view definition have been calming down, 
 we're preparing to deliver a draft version to the Devs.  I want to make 
 sure that we all know the status and that any substantial concerns are 
 raised before we hand things off on Friday, Dec 12th.

 For this phase, we are delivering the general filter[1].  This is the 
 highest level filter, and exists primarily to distinguish requests worthy 
 of further evaluation. Our plan is to take the definition as it exists on 
 the 12th, and begin generating high-level aggregate numbers based on it. 
 In future iterations, we will be digging into different breakdowns of this 
 metric, and iterating on it to handle any inconsistencies or unexpected 
 results.  There's a few differences from Web Stat Collector's (WSC) 
 version of the general filter that we want to call to your attention to.
 We include searches -- WSC explicitly excludes them.
 We include Apps traffic -- WSC does not detect Apps traffic
 We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) -- WSC 
 hardcodes "/wiki/"
 We don't include Banner impressions -- WSC includes them.
 There are also some known issues with the new definition that are worth 
 your notice:

 Internal traffic is counted
 Note that WSC filters some internal traffic by hardcoding a set of IPs in 
 the definition.  We are working on parsing puppet templates in order to 
 automatically detect which IPs represent internal traffic.  This will be a 
 /better/ solution, but it's not quite ready yet because parsing puppet is 
 hard.  
 Spider traffic is counted
 We will be using the User-agent field to detect and flag spider-based 
 traffic.  This "tag definition" will be delivered in a subsequent 
 definition.  This actually matches WSC, which does not filter spider for 
 the high-level metrics.
 These are problems we're aware of, and will be factoring in as we go 
 forward with our next task: refining the definition using real, 
 hourly-level traffic data. Thanks to everyone who has given feedback and 
 partic

Re: [Analytics] EventLogging data QA

2014-12-11 Thread Dario Taraborelli

thanks for the quick turnaround.

On Dec 11, 2014, at 4:28 PM, Ori Livneh  wrote:

> There's this graph: 
> https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1418343627.977&from=-1weeks&target=movingMedian(diffSeries(eventlogging.overall.raw.rate%2Ceventlogging.overall.valid.rate)%2C20)
>  
> 
> 
> The key is 
> 'diffSeries(eventlogging.overall.raw.rate,eventlogging.overall.valid.rate)', 
> which gets you the rate of invalid events per second.
> 
> It is not broken down by schema, though.

this is great for monitoring, for QA purposes we really need the raw data

> We can't write invalid events to a database -- at least not the same way we 
> write well-formed events. The table schema is derived from the event schema, 
> so an invalid event would violate the constraints of the table as well.

rrright

> It's possible (and easy) to set something up that watches invalid events in 
> real-time and does something with them. The question is: what? E-mail an 
> alert? Produce a daily report? Generate a graph?
> 
> If you describe how you’d like to consume the data, I can try to hash out an 
> an implementation with Nuria and Christian.

a JSON log like all-events.log but sync’ed from vanadium more frequently would 
do the job for me. It can also be truncated as we probably only need a 
relatively short time window and the complete data is captured in all-events 
anyway.

D___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] EventLogging data QA

2014-12-11 Thread Dario Taraborelli

I am kicking off this thread after a good conversation with Nuria and Kaldari 
on pain points and opportunities we have around data QA for EventLogging.

Kaldari, Leila and I have gone through several rounds of data QA before and 
after the deployment of new features on Mobile and we haven’t found yet a good 
solution to catch data quality issues early enough in the deployment cycle. 
Data quality issues with EventLogging typically fall under one of these 5 
scenarios:

1) events are logged and schema-compliant but don’t capture data correctly (for 
example: a wrong value is logged; event counts that should match don’t)
2) events are logged but are not schema-compliant (e.g.: a required field is 
missing)
3) events are missing due to issues with the instrumentation (e.g.: a UI 
element is not instrumented)
4) events are missing due to client issues (a specific UI element is not 
correctly rendered on a given browser/platform and as a result the event is not 
fired)
5) events are missing due to EventLogging outages

In the early days, Ori and I floated the idea of unit tests for instrumentation 
to capture constraint violations that are not easily detected via manual 
testing or the existing client-side validation, but this never happened. When 
it comes to feature deployments, beta labs is a great starting point for 
running manual data QA in an environment that is as close as possible to prod. 
However, there are types of data quality issues that we only discover when 
collecting data at scale and in the wild (on browsers/platforms that we don’t 
necessarily test for internally).

Having a full-fledged set of unit tests for data would be terrific, but in the 
short term I’d like to find a better way to at least identify events that fail 
validation as early as possible.

- the SQL log database has real-time data but only for event that pass 
client-side validation
- the JSON logfiles on stat1003 include invalid events, but the data is only 
rsync’ed from vanadium once a day

is there a way to inspect invalid events in near real time without having 
access to vanadium? For example, could we create either a dedicated database to 
write invalid events only or a logfile for validation errors rsync’ed to 
stat1003 more frequently than once a day?

Thoughts?

Dario___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] s1-analytics-slave impressively slow queries

2014-11-10 Thread Dario Taraborelli

Let's kill them (Leila is OoO today and tomorrow).

> On Nov 10, 2014, at 08:01, Nuria Ruiz  wrote:
> 
> cc-ing leila as we were experimenting with these some weeks back in SF, I 
> think they can be killed w/o problems. I did not know they were still 
> running, we run a faster version of those queries and got the data we were 
> interested in a while back.
> 
>> On Mon, Nov 10, 2014 at 1:55 AM, Sean Pringle  wrote:
>> Three identical queries from the 'research_prod' user have just passed one 
>> month execution time on s1-anlytics-slave:
>> 
>> select count(*) 
>> from staging.ourvision r
>> where exists (
>>   select *
>>   from staging.ourvision r1
>>   inner join
>>staging.ourvision r2 
>>   on r2.sha1 = r1.sha1
>> where r1.page_id = r.page_id
>>   and r2.page_id = r.page_id
>>   and DATE_ADD(r.timestamp, INTERVAL 1 HOUR)
>>   and r2.timestamp between r.timestamp and DATE_SUB(r.timestamp , 
>> INTERVAL 1 HOUR) 
>>   and  r1.sha1!= r.sha1
>> );
>> 
>> I havn't checked to see if the queries are just that amazingly slow, or if 
>> they're part of a larger ongoing transaction. In any case, three month-long 
>> transactions is pushing the resource limits of the slave and will soon 
>> result in either mass replication lag or some other interesting lockup that 
>> may in turn take days to rollback :-)
>> 
>> Can we kill these? Can we optimize and/or redesign the jobs? Happy to help...
>> 
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] data in Vital Signs

2014-11-04 Thread Dario Taraborelli

to add some context to the present approach, you may remember that when we 
defined Editor Model metrics we started from the highest possible level of 
aggregation (i.e. all namespaces combined, archive table included). See 
rationale below from a previous email exchange:

we tried to stick to two general principles:

1) we want to count users making contributions to a project as a whole. 
Establishing that only “content activity” should be considered means that 
someone uploading a picture, editing a template, drafting an article outside of 
ns0 (we have a new Draft namespace), writing or contributing to a new policy, 
helping coordinate a wikiproject, i.e. all fundamental activities that 
contribute to the growth of the project, would be discounted as an editor. By 
this token, someone writing an entire article outside of the main namespace 
would not be included as an editor while a vandal fighter only reverting edits 
at the push of a button would be considered as a contributor. The point I’m 
trying to make is that establishing what “content” means is very arbitrary and 
we should have a measure of overall participation to a project, followed by 
more granular metrics by type of contribution (see next point).

2) instead of starting with a list of exclusions (i.e. we will only measure a 
subset of ns0 edits on articles meeting specific criteria such as countable 
pages), we will introduce breakdowns that inform us about specific types of 
activity. “Namespace” is a possible proxy for types of content, but not 
necessarily the best or the only one. One day, I’d like to be able to monitor 
active typo-fixers or template-editors, but I believe we should start from the 
highest possible level and count total activity or total unique editors before 
breaking them down.

Adding a NS dimension or other criteria to filter top-level metrics sounds like 
a totally legitimate request as a metric breakdown.

Dario

> On Nov 4, 2014, at 12:55 PM, James Forrester  wrote:
> 
> Thanks Toby! :-)
> 
> On 4 November 2014 12:38, Toby Negrin  > wrote:
> Created tracking bug -- please add yourselves to the cc if desired.
> 
> https://bugzilla.wikimedia.org/show_bug.cgi?id=72973 
> 
> 
> -Toby
> 
> On Tue, Nov 4, 2014 at 12:07 PM, James Forrester  > wrote:
> On 4 November 2014 12:00, Aaron Halfaker  > wrote:
> Understood for page creations.  The metric is named "Page creations".  We 
> ought to have a metric called "Content page creations" or "Unique content 
> page creators".  
> 
> Yeah, having both would be great but I don't want to demand the world on a 
> stick. ;-)
> 
> One bit of complication: How do you feel about the draft namespace for 
> enwiki?  Should it be included in content page creations?  
> 
> That should be in $wgContentNamespaces but unfortunately isn't (see the 
> config file 
> ). 
> I'll get that fixed.
>  
> As for edits, the correlation is so strong between edits to content and edits 
> to other namespaces that it doesn't matter which we use when looking for 
> trends.[1]
> 
> 1. 
> http://meta.wikimedia.org/wiki/Research:Refining_the_definition_of_monthly_active_editors
>  
> 
> 
> Fair point.
> 
> J.
> -- 
> James D. Forrester
> Product Manager, Editing
> Wikimedia Foundation, Inc.
> 
> jforres...@wikimedia.org  | @jdforrester
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> 
> 
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> 
> 
> 
> 
> 
> -- 
> James D. Forrester
> Product Manager, Editing
> Wikimedia Foundation, Inc.
> 
> jforres...@wikimedia.org  | @jdforrester
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] stat1002 log cleanup

2014-11-04 Thread Dario Taraborelli

On Nov 4, 2014, at 2:08 PM, Nuria  wrote:
> 
> No, database records are not affected. It should not impact your work with EL 
> in any way as your findings come from the records in the database.
> The logs are used mainly for operational purposes by the dev team as 
> maintainers of the system.

this obviously means that any missing, corrupted or invalid data from the log 
DB can only be recovered within the data retention window, which seems 
reasonable to me.

> On Nov 4, 2014, at 11:55 AM, Aaron Halfaker  > wrote:
> 
>> Hey guys,
>> 
>> Sorry for the late response, but I'm still not sure what lives in 
>> /a/eventlogging/archive/*
>> 
>> Will deleting from there affect what logs we have stored in the DB?  Is this 
>> an intermediate log storage place, a canonical one, etc.?
>> 
>> What will we no longer be able to do after it is pruned?
>> 
>> -Aaron
>> 
>> On Thu, Oct 30, 2014 at 2:35 PM, Nuria Ruiz > > wrote:
>> >Also, I'm not clear on the significance of the EL archive directory.  Can 
>> >you remind me/direct me to documentation?
>> Well, the logs just record the incoming pipeline of events, we have used 
>> them to troubleshoot operational issues in the past but the bulk of data 
>> analysis in EL happens from data stored on database.
>> 
>> Some info here: 
>> https://wikitech.wikimedia.org/wiki/EventLogging#Data_storage 
>> 
>>  
>> 
>> On Thu, Oct 30, 2014 at 12:22 PM, Aaron Halfaker > > wrote:
>> Nuria, can you specify which logs will be trimmed. 
>> 
>> Also, I'm not clear on the significance of the EL archive directory.  Can 
>> you remind me/direct me to documentation?
>> 
>> On Thu, Oct 30, 2014 at 12:36 PM, Nuria Ruiz > > wrote:
>> 
>> Hello, 
>> 
>> To comply with our privacy policy we are going to purge logs in 1002 that 
>> are older than 90 days. Please let us know whether this is an issue. We hope 
>> to have these changes done by the end of next week.
>> 
>> A concrete example: 
>> 
>> Logs in, for example, the eventlogging archiving directory:
>> 
>> @stat1002:/a/eventlogging/archive$ 
>> 
>> 
>> will be restricted to the last 90 days. 
>> 
>> Thanks, 
>> 
>> Nuria
>> 
>> 
>> 
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org 
>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>> 
>> 
>> 
>> 
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org 
>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>> 
>> 
>> 
>> 
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org 
>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>> 
>> 
>> 
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org 
>> https://lists.wikimedia.org/mailman/listinfo/analytics 
>> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Poke for the mailing list admins, whoever they are

2014-11-03 Thread Dario Taraborelli

{{done}}

> On Nov 2, 2014, at 12:30 PM, Oliver Keyes  wrote:
> 
> Whoops; that's research-l! And this is why I shouldn't send emails after 10pm.
> 
> On 1 November 2014 22:33, Oliver Keyes  > wrote:
> Could we temporarily moderate Aileen, please? This is getting somewhat 
> ridiculous and cluttering the archives (and my inbox) with automated dross.
> 
> (Mandatory pause while I wait for Aileen's autoresponder to prove my point)
> -- 
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
> 
> 
> 
> -- 
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Wikimedia Research showcase – October 15 2014, 11.30 PT

2014-10-14 Thread Dario Taraborelli

After a break in September, we’re resuming our monthly Research and Data 
showcase. The next showcase will be live-streamed tomorrow Wednesday October 15 
at 11.30 PT. As usual you can join the conversation via IRC on freenode.net by 
joining the #wikimedia-research channel.

We look forward to seeing you there,

Dario


This month:

Emotions under Discussion: Gender, Status and Communication in Wikipedia
By David Laniado: I will present a large-scale analysis of emotional expression 
and communication style of editors in Wikipedia discussions. The talk will 
focus especially on how emotion and dialogue differ depending on the status, 
gender, and the communication network of the about 12000 editors who have 
written at least 100 comments on the English Wikipedia's article talk pages. 
The analysis is based on three different predefined lexicon-based methods for 
quantifying emotions: ANEW, LIWC and SentiStrength. The results unveil 
significant differences in the emotional expression and communication style of 
editors according to their status and gender, and can help to address issues 
such as gender gap and editor stagnation.

Wikipedia as a socio-technical system
By Aaron Halfaker: Wikipedia is a socio-technical system. In this presentation, 
I'll explain how the integration of human collective behavior ("social") and 
information technology ("technical") has lead to a phenomena that, while being 
massively productive, is poorly understood due to lack of precedence. Based on 
my work in this area, I'll describe five critical functions that healthy, 
Wikipedia-like socio-technical systems must serve in order to continue to 
function: allocation, regulation, quality control, community management and 
reflection. Next I'll argue the Wikimedia Foundation's analytics strategy 
currently focuses on outcomes related to a relatively narrow aspect of system 
health and all but completely ignores productivity. Finally, I'll conclude with 
an overview of three classes of new projects that should provide critical 
opportunities to both practically and academically understand the maintenance 
of Wikipedia's socio-technical fitness.

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Welcome Marcel Ruiz Forns to the Analytics Development team

2014-10-07 Thread Dario Taraborelli

Benvingut — looking forward to working with you, Marcel.

> On Oct 7, 2014, at 17:52, Jonas Xavier  wrote:
> 
> Bem-vindo, Marcel!
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] eventlogging largest tables

2014-09-29 Thread Dario Taraborelli

On Sep 27, 2014, at 11:42 AM, Aaron Halfaker  wrote:

> I'm not surprised that PageContentSaveComplete is big.  That's a very useful 
> table and it sees a lot of rows for good reason (every revision saved on 
> every wiki).  
> 
> As for the Multimedia/Mediaviewer tables, we should probably ping someone on 
> that team to discuss them. 
> 
> Dario, can you speak for the MobileWebClickTracking and 
> MobileWikiAppToCInteraction schemas?

neither I nor Oliver are using this data but it’s used for some Limn dashboards 
by the Mobile team. Copying Maryana and Kaldari so they can chime in

D

> On Sat, Sep 27, 2014 at 2:02 PM, Sean Pringle  wrote:
> Hi :-)
> 
> These are the largest Eventlogging tables on m2-master:
> 
> 145GMobileWebClickTracking_5929948.ibd
> 94G PageContentSaveComplete_5588433.ibd
> 61G MediaViewer_8572637.ibd
> 57G MediaViewer_8245578.ibd
> 30G MultimediaViewerNetworkPerformance_7917896.ibd
> 29G MediaViewer_8935662.ibd
> 24G MobileWikiAppToCInteraction_8461467.ibd
> 
> Are these sizes roughly expected?
> 
> Anything we can discard or reduce?
> 
> Where did the discussion on purging data end up?
> 
> No immediate problems here, just rattling cages :-)
> 
> BR
> /s
> 
> -- 
> DBA @ WMF
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Ten Simple Rules for Better Figures

2014-09-12 Thread Dario Taraborelli

A no-nonsense guide to scientific data visualization published in PLOS 
Computational Biology

http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003833

(the contents are CC-BY licensed and the source code is here: 
https://github.com/rougier/ten-rules )

Dario
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] pitching the Gender Edit Dashboard

2014-08-31 Thread Dario Taraborelli

…meanwhile:

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0104880

(I reached out to Chato, Mayo and David to ask if they would like to present 
this work at the research showcase)

Emotions under Discussion: Gender, Status and Communication in Online 
Collaboration
Daniela Iosub,

 David Laniado,

 Carlos Castillo,

 Mayo Fuster Morell,

 Andreas Kaltenbrunner mail

Published: August 20, 2014DOI: 10.1371/journal.pone.0104880

Background

Despite the undisputed role of emotions in teamwork, not much is known about 
the make-up of emotions in online collaboration. Publicly available 
repositories of collaboration data, such as Wikipedia editor discussions, now 
enable the large-scale study of affect and dialogue in peer production.

Methods

We investigate the established Wikipedia community and focus on how emotion and 
dialogue differ depending on the status, gender, and the communication network 
of the editors who have written at least 100 comments on the English 
Wikipedia's article talk pages. Emotions are quantified using a word-based 
approach comparing the results of two predefined lexicon-based methods: LIWC 
and SentiStrength.

Principal Findings

We find that administrators maintain a rather neutral, impersonal tone, while 
regular editors are more emotional and relationship-oriented, that is, they use 
language to form and maintain connections to other editors. A persistent gender 
difference is that female contributors communicate in a manner that promotes 
social affiliation and emotional connection more than male editors, 
irrespective of their status in the community. Female regular editors are the 
most relationship-oriented, whereas male administrators are the least 
relationship-focused. Finally, emotional and linguistic homophily is prevalent: 
editors tend to interact with other editors having similar emotional styles 
(e.g., editors expressing more anger connect more with one another).

On Aug 29, 2014, at 11:59 AM, Dario Taraborelli  
wrote:

> I too recommend the use of micro-surveys. The full rationale is here [1] but 
> one of the immediate benefits I see is the ability to randomly sample from 
> the population of newly registered users. It shouldn’t be particularly hard 
> to set up an ongoing gender micro-survey to collect this data over time (it’s 
> more a question for UX/Product: would this interfere with the existing 
> acquisition workflow). We can also trigger a micro-survey at the end of the 
> edit funnel and measure user drop-off rate by (self-reported) gender.
> 
> Product has concerns about adding extra fields to the signup screen: they may 
> not be optimal from a UX perspective, but micro-surveys are the most flexible 
> way of collecting this kind of demographic data without heavy MediaWiki 
> engineering effort.
> 
> Dario
> 
> [1] http://www.mediawiki.org/wiki/Extension:GuidedTour/Microsurveys
> 
> 
> On Aug 29, 2014, at 7:01 AM, Leila Zia  wrote:
> 
>> 
>> On Fri, Aug 29, 2014 at 4:58 AM, Dan Andreescu  
>> wrote:
>> I wonder if we might explore ways to improve such a survey.  For example, we 
>> might include the gender question in the signup form for a small percentage 
>> of newly registered users.
>> This experiment sounds more useful than the current gender data.  Over time, 
>> it would also allow us to track retention rate by gender for those who 
>> answer the question.
>> 
>> +1
>>  
>> 
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>> 
>> 
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
> 

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] pitching the Gender Edit Dashboard

2014-08-29 Thread Dario Taraborelli

I too recommend the use of micro-surveys. The full rationale is here [1] but 
one of the immediate benefits I see is the ability to randomly sample from the 
population of newly registered users. It shouldn’t be particularly hard to set 
up an ongoing gender micro-survey to collect this data over time (it’s more a 
question for UX/Product: would this interfere with the existing acquisition 
workflow). We can also trigger a micro-survey at the end of the edit funnel and 
measure user drop-off rate by (self-reported) gender.

Product has concerns about adding extra fields to the signup screen: they may 
not be optimal from a UX perspective, but micro-surveys are the most flexible 
way of collecting this kind of demographic data without heavy MediaWiki 
engineering effort.

Dario

[1] http://www.mediawiki.org/wiki/Extension:GuidedTour/Microsurveys

On Aug 29, 2014, at 7:01 AM, Leila Zia  wrote:

> 
> On Fri, Aug 29, 2014 at 4:58 AM, Dan Andreescu  
> wrote:
> I wonder if we might explore ways to improve such a survey.  For example, we 
> might include the gender question in the signup form for a small percentage 
> of newly registered users.
> This experiment sounds more useful than the current gender data.  Over time, 
> it would also allow us to track retention rate by gender for those who answer 
> the question.
> 
> +1
>  
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Editor engagement analytics

2014-08-14 Thread Dario Taraborelli

Pine – in fact (as I am sure you know, as you post frequently there) you can 
reach most Product people involved in the design of editor engagement 
features/experiments via e...@lists.wikimedia.org.

On Aug 14, 2014, at 7:10 AM, Toby Negrin  wrote:

> Thanks Aaron -- well said.
> 
> We are collaborating with the growth team on task suggestions which is one of 
> the first areas where we see our data being used to drive feature 
> development. We have some ideas in this area but our activities have been 
> focused on measurement and comprehension.
> 
> -Toby
> 
> 
> On Thu, Aug 14, 2014 at 7:07 AM, Aaron Halfaker  
> wrote:
> Hey Pine,
> 
> We don't deploy software that affects the user experience on Wikimedia 
> projects, so it is hard to identify any direct effect on editor engagement 
> that we've had.  The Product teams[1] develop user-facing features.  It 
> doesn't look like they have a public facing mailing list, but the community 
> engagement team (for product)[2] does.  You can contact them at 
> c...@lists.wikimedia.org.  
> 
> In analytics, we develop new measures of editor engagement (among other 
> things)[3] and deploy those measures for public use.  For example, see 
> WikiMetrics[4].  We also support the product teams by helping them identify 
> which features are likely to have a positive impact with background analysis 
> (e.g. [5]) and by running experiments to help product teams iterate toward 
> feature designs that maximize positive impact (e.g. [6]).  Right now, we 
> provide direct support of the Growth[7] and Mobile[8] product teams, but we 
> also consult with other teams at the WMF and engage with "community outreach 
> efforts" (e.g. [9]) in our (not so copious) free time.
> 
> 1. https://www.mediawiki.org/wiki/Product
> 2. https://www.mediawiki.org/wiki/Community_Engagement_(Product)
> 3. 
> https://www.mediawiki.org/wiki/Analytics/Epics/Editor_Engagement_Vital_Signs
> 4. https://metrics.wmflabs.org/
> 5. https://meta.wikimedia.org/wiki/Research:Wikipedia_article_creation
> 6. 
> https://meta.wikimedia.org/wiki/Research:Asking_anonymous_editors_to_register
> 7. https://meta.wikimedia.org/wiki/Growth
> 8. https://www.mediawiki.org/wiki/Mobile_web_projects
> 9. 
> https://meta.wikimedia.org/wiki/Research:Labs2/Hackathons/August_6-7th,_2014
> 
> -Aaron
> 
> 
> On Thu, Aug 14, 2014 at 2:11 AM, Pine W  wrote:
> Hi Analytics team,
> 
> I'm curious, which tools developed by Analytics have contributed notably to 
> editor engagement successes?
> 
> Pine
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
> 
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Public EventLogging --> LabsDB

2014-08-13 Thread Dario Taraborelli

(expanding on what I think Dan is referring to re: goals), addressing this 
issue would allow EEVS to access data needed to generate breakdowns for metrics 
by method/target site (mobile, desktop, apps).

On Aug 13, 2014, at 1:40 PM, Dan Andreescu  wrote:

> Kevin, for what it's worth I don't think that bug that Sean is asking for is 
> that challenging.  The relevant part we'd have to change is really just a few 
> lines [1].  I respect your decision of course, but I just wanted to point out 
> that this issue does drive towards some of our goals, as we talked a bit 
> about getting EventLogging data to be usable by Wikimetrics, and this is the 
> first step.
> 
> 
> [1] - 
> https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FEventLogging/4d917e1594e6f09784ab0e0bffccc144f87a11b3/server%2Feventlogging%2Fjrm.py#L167
> 
> 
> On Wed, Aug 13, 2014 at 4:19 PM, Aaron Halfaker  
> wrote:
> OK.  Sounds reasonable.  Sorry to seem as though I am pushing on you & the 
> devs.  In fact, specifying that you won't have the bandwidth to even consider 
> the bug until next quarter gives me the power to push on others.  >:)
> 
> Thanks!
> -Aaron
> 
> 
> On Wed, Aug 13, 2014 at 8:56 PM, Kevin Leduc  wrote:
> Hi Aaron,
> 
> I was not planning on prioritizing any EventLogging work for the rest of this 
> quarter.  The analytics dev team has a goal to get an EEVS dashboard running 
> and I want to keep them focused otherwise we will not reach this goal.
> 
> I'm tempted to ask what springle and YuviPanda can accomplish without the 
> help of the analytics devs, but even that will imply discussions and 
> distractions from our goals.
> 
> In September I am planning on looking at what goals we can set for the next 
> quarter and look at what we want to accomplish with EventLogging.  I was 
> going to prioritize it at that point.
> 
> 
> 
> 
> On Wed, Aug 13, 2014 at 10:28 AM, Aaron Halfaker  
> wrote:
> Excellent.  Kevin, can you work to get that bug[1] prioritized and let us 
> know?   I can start working with R&D on a proposal to bring to legal.  
> 
> 1. https://bugzilla.wikimedia.org/show_bug.cgi?id=67450
> 
> It stands to reason that you would be interested on the capsule too as it 
> holds the timestamp and wiki project the event applies to, but I imagine we 
> can make fields public selectively.
> 
> Fair enough.  I think we can drop that one column from the capsule and be 
> quite happy with the rest.  No need to purge EventLogging.   
> 
> -Aaron
> 
> 
> On Wed, Aug 13, 2014 at 6:08 PM, Nuria Ruiz  wrote:
> > Re. (2), I didn't say anything about that being related to public/private.  
> > This is a request from springle -- that if we are going to start pushing 
> > Events to LabsDB, he'd like us to do so more efficiently.  That bug is 
> > about efficiently batching inserts.
> ah, my mistake. Kevin can do prioritization as needed.
> 
> >If you are concerned about UserAgents as the sanitization page you linked to 
> >suggests, then we should talk about the >EventLogging capsule, not the 
> >event.  
> If you want to be so precise, sure, that is correct. Note that currently 
> there is no distinction in storage as to the event and the capsule, they are 
> stored together in the same record. Capsule data is only identified by a 
> prefix on the column name. It stands to reason that you would be interested 
> on the capsule too as it holds the timestamp and wiki project the event 
> applies to, but I imagine we can make fields public selectively.
> 
> 
> 
> 
> 
> On Wed, Aug 13, 2014 at 6:47 PM, Aaron Halfaker  
> wrote:
> Re. (2), I didn't say anything about that being related to public/private.  
> This is a request from springle -- that if we are going to start pushing 
> Events to LabsDB, he'd like us to do so more efficiently.  That bug is about 
> efficiently batching inserts. 
> 
> I don't know what you are talking about re. 90 day purges.  I'm talking about 
> 100% public Event logging events -- E.g. 
> https://meta.wikimedia.org/wiki/Schema:PageMove   Also, we do *not* need to 
> purge EventLogging event data at 90 days.  We need to purge PII at 90 days.  
> We generally do not store PII in EventLogging events, but when we do, we 
> organize 90 days purges as we have recently for the anonymous editor 
> experiments.  If you are concerned about UserAgents as the sanitization page 
> you linked to suggests, then we should talk about the EventLogging capsule, 
> not the event. 
> 
> Re. (1), we are already performing this review internally in order to 
> determine what does and does not conform to the Data Retention Guidelines.  
> It seems clear that a robust process could also identify non-sensitive 
> Schemas that could be published in labs.
> 
> -Aaron
> 
> 
> On Wed, Aug 13, 2014 at 5:00 PM, Nuria Ruiz  wrote:
> Aaron, 
> 
> >(2) https://bugzilla.wikimedia.org/show_bug.cgi?id=67450
> The bug does not have to do with making data public. It has to do with how 
> data is inserted in to EL from the 
> consumers,

Re: [Analytics] Data inconsistency with displayMobile in ServerSideAccountCreation

2014-07-25 Thread Dario Taraborelli

Dan,

we were just having a separate discussion about the fact that the various 
isMobile or displayMobile fields predate the launch of apps and are likely to 
create artifacts if used to filter app-specific events.
IMO there should be a default field in the event capsule for {desktop site, 
mobile site, app} applied to all events. Further breakdowns by client or device 
should be generated as usual from the UA field.

Dario

On Jul 25, 2014, at 8:00 AM, Dan Garry  wrote:

> I'm unsure, but I think the actual issue may be caused by the API fallback; 
> if you can't use the mobile API for some reason, the app uses the standard 
> API instead. This is an issue in China where the mobile API is blocked. This 
> would account for those that had registered on Android but didn't have 
> event_displayMobile = 1 due to using the standard API.
> 
> Dan
> 
> 
> On 24 July 2014 22:45, Pine W  wrote:
> I believe that Android will run on desktops. Would Android desktops account 
> for the nonzero number?
> 
> See 
> http://www.pcworld.com/article/2048220/hybrid-hijinks-how-to-install-android-on-your-pc.html
> 
> Pine
> 
> On Jul 24, 2014 5:37 PM, "Dan Garry"  wrote:
> Hi!
> 
> So I've been rooting around in ServerSideAccountCreation and I've noticed 
> some inconsistencies in the data. The final two clauses in the WHERE in the 
> following query should be mutually exclusive (registered on Android app, and 
> registered not on mobile), but the number returned is nonzero.
> 
> SELECT count(*)
> FROM ServerSideAccountCreation_5487345
> WHERE timestamp >= 2014072200
> AND timestamp <= 2014072300
> AND userAgent like 'WikipediaApp%'
> AND event_displayMobile = 0
> 
> I'm sure you guys get data inconsistencies like this all the time, but I 
> thought I should at least report it so you're aware.
> 
> Thanks,
> Dan
> 
> -- 
> Dan Garry
> Associate Product Manager for Platform and Mobile Apps
> Wikimedia Foundation
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
> 
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
> 
> 
> 
> 
> -- 
> Dan Garry
> Associate Product Manager for Platform and Mobile Apps
> Wikimedia Foundation
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Dashboard-like frontend for graphite

2014-07-10 Thread Dario Taraborelli

Much as I love the idea of adding charting capability in MediaWiki (especially 
if it were to be integrated with a data namespace and version controlled JSON 
annotations) – I agree with Steven that this seems to solve a different problem.

The biggest pain points of using Limn to me (on top of the usability issues 
mentioned in this thread [1]) are its poor information architecture and its 
limited support for data documentation/metadata. We know that it’s hard at the 
moment for people to find the data they are looking for or to be able to 
navigate in an intuitive form a large set of dashboards. For example: the first 
metric we modeled for the vital signs project (newly registered users), when 
combined with a single breakdown by platform (desktop site, mobile site, apps), 
would result in ~2.5K data series. I can’t quite figure out how these series 
would look like and be discoverable on Limn.

I think the best investment of our time would be to: 

(1) give Wikimetrics and EventLogging a standard interface to plug the data and 
metadata into any arbitrary dashboard/visualization frontend – whether 
custom-built, off-the-shelf or even hosted

(2) start solving the visualization problem incrementally, moving from the most 
urgent customer needs and evaluating visualization solutions against these 
priorities.

That would give us ample time to bring data (and immediate value) to the users, 
while testing the best approach for visualizing it and supporting more 
sophisticated requirements for presenting and rendering the data (we could 
abandon the first frontend when it stops serving our needs and migrate to 
something more sophisticated).

I like the look and feel of Tessera and the fact that it can easily consume 
Graphite data, but I share Dan’s concerns about storage. 
Dan, I think it would be valuable to put your thoughts on a wiki page, if you 
have bandwidth to do so.

Dario

[1] I also want to add that whatever solution we settle on, it needs to be 
mobile friendly.

On Jul 9, 2014, at 11:55 PM, Dan Andreescu  wrote:

> 
> 
> 
> On Wed, Jul 9, 2014 at 4:23 PM, Steven Walling  wrote:
> 
> On Wed, Jul 9, 2014 at 1:01 PM, Dan Andreescu  
> wrote:
> By the way, if this at all sounds like I'm proposing a "new" monster 
> codebase, that is not at all the case.  Most of the hard problems will be 
> out-sourced to promising projects.  Like Vega is in the top running to handle 
> the visualizations themselves and the dashboarding around it will be very 
> simplistic but solve problems we've encountered with Limn.  But again, very 
> early days.
> 
> Yeah to be honest I'm pretty skeptical of such a plan. 
> 
> To back up... As a consumer of numerous dashboards and someone who has to 
> decide when/how to request creation of them, I care about getting a readable 
> new dashboard set up and maintained to run indefinitely with as little 
> developer or researcher time as possible.
> 
> Agreed, Limn fails at this pretty miserably, and it's definitely one of our 
> top problems to solve.
> 
> The main problem with Limn is that to set up a suite of dashboards takes a 
> very large initial investment.
> 
> There are many other problems, a few relevant examples: discovery of 
> dashboards, documentation of visualization capabilities, lack of annotations, 
> ease of contributing to the code base
>  
> I'm not really sure how shoehorning a dashboard service on top of MediaWiki 
> really solves this problem better than just setting up one of the many 
> existing solutions out there.  I don't care about transparent versioning and 
> authentication, which seems to be the two things that MediaWiki is really 
> good at in this context.
> 
> I'm not sure this is true.  You may not care about it, but storage needs to 
> happen, and I'd rather outsource that problem.  Limn's idea of using 
> file-backed storage made it very inefficient and clumsy to work with.  A 
> custom database, like Tessera is using, is much better but also requires 
> someone to maintain it and manage access, etc.  So more ops burden but less 
> up-front development.  And the definitions would be "further" from our 
> community.  Meaning, for example, if someone defaces a graph, we'd have to 
> build a "watch this page" mechanism to help us deal with it.  I started where 
> you're starting with Tesera and as I thought of these problems I slowly 
> migrated to Mediawiki.  But I'll try to explain below why I don't think this 
> is a big undertaking at all.  MediaWiki is really easy to use as a service.
>  
> Building a custom tool from scratch is also part of what got us in this mess 
> with Limn to begin with. 
> 
> I see that I have caused a bit of a misunderstanding.  So, Limn is well over 
> 10,000 lines of Coco.  This is a dense language that transpiles to roughly 
> 20,000 lines of Javascript.  The tool I'm proposing here is basically 
> ignoring 90% of the problems that Limn dealt with.  Visualization is the main 
> problem, and that is solved by

Re: [Analytics] eventlogging UniversalLanguageSelector-tofu_7629564

2014-07-02 Thread Dario Taraborelli

I have the feeling there’s no need to keep 114Gb of raw client-side 
instrumentation data for tofu detection.
Copying Amir, Gilles and Jon who are the respective owners of the schemas in 
Sean’s list. 

On Jul 2, 2014, at 7:44 PM, Oliver Keyes  wrote:

> he odd name is frustrating to me too :/. I'd be interested to see if we need 
> the MV tables (or, the really old data in them): as I understand it those are 
> aggregated for public consumption fairly regularly.
> 
> 
> On 2 July 2014 22:21, Sean Pringle  wrote:
> Hi :)
> 
> The following table is easily the largest in eventlogging and growing fastest:
> 
> 114G UniversalLanguageSelector-tofu_7629564
> 
> Is there a plan for purging old data from this one? I realize it's mostly new 
> data; just wondering if growth will be unbounded.
> 
> Why does it have an odd name "-tofu"? Is it intended?
> 
> There is a duplicate table called UniversalLanguageSelecTor-tofu_7629564 -- 
> note the uppercase T -- with a single row. Is that needed?
> 
> The next biggest are:
> 
> 67G PageContentSaveComplete_5588433.ibd
> 61G MediaViewer_8572637.ibd
> 57G MediaViewer_8245578.ibd
> 33G MobileWebClickTracking_5929948.ibd
> 
> BR
> Sean
> 
> --- 
> DBA @ WMF
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
> 
> 
> 
> 
> -- 
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

1 2 >

1 - 100 of 161 matches

Mail list logo