from:"Leila Zia"

[Analytics] Fwd: [Wiki-research-l] [events] Wiki Workshop 2023 Call for Papers

2023-03-02 Thread Leila Zia

Hi all,

Please see the call for papers for the 10th edition of Wiki Workshop below.
The call is for extended abstracts (2 pages) of ongoing or completed work.
The deadline is March 23. The submissions are non-archival which means you
can submit work that is already published as well! :)

Submit and join us in conversations about research on the Wikimedia
projects.

Best,
Leila


--
Leila Zia
Head of Research
Wikimedia Foundation


-- Forwarded message -
From: Martin Gerlach 
Date: Mon, Feb 20, 2023 at 1:29 AM
Subject: [Wiki-research-l] [events] Wiki Workshop 2023 Call for Papers
To: 


Hi everyone,

The call for papers for the 10th Wiki Workshop in 2023 is out:
https://wikiworkshop.org/2023/#call  Submit your 2-page abstracts by March
23 (all submissions are non-archival). The workshop will take place on May
11, 2023. For more information, see the workshop website [1].

If you have questions about the workshop, please let us know on this list
or at wikiworkshop(a)googlegroups.com.

Looking forward to seeing many of you in this year's edition.

Best,

Pablo Aragón, Wikimedia Foundation

Martin Gerlach, Wikimedia Foundation

Evelin Heidel, Wikimedistas de Uruguay

Emily Lescak, Wikimedia Foundation

Francesca Tripodi, University of North Carolina

Bob West, EPFL

Leila Zia, Wikimedia Foundation

[1] https://wikiworkshop.org/2023/

—

We invite contributions to the 10th edition (!) of Wiki Workshop, which
will take place virtually on May 11, 2023 (tentatively 12:00-19:00 UTC).
Wiki Workshop is the largest Wikimedia research event of the year, aimed at
bringing together researchers who study all aspects of Wikimedia projects
(including, but not limited to, Wikipedia, Wikidata, Wikimedia Commons,
Wikisource, and Wiktionary) as well as Wikimedia developers, affiliate
organizations, and volunteer editors. Co-organized by the Wikimedia
Foundation’s Research team and members of the Wikimedia research community,
the workshop facilitates a direct pathway for exchanging ideas between the
organizations that serve Wikimedia projects and the researchers actively
studying them. New this year: Building on the successful experiences of
organizing Wiki Workshop in 2015 <https://wikiworkshop.org/2015/>, 2016
<https://wikiworkshop.org/2016/>, 2017 <https://wikiworkshop.org/2017/>,
2018 <https://wikiworkshop.org/2018/>, 2019 <https://wikiworkshop.org/2019/>
, 2020 <https://wikiworkshop.org/2020/>, 2021
<https://wikiworkshop.org/2021/>, and 2022 <https://wikiworkshop.org/2022/>
and based on feedback from authors and participants over the years, we are
introducing a few updates to the research track of the workshop for 2023:

   -

   This 10th edition will take place as a standalone event (rather than in
   co-location with a conference, as in previous years).
   -

   We have changed the format of submissions and will only accept 2-page
   extended abstracts (following the successful IC2S2 model).
   -

   Submissions are non-archival, so we welcome ongoing, completed, and
   already published work.
   -

   We are excited to share that the authors of Wiki Workshop 2023 will have
   the opportunity to receive feedback, improve their work, and submit the
   extended version of their research paper to a special issue of the ACM
   Transactions on the Web, which will have a dedicated open call for papers
   later in 2023.

Topics include, but are not limited to:

   -

   new technologies and initiatives to grow content, quality, equity,
   diversity, and participation across Wikimedia projects
   -

   use of bots, algorithms, and crowdsourcing strategies to curate, source,
   or verify content and structured data
   -

   bias in content and gaps of knowledge on Wikimedia projects
   -

   relation between Wikimedia projects and the broader (open) knowledge
   ecosystem
   -

   exploration of what constitutes a source and how/if the incorporation of
   other kinds of sources are possible (e.g., oral histories, video)
   -

   detection of low-quality, promotional, or fake content (misinformation
   or disinformation), as well as fake accounts (e.g., sock puppets)
   -

   questions related to community health (e.g., sentiment analysis,
   harassment detection, tools that could increase harmony)
   -

   motivations, engagement models, incentives, and needs of editors,
   readers, and/or developers of Wikimedia projects
   -

   innovative uses of Wikipedia and other Wikimedia projects for AI and NLP
   applications and vice versa
   -

   consensus-finding and conflict resolution on editorial issues
   -

   dynamics of content reuse across projects and the impact of policies and
   community norms on reuse privacy, security, and trust
   -

   collaborative content creation
   -

   innovative uses of Wikimedia projects' content and consumption patterns
   as sensors for real-world events, culture, etc.
   -

   open-source research code, datasets, and to

[Analytics] Re: [event] Wiki Workshop 2022 - Registration open

2022-06-06 Thread Leila Zia

Hi all,

For those of you who could not attend Wiki Workshop virtually, you can now
access:

* the recorded sessions at https://wikiworkshop.org/2022/#schedule (The
opening, paper presentations, panel, keynote, and the closing sessions were
recorded.)
* the accepted papers at https://wikiworkshop.org/2022/#papers .

Best,
Leila, on behalf of Wiki Workshop 2022 organizers


On Fri, Apr 8, 2022 at 8:03 AM Leila Zia  wrote:

> Hi all,
>
> The registration for Wiki Workshop 2022 [1] is now open. The event is
> virtually held on April 25, 12:00-18:30 UTC and as part of The Web
> Conference 2022 [2]. The plenary parts of the event will be recorded
> and shared publicly afterwards.
>
> Wiki Workshop is the largest Wikimedia research event of the year (so
> far;) that the Research team at the Wikimedia Foundation co-organizes
> with our Research Fellow, Bob West (EPFL). This year, Srijan Kumar
> (Georgia Tech) joined the organizing team as well.:) The event brings
> together scholars and researchers from across the world who are
> interested in or are actively engaged with research and development on
> the Wikimedia projects.
>
> While the details of the schedule are to be finalized and posted in
> the coming week, we expect to generally follow the format of 2021 [3].
> This year we received research submissions from more than 20 countries
> and have accepted 27 research papers whose authors will present the
> work as part of the workshop (If you are an author of an accepted
> paper: congrats!:) . Our keynote speaker is Larry Lessig [4] and we
> will have a panel to reflect on the decade anniversary of SOPA/PIPA,
> moderated by Erik Moeller (Freedom of the Press). And of course, all
> the music, games, etc. will remain. :)
>
> If you are interested in participating in the live event, please
> indicate your interest by filling out [5]. Anyone is encouraged to
> register: you don't have to be a researcher. In the registration form,
> please explain why attending the live event will support you in your
> work on the Wikimedia projects and beyond.
>
> If you have questions, please don't hesitate to reach out.
>
> Best,
> Leila
>
> [1] https://wikiworkshop.org/2022/
> [2] https://www2022.thewebconf.org/
> [3] https://wikiworkshop.org/2021/#schedule
> [4] https://hls.harvard.edu/faculty/directory/10519/Lessig
> [5] (privacy statement for the Google form survey [6])
>
> https://docs.google.com/forms/d/e/1FAIpQLSctlkUv8FasB2Nc4RvThnxAbjPzUwmnxB2FwnNkZlKG1NPOTg/viewform
> [6]
> https://foundation.wikimedia.org/wiki/Legal:Wiki_Workshop_Registration_Privacy_Statement
>
> --
> Leila Zia
> Head of Research
> Wikimedia Foundation
>
___
Analytics mailing list -- analytics@lists.wikimedia.org
To unsubscribe send an email to analytics-le...@lists.wikimedia.org

[Analytics] [event] Wiki Workshop 2022 - Registration open

2022-04-08 Thread Leila Zia

Hi all,

The registration for Wiki Workshop 2022 [1] is now open. The event is
virtually held on April 25, 12:00-18:30 UTC and as part of The Web
Conference 2022 [2]. The plenary parts of the event will be recorded
and shared publicly afterwards.

Wiki Workshop is the largest Wikimedia research event of the year (so
far;) that the Research team at the Wikimedia Foundation co-organizes
with our Research Fellow, Bob West (EPFL). This year, Srijan Kumar
(Georgia Tech) joined the organizing team as well.:) The event brings
together scholars and researchers from across the world who are
interested in or are actively engaged with research and development on
the Wikimedia projects.

While the details of the schedule are to be finalized and posted in
the coming week, we expect to generally follow the format of 2021 [3].
This year we received research submissions from more than 20 countries
and have accepted 27 research papers whose authors will present the
work as part of the workshop (If you are an author of an accepted
paper: congrats!:) . Our keynote speaker is Larry Lessig [4] and we
will have a panel to reflect on the decade anniversary of SOPA/PIPA,
moderated by Erik Moeller (Freedom of the Press). And of course, all
the music, games, etc. will remain. :)

If you are interested in participating in the live event, please
indicate your interest by filling out [5]. Anyone is encouraged to
register: you don't have to be a researcher. In the registration form,
please explain why attending the live event will support you in your
work on the Wikimedia projects and beyond.

If you have questions, please don't hesitate to reach out.

Best,
Leila

[1] https://wikiworkshop.org/2022/
[2] https://www2022.thewebconf.org/
[3] https://wikiworkshop.org/2021/#schedule
[4] https://hls.harvard.edu/faculty/directory/10519/Lessig
[5] (privacy statement for the Google form survey [6])
https://docs.google.com/forms/d/e/1FAIpQLSctlkUv8FasB2Nc4RvThnxAbjPzUwmnxB2FwnNkZlKG1NPOTg/viewform
[6] 
https://foundation.wikimedia.org/wiki/Legal:Wiki_Workshop_Registration_Privacy_Statement

--
Leila Zia
Head of Research
Wikimedia Foundation
___
Analytics mailing list -- analytics@lists.wikimedia.org
To unsubscribe send an email to analytics-le...@lists.wikimedia.org

[Analytics] Re: [events] Wiki Workshop 2022 Announcement and Call for Papers

2022-03-04 Thread Leila Zia

Hi all,

A reminder that if you're considering submitting your ongoing or completed
Wikimedia related research to Wiki Workshop, the non-archival deadline is
on March 10. Submission instructions at
https://wikiworkshop.org/2022/#submission .

Best,
Leila

On Fri, Jan 28, 2022 at 2:34 PM Leila Zia  wrote:

> Reminder: If you're considering submitting your ongoing or completed
> Wikimedia research to Wiki Workshop, note that the deadline for your
> submission to be considered as part of the WWW'2022 Proceedings is
> February 3rd. All other submissions on March 10th. Check out
> https://wikiworkshop.org/2022/#call . See my original email below for
> more details.
>
> Best,
> Leila
>
> On Mon, Dec 20, 2021 at 9:53 PM Leila Zia  wrote:
> >
> > Hi everyone,
> >
> > Summary: Wiki Workshop 2022 [0] will take place virtually as part of
> > The Web Conference 2022 [1]. Call for papers is now open:
> > https://wikiworkshop.org/2022/#call . Deadline to submit for paper to
> > appear in the proceedings of the conference is Feb 3, for all other
> > submissions March 10. The workshop will take place on April 25, 2022.
> >
> > --
> >
> > We are delighted to announce that Wiki Workshop 2022 [0] will be held
> > virtually April 25, 2022 and as part of the Web Conference 2022 [1].
> >
> > In the past years, Wiki Workshop has traveled to Oxford, Montreal,
> > Cologne, Perth, Lyon, and San Francisco, and (virtually) to Taipei and
> > Ljubljana.
> > Last year, we had more than 150 participants in the workshop along
> > with 22 accepted paper presentations, keynote, panel, music and more.
> > The workshop is now a vibrant event for Wikimedia researchers and
> > those interested in this space to get together on an annual basis.
> >
> > We encourage contributions by all researchers who study the Wikimedia
> > projects. We specifically encourage 1-2 page submissions of
> > preliminary research. You will have the option to publish your work as
> > part of the proceedings of The Web Conference 2022.
> >
> > You can read more about the call for papers and the workshop at
> > http://wikiworkshop.org/2022/#call. Please note that the deadline for
> > the submissions to be considered for proceedings is February 3. All
> > other submissions should be received by March 10.
> >
> > If you have questions about the workshop, please let us know on this
> > list or at wikiworks...@googlegroups.com.
> >
> > Looking forward to seeing many of you in this year's edition.
> >
> > Best,
> > Srijan Kumar, Georgia Tech
> > Emily Lesack, Wikimedia Foundation
> > Miriam Redi, Wikimedia Foundation
> > Bob West, EPFL
> > Leila Zia, Wikimedia Foundation
> >
> > [0] https://wikiworkshop.org/2022/
> > [1] https://www2022.thewebconf.org/
>
___
Analytics mailing list -- analytics@lists.wikimedia.org
To unsubscribe send an email to analytics-le...@lists.wikimedia.org

[Analytics] Re: [events] Wiki Workshop 2022 Announcement and Call for Papers

2022-01-28 Thread Leila Zia

Reminder: If you're considering submitting your ongoing or completed
Wikimedia research to Wiki Workshop, note that the deadline for your
submission to be considered as part of the WWW'2022 Proceedings is
February 3rd. All other submissions on March 10th. Check out
https://wikiworkshop.org/2022/#call . See my original email below for
more details.

Best,
Leila

On Mon, Dec 20, 2021 at 9:53 PM Leila Zia  wrote:
>
> Hi everyone,
>
> Summary: Wiki Workshop 2022 [0] will take place virtually as part of
> The Web Conference 2022 [1]. Call for papers is now open:
> https://wikiworkshop.org/2022/#call . Deadline to submit for paper to
> appear in the proceedings of the conference is Feb 3, for all other
> submissions March 10. The workshop will take place on April 25, 2022.
>
> --
>
> We are delighted to announce that Wiki Workshop 2022 [0] will be held
> virtually April 25, 2022 and as part of the Web Conference 2022 [1].
>
> In the past years, Wiki Workshop has traveled to Oxford, Montreal,
> Cologne, Perth, Lyon, and San Francisco, and (virtually) to Taipei and
> Ljubljana.
> Last year, we had more than 150 participants in the workshop along
> with 22 accepted paper presentations, keynote, panel, music and more.
> The workshop is now a vibrant event for Wikimedia researchers and
> those interested in this space to get together on an annual basis.
>
> We encourage contributions by all researchers who study the Wikimedia
> projects. We specifically encourage 1-2 page submissions of
> preliminary research. You will have the option to publish your work as
> part of the proceedings of The Web Conference 2022.
>
> You can read more about the call for papers and the workshop at
> http://wikiworkshop.org/2022/#call. Please note that the deadline for
> the submissions to be considered for proceedings is February 3. All
> other submissions should be received by March 10.
>
> If you have questions about the workshop, please let us know on this
> list or at wikiworks...@googlegroups.com.
>
> Looking forward to seeing many of you in this year's edition.
>
> Best,
> Srijan Kumar, Georgia Tech
> Emily Lesack, Wikimedia Foundation
> Miriam Redi, Wikimedia Foundation
> Bob West, EPFL
> Leila Zia, Wikimedia Foundation
>
> [0] https://wikiworkshop.org/2022/
> [1] https://www2022.thewebconf.org/
___
Analytics mailing list -- analytics@lists.wikimedia.org
To unsubscribe send an email to analytics-le...@lists.wikimedia.org

[Analytics] The Wikimedia Foundation Research Award of the Year - Call for Nominations

2022-01-07 Thread Leila Zia

[Apologies for cross-posting.]

Hi all,

We invite you to nominate one or more scholarly research publications
to be considered for the Wikimedia Foundation Research Award of the
Year. Learn more below.

=Purpose of the award=
Recognize recent research on or about the Wikimedia projects or recent
research that is of importance to the Wikimedia projects. Recognize
the researchers behind the research.

You can learn more about 2021's winners at
https://research.wikimedia.org/awards.html .

=Eligibility criteria=
Your nomination must meet the following criteria:

* The research must be on, about, using data from, and/or of
importance to Wikipedia, Wikidata, Wikisource, Wikimedia Commons or
other Wikimedia projects.

* The publication must be available in English.

* The research must have been published between January 1, 2021 and
December 31, 2021.

=Nomination process=
Submit your nominations by 2022-02-07 through
https://easychair.org/conferences/?conf=wmfray2021 . We will ask you
to provide the following information in your nomination:

* Title of the manuscript
* A copy of the manuscript you are nominating
* A summary of the research and a clear justification for why the work
merits the award (in 350 words or fewer in English).

Note that self-nominations and nominations of others' work are both welcome.

==Winner(s)==
The winner(s) will be announced in a ceremony as part of Wiki Workshop
2022: https://wikiworkshop.org/2022/ .

If you have any questions, please contact us at
wmf-ray-2...@easychair.org or here.

Best,
Benjamin Mako Hill (University of Washington)
Leila Zia (Wikimedia Foundation)
___
Analytics mailing list -- analytics@lists.wikimedia.org
To unsubscribe send an email to analytics-le...@lists.wikimedia.org

[Analytics] [events] Wiki Workshop 2022 Announcement and Call for Papers

2021-12-20 Thread Leila Zia

Hi everyone,

Summary: Wiki Workshop 2022 [0] will take place virtually as part of
The Web Conference 2022 [1]. Call for papers is now open:
https://wikiworkshop.org/2022/#call . Deadline to submit for paper to
appear in the proceedings of the conference is Feb 3, for all other
submissions March 10. The workshop will take place on April 25, 2022.

--

We are delighted to announce that Wiki Workshop 2022 [0] will be held
virtually April 25, 2022 and as part of the Web Conference 2022 [1].

In the past years, Wiki Workshop has traveled to Oxford, Montreal,
Cologne, Perth, Lyon, and San Francisco, and (virtually) to Taipei and
Ljubljana.
Last year, we had more than 150 participants in the workshop along
with 22 accepted paper presentations, keynote, panel, music and more.
The workshop is now a vibrant event for Wikimedia researchers and
those interested in this space to get together on an annual basis.

We encourage contributions by all researchers who study the Wikimedia
projects. We specifically encourage 1-2 page submissions of
preliminary research. You will have the option to publish your work as
part of the proceedings of The Web Conference 2022.

You can read more about the call for papers and the workshop at
http://wikiworkshop.org/2022/#call. Please note that the deadline for
the submissions to be considered for proceedings is February 3. All
other submissions should be received by March 10.

If you have questions about the workshop, please let us know on this
list or at wikiworks...@googlegroups.com.

Looking forward to seeing many of you in this year's edition.

Best,
Srijan Kumar, Georgia Tech
Emily Lesack, Wikimedia Foundation
Miriam Redi, Wikimedia Foundation
Bob West, EPFL
Leila Zia, Wikimedia Foundation

[0] https://wikiworkshop.org/2022/
[1] https://www2022.thewebconf.org/
___
Analytics mailing list -- analytics@lists.wikimedia.org
To unsubscribe send an email to analytics-le...@lists.wikimedia.org

Re: [Analytics] [events] Wiki Workshop 2021 Announcement and Call for Papers

2021-04-09 Thread Leila Zia

Hi all,

This is our final friendly reminder that if you're interested to join
us in the 8th annual Wiki Workshop https://wikiworkshop.org/2021/ on
April 14, 12:00-18:30 UTC you have until April 13 to submit your
indication of interest to participate:
https://docs.google.com/forms/d/e/1FAIpQLSeiq7MUp9ln8Z9KijslxRh18eT0bqCpQqAGjunC4n99WMumSw/viewform
. If you're interested in research on or about the Wikimedia projects,
don't miss it. :)

Best,
Leila, Miriam and Bob

On Fri, Mar 26, 2021 at 5:27 PM Leila Zia  wrote:
>
> Hi everyone,
>
> We are looking forward to hosting you in Wiki Workshop 2021
> (virtually) on April 14. You can now submit your indication of
> interest to participate in the workshop via
> https://docs.google.com/forms/d/e/1FAIpQLSeiq7MUp9ln8Z9KijslxRh18eT0bqCpQqAGjunC4n99WMumSw/viewform
> .
>
> You can also check out the accepted papers and the invited speakers'
> list on the website: https://wikiworkshop.org/2021/
>
> Best,
> Leila, Miriam and Bob
>
> On Wed, Jan 6, 2021 at 8:52 AM Leila Zia  wrote:
> >
> > Hi everyone,
> >
> > We are delighted to announce that Wiki Workshop 2021 will be held
> > virtually in April 2021 and as part of the Web Conference 2021 [1].
> > The exact day is to be finalized and we know it will be between April
> > 19-23.
> >
> > In the past years, Wiki Workshop has traveled to Oxford, Montreal,
> > Cologne, Perth, Lyon, and San Francisco, and (virtually) to Taipei.
> > Last year, we had more than 120 participants in the workshop and we
> > are particularly excited about this year's as we will celebrate the
> > 20th birthday of Wikipedia.
> >
> > We encourage contributions by all researchers who study the Wikimedia
> > projects. We specifically encourage 1-2 page submissions of
> > preliminary research. You will have the option to publish your work as
> > part of the proceedings of The Web Conference 2021.
> >
> > You can read more about the call for papers and the workshop at
> > http://wikiworkshop.org/2021/#call. Please note that the deadline for
> > the submissions to be considered for proceedings is January 29. All
> > other submissions should be received by March 1.
> >
> > If you have questions about the workshop, please let us know on this
> > list or at wikiworks...@googlegroups.com.
> >
> > Looking forward to seeing many of you in this year's edition.
> >
> > Best,
> > Miriam Redi, Wikimedia Foundation
> > Bob West, EPFL
> > Leila Zia, Wikimedia Foundation
> >
> > [1] https://www2021.thewebconf.org/

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [events] Wiki Workshop 2021 Announcement and Call for Papers

2021-03-26 Thread Leila Zia

Hi everyone,

We are looking forward to hosting you in Wiki Workshop 2021
(virtually) on April 14. You can now submit your indication of
interest to participate in the workshop via
https://docs.google.com/forms/d/e/1FAIpQLSeiq7MUp9ln8Z9KijslxRh18eT0bqCpQqAGjunC4n99WMumSw/viewform
.

You can also check out the accepted papers and the invited speakers'
list on the website: https://wikiworkshop.org/2021/

Best,
Leila, Miriam and Bob

On Wed, Jan 6, 2021 at 8:52 AM Leila Zia  wrote:
>
> Hi everyone,
>
> We are delighted to announce that Wiki Workshop 2021 will be held
> virtually in April 2021 and as part of the Web Conference 2021 [1].
> The exact day is to be finalized and we know it will be between April
> 19-23.
>
> In the past years, Wiki Workshop has traveled to Oxford, Montreal,
> Cologne, Perth, Lyon, and San Francisco, and (virtually) to Taipei.
> Last year, we had more than 120 participants in the workshop and we
> are particularly excited about this year's as we will celebrate the
> 20th birthday of Wikipedia.
>
> We encourage contributions by all researchers who study the Wikimedia
> projects. We specifically encourage 1-2 page submissions of
> preliminary research. You will have the option to publish your work as
> part of the proceedings of The Web Conference 2021.
>
> You can read more about the call for papers and the workshop at
> http://wikiworkshop.org/2021/#call. Please note that the deadline for
> the submissions to be considered for proceedings is January 29. All
> other submissions should be received by March 1.
>
> If you have questions about the workshop, please let us know on this
> list or at wikiworks...@googlegroups.com.
>
> Looking forward to seeing many of you in this year's edition.
>
> Best,
> Miriam Redi, Wikimedia Foundation
> Bob West, EPFL
> Leila Zia, Wikimedia Foundation
>
> [1] https://www2021.thewebconf.org/

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] [events] Wiki Workshop 2021 Announcement and Call for Papers

2021-01-06 Thread Leila Zia

Hi everyone,

We are delighted to announce that Wiki Workshop 2021 will be held
virtually in April 2021 and as part of the Web Conference 2021 [1].
The exact day is to be finalized and we know it will be between April
19-23.

In the past years, Wiki Workshop has traveled to Oxford, Montreal,
Cologne, Perth, Lyon, and San Francisco, and (virtually) to Taipei.
Last year, we had more than 120 participants in the workshop and we
are particularly excited about this year's as we will celebrate the
20th birthday of Wikipedia.

We encourage contributions by all researchers who study the Wikimedia
projects. We specifically encourage 1-2 page submissions of
preliminary research. You will have the option to publish your work as
part of the proceedings of The Web Conference 2021.

You can read more about the call for papers and the workshop at
http://wikiworkshop.org/2021/#call. Please note that the deadline for
the submissions to be considered for proceedings is January 29. All
other submissions should be received by March 1.

If you have questions about the workshop, please let us know on this
list or at wikiworks...@googlegroups.com.

Looking forward to seeing many of you in this year's edition.

Best,
Miriam Redi, Wikimedia Foundation
Bob West, EPFL
Leila Zia, Wikimedia Foundation

[1] https://www2021.thewebconf.org/

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Upcoming WMF/Research-Team Office hours on September 1st, 2020

2020-09-01 Thread Leila Zia

A friendly reminder that we will kick off this meeting in a couple of
minutes. Join us if you want to talk about research. :)

On Fri, Aug 28, 2020 at 10:13 AM Martin Gerlach  wrote:
>
> Hi all,
>
> Join the Research Team at the Wikimedia Foundation [1] for their monthly 
> Office hours on 2020-09-01 at 16.00-17.00 (UTC).
>
>
> Through these office hours, we aim to make ourselves more available to answer 
> some of the research related questions that you as Wikimedia volunteer 
> editors, organizers, affiliates, staff, and researchers face in your projects 
> and initiatives (*).
>
>
> To participate, join the video-call via this Wikimedia-meet link [2]. There 
> is no set agenda - feel free to add your item to the list of topics in the 
> etherpad [3] (You can do this after you join the meeting, too.), otherwise 
> you are welcome to also just hang out. More detailed information (e.g. about 
> how to attend) can be found here [4].
>
>
> Started in the beginning of 2020 as an experiment [5], after the first 6 
> editions we have evaluated the scope and format of the Research office hours. 
> In order to decrease barriers of accessibility and to facilitate more direct 
> interaction, we have switched the format from IRC to video call. We will 
> re-evaluate the current format at the end of the year. We would also be glad 
> to hear your feedback and/or comments.
>
>
> (*) Some example cases we hope to be able to support you in:
>
> You have a specific research related question that you suspect you should be 
> able to answer with the publicly available data and you don’t know how to 
> find an answer for it, or you just need some more help with it. For example, 
> how can I compute the ratio of anonymous to registered editors in my wiki?
>
> You run into repetitive or very manual work as part of your Wikimedia 
> contributions and you wish to find out if there are ways to use machines to 
> improve your workflows. These types of conversations can sometimes be harder 
> to find an answer for during an office hour, however, discussing them can 
> help us understand your challenges better and we may find ways to work with 
> each other to support you in addressing it in the future.
>
> You want to learn what the Research team at the Wikimedia Foundation does and 
> how we can potentially support you. Specifically for affiliates: if you are 
> interested in building relationships with the academic institutions in your 
> country, we would love to talk with you and learn more. We have a series of 
> programs that aim to expand the network of Wikimedia researchers globally and 
> we would love to collaborate with those of you interested more closely in 
> this space.
>
> You want to talk with us about one of our existing programs [6].
>
>
> Hope to see many of you,
> Martin (WMF Research Team)
>
>
> [1] https://research.wikimedia.org/team.html
>
> [2] https://meet.wmcloud.org/ResearchOfficeHours
>
> [3] https://etherpad.wikimedia.org/p/Research-Analytics-Office-hours
>
> [4] https://www.mediawiki.org/wiki/Wikimedia_Research/Office_hours
>
> [5] 
> https://lists.wikimedia.org/pipermail/wiki-research-l/2019-December/007039.html
>
> [6] https://research.wikimedia.org/projects.html
>
>
>
> --
> Martin Gerlach
> Research Scientist
> Wikimedia Foundation
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] [job] We're hiring

2020-08-05 Thread Leila Zia

Hi all,

I hope this email finds you well.

We, the Research team at the Wikimedia Foundation [1], are hiring for
2 positions. Please review the corresponding job descriptions via the
links below and apply on or before August 31 if you're interested.
Also, do consider spreading the word, please! :)

Research Scientist (Disinformation):
https://boards.greenhouse.io/wikimedia/jobs/2267633

Research Engineer: https://boards.greenhouse.io/wikimedia/jobs/2267741

If you have questions, please don't hesitate to reach out.

Best,
Leila

[1] https://research.wikimedia.org/team.html

--
Leila Zia
Head of Research
Wikimedia Foundation

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Wiki Workshop 2020 Announcement and Call for Papers

2020-04-18 Thread Leila Zia

Hi all,

We are now less than 3 days away from Wiki Workshop 2020 (April 21,
11:45-18:00 UTC, 4:45-11:00 PST) and more than 170 people have registered
to attend. This is your final reminder: If you like to participate in the
event and you haven't submitted your registration request, yet, please do
so in the next 24 hours via
https://docs.google.com/forms/d/e/1FAIpQLSe2ctfYVIokWYsvUfxcdF-FNwrymgzpNZLh25EhU8JbvKp1tA/viewform
.
After that we will only process the requests if we have time to do so as
our priority will switch to address any last minute needs for the event. :)

Thanks and looking forward to seeing those of you who will attend. :)

Leila



On Mon, Mar 16, 2020 at 12:42 PM Leila Zia  wrote:

> Hi Pine,
>
> We're considering the options and going through the pros and cons of
> it. One consideration is that folks whose native language is not
> English or don't consider themselves proficient in the language may
> feel less comfortable to talk if the session is recorded. We will get
> back to you all with the decision in the coming weeks.
>
> Thanks,
> Leila
>
> On Sun, Mar 15, 2020 at 10:45 AM Pine W  wrote:
> >
> > Hi Leila,
> >
> > Thank you for the updates. I have one small question. Will the
> > sessions which are available on Zoom also be recorded for later
> > viewing? On occasion, I watch or share presentations after they have
> > occurred.
> >
> > Pine
> > ( https://meta.wikimedia.org/wiki/User:Pine )
> >
> > On Fri, Mar 13, 2020 at 9:43 PM Leila Zia  wrote:
> > >
> > > Hi all,
> > >
> > > We have an update for you regarding Wiki Workshop 2020 [0] in light of
> > > the global health situation related to COVID-19.
> > >
> > > ==Summary==
> > > We have turned Wiki Workshop 2020 from an in-person event to a fully
> > > virtual event. This was not an easy decision for us to make. We know
> > > that past year attendees had gained a lot from the in-person set-up of
> > > the workshop. This being said, we're excited about the opportunity of
> > > organizing the workshop in a virtual set-up: this allows us to reduce
> > > our carbon foot-print and to allow more people to benefit from the
> > > workshop.
> > >
> > > For this year's workshop, we have decided to remove the registration
> > > cost which is removing one more barrier for participation. The
> > > workshop will take place, as originally planned, on April 21 2020. We
> > > have changed the time of the workshop from Taipei's local time to
> > > afternoon UTC until evening UTC. (Exact times will be announced in the
> > > coming couple of weeks.) We also want you to know that we're working
> > > hard to transform the workshop program to one that can be engaging in
> > > a virtual set-up. We are making good progress on this front, thanks to
> > > the immense flexibility of everyone who is working with us including
> > > our speakers and the authors of the papers. Look for more information
> > > in the coming weeks about how to register (for fee). If you want to
> > > know more, please read on! :)
> > >
> > > ==Where?==
> > > Wiki Workshop 2020 is going fully virtual. All talks, conversations,
> > > poster sessions, and one on one meetings are moved to a virtual
> > > environment.
> > >
> > > ==When?==
> > > April 21, 2020. We will start in the afternoon UTC and will end in the
> > > evening UTC. Note that this is a change from the original plan to
> > > start at 9:00 local time in Taipei. We expect to be able to finalize
> > > the start and end times of the workshop no later than 2020-03-27.
> > >
> > > ==How?==
> > > We are testing a few different video communication options and most
> > > likely we will go with Zoom [1]. There is no cost for downloading
> > > Zoom, and there is even a web browser version of it. However, some of
> > > the features we will use, such as breakout rooms, will work more
> > > smoothly if you download Zoom. We will send specific instructions for
> > > how to connect to those who register for the event.
> > >
> > > ==Registration==
> > > If you are not an author of an accepted archival paper, you can
> > > request to attend the event for free. We will send the details for how
> > > to submit your request by 2020-03-27. We will review all registration
> > > requests and will let you know if your registration is through.
> > >
> > > If you are an author of an accepted paper in the workshop, you will
> > > need to make sur

Re: [Analytics] Wiki Workshop 2020 Announcement and Call for Papers

2020-03-27 Thread Leila Zia

Hi all,

I hope you are all doing well and staying healthy.

A few updates (including registration information) about Wiki Workshop 2020:

* Wiki Workshop 2020 will take place on April 21, 5:00-10:00 PST
(13:00-18:00 UTC).

* We are planning an optional 80-min event after the completion of the
workshop. More information about this once we publish the full
schedule on 2020-03-31.

* We currently have a cap of 300 people for folks to participate in
the workshop. If you know you want to participate, please register
sooner rather than later. We are still iterating over the "streaming"
idea. More updates on that in the coming weeks.

* If you like to participate in the workshop, please submit your
registration request via
https://docs.google.com/forms/d/e/1FAIpQLSe2ctfYVIokWYsvUfxcdF-FNwrymgzpNZLh25EhU8JbvKp1tA/viewform
.

Best,
Leila, on behalf of Wiki Workshop 2020 organizers

On Fri, Mar 13, 2020 at 2:42 PM Leila Zia  wrote:
>
> Hi all,
>
> We have an update for you regarding Wiki Workshop 2020 [0] in light of
> the global health situation related to COVID-19.
>
> ==Summary==
> We have turned Wiki Workshop 2020 from an in-person event to a fully
> virtual event. This was not an easy decision for us to make. We know
> that past year attendees had gained a lot from the in-person set-up of
> the workshop. This being said, we're excited about the opportunity of
> organizing the workshop in a virtual set-up: this allows us to reduce
> our carbon foot-print and to allow more people to benefit from the
> workshop.
>
> For this year's workshop, we have decided to remove the registration
> cost which is removing one more barrier for participation. The
> workshop will take place, as originally planned, on April 21 2020. We
> have changed the time of the workshop from Taipei's local time to
> afternoon UTC until evening UTC. (Exact times will be announced in the
> coming couple of weeks.) We also want you to know that we're working
> hard to transform the workshop program to one that can be engaging in
> a virtual set-up. We are making good progress on this front, thanks to
> the immense flexibility of everyone who is working with us including
> our speakers and the authors of the papers. Look for more information
> in the coming weeks about how to register (for fee). If you want to
> know more, please read on! :)
>
> ==Where?==
> Wiki Workshop 2020 is going fully virtual. All talks, conversations,
> poster sessions, and one on one meetings are moved to a virtual
> environment.
>
> ==When?==
> April 21, 2020. We will start in the afternoon UTC and will end in the
> evening UTC. Note that this is a change from the original plan to
> start at 9:00 local time in Taipei. We expect to be able to finalize
> the start and end times of the workshop no later than 2020-03-27.
>
> ==How?==
> We are testing a few different video communication options and most
> likely we will go with Zoom [1]. There is no cost for downloading
> Zoom, and there is even a web browser version of it. However, some of
> the features we will use, such as breakout rooms, will work more
> smoothly if you download Zoom. We will send specific instructions for
> how to connect to those who register for the event.
>
> ==Registration==
> If you are not an author of an accepted archival paper, you can
> request to attend the event for free. We will send the details for how
> to submit your request by 2020-03-27. We will review all registration
> requests and will let you know if your registration is through.
>
> If you are an author of an accepted paper in the workshop, you will
> need to make sure at least one of the authors of your paper is
> registered for a 1-day (or more) in-person registration option offered
> by the Web Conference 2020 organizers or the 5-day virtual attendance
> registration for the conference. Link to register:
> https://www2020.thewebconf.org/registration . Please note that your
> paper will be removed from the proceedings of the conference if you do
> not take this step and we, as workshop organizers, don't have any
> means to fix that for you.
>
> ==Program==
> We traditionally had 5-6 invited talks (45-min each) in Wiki Workshop
> along with a Featured and Lightning Talk session by the authors of the
> accepted papers followed by a poster session. The duration of the
> workshop in the old set-up was 8 hours.
>
> We have no doubt that our traditional model for the program has to
> change for this year. We know that the dynamics of engagement in the
> virtual set-ups are different from the in-person set-ups. Here is what
> we're thinking about the high-level format, the details to be
> announced as we finalize the program in the coming few weeks:
> * Ice-breaker: unchanged.
> * Introductions: som

Re: [Analytics] Analytics/Research Office hours on 2020-03-25 at 17.00-18.00 (UTC)

2020-03-25 Thread Leila Zia

This is happening now and for the next 53 min. ;) Show up if you have
questions for us.

Best,
Leila
--
Leila Zia
Head of Research
Wikimedia Foundation

On Fri, Mar 20, 2020 at 3:03 AM Martin Gerlach  wrote:
>
> Hi all,
>
> join us for our monthly Analytics/Research Office hours on 2020-03-25 at 
> 17.00-18.00 (UTC). Bring all your research questions and ideas to discuss 
> projects, data, analysis, etc…
>
> To participate, please join the IRC channel: #wikimedia-research [1].
>
> More detailed information can be found here [2] or on the etherpad [3] if you 
> would like to add items to agenda or check notes from previous meetings.
>
> Best,
> Martin
>
>
> [1] irc://chat.freenode.net:6667/wikimedia-research
>
> [2] https://www.mediawiki.org/wiki/Wikimedia_Research/Office_hours
>
> [3] https://etherpad.wikimedia.org/p/Research-Analytics-Office-hours
>
>
>
> --
> Martin Gerlach
> Research Scientist
> Wikimedia Foundation
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Wiki Workshop 2020 Announcement and Call for Papers

2020-03-16 Thread Leila Zia

Hi Pine,

We're considering the options and going through the pros and cons of
it. One consideration is that folks whose native language is not
English or don't consider themselves proficient in the language may
feel less comfortable to talk if the session is recorded. We will get
back to you all with the decision in the coming weeks.

Thanks,
Leila

On Sun, Mar 15, 2020 at 10:45 AM Pine W  wrote:
>
> Hi Leila,
>
> Thank you for the updates. I have one small question. Will the
> sessions which are available on Zoom also be recorded for later
> viewing? On occasion, I watch or share presentations after they have
> occurred.
>
> Pine
> ( https://meta.wikimedia.org/wiki/User:Pine )
>
> On Fri, Mar 13, 2020 at 9:43 PM Leila Zia  wrote:
> >
> > Hi all,
> >
> > We have an update for you regarding Wiki Workshop 2020 [0] in light of
> > the global health situation related to COVID-19.
> >
> > ==Summary==
> > We have turned Wiki Workshop 2020 from an in-person event to a fully
> > virtual event. This was not an easy decision for us to make. We know
> > that past year attendees had gained a lot from the in-person set-up of
> > the workshop. This being said, we're excited about the opportunity of
> > organizing the workshop in a virtual set-up: this allows us to reduce
> > our carbon foot-print and to allow more people to benefit from the
> > workshop.
> >
> > For this year's workshop, we have decided to remove the registration
> > cost which is removing one more barrier for participation. The
> > workshop will take place, as originally planned, on April 21 2020. We
> > have changed the time of the workshop from Taipei's local time to
> > afternoon UTC until evening UTC. (Exact times will be announced in the
> > coming couple of weeks.) We also want you to know that we're working
> > hard to transform the workshop program to one that can be engaging in
> > a virtual set-up. We are making good progress on this front, thanks to
> > the immense flexibility of everyone who is working with us including
> > our speakers and the authors of the papers. Look for more information
> > in the coming weeks about how to register (for fee). If you want to
> > know more, please read on! :)
> >
> > ==Where?==
> > Wiki Workshop 2020 is going fully virtual. All talks, conversations,
> > poster sessions, and one on one meetings are moved to a virtual
> > environment.
> >
> > ==When?==
> > April 21, 2020. We will start in the afternoon UTC and will end in the
> > evening UTC. Note that this is a change from the original plan to
> > start at 9:00 local time in Taipei. We expect to be able to finalize
> > the start and end times of the workshop no later than 2020-03-27.
> >
> > ==How?==
> > We are testing a few different video communication options and most
> > likely we will go with Zoom [1]. There is no cost for downloading
> > Zoom, and there is even a web browser version of it. However, some of
> > the features we will use, such as breakout rooms, will work more
> > smoothly if you download Zoom. We will send specific instructions for
> > how to connect to those who register for the event.
> >
> > ==Registration==
> > If you are not an author of an accepted archival paper, you can
> > request to attend the event for free. We will send the details for how
> > to submit your request by 2020-03-27. We will review all registration
> > requests and will let you know if your registration is through.
> >
> > If you are an author of an accepted paper in the workshop, you will
> > need to make sure at least one of the authors of your paper is
> > registered for a 1-day (or more) in-person registration option offered
> > by the Web Conference 2020 organizers or the 5-day virtual attendance
> > registration for the conference. Link to register:
> > https://www2020.thewebconf.org/registration . Please note that your
> > paper will be removed from the proceedings of the conference if you do
> > not take this step and we, as workshop organizers, don't have any
> > means to fix that for you.
> >
> > ==Program==
> > We traditionally had 5-6 invited talks (45-min each) in Wiki Workshop
> > along with a Featured and Lightning Talk session by the authors of the
> > accepted papers followed by a poster session. The duration of the
> > workshop in the old set-up was 8 hours.
> >
> > We have no doubt that our traditional model for the program has to
> > change for this year. We know that the dynamics of engagement in the
> > virtual set-ups are different from the in-person set-ups. Here is wh

Re: [Analytics] Wiki Workshop 2020 Announcement and Call for Papers

2020-03-13 Thread Leila Zia

 Workshop organizers [2]

[0] https://www2020.thewebconf.org/
[1] https://en.wikipedia.org/wiki/Zoom_Video_Communications
[2] https://wikiworkshop.org/2020/#organization

On Wed, Nov 27, 2019 at 7:13 PM Leila Zia  wrote:
>
> Hi everyone,
>
> We are delighted to announce that Wiki Workshop 2020 will be held in
> Taipei on April 20 or 21, 2020 (the date to be finalized soon) and as
> part of the Web Conference 2020 [1]. In the past years, Wiki Workshop
> has traveled to Oxford, Montreal, Cologne, Perth, Lyon, and San
> Francisco.
>
> You can read more about the call for papers and the workshops at
> http://wikiworkshop.org/2020/#call. Please note that the deadline for
> the submissions to be considered for proceedings is January 17. All
> other submissions should be received by February 21.
>
> If you have questions about the workshop, please let us know on this
> list or at wikiworks...@googlegroups.com.
>
> Looking forward to seeing you in Taipei.
>
> Best,
> Miriam Redi, Wikimedia Foundation
> Bob West, EPFL
> Leila Zia, Wikimedia Foundation
>
> [1] https://www2020.thewebconf.org/

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Announcement - Mediawiki History Dumps

2020-02-11 Thread Leila Zia

Hi Joseph and team,

summary: congratulations and some suggestions/requests.

I second and third Nate and Neil. Congratulations on meeting this
milestone. This effort can empower the research community to spend
less time on joining datasets and trying to resolve existing, known
(to some) and complex issues with mediawiki history data and instead
spend time doing the research. Nice! :)

I'm eager to see what the dataset(s) will be used for by others. On my
end, I am looking forward to seeing more research on how Wiki(m|p)edia
projects have evolved over the past almost 2 decades now that this
data is more readily available for studying. What we learn from the
Wikimedia projects and their evolution can be helpful in understanding
the broader web ecosystem and its evolution as well (as the Web is
only 30 years old now).

I have some requests if I may:

* Pine brings up a good point about licenses. It would be great to
make that clear in the documentation page(s). There are many examples
of this (that you know better than I), just in case, I find the
License section of
https://iccl.inf.tu-dresden.de/web/Wikidata/Maps-06-2015/en
informative, for example.

* The other request I have is that you make the template for citing
this data-set clear to the end-user in your documentation pages
(including readme). You can do this in a few different ways:

** In the documentation pages, put a suggested citation link. For
example (for bibtex):

@misc{wmfanalytics2020mediawikihistory,
  title = {MediaWiki History},
  author = {nameoftheauthors},
  howpublished = "\url{https://dumps.wikimedia.org/other/mediawiki_history/};,
  note = {Accessed on date x},
  year={2020}
}

** Upload a paper about the work on arxiv.org. This way, your work
gets a DOI that you can use in your documentation pages for folks to
use for citation. Note that this step can be relatively light-weight.
(no peer-review in this case and it's relatively quick.)

** Submit the paper to a conference. Some conferences have a data-set
paper track where you publish about the dataset you release. Research
is happy to support you with guidance if you need it and if you choose
to go down this path. This takes some more time and in return it will
give you a "peer-review" stamp and more experience in publishing if
you like that.

Unless you like publishing your work in a peer-reviewed venue, I
suggest one of the first two approaches.

* I'm not sure if you intend to make the dataset more discoverable
through places such as https://datasetsearch.research.google.com/ .
You may want to consider that.

Thanks,
Leila

--
Leila Zia
Head of Research
Wikimedia Foundation

On Mon, Feb 10, 2020 at 9:28 PM Pine W  wrote:
>
> I was thinking about the licensing issue some more. Apparently there
> was a relevant United States court case regarding metadata several
> years ago in the United States, but it's unclear to me from my brief
> web search whether this holding would apply to metadata from every
> nation. Also, I don't know if the underlying statues have changed
> since the time of that ruling. I think that WMF Legal should be
> consulted regarding the copyright status of the metadata. Also, I
> think that the licensing of metadata should be explicitly addressed in
> the Terms of Use or a similar document which is easily accessible to
> all contributors to Wikimedia sites.
>
> Pine
> ( https://meta.wikimedia.org/wiki/User:Pine )
>
> On Tue, Feb 11, 2020 at 12:17 AM Pine W  wrote:
> >
> > Hi Joseph,
> >
> > Thanks for this announcement.
> >
> > I am looking for license information regarding the dumps, and I'm not
> > finding it in the pages that you linked at [1] or [2]. The license
> > that applies to text on Wikimedia sites is often CC-BY-SA 3.0, and the
> > WMF Terms of Use at https://foundation.wikimedia.org/wiki/Terms_of_Use
> > do not appear to provide any exception for metadata. In the absence of
> > a specific license, I think that the CC-BY-SA or other relevant
> > licenses would apply to the metadata, and that the licensing
> > information should be prominently included on relevant pages and in
> > the dumps themselves.
> >
> > What do you think?
> >
> > Pine
> > ( https://meta.wikimedia.org/wiki/User:Pine )
> >
> > On Mon, Feb 10, 2020 at 4:28 PM Joseph Allemandou
> >  wrote:
> > >
> > > Hi Analytics People,
> > >
> > > The Wikimedia Analytics Team is pleased to announce the release of the 
> > > most complete dataset we have to date to analyze content and contributors 
> > > metadata: Mediawiki History [1] [2].
> > >
> > > Data is in TSV format, released monthly around the 3rd of the month 
> > > usually, and every new release contains the full history of metadata.
> > >
>

Re: [Analytics] SparkContext stopped and cannot be restarted

2020-02-07 Thread Leila Zia

On Fri, Feb 7, 2020 at 12:45 PM Nuria Ruiz  wrote:

> > and the verdict (supported by you) was that we should use this list or
> the public IRC channel.
> Indeed, eh? I suggest we revisit that to send questions to
> analytics-internal but if others disagree, I am fine with either.
>

my 2 cents: I prefer the public list as the conversation can be relevant to
my team (Research) as well. At the moment, if I see something is not of
immediate interest to me, I mute the thread. That's quite easy/cheap on my
end. If the frequency of this kind of question on the list increases
significantly, I'd suggest adding a tag to the subject line that allows
people to filter appropriately.

Leila
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Subject: New Office hours for WMF/Research starting in January 2020

2020-01-17 Thread Leila Zia

A friendly reminder that the first joint Analytics and Research office
hours will take place on 2020-01-22 at 17.00-18.00 (UTC). Bring your
Wikimedia related research and data questions to us during these
office hours. More at
https://www.mediawiki.org/wiki/Wikimedia_Research/Office_hours

--
Leila Zia
Head of Research
Wikimedia Foundation

On Mon, Dec 16, 2019 at 3:39 AM Martin Gerlach  wrote:
>
> Hi all,
>
>
> We, the Research team at Wikimedia Foundation, have received some requests 
> over the past months for making ourselves more available to answer some of 
> the research questions that you as Wikimedia volunteers, affiliates' staff, 
> and researchers face in your projects and initiatives. Starting January 2020, 
> we will experiment with monthly office hours organized jointly by our team 
> and the Analytics team where you can join us and direct your questions to us. 
> We will revisit this experiment in June 2020 to assess whether to continue it 
> or not.
>
>
> The scope
>
> We encourage you to attend the office hour if you have research related 
> questions. These can be questions about our teams, our projects, or more 
> importantly questions about your projects or ideas that we can support you 
> with during the office hours. You can also ask us questions about how to use 
> a specific dataset available to you, to answer a question you have, or some 
> other question. Note that the purpose of the office hours is to answer your 
> questions during the dedicated time of the office hour. Questions that may 
> require many hours of back-and-forth between our team and you are not suited 
> for this forum. For these bigger questions, however, we are happy to 
> brainstorm with you in the office hour and point you to some good directions 
> to explore further on your own (and maybe come back in the next office hour 
> and ask more questions).
>
>
> Time and Location
>
> We meet on the 4th Wednesday of every month 17.00-18.00 (UTC) in 
> #wikimedia-research IRC channel on freenode [1].
>
> The first meeting will be on January 22.
>
> Up-to-date information on mediawiki [2]
>
>
> Archiving
>
> If you miss the office hour, you can read the logs of it at [3].
>
> The future announcements about these office hours will only go to the 
> following lists so please make sure you're subscribed to them if you like to 
> receive a ping:
>
> * wiki-research-l mailing list [4]
>
> * analytics mailing list [5]
>
> * wikidata mailing list [6]
>
> * the Research category in Space [7]
>
>
> on behalf of Research and Analytics at WMF,
>
> Martin
>
>
> [1] irc://irc.freenode.net/wikimedia-research
>
> [2] https://www.mediawiki.org/wiki/Wikimedia_Research/Office_hours
>
> [3] https://wm-bot.wmflabs.org/logs/%23wikimedia-research/
>
> [4] https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
> [5] https://lists.wikimedia.org/mailman/listinfo/analytics
>
> [6] https://lists.wikimedia.org/mailman/listinfo/wikidata
>
> [7] https://discuss-space.wmflabs.org/tags/research
>
>
>
> --
> Martin Gerlach
> Research Scientist
> Wikimedia Foundation
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Wiki Workshop 2020 Announcement and Call for Papers

2020-01-14 Thread Leila Zia

A gentle reminder that January 17 is the deadline for submitting your
papers to Wiki Workshop 2020 for the archival publication. February 21
is the last day to submit for non-archival publication.

Your paper can be 1-8 pages long, and we specifically encourage the
submission of preliminary work. Our past participants have reported
that they have benefited from thoughtful and detailed feedback from
the program committee [1] as well as discussing their work in early
stages with other researchers and Wikimedians.

Submit your work via http://wikiworkshop.org/2020/#submission .

If you have questions, let us know.

Best,
Leila

[1] 2020's program committee: http://wikiworkshop.org/2020/#program-committee

On Wed, Nov 27, 2019 at 7:13 PM Leila Zia  wrote:
>
> Hi everyone,
>
> We are delighted to announce that Wiki Workshop 2020 will be held in
> Taipei on April 20 or 21, 2020 (the date to be finalized soon) and as
> part of the Web Conference 2020 [1]. In the past years, Wiki Workshop
> has traveled to Oxford, Montreal, Cologne, Perth, Lyon, and San
> Francisco.
>
> You can read more about the call for papers and the workshops at
> http://wikiworkshop.org/2020/#call. Please note that the deadline for
> the submissions to be considered for proceedings is January 17. All
> other submissions should be received by February 21.
>
> If you have questions about the workshop, please let us know on this
> list or at wikiworks...@googlegroups.com.
>
> Looking forward to seeing you in Taipei.
>
> Best,
> Miriam Redi, Wikimedia Foundation
> Bob West, EPFL
> Leila Zia, Wikimedia Foundation
>
> [1] https://www2020.thewebconf.org/

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Wiki Workshop 2020 Announcement and Call for Papers

2019-11-27 Thread Leila Zia

Hi everyone,

We are delighted to announce that Wiki Workshop 2020 will be held in
Taipei on April 20 or 21, 2020 (the date to be finalized soon) and as
part of the Web Conference 2020 [1]. In the past years, Wiki Workshop
has traveled to Oxford, Montreal, Cologne, Perth, Lyon, and San
Francisco.

You can read more about the call for papers and the workshops at
http://wikiworkshop.org/2020/#call. Please note that the deadline for
the submissions to be considered for proceedings is January 17. All
other submissions should be received by February 21.

If you have questions about the workshop, please let us know on this
list or at wikiworks...@googlegroups.com.

Looking forward to seeing you in Taipei.

Best,
Miriam Redi, Wikimedia Foundation
Bob West, EPFL
Leila Zia, Wikimedia Foundation

[1] https://www2020.thewebconf.org/

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Fwd: [Wikitech-l] BREAKING CHANGE: schema update, xml dumps

2019-11-27 Thread Leila Zia

FYI

-- Forwarded message -
From: Ariel Glenn WMF 
Date: Wed, Nov 27, 2019 at 5:38 AM
Subject: [Wikitech-l] BREAKING CHANGE: schema update, xml dumps
To: Wikipedia Xmldatadumps-l ,
Wikimedia developers 

We plan to move to the new schema for xml dumps for the February 1, 2020
run. Update your scripts and apps accordingly!

The new schema contains an entry for each 'slot' of content. This means
that, for example, the commonswiki dump will contain MediaInfo information
as well as the usual wikitext. See
https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/master/docs/export-0.11.xsd
for the schema and
https://www.mediawiki.org/wiki/Requests_for_comment/Schema_update_for_multiple_content_objects_per_revision_(MCR)_in_XML_dumps
for further explanation and example outputs.

Phabricator task for the update: https://phabricator.wikimedia.org/T238972

PLEASE FORWARD to other lists as you deem appropriate. Thanks!

Ariel Glenn
___
Wikitech-l mailing list
wikitec...@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Wikimedia Research Showcase - Feedback time

2019-11-15 Thread Leila Zia

Hi all,

A (first and last) friendly reminder that if you haven't participated
in the Research Showcase survey by now, you have one more week to do
so (and we will greatly appreciate your input:).

Link to survey:
https://docs.google.com/forms/d/e/1FAIpQLSecgn8cMu5IfTYRgn93bfOiJVEIL09RRf_WV0dVr6ZnJ8UU_w/viewform

Thanks to those of you who have participated so far.

Thanks,
Leila

On Fri, Nov 8, 2019 at 12:46 PM Leila Zia  wrote:
>
> Hi all,
>
> Wikimedia Research Showcase [1] is almost six years old and we're
> using the birthday opportunity to step back and reflect on the past,
> celebrate the contributions by more than 70 speakers and many of you
> who participated in the discussions, and plan for its future.
>
> We would like to ask for your input as we're thinking about the future
> of the Research Showcases. We want to hear from those of you who
> participated in the showcases and/or watched them, as well as those of
> you who decided this is not something for you. :) In order to gather
> your input, we have put together a survey that we'd appreciate if you
> participate in.
>
> Link to survey (please note that the link will take you to Google
> Forms [2]): 
> https://docs.google.com/forms/d/e/1FAIpQLSecgn8cMu5IfTYRgn93bfOiJVEIL09RRf_WV0dVr6ZnJ8UU_w/viewform
>
> Your contributions to this survey can help us in our thinking as we
> move forward. Please submit your responses by 2019-11-22.
>
> Sincerely,
> Jonathan Morgan and Leila Zia
> Research, Wikimedia Foundation
>
> [1] https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
>
> [2] If you want to participate but not through Google Forms, ping me
> off-list and I'll send you a pdf file you can fill and send back to me
> (it won't be anonymous though. sorry.). I'm not attaching it to this
> email as some lists may put my email in the moderation queue with an
> attachment. (And I don't /think/ I can upload it to Commons.)

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Wikimedia Research Showcase - Feedback time

2019-11-08 Thread Leila Zia

Hi all,

Wikimedia Research Showcase [1] is almost six years old and we're
using the birthday opportunity to step back and reflect on the past,
celebrate the contributions by more than 70 speakers and many of you
who participated in the discussions, and plan for its future.

We would like to ask for your input as we're thinking about the future
of the Research Showcases. We want to hear from those of you who
participated in the showcases and/or watched them, as well as those of
you who decided this is not something for you. :) In order to gather
your input, we have put together a survey that we'd appreciate if you
participate in.

Link to survey (please note that the link will take you to Google
Forms [2]):
https://docs.google.com/forms/d/e/1FAIpQLSecgn8cMu5IfTYRgn93bfOiJVEIL09RRf_WV0dVr6ZnJ8UU_w/viewform

Your contributions to this survey can help us in our thinking as we
move forward. Please submit your responses by 2019-11-22.

Sincerely,
Jonathan Morgan and Leila Zia
Research, Wikimedia Foundation

[1] https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase

[2] If you want to participate but not through Google Forms, ping me
off-list and I'll send you a pdf file you can fill and send back to me
(it won't be anonymous though. sorry.). I'm not attaching it to this
email as some lists may put my email in the moderation queue with an
attachment. (And I don't /think/ I can upload it to Commons.)

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Wiki-research-l] Analytics clients (stat/notebook hosts) and backups of home directories

2019-07-15 Thread Leila Zia

All clear, Luca and Nuria. Thanks!


On Thu, Jul 11, 2019 at 2:55 AM Luca Toscano  wrote:
>
> Hi Leila and Kate,
>
> adding a few words after Nuria's email to clarify my original intentions.
> My point was that any important and vital file that needs to be preserved
> may be stored in HDFS rather than on stat/notebooks due to the absence of
> backups of the home directories. My concern was that people had a different
> understanding about backups and I wanted to clarify.
> We (as Analytics team) don't have any good way at the moment to
> periodically scan HDFS and home directories across hosts to find PII data
> that is retained more than the allowed period of time. The main motivation
> is that we'd need to find a way to check a huge amount of files, with
> different names and formats, and figure out if the data contained in them
> is PII and retained more than X days. It is not an impossible task but not
> easy or trivial, we'd need a lot more staff in my opinion to create and
> maintain something similar :) We started recently with the clean up of old
> home directories (i.e. belonging to users not active anymore) and we
> established a process with SRE to get pinged when a user is offboarded to
> verify what data should be kept and what not (I know that both of you are
> aware of this since you have been working with us on several tasks, I am
> writing it to allow other people to get the context :). This is only a
> starting point, I really hope to have something more robust and complete in
> the future. In the meantime, I'd say that every user is responsible of the
> data that he/she handles on the Analytics infrastructure, periodically
> reviewing it and deleting when necessary. I don't have a specific
> guideline/process to suggest, but we can definitely have a chat together
> and decide something shared among our teams!
>
> Let me know if this makes sense or not :)
>
> Thanks,
>
> Luca
>
> Il giorno mer 10 lug 2019 alle ore 23:15 Nuria Ruiz 
> ha scritto:
>
> > >I have one question for you: As you allow/encourage for more copies of
> > >the files to exist
> > To be extra clear, we do not encourage for data to be in that notebooks
> > hosts at all, there is no capacity of them to neither process nor hosts
> > large amounts of data. Data that you are working with is best placed on
> > /user/your-username databse in hadoop so far from encouraging multiple
> > copies we are rather encouraging you keep the data outside the notebook
> > machines.
> >
> > Thanks,
> >
> > Nuria
> >
> > On Wed, Jul 10, 2019 at 11:13 AM Kate Zimmerman 
> > wrote:
> >
> >> I second Leila's question. The issue of how we flag PII data and ensure
> >> it's appropriately scrubbed came up in our team meeting yesterday. We're
> >> discussing team practices for data/project backups tomorrow and plan to
> >> come out with some proposals, at least for the short term.
> >>
> >> Are there any existing processes or guidelines I should be aware of?
> >>
> >> Thanks!
> >> Kate
> >>
> >> --
> >>
> >> Kate Zimmerman (she/they)
> >> Head of Product Analytics
> >> Wikimedia Foundation
> >>
> >>
> >> On Wed, Jul 10, 2019 at 9:00 AM Leila Zia  wrote:
> >>
> >>> Hi Luca,
> >>>
> >>> Thanks for the heads up. Isaac is coordinating a response from the
> >>> Research side.
> >>>
> >>> I have one question for you: As you allow/encourage for more copies of
> >>> the files to exist, what is the mechanism you'd like to put in place
> >>> for reducing the chances of PII to be copied in new folders that then
> >>> will be even harder (for your team) to keep track of? Having an
> >>> explicit process/understanding about this will be very helpful.
> >>>
> >>> Thanks,
> >>> Leila
> >>>
> >>>
> >>> On Thu, Jul 4, 2019 at 3:14 AM Luca Toscano 
> >>> wrote:
> >>> >
> >>> > Hi everybody,
> >>> >
> >>> > as part of https://phabricator.wikimedia.org/T201165 the Analytics
> >>> team
> >>> > thought to reach out to everybody to make it clear that all the home
> >>> > directories on the stat/notebook nodes are not backed up periodically.
> >>> They
> >>> > run on a software RAID configuration spanning multiple disks of
> >>> course, so
> >>> > we are resilient on a disk failure, but even if unlikely if might
> >>> ha

Re: [Analytics] [Wiki-research-l] Analytics clients (stat/notebook hosts) and backups of home directories

2019-07-10 Thread Leila Zia

Hi Luca,

Thanks for the heads up. Isaac is coordinating a response from the
Research side.

I have one question for you: As you allow/encourage for more copies of
the files to exist, what is the mechanism you'd like to put in place
for reducing the chances of PII to be copied in new folders that then
will be even harder (for your team) to keep track of? Having an
explicit process/understanding about this will be very helpful.

Thanks,
Leila


On Thu, Jul 4, 2019 at 3:14 AM Luca Toscano  wrote:
>
> Hi everybody,
>
> as part of https://phabricator.wikimedia.org/T201165 the Analytics team
> thought to reach out to everybody to make it clear that all the home
> directories on the stat/notebook nodes are not backed up periodically. They
> run on a software RAID configuration spanning multiple disks of course, so
> we are resilient on a disk failure, but even if unlikely if might happen
> that a host could loose all its data. Please keep this in mind when working
> on important projects and/or handling important data that you care about.
>
> I just added a warning to
> https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients.
> If you have really important data that is too big to backup, keep in mind
> that you can use your home directory (/user/your-username) on HDFS (that
> replicates data three times across multiple nodes).
>
> Please let us know if you have comments/suggestions/etc.. in the
> aforementioned task.
>
> Thanks in advance!
>
> Luca (on behalf of the Analytics team)
> ___
> Wiki-research-l mailing list
> wiki-researc...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Wikimedia Research Showcase] June 26, 2019 at 11:30 AM PST, 19:30 UTC

2019-06-27 Thread Leila Zia

Hi James,

Beyond the abstract of the talk at
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#June_2019
and section 7 of the full paper (the full paper is linked from the
abstract as well:
http://www.cs.cornell.edu/~jpchang/papers/recidivism_online.pdf ), I'm
not aware of any other summaries.

Best,
Leila
p.s. inspired by your question, I'm thinking maybe we should ask
Jonathan Chang to write a blog post about it. That's for later though.
:)

--
Leila Zia
Principal Research Scientist, Head of Research
Wikimedia Foundation

On Thu, Jun 27, 2019 at 2:50 PM James Salsman  wrote:
>
> > For those that couldn't make it, Is there are summary of what was said?
>
> Full recording: https://www.youtube.com/watch?v=WiUfpmeJG7E
>
> Slides:
>
> https://www.mediawiki.org/wiki/File:Trajectories_of_Blocked_Community_Members_-_Slides.pdf
>
> https://meta.wikimedia.org/wiki/University_of_Virginia/Automatic_Detection_of_Online_Abuse
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] [Wikimedia Research Showcase] March 20 at 11:30 AM PST, 18:30 UTC

2019-03-18 Thread Leila Zia

Hi all,

The next Research Showcase, “Learning How to Correct a Knowledge Base
from the Edit History” and “TableNet: An Approach for Determining
Fine-grained Relations for Wikipedia Tables” will be live-streamed
this Wednesday, March 20, 2019, at 11:30 AM PST/18:30 UTC (Please note
the change in time in UTC due to daylight saving changes in the U.S.).
The first presentation is about using edit history to automatically
correct constraint violations in Wikidata, and the second is about
interlinking Wikipedia tables.

YouTube stream: https://www.youtube.com/watch?v=6p62PMhkVNM

As usual, you can join the conversation on IRC at #wikimedia-research.
You can also watch our past research showcases at
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase .

This month's presentations:

Learning How to Correct a Knowledge Basefrom the Edit History

By Thomas Pellissier Tanon (Télécom ParisTech), Camille Bourgaux (DI
ENS, CNRS, ENS, PSL Univ. & Inria), Fabian Suchanek (Télécom
ParisTech), WWW'19.

The curation of Wikidata (and other knowledge bases) is crucial to
keep the data consistent, to fight vandalism and to correct good faith
mistakes. However, manual curation of the data is costly. In this
work, we propose to take advantage of the edit history of the
knowledge base in order to learn how to correct constraint violations
automatically. Our method is based on rule mining, and uses the edits
that solved violations in the past to infer how to solve similar
violations in the present. For example, our system is able to learn
that the value of the [[d:Property:P21|sex or gender]] property
[[d:Q467|woman]] should be replaced by [[d:Q6581072|female]]. We
provide [https://tools.wmflabs.org/wikidata-game/distributed/#game=43
a Wikidata game] that suggests our corrections to the users in order
to improve Wikidata. Both the evaluation of our method on past
corrections, and the Wikidata game statistics show significant
improvements over baselines.


TableNet: An Approach for Determining Fine-grained Relations for
Wikipedia Tables

By Besnik Fetahu

Wikipedia tables represent an important resource, where information is
organized w.r.t table schemas consisting of columns. In turn each
column, may contain instance values that point to other Wikipedia
articles or primitive values (e.g. numbers, strings etc.). In this
work, we focus on the problem of interlinking Wikipedia tables for two
types of table relations: equivalent and subPartOf. Through such
relations, we can further harness semantically related information by
accessing related tables or facts therein. Determining the relation
type of a table pair is not trivial, as it is dependent on the
schemas, the values therein, and the semantic overlap of the cell
values in the corresponding tables. We propose TableNet, an approach
that constructs a knowledge graph of interlinked tables with subPartOf
and equivalent relations. TableNet consists of two main steps: (i) for
any source table we provide an efficient algorithm to find all
candidate related tables with high coverage, and (ii) a neural based
approach, which takes into account the table schemas, and the
corresponding table data, we determine with high accuracy the table
relation for a table pair. We perform an extensive experimental
evaluation on the entire Wikipedia with more than 3.2 million tables.
We show that with more than 88% we retain relevant candidate tables
pairs for alignment. Consequentially, with an accuracy of 90% we are
able to align tables with subPartOf or equivalent relations.
Comparisons with existing competitors show that TableNet has superior
performance in terms of coverage and alignment accuracy.

Best,
Leila

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] DEPRECATION WARNING: dbstore1002 is going to be decommissioned on March 4th

2019-02-22 Thread Leila Zia

On Fri, Feb 22, 2019 at 2:45 AM Luca Toscano  wrote:
>
> the Analytics team has been working with the SRE Data Persistence team during 
> the last months to replace dbstore1002 with three brand new nodes, 
> dbstore100[3-5]. We are moving from a single mysql instance (multi-source) to 
> a multi-instance environment.

This has been an incredible amount of work, both in socializing the
idea and also implementation and making sure workflows don't break as
much as possible. Thank you to all of you who worked on this over the
past months and to those of you who maintained the single mysql
instance for all the past years. I hope that the maintenance workflows
become easier for those of you who continue to maintain these systems
for us.

Thank you! :)

Best,
Leila

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Farewell, Erik!

2019-02-06 Thread Leila Zia

Erik,

It's been an incredible honor to work with you as a colleague and a
volunteer. Thank you for the stats and all the conversations about
categories, topics, languages, ..., but even more so for showing me
the path and the purpose, time after time. I will dearly miss you in
Wikimedia Foundation, and I hope that I can be a steward of what you
stood for (or at least I can say that I will continue to try:).

Enjoy your new endeavors and see you around.

Regards,
Leila


On Wed, Feb 6, 2019 at 3:22 PM Christian Aistleitner
 wrote:
>
> Hi Erik,
>
> Thank you for your work!
>
> When I first came across Wikistats, it completely blew my mind. Such a
> huge collection of raw data turned into digestible information. It's
> amazing, stunning, and above all: enlightening.
> I've spent countless hours digging through Wikistats in awe.
>
> But besides the gargantuan effort that Wikistats represents, I even
> more value your passion for the data and information it holds, your
> second-to-none expertise on it, and your willingness to go through the
> details and numbers with each and everyone, regardless where they come
> from, your openness, your unbiased-ness, your constructive approach,
> and your never-shying-away from discussions about the numbers and
> trends.
>
> Enjoy your retirement from WMF, and seeing your blog post and your
> tree mapping project, I'm sure it'll be an amazing "Unruhestand" :-)
>
> Have fun,
> Christian
>
>
>
> On Wed, Feb 06, 2019 at 01:17:48PM -0800, Dario Taraborelli wrote:
> > “[R]ecent revisions of an article can be peeled off to reveal older layers,
> > which are still meaningful for historians. Even graffiti applied by vandals
> > can by its sheer informality convey meaningful information, just like
> > historians learned a lot from graffiti on walls of classic Pompei. Likewise
> > view patterns can tell future historians a lot about what was hot and what
> > wasn’t in our times. Reason why these raw view data are meant to be
> > preserved for a long time.”
> >
> > Erik Zachte wrote these lines in a blog post
> > 
> > almost
> > ten years ago, and I cannot find better words to describe the gift he gave
> > us. Erik retired  this
> > past Friday, leaving behind an immense legacy. I had the honor to work with
> > him for several years, and I hosted this morning an intimate, tearful
> > celebration of what Erik has represented for the Wikimedia movement.
> >
> > His Wikistats project —with his signature
> > pale yellow background we've known and loved since the mid 2000s
> > —has
> > been much more than an "analytics platform". It's been an individual
> > attempt he initiated, and grew over time, to try and comprehend and make
> > sense of the largest open collaboration project in human history, driven by
> > curiosity and by an insatiable desire to serve data to the communities that
> > most needed it.
> >
> > Through this project, Erik has created a live record of data describing the
> > growth and reach of all Wikimedia communities, across languages and
> > projects, putting multi-lingualism and smaller communities at the very
> > center of his attention. He coined metrics such as "active editors" that
> > defined the benchmark for volunteers, the Wikimedia Foundation, and the
> > academic community to understand some of the growing pains and editor
> > retention issues
> > 
> > the movement has faced. He created countless reports—that predate by nearly
> > a decade modern visualizations of online attention—to understand what
> > Wikipedia traffic means in the context of current events like elections
> > 
> > or public health crises
> > .
> > He has created countless
> >  visualizations
> > 
> > that show the enormous gaps in local language content and representation
> > that, as a movement, we face in our efforts to build an encyclopedia for
> > and about everyone. He has also made extensive use of pie charts
> > ,
> > which—as friends—we are ready to turn a blind eye towards.
> >
> > Most importantly, the data Erik has brougth to life has been cited over
> > 1,000 times
> > 
> > in the

Re: [Analytics] Wikistats2 - Metrics available for project families

2018-12-12 Thread Leila Zia

On Wed, Dec 12, 2018 at 11:40 AM Nuria Ruiz  wrote:
>
> Hello!
>
> The Analytics team would like to announce that we have now in Wikistats2 
> metrics available for what we are calling (for the lack of a better name) 
> "project families". That is, "all wikipedias", "all wikibooks"..etc
>
> See, for example, bytes added by users to all wikibooks in the last month:
> https://stats.wikimedia.org/v2/#/all-wikibooks-projects/content/net-bytes-difference/normal|bar|1-Month|editor_type~user

Thanks! I really missed this one each time I prepared a presentation
in the past months. Great to have it there. :)

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Pageviews by agent for May 18-21 2015

2018-11-13 Thread Leila Zia

Hi Jennifer,

Is this question related to the topic we're discussing in a private
email thread as well? I want to make sure I don't misunderstand your
other question, and also that we all don't double/multi-work on the
same question. It would be great if you expand if answering that email
will address this question or if I'm missing something obvious
(apologies in advance if that's the case).

Best,
Leila

--
Leila Zia
Senior Research Scientist, Lead
Wikimedia Foundation

On Tue, Nov 13, 2018 at 6:41 AM Jennifer Pan  wrote:
>
> Hi there,
>
>
> I'm an assistant professor in the Department of Communication at Stanford. My 
> co-author, Molly Roberts (Political Science, UCSD), and I are working on a 
> paper examining the effect of China's 2015 block of Chinese language 
> wikipedia on pageviews, which builds on our previous work on censorship in 
> China.
>
> We are using the block to conduct a interrupted time series design to measure 
> the effect of censorship on Chinese users. Our main finding is that Chinese 
> users were using Wikipedia to browse (starting at the home page), and the 
> block influenced users' ability to explore and encounter unexpected 
> information. One question we have is whether the pageviews we observe are 
> driven by bots and spiders. We know that the wikimedia rest api provides this 
> information going back to July 1 2015. Since the China block of Wikipedia was 
> on May 19, 2015, we are wondering if there is pageview data by agent type for 
> zh.wikipedia.org pages (all or some subset like most popular) going back to 
> May 2015 (specifically May 18-21, 2015)? From 
> https://meta.wikimedia.org/wiki/Research:Timeline_of_Wikimedia_analytics,
> it says that pageview data is available in bulk starting on May 1, 2015, so 
> we thought maybe there was some chance this data exists.
>
> Any suggestions would be greatly appreciated, and if this is not possible, 
> please let us know.
>
> Thank you!
> Jennifer Pan
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Statistics about republication of Wikimedia content

2018-10-17 Thread Leila Zia

Hi Pine,

On Tue, Sep 18, 2018 at 12:11 PM Pine W  wrote:
>
> Hi Analytics,
>
> Are views of republished Wikimedia content, such as on Google and Youtube, 
> something that we could include in addition to Wikimedia pageview statistics? 
> I imagine that this would require cooperation from Alphabet and other 
> companies that reuse Wikimedia content. It would be nice if we could get that 
> cooperation.

This is an interesting idea, and as Dan has mentioned in his response,
something that we're generally interested in. Measuring re-use can
open up a lot of opportunities for us as a Movement: that the
importance of Wikipedia does not end in Wikipedia, that the content
and knowledge is presented to different audiences through a variety of
channels.

While we may be able to start getting some raw numbers for re-use from
specific platforms (through cooperations that you called out or other
means), the problem is much more complex than what those raw numbers
can show and a part of me is interested to address that more
fundamental question and not summarize the value of Wikipedia with
direct pageviews. We all know that the value of WP doesn't end in
Wikipedia. For example, the exact/rough value of Wikipedia for
Knowledge Vault [1] which was/is the underlying mechanism for
surfacing search results in Google and other major websites' products
is unknown to us. It is easier to measure how many times Wikipedia is
directly used in Google Home, Alexa, Google/bing/etc. search, and
harder to see the value of Wikipedia for many of the services we enjoy
using today on and outside of the Web (including search logic,
Google/Yandex/etc. translation machines, many of the advancements in
AI and ML fields (NLP field has highly benefited from WP for example
and NLP is heavily used across many industries), ...).

From the research perspective, the really interesting and informing
research question is: what is the value of Wikipedia? (both economic
and otherwise) across the many languages. It would be great to be able
to get to the bottom of this question. If we can measure this, we have
opened up a major force to open up more doors for Wikipedia.

Best,
Leila

[1] 
https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45634.pdf

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Wiki-research-l] New files for geo coded Wikimedia stats

2018-07-27 Thread Leila Zia

Thanks for this, Erik. This can be helpful for a variety of projects
including
https://meta.wikimedia.org/wiki/Research:Characterizing_Wikipedia_Reader_Behaviour/Robustness_across_languages
and the next steps for this project.

L

On Wednesday, July 11, 2018, Erik Zachte  wrote:

>  Today I released two new json files [2][4].
> Both complement visualization 'Wikipedia Views Visualized' [1] (aka
> WiViVi), but both can be useful in other contexts as well.
> 1) File 'demographics_from_world_bank_for_wikimedia.json' [2] resulted
> from
> harvesting World Bank API files.
> It contains yearly figures for four metrics: (more could be added rather
> easily):
> - population counts,
> - percentage internet users,
> - percentage mobile subscriptions,
> - GDP per capita.
> The following static demographics charts on meta are also based on these
> metrics: [3]
> 2) File 'datamaps-data.json' [4] contains the equivalent of 3 rather
> complex (*) csv files which feed WiViVi. This brings together demographics
> data and pageviews (by country, by region, and by language), and also adds
> additional meta info. This json file is meant for external use, as it's
> much easier to parse than the 3 csv files WiViVi uses itself [5].
> (*) complex , as the csv files use a hierarchy based on nested delimiters
> --
> Details:
> World Bank files have different formats (some csv, some json) and use a
> variety of indexes (some use ISO 3166-1 alpha-2 codes, others ..-alpha-3).
> Script 1) first does normalization, then data are aggregated, filtered,
> indexed.
> Json file 1) replaces two csv files which up to now were filled from
> Wikipedia pages [6][7].
> Also, although Wikipedia lists nowadays also use World Bank data, this is
> not consistently done, see [8][9].
> [1] Viz:
> https://stats.wikimedia.org/wikimedia/animations/wivivi/wivivi.html
> [2] Json:
> https://stats.wikimedia.org/wikimedia/animations/wivivi/
> world-bank-demographics.json
> Script:
> https://github.com/wikimedia/analytics-wikistats/tree/master/worldbank
> [3] Charts: https://meta.wikimedia.org/wiki/World_Bank_demographics
> [4] Json:
> https://stats.wikimedia.org/wikimedia/animations/wivivi/datamaps-data.json
> Script:
> https://github.com/wikimedia/analytics-wikistats/tree/master/traffic
> [5] Syntax:
> https://stats.wikimedia.org/wikimedia/animations/wivivi/data.html
> [6] Article:
> https://en.wikipedia.org/wiki/List_of_countries_and_
> dependencies_by_population
> [7] Article:
> https://en.wikipedia.org/wiki/List_of_countries_by_number_
> of_Internet_users
> [8] Talk page: https://bit.ly/2L5Z2P4 section 'Wikipedia vs Worldbank
> population counts'
> [9] Talk page: https://bit.ly/2NJUoIu section 'Wikipedia vs Worldbank
> internet percentages'
> ___________
> Wiki-research-l mailing list
> wiki-researc...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>


-- 

--
Leila Zia
Senior Research Scientist
Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Statistical point of view to the visitors and promotional user names

2018-06-12 Thread Leila Zia

[coming back from a private response to the public list, with
Ladislav's permission.]

On Fri, Jun 1, 2018 at 5:03 AM Ladislav Nešněra
 wrote:
>
> Re: to https://lists.wikimedia.org/pipermail/analytics/2018-May/006349.html
>
> Hi Leila,
> I'm sorry for the delay but I'm not subscriber of this forum and registration 
> doesn't work for me :-O 
> (https://lists.wikimedia.org/mailman/listinfo/analytics). It means - I had no 
> signal about your answer.

No worries. I understand you're in now. :)

> Yes, I'd like to know something about user behaviour. This paragraph limits 
> user names based on potential abuse for promotional purpose. Would be fine to 
> know how many users reach the pages where they can see user names 
> (=discussion + history of the article + history of the discussion). Ideally 
> separate human readers and editors. Is it possible?

There is no immediate data available I can think of to point you, too
(others should feel free, of course, to provide pointers):

* If you're interested in anecdotal evidence: The easiest way I can
think of that you can visually see this information would be using
pageviews analysis tool [1].
* A properly set-up analysis will need to look at the pageviews to the
destinations you mentioned before/after the change, controlling for
seasonality, pageview changes over time, etc.
* Separating editor pageviews vs. reader pageviews will be hard and
that's by design. Even if we can set aside time to run this analysis
for you, this can only be done over the data in the past 90 days (at
most) and I would need to see a relatively strong editor community
support for doing it. (We generally don't do in-depth analysis of what
editors read, so some discussion is needed to make that happen.)
* If you decide to pursue the above, please communicate the priority
of this question on your end to help us prioritize. For example, is
there a community discussion pending on this result? How important is
that discussion? etc.

> And one related question - Wikidata Query and Wikipedia statistics.

[Link: 
https://www.wikidata.org/wiki/Wikidata:Request_a_query#Wikidata_Query_and_Wikipedia_statistics/meta_informations]

I'll let others who know more answer the question above, my guess
would be what's already said in the link above.

Best,
Leila

> Thank you in advance for your time  ;?

[1] 
http://tools.wmflabs.org/pageviews/?project=en.wikipedia.org=all-access=user=latest-20=Cat|Dog

>
>
> On 2018-05-29 00:59, Ladislav Nešněra wrote:
>
> Dear analytic team ;),
>
> I'd like to know if the policy about promotional user names solve real 
> problem or if this problem was only anticipated in 2007.
> Is it possible to get statistic which distinguish between article visitors 
> and people which can see potential promotion names i.e. discussion + history 
> of the article + history of discussion? It'd by ideal exclude editors (it's 
> hard to influence a persons familiar with the subject contrariwise they've be 
> annoyed by promoting) but I don't believe it would be a significant 
> difference.
> Can you help me or can you direct me into the right way, please?
>
> Thank you in advance for your time
>
> Ladislav Nešněra   ;?
> +420 721 658 256
>
>

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Open position - Research Scientist

2018-06-11 Thread Leila Zia

[apologies for cross-posting.]

Hi all,

The Research team at the Wikimedia Foundation has opened a Research
Scientist position. Please review the job description at
https://boards.greenhouse.io/wikimedia/jobs/1173279?gh_src=a41847991 ,
apply if you're interested or share it with colleagues and friends.

Best,
Leila

--
Leila Zia
Senior Research Scientist, Lead
Wikimedia Foundation

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Statistical point of view to the visitors and promotional user names

2018-05-29 Thread Leila Zia

Hi Ladislav,

We need some more explanation from you to be able to help. Are you
interested to see how the reader behavior and usage of Wikipedia has
changed as a result of this policy change? If not, can you elaborate?

Thanks,
Leila

--
Leila Zia
Senior Research Scientist, Lead
Wikimedia Foundation


On Mon, May 28, 2018 at 3:59 PM, Ladislav Nešněra
 wrote:
> Dear analytic team ;),
>
> I'd like to know if the policy about promotional user names solve real
> problem or if this problem was only anticipated in 2007.
> Is it possible to get statistic which distinguish between article visitors
> and people which can see potential promotion names i.e. discussion + history
> of the article + history of discussion? It'd by ideal exclude editors (it's
> hard to influence a persons familiar with the subject contrariwise they've
> be annoyed by promoting) but I don't believe it would be a significant
> difference.
> Can you help me or can you direct me into the right way, please?
>
> Thank you in advance for your time
>
> Ladislav Nešněra   ;?
> +420 721 658 256
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Jeff Levesque: List of Articles By Categories (College Project)

2018-05-23 Thread Leila Zia

Hi Jeff and team,

On Wed, May 23, 2018 at 4:57 PM, Jeffrey Levesque <jleve...@syr.edu> wrote:
> Hi Leila,
> I was hoping to try predict what categories of articles viewers would read:
>
> •   https://en.wikipedia.org/wiki/Category:Main_topic_classifications
>
> But, I realized that Wikipedia categories doesn't have a well-defined 
> structure. For example, I think it's possible that articles could have a 
> recursive chaining of categories (a subcategory could have many parent 
> categories, and may continue indefinitely). So, it seems impossible to derive 
> the idea of a "main category".  I was previously hoping that if it was 
> possible to derive a "main category", I could extend the findings, by 
> relating it to current socio-political events. To meet my course 
> requirements, I may have to adjust our project idea. However, if you have 
> possible (maybe related insights / strategies), that would be very 
> appreciated.

Ok. So we have some things for you:

* Check section 4.3. of https://arxiv.org/pdf/1804.05995.pdf . There
we describe a way to clean the category network. What you will get
there is a series of DAGs where cycles are removed and the relations
are is-a.

* We have a research showcase presentation on the above, if that
helps: First presentation, goes for ~30min
https://www.youtube.com/watch?v=ACevHs0sMMw

* The code for removing cycles is at
https://github.com/epfl-dlab/GraphCyclesRemoval

* The code for the pruning method is at https://github.com/epfl-dlab/WCNPruning

* We have done a (silent;) release of the data-set of the paper at
https://figshare.com/articles/Structuring_Wikipedia_Articles_with_Section_Recommendations/6157583
.

If you want the already cleaned category network in the form of DAGs
based on a snapshot in 2017 (and if it's already not in these links,
I'm blanking now), we should be able to send it your way. Just say it.

If the category prediction becomes too hairy and if you have more than
a week time left, ;) ping and I'd be happy to brainstorm about what
other questions you can consider. (One thing that comes to mind is:
characterizing articles, let's say in English Wikipedia, that have not
been read often in the past six months, and if you have time,
contrasting it those that have been read often.)

> Also, thank you very much for taking the time to respond to me!

No worries. :)

Good luck! This class of yours sounds really exciting.

Leila

>
> Thank you,
> Jeff Levesque
>
> -Original Message-
> From: Leila Zia <le...@wikimedia.org>
> Sent: Wednesday, May 23, 2018 7:34 PM
> To: Jeffrey Levesque <jleve...@syr.edu>
> Cc: Wikimedia Answers <answ...@wikimedia.org>; A mailing list for the 
> Analytics Team at WMF and everybody who has an interest in Wikipedia and 
> analytics. <analytics@lists.wikimedia.org>
> Subject: Re: Jeff Levesque: List of Articles By Categories (College Project)
>
> + Analytics, our public analytics related mailing list [1]
>
> Hi Jeff,
>
> Let me give it a try:
>
> * Re pageviews: a lot has changed since the Kaggle contest days you refer to. 
> :) I highly recommend you check out 
> https://dumps.wikimedia.org/other/pagecounts-ez/ where our hourly pageviews 
> per article live. In case you need it, abbreviations used in the file names 
> are documented. [2]
>
> * Can you expand more what you are trying to do? The short answer for your 
> category related question is that you have to parse XML dumps, but we may 
> have some good pointers for you to save you from that. If you tell us more, 
> we're more likely to be able to help.
>
> * And, if you decide to continue research on Wiki(m|p)edia data (which I hope 
> you do:), consider signing up in our public research list at 
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
> Best,
> Leila
>
> [1] https://lists.wikimedia.org/mailman/listinfo/analytics
> [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews
>
> --
> Leila Zia
> Senior Research Scientist, Lead
> Wikimedia Foundation
>
>
> On Wed, May 23, 2018 at 3:22 PM, Wikimedia Answers <answ...@wikimedia.org> 
> wrote:
>> Forwarding for your evaluation :) Feel free to include the wider
>> Research team.
>>
>> best,
>> Joe
>>
>> -- Forwarded message --
>> From: Jeffrey Levesque <jleve...@syr.edu>
>> Date: Tue, May 22, 2018 at 7:48 AM
>> Subject: Re: Jeff Levesque: List of Articles By Categories (College
>> Project)
>> To: "info...@wikimedia.org" <info...@wikimedia.org>
>> Cc: "answ...@wikimedia.org" <answ...@wikimedia.org>
>>
>>
>> Hi,
>> Is there a known API, where I can supply the arti

Re: [Analytics] Jeff Levesque: List of Articles By Categories (College Project)

2018-05-23 Thread Leila Zia

+ Analytics, our public analytics related mailing list [1]

Hi Jeff,

Let me give it a try:

* Re pageviews: a lot has changed since the Kaggle contest days you
refer to. :) I highly recommend you check out
https://dumps.wikimedia.org/other/pagecounts-ez/ where our hourly
pageviews per article live. In case you need it, abbreviations used in
the file names are documented. [2]

* Can you expand more what you are trying to do? The short answer for
your category related question is that you have to parse XML dumps,
but we may have some good pointers for you to save you from that. If
you tell us more, we're more likely to be able to help.

* And, if you decide to continue research on Wiki(m|p)edia data (which
I hope you do:), consider signing up in our public research list at
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Best,
Leila

[1] https://lists.wikimedia.org/mailman/listinfo/analytics
[2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews

--
Leila Zia
Senior Research Scientist, Lead
Wikimedia Foundation


On Wed, May 23, 2018 at 3:22 PM, Wikimedia Answers
<answ...@wikimedia.org> wrote:
> Forwarding for your evaluation :) Feel free to include the wider Research
> team.
>
> best,
> Joe
>
> -- Forwarded message --
> From: Jeffrey Levesque <jleve...@syr.edu>
> Date: Tue, May 22, 2018 at 7:48 AM
> Subject: Re: Jeff Levesque: List of Articles By Categories (College Project)
> To: "info...@wikimedia.org" <info...@wikimedia.org>
> Cc: "answ...@wikimedia.org" <answ...@wikimedia.org>
>
>
> Hi,
> Is there a known API, where I can supply the article name, and attain the
> corresponding "category" the article belongs to? I'm thinking I could write
> a python script and iterate the kaggle dataset, then send some POST request
> to hopefully some existing API, to determine the articles "category".
>
> Thank you,
>
> Jeff Levesque
> https://github.com/jeff1evesque
>
> On May 22, 2018, at 10:37 AM, Jeffrey Levesque <jleve...@syr.edu> wrote:
>
> Hi,
> Do you guys have a more recent time series of Wikipedia article traffic. I'm
> noticing that the kaggle dataset does not have a lot of articles that are on
> Wikipedia. Do you guys have a good idea of how I can categorize the dataset
> I have?
>
> Thank you,
>
> Jeff Levesque
> https://github.com/jeff1evesque
>
> On May 22, 2018, at 8:40 AM, Jeffrey Levesque <jleve...@syr.edu> wrote:
>
> Hi,
>
> I am masters student at Syracuse University. For my data science class, I am
> doing a project trying to analyze traffic patterns for Wikipedia. I’ve
> attained the Kaggle dataset for 2015-2016 data:
>
>
>
> https://www.kaggle.com/headsortails/wiki-traffic-forecast-exploration-wtf-eda/data
>
>
>
> However, the dataset only provides the frequency of visits to particular
> pages on a given day. Could I request to attain a list of articles grouped
> by “Categories”? I’ve tried to use the API (i.e.
> https://en.wikipedia.org/wiki/Special:Export). But, that doesn’t seem to
> generate a full output. Additionally, in the list it supplies subcategories.
> So, I tried using the URL API (i.e.
> https://en.wikipedia.org/w/api.php?action=query=categorymembers=Category:Physics=json).
> But, that also seems to return an even shorter result set:
>
>
>
> {"batchcomplete":"","continue":{"cmcontinue":"page|2d2941313f2b292d3d0447454f31434f39293f011701dc16|55503653","continue":"-||"},"query":{"categorymembers":[{"pageid":22939,"ns":0,"title":"Physics"},{"pageid":24489,"ns":0,"title":"Outline
> of physics"},{"pageid":3445246,"ns":0,"title":"Glossary of classical
> physics"},{"pageid":1653925,"ns":100,"title":"Portal:Physics"},{"pageid":50926902,"ns":0,"title":"Action
> angle
> coordinates"},{"pageid":9079863,"ns":0,"title":"Aerometer"},{"pageid":52657328,"ns":0,"title":"Bayesian
> model of computational anatomy"},{"pageid":49342572,"ns":0,"title":"Group
> actions in computational
> anatomy"},{"pageid":50724262,"ns":0,"title":"Blasius\u2013Chaplygin
> formula"},{"pageid":33327002,"ns":0,"title":"Cabbeling"}]}}
>
>
>
>
>
> Thank you,
>
> Jeff Levesque
>
> (603) 969-5363
>
>

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Content of wmf.wdqs_extract

2018-05-08 Thread Leila Zia

A couple of pointers as Stas was not involved in the details of the
extraction.

Adrian: you can dig the history behind the extraction at
https://phabricator.wikimedia.org/T146064

Please also check the codes at https://gerrit.wikimedia.org/r/#/c/311964/
for details, specifically wdqs_extract.hql .

Best,
Leila



On Mon, May 7, 2018, 18:15 Andrew Otto  wrote:

> CCing Stas, he might know more.
>
> On Sun, May 6, 2018 at 9:58 AM, Adrian Bielefeldt <
> adrian.bielefe...@mailbox.tu-dresden.de> wrote:
>
>> Hello everyone,
>>
>> I wanted to ask if anyone can tell me what wmf.wdqs_extract contains. I
>> know generally that it is the query log of the SPARQL endpoint. However,
>> I do not know if it is all requests, only uncached requests etc.
>>
>> If anyone knows or knows where I can read up on it that would be great.
>>
>> Greetings,
>>
>> Adrian
>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] How best to accurately record page interactions in Page Previews

2018-04-12 Thread Leila Zia

Thank you, Tilman. This is very helpful.

Leila


On Thu, Feb 8, 2018 at 1:50 AM, Tilman Bayer <tba...@wikimedia.org> wrote:

> Hi Leila,
>
> On Wed, Jan 17, 2018 at 10:46 AM, Leila Zia <le...@wikimedia.org> wrote:
>
>> Hi Sam,
>>
>> On Wed, Jan 17, 2018 at 1:51 AM, Sam Smith <samsm...@wikimedia.org>
>> wrote:
>>
>> > IMO #1 is preferable from the operations and performance perspectives
>> as the
>> > response is always served from the edge and includes very few headers,
>> > whereas the request in #2 may be served by the application servers if
>> the
>> > user is logged in (or in the mobile site's beta cohort). However, the
>> > requests in #2 are already
>>
>> It seems the sentence above is cut, can you resend it?
>>
>> > We're currently considering recording page interactions when previews
>> are
>> > open for longer than 1000 ms. We estimate that this would increase
>> overall
>> > web requests by 0.3% [3].
>>
>> Can you say some words about how the 1000 ms threshold is chosen?
>
> This is a good question, sorry that it got buried earlier. (It's kind of
> orthogonal though to the technical instrumentation questions that have been
> the focus of attention: as indicated by the capital X in Sam's initial
> post, we can still decide to fine-tune that threshold right now, it's just
> a parameter change.)
>
> This kind of threshold necessarily needs to be set somewhat arbitrarily,
> in the sense that there will always be either cases where some content was
> already read/perceived in a preview card shown for a shorter time, or cases
> where a reader needed a longer time to consume any content from the card.
> We picked a time by which we can be reasonably certain that at least some
> readers can consume content (read some words, perceive an image). It's not
> the result of an exact calculation to find the provably best limit. But we
> did have look at the frequency of the different user actions over time
> during the first seconds after they start to hover over a link. In case
> you're interested, I recently updated those charts with better quality data
> from our latest two tests, e.g:
> https://phabricator.wikimedia.org/F12940888
> https://phabricator.wikimedia.org/F13134460 (a zoomed-in look at the same
> histogram)
>
> The following is just eyeballing and thinking aloud, but one way to view
> this histogram is as the sum of several distributions associated with
> different user intentions:
> 1. Most of the time when our instrumentation registered the cursor moving
> over a link, the user was just on their way to a different part of the
> screen (with no intention of either clicking that link or viewing the
> preview). That's mostly the huge yellow spike on the left -
> "dwelledButAbandoned" meaning that the cursor left the link without either
> clicking it or causing a preview to show. The feature involves a 500ms
> delay before the preview card begins to display, so that we don't bother
> that group too much. (Only the right tail end of that distribution, folks
> moving the cursor very slowly, will be affected, where things morph from
> yellow into purple.)
> 2. Then there are users who want to click the link without viewing the
> preview, forming all of the green part left of 500ms and an unknown portion
> to the right of it (after the card starts to show, some of these "open"
> actions will instead happen after the user intentionally viewed the card,
> case 3.).
> 3. And there are users who intentionally view a preview. The little bump
> in the purple part ("dismissed" meaning that the preview was shown and then
> closed by moving the cursor away) at about 1100ms indicates that the
> distribution for that user group also peaks somewhere there, maybe a few
> 100ms to the right. That would mean that our 1000ms threshold (i.e. only
> counting the part of the histogram right of 1500ms = 500ms + 1000ms as seen
> previews) is actually right of that distribution's peak. I.e. that the
> threshold is in some sense quite conservative.
>
> Like I said, this is all of course still a bit handwavy; it involves some
> assumptions about the form of these distributions, as well as disregarding
> some other information for now that can give a fuller picture (in
> particular the analogous histogram for link interaction behavior without
> page previews being active, which we also have from our A/B tests).
>
>
>> Is
>> this based (partially) on looking at traces where a user-agent goes to
>> a page and returns to the "source" article?
>>
> We did an analysis of that user behavior, but not regarding

Re: [Analytics] [Services] Getting more than just 1000 top articles from REST API

2018-04-02 Thread Leila Zia

On Mon, Apr 2, 2018 at 7:47 AM, Dan Andreescu <dandree...@wikimedia.org>
wrote:

> Hi Srdjan,
>
> The data pipeline behind the API can't handle arbitrary skip or limit
> parameters, but there's a better way for the kind of question you have.  We
> publish all the pageviews at https://dumps.wikimedia.org/
> other/pagecounts-ez/, look at the "Hourly page views per article"
> section.  I would imagine for your use case one month of data is enough,
> and you can get the top N articles for all wikis this way, where N is
> anything you want.
>

One suggestion here is that if you want to find articles that are
consistently high-page-view (and not part of spike/trend-views), you
increase the time-window to 6 months or longer.

Best,
Leila



--
Leila Zia
Senior Research Scientist, Lead
Wikimedia Foundation


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] How best to accurately record page interactions in Page Previews

2018-01-17 Thread Leila Zia

Hi Sam,

On Wed, Jan 17, 2018 at 1:51 AM, Sam Smith  wrote:

> IMO #1 is preferable from the operations and performance perspectives as the
> response is always served from the edge and includes very few headers,
> whereas the request in #2 may be served by the application servers if the
> user is logged in (or in the mobile site's beta cohort). However, the
> requests in #2 are already

It seems the sentence above is cut, can you resend it?

> We're currently considering recording page interactions when previews are
> open for longer than 1000 ms. We estimate that this would increase overall
> web requests by 0.3% [3].

Can you say some words about how the 1000 ms threshold is chosen? Is
this based (partially) on looking at traces where a user-agent goes to
a page and returns to the "source" article?

Thanks,
Leila

>
> [0] https://lists.wikimedia.org/pipermail/analytics/2015-March/003633.html
> [1]
> https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb;1bce79d58e03bd02888beef986c41989e8345037$269
> [2] https://wikitech.wikimedia.org/wiki/X-Analytics
> [3] https://phabricator.wikimedia.org/T184793#3901365
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Wikipedia aggregate clickstream data released

2018-01-16 Thread Leila Zia

Hi all,

For archive happiness:

Clickstream dataset is now being generated on a monthly basis for 5
Wikipedia languages (English, Russian, German, Spanish, and Japanese). You
can access the data at https://dumps.wikimedia.org/other/clickstream/ and
read more about the release and those who contributed to it at
https://blog.wikimedia.org/2018/01/16/wikipedia-rabbit-hole-clickstream/

Best,
Leila



--
Leila Zia
Senior Research Scientist
Wikimedia Foundation

On Tue, Feb 17, 2015 at 11:00 AM, Dario Taraborelli <
dtarabore...@wikimedia.org> wrote:

> We’re glad to announce the release of an aggregate clickstream dataset
> extracted from English Wikipedia
>
> http://dx.doi.org/10.6084/m9.figshare.1305770
>
> This dataset contains counts of *(referer, article) *pairs aggregated
> from the HTTP request logs of English Wikipedia. This snapshot captures 22
> million *(referer, article)* pairs from a total of 4 billion requests
> collected during the month of January 2015.
>
> This data can be used for various purposes:
> • determining the most frequent links people click on for a given article
> • determining the most common links people followed to an article
> • determining how much of the total traffic to an article clicked on a
> link in that article
> • generating a Markov chain over English Wikipedia
>
> We created a page on Meta for feedback and discussion about this release:
> https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream
>
> Ellery and Dario
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Research Showcase Wednesday, November 15, 2017 at 11:30 AM (PST) 18:30 UTC

2017-11-15 Thread Leila Zia

On Wed, Nov 15, 2017 at 11:02 AM, Jan Ainali  wrote:
> Wasn't 18:30 UTC was 30 minutes ago?

That seems to be a typo. It's at 19:30 UTC. Sorry about that.

Best,
Leila

>
> Med vänliga hälsningar
> Jan Ainali
> http://ainali.com
>
> 2017-11-15 19:53 GMT+01:00 Sarah R :
>>
>> Hi Everyone,
>>
>> Just a reminder that this will start at 11:30 AM (Pacific), 18:30 UTC.
>>
>> Kindly,
>>
>> Sarah R.
>>
>> On Thu, Nov 9, 2017 at 3:34 PM, Sarah R  wrote:
>>>
>>> Hi Everyone,
>>>
>>> The next Research Showcase will be live-streamed this Wednesday, November
>>> 15, 2017 at 11:30 AM (PST) 18:30 UTC.
>>>
>>> YouTube stream:  https://www.youtube.com/watch?v=nMENRAkeHnQ
>>>
>>> As usual, you can join the conversation on IRC at #wikimedia-research.
>>> And, you can watch our past research showcases here.
>>>
>>> This month's presentation:
>>>
>>> Conversation Corpora, Emotional Robots, and Battles with BiasBy Lucas
>>> Dixon (Google/Jigsaw)I'll talk about interesting experimental setups for
>>> doing large-scale analysis of conversations in Wikipedia, and what it even
>>> means to grapple with the concept of conversation when one is talking about
>>> revisions on talk pages. I'll also describe challenges with having good
>>> conversations at scale, some of the dreams one might have for AI in the
>>> space, and I'll dig into measuring unintended bias in machine learning and
>>> what one can do to make ML more inclusive. This talk will cover work from
>>> the WikiDetox project as well as ongoing research on the nature and impact
>>> of harassment in Wikipedia discussion spaces – part of a collaboration
>>> between Jigsaw, Cornell University, and the Wikimedia Foundation. The ML
>>> model training code, datasets, and the supporting tooling developed as part
>>> of this project are openly available.
>>>
>>>
>>> Many kind regards,
>>>
>>> Sarah R. Rodlund
>>> Senior Project Coordinator-Product & Technology, Wikimedia Foundation
>>> srodl...@wikimedia.org
>>>
>>>
>>>
>>
>>
>>
>> --
>> Sarah R. Rodlund
>> Senior Project Coordinator-Product & Technology, Wikimedia Foundation
>> srodl...@wikimedia.org
>>
>> “Our lives begin to end the day we become silent about things that
>> matter.”  ~ Martin Luther King Jr
>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] research process (was Re: Google Code-in: Get your tasks for young contributors prepared!)

2017-11-03 Thread Leila Zia

Hi Lars,

On Fri, Nov 3, 2017 at 4:46 AM, Lars Noodén <lars.noo...@gmail.com> wrote:

> The research page ( https://meta.wikimedia.org/wiki/Research ) seems to
> be automatically generated.
>
> How would I go about finding the next step towards establishing a
> project?

I assume by establishing a project you mean finding a way to get access to
the data that your research proposal is going to use. If that is correct:

> I now have a preliminary draft of a proposal:
>
> https://meta.wikimedia.org/wiki/Research:Finding_Search_
> Engine_Terms_Used_to_Retrieve_Wikibooks

I will review this page and get back to you next week. To set
expectations: all I can promise is that we will review the page and discuss
if we can find a light-weight format to help you with it. I can't promise
that we can actually make it happen as the resources are very tight on our
end. We will do our best.

The ticket for tracking this task is
https://phabricator.wikimedia.org/T179693 .

Best,
Leila

--
Leila Zia
Senior Research Scientist
Wikimedia Foundation

>
>
> /Lars
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Analytics project request

2017-07-24 Thread Leila Zia

Hi Daniel,

I reviewed your request.

== Context ==
* The data you're asking for is one of the most frequently requested
data-sets. We also receive quite a bit of interest for that data
specifically for the general research direction you're interested in.
* Resources are highly limited on our end. Every formal collaboration will
need to be created taking into account this constraint
 and the commitments we have already made.

== When Research can sign up for formal collaborations? ==
At least one of the conditions below should hold for us to be able to
consider creating a new formal collaboration at this point in time:
* The outside research is (tightly) aligned with one of our annual plan
commitments (for the period of July 1, 2017 to June 30, 2018). [1]
* If a researcher in Research team picks up a specific direction for
exploration based on their expertise/interest.
* If access to data is broadly agreed upon as strategic for humanity. The
examples in this direction are rare, but to give you a sense: if there is
an epidemic and we know, with some certainty, that the data we have can
help control it or help understanding the research and development in that
space.

== Access to data ==
At this point, unfortunately we cannot
create a formal collaboration for your request
. I hope that this email can transfer our disappointment to convey this
message
.
:(

Th
e above
being said, I think there is one data-set that can be helpful for your
research and that's Wikipedia Clickstream dataset. [2] You can use that
dataset to compute the transition probabilit
y
 of moving from one English Wikipedia
 article
to another. The data is not refreshed frequently, but refreshing that at
specific snapshots in time is something we can consider. Please work with
the dataset, if you haven't, and let us know if that can be of help for you.

Best,
Leila

[1] All
programs Research has committed to are listed below. Specific objectives
within each Program Research has signed up for is at
https://phabricator.wikimedia.org/tag/research-programs/

Program 4
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/Final/Programs/Technology#Program_4:_Technical_community_building

Program 7

https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/Final/Programs/Technology#Program_7._Smart_tools_for_better_data

Program 9
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/Final/Programs/Technology#Program_9:_Growing_Wikipedia_across_languages_via_recommendations

Program 11

https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/Final/Programs/Technology#Program_11:_Improving_citations_across_Wikimedia_projects

Program 12

https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/Final/Programs/Technology#Program_12:_Grow_contributor_diversity

CD - Community Health
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/Final/Community_Health#Segment_3:_Research_on_harassment

CD - Structured Data
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/Final/Structured_Data#Segment_4:_Programs

[2] https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream

--
Leila Zia
Senior Research Scientist
Wikimedia Foundation

On Mon, Jul 24, 2017 at 9:24 AM, Leila Zia <le...@wikimedia.org> wrote:

> I'll review Daniel's email and will get back to him/you on this list
> in the next day or so.
>
> Leila
>
> --
> Leila Zia
> Senior Research Scientist
> Wikimedia Foundation
>
>
> On Mon, Jul 24, 2017 at 7:59 AM, Nuria Ruiz <nu...@wikimedia.org> wrote:
> > Daniel,
> >
> > Singining an NDA is not enough to get access to the data, you also need
> to
> > be part of  a formal research collaboration with our research team, they
> > have a number of those and they are not likely to accept any more soon
> but
> > you can contact them on that regard:
> > https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations
> >
> > Thanks,
> >
> > Nuria
> >
> >
> >
> > On Mon, Jul 24, 2017 at 6:37 AM, Daniel Oberski <
> daniel.ober...@gmail.com>
> > wrote:
> >>
> >> Dear list,
> >>
> >> I'm posting a recent conversation with Dan below, as well as a few
> >> follow-up questions.
> >>
> >> Dan was kind enough to point out this list. I apologize that the post is
> >> "backward" (in
> >> email-thread format) due to my ignorance, will use this list from now
> on.
> >>
> >> Thanks, Daniel
> >>
> >>
> >> 
> >>
> >> Hi Dan
> >>
> >>
> >> Thanks for getting back to me so quickly!
> >>
> >> >Thanks for writing.  In genera

Re: [Analytics] Analytics project request

2017-07-24 Thread Leila Zia

I'll review Daniel's email and will get back to him/you on this list
in the next day or so.

Leila

--
Leila Zia
Senior Research Scientist
Wikimedia Foundation


On Mon, Jul 24, 2017 at 7:59 AM, Nuria Ruiz <nu...@wikimedia.org> wrote:
> Daniel,
>
> Singining an NDA is not enough to get access to the data, you also need to
> be part of  a formal research collaboration with our research team, they
> have a number of those and they are not likely to accept any more soon but
> you can contact them on that regard:
> https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations
>
> Thanks,
>
> Nuria
>
>
>
> On Mon, Jul 24, 2017 at 6:37 AM, Daniel Oberski <daniel.ober...@gmail.com>
> wrote:
>>
>> Dear list,
>>
>> I'm posting a recent conversation with Dan below, as well as a few
>> follow-up questions.
>>
>> Dan was kind enough to point out this list. I apologize that the post is
>> "backward" (in
>> email-thread format) due to my ignorance, will use this list from now on.
>>
>> Thanks, Daniel
>>
>>
>> 
>>
>> Hi Dan
>>
>>
>> Thanks for getting back to me so quickly!
>>
>> >Thanks for writing.  In general these questions are best asked on our
>> > public list, so other
>> >people can see and benefit from any answers:
>> > https://lists.wikimedia.org/mailman/listinfo/
>> >analytics
>>
>> Thanks, I've joined this list and will ask subsequent questions there.
>>
>> >* pairs of pages: we have two datasets that are mentioned in this task
>> > https://
>> >phabricator.wikimedia.org/T158972 which should be very interesting for
>> > this purpose.  They
>> >aren't being updated right now, and the task is to do just that.  We'll
>> > probably get to
>> >that within the next 3 months, but a bunch of us are on paternity leave
>> > this summer, so
>> >things are a little slower than normal
>>
>> This seems close to what I need. From the descriptions I gather the
>> linkage is by session.
>> Is there also a linkage by ip (with IP's removed of course)?
>>
>> >* country data for pageviews: for privacy reasons we only allow access to
>> > this with an
>> >NDA.  We have good data on it, but you need to sign this NDA and use our
>> > cluster to access
>> >it, being careful about what you report about it to the world at large.
>> > Here's information
>> >on that: https://wikitech.wikimedia.org/wiki/Volunteer_NDA
>>
>> I've read this and am happy to sign an NDA. I understand it is best to be
>> as specific as
>> possible about the reasoning, intentions with the data, and permissions
>> required. For me to
>> figure this out it would be useful to know the relevant parts of the
>> database schema, and
>> perhaps a hint as to which data might be most interesting there. Would you
>> be able to point
>> me towards that?
>>
>> >Hope that helps, and feel free to write back to the public list in the
>> > future.
>>
>> Definitely, very helpful and thank you!
>>
>> Best, Daniel
>>
>>
>> On Wed, Jul 19, 2017 at 9:51 AM, Oberski, D.L. (Daniel)
>> <d.l.ober...@uu.nl> wrote:
>> Dear Dan,
>>
>>
>> My name is Daniel Oberski, I'm an associate professor of data science
>> methodology in the
>> department of statistics at Utrecht University in the Netherlands.
>>
>> I've been using your incredibly useful pageviews API to study correlations
>> between the
>> amount of interest people show in a topic (pageviews) with other data such
>> as political
>> party preference over time. That has yielded some interesting results
>> (which I have yet to
>> write up).
>>
>> However, to do a better study it would be very helpful to have slightly
>> more information
>> than is in the API. Specifically, it would be very useful to be able to
>> query, for each
>> _pair_ of pages, how many people (or IP's) viewed _both_ of those pages.
>> That way I can find
>> out which pages are really indicative of interest in a specific common
>> topic, rather than
>> just correlated by accident. In addition, I've found it hard to figure out
>> pageviews for
>> specific pages by country rather than language.
>>
>> My question is, would you happen to know if is there any way to obtain
>> this information?
>> (does not necessarily have to be through the API.) Or do you know if there
>> are people to
>> whom I might talk about this?
>>
>> Thanks for reading (to) the end and best regards,
>>
>> Daniel
>>
>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] new mediawiki_history snapshot available

2017-07-12 Thread Leila Zia

On Wed, Jul 12, 2017 at 12:25 PM, Nuria Ruiz <nu...@wikimedia.org> wrote:
>>Can you specify what you mean by "next year"? I can think fiscal,
>>calendar, etc. :)
>
> We are aiming for this data to be public in its current analytics-friendly
> form by end 2017/ begginning 2018.

Thank you!

>
> On Wed, Jul 12, 2017 at 12:22 PM, Leila Zia <le...@wikimedia.org> wrote:
>>
>> On Wed, Jul 12, 2017 at 12:16 PM, Nuria Ruiz <nu...@wikimedia.org> wrote:
>> > Further clarification that this snapshot of data is not yet public
>> > (meaning
>> > available to the outside world, not just WMF/NAD holders) .
>>
>> Thanks for clarifying this and the work you and your team has put into
>> this.
>>
>> > Our team is working towards making this data available next year in labs
>> > in the same
>> > fashion that data is now available on the labs replicas.
>>
>> Can you specify what you mean by "next year"? I can think fiscal,
>> calendar, etc. :)
>>
>> A big thumbs up for making data public. wiki-research-l list and
>> audience will be happy.
>>
>> Best,
>> Leila
>>
>> >
>> >
>> > Thanks,
>> >
>> > Nuria
>> >
>> > On Wed, Jul 12, 2017 at 9:34 AM, Dan Andreescu
>> > <dandree...@wikimedia.org>
>> > wrote:
>> >>
>> >> Today we announce a new snapshot (named 2017-06) of the mediawiki
>> >> history
>> >> data [1].  It includes these awesome new fields:
>> >>
>> >> event_user_revision_count: 'Cumulative revision count per user for the
>> >> current event_user_id (only available in revision-create events so
>> >> far)'
>> >>
>> >> page_revision_count: 'In revision/page events: Cumulative revision
>> >> count
>> >> per page for the current page_id (only available in revision-create
>> >> events
>> >> so far)'
>> >>
>> >> The event_user_revision_count field is useful as a close estimate to
>> >> user_editcount, but it does not include Flow talk page edits.
>> >> We've also added event_user_seconds_to_previous_revision and
>> >> page_seconds_to_previous_revision, but those are not being computed
>> >> right
>> >> now.
>> >>
>> >> The mediawiki_history dataset is updated every month, but we thought
>> >> we'd
>> >> let you know about this one since it has new goodies.  It's all thanks
>> >> to
>> >> Joseph who did everything but announce this wonderful work and then had
>> >> to
>> >> rush away to welcome his daughter into the world.  Hi Joseph!  Stop
>> >> reading
>> >> work email! :D
>> >>
>> >>
>> >> [1]
>> >>
>> >> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
>> >>
>> >> ___
>> >> Analytics mailing list
>> >> Analytics@lists.wikimedia.org
>> >> https://lists.wikimedia.org/mailman/listinfo/analytics
>> >>
>> >
>> >
>> > ___
>> > Analytics mailing list
>> > Analytics@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/analytics
>> >
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Wiki-research-l] Wikipedia Detox: Scaling up our understanding of harassment on Wikipedia

2017-06-22 Thread Leila Zia

Hi Pine,

On Thu, Jun 22, 2017 at 2:03 AM, Pine W <wiki.p...@gmail.com> wrote:

>
> At the same time, I'd appreciate getting a more precise understanding of
> how WMF is defining the word "harassment".
>

This is a policy question and wiki-research-l and analytics mailing lists
are not the best place to discuss it. (I'm not sure where it is, maybe
policy public list?)

If you are interested about learning how a specific research effort has
handled this question, I suggest you reach out to the researchers in that
effort. In the case of  Ex Machina: Personal Attacks Seen at Scale
<https://arxiv.org/abs/1610.08914>, Section 3 should give you a relatively
detailed description of how this question was approached.

Best,
Leila

Pine
>
> On Wed, Jun 21, 2017 at 2:08 PM, Leila Zia <le...@wikimedia.org> wrote:
>
>> Hi Dan,
>>
>> Thanks for your note. :)
>>
>> On the Research end, Dario is still a big supporter of the efforts
>> around research to help us better understand harassment (as you
>> noticed in our commitments to the annual plan) and with Ellery's
>> departure, I've been helping him a bit to make sure we can move
>> forward on this front. More specifically, and while we're continuing
>> the research with Nithum and Lucas who were Ellery's collaborators on
>> the Detox project, we recently initiated
>> https://meta.wikimedia.org/wiki/Research:Study_of_harassment
>> _and_its_impact
>> with Cristian and Yiqing from Cornell University. We are very excited
>> about this new collaboration as Cristian has years of experience in
>> spaces that are very relevant to the socio-technical problems related
>> to harassment. I think you will enjoy reading that page which signal
>> the early directions of the research.
>>
>> The whole harassment research team meets every 2 weeks, if you're
>> curious what's going on on this front and on our end and you want to
>> listen in, please ping me. And, thank you for the offer to help. We
>> may take you up on that. :)
>>
>> Best,
>> Leila
>>
>> --
>> Leila Zia
>> Senior Research Scientist
>> Wikimedia Foundation
>>
>>
>> On Wed, Jun 21, 2017 at 7:55 PM, Toby Negrin <tneg...@wikimedia.org>
>> wrote:
>> > Hi Dan -- we are actually in touch with Detox as part of the Community
>> > Health initiative. They are doing their first quarterly check in this
>> > quarter so expect some updates then. Ping me offlist if you want more
>> info.
>> >
>> > -Toby
>> >
>> > On Wed, Jun 21, 2017 at 10:48 AM, Dan Andreescu <
>> dandree...@wikimedia.org>
>> > wrote:
>> >>
>> >> I'm reflecting on this work and how awesome it was.  I see that it's
>> >> continued in our annual plan under the Community Health Initiative,
>> but I
>> >> am afraid it's taking a secondary role without Ellery and others to
>> drive
>> >> it.  On
>> >> https://meta.wikimedia.org/wiki/Community_health_initiative/
>> AbuseFilter
>> >> it's only featured as a question under the #Functionality section.
>> >>
>> >> I just wanted to point this out and offer to help if I can be of use.
>> >>
>> >> On Tue, Feb 7, 2017 at 5:16 PM, Ellery Wulczyn <ewulc...@wikimedia.org
>> >
>> >> wrote:
>> >>
>> >> > Today we are announcing
>> >> >
>> >> > <https://blog.wikimedia.org/2017/02/07/scaling-understanding
>> -of-harassment/>
>> >> > the
>> >> > first results of the collaboration between Wikimedia Research and
>> Jigsaw
>> >> > on
>> >> > modeling personal attacks and other forms of harassment on English
>> >> > Wikipedia. We have released
>> >> > <https://figshare.com/projects/Wikipedia_Talk/16731> a corpus of 95M
>> >> > user
>> >> > and article talk page comments as well as over 1M human labels
>> produced
>> >> > by
>> >> > 4000 crowd-workers for a set of 100k comments. Documentation on our
>> >> > methodology and future work can be found in our paper Ex Machina:
>> >> > Personal Attacks Seen at Scale <https://arxiv.org/abs/1610.08914>
>> (to
>> >> > appear at WWW2017) and on our project page on meta
>> >> > <https://meta.wikimedia.org/wiki/Research:Detox>. If you are
>> interested
>> >> > in contributing to the project, please get in touch

Re: [Analytics] web log data

2017-03-06 Thread Leila Zia

Hi Genevieve,

This is Leila from Research. Thanks for reaching out.

Access to non-public data through the Research team happens if we create a
formal research collaboration with you and your team. Whether a formal
collaboration can be created is a function of some requirements to be met
[1] and our capacity in the Research team. At the moment, our capacity is
very limited and you have a specific research question in mind that you
want to address. Unfortunately, I don't see a way for us to be able to
accommodate your request at this point.

As Dan said, we will be looking for improving our algorithms for bot
detection in the next 3-6 months. If you'd like to be informed about that
research when we are closer to pick that up and you're interested to
collaborate with us there, please ping me off-list and we will get in touch
with you in some months about that research.

I'm sorry that we cannot be of more help for your research at this point.

Best,
Leila

[1]
https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations#How_are_formal_research_collaborations_created.3F
are met and

Leila Zia
Senior Research Scientist
Wikimedia Foundation

On Mon, Mar 6, 2017 at 6:28 AM, Dan Andreescu <dandree...@wikimedia.org>
wrote:

> Hi Genevieve & Jelena,
>
> We have a process for working with external researchers, and it starts
> here: https://meta.wikimedia.org/wiki/Research:Access_to_non-public_data
>
> It certainly sounds like the data we have could help you.  We have some
> requirements listed there and your project should get the approval of the
> research team.  You can also check out what other Research projects are
> happening: https://meta.wikimedia.org/wiki/Research:Projects
>
> We (the Analytics engineering team) are very interested in bot detection
> as well.  It might be useful to collaborate.  We have several important use
> cases for which we need to distinguish bot activity from human activity,
> and we were planning on starting that work within one or two quarters.
>
> On Thu, Mar 2, 2017 at 7:15 PM, Genevieve Bartlett <bartl...@isi.edu>
> wrote:
>
>> Hi All -
>>
>> Emanuele Rocca suggested we reach out to you guys and see if you guys
>> would be willing to share web log/content access data.
>>
>> Jelena and I are network security researchers at University of Southern
>> California's Information Sciences Institute. We're working on a project for
>> application-level DDoS defences, and are evaluating our defences for web
>> applications.
>>
>> Our defences model how legitimate users interact with served content and
>> using these models we attempt to differentiate between legitimate users and
>> any attacking bots during high-load (ie a potential attack). Our models are
>> based on the timing between user requests and the semantic connections (or
>> lack there of) between content requests. More information on our NSF funded
>> project can be found here: https://steel.isi.edu/Projects/frade/
>>
>> Right now, to collect data for evaluation, we've mirrored several sites
>> (Wikipedia is one of them :) and hired ~200 users to interact with our
>> mirrored sites (for app-level attack data, we simulate attacks). Of course,
>> this isn't the most ideal way of getting data on human-content interaction
>> and we would be thrilled to augment our evaluation with "real world" data.
>>
>> Wikipedia is particularly of interest to us given the number of "good"
>> bots which access content and who's access patterns may not exist in our
>> current models trained on human interactions with mirrored wikipedia
>> content.
>>
>> We would be extremely grateful for any information you are willing to
>> share. We understand and fully support the need to preserve privacy of
>> Wikipedia and Wikipedia users, and we regularly work with anonymized
>> datasets.  If there's any need for NDAs or similar agreements, we are
>> very open to whatever is necessary. In addition to web/access logs, any
>> information on application-level DoS attacks or flash crowds you have
>> experienced we would be grateful for as well.
>>
>> cheers,
>> gen
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] stats.grok.se used in study about Snowden and internet traffic

2017-01-18 Thread Leila Zia

+ Juliet, as this is something Communications may want to follow up given
that stats.groke.se is not maintained by a Wikimedia Foundation member.

Thanks for sharing this.

Leila

Leila Zia
Senior Research Scientist
Wikimedia Foundation

On Wed, Jan 18, 2017 at 9:00 AM, Andrew Otto <aco...@gmail.com> wrote:

> Saw this on reddit:
>
> https://theintercept.com/2016/04/28/new-study-shows-mass-
> surveillance-breeds-meekness-fear-and-self-censorship/
>
> From the paper
> <https://poseidon01.ssrn.com/delivery.php?ID=753074071124064067015124108126089006010045004048003005075122091087089064095002083029126119017036023013055086074065017070067091045045047076049103090096065089019029088069014094126005065070066069027097007119094008090029020087004112003090106003067003084=pdf>
> :
>
> "This case study uses data on English language Wikipedia article view
> counts from the online service stats.grok.se, a portal maintained by a
> Wikimedia Foundation member. This portal provides access to a range of
> Wikipedia analytics, stats, and data.86 In particular, the portal
> aggregates Wikipedia article view data on a daily and monthly basis.87 This
> data at stats.grok.se has been used in a range of research, including
> studies involving market trends, health information access, and
> social-political change.88”
>
> Just thought it might be of interest, especially considering WMF’s NSA
> lawsuit <https://blog.wikimedia.org/2015/03/10/wikimedia-v-nsa/>.
>
> -Ao
>
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Fwd: [Query Logs] Research:Understanding Wikidata Queries

2017-01-16 Thread Leila Zia

On Tue, Jan 3, 2017 at 9:30 AM, Stas Malyshev 
wrote:

> Hi!
>
> > 1. Is there a unique key for the query log? The log I am refering to
> > is the *wdqs_extract* table**from
> > the hive database wmf.**We would like to be able to
> > permanently link our own computed data with the log entry we
> > computed it from.
>
> I think you can use hostname+sequence (from
> https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest, assuming
> those are preserved in wdqs_extract) as a key.
>

Adrian, you can also consider adding other fields to Stas' recommendation
to create the key, to be sure about uniqueness. For example, IP and UA
fields, in combination with hostname and sequence (or browser language, if
it's relevant in your case). Let us know what you end up using on this
thread, so we know the answer for the future. :)

Best,
Leila

>
>
> --
> Stas Malyshev
> smalys...@wikimedia.org
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Upcoming Research Showcase, November 16, 2016

2016-11-16 Thread Leila Zia

Hi all,

A reminder that this is happening in 2 hours from now.

Best,
Leila


On Wed, Nov 9, 2016 at 2:29 PM, Leila Zia <le...@wikimedia.org> wrote:

> [Apologies for cross-posting]
>
> Hi everyone,
>
> Almost a year ago, we [1] embarked on a research project to understand who
> Wikipedia readers are. More specifically, we set a goal for finding a
> taxonomy of Wikipedia readers. In the upcoming Research Showcase, I will
> present the findings of this research.
>
> *Logistics*
> The Research Showcase will be live-streamed on Wednesday, November 16,
> 2016 at 11:35 (PST) 19:35 (UTC).
>
> YouTube stream: https://www.youtube.com/watch?v=O24F1xkbNwI
>
> As usual, you can join the conversation on IRC freedone at
> #wikimedia-research. And, you can watch our past research showcases at
> https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase.
>
> *Title*
> Why We Read Wikipedia
>
> *Abstract*
> Every day, millions of readers come to Wikipedia to satisfy a broad range
> of information needs, however, little is known about what these needs are.
> In this presentation, I share the result of a research that sets to help us
> understand Wikipedia readers better. Based on an initial user study on
> English, Persian, and Spanish Wikipedia, we build a taxonomy of Wikipedia
> use-cases along several dimensions, capturing users’ motivations to visit
> Wikipedia, the depth of knowledge they are seeking, and their knowledge of
> the topic of interest prior to visiting Wikipedia. Then, we quantify the
> prevalence of these use-cases via a large-scale user survey conducted on
> English Wikipedia. Our analyses highlight the variety of factors driving
> users to Wikipedia, such as current events, media coverage of a topic,
> personal curiosity, work or school assignments, or boredom. Finally, we
> match survey responses to the respondents’ digital traces in Wikipedia’s
> server logs, enabling the discovery of behavioral patterns associated with
> specific use-cases. Our findings advance our understanding of reader
> motivations and behavior on Wikipedia and have potential implications for
> developers aiming to improve Wikipedia’s user experience, editors striving
> to cater to (a subset of) their readers’ needs, third-party services (such
> as search engines) providing access to Wikipedia content, and researchers
> aiming to build tools such as article recommendation engines.
>
>
> *How to prepare? What to expect?*
> If you decide to attend, here are a few things I would like to ask you to
> keep in mind, especially if this will be your first time to one of our
> research showcases:
>
> * Like many other research projects in fields that are not heavily
> explored, the findings of this research will create more questions than
> they answer. I encourage you to keep these questions in mind throughout the
> presentation and discussion: "What can we do with this finding? What other
> questions can we ask? What other ideas can we try?"
>
> * Be open to ask these questions to yourself, especially if you are a
> Wikipedia editor, even before coming to the showcase: "Why do I edit
> Wikipedia? Who am I writing the content for, if anyone? Will I change the
> way I write content if I know more about who reads it (to encourage or
> discourage certain types of reading or readers)? What needs an encyclopedia
> should serve? What is Wikipedia: A place one can quickly find the answer to
> his/her questions, or a place that one can go to when he/she wants to spend
> a quiet time reading and learning, or a place for both and even more? etc."
>
> * And, see if you would be interested to see the result of this study in
> your language. What will be presented is based on research on English,
> Persian, and Spanish Wikipedia (the data from the latter two projects have
> been used only for one part of the research). We are interested in running
> the study on at least 2-3 more languages to understand the robustness of
> some of the results across different languages, and to also help
> communities with having access to the results for their specific language
> project.
>
> Looking forward to seeing you there, and if you can't make it, please
> feel free to watch the video later and get in touch with us with
> questions/comments. :)
>
> Best,
> Leila
> --
> Leila Zia
> Senior Research Scientist
> Wikimedia Foundation
>
> [1] WMF Research and researchers from three academic institutions: EPFL,
> GESIS, and Stanford University, in collaboration with WMF Reading.
> 
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] ensuring reader anonymity

2016-11-11 Thread Leila Zia

And one last email from me until Monday ;)

(Thanks to Nuria) We are now following hashing of IPs in webrequest logs at
https://phabricator.wikimedia.org/T150545 . I have asked Nuria to give me 2
weeks to reach out to the people who work with this data to see if anyone
raises any flag about hashing IP addresses in webrequest logs.

Best,
Leila

On Fri, Nov 11, 2016 at 11:16 AM, Leila Zia <le...@wikimedia.org> wrote:

> Hi Pine,
>
> On Fri, Nov 11, 2016 at 10:39 AM, Pine W <wiki.p...@gmail.com> wrote:
>
>> On Fri, Nov 11, 2016 at 9:25 AM, Leila Zia <le...@wikimedia.org> wrote:
>>
>>> Nuria, regarding the IP addresses specifically (not the proxy, for
>>> which, I'll need more time to go through the use-cases we've had and see if
>>> we can find work-arounds if we hash proxy information):
>>>
>>> Have we considered in the past to create at least two levels of access
>>> when it comes to the IP addresses? From what you describe, it is clear to
>>> me that your team will need to have access to raw IPs for a certain period
>>> of time. It may be the case that no one else uses that information (for all
>>> of the use-cases of the research I've been involved in, hashed IP works as
>>> well, as long as we have geolocation available to us). By creating two
>>> layers of access, we can make sure that your team has access to raw IP
>>> while everyone else doesn't. Is this an option?
>>>
>>> And one suggestion: if we want to reconsider the way we provide access
>>> to IP address, I'd like to suggest that we step back and reconsider the way
>>> we give access to other fields in the webrequest logs as well. This will be
>>> a longer process, but it may be worthwhile. For example, if we decide that
>>> access to raw IP should be limited even further, do we want to have the
>>> same restrictions applied to access to UAs? It's not obvious to me that the
>>> answer should be no.
>>>
>>> Best,
>>> Leila
>>>
>>>
>> I'd be happy to have Legal and Analytics take a look at what could be
>> done to tighten the screws a bit on who has access to other data in the
>> logs such as UAs. (To follow up on a comment from Wikimedia-l: I'm also
>> very wary of letting people outside of WMF and the community have access to
>> this kind of information, even with a signed NDA.)
>>
>
> I'm not a supporter of the narrative that non-staff folks who have access
> to the webrequest logs should be limited more than staff members who can
> have access to the logs. Some of these folks are highly trained individuals
> (sometimes even more than staff members) and some of them are less
> experienced but work very closely with a staff member who is experienced in
> dealing with sensitive data. They understand the importance of the data
> they work with, and we do our share in onboarding them and making sure we
> are all on the same page about what data they're working with and how they
> should handle it.
>
> Let's step back:
>
> * Subpoena related concerns: the best way to handle this from the data
> storage perspective is to not have the data at all. That is why very
> sensitive data is purged after 60 days at the moment in webrequest logs. As
> Nuria said, this length of time may be shortened by a little, but at least
> because of operational constraints, we won't be able to not store this data
> at all.
>
> * Error related concerns: One way to reduce the errors is to constrain the
> number of people who can access the data (which is already happening, we're
> talking about increasing restrictions here). In this case, there is very
> little difference between staff and non-staff folks who have access to
> webrequest logs at the moment. Mistakes can happen by people in each group.
> I may make a mistake and give an output in my GitHub account with the top
> 10 IP addresses that have accessed WP in the last hour. This mistake can
> happen, by anyone accessing this data. The logical thing to do is to reduce
> the number of people who don't /have to/ have access to that data. If I
> don't /need/ to see the IPs for my work, I shouldn't see them, whether I'm
> a staff member or non-staff under NDA to access this data. If I should,
> then we should accept that mistakes can happen, but we will do our best to
> reduce them.
>
> * I also want to point out prioritization here, which is something Nuria
> and her team should handle (and this will affect Security, Research, and
> Legal as well):
>
> the Analytics team has been allocating resources to transition wikistats.
> This has been a gigantic endeavour by the team. We know that if wi

Re: [Analytics] ensuring reader anonymity

2016-11-11 Thread Leila Zia

Hi Pine,

On Fri, Nov 11, 2016 at 10:39 AM, Pine W <wiki.p...@gmail.com> wrote:

> On Fri, Nov 11, 2016 at 9:25 AM, Leila Zia <le...@wikimedia.org> wrote:
>
>> Nuria, regarding the IP addresses specifically (not the proxy, for which,
>> I'll need more time to go through the use-cases we've had and see if we can
>> find work-arounds if we hash proxy information):
>>
>> Have we considered in the past to create at least two levels of access
>> when it comes to the IP addresses? From what you describe, it is clear to
>> me that your team will need to have access to raw IPs for a certain period
>> of time. It may be the case that no one else uses that information (for all
>> of the use-cases of the research I've been involved in, hashed IP works as
>> well, as long as we have geolocation available to us). By creating two
>> layers of access, we can make sure that your team has access to raw IP
>> while everyone else doesn't. Is this an option?
>>
>> And one suggestion: if we want to reconsider the way we provide access to
>> IP address, I'd like to suggest that we step back and reconsider the way we
>> give access to other fields in the webrequest logs as well. This will be a
>> longer process, but it may be worthwhile. For example, if we decide that
>> access to raw IP should be limited even further, do we want to have the
>> same restrictions applied to access to UAs? It's not obvious to me that the
>> answer should be no.
>>
>> Best,
>> Leila
>>
>>
> I'd be happy to have Legal and Analytics take a look at what could be done
> to tighten the screws a bit on who has access to other data in the logs
> such as UAs. (To follow up on a comment from Wikimedia-l: I'm also very
> wary of letting people outside of WMF and the community have access to this
> kind of information, even with a signed NDA.)
>

I'm not a supporter of the narrative that non-staff folks who have access
to the webrequest logs should be limited more than staff members who can
have access to the logs. Some of these folks are highly trained individuals
(sometimes even more than staff members) and some of them are less
experienced but work very closely with a staff member who is experienced in
dealing with sensitive data. They understand the importance of the data
they work with, and we do our share in onboarding them and making sure we
are all on the same page about what data they're working with and how they
should handle it.

Let's step back:

* Subpoena related concerns: the best way to handle this from the data
storage perspective is to not have the data at all. That is why very
sensitive data is purged after 60 days at the moment in webrequest logs. As
Nuria said, this length of time may be shortened by a little, but at least
because of operational constraints, we won't be able to not store this data
at all.

* Error related concerns: One way to reduce the errors is to constrain the
number of people who can access the data (which is already happening, we're
talking about increasing restrictions here). In this case, there is very
little difference between staff and non-staff folks who have access to
webrequest logs at the moment. Mistakes can happen by people in each group.
I may make a mistake and give an output in my GitHub account with the top
10 IP addresses that have accessed WP in the last hour. This mistake can
happen, by anyone accessing this data. The logical thing to do is to reduce
the number of people who don't /have to/ have access to that data. If I
don't /need/ to see the IPs for my work, I shouldn't see them, whether I'm
a staff member or non-staff under NDA to access this data. If I should,
then we should accept that mistakes can happen, but we will do our best to
reduce them.

* I also want to point out prioritization here, which is something Nuria
and her team should handle (and this will affect Security, Research, and
Legal as well):

the Analytics team has been allocating resources to transition wikistats.
This has been a gigantic endeavour by the team. We know that if wikistats
data is not generated for a few months, we will have a lot of unhappy
people around. If we are to step back and allocate resources to spend hours
on rethinking how we handle webrequest logs (and I can assure you that this
will require at least 2 full-time work week of many people, likely spread
over months), we will have to slow down some other effort. Also, consider
Security who needs to be involved in this process. You know better than I
that Security has a lot to do with very few people.

imho, we are doing a very good job with the way we are handling webrequest
logs data at the moment given our constraints. Sure, we can and should
improve some steps over time.

Best,
Leila

>
> Pine
>
>
> _

[Analytics] Upcoming Research Showcase, November 16, 2016

2016-11-09 Thread Leila Zia

[Apologies for cross-posting]

Hi everyone,

Almost a year ago, we [1] embarked on a research project to understand who
Wikipedia readers are. More specifically, we set a goal for finding a
taxonomy of Wikipedia readers. In the upcoming Research Showcase, I will
present the findings of this research.

*Logistics*
The Research Showcase will be live-streamed on Wednesday, November 16, 2016
at 11:35 (PST) 19:35 (UTC).

YouTube stream: https://www.youtube.com/watch?v=O24F1xkbNwI

As usual, you can join the conversation on IRC freedone at
#wikimedia-research. And, you can watch our past research showcases at
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase.

*Title*
Why We Read Wikipedia

*Abstract*
Every day, millions of readers come to Wikipedia to satisfy a broad range
of information needs, however, little is known about what these needs are.
In this presentation, I share the result of a research that sets to help us
understand Wikipedia readers better. Based on an initial user study on
English, Persian, and Spanish Wikipedia, we build a taxonomy of Wikipedia
use-cases along several dimensions, capturing users’ motivations to visit
Wikipedia, the depth of knowledge they are seeking, and their knowledge of
the topic of interest prior to visiting Wikipedia. Then, we quantify the
prevalence of these use-cases via a large-scale user survey conducted on
English Wikipedia. Our analyses highlight the variety of factors driving
users to Wikipedia, such as current events, media coverage of a topic,
personal curiosity, work or school assignments, or boredom. Finally, we
match survey responses to the respondents’ digital traces in Wikipedia’s
server logs, enabling the discovery of behavioral patterns associated with
specific use-cases. Our findings advance our understanding of reader
motivations and behavior on Wikipedia and have potential implications for
developers aiming to improve Wikipedia’s user experience, editors striving
to cater to (a subset of) their readers’ needs, third-party services (such
as search engines) providing access to Wikipedia content, and researchers
aiming to build tools such as article recommendation engines.


*How to prepare? What to expect?*
If you decide to attend, here are a few things I would like to ask you to
keep in mind, especially if this will be your first time to one of our
research showcases:

* Like many other research projects in fields that are not heavily
explored, the findings of this research will create more questions than
they answer. I encourage you to keep these questions in mind throughout the
presentation and discussion: "What can we do with this finding? What other
questions can we ask? What other ideas can we try?"

* Be open to ask these questions to yourself, especially if you are a
Wikipedia editor, even before coming to the showcase: "Why do I edit
Wikipedia? Who am I writing the content for, if anyone? Will I change the
way I write content if I know more about who reads it (to encourage or
discourage certain types of reading or readers)? What needs an encyclopedia
should serve? What is Wikipedia: A place one can quickly find the answer to
his/her questions, or a place that one can go to when he/she wants to spend
a quiet time reading and learning, or a place for both and even more? etc."

* And, see if you would be interested to see the result of this study in
your language. What will be presented is based on research on English,
Persian, and Spanish Wikipedia (the data from the latter two projects have
been used only for one part of the research). We are interested in running
the study on at least 2-3 more languages to understand the robustness of
some of the results across different languages, and to also help
communities with having access to the results for their specific language
project.

Looking forward to seeing you there, and if you can't make it, please feel
free to watch the video later and get in touch with us with
questions/comments. :)

Best,
Leila
--
Leila Zia
Senior Research Scientist
Wikimedia Foundation

[1] WMF Research and researchers from three academic institutions: EPFL,
GESIS, and Stanford University, in collaboration with WMF Reading.

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Fwd: [Research-Internal] Fwd: Dumps Rewrite getting underway (help needed!)

2016-09-13 Thread Leila Zia

FYI

-- Forwarded message --
From: Ariel Glenn WMF 
Date: Mon, Sep 12, 2016 at 9:07 AM
Subject: [Research-Internal] Fwd: Dumps Rewrite getting underway (help
needed!)
To: research-inter...@lists.wikimedia.org

-- Forwarded message --
From: Ariel Glenn WMF 
Date: Mon, Sep 5, 2016 at 2:35 PM
Subject: Dumps Rewrite getting underway (help needed!)
To: Wikipedia Xmldatadumps-l 

Hello folks,

I know a number of you have subscribed to the Dumps Rewrite project (
https://phabricator.wikimedia.org/tag/dumps-rewrite/) but I bet none of you
actually watch it or any of its tasks.  So here's a heads up.

I'm getting started on work on the job scheduler/workflow manager piece;
this would accept lists of dump tasks (in the current setup, "dump stubs
for el wikipedia"), call a callback to turn each of them into small jobs
that can be completed in less than an hour, submit and monitor these jobs
with retries, dependencies etc, call a callback to recombine the outputs of
the jobs, and notify some caller on success of te whole operation.

First up is evaluating existing packages and choosing one to use as a
foundation.  Please contribute!  See the following tasks:

https://phabricator.wikimedia.org/T143205: Draft usage scenarios for
job/workflow manager 
https://phabricator.wikimedia.org/T143206: List requirements needed for
task/job/workflow manager 
https://phabricator.wikimedia.org/T143207: Evaluate software packages for
job/task/workflow management 

Also, can someone please forward this on to analytics-l and research-l?
I'm not on those lists but they will no doubt have a lot of useful
expertise here.

Thanks!

Ariel

___
Research-Internal mailing list
research-inter...@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/research-internal
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Split testing example implementations

2016-09-07 Thread Leila Zia

Hi Jan,

I don't know of documented examples (the A/B testing design depends on the
question you want to answer). If you want to chat about this more, I'd be
happy to brainstorm with you about your options. Message me off-list and we
can set up a time if that's helpful.

Best,
Leila

Leila Zia
Senior Research Scientist
Wikimedia Foundation

On Wed, Sep 7, 2016 at 12:16 AM, Jan Dittrich <jan.dittr...@wikimedia.de>
wrote:

> Hello Analytics,
>
> A while ago I asked about the existence of any A/B-Testing Framework. I
> got to know (Thanks, Nuria!) that https://phabricator.wikimedia.
> org/T135762 is in preparation. However, I assume, that until this is in
> place, we need to use custom solutions which utilize EventLogging.
> Event logging itself is pretty clear to me, but not the splitting/cookie
> logic.
>
> Could anybody link me some examples for such a self implemented way to
> show users their assigned content and, if they are not assigned to a group
> yet, to assign users to A/B… bins?
>
> Jan
>
> --
> Jan Dittrich
> UX Design/ User Research
>
> Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin
> Phone: +49 (0)30 219 158 26-0
> http://wikimedia.de
>
> Imagine a world, in which every single human being can freely share in the
> sum of all knowledge. That‘s our commitment.
>
> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
> der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
> Körperschaften I Berlin, Steuernummer 27/029/42207.
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Getting search engine terms for specific wikibook?

2016-09-06 Thread Leila Zia

Hi Lars,

This is Leila from WMF Research. Recently, we have been receiving a lot of
requests about search queries. Here is a response we've given to another
researcher few days ago, FYI, and hopefully it will be helpful.

Best,
Leila

--

As you well know, access to the data you're asking for is not
straightforward, and it's a topic that resurfaces every few months, as the
editor community is also very interested in it. See for example a recent
discussion in here
<https://lists.wikimedia.org/pipermail/wikimedia-l/2016-July/084745.html>.

There are a few things the Research team (that I'm a member of) needs to
know before we can say more:

* We need a proposal from you and your collaborators of your project
explaining what the project is, a short description of the methodology
you're proposing or approaches you want to try, and how the project can
contribute to Wikimedia Foundation's mission/plans and/or
Wikimedia/Wikipedia community. If there is something in our annual plan
<https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2016-2017/Final>
that catches your eyes as a potential alignment, please bring that up to us
in your proposal.

You can create a page at https://meta.wikimedia.org/wiki/Research for your
project and share a link with us. Note that the proposal shouldn't be long.
See for example the proposal for this research
<https://meta.wikimedia.org/wiki/Research:Increasing_article_coverage>
(search for "Proposal" in the page).

* If you are under time constraints, please be explicit about it in your
proposal. Looking at the current list of our collaborations
<https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations#Current_list_of_formal_collaborators>,
and knowing that there are few more in the process, you may have to wait
for some time before one of us can work with you to make it happen, of
course if your proposal is passed by the team.

* To learn more about our formal collaborations, which is the way such
access to data can be made possible, please read here
<https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations>.

Leila Zia
Senior Research Scientist
Wikimedia Foundation

On Mon, Sep 5, 2016 at 11:19 AM, Nuria Ruiz <nu...@wikimedia.org> wrote:

> >By the way, what about alternate, external methods such as subscribing
> >that particular wikibook to Google Search Console?
> Our privacy policy prevent us from sending data to third party , so
> sending analytics data to google is not allowed.
>
> Thanks,
>
> Nuria
>
> On Sun, Sep 4, 2016 at 11:26 PM, Lars Noodén <lars.noo...@gmail.com>
> wrote:
>
>> On 09/05/2016 07:36 AM, Nuria Ruiz wrote:
>> > Lars,
>> >
>> > I am not sure we have at the data you are looking for, the data we get
>> from
>> > searches is only available for 60 days or less while it gets processed
>> and
>> > deleted after that. Agreggated pageview data is kept long term, search
>> data
>> > is not.
>>
>> Even the most recent 30 to 60 days worth would help.  The pageview data
>> shows what is used but gives no hint about why.
>>
>> >> So, what would be the process to request access to the raw data and
>> what would
>> > be the conditions for such access?
>> > Access to raw data is normally restricted to research projects. You can
>> > perhaps do a request for a 1 time query but, as I was saying, the data
>> you
>> > are looking for is not available long term.
>>
>> I've made a request in phabricator, if I understand the request
>> procedure properly.
>>
>> https://phabricator.wikimedia.org/T144714
>>
>> > You can read about data access here:
>> > https://meta.wikimedia.org/wiki/Research:FAQ
>>
>> Thanks.  I'm wading through that one and the nearby pages.
>>
>> > Thanks,
>> >
>> > Nuria
>>
>> By the way, what about alternate, external methods such as subscribing
>> that particular wikibook to Google Search Console?  If it is allowed, I
>> might try it to see if it is possible and what it yields.
>>
>> Regards,
>> Lars
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Analysing link

2016-08-26 Thread Leila Zia

On Fri, Aug 26, 2016 at 1:38 AM, Federico Leva (Nemo) 
wrote:

> Jan Dittrich, 26/08/2016 10:03:
>
>> or even click paths
>>
>
> Do you know about https://meta.wikimedia.org/wik
> i/Research:Improving_link_coverage/Release_page_traces ?
>

and https://meta.wikimedia.org/wiki/Research:Wikipedia_Navigation_Vectors
?

Leila



>
> Nemo
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Pageview API] Data Retention Question

2016-07-29 Thread Leila Zia

Dan, Thanks for reaching out.

18 months is enough for my use cases as long as the dumps capture the exact
data structure.

Best,
Leila

--
Leila Zia
Senior Research Scientist
Wikimedia Foundation

On Fri, Jul 29, 2016 at 11:51 AM, Amir E. Aharoni <
amir.ahar...@mail.huji.ac.il> wrote:

> I am now checking traffic data every day to see whether Compact Language
> Links affect it. It makes sense to compare them not only to the previous
> week, but also to the same month previous year. So one year is not hardly
> enough. 18 months is better, and three years is much better because I'll be
> able to check also the same month in earlier years.
>
> I imagine that this may be useful to all product managers that work on
> features that can affect traffic.
>
> בתאריך 29 ביולי 2016 15:41,‏ "Dan Andreescu" <dandree...@wikimedia.org>
> כתב:
>
>> Dear Pageview API consumers,
>>
>> We would like to plan storage capacity for our pageview API cluster.
>> Right now, with a reliable RAID setup, we can keep *18 months* of data.
>> If you'd like to query further back than that, you can download dump files
>> (which we'll make easier to use with python utilities).
>>
>> What do you think?  Will you need more than 18 months of data?  If so, we
>> need to add more nodes when we get to that point, and that costs money, so
>> we want to check if there is a real need for it.
>>
>> Another option is to start degrading the resolution for older data (only
>> keep weekly or monthly for data older than 1 year for example).  If you
>> need more than 18 months, we'd love to hear your use case and something in
>> the form of:
>>
>> need daily resolution for 1 year
>> need weekly resolution for 2 years
>> need monthly resolution for 3 years
>>
>> Thank you!
>>
>> Dan
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Wiki-research-l] question about Pageviews dumps

2016-07-01 Thread Leila Zia

Hi Marc,

On Tue, Jun 28, 2016 at 6:36 AM, Marc Miquel  wrote:

> Since this would be for a research project I might ask funding for it, I
> would like to know if I could count on that, what is the nature of the
> available data, and what would be the procedure to obtain this data and if
> there would be any implication because of privacy concerns.
>

We grant access to webrequest log data and the non-public derivatives of
it not very frequently. When we do, we do it through creating formal
collaborations with the researchers. What these collaborations are and how
we set them up are explained at
https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations.

To provide more context:

Requiring formal collaborations as a necessary step for accessing the data
means that we cannot scale rapidly, i.e, each researcher on our team is
only able to be involved in so many of them. The practical cap is somewhere
around 3 collaborations per researcher in my experience. We understand that
this is a problem as we would like more researchers to work with this data.
We reconsider ways for expanding our capacity to collaborate frequently. We
also always consider releasing more data-sets publicly since ultimately,
that's one of the best ways for us to empower others do what they want to
work on and find value in.

Best,
Leila

> Thank you very much!
>
> Best,
>
> Marc Miquel
> ᐧ
>
> ___
> Wiki-research-l mailing list
> wiki-researc...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Zika

2016-02-14 Thread Leila Zia

Hey Dan,

On Sun, Feb 14, 2016 at 3:02 AM, Dan Andreescu 
wrote:

> So, I felt personally compelled in the case of Zika, and the confusing
> coverage it has seen, to offer to personally help.

Which aspect of the coverage are you referring to as confusing?

> I can run queries, test hypotheses, and help publish data that could back
> up articles.  Privacy of our editors is of course still obviously
> protected, but that's easier to do in a specific case with human review
> than in the general case.
> 
>

I'm up for brainstorming about what we can do and helping. Please keep me
in the loop. In general, given that a big chunk of our traffic comes from
Google at the moment, it would be great to work with the researchers in
Google involved in Google's health related initiatives to produce
complementary knowledge to what Google can already tell about Zika (for
example, this ).
I'll reach out to the few people I know to get some more information.
Depending on what complementary knowledge we want to produce, working with
WikiProject Medicine can be helpful, too.

Leila

___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Canonical location for metrics documentation

2015-10-14 Thread Leila Zia

On Wed, Oct 14, 2015 at 8:05 AM, Dan Andreescu 
wrote:

>
>
> I'm not saying it's easy, but I think having documentation in more than
> one place is an awful experience for newcomers.
>

I second this as a problem. I make a joke of it each time I want to explain
to a newcomer what is documented where, it's much better if we can solve it
though.

> I know a lot of research stuff is on meta, so maybe in your case it makes
> sense to standardize on meta and point to it from the other wikis.
>

This can work for the type of documentation I have on Meta. Moving my
current documentation out of Meta is also an option (it's really not
discoverable to newcomers and those outside of the Movement) as long as I
can have a bigger picture of how we envision the future of documentations.

Neil, this conversation may take some time to settle. My recommendation is
that you document your work somewhere that makes sense to you based on the
type of work currently documented in the different places. You can always
move it out of there. Don't let figuring out where to document it becomes a
barrier for documentation. :-)

Leila

> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Canonical location for metrics documentation

2015-10-14 Thread Leila Zia

Makes sense to me.

On Wed, Oct 14, 2015 at 11:27 AM, Neil P. Quinn 
wrote:

> Keep in mind that, when I say "metrics documentation", I'm not referring
> to documentation about Hive
> , the webrequest
> logs , or
> EventLogging .
> To my mind, those are infrastructure topics that are relevant mainly to
> Wikimedia (not MediaWiki) engineers, and so belong on Wikitech.
>
> I'm talking about documentation relevant to analysts, researchers, and end
> users of metrics: "this is how we define an *edit*", "this is why we use
> 5 edits as the cutoff for an active editor", "this is sample SQL for
> counting surviving new active editors", and so on. I think that kind of
> information belongs on Meta (and not on mediawiki.org, which was the
> original thrust of my question).
>
> Does that seem like a sensible split to people, or am I just agreeing with
> one side of the debate?
>
> On Wed, Oct 14, 2015 at 8:32 AM, Aaron Halfaker 
> wrote:
>
>> I think having documentation in more than one place is an awful
>>> experience for newcomers.
>>
>>
>> Which newcomers are you referring to?  Newcomers to the WMF engineering
>> staff or newcomers to research/analytics of Wikimedia projects?
>>
>> It's OK to not understand the different purposes of our Wikis right away,
>> but I don't think that is a good reason to undermine their purposes.  I
>> certainly don't see why wikitech is a desirable hub for this kind of
>> information.  From my point of view Wikitech is the *worst* potential hub
>> of information that is not specific to engineering.
>>
>> What, exactly, is the trouble with having metrics documentation on Meta?
>>   How would moving *some of the the documentation* to wikitech help that?
>> (Because you're not going to move research project documentation without
>> even stronger disagreement from the locals.)
>>
>> -Aaron
>>
>> On Wed, Oct 14, 2015 at 11:05 AM, Dan Andreescu > > wrote:
>>
>>> Strongly oppose moving the Research namespace hosted metrics
 documentation off Meta. It's s'posed to be broadly accessible. Wikitech is
 on few peoples' radar. Mediawiki.org is for software documentation. Meta is
 the central wiki for the movement (however imperfect). - J

>>>
>>> I respect the fact that these kinds of distinctions make sense to people
>>> who are already familiar with the movement and research / analytics.  But
>>> to someone relatively new, and to me for the first year at the foundation,
>>> those distinctions made zero sense.
>>>
>>> I'm not saying it's easy, but I think having documentation in more than
>>> one place is an awful experience for newcomers.  We'll continue to move
>>> things to wikitech and leave nice high level landing pages on the other
>>> wikis.  Others are welcome to act differently if they so see fit.  I know a
>>> lot of research stuff is on meta, so maybe in your case it makes sense to
>>> standardize on meta and point to it from the other wikis.
>>>
>>> You're of course welcome to disagree with me but I'd suggest first
>>> trying to come up with examples of newcomers who understand the purpose of
>>> our different wikis perfectly right away.
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> Neil P. Quinn ,
> product analyst
> Wikimedia Foundation
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Echo databases on analytics-store?

2015-10-09 Thread Leila Zia

On Fri, Oct 9, 2015 at 1:26 PM, Neil P. Quinn  wrote:

> I'm trying to gather some stats on the use of Echo notifications across
> wikis, and I'd like to join the `echo_events` table with the `user` table
> for a given wiki.
>

I'm not sure what kind of information you need but there is a chance that
they are captured via schema in which case you can get them from Echo
schema in log database and then join that with enwiki.user. I'm also not
sure what info you need from enwiki.user, you may be able to get more
global info from centralauth.globaluser.

Leila

>
> --
> Neil P. Quinn ,
> product analyst
> Wikimedia Foundation
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Users changing language version through interwiki links

2015-09-12 Thread Leila Zia

Hi Strainu,

On Sat, Sep 12, 2015 at 5:43 AM, Strainu  wrote:

>
> I think for smaller wikis this would be an interesting way to know which
> domains/articles to work on.
>

What I'm saying is not directly related to your data request but to your
comment above:

We've been working on a project to understand gaps in Wikipedia and
increase content coverage. You can read more about it here
. As
part of the project, we are developing a tool that will provide article
recommendations (for translation or creation from scratch) based on
articles available in a source language x and missing in a destination
language y and the user's interest model. You can check out the current
state of the tool here  (note that the tool
is not ready for public consumption, yet.). The phab tickets for the tool
are here
. :-).
You can read more about the tool here
.
This project can help those who are interested in recommendations to
receive recommendations about what articles can be created next in their
local language.

Your question about tracking the change in language versions while reading
the article is definitely interesting for us given that it can be used as a
signal of demand in the destination language.

Best,
Leila

>
> Thanks,
>   Strainu
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Survey] Pageview API

2015-09-11 Thread Leila Zia

It's getting exciting. :-)

I'd go with choice 2 since it gives more control to the user while offering
what the user can get through choice 1 as well.

Question: will we get page_ids or page_titles or both? It's good to have
both.

Leila

On Fri, Sep 11, 2015 at 3:00 PM, Dan Andreescu 
wrote:

> Hi everyone.  End of quarter is rapidly approaching and I wanted to ask a
> quick question about one of the endpoints we want to push out.  We want to
> let you ask "what are the top articles" but we're not sure how to structure
> the URL so it's most useful to you.  Here are the choices:
>
> Choice 1. /top/{project}/{access}/{days-in-the-past}
>
> Example: top articles via all en.wikipedia sites for the past 30 days:
> /top/en.wikipedia/all-access/30
>
>
> Choice 2. /top/{project}/{access}/{start}/{end}
>
> Example: top articles via all en.wikipedia sites from June 12th, 2014 to
> August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30
>
>
> (in all of those,
>
> * {project} means en.wikipedia, commons.wikimedia, etc.
> * {access} means access method as in desktop, mobile web, mobile app
>
> )
>
> Which do you prefer?  Would any other query style be useful?
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Breakdown of unique visitors by country (and by project)

2015-09-08 Thread Leila Zia

Hi Cristian,

On Tue, Sep 8, 2015 at 10:42 AM, Cristian Consonni 
wrote:

>
> we (Wikimedia Italia) are starting writing a proposal for a EU project
>

disclaimer: I'm not in Analytics.

We don't have unique counts (per country/project, or otherwise) as you
already guessed. If you can tell us more about your proposal in a
paragraph, we may be able to find some other ways through which you can
show the potential impact of your project through the data available, if
that's helpful to you.

Best,
Leila

> Thank you.
>
> Cristian
>
> [1] https://reportcard.wmflabs.org/
> [2] https://stats.wikimedia.org/EN/SummaryEN.htm
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] user table information

2015-06-29 Thread Leila Zia

Thanks a lot everyone. :-)

On Mon, Jun 29, 2015 at 6:21 PM, Gergo Tisza gti...@wikimedia.org wrote:

 On Sat, Jun 27, 2015 at 2:30 PM, Leila Zia le...@wikimedia.org wrote:

 For the article recommendation test, we queried user table to get
 editors' email addresses. We then excluded the emails that were not
 verified. We've received a comment here
 https://meta.wikimedia.org/wiki/Research_talk:Increasing_article_coverage#Usage_of_user_database
 that suggests the user has changed his/her email addresse and we have
 somehow retained the old email address. I'd like to get to the bottom of
 this problem. Can someone help with this, in the Talk page or here? Are we
 looking at the wrong table? And in general, how can old information be in
 the user table?


 Now that SUL finalization
 https://www.mediawiki.org/wiki/SUL_finalisation has completed, you
 should use the global user table for email addresses
 (centralauth.globaluser.gu_email). When the user changes their email, it is
 not updated on all 800-something projects where they might have an account,
 only locally and centrally. User data on other wikis will be updated
 whenever the user next visits them.

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] user table information

2015-06-27 Thread Leila Zia

Hi,

For the article recommendation test, we queried user table to get editors'
email addresses. We then excluded the emails that were not verified. We've
received a comment here
https://meta.wikimedia.org/wiki/Research_talk:Increasing_article_coverage#Usage_of_user_database
that suggests the user has changed his/her email addresse and we have
somehow retained the old email address. I'd like to get to the bottom of
this problem. Can someone help with this, in the Talk page or here? Are we
looking at the wrong table? And in general, how can old information be in
the user table?

Sorry for sending an email over the weekend. A response on Monday would be
great. :-)

Thank you!

Leila
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Research-Internal] Revision history of deleted pages

2015-06-25 Thread Leila Zia

Aaron, any chance you know the answer to this question? I have a vague
memory that we talked about deleted pages and their text some time back.
This data should live somewhere, right? given that deleted pages can be
restored.

Thanks,
Leila

On Wed, Jun 24, 2015 at 2:03 PM, Leila Zia le...@wikimedia.org wrote:

 switching to the public list with Bob's permission.

 On Wed, Jun 24, 2015 at 1:58 PM, Robert West robert.bob.w...@gmail.com
 wrote:

 Hi everyone,

 I'd like to find all enwiki articles that were ever marked with the
 {{hoax}} template. Pages with that template mostly end up being deleted, so
 they're not available in the public revision dumps
 https://dumps.wikimedia.org/enwiki/20150602/.

 Hence my question:
 Is there a way of getting access to the full enwiki revision dump
 including all deleted pages?
 I don't know yet which deleted articles I'm interested in, but will only
 know that after having done a pass over the full revision history.

 I know that viewing deleted content is problematic
 https://en.wikipedia.org/wiki/Wikipedia:Viewing_deleted_content (hence
 I'm sending this request to this internal research list), but I signed an
 NDA and have access to data on HDFS via stat1002, so there might be a way
 for me to access that data?

 I'm also aware of a list of archived hoaxes
 https://en.wikipedia.org/wiki/Wikipedia:List_of_hoaxes_on_Wikipedia,
 but many shorter-lived hoaxes that got deleted fast are not included there.

 Thanks -- any pointers welcome!
 Bob


 --
 Up for a little language game? -- http://www.unfun.me

 ___
 Research-Internal mailing list
 research-inter...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/research-internal



___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Research-Internal] Revision history of deleted pages

2015-06-24 Thread Leila Zia

switching to the public list with Bob's permission.

On Wed, Jun 24, 2015 at 1:58 PM, Robert West robert.bob.w...@gmail.com
wrote:

 Hi everyone,

 I'd like to find all enwiki articles that were ever marked with the
 {{hoax}} template. Pages with that template mostly end up being deleted, so
 they're not available in the public revision dumps
 https://dumps.wikimedia.org/enwiki/20150602/.

 Hence my question:
 Is there a way of getting access to the full enwiki revision dump
 including all deleted pages?
 I don't know yet which deleted articles I'm interested in, but will only
 know that after having done a pass over the full revision history.

 I know that viewing deleted content is problematic
 https://en.wikipedia.org/wiki/Wikipedia:Viewing_deleted_content (hence
 I'm sending this request to this internal research list), but I signed an
 NDA and have access to data on HDFS via stat1002, so there might be a way
 for me to access that data?

 I'm also aware of a list of archived hoaxes
 https://en.wikipedia.org/wiki/Wikipedia:List_of_hoaxes_on_Wikipedia,
 but many shorter-lived hoaxes that got deleted fast are not included there.

 Thanks -- any pointers welcome!
 Bob


 --
 Up for a little language game? -- http://www.unfun.me

 ___
 Research-Internal mailing list
 research-inter...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/research-internal


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Search dashboards are now running on live data

2015-05-22 Thread Leila Zia

On Fri, May 22, 2015 at 3:14 PM, Luis Villa lvi...@wikimedia.org wrote:

 68,000 searches/day seems *really* low,


right, but I'm not sure search sessions per day is the same as the number
of searches per day.
Oliver, what definition of a search session do you use? How do you
compute it?

Leila


 Luis

 On Fri, May 22, 2015 at 2:56 PM, Michael Holloway mhollo...@wikimedia.org
  wrote:

 Awesome.
 -m.

 On Fri, May 22, 2015 at 5:35 PM, Oliver Keyes oke...@wikimedia.org
 wrote:

 http://searchdata.wmflabs.org/ - boop! This was my Friday. Previously
 we were playing around with them and testing what we needed with a
 static snapshot; these dashboards will now update once a day with new
 information.

 It has turned up some bugs (is the mobile schema just not running?)
 and there are more metrics to add. But for the time being, is progress
 :)

 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics



 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics




 --
 Luis Villa
 Sr. Director of Community Engagement
 Wikimedia Foundation
 *Working towards a world in which every single human being can freely
 share in the sum of all knowledge.*

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] clicks on red links

2015-05-22 Thread Leila Zia

Hi Amir,

   As far as I know and as mentioned by others, the exact statistics you're
looking for don't exist. More comments in-line.

On Wed, May 20, 2015 at 10:37 PM, Amir E. Aharoni amir.ahar...@mail.huji
.ac.il wrote:

 Hi,

 Are there statistics about the number of people who click on red links in
 Wikimedia projects?


This you can get from the logs, for the past 30 days. I'm assuming you are
not very strict about the definition of people and as long as you can
factor out spiders and bots to a good extent you're fine.

And about what they do as the next step - go back, close the page, create
 an article, something else?


There are two ways this can potentially be done: EventLogging and if you
are not concerned about actions like closed the page, from the logs. Both
require quite some work, so my question is: what do you need this
information for? We may have other results that can help you answer your
questions in some other way.

Best,
Leila



 --
 Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
 http://aharoni.wordpress.com
 ‪“We're living in pieces,
 I want to live in peace.” – T. Moore‬

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] May 2015 research showcase

2015-05-11 Thread Leila Zia

Hi everyone,

The next research showcase will be live-streamed this Wednesday, May 13 at
11.30 PT. The streaming link will be posted on the lists a few minutes
before the showcase starts and as usual, you can join the conversation on
IRC at #wikimedia-research.

We look forward to seeing you!

Leila

This month

*The people's classifier: Towards an open model for algorithmic
infrastructure*
By Aaron Halfaker https://www.mediawiki.org/wiki/User:Halfak_(WMF)

Recent research has implicated that Wikipedia's algorithmic infrastructure
is perpetuating social issues. However, these same algorithmic tools are
critical to maintaining efficiency of open projects like Wikipedia at
scale. But rather than simply critiquing algorithmic wiki-tools and calling
for less algorithmic infrastructure, I'll propose a different strategy --
an open approach to building this algorithmic infrastructure. In this
presentation, I'll demo a set of services that are designed to open a
critical part Wikipedia's quality control infrastructure -- machine
classifiers. I'll also discuss how this strategy unites critical/feminist
HCI with more dominant narratives about efficiency and productivity.

*Social transparency online*
By Jennifer Marlow http://www.aboutjmarlow.com/ and Laura Dabbish
http://www.lauradabbish.com/

An emerging Internet trend is greater social transparency, such as the use
of real names in social networking sites, feeds of friends' activities,
traces of others' re-use of content, and visualizations of team
interactions. There is a potential for this transparency to radically
improve coordination, particularly in open collaboration settings like
Wikipedia. In this talk, we will describe some of our research identifying
how transparency influences collaborative performance in online work
environments. First, we have been studying professional social networking
communities. Social media allows individuals in these communities to create
an interest network of people and digital artifacts, and get
moment-by-moment updates about actions by those people or changes to those
artifacts. It affords and unprecedented level of transparency about the
actions of others over time. We will describe qualitative work examining
how members of these communities use transparency to accomplish their
goals. Second, we have been looking at the impact of making workflows
transparent. In a series of field experiments we are investigating how
socially transparent interfaces, and activity trace information in
particular, influence perceptions and behavior towards others and
evaluations of their work.
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Wiki-research-l] April 2015 research showcase: remix and reuse in collaborative communities; the oral citations debate

2015-04-30 Thread Leila Zia

A reminder that this event will start in 10 minutes. You can watch the
event on YouTube here http://youtu.be/upQXecRNcdw. As usual, we will be
in #wikimedia-research for questions and chat. :-)

On Thu, Apr 16, 2015 at 12:43 PM, Dario Taraborelli 
dtarabore...@wikimedia.org wrote:

 I am thrilled to announce our speaker lineup for this month’s research
 showcase
 https://www.mediawiki.org/wiki/Analytics/Research_and_Data/Showcase#April_2015.


 *Jeff Nickerson* (Stevens Institute of Technology) will talk about remix
 and reuse in collaborative communities; *Heather Ford* (Oxford Internet
 Institute) will present an overview of the oral citations debate in the
 English Wikipedia.

 The showcase will be recorded and publicly streamed at 11.30 PT on *Thursday,
 April 30 *(livestream link will follow). We’ll hold a discussion and take
 questions from remote attendees via the Wikimedia Research IRC channel (
 #wikimedia-research
 http://webchat.freenode.net/?channels=wikimedia-research on freenode)
 as usual.

 Looking forward to seeing you there.

 Dario


 *Creating, remixing, and planning in open online communities**Jeff
 Nickerson*Paradoxically, users in remixing communities don’t remix very
 much. But an analysis of one remix community, Thingiverse, shows that those
 who actively remix end up producing work that is in turn more likely to
 remixed. What does this suggest about Wikipedia editing? Wikipedia allows
 more types of contribution, because creating and editing pages are done in
 a planning context: plans are discussed on particular loci, including
 project talk pages. Plans on project talk pages lead to both creation and
 editing; some editors specialize in making article changes and others, who
 tend to have more experience, focus on planning rather than acting.
 Contributions can happen at the level of the article and also at a series
 of meta levels. Some patterns of behavior – with respect to creating versus
 editing and acting versus planning – are likely to lead to more sustained
 engagement and to higher quality work. Experiments are proposed to test
 these conjectures.*Authority, power and culture on Wikipedia: The oral
 citations debate**Heather Ford*In 2011, Wikimedia Foundation Advisory
 Board member, Achal Prabhala was funded by the WMF to run a project called
 'People are knowledge' or the Oral citations project
 https://meta.wikimedia.org/wiki/Research:Oral_Citations. The goal of
 the project was to respond to the dearth of published material about topics
 of relevance to communities in the developing world and, although the
 majority of articles in languages other than English remain intact, the
 English editions of these articles have had their oral citations removed. I
 ask why this happened, what the policy implications are for oral citations
 generally, and what steps can be taken in the future to respond to the
 problem that this project (and more recent versions of it
 https://meta.wikimedia.org/wiki/Research:Indigenous_Knowledge) set out
 to solve. This talk comes out of an ethnographic project in which I have
 interviewed some of the actors involved in the original oral citations
 project, including the majority of editors of the surr
 https://en.wikipedia.org/wiki/surr article that I trace in a chapter of
 my PhD[1] http://www.oii.ox.ac.uk/people/?id=286.


 ___
 Wiki-research-l mailing list
 wiki-researc...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Research Showcase Starting in 8 minutes!

2015-03-25 Thread Leila Zia

The youtube link has changed to: http://youtu.be/PHQqicVoVx4

On Wed, Mar 25, 2015 at 11:22 AM, Ellery Wulczyn ewulc...@wikimedia.org
wrote:

 Today we will have two presentation:

  1. User Session Identification by Aaron Halfaker
  2. Mining Missing Hyperlinks in Wikipedia by Bob West.

 You can follow the talk on youtube http://youtu.be/CgkwLXbALQg.

 We will hold a discussion and take questions from remote participants via
 the Wikimedia Research IRC channel (#wikimedia-research on freenode).

 See you there,
 Ellery

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] [Announcement] March 2015 Research Showcase

2015-03-20 Thread Leila Zia

Hi,

This month's research showcase
https://www.mediawiki.org/w/index.php?title=Analytics/Research_and_Data/Showcase#March_2015
is scheduled for Wednesday, March 25, 11:30 (PST).

We will have two presentations on user session identification by Aaron
Halfaker, and mining missing hyperlinks in Wikipedia by Bob West.

As usual, the event will be recorded and publicly streamed on YouTube
(links will follow). We will hold a discussion and take questions from
remote participants via the Wikimedia Research IRC channel
(#wikimedia-research on freenode).

Looking forward to seeing you there.

Leila
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] [Technical] which pageview definition

2015-03-15 Thread Leila Zia

Hi,

   I'm trying to figure out which of the two pageview definitions we
currently have I can use for a question Bob and I are trying to address. It
would be great if you share your thoughts. If you choose to do so, please
do it by Tuesday, eod, PST.

More details:


*What are we doing?*
We are building an edit recommendation system that identifies the missing
articles in Wikipedia that have a corresponding page in at least one of the
top 50 Wikipedia languages, ranks them, and recommends the ranked articles
to editors who the algorithm assesses as those who may like to edit the
article.


*Where does pageview definition come into play?*
When we want to rank missing articles. To do the ranking, we want to
consider the pageviews to the article in the languages the article exists
in, and using this information estimate what the traffic is expected to be
in the language the article is missing in.


*Why does it matter which pageview definition we use?*
We would like to use webstatscollector pageview definition since the hourly
data we have based on this definition goes back to roughly September 2014.
If we go with the new pageview definition, we will have data for the past
2.5 months. The longer period of time we have data for, the better.


*Why don't you then just use webstatscollector data?*
We're inclined to do that but we need to make sure that data works for the
kind of analysis we want to do. Per discussions with Oliver,
webstatscollector data has a lot of pageviews from bots and spiders. The
question is: is the effect of bot/spider traffic, i.e., the number of
pageviews they add to each page, roughly uniform across all pages? If that
is the case, webstatscollector definition will be our choice.

I appreciate your thoughts on this.

Best,
Leila
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Cluster] Monitoring the impact Hive jobs have on the Analytics cluster

2015-03-08 Thread Leila Zia

This is really useful, Christian. Thanks for explaining and documenting it.

Leila

On Sat, Mar 7, 2015 at 6:14 AM, Christian Aistleitner 
christ...@quelltextlich.at wrote:

 Hi,

 around running jobs on the Analytics cluster, I've sometime seen
 people say in IRC: “Let's run this heavy job. I'll keep an eye on it”.

 But more often than not, this seems to have meant:
 “Let's just run this heavy job and wait. If QChris joins IRC, let's
 hope he doesn't ping us about having overloaded the cluster.”

 That's not nice^Wscalable ;-)

 So just in case someone is vague on how to “keep an eye on it”, I did
 a short write-up at:

   https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load

 which details on detecting how the cluster is doing on a very high
 level.
 Especially, it allows you to detect if the cluster got stalled, and if
 it did, it tells you what to do.

 Have fun,
 Christian

 P.S.: The above URL has diagrams! Click the URL!

 --
  quelltextlich e.U.  \\  Christian Aistleitner 
Companies' registry: 360296y in Linz
 Christian Aistleitner
 Kefermarkterstrasze 6a/3 Email:  christ...@quelltextlich.at
 4293 Gutau, Austria  Phone:  +43 7946 / 20 5 81
  Fax:+43 7946 / 20 5 81
  Homepage: http://quelltextlich.at/
 ---

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] analytics-store replag

2015-03-05 Thread Leila Zia

Hi Sean,

   Thanks for the email. The two create queries are mine. Should I kill one?

Leila
On Mar 5, 2015 7:09 AM, Sean Pringle sprin...@wikimedia.org wrote:

 Just a heads-up:

 Analytics-store is seeing several hours of replag on s1, s4, and s6.

 s4 is me doing a commonswiki schema change, which should be done
 shortly. s1 and s6 are lagging due to load from queries like:

 create table staging.enwiki_intra select a.pl_from as page_id_from,
 a.pl_title as page_title_to, b.page_id as page_id_to from
 enwiki.pagelinks a left join enwiki.page b on a.pl_title=b.page_title
 where a.pl_namespace=0 and a.pl_from_namespace=0;

 Some very large cross joins there. No doubt it's intended activity.
 And if not, now at least you know what's happening ;-)

 BR
 Sean

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] analytics-store replag

2015-03-05 Thread Leila Zia

Hi Sean,

On Thu, Mar 5, 2015 at 9:59 PM, Sean Pringle sprin...@wikimedia.org wrote:

 Hi Leila

 On Fri, Mar 6, 2015 at 1:38 AM, Leila Zia le...@wikimedia.org wrote:
  Hi Sean,
 
 Thanks for the email. The two create queries are mine. Should I kill
 one?

 Lag has now reached 24h for s1 and s6, plus I found a few other
 'research' user connections apparently fully blocked waiting on
 staging metadata, so I've killed your queries.


Thanks for letting me know.


 I recommend building tables like that using batched inserts, so both
 replication and other people can co-exist with you :-) Super large
 write-transactions running for more than an hour or two are never a
 good idea.


makes sense. will follow your advice.

Leila



 BR
 Sean

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] February 2015 Research Showcase: Global South survey results; data imports in OpenStreetMap

2015-02-18 Thread Leila Zia

This is happening in 15 minutes.

Here is the link for watching it: http://youtu.be/yaj9dfHjkOA

We will be in IRC channel #wikimedia-research for taking your questions. :-)

On Wed, Feb 11, 2015 at 5:21 PM, Dario Taraborelli 
dtarabore...@wikimedia.org wrote:

 I am thrilled to announce our speaker lineup for this month’s research
 showcase
 https://www.mediawiki.org/wiki/Analytics/Research_and_Data/Showcase#February_2015.


 Our own *Haitham Shammaa* will present results from the Global South
 survey. We also invited Stamen’s *Alan McConchie*, an OpenStreetMap
 expert, to talk about the challenges the OSM community is facing with
 external data imports.

 The showcase will be recorded and publicly streamed at 11.30 PT on *Wednesday,
 February 18 *(livestream link will follow). We’ll hold a discussion and
 take questions from remote participants via the Wikimedia Research IRC
 channel (#wikimedia-research
 http://webchat.freenode.net/?channels=wikimedia-research on freenode).

 Looking forward to seeing you there.

 Dario


 *Global South User Survey 2014*By *Haitham Shammaa
 https://meta.wikimedia.org/wiki/User:HaithamS_(WMF)*Users' trends in
 the Global South have significantly changed over the past two years, and
 given the increase in interest in Global South communities and their
 activities, we wanted this survey to focus on understanding the statistics
 and needs of our users (both readers, and editors) in the regions listed in
 the WMF's New Global South Strategy
 https://m.mediawiki.org/wiki/File:WMF%27s_New_Global_South_Strategy.pdf.
 This survey aims to provide a better understanding of the specific needs of
 local user communities in the Global South, as well as provide data that
 supports product and program development decision making process.
 *Ingesting Open Geodata: Observations from OpenStreetMap*By *Alan
 McConchie* http://stamen.com/studio/alanAs Wikidata grapples with the
 challenges of ingesting external data sources such as Freebase, what
 lessons can we learn from other open knowledge projects that have had
 similar experiences? OpenStreetMap, often called The Wikipedia of Maps,
 is a crowdsourced geospatial data project covering the entire world. Since
 the earliest years of the project, OSM has combined user contributions with
 existing data imported from external sources. Within the OSM community,
 these imports have been controversial; some core OSM contributors complain
 that imported data is lower quality than user-contributed data, or that it
 discourages the growth of local mapping communities. In this talk, I'll
 review the history of data imports in OSM, and describe how OSM's
 best-practices have evolved over time in response to these critiques.


 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Welcome Joseph

2015-02-18 Thread Leila Zia

Welcome to the team, Joseph!

b.t.w., I didn't know you have a background in NLP. That skill may become
handy soon. ;-)

On Wed, Feb 18, 2015 at 6:37 PM, Toby Negrin tneg...@wikimedia.org wrote:

 Hi Everyone,

 I'd like to welcome Joseph Allemendou to the Analytics team! We are really
 excited to get some of Joseph's calibre to help take our analytics work to
 the next level.

 In his own words:

 Joseph's experiences were mostly with private companies and almost always
 involved open source software. After a M.S. in Computer Science with a
 specialization in programming languages theory and a PhD in the Natural
 Language Processing and Dialog Systems fields, Joseph worked four years
 in Ireland. He spent two years at IBM learning and applying project
 management and process improvement methodologies, and two other years
 building a start-up to help English as a foreign language teachers find
 up-to-date teaching material. Then he moved back to France and worked for
 Criteo as a specialist in scalabilty for one year, and as a manager for
 another year. Lastly Joseph worked with Fotolia, where he built the
 analytics architecture and team. Working with the Wikimedia Foundation
 allows him to really apply his energy and skills in the direction he wish
 the world to move on.

 Joseph is based in Brittany, France. Welcome Joseph!

 -Toby

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] DNT, standards, and expectations

2015-01-16 Thread Leila Zia

On Fri, Jan 16, 2015 at 4:56 PM, Ori Livneh o...@wikimedia.org wrote:


 On Fri, Jan 16, 2015 at 4:25 PM, Dario Taraborelli 
 dtarabore...@wikimedia.org wrote:

 I second Aaron’s concerns, which I previously expressed during the
 consultation about the new privacy policy. My main objection to the
 proposed solution is that by saying “Wikimedia honors DNT headers” we imply
 – by the most popular/de facto interpretation of DNT – that we *do* 3rd
 party tracking but we* allow* users to opt out, which puts WMF on par
 with aggressive tracking practices adopted by most sites.


 But we wouldn't say that; that would be silly. We'd make it completely
 clear that we are making use of the header that we think is consistent with
 the expectation of users but which departs from the standard in a
 significant way.


I, too, agree that this is something we (Comms) can handle through proper
communications. It's not a big concern for me.

Leila
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] most clicked links in articles

2015-01-12 Thread Leila Zia

Hi Amir,

   We're working on a link improvement project [1] that will answer your
first two questions. The first round of tests will be on ptwiki, then
enwiki, and depending on the results we may add more languages. The
algorithm used is robust to the choice of language, its accuracy, however,
depends on the traffic the language receives.

   We will continue to update the project page as more results become
available.

Best,
Leila

[1] https://meta.wikimedia.org/wiki/Research:Improving_link_coverage

On Mon, Jan 12, 2015 at 5:44 AM, Amir E. Aharoni 
amir.ahar...@mail.huji.ac.il wrote:

 Hi,

 Are there metrics about which links in each article are the most clicked?

 I can think there's a lot to be learned from it:
 * Data-driven suggestions for manual of style about linking (too much and
 too few links are a perennial topic of argument)
 * How do people traverse between topics.
 * Which terms in the article may need a short explanation in parentheses
 rather than just a link.
 * How far down into the article do people bother to read.

 Anyway, I can think that accessibility to such data can optimize both
 readership and editing.

 And maybe this can be just taken right from the logs, without any
 additional EventLogging.

 --
 Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
 http://aharoni.wordpress.com
 ‪“We're living in pieces,
 I want to live in peace.” – T. Moore‬

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Per-namespace daily edit numbers

2015-01-08 Thread Leila Zia

Gergo, this table has edits per name space aggregated by month. In your
original email, you ask for edit count and time of edit. If that's the
case, this table can't help (but how Aaron has generated this table can).

mmonth: last day of the month (month is MM form)
reverted: total number of reverted revisions done by the user (or reverts
by the user, I'm not 100% sure right now, but given your questions, you can
safely ignore this column).

On Thu, Jan 8, 2015 at 10:30 AM, Aaron Halfaker ahalfa...@wikimedia.org
wrote:

 It turns out that I did do some pre-computing here.  See
 db1047.eqiad.wmnet:staging.editor_month_by_namespace

 [staging] explain editor_month_by_namespace;
 ++--+--+-+-+---+
 | Field  | Type | Null | Key | Default | Extra |
 ++--+--+-+-+---+
 | wiki   | varchar(50)  | NO   | PRI | |   |
 | month  | varbinary(7) | NO   | PRI | |   |
 | user_id| varchar(255) | NO   | PRI | |   |
 | page_namespace | int(11)  | NO   | PRI | 0   |   |
 | archived   | int(11)  | YES  | | NULL|   |
 | revisions  | int(11)  | YES  | | NULL|   |
 | mmonth | date | YES  | | NULL|   |
 | reverted   | int(11)  | YES  | | NULL|   |
 ++--+--+-+-+---+
 8 rows in set (0.01 sec)

 As you'll notice, the table has a column for Wiki -- which means you can
 use it to do cross-wiki analysis.

 mmonth and reverted were added by Leila, so she'll need to comment on
 that.

 Otherwise:

- wiki - wikidb name (e.g. enwiki)
- month - MM
- user_id - corresponds to user table
- page_namespace - namespace ID number
- archived - # of revisions to deleted pages
- revisions - # of all revisions (archived or not)

 -Aaron

 On Thu, Jan 8, 2015 at 9:00 AM, Dan Andreescu dandree...@wikimedia.org
 wrote:



 On Thu, Jan 8, 2015 at 2:33 AM, Oliver Keyes oke...@wikimedia.org
 wrote:

 On 8 January 2015 at 02:31, Gergo Tisza gti...@wikimedia.org wrote:
  On Wed, Jan 7, 2015 at 6:26 PM, Oliver Keyes oke...@wikimedia.org
 wrote:
 
  places to get edits? Wellthe revision table? I'm sort of confused
  as to what you're looking for, I guess, that the db wouldn't have.
 
 
  There are a thousand or so wikis; it would be nice if there was a
 single
  table with all the edits. I guess I can generate a query with a
 thousand
  unions...


 We agree.  And that's why we're building a data warehouse.  We are
 currently going back and forth with Sean vetting a load process that
 creates exactly the edit table as you describe it.  The nice thing about
 the schema we are putting together is that not only would you be able to
 see the namespace of the page at the time of query but also throughout the
 page's history (as it moves from draft to main, etc.)

 
  The harder problem is that it would be nice to group by editor activity
  levels. One of the concerns about MediaViewer was that it makes harder
 for
  new editors to understand file pages and start editing them; so it
 would be
  a plausible hypothesis that the number of file edits by new editors
 would
  drop sharply after making MV default, but the total file edit count
 wouldn't
  be visibly affected because it would be dominated by power users who
 already
  know how to curate image metadata.
 
  So I would like to look at something like the number of first edits per
  month, or the number of edits by editors who at the time had less than
 10
  edits... recovering that kind of data from the revision table seems
  extremely difficult.

 Yeah, that is difficult. Aaron has, I believe, precomputed some things;
 Aaron?


 IANAA (I am not an Aaron) but I'm happy to help with the query.  I know
 of most of the stuff Aaron pre-computed as of a couple of months ago and
 this specific thing wasn't done.  Gergo, if you could precisely spell out a
 few queries you'd like to do, I can translate to SQL and use the experience
 to inform our data warehouse work.

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics



 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] WikiGrok and EventLogging

2015-01-07 Thread Leila Zia

Thanks everyone for chiming in. Your comments were very helpful. :-)

Nuria, I checked the per second pageview count for the pages wikigrok will
be live on for 3 hours in 2015-01-07 (as a sample). We're talking about a
total of ~170 events per sec for these pages. Of course major events can
affect this number. This number added to the current 270 events per sec you
mentioned will send us over the 350 events per sec limit (if it's a hard
limit). What do you think?

Leila



On Wed, Jan 7, 2015 at 10:13 AM, Nuria Ruiz nu...@wikimedia.org wrote:

 Given that information, do you have any idea if we are in danger of
 overloading EventLogging?
 Logging broad events (such a page load) 1 to 1 might incur into problems
 as our traffic is high enough that events logged1/1000 happen still in very
 large amounts.

 Some numbers (oversimplyfying and rounding)

 We have about 200 million visits per day for the enwiki mobile site . This
 means about 2300 pageviews per sec, if we are sending 1 load event per
 pageview EL will (sadly) die, most likely.

 If we assume EL handles up to 350 events per second (and now we are at 270
 events per sec) I would think that sending 10 events per sec on your case
 would be pretty safe. That would be sampling about 1/200 for a load event
 per every pageview. This seems like a good upper bound.

 Now, since there are no constrains as to how long you keep your experiment
 running you can try a lower sampling ratio, say, 1/1000 and keep the
 experiment running for longer.






 On Tue, Jan 6, 2015 at 5:50 PM, Ryan Kaldari rkald...@wikimedia.org
 wrote:

 The highest volume events we are going to log will be:
 1. For each of the 166,000 articles, one event when the page loads
 2. For each of the 166,000 articles, one event when the WikiGrok widget
 enters the viewport (about half as often as #1)

 These will be active for all mobile users, logged in and logged out,
 including many high pageview articles.

 Given that information, do you have any idea if we are in danger of
 overloading EventLogging? If so, do you have recommendations on sampling?
 So far, everyone has said not to worry about it, but it would be good to
 get a sanity check for this test specifically.

 Kaldari

 On Tue, Jan 6, 2015 at 4:57 PM, Nuria Ruiz nu...@wikimedia.org wrote:

 (cc-ing mobile-tech)

 Since we do not the details of how wikigrok is used and its throughput
 of requests we can not estimate sampling ourselves. I imagine wikigrok is
 been deployed to a number of users and it is with that usage the mobile
 team could estimate the total throughput expected, with this throughput we
 can recommend sampling ratios.


 Thanks for asking about this without before deploying!


 On Tue, Jan 6, 2015 at 4:55 PM, Ryan Kaldari rkald...@wikimedia.org
 wrote:

 I can elaborate on this after I finished the SWAT deployment Gimme
 30 minutes or so.

 On Tue, Jan 6, 2015 at 4:51 PM, Leila Zia le...@wikimedia.org wrote:

 Hi,

   The mobile team is planning to switch WikiGrok on for non-logged in
 users next week (2014-01-12). The widget will be on on 166,029 article
 pages in enwiki. There are two EventLogging schema that may collect data
 heavily and we want to make sure EL can handle the influx of data.

 The two schema collecting data are:
 https://meta.wikimedia.org/wiki/Schema:MobileWebWikiGrok
 https://meta.wikimedia.org/wiki/Schema:MobileWebWikiGrokError
 and the list of pages affected is in:
 wgq_page in enwiki.wikigrok_questions.

It would be great if someone from the dev side let us know whether
 we will need sampling.

 Thanks,
 Leila



 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics



 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics



 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics



 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] WikiGrok and EventLogging

2015-01-06 Thread Leila Zia

Hi,

  The mobile team is planning to switch WikiGrok on for non-logged in users
next week (2014-01-12). The widget will be on on 166,029 article pages in
enwiki. There are two EventLogging schema that may collect data heavily and
we want to make sure EL can handle the influx of data.

The two schema collecting data are:
https://meta.wikimedia.org/wiki/Schema:MobileWebWikiGrok
https://meta.wikimedia.org/wiki/Schema:MobileWebWikiGrokError
and the list of pages affected is in:
wgq_page in enwiki.wikigrok_questions.

   It would be great if someone from the dev side let us know whether we
will need sampling.

Thanks,
Leila
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Getting Access to Wikipedia Database

2014-12-24 Thread Leila Zia

Hi Neta,

On Wed, Dec 24, 2014 at 7:19 AM, Neta Livneh neta.liv...@gmail.com wrote:


 Actually, this is a great opportunity to say that I would love to get you
 guys involved or at least hear insights from the analytics team regarding
 the project's direction.


Feel free to keep me in the loop for the latter.

Best,
Leila




 On Wed, Dec 24, 2014 at 4:39 PM, Aaron Halfaker ahalfa...@wikimedia.org
 wrote:

 Here's the instructions that Christian gave with some screenshots and
 discussion:
 https://meta.wikimedia.org/wiki/Research:Labs2/Getting_started_with_Tool_Labs

 If you're just looking to run a few queries, you might consider
 http://quarry.wmflabs.org which requires no shell access -- just a
 Wikimedia sites account.

 -Aaron

 On Wed, Dec 24, 2014 at 7:22 AM, Christian Aistleitner 
 christ...@quelltextlich.at wrote:

 Hi Neta,

 On Wed, Dec 24, 2014 at 11:28:33AM +0200, Neta Livneh wrote:
  For my project, we will need to sql queries on current wikipedia data
  (mostly revision history table).
 
  I already have a Gerrit account. Can I get SSH access for running such
  queries?

 It sounds like the redacted labs databases would nicely fit your use
 case. The easiest way to get access there is to apply for Tool Labs [1].

 To get access, please file a request through


 https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request

 (Many parts around the WMF are currently getting migrated to
 phabricator.wikimedia.org, so if someone knows a phabricator procedure
 for that please chime in!)


 Once you've got Tool Labs [1] access you can ssh to

   tools-login.wmflabs.org

 and running

   sql enwiki

 on that host connects you to labsdb's enwiki database and you can run
 your queries there (similar for other wikis).

 Have fun,
 Christian



 [1] https://wikitech.wikimedia.org/wiki/Help:Tool_Labs
 has more information and links about Tool Labs.


 --
  quelltextlich e.U.  \\  Christian Aistleitner 
Companies' registry: 360296y in Linz
 Christian Aistleitner
 Kefermarkterstrasze 6a/3 Email:  christ...@quelltextlich.at
 4293 Gutau, Austria  Phone:  +43 7946 / 20 5 81
  Fax:+43 7946 / 20 5 81
  Homepage: http://quelltextlich.at/
 ---

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics



 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Switching the RD team to Phabricator

2014-12-15 Thread Leila Zia

On Mon, Dec 15, 2014 at 11:01 AM, Grace Gellerman ggeller...@wikimedia.org
wrote:

 I meant talking about Phab in RD Staff at 1:30pm Pacific tomorrow,


We haven't held the 1:30pm meeting when we've had combined staff meeting
once a month (shouldn't be in the calendar, but it is).

Leila


 On Mon, Dec 15, 2014 at 10:57 AM, Toby Negrin tneg...@wikimedia.org
 wrote:

 Sure -- I need to send out an agenda for consolidated staff tomorrow, but
 perhaps you and Kevin can share your experiences and talk about the dev
 process and use of fab?

 [Note -- we will have staff tomorrow]

 -Toby

 On Mon, Dec 15, 2014 at 10:53 AM, Grace Gellerman 
 ggeller...@wikimedia.org wrote:

 Should we talk more about this in our Research staff meeting on Tuesday?

 I agree that we need to figure out prioritization process first.  Also
 keeping two systems active could lead to requests going into two places
 which could affect the workflow of prioritization.

 I see the advantages to using Phab, but haste can sometimes, ya know,
 make waste.  So let's talk about this more...

 On Mon, Dec 15, 2014 at 10:48 AM, Toby Negrin tneg...@wikimedia.org
 wrote:

 To be clear - I do not want to move to Fabricator without reviewing our
 prioritization process.

 Shall we make this a Q3 goal since people seem really into it?

 On Dec 15, 2014, at 10:44 AM, Leila Zia le...@wikimedia.org wrote:

 Hi Oliver,

I'd like to give Phabricator a try. I suggest the following steps if
 we decide to do it:

1. We block a 15-min team time in December during which RD will
play with Phabricator in https://phab-01.wmflabs.org/ If we all
feel reasonably comfortable, then,
2. We switch to Phabricator in January, at the beginning of Q3, and
we aim to try it for one quarter [1]. If it works, great, if not,
3. We go back to Trello.

 Leila

 [1] During the quarter, we keep Trello but we don't update it. If we
 figure out some time during the quarter that we can't work with Phabricator
 at all, we switch back to Trello and all we need to do is to add Q3's tasks
 back to it.

 On Mon, Dec 15, 2014 at 7:48 AM, Dan Andreescu 
 dandree...@wikimedia.org wrote:

 So those are the reasons I have, off the top of my head. Other
 reasons? Counter-arguments? Post em here.


 Agreed, Phabricator is a good tool, and great if it creates a single
 place where we manage our projects.

 It's not all great though.  Trello has a much more refined
 notification system and much prettier interface.  Mingle has more powerful
 queries, Asana is more to the point, etc.  We've been over these and
 decided as an organization to move towards Phabricator, I'm certainly not
 trying to dig up old wounds.  I do want to say though: bring those 
 positive
 experiences when you switch from a system you like.  Because Phabricator 
 is
 open source and the Facebook team that maintains it has been super 
 friendly
 and helpful to us during our migration.  If we criticize it 
 constructively,
 Phabricator will only get better.

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics



 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics



 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Switching the RD team to Phabricator

2014-12-15 Thread Leila Zia

On Mon, Dec 15, 2014 at 10:48 AM, Toby Negrin tneg...@wikimedia.org wrote:


 Shall we make this a Q3 goal since people seem really into it?


I'm not sure. If it involves figuring out prioritization, it can be a good
idea.



 On Dec 15, 2014, at 10:44 AM, Leila Zia le...@wikimedia.org wrote:

 Hi Oliver,

I'd like to give Phabricator a try. I suggest the following steps if we
 decide to do it:

1. We block a 15-min team time in December during which RD will play
with Phabricator in https://phab-01.wmflabs.org/ If we all feel
reasonably comfortable, then,
2. We switch to Phabricator in January, at the beginning of Q3, and we
aim to try it for one quarter [1]. If it works, great, if not,
3. We go back to Trello.

 Leila

 [1] During the quarter, we keep Trello but we don't update it. If we
 figure out some time during the quarter that we can't work with Phabricator
 at all, we switch back to Trello and all we need to do is to add Q3's tasks
 back to it.

 On Mon, Dec 15, 2014 at 7:48 AM, Dan Andreescu dandree...@wikimedia.org
 wrote:

 So those are the reasons I have, off the top of my head. Other reasons?
 Counter-arguments? Post em here.


 Agreed, Phabricator is a good tool, and great if it creates a single
 place where we manage our projects.

 It's not all great though.  Trello has a much more refined notification
 system and much prettier interface.  Mingle has more powerful queries,
 Asana is more to the point, etc.  We've been over these and decided as an
 organization to move towards Phabricator, I'm certainly not trying to dig
 up old wounds.  I do want to say though: bring those positive experiences
 when you switch from a system you like.  Because Phabricator is open source
 and the Facebook team that maintains it has been super friendly and helpful
 to us during our migration.  If we criticize it constructively, Phabricator
 will only get better.

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] EventLogging data QA

2014-12-15 Thread Leila Zia

On Mon, Dec 15, 2014 at 10:06 AM, Toby Negrin tneg...@wikimedia.org wrote:

 I share Christian's concerns -

 Dario/Leila - can you comment based on your recent experiences with
 WikiGrok?


I agree with Christian.

QA in beta labs is good but not enough. We still need to do QA when a
feature goes to production and currently, it's very hard to figure out if
there's a problem with logging. An example:

While testing WikiGrok in production, we learned that after some point
tests from Firefox browser from my machine were not logged. We did not get
any errors for this. I found out about this because I was trying to
manually make a trace of activities and see if I can stitch them together
and make sense of them. We eventually figured out what was going on in that
case [1], but it concerns me that there may be other important events that
we don't log in the DB and we never know that we're not logging.

Leila
[1]
https://lists.wikimedia.org/pipermail/analytics/2014-December/002864.html



 Thanks

 -Toby


  On Dec 15, 2014, at 9:42 AM, Christian Aistleitner 
 christ...@quelltextlich.at wrote:
 
  Hi,
 
  On Mon, Dec 15, 2014 at 08:34:39AM -0800, Kevin Leduc wrote:
  I closed the Phabricator task with a links to this thread and the
 wikitech
  doc for testing on beta cluster.
 
  I am fine with keeping the task closed.
 
  But I am somewhat surprised to see beta mentioned in the
  resolution. Note that Dario's request set scope as [1]
 
   However, there are types of data quality issues that we only
   discover when collecting data at scale and in the wild (on
   browsers/platforms that we don’t necessarily test for internally).
 
  . That's a valid scope, but from my point of view, beta does not match
  that scope.
 
  Neither is beta large scale, nor is it hammered on with crazy devices.
 
  Beta is just a halfing the distance between EventLogging's devserver
  (Vagrant!) and production.
 
  Have fun,
  Christian
 
 
 
  [1]
 https://lists.wikimedia.org/pipermail/analytics/2014-December/002884.html
 
 
 
  --
   quelltextlich e.U.  \\  Christian Aistleitner 
Companies' registry: 360296y in Linz
  Christian Aistleitner
  Kefermarkterstrasze 6a/3 Email:  christ...@quelltextlich.at
  4293 Gutau, Austria  Phone:  +43 7946 / 20 5 81
  Fax:+43 7946 / 20 5 81
  Homepage: http://quelltextlich.at/
  ---
  ___
  Analytics mailing list
  Analytics@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/analytics

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] EventLogging data QA

2014-12-15 Thread Leila Zia

On Monday, December 15, 2014, Kevin Leduc ke...@wikimedia.org wrote:


 I'd like to move this to a video conference call between analytics
 developers and analytics engineering to come to a mutual understanding of
 what the current pain points are and what's the biggest priority.


It probably makes sense to have someone from RD with experience in QA in
that meeting (Dario if you want a more experienced person, myself
otherwise). Not sure if you meant the same when you said analytics
engineering.

Leila






 On Mon, Dec 15, 2014 at 4:37 PM, Nuria Ruiz nu...@wikimedia.org
 javascript:_e(%7B%7D,'cvml','nu...@wikimedia.org'); wrote:

 QA in beta labs is good but not enough. We still need to do QA when a
 feature goes to production and currently
 This is true but at the same time, I do not see anything in the
 description of your FF events that could not be tested on beta-labs. If we
 are talking add-block that can be tested even earlier, vagrant will be a
 fine venue. All the issues related to the client (browser) not emitting
 events can be tested on the development environment with ease.



 On Mon, Dec 15, 2014 at 4:18 PM, Leila Zia le...@wikimedia.org
 javascript:_e(%7B%7D,'cvml','le...@wikimedia.org'); wrote:


 On Mon, Dec 15, 2014 at 10:06 AM, Toby Negrin tneg...@wikimedia.org
 javascript:_e(%7B%7D,'cvml','tneg...@wikimedia.org'); wrote:

 I share Christian's concerns -

 Dario/Leila - can you comment based on your recent experiences with
 WikiGrok?


 I agree with Christian.

 QA in beta labs is good but not enough. We still need to do QA when a
 feature goes to production and currently, it's very hard to figure out if
 there's a problem with logging. An example:

 While testing WikiGrok in production, we learned that after some point
 tests from Firefox browser from my machine were not logged. We did not get
 any errors for this. I found out about this because I was trying to
 manually make a trace of activities and see if I can stitch them together
 and make sense of them. We eventually figured out what was going on in that
 case [1], but it concerns me that there may be other important events that
 we don't log in the DB and we never know that we're not logging.

 Leila
 [1]
 https://lists.wikimedia.org/pipermail/analytics/2014-December/002864.html



 Thanks

 -Toby


  On Dec 15, 2014, at 9:42 AM, Christian Aistleitner 
 christ...@quelltextlich.at
 javascript:_e(%7B%7D,'cvml','christ...@quelltextlich.at'); wrote:
 
  Hi,
 
  On Mon, Dec 15, 2014 at 08:34:39AM -0800, Kevin Leduc wrote:
  I closed the Phabricator task with a links to this thread and the
 wikitech
  doc for testing on beta cluster.
 
  I am fine with keeping the task closed.
 
  But I am somewhat surprised to see beta mentioned in the
  resolution. Note that Dario's request set scope as [1]
 
   However, there are types of data quality issues that we only
   discover when collecting data at scale and in the wild (on
   browsers/platforms that we don’t necessarily test for internally).
 
  . That's a valid scope, but from my point of view, beta does not match
  that scope.
 
  Neither is beta large scale, nor is it hammered on with crazy devices.
 
  Beta is just a halfing the distance between EventLogging's devserver
  (Vagrant!) and production.
 
  Have fun,
  Christian
 
 
 
  [1]
 https://lists.wikimedia.org/pipermail/analytics/2014-December/002884.html
 
 
 
  --
   quelltextlich e.U.  \\  Christian Aistleitner 
Companies' registry: 360296y in Linz
  Christian Aistleitner
  Kefermarkterstrasze 6a/3 Email:  christ...@quelltextlich.at
 javascript:_e(%7B%7D,'cvml','christ...@quelltextlich.at');
  4293 Gutau, Austria  Phone:  +43 7946 / 20 5 81
  Fax:+43 7946 / 20 5 81
  Homepage: http://quelltextlich.at/
  ---
  ___
  Analytics mailing list
  Analytics@lists.wikimedia.org
 javascript:_e(%7B%7D,'cvml','Analytics@lists.wikimedia.org');
  https://lists.wikimedia.org/mailman/listinfo/analytics

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 javascript:_e(%7B%7D,'cvml','Analytics@lists.wikimedia.org');
 https://lists.wikimedia.org/mailman/listinfo/analytics


 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 javascript:_e(%7B%7D,'cvml','Analytics@lists.wikimedia.org');
 https://lists.wikimedia.org/mailman/listinfo/analytics


 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 javascript:_e(%7B%7D,'cvml','Analytics@lists.wikimedia.org');
 https://lists.wikimedia.org/mailman/listinfo/analytics


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo

[Analytics] EventLogging and Adblock on Linux/Firefox

2014-12-11 Thread Leila Zia

Hi everyone,

   From some initial tests it appears to me that EventLogging is not
logging events from Linux/Firefox when Adblock is enabled. I'm on Ubuntu
14.04, Firefox 34.0, and Adblock Plus 2.6.6. When I disable Adblock, I see
event.gif?{...} in Console, when I enable it, I don't. Just to make sure,
I've checked the EL tables and my events don't get registered there.

   I'd be happy to sit with someone to troubleshoot before 4pm (PST), after
5:30pm (PST) or tomorrow.

Leila
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] EventLogging and Adblock on Linux/Firefox

2014-12-11 Thread Leila Zia

good catch, didn't know there is a venue of them.

My test was with Adblock Plus on Linux/Firefox. I installed Adblock Plus on
Chrome just now and tested. Linux/Chrome logs events without a problem.

On Thu, Dec 11, 2014 at 3:00 PM, Federico Leva (Nemo) nemow...@gmail.com
wrote:

 Is everyone talking of the same extension here? Adblock Plus is not the
 same as AdBlock.
 https://en.wikipedia.org/wiki/Adblock_Plus
 https://en.wikipedia.org/wiki/AdBlock
 (Adblock was yet another.)

 Nemo


 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] EventLogging and Adblock on Linux/Firefox

2014-12-11 Thread Leila Zia

Ori, I have /event/gif? filter rule enabled under EasyPrivacy:
https://easylist-downloads.adblockplus.org/easyprivacy.txt

On Thu, Dec 11, 2014 at 3:01 PM, Ori Livneh o...@wikimedia.org wrote:

 The version of AdBlock itself is not significant; what is significant is
 the filter subscription, which is a set of URL patterns and element
 selectors that AdBlock will block. The most popular one (and the one that
 is installed by default) is EasyList, and I do not see EventLogging in its
 list:

 https://easylist-downloads.adblockplus.org/easylist.txt

 Other filter subscriptions are listed here:
 https://easylist.adblockplus.org/en/

 I checked the popular English-language ones and did not see EventLogging
 filtering. bits.wikimedia.org/geoiplookup and
 meta.wikimedia.org/wiki/Special:RecordImpression are filtered, but the
 fundraising team should be aware of that already.

 On Thu, Dec 11, 2014 at 2:58 PM, Dan Garry dga...@wikimedia.org wrote:

 Mac OS X
 Chrome 39.0.2171.71 (64-bit)
 Adblock Plus 1.8.8

 Dan

 On 11 December 2014 at 14:57, Toby Negrin tneg...@wikimedia.org wrote:

 Dan -- what platform?

 On Thu, Dec 11, 2014 at 2:56 PM, Dan Garry dga...@wikimedia.org wrote:

 And, from looking at the tables, I can confirm my action was logged
 successfully.

 Dan

 On 11 December 2014 at 14:50, Dan Garry dga...@wikimedia.org wrote:

 FWIW, I tested this quickly by going to en.m.wikipedia.org on my
 browser (which has Adblock Plus 1.8.8) and clicking the left nav, which I
 know has EventLogging attached to it on mobile web. I saw event.gif in the
 console, transmitting the following JSON:


1. 
 {event:{name:hamburger-settings,destination:/w/index.php?title=Special:MobileOptionsreturnto=Main+Page,mobileMode:stable,username:Deskana

 (WMF),userEditCount:223},clientValidated:true,revision:5929948,schema:MobileWebClickTracking,webHost:
en.m.wikipedia.org,wiki:enwiki};:


 As far as I know, I use default Adblock settings.

 If you want to do some troubleshooting with my browser as a control, I
 am in the office chilling on the hammock, in spite of the rain [1].

 Thanks,
 Dan

 [1]: Yes, that's what it is. Rain. You Californians might call it
 stormageddon or whatever, but the rest of the world calls it rain. ;-)

 On 11 December 2014 at 14:15, Leila Zia le...@wikimedia.org wrote:

 Hi everyone,

From some initial tests it appears to me that EventLogging is not
 logging events from Linux/Firefox when Adblock is enabled. I'm on Ubuntu
 14.04, Firefox 34.0, and Adblock Plus 2.6.6. When I disable Adblock, I 
 see
 event.gif?{...} in Console, when I enable it, I don't. Just to make sure,
 I've checked the EL tables and my events don't get registered there.

I'd be happy to sit with someone to troubleshoot before 4pm (PST),
 after 5:30pm (PST) or tomorrow.

 Leila



 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics



 --
 Dan Garry
 Associate Product Manager, Mobile Apps
 Wikimedia Foundation



 --
 Dan Garry
 Associate Product Manager, Mobile Apps
 Wikimedia Foundation

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics



 --
 Dan Garry
 Associate Product Manager, Mobile Apps
 Wikimedia Foundation

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

1 2 >

1 - 100 of 112 matches

Mail list logo