Re: [Analytics] [Discussion] User agent data releases

2015-03-05 Thread Dario Taraborelli
heads up that after a review with Legal we decided that we should not release 
the sampled raw dataset. Oliver is now working on making parsed UA data 
available.

> On Mar 5, 2015, at 10:52 AM, Oliver Keyes  wrote:
> 
> Just a clarifying note: Dario still needs to review the actual
> methodology. While Legal have approved it from their end, they've also
> made clear that this is contingent on the anonymisation methodology
> pasting muster from an R&D point of view.
> 
> On 5 March 2015 at 12:39, Oliver Keyes  wrote:
>> Just an FYI that Legal have approved this release under the
>> anonymisation procedures we've set out (thanks Michelle!) on the
>> condition that Dario, too, is comfortable with them. Dario?
>> 
>> On 4 March 2015 at 17:16, Oliver Keyes  wrote:
>>> So it's distinct people, globally - and I deliberately made it wooly
>>> it by operating over username, which means the threshold is fuzzy
>>> (i.e., at a minimum it's 50. At a maximum it's 50x[number of wikis]).
>>> 
>>> It's very deliberately dimension-free: user_agent,
>>> edit_count_in_non_specified_90_day_period, and that's it.
>>> 
>>> On 4 March 2015 at 17:12, Aaron Halfaker  wrote:
 Assuming this was public, I could use this data on seldom edited Wikis to
 find out which editors likely have old browser/OS versions with
 vulnerabilities that I could attack[1].  This would be easier and easier 
 the
 more dimensions you add to the data.
 
 
 
 OK.  The anonymization strategy for dropping records that represent < 50
 distinct editors seems to address this concern.   50 edits is a lot.  So
 this data wouldn't be too terribly useful for under-active wikis.  Then
 again, if you just want to a sense for what the dominant browser/OS pairs
 are, then they will likely represent > 50 unique editors on most projects.
 
 1. Props to Matt Flaschen and Dan Andreescu for helping me work through the
 implications of that one.
 
 On Tue, Mar 3, 2015 at 9:59 PM, Oliver Keyes  wrote:
> 
> Yeah, makes sense.
> 
> On 3 March 2015 at 20:38, Nuria Ruiz  wrote:
>>> Agreed. Do we have a way of syncing files to Labs yet?
>> No need to sync if file is available in an endpoint like
>> htpp://some-data-here
>> 
>> On Tue, Mar 3, 2015 at 4:50 PM, Oliver Keyes 
>> wrote:
>>> 
>>> On 3 March 2015 at 19:35, Nuria Ruiz  wrote:
> Erik has asked me to write an exploratory app for user-agent data.
> The
> idea is to enable Product Managers and engineers to easily explore
> what users use so they know what to support. I've thrown up an
> example
> screenshot at http://ironholds.org/agents_example_screen.png
 
 I cannot speak as to the interest of community about this data but
 for
 developers and PM we should make sure we have a solid way to update
 any
 data
 we put up. User Agent data is outdated as soon as a new version of
 android
 or iOs is released, a new popular phone comes along or a new
 autoupdate
 for
 popular browsers. Not only that, if we make changes to, say, redirect
 all
 iPad users to the desktop site we want to asses effect of those
 changes
 as
 soon as possible. A monthly update will be a must. Also
 distinguishing
 between browser percentages on desktop site versus mobile site versus
 apps
 is a must for this data to be real useful for PMs and developers
 (specially
 for bug triage).
 
>>> 
>>> Yes! However, I am addressing a specific ad-hoc request. If there is a
>>> need for this (I agree there is) I hope Toby and Kevin can eke out the
>>> time on the Analytics Engineering schedule to work on it; y'all are a
>>> lot better at infrastructure work than me :).
>>> 
 
 We have couple backlog items to make monthly reports on this regard.
 A
 UI on
 top of them will be superb.
 
>>> 
>>> Agreed. Do we have a way of syncing files to Labs yet? That's the
>>> biggest blocker. The UI doesn't care what the file contains as long as
>>> it's a TSV with a header row - I've deliberately built it so that
>>> things like the download links are dynamic and can change.
>>> 
 
 
 
 
 On Tue, Mar 3, 2015 at 1:05 PM, Oliver Keyes 
 wrote:
> 
> Hey all,
> 
> (Sending this to the public list because it's more transparent and
> I'd
> like people who think this data is useful to be able to shout out)
> 
> Erik has asked me to write an exploratory app for user-agent data.
> The
> idea is to enable Product Managers and engineers to easily explore
> what users use so they know what to support. I've thr

Re: [Analytics] [Discussion] User agent data releases

2015-03-05 Thread Oliver Keyes
Just a clarifying note: Dario still needs to review the actual
methodology. While Legal have approved it from their end, they've also
made clear that this is contingent on the anonymisation methodology
pasting muster from an R&D point of view.

On 5 March 2015 at 12:39, Oliver Keyes  wrote:
> Just an FYI that Legal have approved this release under the
> anonymisation procedures we've set out (thanks Michelle!) on the
> condition that Dario, too, is comfortable with them. Dario?
>
> On 4 March 2015 at 17:16, Oliver Keyes  wrote:
>> So it's distinct people, globally - and I deliberately made it wooly
>> it by operating over username, which means the threshold is fuzzy
>> (i.e., at a minimum it's 50. At a maximum it's 50x[number of wikis]).
>>
>> It's very deliberately dimension-free: user_agent,
>> edit_count_in_non_specified_90_day_period, and that's it.
>>
>> On 4 March 2015 at 17:12, Aaron Halfaker  wrote:
>>> Assuming this was public, I could use this data on seldom edited Wikis to
>>> find out which editors likely have old browser/OS versions with
>>> vulnerabilities that I could attack[1].  This would be easier and easier the
>>> more dimensions you add to the data.
>>>
>>> 
>>>
>>> OK.  The anonymization strategy for dropping records that represent < 50
>>> distinct editors seems to address this concern.   50 edits is a lot.  So
>>> this data wouldn't be too terribly useful for under-active wikis.  Then
>>> again, if you just want to a sense for what the dominant browser/OS pairs
>>> are, then they will likely represent > 50 unique editors on most projects.
>>>
>>> 1. Props to Matt Flaschen and Dan Andreescu for helping me work through the
>>> implications of that one.
>>>
>>> On Tue, Mar 3, 2015 at 9:59 PM, Oliver Keyes  wrote:

 Yeah, makes sense.

 On 3 March 2015 at 20:38, Nuria Ruiz  wrote:
 >>Agreed. Do we have a way of syncing files to Labs yet?
 > No need to sync if file is available in an endpoint like
 > htpp://some-data-here
 >
 > On Tue, Mar 3, 2015 at 4:50 PM, Oliver Keyes 
 > wrote:
 >>
 >> On 3 March 2015 at 19:35, Nuria Ruiz  wrote:
 >> >>Erik has asked me to write an exploratory app for user-agent data.
 >> >> The
 >> >>idea is to enable Product Managers and engineers to easily explore
 >> >>what users use so they know what to support. I've thrown up an
 >> >> example
 >> >>screenshot at http://ironholds.org/agents_example_screen.png
 >> >
 >> > I cannot speak as to the interest of community about this data but
 >> > for
 >> > developers and PM we should make sure we have a solid way to update
 >> > any
 >> > data
 >> > we put up. User Agent data is outdated as soon as a new version of
 >> > android
 >> > or iOs is released, a new popular phone comes along or a new
 >> > autoupdate
 >> > for
 >> > popular browsers. Not only that, if we make changes to, say, redirect
 >> > all
 >> > iPad users to the desktop site we want to asses effect of those
 >> > changes
 >> > as
 >> > soon as possible. A monthly update will be a must. Also
 >> > distinguishing
 >> > between browser percentages on desktop site versus mobile site versus
 >> > apps
 >> > is a must for this data to be real useful for PMs and developers
 >> > (specially
 >> > for bug triage).
 >> >
 >>
 >> Yes! However, I am addressing a specific ad-hoc request. If there is a
 >> need for this (I agree there is) I hope Toby and Kevin can eke out the
 >> time on the Analytics Engineering schedule to work on it; y'all are a
 >> lot better at infrastructure work than me :).
 >>
 >> >
 >> > We have couple backlog items to make monthly reports on this regard.
 >> > A
 >> > UI on
 >> > top of them will be superb.
 >> >
 >>
 >> Agreed. Do we have a way of syncing files to Labs yet? That's the
 >> biggest blocker. The UI doesn't care what the file contains as long as
 >> it's a TSV with a header row - I've deliberately built it so that
 >> things like the download links are dynamic and can change.
 >>
 >> >
 >> >
 >> >
 >> >
 >> > On Tue, Mar 3, 2015 at 1:05 PM, Oliver Keyes 
 >> > wrote:
 >> >>
 >> >> Hey all,
 >> >>
 >> >> (Sending this to the public list because it's more transparent and
 >> >> I'd
 >> >> like people who think this data is useful to be able to shout out)
 >> >>
 >> >> Erik has asked me to write an exploratory app for user-agent data.
 >> >> The
 >> >> idea is to enable Product Managers and engineers to easily explore
 >> >> what users use so they know what to support. I've thrown up an
 >> >> example
 >> >> screenshot at http://ironholds.org/agents_example_screen.png  (I'd
 >> >> host it on Commons, inb4Dario, but I'm not sure the copyright status
 >> >> of the UI)
 >> >>
 >> >> One side-effect of

Re: [Analytics] [Discussion] User agent data releases

2015-03-05 Thread Oliver Keyes
Just an FYI that Legal have approved this release under the
anonymisation procedures we've set out (thanks Michelle!) on the
condition that Dario, too, is comfortable with them. Dario?

On 4 March 2015 at 17:16, Oliver Keyes  wrote:
> So it's distinct people, globally - and I deliberately made it wooly
> it by operating over username, which means the threshold is fuzzy
> (i.e., at a minimum it's 50. At a maximum it's 50x[number of wikis]).
>
> It's very deliberately dimension-free: user_agent,
> edit_count_in_non_specified_90_day_period, and that's it.
>
> On 4 March 2015 at 17:12, Aaron Halfaker  wrote:
>> Assuming this was public, I could use this data on seldom edited Wikis to
>> find out which editors likely have old browser/OS versions with
>> vulnerabilities that I could attack[1].  This would be easier and easier the
>> more dimensions you add to the data.
>>
>> 
>>
>> OK.  The anonymization strategy for dropping records that represent < 50
>> distinct editors seems to address this concern.   50 edits is a lot.  So
>> this data wouldn't be too terribly useful for under-active wikis.  Then
>> again, if you just want to a sense for what the dominant browser/OS pairs
>> are, then they will likely represent > 50 unique editors on most projects.
>>
>> 1. Props to Matt Flaschen and Dan Andreescu for helping me work through the
>> implications of that one.
>>
>> On Tue, Mar 3, 2015 at 9:59 PM, Oliver Keyes  wrote:
>>>
>>> Yeah, makes sense.
>>>
>>> On 3 March 2015 at 20:38, Nuria Ruiz  wrote:
>>> >>Agreed. Do we have a way of syncing files to Labs yet?
>>> > No need to sync if file is available in an endpoint like
>>> > htpp://some-data-here
>>> >
>>> > On Tue, Mar 3, 2015 at 4:50 PM, Oliver Keyes 
>>> > wrote:
>>> >>
>>> >> On 3 March 2015 at 19:35, Nuria Ruiz  wrote:
>>> >> >>Erik has asked me to write an exploratory app for user-agent data.
>>> >> >> The
>>> >> >>idea is to enable Product Managers and engineers to easily explore
>>> >> >>what users use so they know what to support. I've thrown up an
>>> >> >> example
>>> >> >>screenshot at http://ironholds.org/agents_example_screen.png
>>> >> >
>>> >> > I cannot speak as to the interest of community about this data but
>>> >> > for
>>> >> > developers and PM we should make sure we have a solid way to update
>>> >> > any
>>> >> > data
>>> >> > we put up. User Agent data is outdated as soon as a new version of
>>> >> > android
>>> >> > or iOs is released, a new popular phone comes along or a new
>>> >> > autoupdate
>>> >> > for
>>> >> > popular browsers. Not only that, if we make changes to, say, redirect
>>> >> > all
>>> >> > iPad users to the desktop site we want to asses effect of those
>>> >> > changes
>>> >> > as
>>> >> > soon as possible. A monthly update will be a must. Also
>>> >> > distinguishing
>>> >> > between browser percentages on desktop site versus mobile site versus
>>> >> > apps
>>> >> > is a must for this data to be real useful for PMs and developers
>>> >> > (specially
>>> >> > for bug triage).
>>> >> >
>>> >>
>>> >> Yes! However, I am addressing a specific ad-hoc request. If there is a
>>> >> need for this (I agree there is) I hope Toby and Kevin can eke out the
>>> >> time on the Analytics Engineering schedule to work on it; y'all are a
>>> >> lot better at infrastructure work than me :).
>>> >>
>>> >> >
>>> >> > We have couple backlog items to make monthly reports on this regard.
>>> >> > A
>>> >> > UI on
>>> >> > top of them will be superb.
>>> >> >
>>> >>
>>> >> Agreed. Do we have a way of syncing files to Labs yet? That's the
>>> >> biggest blocker. The UI doesn't care what the file contains as long as
>>> >> it's a TSV with a header row - I've deliberately built it so that
>>> >> things like the download links are dynamic and can change.
>>> >>
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Tue, Mar 3, 2015 at 1:05 PM, Oliver Keyes 
>>> >> > wrote:
>>> >> >>
>>> >> >> Hey all,
>>> >> >>
>>> >> >> (Sending this to the public list because it's more transparent and
>>> >> >> I'd
>>> >> >> like people who think this data is useful to be able to shout out)
>>> >> >>
>>> >> >> Erik has asked me to write an exploratory app for user-agent data.
>>> >> >> The
>>> >> >> idea is to enable Product Managers and engineers to easily explore
>>> >> >> what users use so they know what to support. I've thrown up an
>>> >> >> example
>>> >> >> screenshot at http://ironholds.org/agents_example_screen.png  (I'd
>>> >> >> host it on Commons, inb4Dario, but I'm not sure the copyright status
>>> >> >> of the UI)
>>> >> >>
>>> >> >> One side-effect of this is that we end up with files of common user
>>> >> >> agents, split between {readers,editors} and {mobile, desktop},
>>> >> >> parsed
>>> >> >> and unparsed. I'd like to release these files. The reuse potential
>>> >> >> is
>>> >> >> twofold; researchers and engineers can use the parsed files to see
>>> >> >> what browser penetration looks like globally and what browsers
>>> >> >> should
>>> >> >> be

Re: [Analytics] [Discussion] User agent data releases

2015-03-04 Thread Oliver Keyes
So it's distinct people, globally - and I deliberately made it wooly
it by operating over username, which means the threshold is fuzzy
(i.e., at a minimum it's 50. At a maximum it's 50x[number of wikis]).

It's very deliberately dimension-free: user_agent,
edit_count_in_non_specified_90_day_period, and that's it.

On 4 March 2015 at 17:12, Aaron Halfaker  wrote:
> Assuming this was public, I could use this data on seldom edited Wikis to
> find out which editors likely have old browser/OS versions with
> vulnerabilities that I could attack[1].  This would be easier and easier the
> more dimensions you add to the data.
>
> 
>
> OK.  The anonymization strategy for dropping records that represent < 50
> distinct editors seems to address this concern.   50 edits is a lot.  So
> this data wouldn't be too terribly useful for under-active wikis.  Then
> again, if you just want to a sense for what the dominant browser/OS pairs
> are, then they will likely represent > 50 unique editors on most projects.
>
> 1. Props to Matt Flaschen and Dan Andreescu for helping me work through the
> implications of that one.
>
> On Tue, Mar 3, 2015 at 9:59 PM, Oliver Keyes  wrote:
>>
>> Yeah, makes sense.
>>
>> On 3 March 2015 at 20:38, Nuria Ruiz  wrote:
>> >>Agreed. Do we have a way of syncing files to Labs yet?
>> > No need to sync if file is available in an endpoint like
>> > htpp://some-data-here
>> >
>> > On Tue, Mar 3, 2015 at 4:50 PM, Oliver Keyes 
>> > wrote:
>> >>
>> >> On 3 March 2015 at 19:35, Nuria Ruiz  wrote:
>> >> >>Erik has asked me to write an exploratory app for user-agent data.
>> >> >> The
>> >> >>idea is to enable Product Managers and engineers to easily explore
>> >> >>what users use so they know what to support. I've thrown up an
>> >> >> example
>> >> >>screenshot at http://ironholds.org/agents_example_screen.png
>> >> >
>> >> > I cannot speak as to the interest of community about this data but
>> >> > for
>> >> > developers and PM we should make sure we have a solid way to update
>> >> > any
>> >> > data
>> >> > we put up. User Agent data is outdated as soon as a new version of
>> >> > android
>> >> > or iOs is released, a new popular phone comes along or a new
>> >> > autoupdate
>> >> > for
>> >> > popular browsers. Not only that, if we make changes to, say, redirect
>> >> > all
>> >> > iPad users to the desktop site we want to asses effect of those
>> >> > changes
>> >> > as
>> >> > soon as possible. A monthly update will be a must. Also
>> >> > distinguishing
>> >> > between browser percentages on desktop site versus mobile site versus
>> >> > apps
>> >> > is a must for this data to be real useful for PMs and developers
>> >> > (specially
>> >> > for bug triage).
>> >> >
>> >>
>> >> Yes! However, I am addressing a specific ad-hoc request. If there is a
>> >> need for this (I agree there is) I hope Toby and Kevin can eke out the
>> >> time on the Analytics Engineering schedule to work on it; y'all are a
>> >> lot better at infrastructure work than me :).
>> >>
>> >> >
>> >> > We have couple backlog items to make monthly reports on this regard.
>> >> > A
>> >> > UI on
>> >> > top of them will be superb.
>> >> >
>> >>
>> >> Agreed. Do we have a way of syncing files to Labs yet? That's the
>> >> biggest blocker. The UI doesn't care what the file contains as long as
>> >> it's a TSV with a header row - I've deliberately built it so that
>> >> things like the download links are dynamic and can change.
>> >>
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Mar 3, 2015 at 1:05 PM, Oliver Keyes 
>> >> > wrote:
>> >> >>
>> >> >> Hey all,
>> >> >>
>> >> >> (Sending this to the public list because it's more transparent and
>> >> >> I'd
>> >> >> like people who think this data is useful to be able to shout out)
>> >> >>
>> >> >> Erik has asked me to write an exploratory app for user-agent data.
>> >> >> The
>> >> >> idea is to enable Product Managers and engineers to easily explore
>> >> >> what users use so they know what to support. I've thrown up an
>> >> >> example
>> >> >> screenshot at http://ironholds.org/agents_example_screen.png  (I'd
>> >> >> host it on Commons, inb4Dario, but I'm not sure the copyright status
>> >> >> of the UI)
>> >> >>
>> >> >> One side-effect of this is that we end up with files of common user
>> >> >> agents, split between {readers,editors} and {mobile, desktop},
>> >> >> parsed
>> >> >> and unparsed. I'd like to release these files. The reuse potential
>> >> >> is
>> >> >> twofold; researchers and engineers can use the parsed files to see
>> >> >> what browser penetration looks like globally and what browsers
>> >> >> should
>> >> >> be supported at a top-10, and software engineers can use the
>> >> >> unparsed
>> >> >> files to improve detection rates.
>> >> >>
>> >> >> The privacy implications /should/ be minimal, because of how this
>> >> >> data
>> >> >> is gathered. The editor data is gathered from the checkuser table,
>> >> >> globally, and automatically excludes any user agent 

Re: [Analytics] [Discussion] User agent data releases

2015-03-04 Thread Aaron Halfaker
Assuming this was public, I could use this data on seldom edited Wikis to
find out which editors likely have old browser/OS versions with
vulnerabilities that I could attack[1].  This would be easier and easier
the more dimensions you add to the data.



OK.  The anonymization strategy for dropping records that represent < 50
distinct editors seems to address this concern.   50 edits is a lot.  So
this data wouldn't be too terribly useful for under-active wikis.  Then
again, if you just want to a sense for what the dominant browser/OS pairs
are, then they will likely represent > 50 unique editors on most projects.

1. Props to Matt Flaschen and Dan Andreescu for helping me work through the
implications of that one.

On Tue, Mar 3, 2015 at 9:59 PM, Oliver Keyes  wrote:

> Yeah, makes sense.
>
> On 3 March 2015 at 20:38, Nuria Ruiz  wrote:
> >>Agreed. Do we have a way of syncing files to Labs yet?
> > No need to sync if file is available in an endpoint like
> > htpp://some-data-here
> >
> > On Tue, Mar 3, 2015 at 4:50 PM, Oliver Keyes 
> wrote:
> >>
> >> On 3 March 2015 at 19:35, Nuria Ruiz  wrote:
> >> >>Erik has asked me to write an exploratory app for user-agent data. The
> >> >>idea is to enable Product Managers and engineers to easily explore
> >> >>what users use so they know what to support. I've thrown up an example
> >> >>screenshot at http://ironholds.org/agents_example_screen.png
> >> >
> >> > I cannot speak as to the interest of community about this data but for
> >> > developers and PM we should make sure we have a solid way to update
> any
> >> > data
> >> > we put up. User Agent data is outdated as soon as a new version of
> >> > android
> >> > or iOs is released, a new popular phone comes along or a new
> autoupdate
> >> > for
> >> > popular browsers. Not only that, if we make changes to, say, redirect
> >> > all
> >> > iPad users to the desktop site we want to asses effect of those
> changes
> >> > as
> >> > soon as possible. A monthly update will be a must. Also distinguishing
> >> > between browser percentages on desktop site versus mobile site versus
> >> > apps
> >> > is a must for this data to be real useful for PMs and developers
> >> > (specially
> >> > for bug triage).
> >> >
> >>
> >> Yes! However, I am addressing a specific ad-hoc request. If there is a
> >> need for this (I agree there is) I hope Toby and Kevin can eke out the
> >> time on the Analytics Engineering schedule to work on it; y'all are a
> >> lot better at infrastructure work than me :).
> >>
> >> >
> >> > We have couple backlog items to make monthly reports on this regard. A
> >> > UI on
> >> > top of them will be superb.
> >> >
> >>
> >> Agreed. Do we have a way of syncing files to Labs yet? That's the
> >> biggest blocker. The UI doesn't care what the file contains as long as
> >> it's a TSV with a header row - I've deliberately built it so that
> >> things like the download links are dynamic and can change.
> >>
> >> >
> >> >
> >> >
> >> >
> >> > On Tue, Mar 3, 2015 at 1:05 PM, Oliver Keyes 
> >> > wrote:
> >> >>
> >> >> Hey all,
> >> >>
> >> >> (Sending this to the public list because it's more transparent and
> I'd
> >> >> like people who think this data is useful to be able to shout out)
> >> >>
> >> >> Erik has asked me to write an exploratory app for user-agent data.
> The
> >> >> idea is to enable Product Managers and engineers to easily explore
> >> >> what users use so they know what to support. I've thrown up an
> example
> >> >> screenshot at http://ironholds.org/agents_example_screen.png  (I'd
> >> >> host it on Commons, inb4Dario, but I'm not sure the copyright status
> >> >> of the UI)
> >> >>
> >> >> One side-effect of this is that we end up with files of common user
> >> >> agents, split between {readers,editors} and {mobile, desktop}, parsed
> >> >> and unparsed. I'd like to release these files. The reuse potential is
> >> >> twofold; researchers and engineers can use the parsed files to see
> >> >> what browser penetration looks like globally and what browsers should
> >> >> be supported at a top-10, and software engineers can use the unparsed
> >> >> files to improve detection rates.
> >> >>
> >> >> The privacy implications /should/ be minimal, because of how this
> data
> >> >> is gathered. The editor data is gathered from the checkuser table,
> >> >> globally, and automatically excludes any user agent used by fewer
> than
> >> >> 50 distinct usernames. The reader data is gathered from a month of
> >> >> 1:1000 sampled log files, and excludes any agent responsible for
> fewer
> >> >> than 500 pageviews in a 24 hour period (except, sampled. So,
> >> >> practically speaking, that's 500,000 pageviews)
> >> >>
> >> >> What do people think about making this a data release? Would people
> >> >> get value from the data, as well as the tool?
> >> >>
> >> >> --
> >> >> Oliver Keyes
> >> >> Research Analyst
> >> >> Wikimedia Foundation
> >> >>
> >> >> ___
> >> >

Re: [Analytics] [Discussion] User agent data releases

2015-03-03 Thread Oliver Keyes
Yeah, makes sense.

On 3 March 2015 at 20:38, Nuria Ruiz  wrote:
>>Agreed. Do we have a way of syncing files to Labs yet?
> No need to sync if file is available in an endpoint like
> htpp://some-data-here
>
> On Tue, Mar 3, 2015 at 4:50 PM, Oliver Keyes  wrote:
>>
>> On 3 March 2015 at 19:35, Nuria Ruiz  wrote:
>> >>Erik has asked me to write an exploratory app for user-agent data. The
>> >>idea is to enable Product Managers and engineers to easily explore
>> >>what users use so they know what to support. I've thrown up an example
>> >>screenshot at http://ironholds.org/agents_example_screen.png
>> >
>> > I cannot speak as to the interest of community about this data but for
>> > developers and PM we should make sure we have a solid way to update any
>> > data
>> > we put up. User Agent data is outdated as soon as a new version of
>> > android
>> > or iOs is released, a new popular phone comes along or a new autoupdate
>> > for
>> > popular browsers. Not only that, if we make changes to, say, redirect
>> > all
>> > iPad users to the desktop site we want to asses effect of those changes
>> > as
>> > soon as possible. A monthly update will be a must. Also distinguishing
>> > between browser percentages on desktop site versus mobile site versus
>> > apps
>> > is a must for this data to be real useful for PMs and developers
>> > (specially
>> > for bug triage).
>> >
>>
>> Yes! However, I am addressing a specific ad-hoc request. If there is a
>> need for this (I agree there is) I hope Toby and Kevin can eke out the
>> time on the Analytics Engineering schedule to work on it; y'all are a
>> lot better at infrastructure work than me :).
>>
>> >
>> > We have couple backlog items to make monthly reports on this regard. A
>> > UI on
>> > top of them will be superb.
>> >
>>
>> Agreed. Do we have a way of syncing files to Labs yet? That's the
>> biggest blocker. The UI doesn't care what the file contains as long as
>> it's a TSV with a header row - I've deliberately built it so that
>> things like the download links are dynamic and can change.
>>
>> >
>> >
>> >
>> >
>> > On Tue, Mar 3, 2015 at 1:05 PM, Oliver Keyes 
>> > wrote:
>> >>
>> >> Hey all,
>> >>
>> >> (Sending this to the public list because it's more transparent and I'd
>> >> like people who think this data is useful to be able to shout out)
>> >>
>> >> Erik has asked me to write an exploratory app for user-agent data. The
>> >> idea is to enable Product Managers and engineers to easily explore
>> >> what users use so they know what to support. I've thrown up an example
>> >> screenshot at http://ironholds.org/agents_example_screen.png  (I'd
>> >> host it on Commons, inb4Dario, but I'm not sure the copyright status
>> >> of the UI)
>> >>
>> >> One side-effect of this is that we end up with files of common user
>> >> agents, split between {readers,editors} and {mobile, desktop}, parsed
>> >> and unparsed. I'd like to release these files. The reuse potential is
>> >> twofold; researchers and engineers can use the parsed files to see
>> >> what browser penetration looks like globally and what browsers should
>> >> be supported at a top-10, and software engineers can use the unparsed
>> >> files to improve detection rates.
>> >>
>> >> The privacy implications /should/ be minimal, because of how this data
>> >> is gathered. The editor data is gathered from the checkuser table,
>> >> globally, and automatically excludes any user agent used by fewer than
>> >> 50 distinct usernames. The reader data is gathered from a month of
>> >> 1:1000 sampled log files, and excludes any agent responsible for fewer
>> >> than 500 pageviews in a 24 hour period (except, sampled. So,
>> >> practically speaking, that's 500,000 pageviews)
>> >>
>> >> What do people think about making this a data release? Would people
>> >> get value from the data, as well as the tool?
>> >>
>> >> --
>> >> Oliver Keyes
>> >> Research Analyst
>> >> Wikimedia Foundation
>> >>
>> >> ___
>> >> Analytics mailing list
>> >> Analytics@lists.wikimedia.org
>> >> https://lists.wikimedia.org/mailman/listinfo/analytics
>> >
>> >
>> >
>> > ___
>> > Analytics mailing list
>> > Analytics@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/analytics
>> >
>>
>>
>>
>> --
>> Oliver Keyes
>> Research Analyst
>> Wikimedia Foundation
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Discussion] User agent data releases

2015-03-03 Thread Nuria Ruiz
>Agreed. Do we have a way of syncing files to Labs yet?
No need to sync if file is available in an endpoint like
htpp://some-data-here

On Tue, Mar 3, 2015 at 4:50 PM, Oliver Keyes  wrote:

> On 3 March 2015 at 19:35, Nuria Ruiz  wrote:
> >>Erik has asked me to write an exploratory app for user-agent data. The
> >>idea is to enable Product Managers and engineers to easily explore
> >>what users use so they know what to support. I've thrown up an example
> >>screenshot at http://ironholds.org/agents_example_screen.png
> >
> > I cannot speak as to the interest of community about this data but for
> > developers and PM we should make sure we have a solid way to update any
> data
> > we put up. User Agent data is outdated as soon as a new version of
> android
> > or iOs is released, a new popular phone comes along or a new autoupdate
> for
> > popular browsers. Not only that, if we make changes to, say, redirect all
> > iPad users to the desktop site we want to asses effect of those changes
> as
> > soon as possible. A monthly update will be a must. Also distinguishing
> > between browser percentages on desktop site versus mobile site versus
> apps
> > is a must for this data to be real useful for PMs and developers
> (specially
> > for bug triage).
> >
>
> Yes! However, I am addressing a specific ad-hoc request. If there is a
> need for this (I agree there is) I hope Toby and Kevin can eke out the
> time on the Analytics Engineering schedule to work on it; y'all are a
> lot better at infrastructure work than me :).
>
> >
> > We have couple backlog items to make monthly reports on this regard. A
> UI on
> > top of them will be superb.
> >
>
> Agreed. Do we have a way of syncing files to Labs yet? That's the
> biggest blocker. The UI doesn't care what the file contains as long as
> it's a TSV with a header row - I've deliberately built it so that
> things like the download links are dynamic and can change.
>
> >
> >
> >
> >
> > On Tue, Mar 3, 2015 at 1:05 PM, Oliver Keyes 
> wrote:
> >>
> >> Hey all,
> >>
> >> (Sending this to the public list because it's more transparent and I'd
> >> like people who think this data is useful to be able to shout out)
> >>
> >> Erik has asked me to write an exploratory app for user-agent data. The
> >> idea is to enable Product Managers and engineers to easily explore
> >> what users use so they know what to support. I've thrown up an example
> >> screenshot at http://ironholds.org/agents_example_screen.png  (I'd
> >> host it on Commons, inb4Dario, but I'm not sure the copyright status
> >> of the UI)
> >>
> >> One side-effect of this is that we end up with files of common user
> >> agents, split between {readers,editors} and {mobile, desktop}, parsed
> >> and unparsed. I'd like to release these files. The reuse potential is
> >> twofold; researchers and engineers can use the parsed files to see
> >> what browser penetration looks like globally and what browsers should
> >> be supported at a top-10, and software engineers can use the unparsed
> >> files to improve detection rates.
> >>
> >> The privacy implications /should/ be minimal, because of how this data
> >> is gathered. The editor data is gathered from the checkuser table,
> >> globally, and automatically excludes any user agent used by fewer than
> >> 50 distinct usernames. The reader data is gathered from a month of
> >> 1:1000 sampled log files, and excludes any agent responsible for fewer
> >> than 500 pageviews in a 24 hour period (except, sampled. So,
> >> practically speaking, that's 500,000 pageviews)
> >>
> >> What do people think about making this a data release? Would people
> >> get value from the data, as well as the tool?
> >>
> >> --
> >> Oliver Keyes
> >> Research Analyst
> >> Wikimedia Foundation
> >>
> >> ___
> >> Analytics mailing list
> >> Analytics@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/analytics
> >
> >
> >
> > ___
> > Analytics mailing list
> > Analytics@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
>
>
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Discussion] User agent data releases

2015-03-03 Thread Oliver Keyes
On 3 March 2015 at 19:35, Nuria Ruiz  wrote:
>>Erik has asked me to write an exploratory app for user-agent data. The
>>idea is to enable Product Managers and engineers to easily explore
>>what users use so they know what to support. I've thrown up an example
>>screenshot at http://ironholds.org/agents_example_screen.png
>
> I cannot speak as to the interest of community about this data but for
> developers and PM we should make sure we have a solid way to update any data
> we put up. User Agent data is outdated as soon as a new version of android
> or iOs is released, a new popular phone comes along or a new autoupdate for
> popular browsers. Not only that, if we make changes to, say, redirect all
> iPad users to the desktop site we want to asses effect of those changes as
> soon as possible. A monthly update will be a must. Also distinguishing
> between browser percentages on desktop site versus mobile site versus apps
> is a must for this data to be real useful for PMs and developers (specially
> for bug triage).
>

Yes! However, I am addressing a specific ad-hoc request. If there is a
need for this (I agree there is) I hope Toby and Kevin can eke out the
time on the Analytics Engineering schedule to work on it; y'all are a
lot better at infrastructure work than me :).

>
> We have couple backlog items to make monthly reports on this regard. A UI on
> top of them will be superb.
>

Agreed. Do we have a way of syncing files to Labs yet? That's the
biggest blocker. The UI doesn't care what the file contains as long as
it's a TSV with a header row - I've deliberately built it so that
things like the download links are dynamic and can change.

>
>
>
>
> On Tue, Mar 3, 2015 at 1:05 PM, Oliver Keyes  wrote:
>>
>> Hey all,
>>
>> (Sending this to the public list because it's more transparent and I'd
>> like people who think this data is useful to be able to shout out)
>>
>> Erik has asked me to write an exploratory app for user-agent data. The
>> idea is to enable Product Managers and engineers to easily explore
>> what users use so they know what to support. I've thrown up an example
>> screenshot at http://ironholds.org/agents_example_screen.png  (I'd
>> host it on Commons, inb4Dario, but I'm not sure the copyright status
>> of the UI)
>>
>> One side-effect of this is that we end up with files of common user
>> agents, split between {readers,editors} and {mobile, desktop}, parsed
>> and unparsed. I'd like to release these files. The reuse potential is
>> twofold; researchers and engineers can use the parsed files to see
>> what browser penetration looks like globally and what browsers should
>> be supported at a top-10, and software engineers can use the unparsed
>> files to improve detection rates.
>>
>> The privacy implications /should/ be minimal, because of how this data
>> is gathered. The editor data is gathered from the checkuser table,
>> globally, and automatically excludes any user agent used by fewer than
>> 50 distinct usernames. The reader data is gathered from a month of
>> 1:1000 sampled log files, and excludes any agent responsible for fewer
>> than 500 pageviews in a 24 hour period (except, sampled. So,
>> practically speaking, that's 500,000 pageviews)
>>
>> What do people think about making this a data release? Would people
>> get value from the data, as well as the tool?
>>
>> --
>> Oliver Keyes
>> Research Analyst
>> Wikimedia Foundation
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Discussion] User agent data releases

2015-03-03 Thread Nuria Ruiz
>Erik has asked me to write an exploratory app for user-agent data. The
>idea is to enable Product Managers and engineers to easily explore
>what users use so they know what to support. I've thrown up an example
>screenshot at http://ironholds.org/agents_example_screen.png

I cannot speak as to the interest of community about this data but for
developers and PM we should make sure we have a solid way to update any
data we put up. User Agent data is outdated as soon as a new version of
android or iOs is released, a new popular phone comes along or a new
autoupdate for popular browsers. Not only that, if we make changes to, say,
redirect all iPad users to the desktop site we want to asses effect of
those changes as soon as possible. A monthly update will be a must. Also
distinguishing between browser percentages on desktop site versus mobile
site versus apps is a must for this data to be real useful for PMs and
developers (specially for bug triage).


We have couple backlog items to make monthly reports on this regard. A UI
on top of them will be superb.





On Tue, Mar 3, 2015 at 1:05 PM, Oliver Keyes  wrote:

> Hey all,
>
> (Sending this to the public list because it's more transparent and I'd
> like people who think this data is useful to be able to shout out)
>
> Erik has asked me to write an exploratory app for user-agent data. The
> idea is to enable Product Managers and engineers to easily explore
> what users use so they know what to support. I've thrown up an example
> screenshot at http://ironholds.org/agents_example_screen.png  (I'd
> host it on Commons, inb4Dario, but I'm not sure the copyright status
> of the UI)
>
> One side-effect of this is that we end up with files of common user
> agents, split between {readers,editors} and {mobile, desktop}, parsed
> and unparsed. I'd like to release these files. The reuse potential is
> twofold; researchers and engineers can use the parsed files to see
> what browser penetration looks like globally and what browsers should
> be supported at a top-10, and software engineers can use the unparsed
> files to improve detection rates.
>
> The privacy implications /should/ be minimal, because of how this data
> is gathered. The editor data is gathered from the checkuser table,
> globally, and automatically excludes any user agent used by fewer than
> 50 distinct usernames. The reader data is gathered from a month of
> 1:1000 sampled log files, and excludes any agent responsible for fewer
> than 500 pageviews in a 24 hour period (except, sampled. So,
> practically speaking, that's 500,000 pageviews)
>
> What do people think about making this a data release? Would people
> get value from the data, as well as the tool?
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] [Discussion] User agent data releases

2015-03-03 Thread Oliver Keyes
Hey all,

(Sending this to the public list because it's more transparent and I'd
like people who think this data is useful to be able to shout out)

Erik has asked me to write an exploratory app for user-agent data. The
idea is to enable Product Managers and engineers to easily explore
what users use so they know what to support. I've thrown up an example
screenshot at http://ironholds.org/agents_example_screen.png  (I'd
host it on Commons, inb4Dario, but I'm not sure the copyright status
of the UI)

One side-effect of this is that we end up with files of common user
agents, split between {readers,editors} and {mobile, desktop}, parsed
and unparsed. I'd like to release these files. The reuse potential is
twofold; researchers and engineers can use the parsed files to see
what browser penetration looks like globally and what browsers should
be supported at a top-10, and software engineers can use the unparsed
files to improve detection rates.

The privacy implications /should/ be minimal, because of how this data
is gathered. The editor data is gathered from the checkuser table,
globally, and automatically excludes any user agent used by fewer than
50 distinct usernames. The reader data is gathered from a month of
1:1000 sampled log files, and excludes any agent responsible for fewer
than 500 pageviews in a 24 hour period (except, sampled. So,
practically speaking, that's 500,000 pageviews)

What do people think about making this a data release? Would people
get value from the data, as well as the tool?

-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics