Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-18 Thread Kerry Raymond
The thing about sockpuppets is that we only know about the ones that have been 
detected (and some of them have been large groups of 100s of accounts). The 
problem is that we don’t know about the undetected ones. I am sure many of us 
have had suspicions about the behaviour of certain accounts but to request a 
sockpuppet investigation requires a level of evidence above suspicious 
behaviour (specifically identifying another account). New users with 
sophisticated editing skills and writing on topics associated with living 
individuals, businesses or products in a positive way often seem to me to be 
the kind of account likely to be doing undisclosed paid editing, and almost 
therefore certainly a sockpuppet of a paid PR person, but if each account 
writes about a different topic, it is difficult to work out what the other 
accounts might be to look for evidence of sockpuppeting.

 

How far underwater does the iceberg go?

 

Kerry 

 

From: Giovanni Luca Ciampaglia [mailto:glciamp...@gmail.com] 
Sent: Tuesday, 19 March 2019 11:37 AM
To: Research into Wikimedia content and communities 

Cc: Kerry Raymond 
Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia

 

Does anybody know how prevalent are sockpuppets? Has anybody tried estimating 
the percentage of editors that have created at least one additional account? 
(Legitimate or otherwise.) 

 

Giovanni 

 

On Mon, Mar 18, 2019, 20:20 Stuart A. Yeates mailto:syea...@gmail.com> > wrote:

In addition to Kerry's excellent examples there are users editing
wikipedia though TOR, the anonymity and censorship circumvention
network. These users face extra scrutiny.

cheers
stuart


--
...let us be heard from red core to black sky

On Tue, 19 Mar 2019 at 13:04, Kerry Raymond mailto:kerry.raym...@gmail.com> > wrote:
>
> Apart from the legitimate alternate accounts and the illegitimate sockpuppet 
> accounts, there are other ways that alternate accounts exist.
>
> Occasional contributors often forget their username and/or password. Password 
> recovery isn't possible unless you provide an email address at sign-up (it's 
> optional, but you can add it later). So what such people then  do is just 
> create a new user account (I'm not sure there is anything else they can do). 
> I see this sort of behaviour a lot at events. The other variation of the 
> problem is that they did provide an email address but it is one not easily 
> accessible to them at the event (i.e. a librarian who signed up with a work 
> email address that cannot be accessed outside of the organisation).
>
> The other group of people with multiple accounts are those who edit 
> anonymously as serial IPs. The same person can use a number of IP numbers 
> over time. Often you don't realise it is the same person unless you see a lot 
> of their work and can see a pattern in it. For example, at the moment, there 
> is a person with a series of IP accounts that is  changing a common section 
> of a Queensland place article to be a subsection of another, who I notice on 
> my watchlist . This person appears to acquire a new IP address every week or 
> so, but the pattern of editing makes it obvious it's the same person behind 
> it. Whether or not an IP address can be considered "an account" depends on 
> your purposes. The one IP address can also be used by multiple people (e.g. 
> coming through a shared organisational network in a library or school). It is 
> claimed by some people that many new users do their first edits anonymously, 
> so if you are serious about studying "new contributors", then maybe you have 
> to look at anonymous editing. And also even regular contributors may 
> sometimes choose to edit anonymously, e.g. being in an unsecure IT 
> environment and reluctant to use their username/password in that situation 
> (particularly people with administrator or other significant access rights).
>
> Because I do outreach, I look for new accounts that turn up on my watchlist 
> and send them welcome messages etc. Because I also do training, I see a lot 
> of genuinely new people in action where I can observe their edits. So when I 
> see new accounts or IPs doing far more "sophisticated" edits than I see new 
> users do, I tend to suspect they are not genuinely new contributors.
>
> I think the best you can do is look for new accounts and be prepared to omit 
> any that show signs of sophisticated editing (either in terms of they are 
> doing technically or what they say on Talk pages or in edit summaries). For 
> example, no genuine new user will mention a policy (they don't know they 
> exist). Also genuine new users don't tend to edit that quickly, so any rapid 
> fire series of successful edits is unlikely to be a genuine new user.  I 
> think this inability to know if a new account represent

Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-18 Thread Giovanni Luca Ciampaglia
Does anybody know how prevalent are sockpuppets? Has anybody tried
estimating the percentage of editors that have created at least one
additional account? (Legitimate or otherwise.)

Giovanni


On Mon, Mar 18, 2019, 20:20 Stuart A. Yeates  wrote:

> In addition to Kerry's excellent examples there are users editing
> wikipedia though TOR, the anonymity and censorship circumvention
> network. These users face extra scrutiny.
>
> cheers
> stuart
>
>
> --
> ...let us be heard from red core to black sky
>
> On Tue, 19 Mar 2019 at 13:04, Kerry Raymond 
> wrote:
> >
> > Apart from the legitimate alternate accounts and the illegitimate
> sockpuppet accounts, there are other ways that alternate accounts exist.
> >
> > Occasional contributors often forget their username and/or password.
> Password recovery isn't possible unless you provide an email address at
> sign-up (it's optional, but you can add it later). So what such people
> then  do is just create a new user account (I'm not sure there is anything
> else they can do). I see this sort of behaviour a lot at events. The other
> variation of the problem is that they did provide an email address but it
> is one not easily accessible to them at the event (i.e. a librarian who
> signed up with a work email address that cannot be accessed outside of the
> organisation).
> >
> > The other group of people with multiple accounts are those who edit
> anonymously as serial IPs. The same person can use a number of IP numbers
> over time. Often you don't realise it is the same person unless you see a
> lot of their work and can see a pattern in it. For example, at the moment,
> there is a person with a series of IP accounts that is  changing a common
> section of a Queensland place article to be a subsection of another, who I
> notice on my watchlist . This person appears to acquire a new IP address
> every week or so, but the pattern of editing makes it obvious it's the same
> person behind it. Whether or not an IP address can be considered "an
> account" depends on your purposes. The one IP address can also be used by
> multiple people (e.g. coming through a shared organisational network in a
> library or school). It is claimed by some people that many new users do
> their first edits anonymously, so if you are serious about studying "new
> contributors", then maybe you have to look at anonymous editing. And also
> even regular contributors may sometimes choose to edit anonymously, e.g.
> being in an unsecure IT environment and reluctant to use their
> username/password in that situation (particularly people with administrator
> or other significant access rights).
> >
> > Because I do outreach, I look for new accounts that turn up on my
> watchlist and send them welcome messages etc. Because I also do training, I
> see a lot of genuinely new people in action where I can observe their
> edits. So when I see new accounts or IPs doing far more "sophisticated"
> edits than I see new users do, I tend to suspect they are not genuinely new
> contributors.
> >
> > I think the best you can do is look for new accounts and be prepared to
> omit any that show signs of sophisticated editing (either in terms of they
> are doing technically or what they say on Talk pages or in edit summaries).
> For example, no genuine new user will mention a policy (they don't know
> they exist). Also genuine new users don't tend to edit that quickly, so any
> rapid fire series of successful edits is unlikely to be a genuine new
> user.  I think this inability to know if a new account represents a
> genuinely new user is an inherent limitation for your research and should
> be documented as such explaining the many circumstances in which new
> accounts might belong to non-new users.
> >
> > Kerry
> >
> > -Original Message-
> > From: Wiki-research-l [mailto:
> wiki-research-l-boun...@lists.wikimedia.org] On Behalf Of Pine W
> > Sent: Tuesday, 19 March 2019 5:27 AM
> > To: Research into Wikimedia content and communities <
> wiki-research-l@lists.wikimedia.org>
> > Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
> >
> > Hi Haifeng,
> >
> > Some users will state on user pages that an account is an alternate
> account. However, this practice is not followed by everyone, and those who
> do follow this practice aren't required to so in a uniform way.
> >
> > Alternate accounts which are not labeled as such, and which are used for
> illegitimate purposes such as double voting, are an ongoing problem. You
> might be interested in the English Wikipedia page
> https://en.wikipedia.org/wiki/Wikipedia:Sock_puppetry.
> &g

Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-18 Thread Adam Jenkins
A quick and dirty solution might be to use the hostbot list from the
teahouse at
https://en.wikipedia.org/wiki/Wikipedia:Teahouse/Hosts/Database_reports The
list is regularly refreshed, so you could pull the account names from there
over the course of a month and then randomly select your sample, noting
that it is biased towards new editors that have made more than 10 edits.

Otherwise perhaps using recent changes, but filtering for logged actions by
new users?
https://en.wikipedia.org/wiki/Special:RecentChanges?userExpLevel=newcomer=1=1=1=1=1=50=7=2


Virus-free.
www.avast.com

<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Wed, 13 Mar 2019 at 04:49, Haifeng Zhang  wrote:

> Hi folks,
>
> My work needs to randomly sample new editors in each month, e.g., 100
> editors per month.
>
> Do any of you have good suggestions for how to do this efficiently?
>
> I could think of using the dump files, but wonder are there other options?
>
>
> Thanks,
>
> Haifeng Zhang
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-18 Thread Stuart A. Yeates
In addition to Kerry's excellent examples there are users editing
wikipedia though TOR, the anonymity and censorship circumvention
network. These users face extra scrutiny.

cheers
stuart


--
...let us be heard from red core to black sky

On Tue, 19 Mar 2019 at 13:04, Kerry Raymond  wrote:
>
> Apart from the legitimate alternate accounts and the illegitimate sockpuppet 
> accounts, there are other ways that alternate accounts exist.
>
> Occasional contributors often forget their username and/or password. Password 
> recovery isn't possible unless you provide an email address at sign-up (it's 
> optional, but you can add it later). So what such people then  do is just 
> create a new user account (I'm not sure there is anything else they can do). 
> I see this sort of behaviour a lot at events. The other variation of the 
> problem is that they did provide an email address but it is one not easily 
> accessible to them at the event (i.e. a librarian who signed up with a work 
> email address that cannot be accessed outside of the organisation).
>
> The other group of people with multiple accounts are those who edit 
> anonymously as serial IPs. The same person can use a number of IP numbers 
> over time. Often you don't realise it is the same person unless you see a lot 
> of their work and can see a pattern in it. For example, at the moment, there 
> is a person with a series of IP accounts that is  changing a common section 
> of a Queensland place article to be a subsection of another, who I notice on 
> my watchlist . This person appears to acquire a new IP address every week or 
> so, but the pattern of editing makes it obvious it's the same person behind 
> it. Whether or not an IP address can be considered "an account" depends on 
> your purposes. The one IP address can also be used by multiple people (e.g. 
> coming through a shared organisational network in a library or school). It is 
> claimed by some people that many new users do their first edits anonymously, 
> so if you are serious about studying "new contributors", then maybe you have 
> to look at anonymous editing. And also even regular contributors may 
> sometimes choose to edit anonymously, e.g. being in an unsecure IT 
> environment and reluctant to use their username/password in that situation 
> (particularly people with administrator or other significant access rights).
>
> Because I do outreach, I look for new accounts that turn up on my watchlist 
> and send them welcome messages etc. Because I also do training, I see a lot 
> of genuinely new people in action where I can observe their edits. So when I 
> see new accounts or IPs doing far more "sophisticated" edits than I see new 
> users do, I tend to suspect they are not genuinely new contributors.
>
> I think the best you can do is look for new accounts and be prepared to omit 
> any that show signs of sophisticated editing (either in terms of they are 
> doing technically or what they say on Talk pages or in edit summaries). For 
> example, no genuine new user will mention a policy (they don't know they 
> exist). Also genuine new users don't tend to edit that quickly, so any rapid 
> fire series of successful edits is unlikely to be a genuine new user.  I 
> think this inability to know if a new account represents a genuinely new user 
> is an inherent limitation for your research and should be documented as such 
> explaining the many circumstances in which new accounts might belong to 
> non-new users.
>
> Kerry
>
> -Original Message-
> From: Wiki-research-l [mailto:wiki-research-l-boun...@lists.wikimedia.org] On 
> Behalf Of Pine W
> Sent: Tuesday, 19 March 2019 5:27 AM
> To: Research into Wikimedia content and communities 
> 
> Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
>
> Hi Haifeng,
>
> Some users will state on user pages that an account is an alternate account. 
> However, this practice is not followed by everyone, and those who do follow 
> this practice aren't required to so in a uniform way.
>
> Alternate accounts which are not labeled as such, and which are used for 
> illegitimate purposes such as double voting, are an ongoing problem. You 
> might be interested in the English Wikipedia page 
> https://en.wikipedia.org/wiki/Wikipedia:Sock_puppetry.
>
> Alternate accounts can also be used for legitimate purposes, such as people 
> who have one account for their professional or academic activities and 
> another account for their personal use.
>
> Good luck with your project.
>
> Pine
> ( https://meta.wikimedia.org/wiki/User:Pine )
>
>
> On Thu, Mar 14, 2019 at 1:30 PM Haifeng Zhang 
> wrote:
>
> > Stuart,
> >
> > I'm building an agent

Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-18 Thread Pine W
Hi Haifeng,

Some users will state on user pages that an account is an alternate
account. However, this practice is not followed by everyone, and those who
do follow this practice aren't required to so in a uniform way.

Alternate accounts which are not labeled as such, and which are used for
illegitimate purposes such as double voting, are an ongoing problem. You
might be interested in the English Wikipedia page
https://en.wikipedia.org/wiki/Wikipedia:Sock_puppetry.

Alternate accounts can also be used for legitimate purposes, such as people
who have one account for their professional or academic activities and
another account for their personal use.

Good luck with your project.

Pine
( https://meta.wikimedia.org/wiki/User:Pine )


On Thu, Mar 14, 2019 at 1:30 PM Haifeng Zhang 
wrote:

> Stuart,
>
> I'm building an agent-based simulation of Wikipedia collaboration.
>
> I would like my model to be empirically grounded, so I need to collect
> data for new editors.
>
> Alternative accounts can be an issue, but I wonder is there a way to
> identify editors who have multiple account?
>
>
> Thanks,
>
> Haifeng Zhang
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-14 Thread Haifeng Zhang
Stuart,

I'm building an agent-based simulation of Wikipedia collaboration.

I would like my model to be empirically grounded, so I need to collect data for 
new editors.

Alternative accounts can be an issue, but I wonder is there a way to identify 
editors who have multiple account?


Thanks,

Haifeng Zhang

From: Wiki-research-l  on behalf 
of Stuart A. Yeates 
Sent: Wednesday, March 13, 2019 6:31:26 PM
To: Research into Wikimedia content and communities
Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia

On Thu, 14 Mar 2019 at 09:16, Haifeng Zhang  wrote:
>
> Thanks a lot for help, Finn. Now my query can draw sample of new registered 
> editors.

To repeat a point I made earlier in the thread: this query deals with
accounts not editors. Many at the coalface consider this to be a very
important difference. You appear not to have shared enough of your
research project for us to tell whether it's going to matter for you.

cheers
stuart

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-13 Thread Stuart A. Yeates
On Thu, 14 Mar 2019 at 09:16, Haifeng Zhang  wrote:
>
> Thanks a lot for help, Finn. Now my query can draw sample of new registered 
> editors.

To repeat a point I made earlier in the thread: this query deals with
accounts not editors. Many at the coalface consider this to be a very
important difference. You appear not to have shared enough of your
research project for us to tell whether it's going to matter for you.

cheers
stuart

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-13 Thread Haifeng Zhang
Thanks a lot for help, Finn. Now my query can draw sample of new registered 
editors.


Best,

Haifeng Zhang

From: Wiki-research-l  on behalf 
of f...@imm.dtu.dk 
Sent: Wednesday, March 13, 2019 12:01:59 PM
To: wiki-research-l@lists.wikimedia.org
Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia

Haifeng,


On 13/03/2019 15:56, Haifeng Zhang wrote:
> Thanks for pointing me to Quarray, Finn.
>
> I tried a couple queries, but not sure why all took forever to get result.

I am not familiar with Quarry. It might have a timeout. The user table
associated with the English Wikipedia is quite large, so any operation
on that may take long time.

You might be able to get "timein" with a simplified SQL. For instance,
the query below takes 52.35 seconds:

USE enwiki_p;

SELECT user_id, user_name, user_registration, user_editcount
FROM user
LIMIT 1000
OFFSET 3200



> Is it possible to download relevant Media Wiki database tables (e.g., user, 
> user_groups, logging) and run SQL in my local machine?

There are SQL files available here
https://dumps.wikimedia.org/enwiki/20190301/ but I do not think the user
table is there, - at least I cannot identify it. Perhaps other people
would know.

You might be able try the Toolforge https://tools.wmflabs.org/ You
should be able to access the tables via mysql on the prompt.

Login to dev.tools.wmflabs.org
Then do "sql enwiki"

Read more about Toolforge here:
https://wikitech.wikimedia.org/wiki/Help:Toolforge


/Finn

>
> Thanks,
>
> Haifeng Zhang
> 
> From: Wiki-research-l  on behalf 
> of f...@imm.dtu.dk 
> Sent: Tuesday, March 12, 2019 7:25:53 PM
> To: wiki-research-l@lists.wikimedia.org
> Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
>
> Haifeng ,
>
>
> While some suggests the dumps or notice boards, my immediate thought was
> a database query, e.g., through Quarry. It just happens that Jonathan T.
> Morgan has created a query there:
>
> https://quarry.wmflabs.org/query/310
>
> SELECT user_id, user_name, user_registration, user_editcount
>  FROM enwiki_p.user
>  WHERE user_registration > DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 1
> DAY),'%Y%m%d%H%i%s')
>  AND user_editcount > 10
>  AND user_id NOT IN (SELECT ug_user FROM enwiki_p.user_groups WHERE
> ug_group = 'bot')
>  AND user_name not in (SELECT REPLACE(log_title,"_"," ") from
> enwiki_p.logging
>  where log_type = "block" and log_action = "block"
>  and log_timestamp >  DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 2
> DAY),'%Y%m%d%H%i%s'));
>
>
> You may fork from that query. There is R. Stuart Geiger (Staeiou)'s fork
> here https://quarry.wmflabs.org/query/34256 querying for month, - as
> another example.
>
>
>
> Finn Årup Nielsen
> http://people.compute.dtu.dk/faan/
>
>
> On 12/03/2019 19:18, Haifeng Zhang wrote:
>> Hi folks,
>>
>> My work needs to randomly sample new editors in each month, e.g., 100 
>> editors per month.
>>
>> Do any of you have good suggestions for how to do this efficiently?
>>
>> I could think of using the dump files, but wonder are there other options?
>>
>>
>> Thanks,
>>
>> Haifeng Zhang
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-13 Thread fn

Haifeng,


On 13/03/2019 15:56, Haifeng Zhang wrote:

Thanks for pointing me to Quarray, Finn.

I tried a couple queries, but not sure why all took forever to get result.


I am not familiar with Quarry. It might have a timeout. The user table 
associated with the English Wikipedia is quite large, so any operation 
on that may take long time.


You might be able to get "timein" with a simplified SQL. For instance, 
the query below takes 52.35 seconds:


USE enwiki_p;

SELECT user_id, user_name, user_registration, user_editcount
FROM user
LIMIT 1000
OFFSET 3200




Is it possible to download relevant Media Wiki database tables (e.g., user, 
user_groups, logging) and run SQL in my local machine?


There are SQL files available here 
https://dumps.wikimedia.org/enwiki/20190301/ but I do not think the user 
table is there, - at least I cannot identify it. Perhaps other people 
would know.


You might be able try the Toolforge https://tools.wmflabs.org/ You 
should be able to access the tables via mysql on the prompt.


Login to dev.tools.wmflabs.org
Then do "sql enwiki"

Read more about Toolforge here: 
https://wikitech.wikimedia.org/wiki/Help:Toolforge



/Finn



Thanks,

Haifeng Zhang

From: Wiki-research-l  on behalf of 
f...@imm.dtu.dk 
Sent: Tuesday, March 12, 2019 7:25:53 PM
To: wiki-research-l@lists.wikimedia.org
Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia

Haifeng ,


While some suggests the dumps or notice boards, my immediate thought was
a database query, e.g., through Quarry. It just happens that Jonathan T.
Morgan has created a query there:

https://quarry.wmflabs.org/query/310

SELECT user_id, user_name, user_registration, user_editcount
 FROM enwiki_p.user
 WHERE user_registration > DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 1
DAY),'%Y%m%d%H%i%s')
 AND user_editcount > 10
 AND user_id NOT IN (SELECT ug_user FROM enwiki_p.user_groups WHERE
ug_group = 'bot')
 AND user_name not in (SELECT REPLACE(log_title,"_"," ") from
enwiki_p.logging
 where log_type = "block" and log_action = "block"
 and log_timestamp >  DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 2
DAY),'%Y%m%d%H%i%s'));


You may fork from that query. There is R. Stuart Geiger (Staeiou)'s fork
here https://quarry.wmflabs.org/query/34256 querying for month, - as
another example.



Finn Årup Nielsen
http://people.compute.dtu.dk/faan/


On 12/03/2019 19:18, Haifeng Zhang wrote:

Hi folks,

My work needs to randomly sample new editors in each month, e.g., 100 editors 
per month.

Do any of you have good suggestions for how to do this efficiently?

I could think of using the dump files, but wonder are there other options?


Thanks,

Haifeng Zhang
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-13 Thread Haifeng Zhang
This can be a good option too. Thanks, Issac.


Haifeng Zhang

Postdoctoral Research Fellow
Human-Computer Interaction Institute
Carnegie Mellon University

From: Wiki-research-l  on behalf 
of Isaac Johnson 
Sent: Tuesday, March 12, 2019 5:21:11 PM
To: Research into Wikimedia content and communities
Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia

Hey Haifeng,
If you decide to process the dumps, you should be able to easily repurpose
some quick code that I wrote for a similar project:
https://github.com/geohci/miscellaneous-wikimedia/tree/master/editor-turnover

Notably, I'd suggest using the stub history dumps as they are much smaller
because they do not include the actual content. For instance, for March 1st
and English Wikipedia (https://dumps.wikimedia.org/enwiki/20190301/), this
file would be enwiki-20190301-stub-meta-history.xml.gz and is 57.9 GB.

Best,
Isaac

On Tue, Mar 12, 2019 at 3:56 PM Pine W  wrote:

> Hi Haifeng, thanks for the information. I think that your idea of looking
> in the dumps makes sense. Am I understanding correctly that you would like
> advice regarding how to do that in the most efficient way?
>
> Hi Leila, I believe that I asked for more information regarding Heifeng's
> work. There has been discussion on English Wikipedia regarding volunteers
> being unhappy with the interventions or proposed interventions of
> researchers. I think that asking about the nature of Haifeng's research is
> legitimate, and I tried to provide some examples of possible types of
> research. I'm trying to protect the community from problematic
> interventions, while also welcoming research that is accepted by the
> community.
>
> Pine
> ( https://meta.wikimedia.org/wiki/User:Pine )
>
>
> On Tue, Mar 12, 2019 at 8:00 PM Haifeng Zhang 
> wrote:
>
> > Pine and Stuart,
> >
> > I meant extracting a random sample of new editors (month by month) from
> > Wikipedia edit history.
> >
> > It is not about survey of new editors, but still thanks for your
> > suggestions.
> >
> >
> > Thanks,
> > Haifeng Zhang
> >
> > Postdoctoral Research Fellow
> > Human-Computer Interaction Institute
> > Carnegie Mellon University
> > 
> > From: Wiki-research-l  on
> > behalf of Stuart A. Yeates 
> > Sent: Tuesday, March 12, 2019 3:46:19 PM
> > To: Research into Wikimedia content and communities
> > Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
> >
> > There are a number of new-editor-heavy noticeboards. I would suggest
> > posting an invite there to your survey (or whatever) If you ask for
> > editor's usernames you can filter out those who don't meet your
> > definition of 'new'
> >
> > I'm thinking of places like:
> > https://en.wikipedia.org/wiki/Wikipedia:Teahouse and
> > https://en.wikipedia.org/wiki/Wikipedia:Help_desk
> >
> > cheers
> > stuart
> >
> >
> > --
> > ...let us be heard from red core to black sky
> >
> > On Wed, 13 Mar 2019 at 08:37, Leila Zia  wrote:
> > >
> > > Hi Pine,
> > >
> > > Haifeng has a simple question about how to sample editors other than
> > > via dumps. It would be great if someone who knows the answer to help
> > > them to move forward.
> > >
> > > If you are interested to learn more about their research, instead of
> > > answering their question, my recommendation would be to start the
> > > conversation with: "can you tell us more about your research?" kind of
> > > question. I find the current way of communication very speculative,
> > > and that is not good for making a vibrant research community that can
> > > help us address some of our big questions.
> > >
> > > Best,
> > > Leila
> > >
> > > On Tue, Mar 12, 2019 at 12:08 PM Pine W  wrote:
> > > >
> > > > Hi, can you expand on what you mean by "sample"? If you're referring
> to
> > > > analyzing users' edit histories then that should be fine. However, if
> > > > you're planning to send surveys or messages to them, sending them
> > > > barnstars, or otherwise manipulating their on-wiki experience, that
> > would
> > > > be problematic.
> > > >
> > > > Pine
> > > > ( https://meta.wikimedia.org/wiki/User:Pine )
> > > >
> > > >
> > > > On Tue, Mar 12, 2019 at 6:19 PM Haifeng Zhang <
> haife...@andrew.cmu.edu
> > >
> > > > wrote:
> > > >
> > > > >

Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-13 Thread Haifeng Zhang
Thanks for pointing me to Quarray, Finn.

I tried a couple queries, but not sure why all took forever to get result.

Is it possible to download relevant Media Wiki database tables (e.g., user, 
user_groups, logging) and run SQL in my local machine?


Thanks,

Haifeng Zhang

From: Wiki-research-l  on behalf 
of f...@imm.dtu.dk 
Sent: Tuesday, March 12, 2019 7:25:53 PM
To: wiki-research-l@lists.wikimedia.org
Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia

Haifeng ,


While some suggests the dumps or notice boards, my immediate thought was
a database query, e.g., through Quarry. It just happens that Jonathan T.
Morgan has created a query there:

https://quarry.wmflabs.org/query/310

SELECT user_id, user_name, user_registration, user_editcount
FROM enwiki_p.user
WHERE user_registration > DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 1
DAY),'%Y%m%d%H%i%s')
AND user_editcount > 10
AND user_id NOT IN (SELECT ug_user FROM enwiki_p.user_groups WHERE
ug_group = 'bot')
AND user_name not in (SELECT REPLACE(log_title,"_"," ") from
enwiki_p.logging
where log_type = "block" and log_action = "block"
and log_timestamp >  DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 2
DAY),'%Y%m%d%H%i%s'));


You may fork from that query. There is R. Stuart Geiger (Staeiou)'s fork
here https://quarry.wmflabs.org/query/34256 querying for month, - as
another example.



Finn Årup Nielsen
http://people.compute.dtu.dk/faan/


On 12/03/2019 19:18, Haifeng Zhang wrote:
> Hi folks,
>
> My work needs to randomly sample new editors in each month, e.g., 100 editors 
> per month.
>
> Do any of you have good suggestions for how to do this efficiently?
>
> I could think of using the dump files, but wonder are there other options?
>
>
> Thanks,
>
> Haifeng Zhang
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-12 Thread fn

Haifeng ,


While some suggests the dumps or notice boards, my immediate thought was 
a database query, e.g., through Quarry. It just happens that Jonathan T. 
Morgan has created a query there:


https://quarry.wmflabs.org/query/310

SELECT user_id, user_name, user_registration, user_editcount
FROM enwiki_p.user
	WHERE user_registration > DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 1 
DAY),'%Y%m%d%H%i%s')

AND user_editcount > 10
	AND user_id NOT IN (SELECT ug_user FROM enwiki_p.user_groups WHERE 
ug_group = 'bot')
	AND user_name not in (SELECT REPLACE(log_title,"_"," ") from 
enwiki_p.logging

where log_type = "block" and log_action = "block"
		and log_timestamp >  DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 2 
DAY),'%Y%m%d%H%i%s'));



You may fork from that query. There is R. Stuart Geiger (Staeiou)'s fork 
here https://quarry.wmflabs.org/query/34256 querying for month, - as 
another example.




Finn Årup Nielsen
http://people.compute.dtu.dk/faan/


On 12/03/2019 19:18, Haifeng Zhang wrote:

Hi folks,

My work needs to randomly sample new editors in each month, e.g., 100 editors 
per month.

Do any of you have good suggestions for how to do this efficiently?

I could think of using the dump files, but wonder are there other options?


Thanks,

Haifeng Zhang
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-12 Thread Leila Zia
Let's do it.


On Tue, Mar 12, 2019 at 3:04 PM Pine W  wrote:
>
> Leila, can we discuss this off list?
>
> Thanks,
>
> Pine
> ( https://meta.wikimedia.org/wiki/User:Pine )
>
>
> On Tue, Mar 12, 2019 at 9:29 PM Leila Zia  wrote:
>
> > On Tue, Mar 12, 2019 at 1:56 PM Pine W  wrote:
> > >
> > > Hi Leila, I believe that I asked for more information regarding Heifeng's
> > > work.
> >
> > You stated
> >
> > "However, if you're planning to send surveys or messages to them,
> > sending them barnstars, or otherwise manipulating their on-wiki
> > experience, that would be problematic."
> >
> > and I'm suggesting that you enter from a question angle, please.
> >
>
> > > There has been discussion on English Wikipedia regarding volunteers
> > > being unhappy with the interventions or proposed interventions of
> > > researchers. I think that asking about the nature of Haifeng's research
> > is
> > > legitimate, and I tried to provide some examples of possible types of
> > > research.
> >
> > Please check your email. There was no question there in the part
> > related to this discussion. Also, even if there was a question posed,
> > I highly recommend you enter from a different angle to these
> > conversations. There are many reasons someone may need the sampled
> > data of newcomers. A few examples: they may want to test the
> > assumption whether the arrivals (registrations) to a specific
> > Wikipedia language follow a Poisson process or not, they may want to
> > learn about the distribution of topics editors in a given language
> > edit in the first 24 hours after they open the account, they may want
> > to build a prediction model to predict whether the editor will make
> > the n-th edit or not given that they have started at time x, they may
> > want to see whether external events have strong correlations with
> > account registration and Wikipedia activity, they may want to see if
> > the change to HTTPS had impact on registrations, etc. There are
> > literally millions of questions people may ask (given that the data is
> > available to them) with respect to Wikipedia. The answer to some of
> > them may require interaction with Wikipedia editors, the answer to
> > some may not. So the safest bet to start having a fruitful
> > conversation is to ask: can you tell us more about what you're trying
> > to do?
> >
> > > I'm trying to protect the community from problematic
> > > interventions, while also welcoming research that is accepted by the
> > > community.
> >
> > I understand and I'm looking forward to having conversations with you
> > all about how to achieve that.
> >
> > Best,
> > Leila
> >
> > ___
> > Wiki-research-l mailing list
> > Wiki-research-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-12 Thread Pine W
Leila, can we discuss this off list?

Thanks,

Pine
( https://meta.wikimedia.org/wiki/User:Pine )


On Tue, Mar 12, 2019 at 9:29 PM Leila Zia  wrote:

> On Tue, Mar 12, 2019 at 1:56 PM Pine W  wrote:
> >
> > Hi Leila, I believe that I asked for more information regarding Heifeng's
> > work.
>
> You stated
>
> "However, if you're planning to send surveys or messages to them,
> sending them barnstars, or otherwise manipulating their on-wiki
> experience, that would be problematic."
>
> and I'm suggesting that you enter from a question angle, please.
>

> > There has been discussion on English Wikipedia regarding volunteers
> > being unhappy with the interventions or proposed interventions of
> > researchers. I think that asking about the nature of Haifeng's research
> is
> > legitimate, and I tried to provide some examples of possible types of
> > research.
>
> Please check your email. There was no question there in the part
> related to this discussion. Also, even if there was a question posed,
> I highly recommend you enter from a different angle to these
> conversations. There are many reasons someone may need the sampled
> data of newcomers. A few examples: they may want to test the
> assumption whether the arrivals (registrations) to a specific
> Wikipedia language follow a Poisson process or not, they may want to
> learn about the distribution of topics editors in a given language
> edit in the first 24 hours after they open the account, they may want
> to build a prediction model to predict whether the editor will make
> the n-th edit or not given that they have started at time x, they may
> want to see whether external events have strong correlations with
> account registration and Wikipedia activity, they may want to see if
> the change to HTTPS had impact on registrations, etc. There are
> literally millions of questions people may ask (given that the data is
> available to them) with respect to Wikipedia. The answer to some of
> them may require interaction with Wikipedia editors, the answer to
> some may not. So the safest bet to start having a fruitful
> conversation is to ask: can you tell us more about what you're trying
> to do?
>
> > I'm trying to protect the community from problematic
> > interventions, while also welcoming research that is accepted by the
> > community.
>
> I understand and I'm looking forward to having conversations with you
> all about how to achieve that.
>
> Best,
> Leila
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-12 Thread Stuart A. Yeates
There are thousands and thousands of editors with multiple accounts.
Those who have been bothered to add a category are listed at
https://en.wikipedia.org/wiki/Category:Wikipedians_with_alternative_accounts

Many editors who engage in outreach are advised to create new accounts
for themselves regularly, simply because the experience of new account
creation changes over time and helping users streamline that
(especially in situations such as editathons) requires thorough
knowledge of account creation and the things that can make it go
wrong. Pretty much a prerequisite for the old  accountcreator
userright https://en.wikipedia.org/wiki/Wikipedia:Account_creator
(which I've had on several occasions) and the new eventcoordinator
userright  https://en.wikipedia.org/wiki/Wikipedia:Event_coordinator
(which is too new for me to have had yet).

cheers
stuart
--
...let us be heard from red core to black sky

On Wed, 13 Mar 2019 at 10:40, Isaac Johnson  wrote:
>
> Yes, thanks for the clarification Stuart. I don't know of any statistics to
> suggest how widespread this is, but it might be worth checking, especially
> if you are focusing on editors with higher edit counts (who I suspect are
> more likely to have multiple accounts for licit reasons).
>
> On Tue, Mar 12, 2019 at 4:34 PM Stuart A. Yeates  wrote:
>
> > Note that this code deals with accounts, not editors, which is what
> > Haifeng asked for.
> >
> > There are many reasons, both licit and illicit for editors to have
> > more than one account. I know I have more than ten for
> > policy-compliant reasons.
> >
> > cheers
> > stuart
> >
> >
> > --
> > ...let us be heard from red core to black sky
> >
> > On Wed, 13 Mar 2019 at 10:21, Isaac Johnson  wrote:
> > >
> > > Hey Haifeng,
> > > If you decide to process the dumps, you should be able to easily
> > repurpose
> > > some quick code that I wrote for a similar project:
> > >
> > https://github.com/geohci/miscellaneous-wikimedia/tree/master/editor-turnover
> > >
> > > Notably, I'd suggest using the stub history dumps as they are much
> > smaller
> > > because they do not include the actual content. For instance, for March
> > 1st
> > > and English Wikipedia (https://dumps.wikimedia.org/enwiki/20190301/),
> > this
> > > file would be enwiki-20190301-stub-meta-history.xml.gz and is 57.9 GB.
> > >
> > > Best,
> > > Isaac
> > >
> > > On Tue, Mar 12, 2019 at 3:56 PM Pine W  wrote:
> > >
> > > > Hi Haifeng, thanks for the information. I think that your idea of
> > looking
> > > > in the dumps makes sense. Am I understanding correctly that you would
> > like
> > > > advice regarding how to do that in the most efficient way?
> > > >
> > > > Hi Leila, I believe that I asked for more information regarding
> > Heifeng's
> > > > work. There has been discussion on English Wikipedia regarding
> > volunteers
> > > > being unhappy with the interventions or proposed interventions of
> > > > researchers. I think that asking about the nature of Haifeng's
> > research is
> > > > legitimate, and I tried to provide some examples of possible types of
> > > > research. I'm trying to protect the community from problematic
> > > > interventions, while also welcoming research that is accepted by the
> > > > community.
> > > >
> > > > Pine
> > > > ( https://meta.wikimedia.org/wiki/User:Pine )
> > > >
> > > >
> > > > On Tue, Mar 12, 2019 at 8:00 PM Haifeng Zhang  > >
> > > > wrote:
> > > >
> > > > > Pine and Stuart,
> > > > >
> > > > > I meant extracting a random sample of new editors (month by month)
> > from
> > > > > Wikipedia edit history.
> > > > >
> > > > > It is not about survey of new editors, but still thanks for your
> > > > > suggestions.
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Haifeng Zhang
> > > > >
> > > > > Postdoctoral Research Fellow
> > > > > Human-Computer Interaction Institute
> > > > > Carnegie Mellon University
> > > > > 
> > > > > From: Wiki-research-l 
> > on
> > > > > behalf of Stuart A. Yeates 
> > > > > Sent: Tuesday, March 12, 2019 3:46:19 PM
> > > > > To: Research into Wikimedia content and communities
> &

Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-12 Thread Isaac Johnson
Yes, thanks for the clarification Stuart. I don't know of any statistics to
suggest how widespread this is, but it might be worth checking, especially
if you are focusing on editors with higher edit counts (who I suspect are
more likely to have multiple accounts for licit reasons).

On Tue, Mar 12, 2019 at 4:34 PM Stuart A. Yeates  wrote:

> Note that this code deals with accounts, not editors, which is what
> Haifeng asked for.
>
> There are many reasons, both licit and illicit for editors to have
> more than one account. I know I have more than ten for
> policy-compliant reasons.
>
> cheers
> stuart
>
>
> --
> ...let us be heard from red core to black sky
>
> On Wed, 13 Mar 2019 at 10:21, Isaac Johnson  wrote:
> >
> > Hey Haifeng,
> > If you decide to process the dumps, you should be able to easily
> repurpose
> > some quick code that I wrote for a similar project:
> >
> https://github.com/geohci/miscellaneous-wikimedia/tree/master/editor-turnover
> >
> > Notably, I'd suggest using the stub history dumps as they are much
> smaller
> > because they do not include the actual content. For instance, for March
> 1st
> > and English Wikipedia (https://dumps.wikimedia.org/enwiki/20190301/),
> this
> > file would be enwiki-20190301-stub-meta-history.xml.gz and is 57.9 GB.
> >
> > Best,
> > Isaac
> >
> > On Tue, Mar 12, 2019 at 3:56 PM Pine W  wrote:
> >
> > > Hi Haifeng, thanks for the information. I think that your idea of
> looking
> > > in the dumps makes sense. Am I understanding correctly that you would
> like
> > > advice regarding how to do that in the most efficient way?
> > >
> > > Hi Leila, I believe that I asked for more information regarding
> Heifeng's
> > > work. There has been discussion on English Wikipedia regarding
> volunteers
> > > being unhappy with the interventions or proposed interventions of
> > > researchers. I think that asking about the nature of Haifeng's
> research is
> > > legitimate, and I tried to provide some examples of possible types of
> > > research. I'm trying to protect the community from problematic
> > > interventions, while also welcoming research that is accepted by the
> > > community.
> > >
> > > Pine
> > > ( https://meta.wikimedia.org/wiki/User:Pine )
> > >
> > >
> > > On Tue, Mar 12, 2019 at 8:00 PM Haifeng Zhang  >
> > > wrote:
> > >
> > > > Pine and Stuart,
> > > >
> > > > I meant extracting a random sample of new editors (month by month)
> from
> > > > Wikipedia edit history.
> > > >
> > > > It is not about survey of new editors, but still thanks for your
> > > > suggestions.
> > > >
> > > >
> > > > Thanks,
> > > > Haifeng Zhang
> > > >
> > > > Postdoctoral Research Fellow
> > > > Human-Computer Interaction Institute
> > > > Carnegie Mellon University
> > > > 
> > > > From: Wiki-research-l 
> on
> > > > behalf of Stuart A. Yeates 
> > > > Sent: Tuesday, March 12, 2019 3:46:19 PM
> > > > To: Research into Wikimedia content and communities
> > > > Subject: Re: [Wiki-research-l] Sampling new editors in English
> Wikipedia
> > > >
> > > > There are a number of new-editor-heavy noticeboards. I would suggest
> > > > posting an invite there to your survey (or whatever) If you ask for
> > > > editor's usernames you can filter out those who don't meet your
> > > > definition of 'new'
> > > >
> > > > I'm thinking of places like:
> > > > https://en.wikipedia.org/wiki/Wikipedia:Teahouse and
> > > > https://en.wikipedia.org/wiki/Wikipedia:Help_desk
> > > >
> > > > cheers
> > > > stuart
> > > >
> > > >
> > > > --
> > > > ...let us be heard from red core to black sky
> > > >
> > > > On Wed, 13 Mar 2019 at 08:37, Leila Zia  wrote:
> > > > >
> > > > > Hi Pine,
> > > > >
> > > > > Haifeng has a simple question about how to sample editors other
> than
> > > > > via dumps. It would be great if someone who knows the answer to
> help
> > > > > them to move forward.
> > > > >
> > > > > If you are interested to learn more about their research, instead
> of
> > > > > answering their question, 

Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-12 Thread Stuart A. Yeates
Note that this code deals with accounts, not editors, which is what
Haifeng asked for.

There are many reasons, both licit and illicit for editors to have
more than one account. I know I have more than ten for
policy-compliant reasons.

cheers
stuart


--
...let us be heard from red core to black sky

On Wed, 13 Mar 2019 at 10:21, Isaac Johnson  wrote:
>
> Hey Haifeng,
> If you decide to process the dumps, you should be able to easily repurpose
> some quick code that I wrote for a similar project:
> https://github.com/geohci/miscellaneous-wikimedia/tree/master/editor-turnover
>
> Notably, I'd suggest using the stub history dumps as they are much smaller
> because they do not include the actual content. For instance, for March 1st
> and English Wikipedia (https://dumps.wikimedia.org/enwiki/20190301/), this
> file would be enwiki-20190301-stub-meta-history.xml.gz and is 57.9 GB.
>
> Best,
> Isaac
>
> On Tue, Mar 12, 2019 at 3:56 PM Pine W  wrote:
>
> > Hi Haifeng, thanks for the information. I think that your idea of looking
> > in the dumps makes sense. Am I understanding correctly that you would like
> > advice regarding how to do that in the most efficient way?
> >
> > Hi Leila, I believe that I asked for more information regarding Heifeng's
> > work. There has been discussion on English Wikipedia regarding volunteers
> > being unhappy with the interventions or proposed interventions of
> > researchers. I think that asking about the nature of Haifeng's research is
> > legitimate, and I tried to provide some examples of possible types of
> > research. I'm trying to protect the community from problematic
> > interventions, while also welcoming research that is accepted by the
> > community.
> >
> > Pine
> > ( https://meta.wikimedia.org/wiki/User:Pine )
> >
> >
> > On Tue, Mar 12, 2019 at 8:00 PM Haifeng Zhang 
> > wrote:
> >
> > > Pine and Stuart,
> > >
> > > I meant extracting a random sample of new editors (month by month) from
> > > Wikipedia edit history.
> > >
> > > It is not about survey of new editors, but still thanks for your
> > > suggestions.
> > >
> > >
> > > Thanks,
> > > Haifeng Zhang
> > >
> > > Postdoctoral Research Fellow
> > > Human-Computer Interaction Institute
> > > Carnegie Mellon University
> > > 
> > > From: Wiki-research-l  on
> > > behalf of Stuart A. Yeates 
> > > Sent: Tuesday, March 12, 2019 3:46:19 PM
> > > To: Research into Wikimedia content and communities
> > > Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
> > >
> > > There are a number of new-editor-heavy noticeboards. I would suggest
> > > posting an invite there to your survey (or whatever) If you ask for
> > > editor's usernames you can filter out those who don't meet your
> > > definition of 'new'
> > >
> > > I'm thinking of places like:
> > > https://en.wikipedia.org/wiki/Wikipedia:Teahouse and
> > > https://en.wikipedia.org/wiki/Wikipedia:Help_desk
> > >
> > > cheers
> > > stuart
> > >
> > >
> > > --
> > > ...let us be heard from red core to black sky
> > >
> > > On Wed, 13 Mar 2019 at 08:37, Leila Zia  wrote:
> > > >
> > > > Hi Pine,
> > > >
> > > > Haifeng has a simple question about how to sample editors other than
> > > > via dumps. It would be great if someone who knows the answer to help
> > > > them to move forward.
> > > >
> > > > If you are interested to learn more about their research, instead of
> > > > answering their question, my recommendation would be to start the
> > > > conversation with: "can you tell us more about your research?" kind of
> > > > question. I find the current way of communication very speculative,
> > > > and that is not good for making a vibrant research community that can
> > > > help us address some of our big questions.
> > > >
> > > > Best,
> > > > Leila
> > > >
> > > > On Tue, Mar 12, 2019 at 12:08 PM Pine W  wrote:
> > > > >
> > > > > Hi, can you expand on what you mean by "sample"? If you're referring
> > to
> > > > > analyzing users' edit histories then that should be fine. However, if
> > > > > you're planning to send surveys or messages to them, sending them
> > > > > barnstars, or otherwis

Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-12 Thread Leila Zia
On Tue, Mar 12, 2019 at 1:56 PM Pine W  wrote:
>
> Hi Leila, I believe that I asked for more information regarding Heifeng's
> work.

You stated

"However, if you're planning to send surveys or messages to them,
sending them barnstars, or otherwise manipulating their on-wiki
experience, that would be problematic."

and I'm suggesting that you enter from a question angle, please.

> There has been discussion on English Wikipedia regarding volunteers
> being unhappy with the interventions or proposed interventions of
> researchers. I think that asking about the nature of Haifeng's research is
> legitimate, and I tried to provide some examples of possible types of
> research.

Please check your email. There was no question there in the part
related to this discussion. Also, even if there was a question posed,
I highly recommend you enter from a different angle to these
conversations. There are many reasons someone may need the sampled
data of newcomers. A few examples: they may want to test the
assumption whether the arrivals (registrations) to a specific
Wikipedia language follow a Poisson process or not, they may want to
learn about the distribution of topics editors in a given language
edit in the first 24 hours after they open the account, they may want
to build a prediction model to predict whether the editor will make
the n-th edit or not given that they have started at time x, they may
want to see whether external events have strong correlations with
account registration and Wikipedia activity, they may want to see if
the change to HTTPS had impact on registrations, etc. There are
literally millions of questions people may ask (given that the data is
available to them) with respect to Wikipedia. The answer to some of
them may require interaction with Wikipedia editors, the answer to
some may not. So the safest bet to start having a fruitful
conversation is to ask: can you tell us more about what you're trying
to do?

> I'm trying to protect the community from problematic
> interventions, while also welcoming research that is accepted by the
> community.

I understand and I'm looking forward to having conversations with you
all about how to achieve that.

Best,
Leila

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-12 Thread Isaac Johnson
Hey Haifeng,
If you decide to process the dumps, you should be able to easily repurpose
some quick code that I wrote for a similar project:
https://github.com/geohci/miscellaneous-wikimedia/tree/master/editor-turnover

Notably, I'd suggest using the stub history dumps as they are much smaller
because they do not include the actual content. For instance, for March 1st
and English Wikipedia (https://dumps.wikimedia.org/enwiki/20190301/), this
file would be enwiki-20190301-stub-meta-history.xml.gz and is 57.9 GB.

Best,
Isaac

On Tue, Mar 12, 2019 at 3:56 PM Pine W  wrote:

> Hi Haifeng, thanks for the information. I think that your idea of looking
> in the dumps makes sense. Am I understanding correctly that you would like
> advice regarding how to do that in the most efficient way?
>
> Hi Leila, I believe that I asked for more information regarding Heifeng's
> work. There has been discussion on English Wikipedia regarding volunteers
> being unhappy with the interventions or proposed interventions of
> researchers. I think that asking about the nature of Haifeng's research is
> legitimate, and I tried to provide some examples of possible types of
> research. I'm trying to protect the community from problematic
> interventions, while also welcoming research that is accepted by the
> community.
>
> Pine
> ( https://meta.wikimedia.org/wiki/User:Pine )
>
>
> On Tue, Mar 12, 2019 at 8:00 PM Haifeng Zhang 
> wrote:
>
> > Pine and Stuart,
> >
> > I meant extracting a random sample of new editors (month by month) from
> > Wikipedia edit history.
> >
> > It is not about survey of new editors, but still thanks for your
> > suggestions.
> >
> >
> > Thanks,
> > Haifeng Zhang
> >
> > Postdoctoral Research Fellow
> > Human-Computer Interaction Institute
> > Carnegie Mellon University
> > 
> > From: Wiki-research-l  on
> > behalf of Stuart A. Yeates 
> > Sent: Tuesday, March 12, 2019 3:46:19 PM
> > To: Research into Wikimedia content and communities
> > Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
> >
> > There are a number of new-editor-heavy noticeboards. I would suggest
> > posting an invite there to your survey (or whatever) If you ask for
> > editor's usernames you can filter out those who don't meet your
> > definition of 'new'
> >
> > I'm thinking of places like:
> > https://en.wikipedia.org/wiki/Wikipedia:Teahouse and
> > https://en.wikipedia.org/wiki/Wikipedia:Help_desk
> >
> > cheers
> > stuart
> >
> >
> > --
> > ...let us be heard from red core to black sky
> >
> > On Wed, 13 Mar 2019 at 08:37, Leila Zia  wrote:
> > >
> > > Hi Pine,
> > >
> > > Haifeng has a simple question about how to sample editors other than
> > > via dumps. It would be great if someone who knows the answer to help
> > > them to move forward.
> > >
> > > If you are interested to learn more about their research, instead of
> > > answering their question, my recommendation would be to start the
> > > conversation with: "can you tell us more about your research?" kind of
> > > question. I find the current way of communication very speculative,
> > > and that is not good for making a vibrant research community that can
> > > help us address some of our big questions.
> > >
> > > Best,
> > > Leila
> > >
> > > On Tue, Mar 12, 2019 at 12:08 PM Pine W  wrote:
> > > >
> > > > Hi, can you expand on what you mean by "sample"? If you're referring
> to
> > > > analyzing users' edit histories then that should be fine. However, if
> > > > you're planning to send surveys or messages to them, sending them
> > > > barnstars, or otherwise manipulating their on-wiki experience, that
> > would
> > > > be problematic.
> > > >
> > > > Pine
> > > > ( https://meta.wikimedia.org/wiki/User:Pine )
> > > >
> > > >
> > > > On Tue, Mar 12, 2019 at 6:19 PM Haifeng Zhang <
> haife...@andrew.cmu.edu
> > >
> > > > wrote:
> > > >
> > > > > Hi folks,
> > > > >
> > > > > My work needs to randomly sample new editors in each month, e.g.,
> 100
> > > > > editors per month.
> > > > >
> > > > > Do any of you have good suggestions for how to do this efficiently?
> > > > >
> > > > > I could think of using the dump files, but wonder are there other
&

Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-12 Thread Pine W
Hi Haifeng, thanks for the information. I think that your idea of looking
in the dumps makes sense. Am I understanding correctly that you would like
advice regarding how to do that in the most efficient way?

Hi Leila, I believe that I asked for more information regarding Heifeng's
work. There has been discussion on English Wikipedia regarding volunteers
being unhappy with the interventions or proposed interventions of
researchers. I think that asking about the nature of Haifeng's research is
legitimate, and I tried to provide some examples of possible types of
research. I'm trying to protect the community from problematic
interventions, while also welcoming research that is accepted by the
community.

Pine
( https://meta.wikimedia.org/wiki/User:Pine )


On Tue, Mar 12, 2019 at 8:00 PM Haifeng Zhang 
wrote:

> Pine and Stuart,
>
> I meant extracting a random sample of new editors (month by month) from
> Wikipedia edit history.
>
> It is not about survey of new editors, but still thanks for your
> suggestions.
>
>
> Thanks,
> Haifeng Zhang
>
> Postdoctoral Research Fellow
> Human-Computer Interaction Institute
> Carnegie Mellon University
> 
> From: Wiki-research-l  on
> behalf of Stuart A. Yeates 
> Sent: Tuesday, March 12, 2019 3:46:19 PM
> To: Research into Wikimedia content and communities
> Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia
>
> There are a number of new-editor-heavy noticeboards. I would suggest
> posting an invite there to your survey (or whatever) If you ask for
> editor's usernames you can filter out those who don't meet your
> definition of 'new'
>
> I'm thinking of places like:
> https://en.wikipedia.org/wiki/Wikipedia:Teahouse and
> https://en.wikipedia.org/wiki/Wikipedia:Help_desk
>
> cheers
> stuart
>
>
> --
> ...let us be heard from red core to black sky
>
> On Wed, 13 Mar 2019 at 08:37, Leila Zia  wrote:
> >
> > Hi Pine,
> >
> > Haifeng has a simple question about how to sample editors other than
> > via dumps. It would be great if someone who knows the answer to help
> > them to move forward.
> >
> > If you are interested to learn more about their research, instead of
> > answering their question, my recommendation would be to start the
> > conversation with: "can you tell us more about your research?" kind of
> > question. I find the current way of communication very speculative,
> > and that is not good for making a vibrant research community that can
> > help us address some of our big questions.
> >
> > Best,
> > Leila
> >
> > On Tue, Mar 12, 2019 at 12:08 PM Pine W  wrote:
> > >
> > > Hi, can you expand on what you mean by "sample"? If you're referring to
> > > analyzing users' edit histories then that should be fine. However, if
> > > you're planning to send surveys or messages to them, sending them
> > > barnstars, or otherwise manipulating their on-wiki experience, that
> would
> > > be problematic.
> > >
> > > Pine
> > > ( https://meta.wikimedia.org/wiki/User:Pine )
> > >
> > >
> > > On Tue, Mar 12, 2019 at 6:19 PM Haifeng Zhang  >
> > > wrote:
> > >
> > > > Hi folks,
> > > >
> > > > My work needs to randomly sample new editors in each month, e.g., 100
> > > > editors per month.
> > > >
> > > > Do any of you have good suggestions for how to do this efficiently?
> > > >
> > > > I could think of using the dump files, but wonder are there other
> options?
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Haifeng Zhang
> > > > ___
> > > > Wiki-research-l mailing list
> > > > Wiki-research-l@lists.wikimedia.org
> > > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > > >
> > > ___
> > > Wiki-research-l mailing list
> > > Wiki-research-l@lists.wikimedia.org
> > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> > ___
> > Wiki-research-l mailing list
> > Wiki-research-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-12 Thread Haifeng Zhang
Pine and Stuart,

I meant extracting a random sample of new editors (month by month) from 
Wikipedia edit history.

It is not about survey of new editors, but still thanks for your suggestions.


Thanks,
Haifeng Zhang

Postdoctoral Research Fellow
Human-Computer Interaction Institute
Carnegie Mellon University

From: Wiki-research-l  on behalf 
of Stuart A. Yeates 
Sent: Tuesday, March 12, 2019 3:46:19 PM
To: Research into Wikimedia content and communities
Subject: Re: [Wiki-research-l] Sampling new editors in English Wikipedia

There are a number of new-editor-heavy noticeboards. I would suggest
posting an invite there to your survey (or whatever) If you ask for
editor's usernames you can filter out those who don't meet your
definition of 'new'

I'm thinking of places like:
https://en.wikipedia.org/wiki/Wikipedia:Teahouse and
https://en.wikipedia.org/wiki/Wikipedia:Help_desk

cheers
stuart


--
...let us be heard from red core to black sky

On Wed, 13 Mar 2019 at 08:37, Leila Zia  wrote:
>
> Hi Pine,
>
> Haifeng has a simple question about how to sample editors other than
> via dumps. It would be great if someone who knows the answer to help
> them to move forward.
>
> If you are interested to learn more about their research, instead of
> answering their question, my recommendation would be to start the
> conversation with: "can you tell us more about your research?" kind of
> question. I find the current way of communication very speculative,
> and that is not good for making a vibrant research community that can
> help us address some of our big questions.
>
> Best,
> Leila
>
> On Tue, Mar 12, 2019 at 12:08 PM Pine W  wrote:
> >
> > Hi, can you expand on what you mean by "sample"? If you're referring to
> > analyzing users' edit histories then that should be fine. However, if
> > you're planning to send surveys or messages to them, sending them
> > barnstars, or otherwise manipulating their on-wiki experience, that would
> > be problematic.
> >
> > Pine
> > ( https://meta.wikimedia.org/wiki/User:Pine )
> >
> >
> > On Tue, Mar 12, 2019 at 6:19 PM Haifeng Zhang 
> > wrote:
> >
> > > Hi folks,
> > >
> > > My work needs to randomly sample new editors in each month, e.g., 100
> > > editors per month.
> > >
> > > Do any of you have good suggestions for how to do this efficiently?
> > >
> > > I could think of using the dump files, but wonder are there other options?
> > >
> > >
> > > Thanks,
> > >
> > > Haifeng Zhang
> > > ___
> > > Wiki-research-l mailing list
> > > Wiki-research-l@lists.wikimedia.org
> > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >
> > ___
> > Wiki-research-l mailing list
> > Wiki-research-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-12 Thread Stuart A. Yeates
There are a number of new-editor-heavy noticeboards. I would suggest
posting an invite there to your survey (or whatever) If you ask for
editor's usernames you can filter out those who don't meet your
definition of 'new'

I'm thinking of places like:
https://en.wikipedia.org/wiki/Wikipedia:Teahouse and
https://en.wikipedia.org/wiki/Wikipedia:Help_desk

cheers
stuart


--
...let us be heard from red core to black sky

On Wed, 13 Mar 2019 at 08:37, Leila Zia  wrote:
>
> Hi Pine,
>
> Haifeng has a simple question about how to sample editors other than
> via dumps. It would be great if someone who knows the answer to help
> them to move forward.
>
> If you are interested to learn more about their research, instead of
> answering their question, my recommendation would be to start the
> conversation with: "can you tell us more about your research?" kind of
> question. I find the current way of communication very speculative,
> and that is not good for making a vibrant research community that can
> help us address some of our big questions.
>
> Best,
> Leila
>
> On Tue, Mar 12, 2019 at 12:08 PM Pine W  wrote:
> >
> > Hi, can you expand on what you mean by "sample"? If you're referring to
> > analyzing users' edit histories then that should be fine. However, if
> > you're planning to send surveys or messages to them, sending them
> > barnstars, or otherwise manipulating their on-wiki experience, that would
> > be problematic.
> >
> > Pine
> > ( https://meta.wikimedia.org/wiki/User:Pine )
> >
> >
> > On Tue, Mar 12, 2019 at 6:19 PM Haifeng Zhang 
> > wrote:
> >
> > > Hi folks,
> > >
> > > My work needs to randomly sample new editors in each month, e.g., 100
> > > editors per month.
> > >
> > > Do any of you have good suggestions for how to do this efficiently?
> > >
> > > I could think of using the dump files, but wonder are there other options?
> > >
> > >
> > > Thanks,
> > >
> > > Haifeng Zhang
> > > ___
> > > Wiki-research-l mailing list
> > > Wiki-research-l@lists.wikimedia.org
> > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >
> > ___
> > Wiki-research-l mailing list
> > Wiki-research-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-12 Thread Leila Zia
Hi Pine,

Haifeng has a simple question about how to sample editors other than
via dumps. It would be great if someone who knows the answer to help
them to move forward.

If you are interested to learn more about their research, instead of
answering their question, my recommendation would be to start the
conversation with: "can you tell us more about your research?" kind of
question. I find the current way of communication very speculative,
and that is not good for making a vibrant research community that can
help us address some of our big questions.

Best,
Leila

On Tue, Mar 12, 2019 at 12:08 PM Pine W  wrote:
>
> Hi, can you expand on what you mean by "sample"? If you're referring to
> analyzing users' edit histories then that should be fine. However, if
> you're planning to send surveys or messages to them, sending them
> barnstars, or otherwise manipulating their on-wiki experience, that would
> be problematic.
>
> Pine
> ( https://meta.wikimedia.org/wiki/User:Pine )
>
>
> On Tue, Mar 12, 2019 at 6:19 PM Haifeng Zhang 
> wrote:
>
> > Hi folks,
> >
> > My work needs to randomly sample new editors in each month, e.g., 100
> > editors per month.
> >
> > Do any of you have good suggestions for how to do this efficiently?
> >
> > I could think of using the dump files, but wonder are there other options?
> >
> >
> > Thanks,
> >
> > Haifeng Zhang
> > ___
> > Wiki-research-l mailing list
> > Wiki-research-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-12 Thread Pine W
Hi, can you expand on what you mean by "sample"? If you're referring to
analyzing users' edit histories then that should be fine. However, if
you're planning to send surveys or messages to them, sending them
barnstars, or otherwise manipulating their on-wiki experience, that would
be problematic.

Pine
( https://meta.wikimedia.org/wiki/User:Pine )


On Tue, Mar 12, 2019 at 6:19 PM Haifeng Zhang 
wrote:

> Hi folks,
>
> My work needs to randomly sample new editors in each month, e.g., 100
> editors per month.
>
> Do any of you have good suggestions for how to do this efficiently?
>
> I could think of using the dump files, but wonder are there other options?
>
>
> Thanks,
>
> Haifeng Zhang
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Sampling new editors in English Wikipedia

2019-03-12 Thread Haifeng Zhang
Hi folks,

My work needs to randomly sample new editors in each month, e.g., 100 editors 
per month.

Do any of you have good suggestions for how to do this efficiently?

I could think of using the dump files, but wonder are there other options?


Thanks,

Haifeng Zhang
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l