[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2023-07-07 Thread Manuel
Manuel closed this task as "Resolved".

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, Manuel
Cc: Esh77, Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, 
WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-11-01 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Jan_Dittrich Do we need this ticket anymore?

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Esh77, Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, 
WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, 
maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-10-31 Thread Maintenance_bot
Maintenance_bot removed a project: Patch-For-Review.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, Maintenance_bot
Cc: Esh77, Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, 
WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, 
maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331, 
Suran38, Biggs657, Lalamarie69, Juan90264, Alter-paule, Beast1978, Un1tY, 
Hook696, Kent7301, joker88john, CucyNoiD, Gaboe420, Giuliamocci, Cpaulf30, 
Af420, Bsandipan, Lewizho99, Maathavan
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-10-31 Thread gerritbot
gerritbot added a comment.


  Change 735736 **merged** by GoranSMilovanovic:
  
  [analytics/wmde/WD/WikidataAdHocAnalytics@master] T282563
  
  https://gerrit.wikimedia.org/r/735736

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, gerritbot
Cc: Esh77, Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, 
WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Suran38, 
Biggs657, Invadibot, Lalamarie69, maantietaja, Juan90264, Alter-paule, 
Beast1978, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, 
Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, 
GoranSMilovanovic, QZanden, LawExplorer, Lewizho99, Maathavan, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-10-31 Thread gerritbot
gerritbot added a project: Patch-For-Review.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, gerritbot
Cc: Esh77, Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, 
WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Suran38, 
Biggs657, Invadibot, Lalamarie69, maantietaja, Juan90264, Alter-paule, 
Beast1978, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, 
Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, 
GoranSMilovanovic, QZanden, LawExplorer, Lewizho99, Maathavan, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-10-31 Thread gerritbot
gerritbot added a comment.


  Change 735736 had a related patch set uploaded (by GoranSMilovanovic; author: 
GoranSMilovanovic):
  
  [analytics/wmde/WD/WikidataAdHocAnalytics@master] T282563
  
  https://gerrit.wikimedia.org/r/735736

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, gerritbot
Cc: Esh77, Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, 
WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, 
maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-10-27 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @MGerlach @Jan_Dittrich
  
  I have used XGBoost to train a `leave` vs `stay` binary classifier to our 
data.
  
  I did not go into elaborated cross-validation, used only a single train and 
single test dataset, downsampling by a huge factor (because of a huge class 
imbalance), using `scale_pos_weight` to upweight then, and cross-validate 
across `eta`, `max_depth` and `subsample` only; shallow tress (5 or 10 
`max_depth`) were used only - and lots of them (`n_rounds` set to 10,000).
  
  **Results**
  
  - The best model I've found had an `AUC = 0.9689391` - a bit better than what 
is reported in the paper that used the DeepFM architecture 
 and which @MGerlach 
shared in  T282563#7419722 ; 
however, the models are not really comparable since the authors of the DeepFM 
paper have used a different criterion to define "leave" then we did (5 months 
of inactivity);
  - Since we get `P(Leave)` from XGBoost, I have performed a full ROC analysis. 
When the decision criterion (or boundary, if you prefer) is set to be very high 
(`0.999001`), we obtain the following characteristics:
- TPR = 0.911067
- FNR = 0.088933
- FPR = 0.1030485
- TNR = 0.8969515
  
  That seems quite satisfactory, especially if we consider a simple Bayesian 
analysis 
:
  
  - starting from an a priori of `P(Leave) = .92` - and we now that from our 
data,
  - the a posteriori `P(Leave|Model says "Leave") = .99`.
  
  I hope this addresses all remarks made in T282563#7252149 
 and T282563#7254863 
.
  
  I guess an even better model could be built with XGBoost - after all, I have 
searched in a rather constrained parameter space only - but I do not think that 
I have enough time to run tons of ML cycles until Sunday, October 31 when we 
need to present at WikidataCon 2021. This results will enter our WikidataCon 
2021 slides 
.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Esh77, Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, 
WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, 
maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-10-26 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @MGerlach @Jan_Dittrich
  
  It is a power-law (and thus Lindy) after all:
  
"H0: data IS generated from a power law distribution; 
 H1: data IS NOT generated from a power law distribution."
  
  (from: Fitting Heavy Tailed Distributions: The poweRlaw Package, Colin S. 
Gillespie, Newcastle University, Journal of Statistical Software, February 
2015, Volume 64, Issue 2).
  
  ***I was reading the bootstrap p-values incorrectly*** - our findings say 
that we **cannot** reject H0, and under this hypothesis testing framework H0 
means: it's a power-law.
  
  See our slides for test results.
  
  @MGerlach I will test this against Poisson of course. I did not forget about 
your earlier remark in T282563#7254843 
.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Esh77, Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, 
WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, 
maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-10-19 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @MGerlach
  
  The authors of the paper that you have cited in T282563#7419722 
 use a similar - if not the 
same - approach to feature engineering for the prediction task as I have used 
in T282563#7251679  w. the 
RF classifier, pp.  4 in the PDF: 

  
3. Diversity of edit actions (Divedit·act). To capture the diversity of 
different
types of edit actions (see Section 4), we use the Shannon-Entropy [25] of
different edit actions in the same manner as in [24] as: H(T) = −
Pn i=1 P(ti)·log P(ti) where T indicates different types of edit actions, 
and |T| = n.

4. Diversity of entities (Divent). We measure the diversity of edited 
entities
of a user using the Shannon-Entropy. The intuition is that the diversity of
edited entities of a user could also be different across active and inactive
editors.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Esh77, Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, 
WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, 
maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-10-18 Thread Jan_Dittrich
Jan_Dittrich added a comment.


  sure sure, go ahead :)

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, Jan_Dittrich
Cc: Esh77, Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, 
WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, 
maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-10-18 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Jan_Dittrich Sound great, but I think @MGerlach and I would like to add some 
modeling efforts to see if we can predict if a users stays or not.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Esh77, Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, 
WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, 
maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-10-18 Thread Jan_Dittrich
Jan_Dittrich added a comment.


  What about:
  
  **Proposal title:** Wikidata user retention over time?
  **Session type:** Lightning or short, I guess?
  **Abstract:** People leave online communities after some time. However, the 
likelyhood that a particular user leaves the project is dependent on the time 
they have been on the project already: People who have only spend a brief time 
in the project are more likely to leave than people who are long-term members. 
This is similar to the so-called "Lindy Effect". We modeled the curve for the 
likelihood of leaving the project depending on the time of past participation 
and will present methods, outcomes and practical relevance. 
  **Notes:** We would show some slides
  **Session Image:** Maybe the diagram in from the original issue?

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, Jan_Dittrich
Cc: Esh77, Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, 
WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, 
maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-10-18 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @MGerlach @Jan_Dittrich
  
  We also need to decide on the following in order to submit our session 
proposal  to WikidataCon 2021:
  
  - Proposal title
  - Session type (please take a look at the submission form 
 for options)
  - Abstract
  - Notes
  - Session Image
  
  I will add a submission once we meet and figure out the format and everything 
else + add you as co-speakers.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Esh77, Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, 
WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, 
maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-10-18 Thread GoranSMilovanovic
GoranSMilovanovic added a subscriber: Esh77.
GoranSMilovanovic added a comment.


  @Esh77 @MGerlach @Jan_Dittrich
  
  Martin and Jan: thank you for your readiness to present our findings on 
Wikidata User Retention in the WikidataCon 2021 Education & science track (see 
WikidataCon 2021 program 
,
 Sunday, October 31st from 11:00 to 18:00 UTC) !
  
  @awight Adam, you have contributed too, and there is still time to join us to 
prepare the session for the WikidataCon!
  
  @MGerlach @Jan_Dittrich
  
  If you agree:
  
  - I would prepare a synthesis of the work done here so far, and then
  - share an (R Markdown, rendered to html) Notebook with you;
  - we could the Notebook as a starting point to develop the session.
  
  @MGerlach
  
  - Thank you for sharing the paper in T282563#7419722 
;
  - I will take a look, but I doubt that I will have enough time until 
WikidataCon 2021 to experiment with any other model except for Random Forest 
(almost implemented) and XGBoost (in preparation).
  
  I also suggest that we schedule a concise meeting on this.
  
  Thank you - I am looking forward to seeing you soon!

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Esh77, Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, 
WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, 
maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-10-12 Thread MGerlach
MGerlach added a comment.


  @GoranSMilovanovic regarding the prediction model, a recent paper from this 
year's ISCW-conference might be very interesting (e.g. which features they use 
and to compare prediction-performance):
  
  **Learning to Predict the Departure Dynamics of Wikidata Editors 
** (link to pdf 
)
  
  > Wikidata as one of the largest open collaborative knowledge bases has drawn 
much attention from researchers and practitioners since its launch in 2012. As 
it is collaboratively developed and maintained by a community of a great number 
of volunteer editors, understanding and predicting the departure dynamics of 
those editors are crucial but have not been studied extensively in previous 
works. In this paper, we investigate the synergistic effect of two different 
types of features: statistical and pattern-based ones with DeepFM as our 
classification model which has not been explored in a similar context and 
problem for predicting whether a Wikidata editor will stay or leave the 
platform. Our experimental results show that using the two sets of features 
with DeepFM provides the best performance regarding AUROC (0.9561) and F1 score 
(0.8843), and achieves substantial improvement compared to using either of the 
sets of features and over a wide range of baselines.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, MGerlach
Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, 
Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-08-05 Thread Manuel
Manuel added a project: Wikidata Analytics.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, Manuel
Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, 
Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-08-03 Thread Maintenance_bot
Maintenance_bot removed a project: Patch-For-Review.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, Maintenance_bot
Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, 
Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331, Biggs657, 
Lalamarie69, Juan90264, Alter-paule, Beast1978, Un1tY, Hook696, Kent7301, 
joker88john, CucyNoiD, Gaboe420, Giuliamocci, Cpaulf30, Af420, Bsandipan, 
Lewizho99, Maathavan
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-08-03 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  Here's the ETL code 
.
  I will add modeling and power law estimation as soon as I complete all 
additional steps as suggested.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, 
Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Biggs657, Invadibot, 
Lalamarie69, maantietaja, Juan90264, Alter-paule, Beast1978, Un1tY, Akuckartz, 
Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Gaboe420, Giuliamocci, 
Cpaulf30, Lahi, Gq86, Af420, Bsandipan, GoranSMilovanovic, QZanden, 
LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-08-03 Thread gerritbot
gerritbot added a comment.


  Change 709690 **merged** by GoranSMilovanovic:
  
  [analytics/wmde/WD/WikidataAdHocAnalytics@master] T282563
  
  https://gerrit.wikimedia.org/r/709690

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, gerritbot
Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, 
Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Biggs657, Invadibot, 
Lalamarie69, maantietaja, Juan90264, Alter-paule, Beast1978, Un1tY, Akuckartz, 
Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Gaboe420, Giuliamocci, 
Cpaulf30, Lahi, Gq86, Af420, Bsandipan, GoranSMilovanovic, QZanden, 
LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-08-03 Thread gerritbot
gerritbot added a project: Patch-For-Review.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, gerritbot
Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, 
Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Biggs657, Invadibot, 
Lalamarie69, maantietaja, Juan90264, Alter-paule, Beast1978, Un1tY, Akuckartz, 
Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Gaboe420, Giuliamocci, 
Cpaulf30, Lahi, Gq86, Af420, Bsandipan, GoranSMilovanovic, QZanden, 
LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-08-03 Thread gerritbot
gerritbot added a comment.


  Change 709690 had a related patch set uploaded (by GoranSMilovanovic; author: 
GoranSMilovanovic):
  
  [analytics/wmde/WD/WikidataAdHocAnalytics@master] T282563
  
  https://gerrit.wikimedia.org/r/709690

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, gerritbot
Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, 
Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-08-03 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @MGerlach
  
  First of all, thank you very much for the insights that you have provided.
  
  **On Power Laws and Lindy:**
  
  > One possible path out of this is to slightly change the question. Instead 
of asking whether the data is perfectly described by a powerlaw (in most cases 
it is not), it might be more interesting to know whether a powerlaw describes 
the data better than another distribution.
  
  I agree completely, and that is what I am about to do here next.
  
  > How can the x_min be so large (estimated or not)? My understanding of the 
parameter x_min is that we fit a powerlaw distribution to all x>x_min. Thus we 
only fit the the powerlaw for account ages with more than 69 or 153 months, 
respectively. From the plots you showed above, this applies only to a small 
fraction of accounts.
  
  From POWER-LAW DISTRIBUTIONS IN EMPIRICAL DATA, AARON CLAUSET, COSMA ROHILLA 
SHALIZI, AND M. E. J. NEWMAN (2009) :
  
  > In practice, few empirical phenomena obey power laws for all values of x. 
More often the power law applies only for values greater than some minimum 
xmin. In such cases we say that the tail of the distribution follows a power 
law.
  
  and the {poweRlaw} 
 package - which 
implements the estimation approach of Clauset, Shalizi & Newman - estimates 
xmin to be as large as 153. Let me remind you that I have also tried with xmin 
set to the minimum of the empirical observations (that would be 69 in our 
dataset) - essentially what you have also suggested (see T282563#7250712 
).
  
  > This means that the powerlaw-distribution is rejected for the data. 
However, this is not surprising - real data is messy and this type of 
hypothesis test rejects even if we have really strong reasons to believe it 
should follow the powerlaw distribution, e.g. due to small correlations etc 
(you can read in more detail about this argument in a paper we wrote some time 
ago).
  
  The paper you mention, Gerlach & Altmann (2019). Testing statistical laws in 
complex systems,  is an **overkill** to 
me. If you promise to find some time to meet and provide a translation into 
plain English, I promise to be all ear.
  
  **Now, as of the Random Forest classifier:**
  
  > The high accuracy is not to be taken at face value as the positive/negative 
groups are probably highly imbalanced (not sure if this is true but it looks 
like most account stop editing very quickly).
  
  Yes, the high accuracy at face value does not tell a thing, **but** we have a 
Hit rate (the model predicts "stay" and the editor "stays") at 90% and the 
False Alarm rate (the model says "stay" but the editor "leaves") at "only" 
2.8%. Some would say "not great, not terrible", but given that this is our 
first attempt at the problem at hand I would really say that is not bad at all.
  
  > using a balanced test-set such that you have the same number of positive 
and negative examples (for example via downsampling the majority class or vice 
versa)
  
  Instead of using upsampling or downsampling, I have controlled for the priors 
in classification to account for the (huge) imbalance in the distribution of 
the outcome (see: `classwt` argument of randomForest() 

 in {randomForest} 
).
  
  > compare with a baseline predictor that does not use any of the features. 
This could be either a random guess (for example based on the Lindy-curve) or 
simply always guessing the majority-class
  
  Definitely. Will do.
  
  Thanks again @MGerlach

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, 
Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-08-03 Thread MGerlach
MGerlach added a comment.


  In T282563#7252149 , 
@awight wrote:
  
  > In T282563#7251679 , 
@GoranSMilovanovic wrote:
  >
  >> Anyways, following a series of cross-validations and tricks to account for 
a highly imbalanced dataset, one Random Forrest classifier was able to predict 
leave vs stay in Wikidata with:
  >>
  >> - **Accuracy of 97%**,
  >> - **Hit rate (True Positive Rate, TPP) of 90%**,
  >> - and a **False Alarm (False Positive Rate, FPP) of only 2.8%**.
  >
  > What about the true/false negative rate?  To my untrained eye, these 
numbers look typical for an imbalanced training/test set, where we have a lot 
of people abandoning so it's really easy for a classifier to accurately predict 
that a user will leave, but probably much less accurate at predicting that a 
person will stay.
  
  I agree with @awight. The high accuracy is not to be taken at face value as 
the positive/negative groups are probably highly imbalanced (not sure if this 
is true but it looks like most account stop editing very quickly). Two options 
to make the numbers more interpretable:
  
  - compare with a baseline predictor that does not use any of the features. 
This could be either a random guess (for example based on the Lindy-curve) or 
simply always guessing the majority-class
  - using a balanced test-set such that you have the same number of positive 
and negative examples (for example via downsampling the majority class or vice 
versa)

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, MGerlach
Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, 
Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-08-03 Thread MGerlach
MGerlach added a comment.


  @GoranSMilovanovic
  
  In T282563#7250712 , 
@GoranSMilovanovic wrote:
  
  > @Jan_Dittrich **Do we really find a Lindy effect in the Wikidata acount age 
distribution?**
  >
  > **Assumption.** As demonstrated in Eliazar, Iddo (November 2017). "Lindy's 
Law". Physica A: Statistical Mechanics and Its Applications. 486: 797–805 
, if 
the Lindy effect holds than the Survival function of the account age is Pareto. 
So, we need to test if the Wikidata account age follows a power-law or not.
  >
  > Now, this is a bit tricky, so let's go one step at the time:
  >
  > - the data are the frequencies of Wikidata account ages;
  >
  > - the age of the account is the number of months since the registration 
until the first sequence of five inactive months (when we pronounce an editor 
officially inactive by convention)
  >
  > - Bots are filtered out in the ETL phase;
  >
  > - following a power-law estimation in R from {poweRlaw}, documentation: 
https://cran.r-project.org/web/packages/poweRlaw/index.html, essentially based 
on power-law estimates derived in Clauset, Shalizi & Newman (2007). "Power-law 
distributions in empirical data": https://arxiv.org/pdf/0706.1062.pdf
  >
  > - the `x_min` of the account age is estimated to be `153` with an `alpha` 
of `2.217158`, indicating a power-law behavior with the second and higher-order 
moments divergence (also see Gillespie (2017). Fitting Heavy Tailed 
Distributions: The poweRlaw Package: 
https://cran.r-project.org/web/packages/poweRlaw/vignettes/d_jss_paper.pdf, 
page 3);
  >
  > - if the `x_min` is set to the de facto minimum of the account age (which 
is `69`; no `x_min` estimation), then we have a power-law behavior with an 
estimate of `alpha` found at `1.626341` - a power-law behavior with all moments 
diverging.
  
  How can the `x_min` be so large (estimated or not)?  My understanding of the 
parameter `x_min` is that we fit a powerlaw distribution to all `x>x_min`. Thus 
we only fit the the powerlaw for account ages with more than 69 or 153 months, 
respectively. From the plots you showed above, this applies only to a small 
fraction of accounts. This is problematic because your fitted distribution does 
not try to describe anything that happens at `x **However**, following the recommendations of the authors of {poweRlaw}, 
the boostrap analysis shows that in neither of the two cases the power-law is 
really present (see the Hypothesis Testing framework implemented in {poweRlaw}, 
2. Examples using the poweRlaw package: 
https://cran.r-project.org/web/packages/poweRlaw/vignettes/b_powerlaw_examples.pdf,
 pages 4 - 5).
  
  This means that the powerlaw-distribution is rejected for the data. However, 
this is not surprising - real data is messy and this type of hypothesis test 
rejects even if we have really strong reasons to believe it should follow the 
powerlaw-distribution, e.g. due to small correlations etc (you can read in more 
detail about this argument in a paper we wrote some time ago 
).
  
  One possible path out of this is to slightly change the question. Instead of 
asking whether the data is perfectly described by a powerlaw (in most cases it 
is not), it might be more interesting to know whether a powerlaw describes the 
data better than another distribution. This is also described in the package 
you mention (3. Comparing distributions with the poweRlaw package 
).
 For example, one could compare the fit of a powerlaw with a Poisson. The 
latter is an interesting comparison because the Poisson follows if the 
probability of stopping is independent of the time an editor has already been 
around. In contrast, the powerlaw follows if the probability of stopping 
decreases with time (in a specific way). If the powerlaw fits better than the 
Poisson, this would then be evidence that the probability of stopping does 
depend (somehow) on the time an editor has been already around.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, MGerlach
Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, 
Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-08-02 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @awight
  
  First of all, I might have missed to mention that the outcome variable (i.e. 
what we are predicting) is **"stay"**, not "leave".  My bad.
  
  > I'm unsure whether "positive" here means the classifier identifies a person 
who will leave or stay, btw., can you share more about the test results?
  
  These terms have one and the same meaning in Statistical Decision Theory and 
ML, always, see ROC Analysis from Wikipedia 
.
  
  > What about the true/false negative rate?
  
  Well they are just 1 - their positive counterparts, right?
  
  > To my untrained eye, these numbers look typical for an imbalanced 
training/test set, where we have a lot of people abandoning so it's really easy 
for a classifier to accurately predict that a user will leave, but probably 
much less accurate at predicting that a person will stay.
  
  To the contrary, the reported law FA rate means that the model is good at 
avoiding the Type I Error, i.e. to predict that someone would stay while 
actually they left. And the dataset is still very imbalanced - but there are 
techniques to deal with it. And I've used some of them here.
  
  > I like your "median length of inactivity" measure, that could be a good 
single-parameter predictor.
  
  Could be, don't know yet.
  
  > Of course, there is some risk of this being tautological: e.g. if a user is 
absent for a median of 5 months then they are roughly 50% likely to be absent 
for another 5 months (therefore considered "abandoned") in the future.
  
  Wouldn't that hold only if Lindy and Power-Law hold too? But I think they do 
not, see  T282563#7250712 .
  
  **N.B.** I am still experimenting to see if the feature engineering process 
can give us even more information than we are using now. Then I will share the 
code and the data so that anyone can play with the model or build their own.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, 
Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-08-02 Thread awight
awight added a comment.


  In T282563#7251679 , 
@GoranSMilovanovic wrote:
  
  > Anyways, following a series of cross-validations and tricks to account for 
a highly imbalanced dataset, one Random Forrest classifier was able to predict 
leave vs stay in Wikidata with:
  >
  > - **Accuracy of 97%**,
  > - **Hit rate (True Positive Rate, TPP) of 90%**,
  > - and a **False Alarm (False Positive Rate, FPP) of only 2.8%**.
  
  What about the true/false negative rate?  To my untrained eye, these numbers 
look typical for an imbalanced training/test set, where we have a lot of people 
abandoning so it's really easy for a classifier to accurately predict that a 
user will leave, but probably much less accurate at predicting that a person 
will stay.  I'm unsure whether "positive" here means the classifier identifies 
a person who will leave or stay, btw., can you share more about the test 
results?
  
  > The model encompasses the following features (MeanDecreaseGini is a measure 
of variable importance in Random Forests):
  
  Thanks for including the relative importance of each feature.  I like your 
"median length of inactivity" measure, that could be a good single-parameter 
predictor.  Of course, there is some risk of this being tautological: e.g. if a 
user is absent for a median of 5 months then they are roughly 50% likely to be 
absent for another 5 months (therefore considered "abandoned") in the future.  
Maybe it would help the exploration to run a tool like LIME on the model to 
learn more about how features are related to the prediction.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, awight
Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, 
Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-08-02 Thread Jan_Dittrich
Jan_Dittrich added a comment.


  Thanks, super interesting! Some things are beyond my data-skills, thus, I 
look forward to feedback from other people!

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, Jan_Dittrich
Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, 
Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-08-02 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Jan_Dittrich @awight @Lydia_Pintscher @Manuel @Tobi_WMDE_SW
  
  Probably of interest to all of you, because we have a quite interesting - and 
potentially very useful - outcome here.
  
  As a side kick to this ticket, I have trained a Random Forest classifier, 
following some feature engineering steps first, to predict which editor would 
probably continue to work on Wikidata vs who would probably leave.
  
  All features are derived from user revision histories coded as 
`000101010101111110001100...`, where `1` represents an active month 
(>=5 edits) and `0` and inactive months. 
  All user revision histories for those who are officially and by convention 
//absent// in the present moment (i.e., their revision history ends in 
`...0+$` - five or consecutive months of inactivity //now//) were truncated 
to end in four consecutive months of inactivity - simply because we would like 
to predict what would happen to a user who is still an active editor, and not 
do so once we already pronounce them to be inactive.
  
  Anyways, following a series of cross-validations and tricks to account for a 
highly imbalanced dataset, one Random Forrest classifier was able to predict 
leave vs stay in Wikidata with:
  
  - Accuracy of 97%,
  - Hit rate (True Positive Rate, TPP) of 90%,
  - and a False Alarm (False Positive Rate, FPP) of only 2.8%.
  
  This means that we can recognize, with descent accuracy and a low level of 
false alarms, those editors who are on a streak to continue contributing to 
Wikidata in the future, and think of how to use that information in community 
building and improve our sustainability.
  
  The result should be taken as preliminary, but these initial tests were 
already quite extensive (8 - 10 h of processing, model selection among 240 
cross-validate Random Forest classifiers...).
  
  The model encompasses the following features (MeanDecreaseGini is a measure 
of variable importance in Random Forests):
  
 MeanDecreaseGini
med_inact  12274.1092
sumActiveMonths 7676.7991
mean_inact  6686.6961
accountAge  5541.5158
averageRevisionsPerMonth3875.9850
pActiveMonth3692.2568
numRevisions3618.5379
H   2269.2940
reactivationsN  2145.5995
averageTalkRevisionsPerMonth 552.7711
talkrevisions384.1718
  
  Feature Vocabulary:
  
  - **med_inact** - the median of the length of user's periods of inactivity in 
months (say we find `000`, `000`,  ``, `00`, `0`, `00`, in a particular 
user's revision history somewhere - we take the median of the interval lenghts)
  - **sumActiveMonths** - the count of active months in a particular user's 
revision history
  - **mean_inact** - the average length of user's periods of inactivity in 
months (say we find `000`, `000`,  ``, `00`, `0`, `00`, in a particular 
user's revision history somewhere - we take the average of the interval lenghts)
  - **accountAge** - the length of user's revision history in months, since 
user registration and up to the present moment
  - **averageRevisionsPerMonth** - the average number of revisions in the 
namespaces 0, 120, 146
  - **pActiveMonth** - the proportion of active months in a particular user's 
revision history (i.e. the probability of an active month for a user)
  - **numRevisions** - the total number of revisions in the namespaces 0, 120, 
146
  - **H** - the Shannon Diversity Index derived from the user's revision 
history (i.e. entropy normalized by Hmax)
  - **reactivationsN** - the number of reactivations of the user (slightly 
problematic from a methodological viewpoint: because if the user is currently 
inactive, and we observe their inactivity for the first time, by definition it 
is zero, and than also there is a question of do we focus on that user's data 
in the future or not)
  - **averageTalkRevisionsPerMonth** - the average number of edits in the Talk 
namespaces
  - **talkrevisions** - the total number of edits in the Talk namespaces
  
  These features are somewhat redundant (Random Forests does not care much 
about colinearity and similar issues, however), so the prospects are good that 
we can develop a more efficient/lighter and yet successful model in the future.
  
  All computations were performed on DataKolektiv's servers on a dataset with 
anonymized user ids.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, 
Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden

[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-08-01 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Jan_Dittrich **Do we really find a Lindy effect in the Wikidata acount age 
distribution?**
  
  **Assumption.** As demonstrated in Eliazar, Iddo (November 2017). "Lindy's 
Law". Physica A: Statistical Mechanics and Its Applications. 486: 797–805, if 
the Lindy effect holds than the Survival function of the account age is Pareto. 
So, we need to test if the Wikidata account age follows a power-law or not.
  
  Now, this is a bit tricky, so let's go one step at the time:
  
  - the data are the frequencies of Wikidata account ages;
  
  - the age of the account is the number of months since the registration until 
the first sequence of five inactive months (when we pronounce an editor 
officially inactive by convention)
  
  - Bots are filtered out in the ETL phase;
  
  - following a power-law estimation in R from {poweRlaw}, documentation: 
https://cran.r-project.org/web/packages/poweRlaw/index.html, essentially based 
on power-law estimates derived in Clauset, Shalizi & Newman (2007). "Power-law 
distributions in empirical data": https://arxiv.org/pdf/0706.1062.pdf
  
  - the `x_min` of the account age is estimated to be `153` with an `alpha` of 
`2.217158`, indicating a power-law behavior with the second and higher-order 
moments divergence (also see
  
  Gillespie (2017). Fitting Heavy Tailed Distributions: The poweRlaw Package: 
https://cran.r-project.org/web/packages/poweRlaw/vignettes/d_jss_paper.pdf, 
page 3);
  
  - if the `x_min` is set to the de facto minimum of the account age (which is 
`69`; no `x_min` estimation), then we have a power-law behavior with an 
estimate of `alpha` found at `1.626341` - a power-law behavior with all moments 
diverging.
  
  **However**, following the recommendations of the authors of {poweRlaw}, the 
boostrap analysis shows that in neither of the two cases the power-law is 
really present (see the Hypothesis Testing framework implemented in {poweRlaw}, 
2. Examples using the poweRlaw package: 
https://cran.r-project.org/web/packages/poweRlaw/vignettes/b_powerlaw_examples.pdf,
 pages 4 - 5).
  
  **So, it does not seem to be a case of the Lindy effect after all.** The code 
will be shared on Gerrit soon and referenced from Phab.
  
  I would also feel at least a bit more confident than I am now if @MGerlach 
could find some time to take a look at the data.
  
  It is methodologically problematic, or at least in my viewpoint, to try to 
establish whether the power-law (and thus Lindy) holds for the total number of 
//active months// (obtained by neglecting all inactive months in the editor's 
revision history). However, we can try.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, 
Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-07-30 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Jan_Dittrich @awight
  
  Finally, as of
  
  > ... user behavior on talk pages
  
  F34570923: 07_RevisionTalkNamespacesVSLeftWikidata.png 

  
  but please take into your considerations that the distributions are somewhat 
misleading since 368,410 (out of 383,045 in total) editors considered have 
never made a single edit in any talk namespace on Wikidata.
  
  In fact, revisions in talk namespaces matter a lot in respect to whether the 
editor is found to be active or not now:
  
  **Left Wikidata**
  
Min.   1st Qu.Median  Mean   3rd Qu.  Max. 
0.0.0.0.23810. 2258. 
  
  **Active on Wikidata**
  
Min.  1st Qu.   Median Mean  3rd Qu. Max. 
0.00 0.00 0.0010.13 0.00 11943.00 
  
  We can see that those who are still active make an order of magnitude more 
edits in talk namespaces than those who have left (given the current definition 
of "left Wikidata", of course).

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, 
Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-07-30 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Jan_Dittrich @awight
  
  In reference to T282563#7186386 
 and T282563#7226336 
:
  
  - I have used a fresh dataset, relying on the `2021-06` snapshot of the 
`wmf.mediawiki_history table`;
  - the results are fully replicated (in qualitative sense, of course);
  - I have also filtered out all editors who have less than six (6) months of 
presence in Wikidata, simply because they never really had a chance to leave 
(where "left Wikidata" is defined as five (5) months of inactivity).
  
  **The Lindy Effect**
  
  I have used several different operational definitions of the "length of past 
activity" to illustrate the Lindy Effect in Wikidata editing.
  
  **A. The total number of active months in editor's revision history**
  
  So, and editor can be active and inactive now and than; this measure of 
"length of past activity" is defined as the count of months in which and editor 
was active given the whole course of their presence in Wikidata since 
registration.
  The vertical axis represents the probability to leave Wikidata given the 
count of active months.
  
  F34570577: 01_LindyA.png 
  
  **B. The probability of an active month**
  
  The previous measure could be criticized on the grounds that it is not the 
same if (a) someone has ten active months while being registered a year ago and 
if (b) someone has ten active months while being registered three years ago. I 
have turned the absolute counts of active months per editor into proportions of 
their total stay in Wikidata since registration (effectively calculating the 
probability of any given month in the editor's revision history being an active 
month). 
  The horizontal axis is the probability to have an active month in course of 
one's revision history, binned into 100 intervals. The vertical axis represents 
the probability to leave Wikidata given the count of active months.
  
  F34570592: 02_LindyA.png 
  
  **C. The age of the account**
  This is simple yet probably inconclusive in respect to the Lindy Effect 
itself: how old is their account vs what is the probability that they have left 
Wikidata (i.e. are now inactive for five months at least)?
  
  F34570590: 03_LindyA.png 
  
  **The distribution of the number of revisions vs left or did not left 
Wikidata**
  The horizontal axis represents the log of the number of revisions, while the 
vertical axis is probability density. Obviously, those who are still with us 
are those who made more edits until now - as expected.
  
  F34570594: 04_RevisionsVSLeftWikidata.png 

  
  Here are the descriptive statistics on revisions:
  
  **Left Wikidata:**
  
Min. 1st Qu.  MedianMean 3rd Qu.Max. 
1   1   2 203   7 5891740
  
  **Active on Wikidata:**
  
Min.  1st Qu.   Median Mean  3rd Qu. Max. 
2   19  10815268  720 31003903
  
  **The distribution of the length of inactivity periods vs left or did not 
left Wikidata**
  A single editor can have several periods of inactivity of varying length in 
months. I have analyzed the distribution of both mean and median length of 
inactivity periods per user, grouped according to whether they are still 
editing or not.
  
  Mean length of inactivity periods first:
  
  F34570597: 05_MeanLengthInactiveVSLeftWikidata.png 

  
  Obviously, the editors who are still active typically have way less prolonged 
sequences of inactive months.
  
  The descriptive statistics on mean length of inactivity periods:
  
  **Left Wikidata:**
  
Min. 1st Qu.  MedianMean 3rd Qu.Max. 
1.429  14.500  30.000  37.185  56.000 105.000 
  
  **Active on Wikidata:**
  
Min. 1st Qu.  MedianMean 3rd Qu.Max.NA's 
  1.000   1.875   3.000   4.942   5.600  77.000  88
  
  **N.B.** `NA's` represent those editors who did not have a single inactive 
month in their revision history.
  
  And now for the median length of inactivity periods:
  
  F34570602: 06_MedianLengthInactiveVSLeftWikidata.png 

  
  The descriptive statistics:
  
  **Left Wikidata:**
  
Min. 1st Qu.  MedianMean 3rd Qu.Max. 
1.00   13.00   30.00   36.52   56.00  105.00
  
  **Active on Wikidata:**
  
Min. 1st Qu.  MedianMean 3rd Qu.Max.NA's 
1.000   1.000   2.000   3.609   4.000  77.000  88
  
  **N.B.** `NA's` represent those editors who did not have a single inactive 
month in their revision history.
  
  My present conclusions:
  
  - The Lindy Effect holds in Wikidata editing: the lengthier the past editing 
behavior higher the chances that it will persist;
  - As expe

[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-07-30 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  - Re-work on a fresh dataset (the `2021-06` snapshot of the 
`wmf.mediawiki_history` table) is underway;
  - Reporting: until tonight (hopefully);
  - @Jan_Dittrich I will be getting in touch via e-mail about the 
research/paper part later during the day.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, 
Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-07-28 Thread Jan_Dittrich
Jan_Dittrich added a comment.


  Seems also relevant: 
https://eprints.whiterose.ac.uk/140352/1/evolution-wikidata-editors.pdf "The 
evolution of power and standard Wikidata editors: comparing editing behavior 
over time to predict lifespan and volume of edits"

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, Jan_Dittrich
Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, 
Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-07-21 Thread awight
awight added a comment.


  I really like where this is going.
  
  Maybe also look for patterns in the 94% who have dropped off, for example any 
variables that negatively correlate with longetivity.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, awight
Cc: Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, Manuel, 
Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-06-30 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Jan_Dittrich Following our 20210630 discussion:
  
  **Additional questions**
  
  - for those ~ 6% who are still with us: can we find any interesting patterns
- the distribution of the length of their periods of inactivity
- the distribution of their usage counts
- user behavior on talk pages
  
  - compare the ~ 6% retained group vs. others

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, Manuel, 
Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-06-15 Thread Jan_Dittrich
Jan_Dittrich updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, Jan_Dittrich
Cc: MGerlach, awight, WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, 
Jan_Dittrich, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-06-15 Thread Jan_Dittrich
Jan_Dittrich updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, Jan_Dittrich
Cc: MGerlach, awight, WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, 
Jan_Dittrich, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-06-15 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Jan_Dittrich
  
  **Please disregard all previous findings**. The following is based on:
  
  - the definition of editor inactivity in T282563#7124389 
,
  - and the two important corrections in the analytics code;
- one to guard against what @awight has observed on regex in 
T282563#7110253 ,
- and the other - even more important one - that I had to introduce to fix 
a fatal flow in the existing analysis (having to do with considering only 
months w. > 0 edits at all: my incorrect assumption about the structure of the 
ETL result set).
  
  We consider only editors that were active at some point in time.
  
  > How high is the likelihood to become an active editor again?
  
  Here we consider those users who have had at least one period of inactivity 
(5 months w/o edits): the probability of editor reactivation is `0.208437`:
  
  - around 21% of users who were active at least at some point in time and had 
at least one period of inactivity became active editors again at some point in 
time.
  
  The estimate of the probability of reactivation is based on `358,823` 
editors. Out of this `358,823` editors, `338,058` (~ 94%) had eventually left 
Wikidata, while `20,765` (~ 6%) are still with us. 
  The definition of **"has left Wikidata"** used here is the following one: the 
editor currently has five or more months of inactivity.
  
  > What is the relation between past length of participation in the Wikidata 
community and the likelihood to stop participating?
  
  Again, we define the **"past length of participation"** as the total number 
of active months since registration, and **"to stop participating"** as **"has 
left Wikidata"** (see above): the editor currently has five or more months of 
inactivity.
  
  @Jan_Dittrich And here goes a beautiful illustration of the Lindy Effect for 
Wikidata editing (the data labels referring to how many editors are 
represented):
  
  F34501266: WikidataUserRetention_NEW_LindyEffect.png 


TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: MGerlach, awight, WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, 
Jan_Dittrich, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-06-02 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Jan_Dittrich Happening now:
  
  - the incorporation of the new inactivity criterion mentioned in 
T282563#7124389  (thanks 
@MGerlach), and
  - checking the completeness of my technical procedures in accordance with 
T282563#7110253  (thanks 
@awight)
  
  Reporting back as soon as I have something.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: MGerlach, awight, WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, 
Jan_Dittrich, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-05-31 Thread Jan_Dittrich
Jan_Dittrich added a comment.


  From: "How Long Do Wikipedia Editors Keep Active?"
  
  > …specifically,  we  consider  an  editor  to  be“dead” or inactive if he 
did not make any edit for a certain period of time.  Here we set the threshold 
of inactivity to be5 months, since it reflects WMF’s concern as demonstrated in 
the recent Wikipedia Participation Challenge
  
  Not sure if that is ideal, but it is certainly more simple.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, Jan_Dittrich
Cc: MGerlach, awight, WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, 
Jan_Dittrich, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-05-31 Thread Jan_Dittrich
Jan_Dittrich added a subscriber: MGerlach.
Jan_Dittrich added a comment.


  Some new pointers (via @MGerlach ):
  
  - Understanding Editor Drop-off 

  - How long do Wikipedia Editors Keep Active 


TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, Jan_Dittrich
Cc: MGerlach, awight, WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, 
Jan_Dittrich, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-05-25 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Jan_Dittrich @awight
  
  - I need to re-adjust the regular expression for editor reactivations as 
suggested by Adam in T282563#7110253 
 now
  
  > I am beginning to think that we might need to readjust the definition of 
when we consider the editor to has left Wikidata as I have proposed it. What do 
you think? Any ideas?
  >
  >> Would probably make sense to adjust it, but I have currently no good idea 
how. I'll keep thinking.
  
  There is a range of crieria that we can test and then see which one fits our 
needs the best (i.e. what exactly do we want to learn and understand). 
  Also, I still did not find the time to go through the papers that you have 
provided - maybe someone has already figured out some good, working criteria.
  This is a first shot at the data and analysis that we are discussing here. I 
don't see the question posed as simple in any sense. Maybe a concise 
meeting/brainstorming on this would do us good? Let me know,

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: awight, WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, 
Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-05-25 Thread Jan_Dittrich
Jan_Dittrich added a comment.


  > I am beginning to think that we might need to readjust the definition of 
when we consider the editor to has left Wikidata as I have proposed it. What do 
you think? Any ideas?
  
  Would probably make sense to adjust it, but I have currently no good idea 
how. I'll keep thinking.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, Jan_Dittrich
Cc: awight, WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, 
Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-05-25 Thread Jan_Dittrich
Jan_Dittrich added a comment.


  >> Also, I have a vague and anecdotal memory that Wikidata has a lot of users 
who are semi- or fully-automated bots but are not registered as such. Is this 
the case? If so, is there some other heuristic like overly rapid editing that 
we can use to filter out these users or analyze them separately?
  
  
  
  > It is easier said than done but I also have a vague memory that once I did 
something similar for the purposes of some analysis. Let me think for a while 
and try to remember what exactly was done to account for such 
semi-human-semi-bot editors.
  
  Good point, I guess @Manuel has also some ideas on this.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, Jan_Dittrich
Cc: awight, WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, 
Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-05-25 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @awight
  
  > If the zero reactivations category includes anyone for whom the 1+0+1+ 
regex doesn't match, doesn't this also include active editors who simply have a 
history like, 000111..., and who are still active?
  
  Bravo... Will take a look at it and correct the analysis if necessary. Thanks 
again!
  
  > Also, I have a vague and anecdotal memory that Wikidata has a lot of users 
who are semi- or fully-automated bots but are not registered as such. Is this 
the case? If so, is there some other heuristic like overly rapid editing that 
we can use to filter out these users or analyze them separately?
  
  It is easier said than done but I also have a vague memory that once I did 
something similar for the purposes of some analysis. Let me think for a while 
and try to remember what exactly was done to account for such 
semi-human-semi-bot editors.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: awight, WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, 
Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-05-25 Thread awight
awight added a comment.


  This is looking great!  I mean, it's a discouraging phenomenon but a 
promising analysis :-)
  
  I have a question about how to interpret the "0" reactivations category, 
which is stated above to be "a vast majority of editors [who] never reactivate 
following one month of inactivity".  If the zero reactivations category 
includes anyone for whom the `1+0+1+` regex doesn't match, doesn't this also 
include active editors who simply have a history like, `000111...`, and who are 
still active?
  
  Also, I have a vague and anecdotal memory that Wikidata has a lot of users 
who are semi- or fully-automated bots but are not registered as such.  Is this 
the case?  If so, is there some other heuristic like overly rapid editing that 
we can use to filter out these users or analyze them separately?

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, awight
Cc: awight, WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, 
Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-05-24 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Jan_Dittrich
  
  > For people who become active editors again, it would be interesting to 
understand the patterns: Do they leave for a year and start again? (Like 
parents taking a baby break) Do they stop for a month and continue? (maybe they 
were sick or so) etc. Different thresholds were proposed in this paper, an 
extensive analysis of inter-activity time is published here
  
  I guess the first step - before introducing any hypotheses on semantics 
("parents taking a baby break", "maybe they were sick or so", etc) - is to take 
a look at the distributions of the length of sequences of active and inactive 
months. 
  Here's the chart:
  
  F34466243: Activity_Inactivity_SeqLength.png 

  
  And there is already something interesting to observe: see how (a) on shorter 
activity/inactivity sequences (x-axis, represents the length of `000...` or 
`111...`) we observe more inactivity periods than activity periods, while (b) 
the situation switches in favor of activity periods as the length of the 
sequences increases?
  
  Is this an illustration of Lindy indeed?
  
  - The lengthier the observed sequence, it is more likely to be a sequence of 
active than inactive months;
  - Vice versa, the shorter the observed sequence, it is more likely to be a 
sequence of inactive than active months?

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: awight, WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, 
Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-05-24 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Jan_Dittrich This is also interesting: higher the number of reactivations in 
editing behavior - higher the probability to leave Wikidata.
  
  F34466163: NumReactivations_ProbabilityLeave.png 


TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: awight, WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, 
Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-05-24 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Jan_Dittrich This might also help, a larger version of the chart in 
T282563#7107757  with the 
number of editors on each level of activity (x-axis) included. Of course we get 
to observe fewer editors with prolonged periods of activity, that is simply the 
nature of the data here. But it also suggests that we should be careful in 
drawing any conclusions on whether the Lindy Effect holds or not following 
prolonged periods of Wikidata editing. Please take a look at let me know what 
you think.
  
  F34466145: TotalActive_ProbabilityLeave_Large.png 


TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: awight, WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, 
Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-05-24 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Jan_Dittrich
  
  > Question: What is the relation between past length of participation in the 
Wikidata community and the likelihood to stop participating?
  
  @Jan_Dittrich I had to impose some definitions in order to be able to 
precisely formulate the effect that we are looking for. Please feel free to 
suggest any changes of the following methodological approach.
  
  - **Definitions**:
- **Q1. When do we say that a user has left Wikidata?**
- **A1.** (1) We look at the procession of months since user registration, 
each month coded as active (`1`) or inactive (`0`); (2) We look at all 
sequences of inactive months in a particular user's revision history and find 
the lengthiest one; (3) if the respective user's revision history ends in a 
sequence of non-active months, we compare the length of that latest period of 
inactivity with the previous lengthiest period of inactivity; (4) if the former 
is lengthier than the later, we declare that the user has left Wikidata.
- **Q2. How do we measure for how long has the user been active before 
leaving Wikidata, given that any user can have interspersed periods of activity 
and inactivity?**
- **A2.** We count all active months for a user before it was declared that 
the user has stopped editing. Motivation: that is how account for the 
difference between (a) a user who has edited for six months, *each month*, and 
left, ie `11000...`, and (b) a user who has edited here and there for six 
months and then left, e.g. `101001000...`.
- **Q3. What about the users who are still active (in the sense of not 
obeying to the definition given in **A1** )?**
- **A3.** We say that their measure of the length of activity is simply the 
count of their active months, and thus the measures in **A3** and **A2** are 
comparable.
  
  **N.B.** All registered users that have never edited at all are filtered out 
in this analysis.
  
  Here is how it looks like:
  
  F34466137: TotalActive_ProbabilityLeave.png 


TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: awight, WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, 
Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-05-24 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @WMDE-leszek Thank you. Don't worry, I will request the repo: we need one for 
this kind of one-shot tasks anyways.
  
  @awight @Jan_Dittrich
  
  The following is based on `395,680` Wikidata editors and following the 
corrections as suggested by @awight in T282563#7104580 
:
  
  - The probability of editor reactivation is `0.08548827`;
  - Reactivation is defined in the following way: we look at consecutive months 
since user registration and mark active months (>= 5 edits) as `1` and 
non-active months as `0`, so the whole user revision history becomes a string 
e.g. `01000100101...`
  - We search for a regex pattern `1+0+1+` in the revision histories; each 
match is recognized as a reactivation.
  
  Here is the distribution of the number of reactivations:
  
 0  1  2  3  4  5  6  7  8  9 
10 11 12 
361854  18324   6660   3580   2194   1363869447236101 
37 12  3 
  
  Obviously a vast majority of editors never reactivate following one month of 
inactivity, implying that user retention in Wikidata is a serious problem 
indeed.
  
  Again, only item, property, and lexeme namespaces are considered.
  
  @awight Thanks again for T282563#7104580 
!

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: awight, WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, 
Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-05-24 Thread WMDE-leszek
WMDE-leszek added a comment.


  > A Gerrit repo could take weeks
  
  I have o opinion whether it would be an overkill or not, but for what it 
could be worth, I can create gerrit repos when needed, and hopefully can do it 
in the same working day as I am requested.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, WMDE-leszek
Cc: awight, WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, 
Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-05-23 Thread awight
awight added a comment.


  >> Can you share more about the query that produced reactivations.csv? I 
can't tell from the information provided what counts as a "period of 
inactivity".
  >
  > R code. If you still would like me to share it with you I will open a 
Gerrit repo for this ticket.
  
  A Gerrit repo could take weeks, and probably overkill in this case.  If it's 
a single file, pasting  
is a good option, or you could create a GitLab repo if it's several files.  I 
am curious, but please only bother posting the files if you'd like review.  
Most of all I was wondering whether one month below the active threshold is 
long enough to count as a period of inactivity, and whether multiple 
reactivations are counted for a user, if they oscillate between inactive and 
active.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, awight
Cc: awight, WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, 
Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-05-22 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @awight
  
  > This line can be removed,
  > event_user_id != 0
  
  Indeed.
  
  > why would we need to check the historical column? If the user was 
classified as a bot at a time but now is not, shouldn't we respect the updated 
classification?
  
  Because in the data collection we want to be conservative and make sure that 
if we talk about human and not bot editors we certainly talk about human and 
not bot editors. To put it simply: it reduces the uncertainty in relation to 
our data.
  
  > The text says your filter will include the Item namespace (0), but the 
query only includes the talk pages: page_namespace = 1. Maybe this explains why 
there's such a low user count? I would have expected to see virtually all 
non-bot users who have edited wikidata.
  
  Good catch, thank you! I will re-run the ETL now.
  
  > Can you share more about the query that produced reactivations.csv? I can't 
tell from the information provided what counts as a "period of inactivity".
  
  R code. If you still would like me to share it with you I will open a Gerrit 
repo for this ticket.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: awight, WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, 
Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-05-21 Thread awight
awight added a comment.


  This line can be removed,
  
  >   event_user_id != 0
  
  Anonymous users are already filtered out with `event_user_is_anonymous = 
FALSE`, and anyway `event_user_id` is set to `null` rather than `0` for 
anonymous (or revision-deleted) users.
  
  I think the bot tests can be simplified to, `NOT 
ARRAY_CONTAINS(event_user_groups, \'bot\')`, why would we need to check the 
historical column?  If the user was classified as a bot at a time but now is 
not, shouldn't we respect the updated classification?  And the specific tests 
against the event_user_is_bot_by columns seem to be redundant.
  
  The text says your filter will include the Item namespace (0), but the query 
only includes the talk pages: `page_namespace = 1`.  Maybe this explains why 
there's such a low user count?  I would have expected to see virtually all 
non-bot users who have edited wikidata.
  
  Can you share more about the query that produced `reactivations.csv`?  I 
can't tell from the information provided what counts as a "period of 
inactivity".

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, awight
Cc: awight, WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, 
Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-05-21 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Jan_Dittrich
  
  To answer the following, simple question:
  
  > How high is the likelihood to become an active editor again?
  
  please find a dataset attached: `userId` is a fake (but unique) Wikidata user 
ID, `reactivationsN` is the number of the times when the respective user 
started editing again following a period of inactivity.
  
  F34462941: reactivations.csv 
  
  The probability of becoming an editor again after a period of inactivity - 
looking from if that ever happened for a particular user, not how many times - 
is: `0.104778` (approx. 10.5%). We are considering `11825` users in this 
analysis (see the ETL step in T282563#7094294 
 to understand what was 
filtered out).
  
  The distribution of the number of "comebacks" is the following one (first 
row: how many reactivations, second row: how many users did that):
  
0   1 2 3  4 5 6 7  8 9 
10586   720   224   13679362215 5 2 
  
  Now the Lindy Effect.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, 
maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-05-17 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.


  @Jan_Dittrich
  
  Please find the analytics dataset attached.
  
  Columns:
  
  - **userId**: the anonymized Wikidata user Id
  - **registrationYM**: the `-MM` timestamp of user registration on Wikidata
  - **revisionYM**: the `-MM` timestamp of user revisions on Wikidata
  - **revisions**: the count of revisions made in the **revisionYM** month.
  
  **Next steps:**
  
  - We will proceed to test the Lindy effect for Wikidata by (a) calculating 
the difference in months since user registration and revisions, and (b) 
searching for pauses in editing behavior.
  - All hypotheses/research questions will be addressed from the derived time 
lags between user revisions and registrations.
  
  F34458261: WD_UserRetention.csv 
  
  **Notes.**
  
  - Bot and anonymous revisions were filtered out.
  - Only item (0), property (120), and lexeme (146) namespaces are taken into 
account.
  
  The ETL was performed in HiveQL from wmf.mediawiki_history's 

 current snapshot (and that would be `2021-04`). Here's the query:
  
USE wmf; 
  SELECT 
event_user_id, event_user_registration_timestamp, 
substring(event_timestamp, 1, 4) AS year, 
substring(event_timestamp, 6, 2) AS month, 
COUNT(*) AS revisions FROM mediawiki_history 
  WHERE (
event_entity = \'revision\' AND 
event_type = \'create\' AND 
wiki_db = \'wikidatawiki\' AND 
event_user_is_anonymous = FALSE AND 
NOT ARRAY_CONTAINS(event_user_is_bot_by, \'name\') AND 
NOT ARRAY_CONTAINS(event_user_is_bot_by, \'group\') AND 
NOT ARRAY_CONTAINS(event_user_is_bot_by_historical, \'name\') 
AND 
NOT ARRAY_CONTAINS(event_user_is_bot_by_historical, \'group\') 
AND 
NOT ARRAY_CONTAINS(event_user_groups, \'bot\') AND 
NOT ARRAY_CONTAINS(event_user_groups_historical, \'bot\') AND 
event_user_id != 0 AND 
page_is_redirect = FALSE AND 
revision_is_deleted_by_page_deletion = FALSE AND 
(page_namespace = 1 OR page_namespace = 120 OR page_namespace = 
146) AND 
snapshot = \'2021-04\'
  ) 
  GROUP BY 
event_user_id, 
event_user_registration_timestamp, 
substring(event_timestamp, 1, 4), 
substring(event_timestamp, 6, 2);

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-05-11 Thread GoranSMilovanovic
GoranSMilovanovic added projects: WMDE-Analytics-Engineering, 
User-GoranSMilovanovic.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-05-11 Thread Jan_Dittrich
Jan_Dittrich updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, Jan_Dittrich
Cc: Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

2021-05-11 Thread Jan_Dittrich
Jan_Dittrich renamed this task from "User Retention Wikidata: Exploring the 
resons for patterns in the 2021 Wikidata Community Survey" to "User Retention 
Wikidata: A model for "participating since" patterns in the 2021 Wikidata 
Community Survey".

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, Jan_Dittrich
Cc: Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org