[Wikidata-bugs] [Maniphest] [Commented On] T210664: List of most used Wikidata entities

2019-01-26 Thread Mahir256
Mahir256 added a comment.
Concern has been raised about semi-protection of these items here.TASK DETAILhttps://phabricator.wikimedia.org/T210664EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GoranSMilovanovic, Mahir256Cc: Mahir256, Andreasmperu, AfroThundr3007730, GoranSMilovanovic, Lydia_Pintscher, abian, Aklapper, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, Wikidata-bugs, aude, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T210664: List of most used Wikidata entities

2019-01-26 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.
@abian The code is in place. Just ping me somewhere whenever you need an update.TASK DETAILhttps://phabricator.wikimedia.org/T210664EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GoranSMilovanovicCc: AfroThundr3007730, GoranSMilovanovic, Lydia_Pintscher, abian, Aklapper, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, Wikidata-bugs, aude, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T210664: List of most used Wikidata entities

2019-01-26 Thread Lydia_Pintscher
Lydia_Pintscher added a comment.
Thank you, Goran! That's useful information for discussing this further.
And I agree with abian that protecting more than 3% of all items is too costly when taking into account all the other factors mentioned.
I think we can close this task now?TASK DETAILhttps://phabricator.wikimedia.org/T210664EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GoranSMilovanovic, Lydia_PintscherCc: AfroThundr3007730, GoranSMilovanovic, Lydia_Pintscher, abian, Aklapper, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, Wikidata-bugs, aude, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T210664: List of most used Wikidata entities

2019-01-25 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.
@abian You're welcome! Your analysis is awesome and could be used to exemplify how a Client/Manager/Editor/Owner should introduce the problem to a Data Scientist/Analyst!TASK DETAILhttps://phabricator.wikimedia.org/T210664EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GoranSMilovanovicCc: AfroThundr3007730, GoranSMilovanovic, Lydia_Pintscher, abian, Aklapper, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, Wikidata-bugs, aude, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T210664: List of most used Wikidata entities

2019-01-25 Thread abian
abian added a comment.
Thank you for such an accurate analysis! And for offering your own computing resources.

About the results

I would read the results in a negative way, I would say the conclusion is Items that are used in less than 9 pages aren't so used as to protect them for this reason, while the rest of Items might be protected or might not. More variables than the number of uses are significant to decide whether or not to protect an Item and some of them aren't easily quantifiable, for example:


the opportunity cost of each potential good edit prevented because of a semi-protection,
the value we give to preventing a bad edit,
the ratio bad edits/total edits by non-confirmed users,
the ratio edits by non-confirmed users/total edits,
the visibility that vandalism on an Item has per Item use,
the completeness and timelessness of an Item (or the opposite, the potential of an Item to be improved),
etc.


Taking these other variables (subjectively) into account I wouldn't feel comfortable protecting all those Items. In any case, that would be a decision I would have to agree with many users. The important thing is that now we objectively know more than before. Thanks again, @GoranSMilovanovic!TASK DETAILhttps://phabricator.wikimedia.org/T210664EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GoranSMilovanovic, abianCc: AfroThundr3007730, GoranSMilovanovic, Lydia_Pintscher, abian, Aklapper, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, Wikidata-bugs, aude, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T210664: List of most used Wikidata entities

2019-01-24 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.
@abian @Lydia_Pintscher We have the results.

Method


The power-law was estimated from 27,394,027 WD items that are currently used across the Wikimedia websites;
that makes approximately 50% of items that are now present in WD (54,195,898 is the today's number);
the statistic from which the power-law was estimated is the number of pages that make use of a particular item;
estimation procedures from the {poweRlaw} R package were used.


Results


Power-law behavior cannot be excluded,
with the value of the scaling parameter (alpha) of 2.050451 (infinite distribution variance), and
the value of the xmin parameter of 9 (in effect, this means: the distribution for all items with usage frequency >=9 does exhibit a power-law behavior).
The following is the log(Rank) vs log(Pages) plot for all WD items with usage frequency >= 9 across the pages in our projects:


F28030400: logRank-logPages.png

Recommendation


Protect all items that are used on 9 or more pages across the Wikimedia websites.
There are 1,656,137 such items, which makes only 3.06% of the total number of items in WD, and only 6.05% of WD items that are currently in use.


Discussion


If you can automate this, protecting 1,656,137 should not be a problem, I guess.
Currently, the list of items that are recommended for protection encompasses only item IDs and the number of pages that make use of them;
the list will be shared with @Lydia_Pintscher;
it would take some time/engineering to get the English labels in, and
the procedure to generate this list updated on regular daily basis would take approx. 3 - 4 hours for each run, but
it cannot be established on our infrastructure before we have R upgraded, see my request in T214598.


So, until we have R upgraded on our systems, I recommend you ask for an updated list whenever you need one.TASK DETAILhttps://phabricator.wikimedia.org/T210664EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GoranSMilovanovicCc: AfroThundr3007730, GoranSMilovanovic, Lydia_Pintscher, abian, Aklapper, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, Wikidata-bugs, aude, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T210664: List of most used Wikidata entities

2019-01-12 Thread Lydia_Pintscher
Lydia_Pintscher added a comment.
Alright. Let's give it a go :)TASK DETAILhttps://phabricator.wikimedia.org/T210664EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GoranSMilovanovic, Lydia_PintscherCc: AfroThundr3007730, GoranSMilovanovic, Lydia_Pintscher, abian, Aklapper, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, D3r1ck01, Wikidata-bugs, aude, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T210664: List of most used Wikidata entities

2019-01-10 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.
@Lydia_Pintscher Hey, what is your take on this ticket and especially T210664#4860427? Thanks.TASK DETAILhttps://phabricator.wikimedia.org/T210664EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GoranSMilovanovicCc: AfroThundr3007730, GoranSMilovanovic, Lydia_Pintscher, abian, Aklapper, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, D3r1ck01, Wikidata-bugs, aude, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T210664: List of most used Wikidata entities

2019-01-07 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.
@abian Ok, but I need @Lydia_Pintscher to give me a go for this first.TASK DETAILhttps://phabricator.wikimedia.org/T210664EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GoranSMilovanovicCc: AfroThundr3007730, GoranSMilovanovic, Lydia_Pintscher, abian, Aklapper, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, D3r1ck01, Wikidata-bugs, aude, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T210664: List of most used Wikidata entities

2019-01-07 Thread abian
abian added a comment.
Looks good to me; if actually this isn't going to take you much effort, let's see what the results are and we'll be able to make a more informed decision.TASK DETAILhttps://phabricator.wikimedia.org/T210664EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GoranSMilovanovic, abianCc: AfroThundr3007730, GoranSMilovanovic, Lydia_Pintscher, abian, Aklapper, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, D3r1ck01, Wikidata-bugs, aude, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T210664: List of most used Wikidata entities

2019-01-07 Thread GoranSMilovanovic
GoranSMilovanovic added a comment.
@abian @Lydia_Pintscher It would be very difficult to define a rational criterion of how many top frequently used WD items to protect.

But maybe there is a way. Namely, the distribution of item usage, as you can observe, almost certainly follows the power-law (Zipf). There are maximum-likelihood estimation methods that can determine both the scaling parameter ('alpha' - it essentially determines whether the distribution has infinite moments) and the "cut-off" ('x-min') parameter. Only beyond a certain value of the "cut-off" the given density (or mass) probability function really begins to exhibit a power-law behavior; see https://arxiv.org/abs/0706.1062.

Now, this is my idea: (1) fit the power-law to our empirical data on WD usage; (2) protect only those observations (i.e. items) that are not found in the tail of the distribution (i.e. those whose observed usage frequency falls beyond the estimated x-min value). As a consequence, we would need to protect (a) only the most frequently used WD items, and (b) those items would at the same time be a part of the "stable" region of the distribution of WD items usage.

Empirical drawback: if the value of x-min turns out to be large, and that is not impossible, we would need to protect a large number of WD items. 
Methodological drawback: well, we have already cut off the tail of the distribution: you are observing only the top 100,000 most frequently used WD items. If we estimate x-min from this sample only, it would be different from the x-min estimated from all WD items. If we choose to go for all items, the estimation procedure would take quite some time to complete its run.

Resources: the statistical estimation procedures that I was referring to are already available in R and MATLAB (and maybe Python), so it would take me only a minimal amount of time to develop a script for this. The estimation procedure in itself is computationally demanding but it's not like our number crunching servers could not deal with it.TASK DETAILhttps://phabricator.wikimedia.org/T210664EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GoranSMilovanovicCc: AfroThundr3007730, GoranSMilovanovic, Lydia_Pintscher, abian, Aklapper, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, D3r1ck01, Wikidata-bugs, aude, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T210664: List of most used Wikidata entities

2019-01-07 Thread abian
abian added a comment.
For the record: I've got some data (thanks!) and, based on those, I'm going to run a script to semi-protect the 500 (quite arbitrary number) most used Items in random order (although several of these Items are already semi-protected), which represents the 0.0009% of our total number of Items. Despite what may look like on the chart, the 500th most used Item has still more than 45,000 uses. This measure isn't the panacea and several thousands of Items will remain unprotected with several thousands of uses each one, but I do think this can be a reasonable middle ground for now.

F27814726: xy.png

Some Items can enter or leave the top-500 ranking at some point as a result of small changes in Wikipedia templates, so I keep this task open while we look for an official, probably lower priority, monitoring solution. Thanks again!TASK DETAILhttps://phabricator.wikimedia.org/T210664EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GoranSMilovanovic, abianCc: AfroThundr3007730, GoranSMilovanovic, Lydia_Pintscher, abian, Aklapper, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, D3r1ck01, Wikidata-bugs, aude, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T210664: List of most used Wikidata entities

2018-11-29 Thread Lydia_Pintscher
Lydia_Pintscher added a comment.
@GoranSMilovanovic that'd be lovely. Can you pass-word protect it for now? We should make sure this isn't used as an easy way to find out how to create a lot of harm.TASK DETAILhttps://phabricator.wikimedia.org/T210664EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Lydia_PintscherCc: AfroThundr3007730, GoranSMilovanovic, Lydia_Pintscher, abian, Aklapper, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, D3r1ck01, Wikidata-bugs, aude, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs