Re: [Wikidata] [Wikitech-l] Wikidata vandalism dashboard (for Wikipedians)

2018-01-29 Thread Amir Ladsgroup
It's tracked in: https://github.com/Ladsgroup/Vandalism-dashboard/issues/2

On Sun, Jan 28, 2018 at 11:26 PM Magnus Manske 
wrote:

> Quick note: Looks great, but "Changes in descriptions" is always on, even
> after clicked off...
>
> On Sun, Jan 28, 2018 at 5:54 PM Amir Ladsgroup 
> wrote:
>
> > Hello,
> > People usually ask me how they can patrol edits that affect their
> Wikipedia
> > or their language. The proper way to do so is by using watchlist and
> > recentchanges (with "Show Wikidata edits" option enabled) in Wikipedias
> but
> > sometimes it shows too many unrelated changes.
> >
> > Also, it would be good to patrol edits for languages you know because the
> > descriptions are being shown and editable in the Wikipedia app making it
> > vulnerable to vandalism (so many vandalism in this area goes unnoticed
> for
> > a while and sometimes gets fixed by another reader which is suboptimal).
> >
> > So Lucas [1] and I had a pet project to allow you see unpatrolled edits
> > related to a language in Wikidata. It has some basic integration with
> ORES
> > and if you see a good edit and mark it as patrolled it goes away from
> this
> > list. What I do usually is to check this page twice a day for Persian
> > langauge which given the size of it, that's enough.
> >
> > It's in https://tools.wmflabs.org/wdvd/index.php the source code is in
> > https://github.com/Ladsgroup/Vandalism-dashboard and you can report
> > issues/bug/feature requests in
> > https://github.com/Ladsgroup/Vandalism-dashboard/issues
> >
> > Please spread the word and any feedback about this tool is very welcome
> :)
> >
> > [1]: <https://www.wikidata.org/wiki/User:Lucas_Werkmeister_(WMDE)>
> > https://www.wikidata.org/wiki/User:Lucas_Werkmeister_(WMDE)
> >
> > <https://www.wikidata.org/wiki/User:Lucas_Werkmeister_(WMDE)>
> > Best
> > ___
> > Wikitech-l mailing list
> > wikitec...@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> ___
> Wikitech-l mailing list
> wikitec...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Wikidata vandalism dashboard (for Wikipedians)

2018-01-28 Thread Amir Ladsgroup
Hello,
People usually ask me how they can patrol edits that affect their Wikipedia
or their language. The proper way to do so is by using watchlist and
recentchanges (with "Show Wikidata edits" option enabled) in Wikipedias but
sometimes it shows too many unrelated changes.

Also, it would be good to patrol edits for languages you know because the
descriptions are being shown and editable in the Wikipedia app making it
vulnerable to vandalism (so many vandalism in this area goes unnoticed for
a while and sometimes gets fixed by another reader which is suboptimal).

So Lucas [1] and I had a pet project to allow you see unpatrolled edits
related to a language in Wikidata. It has some basic integration with ORES
and if you see a good edit and mark it as patrolled it goes away from this
list. What I do usually is to check this page twice a day for Persian
langauge which given the size of it, that's enough.

It's in https://tools.wmflabs.org/wdvd/index.php the source code is in
https://github.com/Ladsgroup/Vandalism-dashboard and you can report
issues/bug/feature requests in
https://github.com/Ladsgroup/Vandalism-dashboard/issues

Please spread the word and any feedback about this tool is very welcome :)

[1]: 
https://www.wikidata.org/wiki/User:Lucas_Werkmeister_(WMDE)


Best
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [Wikitech-l] Sitelink removal in Wikidata

2017-04-27 Thread Amir Ladsgroup
Hey Tilman,

Here you are:
http://tools.wmflabs.org/dexbot/tools/changed_descriptions.php?lang=fa&limit=10
That actually helped me to revert five highly visible vandalism in Wikidata
right now. Thanks for pointing out.

Best

On Thu, Apr 27, 2017 at 2:06 AM Tilman Bayer  wrote:

> This looks great! Would it be possible to make a modified version that
> lists description edits in a particular language (instead of sitelink edits
> for a particular project)? This should be very useful for targeted
> patrolling too; it came up as a suggestion in the Wikidata project chat
> last week.
>
> Technically, it should be a relatively simple change, querying for
> "wbsetdescription" instead of "wbsetsitelink-remove".
>
> (More generally, there is https://phabricator.wikimedia.org/T141866
> "Provide a way to filter Wikidata's recent changes for language-dependent
> content in specific languages", but that looks like a larger project.)
>
> On Tue, Apr 25, 2017 at 8:30 PM, Amir Ladsgroup 
> wrote:
>
>> Hey,
>> One common form of vandalism in Wikidata is removing sitelinks (we already
>> have an abuse filter flagging them).
>> One of my friends in Persian Wikipedia (who is not a wikidata editor and
>> only cares about Persian Wikipedia) asked me to write a tool that lists
>> all
>> Persian Wikipedia sitelink removals. So I wrote something small and fast
>> but it's usable for any wiki. For example English Wikipedia:
>> http://tools.wmflabs.org/dexbot/tools/deleted_sitelinks.php?wiki=enwiki
>>
>> It's slow due to nature of the database query but once it responds, you
>> might find good things to revert.
>>
>> Since this is the most useful for Wikipedia editors who don't want to
>> patrol Wikidata (in that case, this query
>>
> <
>> https://www.wikidata.org/w/index.php?namespace=&tagfilter=new+editor+removing+sitelink&translations=noaction&title=Special%3ARecentChanges
>> >
>
>
>> is
>> the most useful) I'm reaching to wider audiences. Sorry for spamming.
>>
>> HTH
>> Best
>> ___
>> Wikitech-l mailing list
>> wikitec...@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Sitelink removal in Wikidata

2017-04-25 Thread Amir Ladsgroup
Hey,
One common form of vandalism in Wikidata is removing sitelinks (we already
have an abuse filter flagging them).
One of my friends in Persian Wikipedia (who is not a wikidata editor and
only cares about Persian Wikipedia) asked me to write a tool that lists all
Persian Wikipedia sitelink removals. So I wrote something small and fast
but it's usable for any wiki. For example English Wikipedia:
http://tools.wmflabs.org/dexbot/tools/deleted_sitelinks.php?wiki=enwiki

It's slow due to nature of the database query but once it responds, you
might find good things to revert.

Since this is the most useful for Wikipedia editors who don't want to
patrol Wikidata (in that case, this query

is
the most useful) I'm reaching to wider audiences. Sorry for spamming.

HTH
Best
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [Wiki-research-l] The basis for Wikidata quality

2017-03-22 Thread Amir Ladsgroup
I was mentioned as "the developer of ORES". So I comment on that. Aaron
Halfaker is the creator of ORES.  It's been his work night and day for a
few years now. I've contributed around 20% of the code base.  But let's be
clear, ORES is his brainchild.  There is an army of other developers who
have contributed.  E.g. He7d3r, Jonas.agx, Aetilley, Danilo, Yuvipanda,
Awight, Kenrick95, NealMCB, and countless translators.  The idea that a
single person can develop something like a production machine learning
service.  Yikes.

See:
https://github.com/wiki-ai/revscoring/graphs/contributors (the modeling
library)
https://github.com/wiki-ai/ores/graphs/contributors (the hosting service)
https://github.com/wiki-ai/ores-wmflabs-deploy/graphs/contributors (our
server configuration)
https://github.com/wiki-ai/wikilabels/graphs/contributors (the labeling
system)
https://github.com/wiki-ai/editquality/graphs/contributors (the set of
damage/vandalism detection models)
https://github.com/wikimedia/mediawiki-extensions-ORES/graphs/contributors
(mediawiki extension that highlights based on ORES predictions)

Also, I fail to see the relation of running a labeling script to what's
ORES is doing.

Best

On Wed, Mar 22, 2017 at 8:51 PM John Erling Blad  wrote:

> Only using sitelinks as a weak indication of quality seems correct to me.
> Also the idea that some languages are more important than other, and some
> large languages are more important than other. I would really like it if
> the reasoning behind the classes and the features could be spelled out.
>
> I have serious issues with the ORES training sets, but that is another
> discussion. ;/ (There is a lot of similar bot edits in the sets, and that
> will train a bot-detector, which is not what we need! Grumpf…)
>
> On Wed, Mar 22, 2017 at 3:33 PM, Aaron Halfaker 
> wrote:
>
> Hey wiki-research-l folks,
>
> Gerard didn't actually link you to the quality criteria he takes issue
> with.  See https://www.wikidata.org/wiki/Wikidata:Item_quality  I think
> Gerard's argument basically boils down to Wikidata != Wikipedia, but it's
> unclear how that is relevant to the goal of measuring the quality of
> items.  This is something I've been talking to Lydia about for a long
> time.  It's been great for the few Wikis where we have models deployed in
> ORES[1] (English, French, and Russian Wikipedia).  So we'd like to have the
> same for Wikidata.   As Lydia said, we do all sorts of fascinating things
> with a model like this.  Honestly, I think the criteria is coming together
> quite nicely and we're just starting a pilot labeling campaign to work
> through a set of issues before starting the primary labeling drive.
>
> 1. https://ores.wikimedia.org
>
> -Aaron
>
>
>
> On Wed, Mar 22, 2017 at 6:39 AM, Gerard Meijssen <
> gerard.meijs...@gmail.com> wrote:
>
> Hoi,
> What I have read is that it will be individual items that are graded. That
> is not what helps you determine what items are lacking in something. When
> you want to determine if something is lacking you need a relational
> approach. When you approach a award like this one [1], it was added to make
> the award for a person [2] more complete. No real importance is given to
> this award, just a few more people were added because they are part of a
> group that gets more attention from me [3]. For yet another award [4], I
> added all the people who received the award because I was told by someone's
> expert opinion that they were all notable (in the Wikipedia sense of the
> word). I added several of these people in Wikidata. Arguably, the Wikidata
> the quality for the item for the award is great but it has no article
> associated to it in Wikipedia but that has nothing to do with the quality
> of the information it provides. It is easy and obvious to recognise in one
> level deeper that quality issues arise; the info for several people is
> meagre at best.You cannot deny their relevance though; removing them
> destroys the quality for the award.
>
> The point is that in relations you can describe quality, in the grading
> that is proposed there is nothing really that is actionable.
>
> When you add links to the mix, these same links have no bearing on the
> quality of the Wikidata item. Why would it? Links only become interesting
> when you compare the statements in Wikidata with the links to other
> articles in the same Wikipedia. This is not what this approach brings.
>
> Really, how will the grades to items make a difference. How will it help us
> understand that "items relating to railroads are lacking"? It does not.
>
> When you want to have indicators for quality; here is one.. an author (and
> its subclasses) should have a VIAF identifier. An artist with objects in
> the Getty Museum should have an ULAN number. The lack of such information
> is actionable. The number of interwiki links is not, the number of
> statements are not and even references are not that convincing.
> Thanks,
>   GerardM
>
> [1] https://tools.w

[Wikidata] Wikidata logo

2016-10-27 Thread Amir Ladsgroup
There was a survey on Wikimedia branding (including branding of its
projects) in Wikimania 2016. I was reading the results and I found it
fascinating that Wikidata logo is liked the most between community people.

Notes:
"The barcode is great"
"It is quite clever, It spells wiki in Morse code" (I didn't know that. I
checked it and it's true. WOW)
"In terms of semantics, It's perfect. The color, font, barcode"
"It is simple and obvious"

Read more:
https://upload.wikimedia.org/wikipedia/commons/0/0f/Wikimedia_Brands_Community_Perceptions_Report_2016.pdf

https://meta.wikimedia.org/wiki/Communications/Wikimedia_brands/Perceptions/2016


Best
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Fwd: [Wikitech-l] Wikis back to 1.28.0-wmf.18 (was: Re: Upgrade of 1.28.0-wmf.19 to group 1 is on hold)

2016-09-16 Thread Amir Ladsgroup
FYI

-- Forwarded message -
From: Antoine Musso 
Date: Fri, Sep 16, 2016 at 6:00 PM
Subject: [Wikitech-l] Wikis back to 1.28.0-wmf.18 (was: Re: Upgrade of
1.28.0-wmf.19 to group 1 is on hold)
To: 


On 14/09/16 23:05, Antoine Musso wrote:
> 
> For now. I am holding the train.  Will reassess tomorrow and ideally
> push group1 at 19:00 UTC then follow with group2 at 20:00UTC.
..
>
> Up-to-date status:
> https://tools.wmflabs.org/versions/
>
> MW-1.28.0-wmf.19 deployment blockers
> https://phabricator.wikimedia.org/T143328

Hello,

I have pushed 1.28.0-wmf.19 on thursday at 19:10 UTC. It quickly got
noticed that Wikidata was no more able to dispatch updates to the wikis
which is tracked in:

https://phabricator.wikimedia.org/T145819

This Friday, I was busy debugging the issue and did not notice a task
about account creation being broken:
  https://phabricator.wikimedia.org/T145839

As soon as I seen that, I have reverted to 1.28.0-wmf.18 at 12:50 UTC.

Account creation has been impossible from Thursday 19:10 UTC until
Friday 12:50 UTC.   Please pass the word around as needed.

My deep apologizes for not having found out earlier the impact was so
important.

I will write an incident report this afternoon with actionables.

--
Antoine "hashar" Musso


___
Wikitech-l mailing list
wikitec...@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] ORES review tool default sensitivity changed now

2016-08-24 Thread Amir Ladsgroup
Hey all,
We just deployed a change that changed default sensitivity of ORES review
tool from "hard" to "soft" (meaning recall would drop from 0.9 to 0.75 but
percentage of false positives drops too). You are still able to change it
back in your preferences (Recent changes tab).

Please come to us for any issues or questions.

Best
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] "Damaging" and "goodfaith" models for Wikidata are now online

2016-05-01 Thread Amir Ladsgroup
Hello,
TLDR: Vandalism detection model for Wikidata just got much more accurate.

Longer version:
ORES is designed to handle different types of classification. For example
one of under development classification types is "wikiclass" which
determines type of edits. If they are adding content, or fixing mistake,
etc.

The most mature classification of ORES is edit quality. Whether an edit is
vandalism or not. We usually have three models: "reverted" model. Training
data for this model is obtained automatically. We sample around 20K edits
(for Wikidata it was different) and we consider an edit as vandalism if
they are reverted within a certain time period after the edit (7 days for
Wikidata).
On the other hand, "damaging" and "goodfaith" models are more accurate
because we sample about 20K edits. Prelabel edits that being made by
trusted users such as admins and bots as not harmful to Wikidata/Wikipedia
and then we ask users to label the rest. (For Wikidata it was around 4K
edits) Since most edits in Wikidata are made by bots and trusted users, We
altered this method for Wikidata a bit but the whole process is the same.
Don't forget that since it's human judgement, this models are more accurate
and useful to damage detection. The ORES extension uses "damaging" model
and not "reverted" model, thus having "damaging" model online is a
requirement for the extension deployment.
People label edits that if an edit is damaging to Wikidata and if the edit
is made by good intention. So we have three cases: 1- An edit is harmful to
Wikidata but made with good intention. An honest/newbie mistake 2- An edit
is harmful and made bad intention. A vandalism 3- A edit with good
intention and productive. A "good" edit".
Biggest reason to distinguish between honest mistakes and vandalisms is
that using anti-vandalism bots caused reducing on new user retention in
Wikis [1]. So future anti-vandalism bots should not revert good faith
mistakes but report them for human review.

One of good things about Wikidata damage detection labeling process is that
so many people were involved (we had 38 labelers for Wikidata[2]). Another
good thing is that its fitness very high in terms of AI [3]. But since
number of damaging edits and not damaging edits are not the same, scores it
gives to edits are not intuitive. Let me give you an example: In our
damaging model if an edit is scored less than 80% it's probably not
vandalism. Actually, in a very huge sampling of human edits we had for
reverted model we couldn't find a bad edit with score lower than 93% i.e.
If an edit is scored 92% in reverted model, you are pretty sure it's not
vandalism. Please reach out to us if you have any questions on using these
scores. Please reach out to us if have any questions in general ;)

In terms of needed changes, ScoredRevision gadget is set automatically to
prefer the damaging model. I just changed my bot in #wikidata-vandalism
channel in order to use damaging instead of reverted.

If you want to use these models. Check out our docs [4]

Sincerely,
Revision scoring team [5]

[1]: Halfaker, A.; Geiger, R. S.; Morgan, J. T.; Riedl, J. (28 December
2012). "The Rise and Decline of an Open Collaboration System: How
Wikipedia's Reaction to Popularity Is Causing Its Decline". *American
Behavioral Scientist* *57* (5): 664–688.
[2]: https://labels.wmflabs.org/campaigns/wikidatawiki/?campaigns=stats
[3]: https://ores.wmflabs.org/scores/wikidatawiki/?model_info
[4]: https://ores.wmflabs.org/v2/
[5]:
https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service#Team
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Fight vandalism

2016-02-27 Thread Amir Ladsgroup
It would be, I don't mind. It's just one or two lines change in the source
code. Depends if people want it or not.

Best

On Sun, Feb 28, 2016 at 12:45 AM Leon Liesener 
wrote:

> With #cvn-wikidata another countervandalism channel already exists.
> Perhaps it would be better to report your results there as well?
>
> Op 27 feb. 2016 om 18:25 heeft Amir Ladsgroup  het
> volgende geschreven:
>
> Hey all,
> I've been working to make vandalism detection easier and I was able to
> build an AI tool that scores each and every edit for being vandalism. It's
> one step forward but not enough, since using these scores is hard. So I'm
> working on building tools that makes these scores more useable for you. The
> long-term goal is using the ORES extension. it is working now in beta
> cluster [1] and you can test it in [2] as well. Once we are done with
> Wikidata:Edit labels [3] we are good to deploy this as a beta feature in
> Wikidata (we are waiting for something else too but hopefully it's done
> very soon)
>
> But I wanted something helpful in short-term too. So in this weekend*, I
> registered #wikidata-vandalism IRC channel in freenode and I made a bot
> that reports unpatrolled edits with high scores every six minutes if they
> are made in the last six minutes. So If you join that channel you can see
> the bot adds edits that need review very soon after the edit has been made
> (basically it's a live reporting edits needing review)
>
> Source code is in github and PRs are very welcome :) [4]
>
> [1]:
> https://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special:RecentChanges
> [2]: https://mw-revscoring.wmflabs.org
> [3]: https://www.wikidata.org/wiki/Wikidata:Edit_labels
> [4]: https://github.com/Ladsgroup/ORES-IRC-Wikidata-bot
>
> * My weekend is Thursday and Friday :)
> Best
>
> ___
>
>
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Fight vandalism

2016-02-27 Thread Amir Ladsgroup
Hey all,
I've been working to make vandalism detection easier and I was able to
build an AI tool that scores each and every edit for being vandalism. It's
one step forward but not enough, since using these scores is hard. So I'm
working on building tools that makes these scores more useable for you. The
long-term goal is using the ORES extension. it is working now in beta
cluster [1] and you can test it in [2] as well. Once we are done with
Wikidata:Edit labels [3] we are good to deploy this as a beta feature in
Wikidata (we are waiting for something else too but hopefully it's done
very soon)

But I wanted something helpful in short-term too. So in this weekend*, I
registered #wikidata-vandalism IRC channel in freenode and I made a bot
that reports unpatrolled edits with high scores every six minutes if they
are made in the last six minutes. So If you join that channel you can see
the bot adds edits that need review very soon after the edit has been made
(basically it's a live reporting edits needing review)

Source code is in github and PRs are very welcome :) [4]

[1]:
https://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special:RecentChanges
[2]: https://mw-revscoring.wmflabs.org
[3]: https://www.wikidata.org/wiki/Wikidata:Edit_labels
[4]: https://github.com/Ladsgroup/ORES-IRC-Wikidata-bot

* My weekend is Thursday and Friday :)
Best
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Fwd: ORES extension soon be deployed, help us test it

2016-02-19 Thread Amir Ladsgroup
We are also in progress of deploying this extension for Wikidata too in
near future. So your help would be appreciated.

-- Forwarded message -
From: Amir Ladsgroup 
Date: Sat, Feb 20, 2016 at 2:05 AM
Subject: ORES extension soon be deployed, help us test it
To: wikitech-l , 


Hey all,
TLDR: ORES extension [1] which is an extension that integrates ORES service
[2] with Wikipedia to make fighting vandalism easier and more efficient is
in the progress of deployment. You can test it in
https://mw-revscoring.wmflabs.org (Enable it in your preferences first)

You probably know ORES. It's an API service that gives probably of an edit
being vandalism, it also does other AI-related stuff like guessing the
quality of articles in Wikipedia. We have a nice blog post in Wikimedia
Blog [3] and media paid some attention to it [4]. Thanks to Aaron Halfaker
and others [5] for their work in building this service. There are several
tools using ORES to highlight possibly vandalism edits. Huggle, gadgets
like ScoredRevisions, etc. But an extension does this job much more
efficiently.

The extension which is being developed by Adam Wight, Kunal Mehta and me
highlights unpatrolled edits in recentchanges, watchlists, related changes
and in future, user contributions if ORES score of those edits pass a
certain threshold. GUI design is made by May Galloway. ORES API (
ores.wmflabs.org) only gives you a score between 0 and 1. Zero means it's
not vandalism at all and one means it's vandalism for sure. You can test
its simple GUI in https://ores.wmflabs.org/ui/. It's possible to change the
threshold in your preferences in the recent changes tab (you have options
instead of numbers because we thought numbers are not very intuitive).
Also, we enabled it in a test wiki so you test it:
https://mw-revscoring.wmflabs.org. You need to make an account (use a dummy
password) and then enable it in beta features tab. Note that building AI
tool to detect vandalism in a test wiki sounds a little bit silly ;) so we
set up a dummy model that probability of an edit being vandalism is
backward of the last two digits (e.g. diff id:12345 = score:54%). In a more
technical aspect, we store these scores in ores_classification table so we
can do a lot more analysis with them once the extension is deployed. Fun
use cases such as the average score of a certain page or contributions of a
user or members of a category, etc.

We passed security review and we have consensus to enable it in Persian
Wikipedia. We are only blocked on ORES moving from Labs to production
(T106867 [6]). The next wiki is Wikidata, we are good to go once the
community finishes labeling edits so we can build the "damaging" model. We
can enable it Portuguese and Turkish Wikipedia after March because s2 and
s3 have database storage issues right now. For other Wikis, you need to
check if ORES supports the Wiki and if community finished labeling edits
for ORES (check out the table at [2])
If you want to report bugs or add feature requests you can find it in here
[7].

[1]: https://www.mediawiki.org/wiki/Extension:ORES
[2]: https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service
[3]:
https://blog.wikimedia.org/2015/11/30/artificial-intelligence-x-ray-specs/
[4]:
https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Media
[5]:
https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service#Team
[6]: https://phabricator.wikimedia.org/T106867
[7]: https://phabricator.wikimedia.org/tag/mediawiki-extensions-ores/

Best
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Teaching machines to make your life easier – quality work on Wikidata

2016-01-02 Thread Amir Ladsgroup
Hey,
New blog post

and some analysis is out. It turns out if we use ORES properly it can
reduce the workload of reviewing human edits by 99%. If you help and label
edits in Edit labels ,
We can both efficiently revert vandalism and route good-faith newcomers to
support and training.

Best
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

2015-12-16 Thread Amir Ladsgroup
Obviously data can't be licensed but graphs and other parts can be
copyrighted. I'm just trying to make re-useability easier.

Best

On Wed, Dec 16, 2015 at 4:14 PM Gerard Meijssen 
wrote:

> Hoi,
> What is achieved in this way and, on what basis can you license the output
> of a tool?
> Thanks,
> GerardM
>
> On 16 December 2015 at 12:58, Amir Ladsgroup  wrote:
>
>> Content created by this tools is licensed under CC-BY v4.0. I made it
>> explicit now :)
>>
>> Best
>>
>> On Wed, Dec 16, 2015 at 3:11 PM Jane Darnell  wrote:
>>
>>> Amir,
>>> Thanks for your work! I like this one showing how our
>>> Sum-of-all-Paintings project is doing compared to sculptures (which have
>>> many copyright issues, but you could still put the data on Wikidata)
>>> http://tools.wmflabs.org/wd-analyst/index.php?p=p31&q=Q3305213%7CQ860861
>>>
>>> Jane
>>>
>>> On Wed, Dec 16, 2015 at 12:23 PM, Amir Ladsgroup 
>>> wrote:
>>>
>>>> Hey,
>>>> Thanks for your feedback. That's exactly what I'm looking for.
>>>>
>>>> On Mon, Dec 14, 2015 at 5:29 PM Paul Houle  wrote:
>>>>
>>>>> It's a step in the right direction,  but it took a very long time to
>>>>> load on my computer.
>>>>>
>>>>  It's maybe related to labs recent issues. Now I get reasonable time:
>>>> http://tools.pingdom.com/fpt/#!/eq1i3s/http://tools.wmflabs.org/wd-analyst/index.php
>>>>
>>>>>
>>>>> After the initial load,  it was pretty peppy,  then I ran the default
>>>>> example that is grayed in but not active (I had to retype it)
>>>>>
>>>>
>>>> I made some modifications that might help;
>>>>
>>>>> Then I get the page that says "results are ready" and how cool they
>>>>> are,  then it takes me a while to figure out what I am looking at and
>>>>> finally realize it is a comparison of data quality metrics (which I think
>>>>> are all fact counts) between all of the P31 predicates and the Q5.
>>>>>
>>>> I made some changes so you can see things easier. I appreciate if you
>>>> suggest some words I put in the description;
>>>>
>>>>
>>>>> The use of the graphic on the first row complicated this for me.
>>>>>
>>>>> Please sugest something I write there for people :);
>>>>
>>>>> There are a lot of broken links on this page too such as
>>>>>
>>>>> http://tools.wmflabs.org/wd-analyst/sitelink.php
>>>>> https://www.wikidata.org/wiki/P31
>>>>>
>>>>
>>>> The property broken should be fixed by now and sitelink is broken
>>>> because It's not there yet. I'll make it very soon;
>>>>
>>>>>
>>>>>
>>>>> and of course no merged in documentation about what P31 and Q5 are.
>>>>> Opaque identifiers are necessary for your project,  but
>>>>>
>>>>> Also some way to find the P's and Q's hooked up to this would be most
>>>>> welcome.
>>>>>
>>>>> Done, Now we have label for everything;
>>>>
>>>>> It's a great start and is completely in the right direction but it
>>>>> could take many sprints of improvement.
>>>>>
>>>>> On Wed, Dec 9, 2015 at 4:36 AM, Gerard Meijssen <
>>>>> gerard.meijs...@gmail.com> wrote:
>>>>>
>>>>>> Hoi,
>>>>>> What would be nice is to have an option to understand progress from
>>>>>> one dump to the next like you can with the Statistics by Magnus. Magnus
>>>>>> also has data on sources but this is more global.
>>>>>> Thanks,
>>>>>>  GerardM
>>>>>>
>>>>>> On 8 December 2015 at 21:41, Markus Krötzsch <
>>>>>> mar...@semantic-mediawiki.org> wrote:
>>>>>>
>>>>>>> Hi Amir,
>>>>>>>
>>>>>>> Very nice, thanks! I like the general approach of having a
>>>>>>> stand-alone tool for analysing the data, and maybe pointing you to 
>>>>>>> issues.
>>>>>>> Like a dashboard for Wikidata editors.
>>>>>>>
>>>>>>> What backend technology are 

Re: [Wikidata] Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

2015-12-16 Thread Amir Ladsgroup
Content created by this tools is licensed under CC-BY v4.0. I made it
explicit now :)

Best

On Wed, Dec 16, 2015 at 3:11 PM Jane Darnell  wrote:

> Amir,
> Thanks for your work! I like this one showing how our Sum-of-all-Paintings
> project is doing compared to sculptures (which have many copyright issues,
> but you could still put the data on Wikidata)
> http://tools.wmflabs.org/wd-analyst/index.php?p=p31&q=Q3305213%7CQ860861
>
> Jane
>
> On Wed, Dec 16, 2015 at 12:23 PM, Amir Ladsgroup 
> wrote:
>
>> Hey,
>> Thanks for your feedback. That's exactly what I'm looking for.
>>
>> On Mon, Dec 14, 2015 at 5:29 PM Paul Houle  wrote:
>>
>>> It's a step in the right direction,  but it took a very long time to
>>> load on my computer.
>>>
>>  It's maybe related to labs recent issues. Now I get reasonable time:
>> http://tools.pingdom.com/fpt/#!/eq1i3s/http://tools.wmflabs.org/wd-analyst/index.php
>>
>>>
>>> After the initial load,  it was pretty peppy,  then I ran the default
>>> example that is grayed in but not active (I had to retype it)
>>>
>>
>> I made some modifications that might help;
>>
>>> Then I get the page that says "results are ready" and how cool they are,
>>>  then it takes me a while to figure out what I am looking at and finally
>>> realize it is a comparison of data quality metrics (which I think are all
>>> fact counts) between all of the P31 predicates and the Q5.
>>>
>> I made some changes so you can see things easier. I appreciate if you
>> suggest some words I put in the description;
>>
>>
>>> The use of the graphic on the first row complicated this for me.
>>>
>>> Please sugest something I write there for people :);
>>
>>> There are a lot of broken links on this page too such as
>>>
>>> http://tools.wmflabs.org/wd-analyst/sitelink.php
>>> https://www.wikidata.org/wiki/P31
>>>
>>
>> The property broken should be fixed by now and sitelink is broken because
>> It's not there yet. I'll make it very soon;
>>
>>>
>>>
>>> and of course no merged in documentation about what P31 and Q5 are.
>>> Opaque identifiers are necessary for your project,  but
>>>
>>> Also some way to find the P's and Q's hooked up to this would be most
>>> welcome.
>>>
>>> Done, Now we have label for everything;
>>
>>> It's a great start and is completely in the right direction but it could
>>> take many sprints of improvement.
>>>
>>> On Wed, Dec 9, 2015 at 4:36 AM, Gerard Meijssen <
>>> gerard.meijs...@gmail.com> wrote:
>>>
>>>> Hoi,
>>>> What would be nice is to have an option to understand progress from one
>>>> dump to the next like you can with the Statistics by Magnus. Magnus also
>>>> has data on sources but this is more global.
>>>> Thanks,
>>>>  GerardM
>>>>
>>>> On 8 December 2015 at 21:41, Markus Krötzsch <
>>>> mar...@semantic-mediawiki.org> wrote:
>>>>
>>>>> Hi Amir,
>>>>>
>>>>> Very nice, thanks! I like the general approach of having a stand-alone
>>>>> tool for analysing the data, and maybe pointing you to issues. Like a
>>>>> dashboard for Wikidata editors.
>>>>>
>>>>> What backend technology are you using to produce these results? Is
>>>>> this live data or dumped data? One could also get those numbers from the
>>>>> SPARQL endpoint, but performance might be problematic (since you compute
>>>>> averages over all items; a custom approach would of course be much faster
>>>>> but then you have the data update problem).
>>>>>
>>>>> An obvious feature request would be to display entity ids as links to
>>>>> the appropriate page, and maybe with their labels (in a language of your
>>>>> choice).
>>>>>
>>>>> But overall very nice.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Markus
>>>>>
>>>>>
>>>>> On 08.12.2015 18:48, Amir Ladsgroup wrote:
>>>>>
>>>>>> Hey,
>>>>>> There has been several discussion regarding quality of information in
>>>>>> Wikidata. I wanted to work on quality of wikidata but we don't have
>>>>>> any
>&

Re: [Wikidata] Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

2015-12-16 Thread Amir Ladsgroup
Hey,
Thanks for your feedback. That's exactly what I'm looking for.

On Mon, Dec 14, 2015 at 5:29 PM Paul Houle  wrote:

> It's a step in the right direction,  but it took a very long time to load
> on my computer.
>
 It's maybe related to labs recent issues. Now I get reasonable time:
http://tools.pingdom.com/fpt/#!/eq1i3s/http://tools.wmflabs.org/wd-analyst/index.php

>
> After the initial load,  it was pretty peppy,  then I ran the default
> example that is grayed in but not active (I had to retype it)
>

I made some modifications that might help;

> Then I get the page that says "results are ready" and how cool they are,
>  then it takes me a while to figure out what I am looking at and finally
> realize it is a comparison of data quality metrics (which I think are all
> fact counts) between all of the P31 predicates and the Q5.
>
I made some changes so you can see things easier. I appreciate if you
suggest some words I put in the description;


> The use of the graphic on the first row complicated this for me.
>
> Please sugest something I write there for people :);

> There are a lot of broken links on this page too such as
>
> http://tools.wmflabs.org/wd-analyst/sitelink.php
> https://www.wikidata.org/wiki/P31
>

The property broken should be fixed by now and sitelink is broken because
It's not there yet. I'll make it very soon;

>
>
> and of course no merged in documentation about what P31 and Q5 are.
> Opaque identifiers are necessary for your project,  but
>
> Also some way to find the P's and Q's hooked up to this would be most
> welcome.
>
> Done, Now we have label for everything;

> It's a great start and is completely in the right direction but it could
> take many sprints of improvement.
>
> On Wed, Dec 9, 2015 at 4:36 AM, Gerard Meijssen  > wrote:
>
>> Hoi,
>> What would be nice is to have an option to understand progress from one
>> dump to the next like you can with the Statistics by Magnus. Magnus also
>> has data on sources but this is more global.
>> Thanks,
>>  GerardM
>>
>> On 8 December 2015 at 21:41, Markus Krötzsch <
>> mar...@semantic-mediawiki.org> wrote:
>>
>>> Hi Amir,
>>>
>>> Very nice, thanks! I like the general approach of having a stand-alone
>>> tool for analysing the data, and maybe pointing you to issues. Like a
>>> dashboard for Wikidata editors.
>>>
>>> What backend technology are you using to produce these results? Is this
>>> live data or dumped data? One could also get those numbers from the SPARQL
>>> endpoint, but performance might be problematic (since you compute averages
>>> over all items; a custom approach would of course be much faster but then
>>> you have the data update problem).
>>>
>>> An obvious feature request would be to display entity ids as links to
>>> the appropriate page, and maybe with their labels (in a language of your
>>> choice).
>>>
>>> But overall very nice.
>>>
>>> Regards,
>>>
>>> Markus
>>>
>>>
>>> On 08.12.2015 18:48, Amir Ladsgroup wrote:
>>>
>>>> Hey,
>>>> There has been several discussion regarding quality of information in
>>>> Wikidata. I wanted to work on quality of wikidata but we don't have any
>>>> source of good information to see where we are ahead and where we are
>>>> behind. So I thought the best thing I can do is to make something to
>>>> show people how exactly sourced our data is with details. So here we
>>>> have *http://tools.wmflabs.org/wd-analyst/index.php*
>>>>
>>>> You can give only a property (let's say P31) and it gives you the four
>>>> most used values + analyze of sources and quality in overall (check this
>>>> out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>)
>>>>   and then you can see about ~33% of them are sources which 29.1% of
>>>> them are based on Wikipedia.
>>>> You can give a property and multiple values you want. Let's say you want
>>>> to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US)
>>>> Check this out
>>>> <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30|Q183>. And
>>>> you can see US biographies are more abundant (300K over 200K) but German
>>>> biographies are more descriptive (3.8 description per item over 3.2
>>>> description over item)
>>>>
>>>> One important note: Compare P31:Q5 (a trivial statement) 46% of t

Re: [Wikidata] Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

2015-12-12 Thread Amir Ladsgroup
Hey,
I made some significant changes based on feedbacks

* Per suggestion of Nemo_bis I added reference-based analysis: Here's
<http://tools.wmflabs.org/wd-analyst/ref.php?p=P143&q=Q328|Q11920&pp=P31>
an example
* I added limit parameter which you can get more results if you want (both
for reference-based and property-based analysis) for example:
http://tools.wmflabs.org/wd-analyst/index.php?p=P31&q=&limit=50 (Maximum
acceptable value is 50)
* Per suggestion of André I added a column to the database and results
which gives you number of percentage of unsourced statements. Obviously it
doesn't apply to reference-based analysis. for example
https://tools.wmflabs.org/wd-analyst/index.php?p=P1082&q= shows only 2% of
statements of population are unsourced

For Gerard suggestion. It's definitely a good idea but problem is it's
technically hard because every week it makes the databse twice as big. We
can store only a limited number (e.g. last three weeks) or apply this to a
limited number of value-pair properties. I'm looking to find out which one
is better.

Best


On Thu, Dec 10, 2015 at 12:13 AM André Costa 
wrote:

> Nice tool!
>
> To understand the statistics better.
> If a claim has two sources, one wikipedia and one other, how does that
> show up in the statistics?
>
> The reason I'm wondering is because I would normally care if a claim is
> sourced or not (but not by how many sources) and whether it is sourced by
> only Wikipedias or anything else.
>
> E.g.
> 1) a statment with 10 claims each sourced is "better" than one with 10
> claims where one claim has 10 sources.
> 2) a statement with a wiki source + another source is "better" than on
> with just a wiki source and just as "good" as one without the wiki source.
>
> Also is wiki ref/source Wikipedia only or any Wikimedia project? Whilst
> (last I checked) the others were only 70,000 refs compared to the 21
> million from Wikipedia they might be significant for certain domains and
> are just as "bad".
>
> Cheers,
> André
> On 9 Dec 2015 10:37, "Gerard Meijssen"  wrote:
>
>> Hoi,
>> What would be nice is to have an option to understand progress from one
>> dump to the next like you can with the Statistics by Magnus. Magnus also
>> has data on sources but this is more global.
>> Thanks,
>>  GerardM
>>
>> On 8 December 2015 at 21:41, Markus Krötzsch <
>> mar...@semantic-mediawiki.org> wrote:
>>
>>> Hi Amir,
>>>
>>> Very nice, thanks! I like the general approach of having a stand-alone
>>> tool for analysing the data, and maybe pointing you to issues. Like a
>>> dashboard for Wikidata editors.
>>>
>>> What backend technology are you using to produce these results? Is this
>>> live data or dumped data? One could also get those numbers from the SPARQL
>>> endpoint, but performance might be problematic (since you compute averages
>>> over all items; a custom approach would of course be much faster but then
>>> you have the data update problem).
>>>
>>> An obvious feature request would be to display entity ids as links to
>>> the appropriate page, and maybe with their labels (in a language of your
>>> choice).
>>>
>>> But overall very nice.
>>>
>>> Regards,
>>>
>>> Markus
>>>
>>>
>>> On 08.12.2015 18:48, Amir Ladsgroup wrote:
>>>
>>>> Hey,
>>>> There has been several discussion regarding quality of information in
>>>> Wikidata. I wanted to work on quality of wikidata but we don't have any
>>>> source of good information to see where we are ahead and where we are
>>>> behind. So I thought the best thing I can do is to make something to
>>>> show people how exactly sourced our data is with details. So here we
>>>> have *http://tools.wmflabs.org/wd-analyst/index.php*
>>>>
>>>> You can give only a property (let's say P31) and it gives you the four
>>>> most used values + analyze of sources and quality in overall (check this
>>>> out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>)
>>>>   and then you can see about ~33% of them are sources which 29.1% of
>>>> them are based on Wikipedia.
>>>> You can give a property and multiple values you want. Let's say you want
>>>> to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US)
>>>> Check this out
>>>> <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30|Q183>. And
>>>> y

Re: [Wikidata] Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

2015-12-08 Thread Amir Ladsgroup
I also published the source code (it's based on python and PHP) PRs are
welcome
https://github.com/Ladsgroup/wd-analyst

On Wed, Dec 9, 2015 at 7:20 AM Amir Ladsgroup  wrote:

> Hey Markus,
>
> On Wed, Dec 9, 2015 at 12:12 AM Markus Krötzsch <
> mar...@semantic-mediawiki.org> wrote:
>
>> Hi Amir,
>>
>> Very nice, thanks! I like the general approach of having a stand-alone
>> tool for analysing the data, and maybe pointing you to issues. Like a
>> dashboard for Wikidata editors.
>>
>> What backend technology are you using to produce these results? Is this
>> live data or dumped data? One could also get those numbers from the
>> SPARQL endpoint, but performance might be problematic (since you compute
>> averages over all items; a custom approach would of course be much
>> faster but then you have the data update problem).
>>
> I build a database based on weekly JSON dumps. we would have some delay in
> the data but computationally it's fast. Using Wikidata database directly
> makes performance so poor that it becomes a good attack point.
>
>
>> An obvious feature request would be to display entity ids as links to
>> the appropriate page, and maybe with their labels (in a language of your
>> choice).
>>
>> Done. :)
>
>> But overall very nice.
>>
>> Regards,
>>
>> Markus
>>
>>
>> On 08.12.2015 18:48, Amir Ladsgroup wrote:
>> > Hey,
>> > There has been several discussion regarding quality of information in
>> > Wikidata. I wanted to work on quality of wikidata but we don't have any
>> > source of good information to see where we are ahead and where we are
>> > behind. So I thought the best thing I can do is to make something to
>> > show people how exactly sourced our data is with details. So here we
>> > have *http://tools.wmflabs.org/wd-analyst/index.php*
>> >
>> > You can give only a property (let's say P31) and it gives you the four
>> > most used values + analyze of sources and quality in overall (check this
>> > out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>)
>> >   and then you can see about ~33% of them are sources which 29.1% of
>> > them are based on Wikipedia.
>> > You can give a property and multiple values you want. Let's say you want
>> > to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US)
>> > Check this out
>> > <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30|Q183>. And
>> > you can see US biographies are more abundant (300K over 200K) but German
>> > biographies are more descriptive (3.8 description per item over 3.2
>> > description over item)
>> >
>> > One important note: Compare P31:Q5 (a trivial statement) 46% of them are
>> > not sourced at all and 49% of them are based on Wikipedia **but* *get
>> > this statistics for population properties (P1082
>> > <http://tools.wmflabs.org/wd-analyst/index.php?p=P1082>) It's not a
>> > trivial statement and we need to be careful about them. It turns out
>> > there are slightly more than one reference per statement and only 4% of
>> > them are based on Wikipedia. So we can relax and enjoy these
>> > highly-sourced data.
>> >
>> > Requests:
>> >
>> >   * Please tell me whether do you want this tool at all
>> >   * Please suggest more ways to analyze and catch unsourced materials
>> >
>> > Future plan (if you agree to keep using this tool):
>> >
>> >   * Support more datatypes (e.g. date of birth based on year,
>> coordinates)
>> >   * Sitelink-based and reference-based analysis (to check how much of
>> > articles of, let's say, Chinese Wikipedia are unsourced)
>> >
>> >   * Free-style analysis: There is a database for this tool that can be
>> > used for way more applications. You can get the most unsourced
>> > statements of P31 and then you can go to fix them. I'm trying to
>> > build a playground for this kind of tasks)
>> >
>> > I hope you like this and rock on!
>> > <http://tools.wmflabs.org/wd-analyst/index.php?p=P136&q=Q11399>
>> > Best
>> >
>> >
>> > ___
>> > Wikidata mailing list
>> > Wikidata@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/wikidata
>> >
>>
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

2015-12-08 Thread Amir Ladsgroup
Hey Markus,

On Wed, Dec 9, 2015 at 12:12 AM Markus Krötzsch <
mar...@semantic-mediawiki.org> wrote:

> Hi Amir,
>
> Very nice, thanks! I like the general approach of having a stand-alone
> tool for analysing the data, and maybe pointing you to issues. Like a
> dashboard for Wikidata editors.
>
> What backend technology are you using to produce these results? Is this
> live data or dumped data? One could also get those numbers from the
> SPARQL endpoint, but performance might be problematic (since you compute
> averages over all items; a custom approach would of course be much
> faster but then you have the data update problem).
>
I build a database based on weekly JSON dumps. we would have some delay in
the data but computationally it's fast. Using Wikidata database directly
makes performance so poor that it becomes a good attack point.


> An obvious feature request would be to display entity ids as links to
> the appropriate page, and maybe with their labels (in a language of your
> choice).
>
> Done. :)

> But overall very nice.
>
> Regards,
>
> Markus
>
>
> On 08.12.2015 18:48, Amir Ladsgroup wrote:
> > Hey,
> > There has been several discussion regarding quality of information in
> > Wikidata. I wanted to work on quality of wikidata but we don't have any
> > source of good information to see where we are ahead and where we are
> > behind. So I thought the best thing I can do is to make something to
> > show people how exactly sourced our data is with details. So here we
> > have *http://tools.wmflabs.org/wd-analyst/index.php*
> >
> > You can give only a property (let's say P31) and it gives you the four
> > most used values + analyze of sources and quality in overall (check this
> > out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>)
> >   and then you can see about ~33% of them are sources which 29.1% of
> > them are based on Wikipedia.
> > You can give a property and multiple values you want. Let's say you want
> > to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US)
> > Check this out
> > <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30|Q183>. And
> > you can see US biographies are more abundant (300K over 200K) but German
> > biographies are more descriptive (3.8 description per item over 3.2
> > description over item)
> >
> > One important note: Compare P31:Q5 (a trivial statement) 46% of them are
> > not sourced at all and 49% of them are based on Wikipedia **but* *get
> > this statistics for population properties (P1082
> > <http://tools.wmflabs.org/wd-analyst/index.php?p=P1082>) It's not a
> > trivial statement and we need to be careful about them. It turns out
> > there are slightly more than one reference per statement and only 4% of
> > them are based on Wikipedia. So we can relax and enjoy these
> > highly-sourced data.
> >
> > Requests:
> >
> >   * Please tell me whether do you want this tool at all
> >   * Please suggest more ways to analyze and catch unsourced materials
> >
> > Future plan (if you agree to keep using this tool):
> >
> >   * Support more datatypes (e.g. date of birth based on year,
> coordinates)
> >   * Sitelink-based and reference-based analysis (to check how much of
> > articles of, let's say, Chinese Wikipedia are unsourced)
> >
> >   * Free-style analysis: There is a database for this tool that can be
> > used for way more applications. You can get the most unsourced
> > statements of P31 and then you can go to fix them. I'm trying to
> > build a playground for this kind of tasks)
> >
> > I hope you like this and rock on!
> > <http://tools.wmflabs.org/wd-analyst/index.php?p=P136&q=Q11399>
> > Best
> >
> >
> > ___
> > Wikidata mailing list
> > Wikidata@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikidata
> >
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

2015-12-08 Thread Amir Ladsgroup
Hey Jane,
Yes. Exactly :)

Best

On Tue, Dec 8, 2015 at 9:37 PM Jane Darnell  wrote:

> Very useful, Amir, thanks! I just ran it for occupation=painter
>  (p=P106&q=Q1028181)
> Am I correct in my interpretation that in general painters have fewer
> claims than the entire population of items with the property occupation?
>
> On Tue, Dec 8, 2015 at 6:48 PM, Amir Ladsgroup 
> wrote:
>
>> Hey,
>> There has been several discussion regarding quality of information in
>> Wikidata. I wanted to work on quality of wikidata but we don't have any
>> source of good information to see where we are ahead and where we are
>> behind. So I thought the best thing I can do is to make something to show
>> people how exactly sourced our data is with details. So here we have 
>> *http://tools.wmflabs.org/wd-analyst/index.php
>> <http://tools.wmflabs.org/wd-analyst/index.php>*
>>
>> You can give only a property (let's say P31) and it gives you the four
>> most used values + analyze of sources and quality in overall (check this
>> out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>)
>>  and then you can see about ~33% of them are sources which 29.1% of them
>> are based on Wikipedia.
>> You can give a property and multiple values you want. Let's say you want
>> to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US)
>> Check this out
>> <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30%7CQ183>. And
>> you can see US biographies are more abundant (300K over 200K) but German
>> biographies are more descriptive (3.8 description per item over 3.2
>> description over item)
>>
>> One important note: Compare P31:Q5 (a trivial statement) 46% of them are
>> not sourced at all and 49% of them are based on Wikipedia **but* *get
>> this statistics for population properties (P1082
>> <http://tools.wmflabs.org/wd-analyst/index.php?p=P1082>) It's not a
>> trivial statement and we need to be careful about them. It turns out there
>> are slightly more than one reference per statement and only 4% of them are
>> based on Wikipedia. So we can relax and enjoy these highly-sourced data.
>>
>> Requests:
>>
>>- Please tell me whether do you want this tool at all
>>- Please suggest more ways to analyze and catch unsourced materials
>>
>> Future plan (if you agree to keep using this tool):
>>
>>- Support more datatypes (e.g. date of birth based on year,
>>coordinates)
>>- Sitelink-based and reference-based analysis (to check how much of
>>articles of, let's say, Chinese Wikipedia are unsourced)
>>
>>
>>- Free-style analysis: There is a database for this tool that can be
>>used for way more applications. You can get the most unsourced statements
>>of P31 and then you can go to fix them. I'm trying to build a playground
>>for this kind of tasks)
>>
>> I hope you like this and rock on!
>> <http://tools.wmflabs.org/wd-analyst/index.php?p=P136&q=Q11399>
>> Best
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

2015-12-08 Thread Amir Ladsgroup
Hey,
There has been several discussion regarding quality of information in
Wikidata. I wanted to work on quality of wikidata but we don't have any
source of good information to see where we are ahead and where we are
behind. So I thought the best thing I can do is to make something to show
people how exactly sourced our data is with details. So here we have
*http://tools.wmflabs.org/wd-analyst/index.php
*

You can give only a property (let's say P31) and it gives you the four most
used values + analyze of sources and quality in overall (check this out
)
 and then you can see about ~33% of them are sources which 29.1% of them
are based on Wikipedia.
You can give a property and multiple values you want. Let's say you want to
compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US)
Check this out
. And you
can see US biographies are more abundant (300K over 200K) but German
biographies are more descriptive (3.8 description per item over 3.2
description over item)

One important note: Compare P31:Q5 (a trivial statement) 46% of them are
not sourced at all and 49% of them are based on Wikipedia **but* *get this
statistics for population properties (P1082
) It's not a trivial
statement and we need to be careful about them. It turns out there are
slightly more than one reference per statement and only 4% of them are
based on Wikipedia. So we can relax and enjoy these highly-sourced data.

Requests:

   - Please tell me whether do you want this tool at all
   - Please suggest more ways to analyze and catch unsourced materials

Future plan (if you agree to keep using this tool):

   - Support more datatypes (e.g. date of birth based on year, coordinates)
   - Sitelink-based and reference-based analysis (to check how much of
   articles of, let's say, Chinese Wikipedia are unsourced)


   - Free-style analysis: There is a database for this tool that can be
   used for way more applications. You can get the most unsourced statements
   of P31 and then you can go to fix them. I'm trying to build a playground
   for this kind of tasks)

I hope you like this and rock on!

Best
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Help needed to improve anti-vandalism tools

2015-12-06 Thread Amir Ladsgroup
Hello, You may know ORES , We
use ORES to build anti-vandalism tools (Learn more
). Based on automatic revert
detection we were able to build an MVP and we have some high quality
classifiers online you can use (WD:ORES
).

In order to improve the anti-vandalism classifier we need you to go through
some edits and determine whether they are damaging to Wikidata and if they
are ill-intended edits or they are just newbies/honest mistakes.

This would help us distinguish between newbies and vandals and also
improves our data to make precise and adequate vandalism detection
classifier. Please go to Wikidata:Edit labels
 install the gadget and
do a workset.

Thanks
Best
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Three birthday gifts for wikidata's third birthday

2015-10-29 Thread Amir Ladsgroup
Hey,
It's Wikidata's third birthday, right? So I prepared three gifts for you:
1- AI-based anti-vandalism classifier is ready after four months of work
thanks to Aaron Halfaker. It's so big that I can't write it here. This is
link to the announcement.

2- Remember Kian? Using bot I already added 100K statements when the Kian
had high certainty but there are far more to add but they need human
review. Thanks to Magnus Manske now we have a game that suggests statements
based on Kian 
and you can simply add them. what I did was populating a database with
suggestions and building an API around that There are 2.6 million
suggestions in 17 languages based on 53 models. I can easily add more
languages and models. Just name them :)
3- Still there lots of old interwiki links (in case you don't remember,
things like [[en:Foo]], ewww) in small wikis specially in template
namespace and there is a flow of them adding in medium-sized wikis. Also in
future we need to clean them from Wiktionary \o/. Now we have a script in
pywikibot named interwikidata.py

merged ten hours ago thanks to jayvdb and xzise. It cleans pages, add links
to wikidata and create items for pages in your wiki. i.e. It's interwiki.py
but for wikis that use Wikidata
Just run:
python pwb.py scripts/interwikidata.py -unconnectedpages -clean -create
Or if you are a little bit advanced in pywikibot. Write a script based on
this and handle interwiki conflicts (more help in source code)

Happy birthday!
Best
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Italian Wikipedia imports gone haywire ?

2015-09-27 Thread Amir Ladsgroup
Italian Wikipedia says it's correct: https://it.wikipedia.org/wiki/Hefei



On Sun, Sep 27, 2015 at 8:21 PM Thad Guidry  wrote:

> OK, it seems that Wikipedia does have a few nice features. :)
>
> I was able to quickly search History on the entity and find that Dexbot
> had imported the erroneous statements
> https://www.wikidata.org/wiki/User:Dexbot  <-- pretty cool options there,
> Good Job whomever !
>
> and let the User (owner of the bot) know of the problem.
> https://www.wikidata.org/wiki/User_talk:Ladsgroup
>
>
> Thad
> +ThadGuidry 
>
> On Sun, Sep 27, 2015 at 11:38 AM, Thad Guidry 
> wrote:
>
>> I had to clean up this entity that had Sister City property filled in
>> with lots of erroneous statements for it that I removed.
>>
>> https://www.wikidata.org/wiki/Q185684
>>
>> How can I figure out where the import went wrong, how it happened, and
>> how to ensure it doesn't happen again ?  How does one look at Wikidata bots
>> and their efficiency or incorrectness ?
>>
>> Trying to learn more,
>>
>> Thad
>> +ThadGuidry 
>>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Units are live! \o/

2015-09-09 Thread Amir Ladsgroup
I probably won't sleep tonight :)

Best

On Thu, Sep 10, 2015 at 12:48 AM Stryn  wrote:

> Wow, finally, great!
> I've been waiting for units so long.  I'm already in my bed, so will try
> tomorrow then :-)
>
> Stryn 🐼
> Sent from Windows Phone
> --
> Lähettäjä: Lydia Pintscher 
> Lähetetty: ‎9.‎9.‎2015 23:00
> Vastaanottaja: Discussion list for the Wikidata project.
> 
> Aihe: [Wikidata] Units are live! \o/
>
> Hey everyone :)
>
> As promised we just enabled support for quantities with units on
> Wikidata. So from now on you'll be able to store fancy things like the
> height of a mountain or the boiling point of an element.
>
> Quite a few properties have been waiting on unit support before they
> are created. I assume they will be created in the next hours and then
> you can go ahead and add all of the measurements.
>
>
> Cheers
> Lydia
>
> --
> Lydia Pintscher - http://about.me/lydia.pintscher
> Product Manager for Wikidata
>
> Wikimedia Deutschland e.V.
> Tempelhofer Ufer 23-24
> 10963 Berlin
> www.wikimedia.de
>
> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
>
> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
> unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
> Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] New version of Kian, faster, general, more accurate

2015-08-31 Thread Amir Ladsgroup
Hey Tom, Thanks for you review. Note that this list list of *possible*
errors and it doesn't mean all of entries are wrong :) (if it was like
this, I would go ahead and removed them all)

On Mon, Aug 31, 2015 at 1:22 AM Tom Morris  wrote:

> After glancing at
> https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes/frFilm,
> it doesn't appear to me that either Wikidata type hierarchy or Wikipedia
> category hierarchy is being considered when evaluating type mismatches.  Is
> that intentional?
>
> Not yet, it can be done with some programming hassle. If people like the
reports and they are willing to work on them I promise to take that into
account.


> For example
>
> Grave of the Fireflies (Q274520) <https://www.wikidata.org/wiki/Q274520>NoYes
> (0.731427666987)
>
>
> is an instance of animated film which is a subtype of film.
>
> Conversely, this telefilm d'horreur
>
> Le Collectionneur de cerveaux (Q579355)
> <https://www.wikidata.org/wiki/Q579355>YesNo (0.239868037957
>
> is part of a subcategory of film d'horreur -> film de fiction
>
> The one other that I glanced at,
> https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes/frHuman,
> seems to have systematic issues with correct classification of Wikipedia
> pages about multiple people (e.g. brothers) which Wikidata correctly
> identifies as not people.
>
It can be considered as mis-classifications in articles in Wikipedia. even
though I'm not a big fan of this idea. It seems these articles in Wikipedia
lack of proper categories like siblings and duo-related categories and if
those categories were there Kian would know.

>
> It also, strangely, seems to think that Wikidata atomic elements are
> humans and I can't see why:
>
> calcium (Q706) <https://www.wikidata.org/wiki/Q706>YesNo (0.0225392419603)
>
> That's a bug in autolist, I don't know why autolist included Q706 in
humans. Maybe Magnus can tell. I need to dig deeper


> Have you considered using other signals as inputs to your models?  For
> example, Freebase types should be a pretty reliable signal for things like
> humans and films.
>
> No, but I think and investigate using them :)

> Tom
>
>
> On Sun, Aug 30, 2015 at 11:56 AM, Amir Ladsgroup 
> wrote:
>
>> Thanks Nemo!
>>
>> I added new reports:
>> https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes
>>
>> If you check them, you can easily find tons of errors, some of them are
>> mis-categorization in Wikipedia, some of them are mistake in connecting
>> article from Wikipedia to wrong item, some of them are vandalism in
>> Wikidata, some of them are mistakes by bots or Widar users. Please check
>> them if you want to have better quality in Wikidata
>>
>> Best
>>
>> On Sun, Aug 30, 2015 at 12:16 PM Federico Leva (Nemo) 
>> wrote:
>>
>>> Amir Ladsgroup, 28/08/2015 20:17:
>>> >
>>> > Another thing I did is reporting possible mistakes, when Wikipedia and
>>> > Wikidata don't agree on one statement,
>>>
>>> Nice, with this Wikidata has better quality control systems than
>>> Wikipedia. ;-)
>>>
>>> Nemo
>>>
>>> ___
>>> Wikidata mailing list
>>> Wikidata@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] New version of Kian, faster, general, more accurate

2015-08-30 Thread Amir Ladsgroup
Thanks Nemo!

I added new reports:
https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes

If you check them, you can easily find tons of errors, some of them are
mis-categorization in Wikipedia, some of them are mistake in connecting
article from Wikipedia to wrong item, some of them are vandalism in
Wikidata, some of them are mistakes by bots or Widar users. Please check
them if you want to have better quality in Wikidata

Best

On Sun, Aug 30, 2015 at 12:16 PM Federico Leva (Nemo) 
wrote:

> Amir Ladsgroup, 28/08/2015 20:17:
> >
> > Another thing I did is reporting possible mistakes, when Wikipedia and
> > Wikidata don't agree on one statement,
>
> Nice, with this Wikidata has better quality control systems than
> Wikipedia. ;-)
>
> Nemo
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] New version of Kian, faster, general, more accurate

2015-08-28 Thread Amir Ladsgroup
Hello,
Yesterday I published new version of Kian
. I ran it to add statement to claimless
items from Japanese Wikipedia and German Wikipedia and it is working

I'm planning to add French and English Wikipedia, You can install it and
run it too.

Another thing I did is reporting possible mistakes, when Wikipedia and
Wikidata don't agree on one statement, These are the results.
 and I
was able to find all kinds of errors such as: human in wikidata
, disambiguation in German Wikipedia
 (it's in this list
)
or film in wikidata , tv series in
ja.wp  (the
item seems to be a mess actually, from this list

) or Iranian mythological character in several wikis
 but "actor and model from U.S." in
Wikidata (came from this list
).
Please go through the lists
 and
fix as much as you can, also give me suggestion on wiki and statements to
run this code.

How new version of Kian works? I introduced concept of "model" in Kian. A
model consists four properties: Wiki (such as "enwikivoyage" or "fawiki"),
name (an arbitrary name), property (like "P31", "P27") and value of that
property (like "Q5" for P31 or "Q31" for "P27"), then Kian goes and trains
that model and once we have that model ready, you can use it to add
statements on any kind of lists of articles (more technically page gens of
pywikibot) for example add this statement on new articles by running
something like this:
python scripts/parser.py -lang:ja -newpages:100 -n jaHuman
which jaHuman is name of that model. It caches all data related to that
model in data/jaHuman/
Or find possible mistakes in that wiki:
python scripts/possible_mistakes.py -n jaHuman
etc.

Another things worth mentioning are:
*scripts of Kian and the library (the part that actually does stuff) are
separated, so you can easily write your own scripts for Kian.
*Since it uses autolists to train and find possible mistakes, results are
live.
* Kian now caches results of SQL queries in different folder of model, so
first model you build for Spanish Wikipedia may take a while to complete
but the second model for Spanish Wikipedia would take so much less time.
* I doubled number of features in a way to made accuracy of Kian really
high [1] (e.g. P31:Q5 for German Wikipedia has AUC of 99.75% and precision
and recall are 99.11%, 98.31% at threshold 63%)
*Thresholds are being chosen automatically based on F-beta scores
 to have optimum accuracy and high
recall
* It can give results in different classes of certainty, and we can send
these results to semi-automated tools. If anyone willing to help, please do
tell.
* I try to follow dependency injection principals, so it is possible to
train any kind of model using Kian and get the results (since we don't have
really good libraries to do ANN training
)

A crazy idea: What do you think If I make a webservice for Kian, so you can
go to a page in labs, register a model and after a while get results, or
use OAuth to add statements?

Last thing: Suggest me models and I will work on them :)


[1]: the old Kian worked this way: It labeled all categories based on
percentage of members that already has that statements then labels articles
based on number of categories in each class the article does have. The new
Kian does this but also labels categories based on percentage of members
that has that property but not that value (e.g. "Category:Fictional
characters" would have a high percentage in model of P31:Q5) and also
labels articles based on number of categories in each class.
Best
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata