Re: [Wikidata] Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

Jane Darnell Wed, 16 Dec 2015 03:42:04 -0800

Amir,
Thanks for your work! I like this one showing how our Sum-of-all-Paintings
project is doing compared to sculptures (which have many copyright issues,
but you could still put the data on Wikidata)
http://tools.wmflabs.org/wd-analyst/index.php?p=p31&q=Q3305213%7CQ860861


Jane

On Wed, Dec 16, 2015 at 12:23 PM, Amir Ladsgroup <ladsgr...@gmail.com>
wrote:

> Hey,
> Thanks for your feedback. That's exactly what I'm looking for.
>
> On Mon, Dec 14, 2015 at 5:29 PM Paul Houle <ontolo...@gmail.com> wrote:
>
>> It's a step in the right direction,  but it took a very long time to load
>> on my computer.
>>
>  It's maybe related to labs recent issues. Now I get reasonable time:
> http://tools.pingdom.com/fpt/#!/eq1i3s/http://tools.wmflabs.org/wd-analyst/index.php
>
>>
>> After the initial load,  it was pretty peppy,  then I ran the default
>> example that is grayed in but not active (I had to retype it)
>>
>
> I made some modifications that might help;
>
>> Then I get the page that says "results are ready" and how cool they are,
>>  then it takes me a while to figure out what I am looking at and finally
>> realize it is a comparison of data quality metrics (which I think are all
>> fact counts) between all of the P31 predicates and the Q5.
>>
> I made some changes so you can see things easier. I appreciate if you
> suggest some words I put in the description;
>
>
>> The use of the graphic on the first row complicated this for me.
>>
>> Please sugest something I write there for people :);
>
>> There are a lot of broken links on this page too such as
>>
>> http://tools.wmflabs.org/wd-analyst/sitelink.php
>> https://www.wikidata.org/wiki/P31
>>
>
> The property broken should be fixed by now and sitelink is broken because
> It's not there yet. I'll make it very soon;
>
>>
>>
>> and of course no merged in documentation about what P31 and Q5 are.
>> Opaque identifiers are necessary for your project,  but
>>
>> Also some way to find the P's and Q's hooked up to this would be most
>> welcome.
>>
>> Done, Now we have label for everything;
>
>> It's a great start and is completely in the right direction but it could
>> take many sprints of improvement.
>>
>> On Wed, Dec 9, 2015 at 4:36 AM, Gerard Meijssen <
>> gerard.meijs...@gmail.com> wrote:
>>
>>> Hoi,
>>> What would be nice is to have an option to understand progress from one
>>> dump to the next like you can with the Statistics by Magnus. Magnus also
>>> has data on sources but this is more global.
>>> Thanks,
>>>      GerardM
>>>
>>> On 8 December 2015 at 21:41, Markus Krötzsch <
>>> mar...@semantic-mediawiki.org> wrote:
>>>
>>>> Hi Amir,
>>>>
>>>> Very nice, thanks! I like the general approach of having a stand-alone
>>>> tool for analysing the data, and maybe pointing you to issues. Like a
>>>> dashboard for Wikidata editors.
>>>>
>>>> What backend technology are you using to produce these results? Is this
>>>> live data or dumped data? One could also get those numbers from the SPARQL
>>>> endpoint, but performance might be problematic (since you compute averages
>>>> over all items; a custom approach would of course be much faster but then
>>>> you have the data update problem).
>>>>
>>>> An obvious feature request would be to display entity ids as links to
>>>> the appropriate page, and maybe with their labels (in a language of your
>>>> choice).
>>>>
>>>> But overall very nice.
>>>>
>>>> Regards,
>>>>
>>>> Markus
>>>>
>>>>
>>>> On 08.12.2015 18:48, Amir Ladsgroup wrote:
>>>>
>>>>> Hey,
>>>>> There has been several discussion regarding quality of information in
>>>>> Wikidata. I wanted to work on quality of wikidata but we don't have any
>>>>> source of good information to see where we are ahead and where we are
>>>>> behind. So I thought the best thing I can do is to make something to
>>>>> show people how exactly sourced our data is with details. So here we
>>>>> have *http://tools.wmflabs.org/wd-analyst/index.php*
>>>>>
>>>>> You can give only a property (let's say P31) and it gives you the four
>>>>> most used values + analyze of sources and quality in overall (check
>>>>> this
>>>>> out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>)
>>>>>   and then you can see about ~33% of them are sources which 29.1% of
>>>>> them are based on Wikipedia.
>>>>> You can give a property and multiple values you want. Let's say you
>>>>> want
>>>>> to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US)
>>>>> Check this out
>>>>> <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30|Q183>. And
>>>>> you can see US biographies are more abundant (300K over 200K) but
>>>>> German
>>>>> biographies are more descriptive (3.8 description per item over 3.2
>>>>> description over item)
>>>>>
>>>>> One important note: Compare P31:Q5 (a trivial statement) 46% of them
>>>>> are
>>>>> not sourced at all and 49% of them are based on Wikipedia **but* *get
>>>>> this statistics for population properties (P1082
>>>>> <http://tools.wmflabs.org/wd-analyst/index.php?p=P1082>) It's not a
>>>>> trivial statement and we need to be careful about them. It turns out
>>>>> there are slightly more than one reference per statement and only 4% of
>>>>> them are based on Wikipedia. So we can relax and enjoy these
>>>>> highly-sourced data.
>>>>>
>>>>> Requests:
>>>>>
>>>>>   * Please tell me whether do you want this tool at all
>>>>>   * Please suggest more ways to analyze and catch unsourced materials
>>>>>
>>>>> Future plan (if you agree to keep using this tool):
>>>>>
>>>>>   * Support more datatypes (e.g. date of birth based on year,
>>>>> coordinates)
>>>>>   * Sitelink-based and reference-based analysis (to check how much of
>>>>>     articles of, let's say, Chinese Wikipedia are unsourced)
>>>>>
>>>>>   * Free-style analysis: There is a database for this tool that can be
>>>>>     used for way more applications. You can get the most unsourced
>>>>>     statements of P31 and then you can go to fix them. I'm trying to
>>>>>     build a playground for this kind of tasks)
>>>>>
>>>>> I hope you like this and rock on!
>>>>> <http://tools.wmflabs.org/wd-analyst/index.php?p=P136&q=Q11399>
>>>>> Best
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Wikidata mailing list
>>>>> Wikidata@lists.wikimedia.org
>>>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Wikidata mailing list
>>>> Wikidata@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>>
>>>
>>>
>>> _______________________________________________
>>> Wikidata mailing list
>>> Wikidata@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>
>>>
>>
>>
>> --
>> Paul Houle
>>
>> *Applying Schemas for Natural Language Processing, Distributed Systems,
>> Classification and Text Mining and Data Lakes*
>>
>> (607) 539 6254    paul.houle on Skype   ontolo...@gmail.com
>>
>> :BaseKB -- Query Freebase Data With SPARQL
>> http://basekb.com/gold/
>>
>> Legal Entity Identifier Lookup
>> https://legalentityidentifier.info/lei/lookup/
>> <http://legalentityidentifier.info/lei/lookup/>
>>
>> Join our Data Lakes group on LinkedIn
>> https://www.linkedin.com/grp/home?gid=8267275
>>
>> _______________________________________________
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

Reply via email to