Re: [Wikidata] Proposal for the introduction of a practicable Data Quality Indicator in Wikidata

2019-08-29 Thread Imre Samu
Hi Sebastian,

>Is there a list of geodata issues, somewhere? Can you give some example?

My main "pain" points:

- the cebuano geo duplicates:
https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2017/10#Cebuano
https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2018/06#A_proposed_course_of_action_for_dealing_with_cebwiki/svwiki_geographic_duplicates

- detecting "anonym" editings  of the wikidata labels from wikidata JSON
dumps.  As I know - Now it is impossible, - no similar  information in the
JSON dump, so I cant' create a score.
  This is similar problem like the original posts ; ( ~ quality score )
 but I would like to use the original editing history and
implementing/tuning my scoring algorithm.

  When somebody renaming some city names (trolls) , then my matching
algorithm not find them,
  and in this cases I can use the previous "better" state of the wikidata.
  It is also important for merging openstreetmap place-names with wikidata
labels for end users.



> Do you have a reference dataset as well, or would that be NaturalEarth
itself?

Sorry, I don't have a reference datasets.  and NaturalEarth is only a
subset of the "reality" . not contains all cities, rivers, ...
But maybe you can use OpenStreetMap as a best resource.
Sometimes I matching add adding wikidata concordances to
https://www.whosonfirst.org/ (WOF)  gazetteer; but this data originated
mostly from  similar sources ( geonames,..)  so can't use a quality
indicator.

If you need some easy example - probably the "airports" is a good start for
checking wikidata completeness.
(p238_iata_airport_code ; p239_icao_airport_code ; p240_faa_airport_code
; p931_place_served ;  p131_located_in )

> What would help you to measure completeness for adding concordances to
NaturalEarth.

I have created my own tools/scripts  ;  because waiting for the community
for fixing cebwiki data problems is lot of times.

I am importing wikidata JSON dumps to PostGIS ( the SparQL is not so
flexible/scalable  for geo matchings , )
- adding some scoring based on cebwiki /srwiki ...
- creating some sheets for manual checking.
but this process is like a  ~ "fuzzy left join" ...  with lot of hacky
codes and manual tunings.

If I don't find some NaturalEarth/WOF  object in the wikidata, then I have
to manually debug.
The most problems is
- different transliterations / spellings / english vs. local names ...
- some trolling by  anonymous users ( mostly from mobile phone ).
- problems with  GPS coordinates.
- changes in the real data ( cities joining / splitting ) so need lot of
background research.

best,
Imre











Sebastian Hellmann  ezt írta (időpont:
2019. aug. 28., Sze, 11:11):

> Hi Imre,
>
> we can encode these rules using the JSON MongoDB database we created in
> GlobalFactSync project (
> https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE).
> As  basis for the GFS Data Browser. The database has open read access.
>
> Is there a list of geodata issues, somewhere? Can you give some example?
> GFS focuses on both: overall quality measures and very domain specific
> adaptations. We will also try to flag these issues for Wikipedians.
>
> So I see that there is some notion of what is good and what not by source.
> Do you have a reference dataset as well, or would that be NaturalEarth
> itself? What would help you to measure completeness for adding concordances
> to NaturalEarth.
>
> -- Sebastian
> On 24.08.19 21:26, Imre Samu wrote:
>
> For geodata ( human settlements/rivers/mountains/... )  ( with GPS
> coordinates ) my simple rules:
> - if it has a  "local wikipedia pages" or  any big
> lang["EN/FR/PT/ES/RU/.."]  wikipedia page ..  than it is OK.
> - if it is only in "cebuano" AND outside of "cebuano BBOX" ->  then 
> this is lower quality
> - only:{shwiki+srwiki} AND outside of "sh"&"sr" BBOX ->  this is lower
> quality
> - only {huwiki} AND outside of CentralEuropeBBOX -> this is lower quality
> - geodata without GPS coordinate ->  ...
> - 
> so my rules based on wikipedia pages and languages areas ...  and I prefer
> wikidata - with local wikipedia pages.
>
> This is based on my experience - adding Wikidata ID concordances to
> NaturalEarth ( https://www.naturalearthdata.com/blog/ )
>
> --
> All the best,
> Sebastian Hellmann
>
> Director of Knowledge Integration and Linked Data Technologies (KILT)
> Competence Center
> at the Institute for Applied Informatics (InfAI) at Leipzig University
> Executive Director of the DBpedia Association
> Projects: http://dbpedia.org, http://nlp2rdf.org,
> http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
> 
> Homepage: http://aksw.org/SebastianHellmann
> Research Group: http://aksw.org
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Proposal for the introduction of a practicable Data Quality Indicator in Wikidata - 3rd round

2019-08-29 Thread Uwe Jung
Hello,

thank you very much for your contributions and comments. I would sign most
of your remarks without hesitation.

But I would like to clarify some things again:

   - The importance of Wikidata grows with its acceptance by the
   "unspecialized" audience. This includes also a lot of people who are
   allowed to decide about project funds or donations. As a rule, they have
   little time to inform themselves sufficiently about the problems of
   measuring data quality. In these hectic times, it is unfortunately common
   for the audience to demand solutions that are as simple and quick to
   analyse as possible. (I will leave the last sentence here as a hypothesis.)
   I think it is important to try to meet these expectations.
   - Recoin is known. And yes, it serves only in connection with the
   dimension of *relative* completeness. At present, however, it is primarily
   aimed at people who enter data manually. Thus it remains invisible or
   unusable for many others. To stick with the idea - would it not be possible
   to calculate a one- or multi-dimensional value from the recoin information,
   which then can be stored as a literal via a property "relative
   completeness" into the item? The advantage would be that this value can be
   queried via SPARQL together with the item. Possible decision-makers from
   the field of "jam science" can thus gain an overview of how complete the
   data from this field are in Wikidata and for which data completion projects
   funds may still have to be provided. As described in my last article, a
   single property "relative completeness" is not sufficient to describe data
   quality.
   - I am sorry if I expressed it in a misleading way. I use this mailing
   list to get feedback for an idea. It may be "my" idea (or not), but it is
   far from being "my" project. However, if the idea should ever be realized
   by anyone in any way, I would be interested in making my small modest
   contribution.
   - It's true that the number of current Wikidata items is hard to
   imagine. If a single instance would need only one second per item to
   calculate the different quality scores, it would take about 113 years for
   all. The fact that many items are modified over and over again and
   therefore have to be recalculated is not yet taken into account in the
   calculation. Therefore, the implemented approach would have to use
   strategies that make the first results visible with less effort. One
   possibility is to initially concentrate on the part of the data that is
   being used. We are hear at the question about dynamic quality.
   - People need support so that they can use data and find and fix their
   flaws. In the foreseeable future, there will not be so many supporters who
   will be able to manually check all 60 million items for errors. This is
   another reason why information about the quality of the data should be
   queried together with the data.


Thanks

Uwe Jung
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Proposal for the introduction of a practicable Data Quality Indicator in Wikidata (next round)

2019-08-28 Thread Andrew Gray
Hi Uwe,

I would agree with Gerard's concern about resources. Actually
embedding it within Wikidata - stored in the item with a property and
queriable by SPARQL - implies that we handle it as a statement. So
each edit that materially changed the quality score would prompt
another edit to update the scoring. Presumably not all edits would
change anything (eg label changes wouldn't be relevant) but even if
only 10% made a material difference, that's basically 10% more edits,
10% more contribution to query service updates, etc. And that's quite
a substantial chunk of resources for a "nice to have" feature!

So... maybe this suggests a different approach.

You could set up a seperate Wikibase installation (or any other kind
of linked-data store) to store the quality ratings, and make that
accessible through a federated SPARQL search. The WDQS is capable of
handling federated searches reasonably efficiently (see eg
https://w.wiki/7at), so you could allow people to do a search using
both sets of data ("find me all ABCs on Wikidata, and only return
those with a value of X > 0.5 on Wikidata-Scores").

Andrew.


On Tue, 27 Aug 2019 at 20:50, Uwe Jung  wrote:
>
> Hello,
>
> many thanks for the answers to my contribution from 24.8.
> I think that all four opinions contain important things to consider.
>
> @David Abián
> I have read the article and agree that in the end the users decide which data 
> is good for them or not.
>
> @GerardM
> It is true that in a possible implementation of the idea, the aspect of 
> computing load must be taken into account right from the beginning.
>
> Please check that I have not given up on the idea yet. With regard to the 
> acceptance of Wikidata, I consider a quality indicator of some kind to be 
> absolutely necessary. There will be a lot of ordinary users who would like to 
> see something like this.
>
> At the same time I completely agree with David;(almost) every chosen 
> indicator is subject to a certain arbitrariness in the selection. There won't 
> be one easy to understand super-indicator.
> So, let's approach things from the other side. Instead of a global indicator, 
> a separate indicator should be developed for each quality dimension to be 
> considered. With some dimensions this should be relatively easy. For others 
> it could take years until we have agreed on an algorithm for their 
> calculation.
>
> Furthermore, the indicators should not represent discrete values but a 
> continuum of values. No traffic light statements (i.e.: good, medium, bad) 
> should be made. Rather, when displaying the qualifiers, the value could be 
> related to the values of all other objects (e.g. the value x for the current 
> data object in relation to the overall average for all objects for this 
> indicator). The advantage here is that the total average can increase over 
> time, meaning that the position of the value for an individual object can 
> also decrease over time.
>
> Another advantage: Users can define the required quality level themselves. 
> If, for example, you have high demands on accuracy but few demands on the 
> completeness of the statements, you can do this.
>
> However, it remains important that these indicators (i.e. the evaluation of 
> the individual item) must be stored together with the item and can be queried 
> together with the data using SPARQL.
>
> Greetings
>
> Uwe Jung
>



--
- Andrew Gray
  and...@generalist.org.uk

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Proposal for the introduction of a practicable Data Quality Indicator in Wikidata (next round)

2019-08-28 Thread Ettore RIZZA
@Uwe : I'm sorry if I say trivialities, but are you familiar with the
Recoin tool [1] ? It seems to be quite close to what you describe, but only
for the data quality dimension of completeness (or more precisely *relative*
completeness) and it could perhaps serve as a model for what you are
considering. It is also a good example of a data quality tool that is
directly useful to editors, as it often allows them to identify and add
missing statements on an item.

Regards,

Ettore Rizza

[1] https://www.wikidata.org/wiki/Wikidata:Recoin



On Tue, 27 Aug 2019 at 21:49, Uwe Jung  wrote:

> Hello,
>
> many thanks for the answers to my contribution from 24.8.
> I think that all four opinions contain important things to consider.
>
> @David Abián
> I have read the article and agree that in the end the users decide which
> data is good for them or not.
>
> @GerardM
> It is true that in a possible implementation of the idea, the aspect of
> computing load must be taken into account right from the beginning.
>
> Please check that I have not given up on the idea yet. With regard to the
> acceptance of Wikidata, I consider a quality indicator of some kind to be
> absolutely necessary. There will be a lot of ordinary users who would like
> to see something like this.
>
> At the same time I completely agree with David;(almost) every chosen
> indicator is subject to a certain arbitrariness in the selection. There
> won't be one easy to understand super-indicator.
> So, let's approach things from the other side. Instead of a global
> indicator, a separate indicator should be developed for each quality
> dimension to be considered. With some dimensions this should be relatively
> easy. For others it could take years until we have agreed on an algorithm
> for their calculation.
>
> Furthermore, the indicators should not represent discrete values but a
> continuum of values. No traffic light statements (i.e.: good, medium, bad)
> should be made. Rather, when displaying the qualifiers, the value could be
> related to the values of all other objects (e.g. the value x for the
> current data object in relation to the overall average for all objects for
> this indicator). The advantage here is that the total average can increase
> over time, meaning that the position of the value for an individual object
> can also decrease over time.
>
> Another advantage: Users can define the required quality level themselves.
> If, for example, you have high demands on accuracy but few demands on the
> completeness of the statements, you can do this.
>
> However, it remains important that these indicators (i.e. the evaluation
> of the individual item) must be stored together with the item and can be
> queried together with the data using SPARQL.
>
> Greetings
>
> Uwe Jung
>
> Am Sa., 24. Aug. 2019 um 13:54 Uhr schrieb Uwe Jung :
>
>> Hello,
>>
>> As the importance of Wikidata increases, so do the demands on the quality
>> of the data. I would like to put the following proposal up for discussion.
>>
>> Two basic ideas:
>>
>>1. Each Wikidata page (item) is scored after each editing. This score
>>should express different dimensions of data quality in a quickly 
>> manageable
>>way.
>>2. A property is created via which the item refers to the score
>>value. Certain qualifiers can be used for a more detailed description 
>> (e.g.
>>time of calculation, algorithm used to calculate the score value, etc.).
>>
>>
>> The score value can be calculated either within Wikibase after each data
>> change or "externally" by a bot. For the calculation can be used among
>> other things: Number of constraints, completeness of references, degree of
>> completeness in relation to the underlying ontology, etc. There are already
>> some interesting discussions on the question of data quality which can be
>> used here ( see  https://www.wikidata.org/wiki/Wikidata:Item_quality;
>> https://www.wikidata.org/wiki/Wikidata:WikiProject_Data_Quality, etc).
>>
>> Advantages
>>
>>- Users get a quick overview of the quality of a page (item).
>>- SPARQL can be used to query only those items that meet a certain
>>quality level.
>>- The idea would probably be relatively easy to implement.
>>
>>
>> Disadvantage:
>>
>>- In a way, the data model is abused by generating statements that no
>>longer describe the item itself, but make statements about the
>>representation of this item in Wikidata.
>>- Additional computing power must be provided for the regular
>>calculation of all changed items.
>>- Only the quality of pages is referred to. If it is insufficient,
>>the changes still have to be made manually.
>>
>>
>> I would now be interested in the following:
>>
>>1. Is this idea suitable to effectively help solve existing quality
>>problems?
>>2. Which quality dimensions should the score value represent?
>>3. Which quality dimension can be calculated with reasonable effort?
>>4. How to calculate and 

Re: [Wikidata] Proposal for the introduction of a practicable Data Quality Indicator in Wikidata

2019-08-28 Thread Sebastian Hellmann

Hi Imre,

we can encode these rules using the JSON MongoDB database we created in 
GlobalFactSync project 
(https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE). 
As  basis for the GFS Data Browser. The database has open read access.


Is there a list of geodata issues, somewhere? Can you give some example? 
GFS focuses on both: overall quality measures and very domain specific 
adaptations. We will also try to flag these issues for Wikipedians.


So I see that there is some notion of what is good and what not by 
source. Do you have a reference dataset as well, or would that be 
NaturalEarth itself? What would help you to measure completeness for 
adding concordances to NaturalEarth.


-- Sebastian

On 24.08.19 21:26, Imre Samu wrote:
For geodata ( human settlements/rivers/mountains/... )  ( with GPS 
coordinates ) my simple rules:
- if it has a  "local wikipedia pages" or  any big 
lang["EN/FR/PT/ES/RU/.."]  wikipedia page ..  than it is OK.
- if it is only in "cebuano" AND outside of "cebuano BBOX" ->  then 
 this is lower quality
- only:{shwiki+srwiki} AND outside of "sh"&"sr" BBOX -> this is lower 
quality

- only {huwiki} AND outside of CentralEuropeBBOX -> this is lower quality
- geodata without GPS coordinate ->  ...
- 
so my rules based on wikipedia pages and languages areas ... and I 
prefer wikidata - with local wikipedia pages.


This is based on my experience - adding Wikidata ID concordances to 
NaturalEarth ( https://www.naturalearthdata.com/blog/ )

--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) 
Competence Center

at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org, 
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt 


Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Proposal for the introduction of a practicable Data Quality Indicator in Wikidata (next round)

2019-08-27 Thread Gerard Meijssen
Hoi,
Shopping for recognition for "your" project. Making it an issue that has to
affect everything else because the quality of "your" project is highly
problematic given a few basic facts. Wikidata has 59,284,641 items, this
effort is about 7500 people. They drown in a sea of other people, items.
Statistically the numbers involved are insignificant.

HOWEVER, when your effort has a practical application, it is that
application, the use of the data that ensures that the data will be
maintained and hopefully ensure that the quality of this subset is
maintained. When you want quality, static quality is achieved by
restricting at the gate. Dynamic quality is achieved by making sure that
the data is actually used. Scholia is an example of functionality that
supports existing data and everyone who uses it will see the flaws in the
data. It is why we need to import additional data, merge scientist when
there are duplicates, add additional authorities.

Yes, we need to achieve quality results. They will be achieved when people
use the data, find its flaws and consequently append and amend. Recognition
of quality is done best by supporting and highlighting the application of
our data and particularly be thankful to the consequential updates we
receive. The users that help us do better are our partners, all others
ensure our relevance.
Thanks,
  GerardM

On Wed, 28 Aug 2019 at 00:52, Magnus Sälgö  wrote:

> Uwe I feel this is more and more important with quality and provenance and
> also communicate inside Wikidata the quality of our data.
>
>  I have added maybe the best source for biographies in Sweden P3217 in
> Wikidata on 7500 person. In Wikipedia those 7500 objects are used on > 200
> different languages in Wikipedia we need to have a ”layer” explaining that
> data confirmed  with P3217 ”SBL from Sweden” has very high trust
>
> See https://phabricator.wikimedia.org/T222142
>
> I can also see this quality problem that  Nobelprize.org and Wikidata has
> > 30 differencies and its sometimes difficult to understand the quality of
> the sources in Wikidata plus that Nobelprize.com has no sources makes the
> equation difficult
> https://phabricator.wikimedia.org/T200668
>
> Regards
> Magnus Sälgö
> 0046-705937579
> salg...@msn.com
>
> A blogpost I wrote
> https://minancestry.blogspot.com/2018/04/wikidata-has-design-problem.html
> 
>
> 28 aug. 2019 kl. 03:49 skrev Uwe Jung :
>
> Hello,
>
> many thanks for the answers to my contribution from 24.8.
> I think that all four opinions contain important things to consider.
>
> @David Abián
> I have read the article and agree that in the end the users decide which
> data is good for them or not.
>
> @GerardM
> It is true that in a possible implementation of the idea, the aspect of
> computing load must be taken into account right from the beginning.
>
> Please check that I have not given up on the idea yet. With regard to the
> acceptance of Wikidata, I consider a quality indicator of some kind to be
> absolutely necessary. There will be a lot of ordinary users who would like
> to see something like this.
>
> At the same time I completely agree with David;(almost) every chosen
> indicator is subject to a certain arbitrariness in the selection. There
> won't be one easy to understand super-indicator.
> So, let's approach things from the other side. Instead of a global
> indicator, a separate indicator should be developed for each quality
> dimension to be considered. With some dimensions this should be relatively
> easy. For others it could take years until we have agreed on an algorithm
> for their calculation.
>
> Furthermore, the indicators should not represent discrete values but a
> continuum of values. No traffic light statements (i.e.: good, medium, bad)
> should be made. Rather, when displaying the qualifiers, the value could be
> related to the values of all other objects (e.g. the value x for the
> current data object in relation to the overall average for all objects for
> this indicator). The advantage here is that the total average can increase
> over time, meaning that the position of the value for an individual object
> can also decrease over time.
>
> Another advantage: Users can define the required quality level themselves.
> If, for example, you have high demands on accuracy but few demands on the
> completeness of the statements, you can do this.
>
> However, it remains important that these indicators (i.e. the evaluation
> of the individual item) must be stored together with the item and can be
> queried together with the data using SPARQL.
>
> Greetings
>
> Uwe Jung
>
> Am Sa., 24. Aug. 2019 um 13:54 Uhr schrieb Uwe Jung :
>
>> Hello,
>>
>> As the importance of Wikidata increases, so do the demands on the quality
>> of the data. I would like to put the following proposal up for discussion.
>>
>> Two basic ideas:
>>
>>1. Each Wikidata page (item) is scored after each edit

Re: [Wikidata] Proposal for the introduction of a practicable Data Quality Indicator in Wikidata

2019-08-27 Thread Sebastian Hellmann

Hi David,

On 24.08.19 21:23, David Abián wrote:

Hi,

If we accept that the quality of the data is the "fitness for use",
which in my opinion is the best and most commonly used definition (as
stated in the article linked by Ettore), then it will never be possible
to define a number that objectively represents data quality. We can
define a number that is the result of an arbitrary weighted average of
different metrics related to various dimensions of quality arbitrarily
captured and transformed, and we can fool ourselves by saying that this
number represents data quality, but it will not, nor will it be an
approximation of what data quality means, nor will this number be able
to order Wikidata entities matching any common, understandable,
high-level criterion. The quality of the data depends on the use, it's
relative to each user, and can't be measured globally and objectively in
any way that is better than another.


True that, but there are these two aspects:

1. what you describe sound more like inherent problem, when "measuring" 
quality, because as soon as you measure it it becomes quantity.  It is 
not specific to data. However, it is a viable helping construct, i.e. 
you measure something and in the interpretation you can judge quality 
for better or for worse. There is some merit in quantification for data, 
e.g. [1] and [2].


[1] SHACL predecessor: 
http://svn.aksw.org/papers/2014/WWW_Databugger/public.pdf


[2] https://svn.aksw.org/papers/2019/ISWC_FlexiFusion/public.pdf

2. Data is not understood well in my opinion. There is no good model yet 
to measure its value, once it becomes information.


Please see below:



As an alternative, however, I can suggest that you separately study some
quality dimensions assuming a particular use case for your study; this
will be correct, doable and greatly appreciated. :-) Please feel free to
ask for help in case you need it, either personally or via this list or
other means. And thanks for your interest in improving Wikidata!


We are studying this in Global Fact Sync at the moment 
https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE/SyncTargets


The next step here is to define 10 or more sync targets for Wikidata and 
then assess them as a baseline. Then we will extend the prototype in a 
way that aims for these synctargets to be:


- in sync with Wikipedia's infoboxes, i.e. all info was transferred in 
terms of data and citations on wikipages


- near perfect, if we find a good reference source to integrate

Is there someone who has a suggestion? Like a certain part of Wikidata 
that can serve as a testbed and that we can improve?


Please send it to us.

-- Sebastian




Regards,
David


On 8/24/19 13:54, Uwe Jung wrote:

Hello,

As the importance of Wikidata increases, so do the demands on the
quality of the data. I would like to put the following proposal up for
discussion.

Two basic ideas:

  1. Each Wikidata page (item) is scored after each editing. This score
 should express different dimensions of data quality in a quickly
 manageable way.
  2. A property is created via which the item refers to the score value.
 Certain qualifiers can be used for a more detailed description (e.g.
 time of calculation, algorithm used to calculate the score value, etc.).


The score value can be calculated either within Wikibase after each data
change or "externally" by a bot. For the calculation can be used among
other things: Number of constraints, completeness of references, degree
of completeness in relation to the underlying ontology, etc. There are
already some interesting discussions on the question of data quality
which can be used here ( see
https://www.wikidata.org/wiki/Wikidata:Item_quality;
https://www.wikidata.org/wiki/Wikidata:WikiProject_Data_Quality, etc).

Advantages

   * Users get a quick overview of the quality of a page (item).
   * SPARQL can be used to query only those items that meet a certain
 quality level.
   * The idea would probably be relatively easy to implement.


Disadvantage:

   * In a way, the data model is abused by generating statements that no
 longer describe the item itself, but make statements about the
 representation of this item in Wikidata.
   * Additional computing power must be provided for the regular
 calculation of all changed items.
   * Only the quality of pages is referred to. If it is insufficient, the
 changes still have to be made manually.


I would now be interested in the following:

  1. Is this idea suitable to effectively help solve existing quality
 problems?
  2. Which quality dimensions should the score value represent?
  3. Which quality dimension can be calculated with reasonable effort?
  4. How to calculate and represent them?
  5. Which is the most suitable way to further discuss and implement this
 idea?


Many thanks in advance.

Uwe Jung  (UJung )
www.archivfuehrer-kolonialzeit.de/thesau

Re: [Wikidata] Proposal for the introduction of a practicable Data Quality Indicator in Wikidata (next round)

2019-08-27 Thread Magnus Sälgö
Uwe I feel this is more and more important with quality and provenance and also 
communicate inside Wikidata the quality of our data.

 I have added maybe the best source for biographies in Sweden P3217 in Wikidata 
on 7500 person. In Wikipedia those 7500 objects are used on > 200 different 
languages in Wikipedia we need to have a ”layer” explaining that data confirmed 
 with P3217 ”SBL from Sweden” has very high trust

See https://phabricator.wikimedia.org/T222142

I can also see this quality problem that  Nobelprize.org 
and Wikidata has > 30 differencies and its sometimes difficult to understand 
the quality of the sources in Wikidata plus that 
Nobelprize.com has no sources makes the equation 
difficult
https://phabricator.wikimedia.org/T200668

Regards
Magnus Sälgö
0046-705937579
salg...@msn.com

A blogpost I wrote 
https://minancestry.blogspot.com/2018/04/wikidata-has-design-problem.html

28 aug. 2019 kl. 03:49 skrev Uwe Jung 
mailto:jung@gmail.com>>:

Hello,

many thanks for the answers to my contribution from 24.8.
I think that all four opinions contain important things to consider.

@David Abián
I have read the article and agree that in the end the users decide which data 
is good for them or not.

@GerardM
It is true that in a possible implementation of the idea, the aspect of 
computing load must be taken into account right from the beginning.

Please check that I have not given up on the idea yet. With regard to the 
acceptance of Wikidata, I consider a quality indicator of some kind to be 
absolutely necessary. There will be a lot of ordinary users who would like to 
see something like this.

At the same time I completely agree with David;(almost) every chosen indicator 
is subject to a certain arbitrariness in the selection. There won't be one easy 
to understand super-indicator.
So, let's approach things from the other side. Instead of a global indicator, a 
separate indicator should be developed for each quality dimension to be 
considered. With some dimensions this should be relatively easy. For others it 
could take years until we have agreed on an algorithm for their calculation.

Furthermore, the indicators should not represent discrete values but a 
continuum of values. No traffic light statements (i.e.: good, medium, bad) 
should be made. Rather, when displaying the qualifiers, the value could be 
related to the values of all other objects (e.g. the value x for the current 
data object in relation to the overall average for all objects for this 
indicator). The advantage here is that the total average can increase over 
time, meaning that the position of the value for an individual object can also 
decrease over time.

Another advantage: Users can define the required quality level themselves. If, 
for example, you have high demands on accuracy but few demands on the 
completeness of the statements, you can do this.

However, it remains important that these indicators (i.e. the evaluation of the 
individual item) must be stored together with the item and can be queried 
together with the data using SPARQL.

Greetings

Uwe Jung

Am Sa., 24. Aug. 2019 um 13:54 Uhr schrieb Uwe Jung 
mailto:jung@gmail.com>>:
Hello,

As the importance of Wikidata increases, so do the demands on the quality of 
the data. I would like to put the following proposal up for discussion.

Two basic ideas:

  1.  Each Wikidata page (item) is scored after each editing. This score should 
express different dimensions of data quality in a quickly manageable way.
  2.  A property is created via which the item refers to the score value. 
Certain qualifiers can be used for a more detailed description (e.g. time of 
calculation, algorithm used to calculate the score value, etc.).

The score value can be calculated either within Wikibase after each data change 
or "externally" by a bot. For the calculation can be used among other things: 
Number of constraints, completeness of references, degree of completeness in 
relation to the underlying ontology, etc. There are already some interesting 
discussions on the question of data quality which can be used here ( see  
https://www.wikidata.org/wiki/Wikidata:Item_quality; 
https://www.wikidata.org/wiki/Wikidata:WikiProject_Data_Quality, etc).

Advantages

  *   Users get a quick overview of the quality of a page (item).
  *   SPARQL can be used to query only those items that meet a certain quality 
level.
  *   The idea would probably be relatively easy to implement.

Disadvantage:

  *   In a way, the data model is abused by generating statements that no 
longer describe the item itself, but make statements about the representation 
of this item in Wikidata.
  *   Additional computing power must be provided for the regular calculation 
of all changed items.
  *   Only the quality of pages is referred to. If it is insuffic

Re: [Wikidata] Proposal for the introduction of a practicable Data Quality Indicator in Wikidata (next round)

2019-08-27 Thread Uwe Jung
Hello,

many thanks for the answers to my contribution from 24.8.
I think that all four opinions contain important things to consider.

@David Abián
I have read the article and agree that in the end the users decide which
data is good for them or not.

@GerardM
It is true that in a possible implementation of the idea, the aspect of
computing load must be taken into account right from the beginning.

Please check that I have not given up on the idea yet. With regard to the
acceptance of Wikidata, I consider a quality indicator of some kind to be
absolutely necessary. There will be a lot of ordinary users who would like
to see something like this.

At the same time I completely agree with David;(almost) every chosen
indicator is subject to a certain arbitrariness in the selection. There
won't be one easy to understand super-indicator.
So, let's approach things from the other side. Instead of a global
indicator, a separate indicator should be developed for each quality
dimension to be considered. With some dimensions this should be relatively
easy. For others it could take years until we have agreed on an algorithm
for their calculation.

Furthermore, the indicators should not represent discrete values but a
continuum of values. No traffic light statements (i.e.: good, medium, bad)
should be made. Rather, when displaying the qualifiers, the value could be
related to the values of all other objects (e.g. the value x for the
current data object in relation to the overall average for all objects for
this indicator). The advantage here is that the total average can increase
over time, meaning that the position of the value for an individual object
can also decrease over time.

Another advantage: Users can define the required quality level themselves.
If, for example, you have high demands on accuracy but few demands on the
completeness of the statements, you can do this.

However, it remains important that these indicators (i.e. the evaluation of
the individual item) must be stored together with the item and can be
queried together with the data using SPARQL.

Greetings

Uwe Jung

Am Sa., 24. Aug. 2019 um 13:54 Uhr schrieb Uwe Jung :

> Hello,
>
> As the importance of Wikidata increases, so do the demands on the quality
> of the data. I would like to put the following proposal up for discussion.
>
> Two basic ideas:
>
>1. Each Wikidata page (item) is scored after each editing. This score
>should express different dimensions of data quality in a quickly manageable
>way.
>2. A property is created via which the item refers to the score value.
>Certain qualifiers can be used for a more detailed description (e.g. time
>of calculation, algorithm used to calculate the score value, etc.).
>
>
> The score value can be calculated either within Wikibase after each data
> change or "externally" by a bot. For the calculation can be used among
> other things: Number of constraints, completeness of references, degree of
> completeness in relation to the underlying ontology, etc. There are already
> some interesting discussions on the question of data quality which can be
> used here ( see  https://www.wikidata.org/wiki/Wikidata:Item_quality;
> https://www.wikidata.org/wiki/Wikidata:WikiProject_Data_Quality, etc).
>
> Advantages
>
>- Users get a quick overview of the quality of a page (item).
>- SPARQL can be used to query only those items that meet a certain
>quality level.
>- The idea would probably be relatively easy to implement.
>
>
> Disadvantage:
>
>- In a way, the data model is abused by generating statements that no
>longer describe the item itself, but make statements about the
>representation of this item in Wikidata.
>- Additional computing power must be provided for the regular
>calculation of all changed items.
>- Only the quality of pages is referred to. If it is insufficient, the
>changes still have to be made manually.
>
>
> I would now be interested in the following:
>
>1. Is this idea suitable to effectively help solve existing quality
>problems?
>2. Which quality dimensions should the score value represent?
>3. Which quality dimension can be calculated with reasonable effort?
>4. How to calculate and represent them?
>5. Which is the most suitable way to further discuss and implement
>this idea?
>
>
> Many thanks in advance.
>
> Uwe Jung  (UJung )
> www.archivfuehrer-kolonialzeit.de/thesaurus
>
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Proposal for the introduction of a practicable Data Quality Indicator in Wikidata

2019-08-24 Thread Imre Samu
TLDR:  it would be useful ; but extreme hard to create rules for every
domains.

>4. How to calculate and represent them?

imho:  it is deepends of the data domain.

For geodata ( human settlements/rivers/mountains/... )  ( with GPS
coordinates ) my simple rules:
- if it has a  "local wikipedia pages" or  any big
lang["EN/FR/PT/ES/RU/.."]  wikipedia page ..  than it is OK.
- if it is only in "cebuano" AND outside of "cebuano BBOX" ->  then 
this is lower quality
- only:{shwiki+srwiki} AND outside of "sh"&"sr" BBOX ->  this is lower
quality
- only {huwiki} AND outside of CentralEuropeBBOX -> this is lower quality
- geodata without GPS coordinate ->  ...
- 
so my rules based on wikipedia pages and languages areas ...  and I prefer
wikidata - with local wikipedia pages.

This is based on my experience - adding Wikidata ID concordances to
NaturalEarth ( https://www.naturalearthdata.com/blog/ )


>5. Which is the most suitable way to further discuss and implement this
idea?

imho:  Loading the wikidata dump to the local database;
and creating
- some "proof of concept" quality data indicators.
- some "meta" rules
- some "real" statistics
so the community can decide it is useful or not.



Imre







Uwe Jung  ezt írta (időpont: 2019. aug. 24., Szo,
14:55):

> Hello,
>
> As the importance of Wikidata increases, so do the demands on the quality
> of the data. I would like to put the following proposal up for discussion.
>
> Two basic ideas:
>
>1. Each Wikidata page (item) is scored after each editing. This score
>should express different dimensions of data quality in a quickly manageable
>way.
>2. A property is created via which the item refers to the score value.
>Certain qualifiers can be used for a more detailed description (e.g. time
>of calculation, algorithm used to calculate the score value, etc.).
>
>
> The score value can be calculated either within Wikibase after each data
> change or "externally" by a bot. For the calculation can be used among
> other things: Number of constraints, completeness of references, degree of
> completeness in relation to the underlying ontology, etc. There are already
> some interesting discussions on the question of data quality which can be
> used here ( see  https://www.wikidata.org/wiki/Wikidata:Item_quality;
> https://www.wikidata.org/wiki/Wikidata:WikiProject_Data_Quality, etc).
>
> Advantages
>
>- Users get a quick overview of the quality of a page (item).
>- SPARQL can be used to query only those items that meet a certain
>quality level.
>- The idea would probably be relatively easy to implement.
>
>
> Disadvantage:
>
>- In a way, the data model is abused by generating statements that no
>longer describe the item itself, but make statements about the
>representation of this item in Wikidata.
>- Additional computing power must be provided for the regular
>calculation of all changed items.
>- Only the quality of pages is referred to. If it is insufficient, the
>changes still have to be made manually.
>
>
> I would now be interested in the following:
>
>1. Is this idea suitable to effectively help solve existing quality
>problems?
>2. Which quality dimensions should the score value represent?
>3. Which quality dimension can be calculated with reasonable effort?
>4. How to calculate and represent them?
>5. Which is the most suitable way to further discuss and implement
>this idea?
>
>
> Many thanks in advance.
>
> Uwe Jung  (UJung )
> www.archivfuehrer-kolonialzeit.de/thesaurus
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Proposal for the introduction of a practicable Data Quality Indicator in Wikidata

2019-08-24 Thread David Abián
Hi,

If we accept that the quality of the data is the "fitness for use",
which in my opinion is the best and most commonly used definition (as
stated in the article linked by Ettore), then it will never be possible
to define a number that objectively represents data quality. We can
define a number that is the result of an arbitrary weighted average of
different metrics related to various dimensions of quality arbitrarily
captured and transformed, and we can fool ourselves by saying that this
number represents data quality, but it will not, nor will it be an
approximation of what data quality means, nor will this number be able
to order Wikidata entities matching any common, understandable,
high-level criterion. The quality of the data depends on the use, it's
relative to each user, and can't be measured globally and objectively in
any way that is better than another.

As an alternative, however, I can suggest that you separately study some
quality dimensions assuming a particular use case for your study; this
will be correct, doable and greatly appreciated. :-) Please feel free to
ask for help in case you need it, either personally or via this list or
other means. And thanks for your interest in improving Wikidata!

Regards,
David


On 8/24/19 13:54, Uwe Jung wrote:
> Hello,
> 
> As the importance of Wikidata increases, so do the demands on the
> quality of the data. I would like to put the following proposal up for
> discussion.
> 
> Two basic ideas:
> 
>  1. Each Wikidata page (item) is scored after each editing. This score
> should express different dimensions of data quality in a quickly
> manageable way.
>  2. A property is created via which the item refers to the score value.
> Certain qualifiers can be used for a more detailed description (e.g.
> time of calculation, algorithm used to calculate the score value, etc.).
> 
> 
> The score value can be calculated either within Wikibase after each data
> change or "externally" by a bot. For the calculation can be used among
> other things: Number of constraints, completeness of references, degree
> of completeness in relation to the underlying ontology, etc. There are
> already some interesting discussions on the question of data quality
> which can be used here ( see 
> https://www.wikidata.org/wiki/Wikidata:Item_quality;
> https://www.wikidata.org/wiki/Wikidata:WikiProject_Data_Quality, etc).
> 
> Advantages
> 
>   * Users get a quick overview of the quality of a page (item).
>   * SPARQL can be used to query only those items that meet a certain
> quality level.
>   * The idea would probably be relatively easy to implement.
> 
> 
> Disadvantage:
> 
>   * In a way, the data model is abused by generating statements that no
> longer describe the item itself, but make statements about the
> representation of this item in Wikidata.
>   * Additional computing power must be provided for the regular
> calculation of all changed items.
>   * Only the quality of pages is referred to. If it is insufficient, the
> changes still have to be made manually.
> 
> 
> I would now be interested in the following:
> 
>  1. Is this idea suitable to effectively help solve existing quality
> problems?
>  2. Which quality dimensions should the score value represent?
>  3. Which quality dimension can be calculated with reasonable effort?
>  4. How to calculate and represent them?
>  5. Which is the most suitable way to further discuss and implement this
> idea?
> 
> 
> Many thanks in advance.
> 
> Uwe Jung  (UJung )
> www.archivfuehrer-kolonialzeit.de/thesaurus
> 
> 
> 
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
> 

-- 
David Abián

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Proposal for the introduction of a practicable Data Quality Indicator in Wikidata

2019-08-24 Thread Gerard Meijssen
Hoi,
What is it that you hope to achieve by this.. It will add to the time it
takes to process an edit. It is a luxury we cannot afford. It is also not
something that would influence my edits.
Thanks,
 GerardM

On Sat, 24 Aug 2019 at 13:55, Uwe Jung  wrote:

> Hello,
>
> As the importance of Wikidata increases, so do the demands on the quality
> of the data. I would like to put the following proposal up for discussion.
>
> Two basic ideas:
>
>1. Each Wikidata page (item) is scored after each editing. This score
>should express different dimensions of data quality in a quickly manageable
>way.
>2. A property is created via which the item refers to the score value.
>Certain qualifiers can be used for a more detailed description (e.g. time
>of calculation, algorithm used to calculate the score value, etc.).
>
>
> The score value can be calculated either within Wikibase after each data
> change or "externally" by a bot. For the calculation can be used among
> other things: Number of constraints, completeness of references, degree of
> completeness in relation to the underlying ontology, etc. There are already
> some interesting discussions on the question of data quality which can be
> used here ( see  https://www.wikidata.org/wiki/Wikidata:Item_quality;
> https://www.wikidata.org/wiki/Wikidata:WikiProject_Data_Quality, etc).
>
> Advantages
>
>- Users get a quick overview of the quality of a page (item).
>- SPARQL can be used to query only those items that meet a certain
>quality level.
>- The idea would probably be relatively easy to implement.
>
>
> Disadvantage:
>
>- In a way, the data model is abused by generating statements that no
>longer describe the item itself, but make statements about the
>representation of this item in Wikidata.
>- Additional computing power must be provided for the regular
>calculation of all changed items.
>- Only the quality of pages is referred to. If it is insufficient, the
>changes still have to be made manually.
>
>
> I would now be interested in the following:
>
>1. Is this idea suitable to effectively help solve existing quality
>problems?
>2. Which quality dimensions should the score value represent?
>3. Which quality dimension can be calculated with reasonable effort?
>4. How to calculate and represent them?
>5. Which is the most suitable way to further discuss and implement
>this idea?
>
>
> Many thanks in advance.
>
> Uwe Jung  (UJung )
> www.archivfuehrer-kolonialzeit.de/thesaurus
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Proposal for the introduction of a practicable Data Quality Indicator in Wikidata

2019-08-24 Thread Ettore RIZZA
Hello,

Very interesting idea. Just to feed the discussion, here is a very recent
literature survey on data quality in Wikidata:
https://opensym.org/wp-content/uploads/2019/08/os19-paper-A17-piscopo.pdf
https://opensym.org/wp-content/uploads/2019/08/os19-paper-A17-piscopo.pdf

Cheers,

Ettore Rizza



On Sat, 24 Aug 2019 at 13:55, Uwe Jung  wrote:

> Hello,
>
> As the importance of Wikidata increases, so do the demands on the quality
> of the data. I would like to put the following proposal up for discussion.
>
> Two basic ideas:
>
>1. Each Wikidata page (item) is scored after each editing. This score
>should express different dimensions of data quality in a quickly manageable
>way.
>2. A property is created via which the item refers to the score value.
>Certain qualifiers can be used for a more detailed description (e.g. time
>of calculation, algorithm used to calculate the score value, etc.).
>
>
> The score value can be calculated either within Wikibase after each data
> change or "externally" by a bot. For the calculation can be used among
> other things: Number of constraints, completeness of references, degree of
> completeness in relation to the underlying ontology, etc. There are already
> some interesting discussions on the question of data quality which can be
> used here ( see  https://www.wikidata.org/wiki/Wikidata:Item_quality;
> https://www.wikidata.org/wiki/Wikidata:WikiProject_Data_Quality, etc).
>
> Advantages
>
>- Users get a quick overview of the quality of a page (item).
>- SPARQL can be used to query only those items that meet a certain
>quality level.
>- The idea would probably be relatively easy to implement.
>
>
> Disadvantage:
>
>- In a way, the data model is abused by generating statements that no
>longer describe the item itself, but make statements about the
>representation of this item in Wikidata.
>- Additional computing power must be provided for the regular
>calculation of all changed items.
>- Only the quality of pages is referred to. If it is insufficient, the
>changes still have to be made manually.
>
>
> I would now be interested in the following:
>
>1. Is this idea suitable to effectively help solve existing quality
>problems?
>2. Which quality dimensions should the score value represent?
>3. Which quality dimension can be calculated with reasonable effort?
>4. How to calculate and represent them?
>5. Which is the most suitable way to further discuss and implement
>this idea?
>
>
> Many thanks in advance.
>
> Uwe Jung  (UJung )
> www.archivfuehrer-kolonialzeit.de/thesaurus
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Proposal for the introduction of a practicable Data Quality Indicator in Wikidata

2019-08-24 Thread Uwe Jung
Hello,

As the importance of Wikidata increases, so do the demands on the quality
of the data. I would like to put the following proposal up for discussion.

Two basic ideas:

   1. Each Wikidata page (item) is scored after each editing. This score
   should express different dimensions of data quality in a quickly manageable
   way.
   2. A property is created via which the item refers to the score value.
   Certain qualifiers can be used for a more detailed description (e.g. time
   of calculation, algorithm used to calculate the score value, etc.).


The score value can be calculated either within Wikibase after each data
change or "externally" by a bot. For the calculation can be used among
other things: Number of constraints, completeness of references, degree of
completeness in relation to the underlying ontology, etc. There are already
some interesting discussions on the question of data quality which can be
used here ( see  https://www.wikidata.org/wiki/Wikidata:Item_quality;
https://www.wikidata.org/wiki/Wikidata:WikiProject_Data_Quality, etc).

Advantages

   - Users get a quick overview of the quality of a page (item).
   - SPARQL can be used to query only those items that meet a certain
   quality level.
   - The idea would probably be relatively easy to implement.


Disadvantage:

   - In a way, the data model is abused by generating statements that no
   longer describe the item itself, but make statements about the
   representation of this item in Wikidata.
   - Additional computing power must be provided for the regular
   calculation of all changed items.
   - Only the quality of pages is referred to. If it is insufficient, the
   changes still have to be made manually.


I would now be interested in the following:

   1. Is this idea suitable to effectively help solve existing quality
   problems?
   2. Which quality dimensions should the score value represent?
   3. Which quality dimension can be calculated with reasonable effort?
   4. How to calculate and represent them?
   5. Which is the most suitable way to further discuss and implement this
   idea?


Many thanks in advance.

Uwe Jung  (UJung )
www.archivfuehrer-kolonialzeit.de/thesaurus
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata