Tcp-ip created this task.
Tcp-ip added a project: Wikidata.
Herald added a subscriber: Aklapper.

TASK DESCRIPTION

I am querying Wikidata for the number of speakers of the languages that have a Wikipedia and I have found out that my query returns some unexplainable duplicated entries.
For example the following entry is returned twice:
{'language': 'http://www.wikidata.org/entity/Q29921', 'languageLabel': 'Inuktitut', 'nSpeakers': '30000'}
This happens although I can find no duplication in the entry about that language and the 'distinct' in my query should have eliminated it anyway.
I suspect that it is a bug of the querying system, and not just me making a mistake, because the same query returns different results if executed few seconds apart. Data about Inuktitut language and Esperanto seems to be always affected, whereas data about other languages are duplicated only sometimes. For example, running the same query five seconds apart, I have obtained the following number of duplicates: 34, 34, 24, 24, 6, 6, 6, 24, 6, 33.

Here is a python 3 script that manifests the problem. It executes 10 queries 5 seconds apart (printing the duplicates of the first and the duplicate count of the others):

#!/usr/bin/env python3
#-*- coding: UTF-8 -*-

import requests, time

def queryWikidata(query):

WIKIDATAQUERYURL = 'https://query.wikidata.org/sparql'
data = "" params={'format': 'json', 'query': query}).json()

data = ""
cleanData = []
for i in data:
    cleanData.append({x: i[x]['value'] for x in i})       
return cleanData

def testQuery(echo):

QUERY = """SELECT DISTINCT ?language ?languageLabel ?nSpeakers ?Lx ?LxLabel ?time ?country
WHERE
{
    ?language wdt:P31/wdt:P279* wd:Q34770.
    ?language p:P1098 ?nSpeakersStatement.
    ?nSpeakersStatement ps:P1098 ?nSpeakers
    optional {?nSpeakersStatement pq:P518 ?Lx}.
    optional {?nSpeakersStatement pq:P585 ?time}.
    optional {?nSpeakersStatement pq:P17 ?country}.
    FILTER EXISTS {?wikipedia wdt:P407 ?language}.
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
    
}"""
foundDuplicates = []
data = ""
data.sort(key=lambda x: (x['language'], x['nSpeakers']))
count=0
for i in data:
    if data.count(i)>1 and i not in foundDuplicates:
        if echo == True:
            print(i)
        foundDuplicates.append(i)
return len(foundDuplicates)

duplicateCount = []
duplicates = testQuery(True)
print('\nQuery', 1)
print('Duplicates:', duplicates)
duplicateCount.append(duplicates)
for i in range(2, 11):

time.sleep(5)
print('Query', i)
duplicates = testQuery(False)
print('Duplicates:', duplicates)
duplicateCount.append(duplicates)

print(duplicateCount)


TASK DETAIL
https://phabricator.wikimedia.org/T153108

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Tcp-ip
Cc: Aklapper, Tcp-ip, D3r1ck01, Izno, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to