Tilomi created this task.
Tilomi added projects: Pywikibot, Wikidata.
Restricted Application added subscribers: pywikibot-bugs-list, Aklapper.

TASK DESCRIPTION
  Hi, i'm currently working on an open source university project 
(https://openartbrowser.org/ , 
https://github.com/hochschule-darmstadt/openartbrowser).
  
  We query different datasets from wikidata about art. In our ETL process we 
get the data we want with help of the pywikibot libary. This libary loads the 
wikidata sites by their qId's which we query beforehand with SPARQL 
(pagegenerator.WikidataSPARQLPageGenerator). This whole process of extracting 
around 150.000 entries took us 47 hours last time measured at the ending of 
october.
  
  We want to improve our crawler performance so that we can test new features 
faster.
  
  Our implementation can be viewed here 
https://github.com/hochschule-darmstadt/openartbrowser/blob/staging/scripts/Wikidata%20crawler/ArtOntologyCrawler.py
 in the extract_artworks function. This function first queries all qIds of 
Paintings, Drawings and Sculptures. Next we iterate over the items returned by 
the page generator. When measuring the times i came accross that the item.get() 
takes from 0.5 to 3-4 seconds. I assume that this is a page load for all data 
on the page of an wikidata entity.
  
  The only possibility i see at the moment to improve this data extraction is 
multi-threading because wikidata allows five queries in parallel (which equals 
five page loads). A direct SPARQL queries seems to be not possible because 
requests are very limited (see 
https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Query_limits).
  
  Maybe there is another way of solving this performance issue. 
  Best regards.

TASK DETAIL
  https://phabricator.wikimedia.org/T238471

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Tilomi
Cc: Aklapper, pywikibot-bugs-list, Tilomi, Zkhalido, darthmon_wmde, Viztor, 
DannyS712, Nandana, Wenyi, Lahi, Gq86, GoranSMilovanovic, QZanden, Tbscho, 
MayS, LawExplorer, Mdupont, JJMC89, Dvorapa, _jensen, rosalieper, Altostratus, 
Avicennasis, Scott_WUaS, mys_721tx, Wikidata-bugs, aude, jayvdb, Dalba, Masti, 
Alchimista, Mbch331, Rxy
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to