This is a progress report about my attempt to match Wikidata items and OSM objects automatically.
Here are some page about adding Wikidata identifiers to OSM: http://wiki.openstreetmap.org/wiki/Wikidata http://wiki.openstreetmap.org/wiki/Proposed_features/Wikidata The list is available here, it is split up by English Wikipedia category: http://edwardbetts.com/osm-wikidata/ Some OSM/Wikidata items will appear in multiple categories. Each page of results is sorted by distance, then by the English Wikidata label. The results include links to Wikidata, the location on OSM from Wikidata and the matched OSM object. A quick recap about how my system works. I have a list of categories on Wikipedia with the appropriate tags on OpenStreetMap. For example, articles in the subcategories of the category "Airports by Country" should appear on the map tagged as aeroway=aerodrome. I use a Wikimedia Labs tool called CatScan to get a list of every article in the category or subcategory: https://tools.wmflabs.org/catscan2/catscan2.php For each article in English Wikipedia this is a matching item in Wikidata. I use the Wikidata API to find the Wikidata items within the category. Items without coordinates are skipped. Once all the categories are processed I have a list of Wikidata items that include coordinates and the label in multiple languages. I split this list up by coordinates into half degree squares. I use the Overpass API to look for OSM objects (nodes, ways and relations) with a name and the expected tags. The acceptable distance for most objects is 1km, for some entity types it has been increased further. I've included a distance field in my results, so you can see how far apart the matched items are. The names in the OSM object are compared with the labels and aliases in the Wikidata item. The code looks at the various name keys listed in the http://wiki.openstreetmap.org/wiki/Key:name page. I exclude old_name from the comparison. The matching code considers addr:housename and can match buildings with Wikidata item labels that are street addresses to the addr:housenumber and addr:street tags. For example "8 Canada Square" will match a building tagged with "addr:housenumber=8" and "addr:street=Canada Square" The overpass API can calculate the centroid of an OSM object, this is what I used in the past. I've switched to using the bounding box for the object, this gives better results for large objects like lakes and forests. The result is that I now have a list of 176,794 OSM objects and matching Wikidata items. The whole process of extracting the data and looking for matches takes about three days to run. This is after quite a few changes to speed it up. I think there are still more improvements possible. I will post the code on github soon. It has been suggested that I shouldn't be using Wikipedia at all, instead I should be looking at the 'instance of' property in Wikidata. Using English Wikipedia introduces an English-language bias, there are items in Wikidata without an associated article in English Wikipedia. The reason for using Wikipedia Categories is because use of the 'instance of' property is very patchy. The majority of the items in my result list don't include the 'instance of' property. A related piece of work will be to populate this field in Wikidata, but for now I'm focused on linking OSM and Wikidata. The system gets confused by chains of restaurants and shops. The Wikidata item will often include the coordinates of the headquarters. The name will match with a nearby store. I should be able to fix this by filtering out Wikidata chain store items. Example: John Lewis - UK department store chain https://www.wikidata.org/wiki/Q1918981 Wikidata coordinates are 51.497, -0.144 near Victoria station. https://www.openstreetmap.org/?mlat=51.497&mlon=-0.14434#map=16/51.4970/-0.1443 The match is for the flag ship store in Oxford Circus, 2km from the HQ. http://www.openstreetmap.org/node/31314236 Some of the coordinates in Wikipedia and Wikidata are wrong, there are many cases where the location in Wikidata is 5km or more from where it should be. London Hackspace moved from Islington to Hackney in 2009, the location has been updated on OSM, but Wikidata still has the old location: http://wikidata.org/wiki/Q6670461 http://www.openstreetmap.org/browse/node/2218654057 There are two pubs in London called Barley Mow that are less than 1k apart, both are mapped on OSM. One of the pubs has an item in Wikidata (Q17985738). My code is matching it to the wrong pub. I will fix this. http://wikidata.org/wiki/Q17985738 is http://www.openstreetmap.org/way/148011247 not http://www.openstreetmap.org/node/462025244 When checking the results for fountains I found that the Butt-Millet Memorial Fountain is mapped twice in different locations: http://wikidata.org/wiki/Q5002757 http://www.openstreetmap.org/way/238456703 http://www.openstreetmap.org/node/358955161 There are already 25k things with a Wikidata tag in OSM. When I compare this list with my generated list I find 2,000 cases where a given Wikidata ID is assigned to a different OSM object from the one picked by my system. In addition there are 26 OSM objects with a wikidata tag pointing at a different Wikidata item. Many of these mismatches are villages, towns and municipalities in Germany. One possible way to solve this is if I exclude settlements in Germany from my list. It looks like Germany doesn't need an automated import of Wikidata tags. There are many more OSM objects tagged with a wikipedia tag. I haven't tried comparing them to my results. I'm going to continue to refine my results and reduce the number of false positives. Once I'm happy with the list I'll post it here. When we have reached consensus I'll add the Wikidata tags to OSM. I won't upload my results as a single changeset, I'll split it up by region, maybe in one degree squares. -- Edward. _______________________________________________ talk mailing list talk@openstreetmap.org https://lists.openstreetmap.org/listinfo/talk