This is a progress report about my attempt to match Wikidata items and OSM
objects automatically.

Here are some page about adding Wikidata identifiers to OSM:
http://wiki.openstreetmap.org/wiki/Wikidata
http://wiki.openstreetmap.org/wiki/Proposed_features/Wikidata

The list is available here, it is split up by English Wikipedia category:

http://edwardbetts.com/osm-wikidata/

Some OSM/Wikidata items will appear in multiple categories.

Each page of results is sorted by distance, then by the English Wikidata
label. The results include links to Wikidata, the location on OSM from
Wikidata and the matched OSM object.

A quick recap about how my system works. I have a list of categories on
Wikipedia with the appropriate tags on OpenStreetMap. For example, articles in
the subcategories of the category "Airports by Country" should appear on the
map tagged as aeroway=aerodrome.

I use a Wikimedia Labs tool called CatScan to get a list of every article in
the category or subcategory: https://tools.wmflabs.org/catscan2/catscan2.php

For each article in English Wikipedia this is a matching item in Wikidata. I
use the Wikidata API to find the Wikidata items within the category. Items
without coordinates are skipped.

Once all the categories are processed I have a list of Wikidata items that 
include coordinates and the label in multiple languages. I split this list up 
by coordinates into half degree squares. I use the Overpass API to look for OSM 
objects (nodes, ways and relations) with a name and the expected tags.

The acceptable distance for most objects is 1km, for some entity types it has
been increased further.  I've included a distance field in my results, so you
can see how far apart the matched items are.

The names in the OSM object are compared with the labels and aliases in the
Wikidata item. The code looks at the various name keys listed in the
http://wiki.openstreetmap.org/wiki/Key:name page. I exclude old_name from the
comparison.

The matching code considers addr:housename and can match buildings with
Wikidata item labels that are street addresses to the addr:housenumber and
addr:street tags. For example "8 Canada Square" will match a building tagged
with "addr:housenumber=8" and "addr:street=Canada Square"

The overpass API can calculate the centroid of an OSM object, this is what I
used in the past. I've switched to using the bounding box for the object, this
gives better results for large objects like lakes and forests.

The result is that I now have a list of 176,794 OSM objects and matching
Wikidata items. The whole process of extracting the data and looking for
matches takes about three days to run. This is after quite a few changes 
to speed it up. I think there are still more improvements possible. I will
post the code on github soon.

It has been suggested that I shouldn't be using Wikipedia at all, instead I
should be looking at the 'instance of' property in Wikidata. Using English
Wikipedia introduces an English-language bias, there are items in Wikidata
without an associated article in English Wikipedia. The reason for using
Wikipedia Categories is because use of the 'instance of' property is very
patchy. The majority of the items in my result list don't include the
'instance of' property. A related piece of work will be to populate this
field in Wikidata, but for now I'm focused on linking OSM and Wikidata.

The system gets confused by chains of restaurants and shops. The Wikidata item
will often include the coordinates of the headquarters. The name will match
with a nearby store. I should be able to fix this by filtering out Wikidata
chain store items.

Example: John Lewis - UK department store chain
https://www.wikidata.org/wiki/Q1918981
Wikidata coordinates are 51.497, -0.144 near Victoria station.
https://www.openstreetmap.org/?mlat=51.497&mlon=-0.14434#map=16/51.4970/-0.1443
The match is for the flag ship store in Oxford Circus, 2km from the HQ.
http://www.openstreetmap.org/node/31314236

Some of the coordinates in Wikipedia and Wikidata are wrong, there are many
cases where the location in Wikidata is 5km or more from where it should be.

London Hackspace moved from Islington to Hackney in 2009, the location has
been updated on OSM, but Wikidata still has the old location:

http://wikidata.org/wiki/Q6670461
http://www.openstreetmap.org/browse/node/2218654057

There are two pubs in London called Barley Mow that are less than 1k apart,
both are mapped on OSM. One of the pubs has an item in Wikidata (Q17985738).
My code is matching it to the wrong pub. I will fix this.

http://wikidata.org/wiki/Q17985738
is
http://www.openstreetmap.org/way/148011247
not
http://www.openstreetmap.org/node/462025244

When checking the results for fountains I found that the Butt-Millet Memorial 
Fountain is mapped twice in different locations:

http://wikidata.org/wiki/Q5002757
http://www.openstreetmap.org/way/238456703
http://www.openstreetmap.org/node/358955161

There are already 25k things with a Wikidata tag in OSM. When I compare this
list with my generated list I find 2,000 cases where a given Wikidata ID is
assigned to a different OSM object from the one picked by my system. In
addition there are 26 OSM objects with a wikidata tag pointing at a different
Wikidata item.

Many of these mismatches are villages, towns and municipalities in Germany.
One possible way to solve this is if I exclude settlements in Germany from my
list. It looks like Germany doesn't need an automated import of Wikidata tags.

There are many more OSM objects tagged with a wikipedia tag. I haven't tried
comparing them to my results.

I'm going to continue to refine my results and reduce the number of false
positives. Once I'm happy with the list I'll post it here. When we have
reached consensus I'll add the Wikidata tags to OSM. I won't upload my results
as a single changeset, I'll split it up by region, maybe in one degree squares.

-- 
Edward.

_______________________________________________
talk mailing list
talk@openstreetmap.org
https://lists.openstreetmap.org/listinfo/talk

Reply via email to