> Tomas, this is what I understand from what you are saying:
> * You download a geotagging wikidata dump and generate a table with
> latitude, longitude, and a wiki page title.
> * You also generate the same table from OSM for all nodes, ways (using geo
> centroid?), and relations (using ??)
> * you compare article titles between the two, and when OSM has something
> that Wikipedia doesn't, you search automatically by geo proximity, or you
> let users fix it or ??

  Relations (abstract colletions, not multipoligons as such) and long
ways, such as rivers, are ignored. There are different mechanisms to
sort those out.

  Found problems are placed on a list of problems which is then
reviewed by users, research is done and when possible - problems are
fixed on problem side (osm or wiki).

> If I understood you correctly (and please correct my understanding if I did
> not), it wouldn't work for the whole planet, simply because the average
> distance between what OSM has and what Wikidata has is far too great to be
> useful.

  If coordinates a too far apart it is reported as an error and has to
be fixed. Usually this is the case of incorrect coordinates in
wikipedia because of copying of other article with coordinates (say
for the similar object like hillfort or lake) and forgetting to update
the coordinates. There were cases when objects in Lithuania had
coordinates in Africa :-) And such cases were identified with the same
success as "closer" mis-matches. It is not important if distance is
5km or 5000km.
  The approximation we use is something like 1km which is way smaller
than Lithuania :-) So I do not see why this mechanism would not work
globally.

> current state of the world OSM data is that there are only 17% of nodes are
> within 10 meters of their Wikidata counterpart.

  It is not important for a coordinate to be exactly the same. For
example if you have a coordinate for a lake or even hillfort, any
coordinate within a radius of hundred meters (for a hillfort) or even
more (for a lake) is perfectly ok. You can distinguish by wikipedia
data what type of object that is: waterbody or something else. So it
is possible to adjust the proximity setting for specific object type.

> If we count ways and
> relations, it drops to 11% -- http://tinyurl.com/ybp4tp7a

  This is what we've seen in the beginning before starting to fix the data.

> In other words, with your approach, you can detect when OSM's wikipedia tag
> is no longer correct, because Wikipedia geo dump no longer has it. But
> afterwards you have to go and fix it by hand.  And this is pretty much the
> only operation you can do with this approach.  You cannot analyze tens of
> thousands of existing wikipedia tags that are pointing to links, disambigs,
> people, tree species, places of business - you can simply mark them as "geo
> missing in Wikipedia".

  Identifying them as "missing in wikipedia" proved to be enough.

> I took a quick look at the various quality control queries I built on the
> cleanup page.  Lithuania does seem pretty clean, with only one
> disambiguation at the moment (has been there for 4 months) -
> https://www.openstreetmap.org/node/1717783246 - but both have the same
> location, two airports that point to a list -
> https://www.openstreetmap.org/node/1042034645 and
> https://www.openstreetmap.org/node/1042034660 . None of these issues are
> possible to find with your approach, or detect renaming. For the rest of the
> world, the situation is much worse.

  All three are successfully identified in a large (435 item) problem
list of "objects with wikipedia tag where wikipedia article does not
have coordinates or coordinates a too far apart".

-- 
Tomas

_______________________________________________
talk mailing list
talk@openstreetmap.org
https://lists.openstreetmap.org/listinfo/talk

Reply via email to