Archer <arc...@gulli.com> wrote:
> 2014-08-31 20:19 GMT+02:00 Edward Betts <edw...@4angle.com>:
> 
> > Archer <arc...@gulli.com> wrote:
> > > Please don’t understand me wrong. I’m a big fan of Wikidata but I'm
> > against
> > > an automated import. The mismatches list gives good examples that your
> > > matching algorithm doesn't work very well:
> > > http://edwardbetts.com/osm-wikidata/mismatches.html
> > >
> > > Some examples:
> > >
> > > 1. Isar Nuclear Power Plant <http://wikidata.org/wiki/Q569510>: your
> > > algorithm matches only one reactor of the power plant: Isar 2
> > > <http://www.openstreetmap.org/way/32918120> but the right matching
> > > would be Kernkraftwerke
> > > Isar <http://www.openstreetmap.org/way/23802422>
> >
> > Q569510 is matching Isar 2 (Way 32918120) because Isar 2 is in the list of
> > German aliases in the Wikidata object:
> >
> > [ "KKW Isar", "AKW Isar", "Isar 2", "Kernkraftwerk Isar I", "Isar 1",
> >   "Atomkraftwerk Isar" ]
> >
> > The German label on the Wikidata item is "Kernkraftwerke Isar", notice the
> > extra 'e' on the end of the first word.
> >
> > I could add Levenshtein distance calculations to my matching, we could say
> > if
> > there is a single character difference the names match. With this change
> > both
> > OSM objects would match and my code would skip the wikidata item.
> >
> > The problem with this change is that hill and hall would match.
> >
> > Ok, but the Wikidata object describes the whole power plant and not only
> one reactor.
> 
> I'd propose to take "is a" (WD-Property: P37) into account. For example in
> Wikidata Q569510 is classified as a nuclear power plant (Q134447) the match
> algorithm should find the matching OSM tags. For example for power plants
> the right tag would be power=plant. Otherwise there should be no match.

Thanks, that's the solution, my matching criteria included the
power=generator tag, I'll remove it, the only matches for a power station
are power=station and power=plant.

I'm not looking at P37 (instance of) because many of the wikidata items don't
include it. I depend on the the article categories from English Wikipedia.

> > > 2. Heligoland <http://wikidata.org/wiki/Q3038>: you’ve matched the
> > island
> > > Heligoland <http://www.openstreetmap.org/relation/3787052> but the right
> > > match would be the municipality Heligoland
> > > <http://www.openstreetmap.org/relation/1157962> (for the island there
> > > exists a different object in Wikidata)
> >
> > I can't find the Wikidata item that represents the island.
> >
> 
> 
> island: https://www.wikidata.org/wiki/Q3129772
> municipality: https://www.wikidata.org/wiki/Q3038
> archipelago: https://www.wikidata.org/wiki/Q17515918

Thanks.

> 
> 
> > > I also don’t understand why you prefer nodes instead of ways or
> > relations.
> > > Ways and relations provide more information (e.g. extent of an area) than
> > > nodes. The Matching algorithm should first look for relations, when
> > there’s
> > > no relation it should search for ways. Nodes should come last.
> >
> > The matching algorithm is only considering objects within 400m, so the
> > nodes
> > happen to be close, but the centre of the relation is more than 400m from
> > the
> > location in Wikidata.
> >
> > I've modified my matching algorithm to use much large distances for some
> > types
> > of object, it is running now. My hope is that when it is finished the code
> > will detect the presence of the node and relation and skip the Wikidata
> > item.
> > Most of these node vs relation mismatches should disappear.
> >
> 
> The radius for natural and administrative features should be much bigger.
> For example if you want to find the island Hispaniola you'll need a radius
> of  93 km. There are also big glaciers, lakes, etc.
> 
> 
> >
> > > What does your matching algorithm when a Wikidata object describes
> > > different objects and therefore should be split?
> > >
> > > A good example for this is the Wikidata object for Thasos
> > > <https://www.wikidata.org/wiki/Q204096> (currently it describes the
> > island
> > > and the municipality “Thasos”) but the object has to be split into two
> > > Wikidata objects so that you can say “the island Thasos lies in the
> > > administrative division Thasos”. There are also other examples like mixed
> > > up nature reserves, lakes and administrative divisions in Wikidata which
> > > you have to solve before you can import the IDs into OSM.
> >
> > My code doesn't do anything special with a wikidata item that represents
> > multiple things like islands and municipalities. If Wikidata/Wikipedia
> > claim a
> > thing is an island, and in OSM there is a thing tagged with place=island
> > and
> > the same name they will match.
> >
> > OSM objects can be tagged as both an island and a municipality.
> 
> I'd propose to drop Wikidata objects which have the following property
> combinations:
> "is a" island and at the same time administrative division
> "is a" nature reserve and administrative division
> "is a" lake and administrative division
> "is a" forest and administrative division
> These are the combinations where I've encountered problems in Wikidata yet.

Thanks, this good to know, I'll investigate these combinations.

> Another problem here: municipality Langeneß:
> https://www.wikidata.org/wiki/Q29931 the algorithm matches the island which
> is also called "Langeneß". But the island has its own WD-object:
> https://www.wikidata.org/wiki/Q13747872 OSM Tags und Wikidata Propertys
> (P39) should be compared and only if the attributes match there should be a
> match.

You're right. I'm only looking at the categories on English Wikipedia, where
there is a single article that represents both the municipality and the
island, but is linked to the Wikidata item Q29931, which only represents the
municipality. The only Wikipedia with separate articles for island and
municipality is Dutch Wikipedia.

> Or Mawson Peak: http://www.openstreetmap.org/node/2774722248 the match of
> the algorithm was Big Ben (volcanoe) https://www.wikidata.org/wiki/Q858516
> but it should be Mawson Peak: https://www.wikidata.org/wiki/Q2114101
> (Mawson Peak is the highest point of the volcanoe "Big Ben". It seems that
> the algorithm focuses to much on aliases in Wikidata.

This is the aliases catching me out again. On Wikidata Q858516 has a Slovak
alias of Mawson Peak.

-- 
Edward.

_______________________________________________
talk mailing list
talk@openstreetmap.org
https://lists.openstreetmap.org/listinfo/talk

Reply via email to