Re: Geo indexing Wikidata

Marco Neumann Mon, 21 Feb 2022 03:03:51 -0800

I have a vague memory of running into this issue in the past, Lorenz. The
wikidata RDF serialization does not conform strictly to the OGC GeoSPARQL
data model nor do the geospatial access methods of the blazegraph extension
attempt to exhibit any of the standards compliant features. That's
something the wikidata community had to address. We should not assume OGC
GeoSPARQL compliance here even though from a Semantic Web point of view it
might be desirable.
I think your workaround is sound as far as OGC GeoSPARQL is concerned. But
I am sure you are familiar with the basic OGC GeoSPARQL data model and that
requires a spatial object to connect via hasGeometry to a type Geometry
that contains a WKT. Which is "just" another literal after all. The current
Apache Jena implementation follows the OGC standard here more or less
strictly. An alleviating factor here seems to be that the wikidata project
only makes use of Point "geometry" at the moment. Of course there is
nothing that prevents you or anyone from picking up alternative shapes in
the data. Like we did in the past with the W3C:WGS84 and Lucene Spatial
Jena geospatial implementation. Apache Jena now provides the
spatialF:convertLatLon function here to transform the data into WKT. But
rather than hacking the data into a suitable form for Apache Jena I'd
recommend discussing this further with a wikidata focused group of people
that would like to improve the wikidata data model. It's also possible that
the Apache Jena community provides an alternative loading syntax for
wikidata in the future. I have noticed that the wikidata community has
recently started to introduce planetary specific / non terrestrial location
identifiers as well.[1][2] As far as I understand, Apache Jena GeoSPARQL
currently only supports reference systems relating to earth via Apache SIS
and the EPSG geodetic dataset.


The GeoLD2022 - 5th International Workshop on Geospatial Linked Data at
ESWC 2022 might be a good place to discuss such an effort.
https://i3mainz.github.io/GeoLD2022/

[1]
https://www.wikidata.org/wiki/Wikidata:Property_proposal/planetary_coordinates
[2] https://arxiv.org/ftp/arxiv/papers/1706/1706.02683.pdf


On Mon, Feb 21, 2022 at 9:08 AM Lorenz Buehmann <
buehm...@informatik.uni-leipzig.de> wrote:

> Hi,
>
> we can use this as an complementary thread for the ongoing "loading
> Wikidata" threads, this time with focus on the geospatial part.
>
> Joachim already did the same for the text index and it works as
> expected, though still loading time could be improved.
>
>
> For the geospatial index things are different, summary and current state
> here:
>
> - Wikidata stores the coordinates in wdt:P625 property
>
> - the literals values are of type geo:wktLiteral
>
> - so far so good? well, not really ... Jena geospatial components
> expects to have the data either
>
>   a) following the GeoSPARQL standard, i.e. having a geometry object
> with a serialization linking to the WKT literal
>
>   b) having the data as WGS lat/lon literals and doing the conversion
> before indexing
>
> - apparently neither a) or b) holds for Wikidata as the WKT literal is
> simply directly attached to an entity via wdt:P625 property, so we do
> not have it in a form like
>
> ||
>
> > |
> >
> > |wd:Q3150 geo:hasDefaultGeometry [||geo:asWKT "||Point(11.586388888
> > 50.927222222)"^^geo:wktLiteral] .|
> >
> > |
> ||||
>
> ||nor do we have it as
>
> ||
>
> > ||wd:Q3150 wgs:||lon "11.586388888"^^xsd:double;
> >          wgs:lat "50.927222222"^^xsd:double .||
> ||
>
> all we have is
>
> ||
>
> > ||wd:Q3150 wdt:P625||"||Point(11.586388888
> > 50.927222222)"^^geo:wktLiteral .||
> |
> |
>
> So what does this mean? Well, you'll see the following output when
> starting a Fuseki with geosparql assembler used:
>
> > ./fuseki-server --conf ~/fuseki-wikidata-geosparql-assembler.ttl
> > 09:20:46 WARN  system          :: The “SIS_DATA” environment variable
> > is not set.
> > 09:20:46 INFO  Server          :: Apache Jena Fuseki 4.5.0-SNAPSHOT
> > 2022-02-17T09:59:26Z
> > 09:20:46 INFO  Config          ::
> > FUSEKI_HOME=/home/user/apache-jena-fuseki-4.5.0-SNAPSHOT/.
> > 09:20:46 INFO  Config          ::
> > FUSEKI_BASE=/home/user/apache-jena-fuseki-4.5.0-SNAPSHOT/run
> > 09:20:46 INFO  Config          :: Shiro file:
> > file:///home/user/apache-jena-fuseki-4.5.0-SNAPSHOT/run/shiro.ini
> > 09:20:47 INFO  GeoSPARQLOperations :: Find Mode SRS - Started
> > 09:20:47 INFO  GeoSPARQLOperations :: Find Mode SRS - Completed
> > 09:20:47 WARN  GeoAssembler    :: No SRS found. Check
> > 'http://www.opengis.net/ont/geosparql#hasSerialization' or
> > '
> http://www.w3.org/2003/01/geo/wgs84_pos#lat'/'http://www.w3.org/2003/01/geo/wgs84_pos#lon'
>
> > predicates are present in the source data. Hint: Inferencing with
> > GeoSPARQL schema may be required.
> > 09:20:47 INFO  Server          :: Path = /ds
> > 09:20:47 INFO  Server          :: System
> > 09:20:47 INFO  Server          ::   Memory: 4.0 GiB
> > 09:20:47 INFO  Server          ::   Java:   11.0.11
> > 09:20:47 INFO  Server          ::   OS:     Linux 5.4.0-90-generic amd64
> > 09:20:47 INFO  Server          ::   PID:    1866352
> > 09:20:47 INFO  Server          :: Started 2022/02/19 09:20:47 CET on
> > port 3030
> so technically nothing happens because the data is not in one of the
> expected formats.
>
> So what could be a workaround here? Clearly, we could execute a SPARQL
> Update statement to add the expected triples. There are ~9 million
> wdt:P625 triples, which means 18 million additional triples (takes 2
> triples for each variant) resp. 36 million triples if we want to avoid
> inference needed (2 more triples are necessary to add geo:Feature and
> geo:Geometry). So for each entity we add triple like
>
> > |wd:Q3150 a geo:Feature, geo:hasDefaultGeometry [||a geo:Geometry ;
> > geo:asWKT "||Point(11.586388888 50.927222222)"^^geo:wktLiteral] .|
> >
> If we do not care about the dataset size we could say not a big deal I
> guess? But for querying it matters as we have to consider this different
> non-Wikidata format. Indeed we don't have to do
>
> > |?subj geo:hasDefaultGeometry ?subjGeom . ?subjGeom geo:asWKT ?subjLit
> > . ?obj geo:hasDefaultGeometry ?objGeom . ?objGeom geo:|||asWKT| ?objLit
> . FILTER(geof:sfContains(?subjLit, ?objLit))|
> because when query rewriting is enabled, it means for the topological
> functions this is fine as we can write
>
> > |?subj geo:sfContains ?obj .|
>
> But for queries using non-topological functions like distance it
> matters, i.e. we either have to go the full path from entity to literal
> or we just get the literal from the original wdt:P625 triple which then
> would be fine. Here I recognized that no spatial index is used for
> distance functions.
>
> So far so good. Now we come to the data quality which I didn't check in
> the first step. Some observations I made and which have to be considered:
>
> - there are some non-literal wdt:P625 triples, those should be omitted
> in the SPARQL Update statement
> - some wdt:P625 triples resp. their WKT literal values refer to
> coordinates on other planets, like Mars and the like. The problem here
> is that Wikidata decided to use the corresponding Wikidata planet entity
> URI as CRS inside the WKT literal. Clearly Jena can't parse those as
> this is misleading. There are some CRS URIs for planets, but those
> haven't been used for the non-terrestrial geo literals. For example
>
> > |wd:Q2267142 wdt:P31 wd:Q1439394 ; wdt:P376 wd:Q3303 ; wdt:P2824
> > "2727" ; wdt:P625 "<http://www.wikidata.org/entity/Q3303>
> > Point(-358.26 11.3)"^^geo:wktLiteral ;|
> denoting a scarp on Saturn's moon Enceladus. Not sure if it's worth to
> spend time fixing this nor do I know if taking an appropriate CRS would
> bring any benefits. Like computing the distance form any city on Earth
> to this moon, would this work? And then with which metric?
>
>
> At this point I dumped the geospatial data, stopped Fuseki, used
> tdbloader to  load it into a separate graph. I used this query to
> produce the minimal triples only because hey there is inference ...
>
> > PREFIX geo: <http://www.opengis.net/ont/geosparql#>
> > PREFIX wd: <http://www.wikidata.org/entity/>
> > PREFIX wdt: <http://www.wikidata.org/prop/direct/>
> >
> > CONSTRUCT {
> >   ?s geo:hasDefaultGeometry [geo:asWKT ?o] .
> > } WHERE {
> >   ?s wdt:P625 ?o
> >   FILTER(isLiteral(?o))
> > }
>
> Then restarted Fuseki and surprisingly got an OOM because something in
> the geosparql setup seems to consume more memory then expected. Not a
> big deal, increased Xmx to 8GB - start works.
>
> Did some querying, works. Then I wanted to use the fancy short form for
> topological functions, so I need. Inferencing which I had disabled until
> now so enabled it in the assembler wtih
>
> > geosparql:inference            true ;
> Restarted Fuseki. Waited 10min, still computing the inferences ... I
> guess everything happens in-memory, so let's wait some more minutes ... OOM
>
> > java.lang.OutOfMemoryError: Java heap space
> >     at java.util.HashMap.resize(HashMap.java:699) ~[?:?]
> >     at java.util.HashMap.putVal(HashMap.java:658) ~[?:?]
> >     at java.util.HashMap.put(HashMap.java:607) ~[?:?]
> >     at java.util.HashSet.add(HashSet.java:220) ~[?:?]
> >     at
> > org.apache.jena.util.iterator.UniqueFilter.test(UniqueFilter.java:38)
> > ~[fuseki-server.jar:4.5.0-SNAPSHOT]
> >     at
> >
> org.apache.jena.util.iterator.FilterIterator.hasNext(FilterIterator.java:56)
>
> > ~[fuseki-server.jar:4.5.0-SNAPSHOT]
> >     at
> >
> org.apache.jena.util.iterator.WrappedIterator.hasNext(WrappedIterator.java:103)
>
> > ~[fuseki-server.jar:4.5.0-SNAPSHOT]
> >     at
> >
> org.apache.jena.util.iterator.FilterIterator.hasNext(FilterIterator.java:55)
>
> > ~[fuseki-server.jar:4.5.0-SNAPSHOT]
> >     at
> >
> org.apache.jena.util.IteratorCollection.iteratorToList(IteratorCollection.java:63)
>
> > ~[fuseki-server.jar:4.5.0-SNAPSHOT]
> >     at
> > org.apache.jena.graph.GraphUtil.addIteratorWorker(GraphUtil.java:144)
> > ~[fuseki-server.jar:4.5.0-SNAPSHOT]
> >     at org.apache.jena.graph.GraphUtil.addInto(GraphUtil.java:139)
> > ~[fuseki-server.jar:4.5.0-SNAPSHOT]
> >     at org.apache.jena.rdf.model.impl.ModelCom.add(ModelCom.java:195)
> > ~[fuseki-server.jar:4.5.0-SNAPSHOT]
> >     at
> >
> org.apache.jena.geosparql.configuration.GeoSPARQLOperations.applyInferencing(GeoSPARQLOperations.java:332)
>
> > ~[fuseki-server.jar:4.5.0-SNAPSHOT]
> >     at
> >
> org.apache.jena.geosparql.configuration.GeoSPARQLOperations.applyInferencing(GeoSPARQLOperations.java:277)
>
> > ~[fuseki-server.jar:4.5.0-SNAPSHOT]
> >     at
> >
> org.apache.jena.geosparql.configuration.GeoSPARQLOperations.applyInferencing(GeoSPARQLOperations.java:259)
>
> > ~[fuseki-server.jar:4.5.0-SNAPSHOT]
> >     at
> >
> org.apache.jena.geosparql.assembler.GeoAssembler.createDataset(GeoAssembler.java:140)
>
> > ~[fuseki-server.jar:4.5.0-SNAPSHOT]
> Looks like 8GB RAM is not enough for RDFS on the 18 million triples
> graph. Did not try to increase the RAM as it it takes too much times
> anyways and given that this would happen on every startup I do not
> consider this a good option.
>
> Next I tried the RDFS dataset feature Andy introduced which allows for
> "RDFS Simple" - inference (subClassOf, subPropertyOf, domain, range) -
> this should be sufficient, all we need is domain/range reasoning. So I
> added it to the assembler and forwarded this RDFS dataset to the
> Geosparql dataset. Restarted Fuseki and got instant access to the endpoint.
>
> Tried a very simple query to see if inference works as I need
>
> > select * {
> >   graph <http://wikidata.org/geosparql> {
> >     ?s1 a geo:Feature
> >   }
> > }
> > limit 10
> Took ~100s on initial query, quite long? Ok, maybe there is some intial
> setup maybe later request will be fast. But it doesn't seem so, did a
> follow up request with a just renaming ?s1 to ?s2 to avoid caching.
> still ~75s response time. I'm not sure if this performance is now what
> we can get or if I did something wrong?
>
> Anyways, currently I'm loading the materialized variant of the dataset,
> i.e. with geo:Feature and geo:Geometry types attached such that
> inference can be disabled at all. Then I'll try some more queries.
>
>
> Did anybody else already try to setup the spatial index on Wikidata? Any
> experience or comments so far? Any suggestions how to handle the core
> data and also how to work on the non-terrestial data? Should we avoid
> inference here (18M vs 36M triples)?
>
>
> Cheers,
>
> Lorenz
>
>

-- 


---
Marco Neumann
KONA

Re: Geo indexing Wikidata

Reply via email to