Hi,

we can use this as an complementary thread for the ongoing "loading Wikidata" threads, this time with focus on the geospatial part.

Joachim already did the same for the text index and it works as expected, though still loading time could be improved.


For the geospatial index things are different, summary and current state here:

- Wikidata stores the coordinates in wdt:P625 property

- the literals values are of type geo:wktLiteral

- so far so good? well, not really ... Jena geospatial components expects to have the data either

 a) following the GeoSPARQL standard, i.e. having a geometry object with a serialization linking to the WKT literal

 b) having the data as WGS lat/lon literals and doing the conversion before indexing

- apparently neither a) or b) holds for Wikidata as the WKT literal is simply directly attached to an entity via wdt:P625 property, so we do not have it in a form like

||

|

|wd:Q3150 geo:hasDefaultGeometry [||geo:asWKT "||Point(11.586388888 50.927222222)"^^geo:wktLiteral] .|

|
||||

||nor do we have it as

||

||wd:Q3150 wgs:||lon "11.586388888"^^xsd:double;
         wgs:lat "50.927222222"^^xsd:double .||
||

all we have is

||

||wd:Q3150 wdt:P625||"||Point(11.586388888 50.927222222)"^^geo:wktLiteral .||
|
|

So what does this mean? Well, you'll see the following output when starting a Fuseki with geosparql assembler used:

./fuseki-server --conf ~/fuseki-wikidata-geosparql-assembler.ttl
09:20:46 WARN  system          :: The “SIS_DATA” environment variable is not set. 09:20:46 INFO  Server          :: Apache Jena Fuseki 4.5.0-SNAPSHOT 2022-02-17T09:59:26Z 09:20:46 INFO  Config          :: FUSEKI_HOME=/home/user/apache-jena-fuseki-4.5.0-SNAPSHOT/. 09:20:46 INFO  Config          :: FUSEKI_BASE=/home/user/apache-jena-fuseki-4.5.0-SNAPSHOT/run 09:20:46 INFO  Config          :: Shiro file: file:///home/user/apache-jena-fuseki-4.5.0-SNAPSHOT/run/shiro.ini
09:20:47 INFO  GeoSPARQLOperations :: Find Mode SRS - Started
09:20:47 INFO  GeoSPARQLOperations :: Find Mode SRS - Completed
09:20:47 WARN  GeoAssembler    :: No SRS found. Check 'http://www.opengis.net/ont/geosparql#hasSerialization' or 'http://www.w3.org/2003/01/geo/wgs84_pos#lat'/'http://www.w3.org/2003/01/geo/wgs84_pos#lon' predicates are present in the source data. Hint: Inferencing with GeoSPARQL schema may be required.
09:20:47 INFO  Server          :: Path = /ds
09:20:47 INFO  Server          :: System
09:20:47 INFO  Server          ::   Memory: 4.0 GiB
09:20:47 INFO  Server          ::   Java:   11.0.11
09:20:47 INFO  Server          ::   OS:     Linux 5.4.0-90-generic amd64
09:20:47 INFO  Server          ::   PID:    1866352
09:20:47 INFO  Server          :: Started 2022/02/19 09:20:47 CET on port 3030
so technically nothing happens because the data is not in one of the expected formats.

So what could be a workaround here? Clearly, we could execute a SPARQL Update statement to add the expected triples. There are ~9 million wdt:P625 triples, which means 18 million additional triples (takes 2 triples for each variant) resp. 36 million triples if we want to avoid inference needed (2 more triples are necessary to add geo:Feature and geo:Geometry). So for each entity we add triple like

|wd:Q3150 a geo:Feature, geo:hasDefaultGeometry [||a geo:Geometry ; geo:asWKT "||Point(11.586388888 50.927222222)"^^geo:wktLiteral] .|

If we do not care about the dataset size we could say not a big deal I guess? But for querying it matters as we have to consider this different non-Wikidata format. Indeed we don't have to do

|?subj geo:hasDefaultGeometry ?subjGeom . ?subjGeom geo:asWKT ?subjLit . ?obj geo:hasDefaultGeometry ?objGeom . ?objGeom geo:|||asWKT| ?objLit . FILTER(geof:sfContains(?subjLit, ?objLit))|
because when query rewriting is enabled, it means for the topological functions this is fine as we can write

|?subj geo:sfContains ?obj .|

But for queries using non-topological functions like distance it matters, i.e. we either have to go the full path from entity to literal or we just get the literal from the original wdt:P625 triple which then would be fine. Here I recognized that no spatial index is used for distance functions.

So far so good. Now we come to the data quality which I didn't check in the first step. Some observations I made and which have to be considered:

- there are some non-literal wdt:P625 triples, those should be omitted in the SPARQL Update statement - some wdt:P625 triples resp. their WKT literal values refer to coordinates on other planets, like Mars and the like. The problem here is that Wikidata decided to use the corresponding Wikidata planet entity URI as CRS inside the WKT literal. Clearly Jena can't parse those as this is misleading. There are some CRS URIs for planets, but those haven't been used for the non-terrestrial geo literals. For example

|wd:Q2267142 wdt:P31 wd:Q1439394 ; wdt:P376 wd:Q3303 ; wdt:P2824 "2727" ; wdt:P625 "<http://www.wikidata.org/entity/Q3303> Point(-358.26 11.3)"^^geo:wktLiteral ;|
denoting a scarp on Saturn's moon Enceladus. Not sure if it's worth to spend time fixing this nor do I know if taking an appropriate CRS would bring any benefits. Like computing the distance form any city on Earth to this moon, would this work? And then with which metric?


At this point I dumped the geospatial data, stopped Fuseki, used tdbloader to  load it into a separate graph. I used this query to produce the minimal triples only because hey there is inference ...

PREFIX geo: <http://www.opengis.net/ont/geosparql#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

CONSTRUCT {
  ?s geo:hasDefaultGeometry [geo:asWKT ?o] .
} WHERE {
  ?s wdt:P625 ?o
  FILTER(isLiteral(?o))
}

Then restarted Fuseki and surprisingly got an OOM because something in the geosparql setup seems to consume more memory then expected. Not a big deal, increased Xmx to 8GB - start works.

Did some querying, works. Then I wanted to use the fancy short form for topological functions, so I need. Inferencing which I had disabled until now so enabled it in the assembler wtih

geosparql:inference            true ;
Restarted Fuseki. Waited 10min, still computing the inferences ... I guess everything happens in-memory, so let's wait some more minutes ... OOM

java.lang.OutOfMemoryError: Java heap space
    at java.util.HashMap.resize(HashMap.java:699) ~[?:?]
    at java.util.HashMap.putVal(HashMap.java:658) ~[?:?]
    at java.util.HashMap.put(HashMap.java:607) ~[?:?]
    at java.util.HashSet.add(HashSet.java:220) ~[?:?]
    at org.apache.jena.util.iterator.UniqueFilter.test(UniqueFilter.java:38) ~[fuseki-server.jar:4.5.0-SNAPSHOT]     at org.apache.jena.util.iterator.FilterIterator.hasNext(FilterIterator.java:56) ~[fuseki-server.jar:4.5.0-SNAPSHOT]     at org.apache.jena.util.iterator.WrappedIterator.hasNext(WrappedIterator.java:103) ~[fuseki-server.jar:4.5.0-SNAPSHOT]     at org.apache.jena.util.iterator.FilterIterator.hasNext(FilterIterator.java:55) ~[fuseki-server.jar:4.5.0-SNAPSHOT]     at org.apache.jena.util.IteratorCollection.iteratorToList(IteratorCollection.java:63) ~[fuseki-server.jar:4.5.0-SNAPSHOT]     at org.apache.jena.graph.GraphUtil.addIteratorWorker(GraphUtil.java:144) ~[fuseki-server.jar:4.5.0-SNAPSHOT]     at org.apache.jena.graph.GraphUtil.addInto(GraphUtil.java:139) ~[fuseki-server.jar:4.5.0-SNAPSHOT]     at org.apache.jena.rdf.model.impl.ModelCom.add(ModelCom.java:195) ~[fuseki-server.jar:4.5.0-SNAPSHOT]     at org.apache.jena.geosparql.configuration.GeoSPARQLOperations.applyInferencing(GeoSPARQLOperations.java:332) ~[fuseki-server.jar:4.5.0-SNAPSHOT]     at org.apache.jena.geosparql.configuration.GeoSPARQLOperations.applyInferencing(GeoSPARQLOperations.java:277) ~[fuseki-server.jar:4.5.0-SNAPSHOT]     at org.apache.jena.geosparql.configuration.GeoSPARQLOperations.applyInferencing(GeoSPARQLOperations.java:259) ~[fuseki-server.jar:4.5.0-SNAPSHOT]     at org.apache.jena.geosparql.assembler.GeoAssembler.createDataset(GeoAssembler.java:140) ~[fuseki-server.jar:4.5.0-SNAPSHOT]
Looks like 8GB RAM is not enough for RDFS on the 18 million triples graph. Did not try to increase the RAM as it it takes too much times anyways and given that this would happen on every startup I do not consider this a good option.

Next I tried the RDFS dataset feature Andy introduced which allows for "RDFS Simple" - inference (subClassOf, subPropertyOf, domain, range) - this should be sufficient, all we need is domain/range reasoning. So I added it to the assembler and forwarded this RDFS dataset to the Geosparql dataset. Restarted Fuseki and got instant access to the endpoint.

Tried a very simple query to see if inference works as I need

select * {
  graph <http://wikidata.org/geosparql> {
    ?s1 a geo:Feature
  }
}
limit 10
Took ~100s on initial query, quite long? Ok, maybe there is some intial setup maybe later request will be fast. But it doesn't seem so, did a follow up request with a just renaming ?s1 to ?s2 to avoid caching. still ~75s response time. I'm not sure if this performance is now what we can get or if I did something wrong?

Anyways, currently I'm loading the materialized variant of the dataset, i.e. with geo:Feature and geo:Geometry types attached such that inference can be disabled at all. Then I'll try some more queries.


Did anybody else already try to setup the spatial index on Wikidata? Any experience or comments so far? Any suggestions how to handle the core data and also how to work on the non-terrestial data? Should we avoid inference here (18M vs 36M triples)?


Cheers,

Lorenz

Reply via email to