Geo indexing Wikidata

Lorenz Buehmann Mon, 21 Feb 2022 01:08:17 -0800

Hi,

we can use this as an complementary thread for the ongoing "loadingWikidata" threads, this time with focus on the geospatial part.

Joachim already did the same for the text index and it works asexpected, though still loading time could be improved.

For the geospatial index things are different, summary and current statehere:


- Wikidata stores the coordinates in wdt:P625 property

- the literals values are of type geo:wktLiteral

- so far so good? well, not really ... Jena geospatial componentsexpects to have the data either

a) following the GeoSPARQL standard, i.e. having a geometry objectwith a serialization linking to the WKT literal

b) having the data as WGS lat/lon literals and doing the conversionbefore indexing

- apparently neither a) or b) holds for Wikidata as the WKT literal issimply directly attached to an entity via wdt:P625 property, so we donot have it in a form like

||

|
|wd:Q3150 geo:hasDefaultGeometry [||geo:asWKT "||Point(11.58638888850.927222222)"^^geo:wktLiteral] .|
|

||||

||nor do we have it as

||

||wd:Q3150 wgs:||lon "11.586388888"^^xsd:double;
         wgs:lat "50.927222222"^^xsd:double .||

||

all we have is

||

||wd:Q3150 wdt:P625||"||Point(11.58638888850.927222222)"^^geo:wktLiteral .||

|
|

So what does this mean? Well, you'll see the following output whenstarting a Fuseki with geosparql assembler used:

./fuseki-server --conf ~/fuseki-wikidata-geosparql-assembler.ttl
09:20:46 WARN system :: The “SIS_DATA” environment variableis not set.09:20:46 INFO Server :: Apache Jena Fuseki 4.5.0-SNAPSHOT2022-02-17T09:59:26Z09:20:46 INFO Config ::FUSEKI_HOME=/home/user/apache-jena-fuseki-4.5.0-SNAPSHOT/.09:20:46 INFO Config ::FUSEKI_BASE=/home/user/apache-jena-fuseki-4.5.0-SNAPSHOT/run09:20:46 INFO Config :: Shiro file:file:///home/user/apache-jena-fuseki-4.5.0-SNAPSHOT/run/shiro.ini
09:20:47 INFO  GeoSPARQLOperations :: Find Mode SRS - Started
09:20:47 INFO  GeoSPARQLOperations :: Find Mode SRS - Completed
09:20:47 WARN GeoAssembler :: No SRS found. Check'http://www.opengis.net/ont/geosparql#hasSerialization' or'http://www.w3.org/2003/01/geo/wgs84_pos#lat'/'http://www.w3.org/2003/01/geo/wgs84_pos#lon'predicates are present in the source data. Hint: Inferencing withGeoSPARQL schema may be required.
09:20:47 INFO  Server          :: Path = /ds
09:20:47 INFO  Server          :: System
09:20:47 INFO  Server          ::   Memory: 4.0 GiB
09:20:47 INFO  Server          ::   Java:   11.0.11
09:20:47 INFO  Server          ::   OS:     Linux 5.4.0-90-generic amd64
09:20:47 INFO  Server          ::   PID:    1866352
09:20:47 INFO Server :: Started 2022/02/19 09:20:47 CET onport 3030

so technically nothing happens because the data is not in one of theexpected formats.

So what could be a workaround here? Clearly, we could execute a SPARQLUpdate statement to add the expected triples. There are ~9 millionwdt:P625 triples, which means 18 million additional triples (takes 2triples for each variant) resp. 36 million triples if we want to avoidinference needed (2 more triples are necessary to add geo:Feature andgeo:Geometry). So for each entity we add triple like

|wd:Q3150 a geo:Feature, geo:hasDefaultGeometry [||a geo:Geometry ;geo:asWKT "||Point(11.586388888 50.927222222)"^^geo:wktLiteral] .|

If we do not care about the dataset size we could say not a big deal Iguess? But for querying it matters as we have to consider this differentnon-Wikidata format. Indeed we don't have to do

|?subj geo:hasDefaultGeometry ?subjGeom . ?subjGeom geo:asWKT ?subjLit. ?obj geo:hasDefaultGeometry ?objGeom . ?objGeom geo:|||asWKT| ?objLit . FILTER(geof:sfContains(?subjLit, ?objLit))|

because when query rewriting is enabled, it means for the topologicalfunctions this is fine as we can write

|?subj geo:sfContains ?obj .|

But for queries using non-topological functions like distance itmatters, i.e. we either have to go the full path from entity to literalor we just get the literal from the original wdt:P625 triple which thenwould be fine. Here I recognized that no spatial index is used fordistance functions.

So far so good. Now we come to the data quality which I didn't check inthe first step. Some observations I made and which have to be considered:

- there are some non-literal wdt:P625 triples, those should be omittedin the SPARQL Update statement- some wdt:P625 triples resp. their WKT literal values refer tocoordinates on other planets, like Mars and the like. The problem hereis that Wikidata decided to use the corresponding Wikidata planet entityURI as CRS inside the WKT literal. Clearly Jena can't parse those asthis is misleading. There are some CRS URIs for planets, but thosehaven't been used for the non-terrestrial geo literals. For example

|wd:Q2267142 wdt:P31 wd:Q1439394 ; wdt:P376 wd:Q3303 ; wdt:P2824"2727" ; wdt:P625 "<http://www.wikidata.org/entity/Q3303>Point(-358.26 11.3)"^^geo:wktLiteral ;|

denoting a scarp on Saturn's moon Enceladus. Not sure if it's worth tospend time fixing this nor do I know if taking an appropriate CRS wouldbring any benefits. Like computing the distance form any city on Earthto this moon, would this work? And then with which metric?

At this point I dumped the geospatial data, stopped Fuseki, usedtdbloader to load it into a separate graph. I used this query toproduce the minimal triples only because hey there is inference ...

PREFIX geo: <http://www.opengis.net/ont/geosparql#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

CONSTRUCT {
  ?s geo:hasDefaultGeometry [geo:asWKT ?o] .
} WHERE {
  ?s wdt:P625 ?o
  FILTER(isLiteral(?o))
}

Then restarted Fuseki and surprisingly got an OOM because something inthe geosparql setup seems to consume more memory then expected. Not abig deal, increased Xmx to 8GB - start works.

Did some querying, works. Then I wanted to use the fancy short form fortopological functions, so I need. Inferencing which I had disabled untilnow so enabled it in the assembler wtih

geosparql:inference            true ;

Restarted Fuseki. Waited 10min, still computing the inferences ... Iguess everything happens in-memory, so let's wait some more minutes ... OOM

java.lang.OutOfMemoryError: Java heap space
    at java.util.HashMap.resize(HashMap.java:699) ~[?:?]
    at java.util.HashMap.putVal(HashMap.java:658) ~[?:?]
    at java.util.HashMap.put(HashMap.java:607) ~[?:?]
    at java.util.HashSet.add(HashSet.java:220) ~[?:?]
atorg.apache.jena.util.iterator.UniqueFilter.test(UniqueFilter.java:38)~[fuseki-server.jar:4.5.0-SNAPSHOT] atorg.apache.jena.util.iterator.FilterIterator.hasNext(FilterIterator.java:56)~[fuseki-server.jar:4.5.0-SNAPSHOT] atorg.apache.jena.util.iterator.WrappedIterator.hasNext(WrappedIterator.java:103)~[fuseki-server.jar:4.5.0-SNAPSHOT] atorg.apache.jena.util.iterator.FilterIterator.hasNext(FilterIterator.java:55)~[fuseki-server.jar:4.5.0-SNAPSHOT] atorg.apache.jena.util.IteratorCollection.iteratorToList(IteratorCollection.java:63)~[fuseki-server.jar:4.5.0-SNAPSHOT] atorg.apache.jena.graph.GraphUtil.addIteratorWorker(GraphUtil.java:144)~[fuseki-server.jar:4.5.0-SNAPSHOT] at org.apache.jena.graph.GraphUtil.addInto(GraphUtil.java:139)~[fuseki-server.jar:4.5.0-SNAPSHOT] at org.apache.jena.rdf.model.impl.ModelCom.add(ModelCom.java:195)~[fuseki-server.jar:4.5.0-SNAPSHOT] atorg.apache.jena.geosparql.configuration.GeoSPARQLOperations.applyInferencing(GeoSPARQLOperations.java:332)~[fuseki-server.jar:4.5.0-SNAPSHOT] atorg.apache.jena.geosparql.configuration.GeoSPARQLOperations.applyInferencing(GeoSPARQLOperations.java:277)~[fuseki-server.jar:4.5.0-SNAPSHOT] atorg.apache.jena.geosparql.configuration.GeoSPARQLOperations.applyInferencing(GeoSPARQLOperations.java:259)~[fuseki-server.jar:4.5.0-SNAPSHOT] atorg.apache.jena.geosparql.assembler.GeoAssembler.createDataset(GeoAssembler.java:140)~[fuseki-server.jar:4.5.0-SNAPSHOT]

Looks like 8GB RAM is not enough for RDFS on the 18 million triplesgraph. Did not try to increase the RAM as it it takes too much timesanyways and given that this would happen on every startup I do notconsider this a good option.

Next I tried the RDFS dataset feature Andy introduced which allows for"RDFS Simple" - inference (subClassOf, subPropertyOf, domain, range) -this should be sufficient, all we need is domain/range reasoning. So Iadded it to the assembler and forwarded this RDFS dataset to theGeosparql dataset. Restarted Fuseki and got instant access to the endpoint.


Tried a very simple query to see if inference works as I need

select * {
  graph <http://wikidata.org/geosparql> {
    ?s1 a geo:Feature
  }
}
limit 10

Took ~100s on initial query, quite long? Ok, maybe there is some intialsetup maybe later request will be fast. But it doesn't seem so, did afollow up request with a just renaming ?s1 to ?s2 to avoid caching.still ~75s response time. I'm not sure if this performance is now whatwe can get or if I did something wrong?

Anyways, currently I'm loading the materialized variant of the dataset,i.e. with geo:Feature and geo:Geometry types attached such thatinference can be disabled at all. Then I'll try some more queries.

Did anybody else already try to setup the spatial index on Wikidata? Anyexperience or comments so far? Any suggestions how to handle the coredata and also how to work on the non-terrestial data? Should we avoidinference here (18M vs 36M triples)?



Cheers,

Lorenz

Geo indexing Wikidata

Reply via email to