Hi,
we can use this as an complementary thread for the ongoing "loading
Wikidata" threads, this time with focus on the geospatial part.
Joachim already did the same for the text index and it works as
expected, though still loading time could be improved.
For the geospatial index things are different, summary and current state
here:
- Wikidata stores the coordinates in wdt:P625 property
- the literals values are of type geo:wktLiteral
- so far so good? well, not really ... Jena geospatial components
expects to have the data either
a) following the GeoSPARQL standard, i.e. having a geometry object
with a serialization linking to the WKT literal
b) having the data as WGS lat/lon literals and doing the conversion
before indexing
- apparently neither a) or b) holds for Wikidata as the WKT literal is
simply directly attached to an entity via wdt:P625 property, so we do
not have it in a form like
||
|
|wd:Q3150 geo:hasDefaultGeometry [||geo:asWKT "||Point(11.586388888
50.927222222)"^^geo:wktLiteral] .|
|
||||
||nor do we have it as
||
||wd:Q3150 wgs:||lon "11.586388888"^^xsd:double;
wgs:lat "50.927222222"^^xsd:double .||
||
all we have is
||
||wd:Q3150 wdt:P625||"||Point(11.586388888
50.927222222)"^^geo:wktLiteral .||
|
|
So what does this mean? Well, you'll see the following output when
starting a Fuseki with geosparql assembler used:
./fuseki-server --conf ~/fuseki-wikidata-geosparql-assembler.ttl
09:20:46 WARN system :: The “SIS_DATA” environment variable
is not set.
09:20:46 INFO Server :: Apache Jena Fuseki 4.5.0-SNAPSHOT
2022-02-17T09:59:26Z
09:20:46 INFO Config ::
FUSEKI_HOME=/home/user/apache-jena-fuseki-4.5.0-SNAPSHOT/.
09:20:46 INFO Config ::
FUSEKI_BASE=/home/user/apache-jena-fuseki-4.5.0-SNAPSHOT/run
09:20:46 INFO Config :: Shiro file:
file:///home/user/apache-jena-fuseki-4.5.0-SNAPSHOT/run/shiro.ini
09:20:47 INFO GeoSPARQLOperations :: Find Mode SRS - Started
09:20:47 INFO GeoSPARQLOperations :: Find Mode SRS - Completed
09:20:47 WARN GeoAssembler :: No SRS found. Check
'http://www.opengis.net/ont/geosparql#hasSerialization' or
'http://www.w3.org/2003/01/geo/wgs84_pos#lat'/'http://www.w3.org/2003/01/geo/wgs84_pos#lon'
predicates are present in the source data. Hint: Inferencing with
GeoSPARQL schema may be required.
09:20:47 INFO Server :: Path = /ds
09:20:47 INFO Server :: System
09:20:47 INFO Server :: Memory: 4.0 GiB
09:20:47 INFO Server :: Java: 11.0.11
09:20:47 INFO Server :: OS: Linux 5.4.0-90-generic amd64
09:20:47 INFO Server :: PID: 1866352
09:20:47 INFO Server :: Started 2022/02/19 09:20:47 CET on
port 3030
so technically nothing happens because the data is not in one of the
expected formats.
So what could be a workaround here? Clearly, we could execute a SPARQL
Update statement to add the expected triples. There are ~9 million
wdt:P625 triples, which means 18 million additional triples (takes 2
triples for each variant) resp. 36 million triples if we want to avoid
inference needed (2 more triples are necessary to add geo:Feature and
geo:Geometry). So for each entity we add triple like
|wd:Q3150 a geo:Feature, geo:hasDefaultGeometry [||a geo:Geometry ;
geo:asWKT "||Point(11.586388888 50.927222222)"^^geo:wktLiteral] .|
If we do not care about the dataset size we could say not a big deal I
guess? But for querying it matters as we have to consider this different
non-Wikidata format. Indeed we don't have to do
|?subj geo:hasDefaultGeometry ?subjGeom . ?subjGeom geo:asWKT ?subjLit
. ?obj geo:hasDefaultGeometry ?objGeom . ?objGeom geo:|||asWKT| ?objLit . FILTER(geof:sfContains(?subjLit, ?objLit))|
because when query rewriting is enabled, it means for the topological
functions this is fine as we can write
|?subj geo:sfContains ?obj .|
But for queries using non-topological functions like distance it
matters, i.e. we either have to go the full path from entity to literal
or we just get the literal from the original wdt:P625 triple which then
would be fine. Here I recognized that no spatial index is used for
distance functions.
So far so good. Now we come to the data quality which I didn't check in
the first step. Some observations I made and which have to be considered:
- there are some non-literal wdt:P625 triples, those should be omitted
in the SPARQL Update statement
- some wdt:P625 triples resp. their WKT literal values refer to
coordinates on other planets, like Mars and the like. The problem here
is that Wikidata decided to use the corresponding Wikidata planet entity
URI as CRS inside the WKT literal. Clearly Jena can't parse those as
this is misleading. There are some CRS URIs for planets, but those
haven't been used for the non-terrestrial geo literals. For example
|wd:Q2267142 wdt:P31 wd:Q1439394 ; wdt:P376 wd:Q3303 ; wdt:P2824
"2727" ; wdt:P625 "<http://www.wikidata.org/entity/Q3303>
Point(-358.26 11.3)"^^geo:wktLiteral ;|
denoting a scarp on Saturn's moon Enceladus. Not sure if it's worth to
spend time fixing this nor do I know if taking an appropriate CRS would
bring any benefits. Like computing the distance form any city on Earth
to this moon, would this work? And then with which metric?
At this point I dumped the geospatial data, stopped Fuseki, used
tdbloader to load it into a separate graph. I used this query to
produce the minimal triples only because hey there is inference ...
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
CONSTRUCT {
?s geo:hasDefaultGeometry [geo:asWKT ?o] .
} WHERE {
?s wdt:P625 ?o
FILTER(isLiteral(?o))
}
Then restarted Fuseki and surprisingly got an OOM because something in
the geosparql setup seems to consume more memory then expected. Not a
big deal, increased Xmx to 8GB - start works.
Did some querying, works. Then I wanted to use the fancy short form for
topological functions, so I need. Inferencing which I had disabled until
now so enabled it in the assembler wtih
geosparql:inference true ;
Restarted Fuseki. Waited 10min, still computing the inferences ... I
guess everything happens in-memory, so let's wait some more minutes ... OOM
java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.resize(HashMap.java:699) ~[?:?]
at java.util.HashMap.putVal(HashMap.java:658) ~[?:?]
at java.util.HashMap.put(HashMap.java:607) ~[?:?]
at java.util.HashSet.add(HashSet.java:220) ~[?:?]
at
org.apache.jena.util.iterator.UniqueFilter.test(UniqueFilter.java:38)
~[fuseki-server.jar:4.5.0-SNAPSHOT]
at
org.apache.jena.util.iterator.FilterIterator.hasNext(FilterIterator.java:56)
~[fuseki-server.jar:4.5.0-SNAPSHOT]
at
org.apache.jena.util.iterator.WrappedIterator.hasNext(WrappedIterator.java:103)
~[fuseki-server.jar:4.5.0-SNAPSHOT]
at
org.apache.jena.util.iterator.FilterIterator.hasNext(FilterIterator.java:55)
~[fuseki-server.jar:4.5.0-SNAPSHOT]
at
org.apache.jena.util.IteratorCollection.iteratorToList(IteratorCollection.java:63)
~[fuseki-server.jar:4.5.0-SNAPSHOT]
at
org.apache.jena.graph.GraphUtil.addIteratorWorker(GraphUtil.java:144)
~[fuseki-server.jar:4.5.0-SNAPSHOT]
at org.apache.jena.graph.GraphUtil.addInto(GraphUtil.java:139)
~[fuseki-server.jar:4.5.0-SNAPSHOT]
at org.apache.jena.rdf.model.impl.ModelCom.add(ModelCom.java:195)
~[fuseki-server.jar:4.5.0-SNAPSHOT]
at
org.apache.jena.geosparql.configuration.GeoSPARQLOperations.applyInferencing(GeoSPARQLOperations.java:332)
~[fuseki-server.jar:4.5.0-SNAPSHOT]
at
org.apache.jena.geosparql.configuration.GeoSPARQLOperations.applyInferencing(GeoSPARQLOperations.java:277)
~[fuseki-server.jar:4.5.0-SNAPSHOT]
at
org.apache.jena.geosparql.configuration.GeoSPARQLOperations.applyInferencing(GeoSPARQLOperations.java:259)
~[fuseki-server.jar:4.5.0-SNAPSHOT]
at
org.apache.jena.geosparql.assembler.GeoAssembler.createDataset(GeoAssembler.java:140)
~[fuseki-server.jar:4.5.0-SNAPSHOT]
Looks like 8GB RAM is not enough for RDFS on the 18 million triples
graph. Did not try to increase the RAM as it it takes too much times
anyways and given that this would happen on every startup I do not
consider this a good option.
Next I tried the RDFS dataset feature Andy introduced which allows for
"RDFS Simple" - inference (subClassOf, subPropertyOf, domain, range) -
this should be sufficient, all we need is domain/range reasoning. So I
added it to the assembler and forwarded this RDFS dataset to the
Geosparql dataset. Restarted Fuseki and got instant access to the endpoint.
Tried a very simple query to see if inference works as I need
select * {
graph <http://wikidata.org/geosparql> {
?s1 a geo:Feature
}
}
limit 10
Took ~100s on initial query, quite long? Ok, maybe there is some intial
setup maybe later request will be fast. But it doesn't seem so, did a
follow up request with a just renaming ?s1 to ?s2 to avoid caching.
still ~75s response time. I'm not sure if this performance is now what
we can get or if I did something wrong?
Anyways, currently I'm loading the materialized variant of the dataset,
i.e. with geo:Feature and geo:Geometry types attached such that
inference can be disabled at all. Then I'll try some more queries.
Did anybody else already try to setup the spatial index on Wikidata? Any
experience or comments so far? Any suggestions how to handle the core
data and also how to work on the non-terrestial data? Should we avoid
inference here (18M vs 36M triples)?
Cheers,
Lorenz