Re: Geo indexing Wikidata

2022-03-02 Thread Greg

Hi Lorenz,

Thanks for the info. I'll look into adding options to disable these
dynamic conversions between units and geometries, as they can create
such large errors over large distances, so that users can choose to
switch them on.

Thanks,

Greg

On 24/02/2022 07:30, Lorenz Buehmann wrote:

Hi Greg,

thanks for providing such an informative answer.

On 23.02.22 17:59, Greg wrote:

Hi Lorenz,

Regarding your final point on the use of Euclidean distance for the
geof:distance, this is derived from Requirement A.3.14 on page 38 of the
GeoSPARQL standard (quoted below). The definition of the distance and
other query functions follows that of the Simple Features standard (ISO
19125-1). The Simple Features standard uses a two dimensional planar
approach, the distance calculation is Euclidean and Great Circle is out
of scope. Applying the distance function to non-planar SRS coordinates
is regarded as an acceptable error.

Ok, I see - now I'm understanding better.


The Jena implementation follows the GeoSPARQL standard by converting the
second Geometry Literal's coordinates to the first Geometry Literal's
SRS, if required.

A Great Circle distance filter function has been provided as
*spatialF:greatCircleGeom(...)*
(https://jena.apache.org/documentation/geosparql/). This is an extension
namepsace for Jena as it is outside the GeoSPARQL standard.

Yep, as you can see from my query I did use spatialF:distance in the
end which maps to Haversine


Could you provide the WKT Geometry Literals returned by your query, so
that they can be tested directly for the asymmetry?


Sure, the data comes from Wikidata but here is a self-contained query
with just the WKT literals in the VALUES clause:


PREFIX geo: 
PREFIX wdt: 
PREFIX uom: 
PREFIX geof: 
PREFIX wd: 
PREFIX spatialF: 
PREFIX afn: 

SELECT * {
  VALUES ?wkt1 {"Point(11.4167
53.6333)"^^geo:wktLiteral "Point(11.575
48.1375)"^^geo:wktLiteral}
  VALUES ?wkt2 {"Point(11.4167
53.6333)"^^geo:wktLiteral "Point(11.575
48.1375)"^^geo:wktLiteral}
  #FILTER(?wkt1 != ?wkt2 && str(?wkt1) < str(?wkt2))
  BIND(geof:distance(?wkt1, ?wkt2, uom:kilometer) as ?d1)
  BIND(geof:distance(?wkt2, ?wkt1, uom:kilometer) as ?d2)
  BIND(abs(?d1 - ?d2) as ?diff_d1_d2)
  BIND(spatialF:distance(?wkt1, ?wkt2, uom:kilometer) as ?d_hav)
  BIND(afn:max(abs(?d1 - ?d_hav), abs(?d2 - ?d_hav)) as ?diff_eucl_hav)
}

Note, I commented out the filter because here must be some bug,
FILTER(?wkt1 != ?wkt2) always leads to an error or false. Can somebody
verify this?

I also checked the source code, indeed the raw euclidean measure is
the same for two points p1 and p2 - but the post-processing to map the
value to a unit like kilometers does more math and depends on the
starting longitude value.




Thanks,

Greg

*A.3.1.4 /conf/geometry-extension/query-functions*

Requirement: /req/geometry-extension/query-functions
Implementations shall support geof:distance, geof:buffer,
geof:convexHull, geof:intersection, geof:union, geof:difference,
geof:symDifference, geof:envelope and geof:boundary as SPARQL extension
functions, consistent with the definitions of the corresponding
functions (distance, buffer, convexHull, intersection, difference,
symDifference, envelope and boundary respectively) in Simple Features
[ISO 19125-1].


On 23/02/2022 08:56, Lorenz Buehmann wrote:

Thanks both for your very helpful input - I'm still a GeoSPARQL novice
and trying to learn stuff and first of all just use the Jena
implementation as efficient as possible.

On 21.02.22 15:22, Andy Seaborne wrote:



On 21/02/2022 09:07, Lorenz Buehmann wrote:

Any experience or comments so far?


Using SubsystemLifecycle, could make the conversions by

    GeoSPARQLOperations.convertGeoPredicates

extensible.

    Andy

But having coordinate location (P625), located on astronomical body
(P376) as properties of a thing, is dangerous because of monotonicity
in RDF:

   SELECT * { ?x wdt:P625 ?coords }

the association of P625 and P376 is lost.


Yep, I could simply omit the extra-terrestrial entities for now when
storing the GeoSPARQL conform triples in the separate graph - clearly,
this would need the Wikidata full dump as qualifiers are not contained
in truthy.

As Marco pointed out there is ongoing discussion on Wikidata
community:
https://www.wikidata.org/wiki/Wikidata:Property_proposal/planetary_coordinates




What is the range of P625? It is not "earth geometry" any more.
What if there is no P376 on ?x?


Wikidata doesn't really have a concept of range or let's say they do
not make use of RDFS at all. They use "property constraints" and if I
look at https://www.wikidata.org/wiki/Property:P625 they more or less
define some kind of domain


Re: Re: Geo indexing Wikidata

2022-02-23 Thread Lorenz Buehmann

Hi Greg,

thanks for providing such an informative answer.

On 23.02.22 17:59, Greg wrote:

Hi Lorenz,

Regarding your final point on the use of Euclidean distance for the
geof:distance, this is derived from Requirement A.3.14 on page 38 of the
GeoSPARQL standard (quoted below). The definition of the distance and
other query functions follows that of the Simple Features standard (ISO
19125-1). The Simple Features standard uses a two dimensional planar
approach, the distance calculation is Euclidean and Great Circle is out
of scope. Applying the distance function to non-planar SRS coordinates
is regarded as an acceptable error.

Ok, I see - now I'm understanding better.


The Jena implementation follows the GeoSPARQL standard by converting the
second Geometry Literal's coordinates to the first Geometry Literal's
SRS, if required.

A Great Circle distance filter function has been provided as
*spatialF:greatCircleGeom(...)*
(https://jena.apache.org/documentation/geosparql/). This is an extension
namepsace for Jena as it is outside the GeoSPARQL standard.
Yep, as you can see from my query I did use spatialF:distance in the end 
which maps to Haversine


Could you provide the WKT Geometry Literals returned by your query, so
that they can be tested directly for the asymmetry?


Sure, the data comes from Wikidata but here is a self-contained query 
with just the WKT literals in the VALUES clause:



PREFIX geo: 
PREFIX wdt: 
PREFIX uom: 
PREFIX geof: 
PREFIX wd: 
PREFIX spatialF: 
PREFIX afn: 

SELECT * {
  VALUES ?wkt1 {"Point(11.4167 
53.6333)"^^geo:wktLiteral "Point(11.575 48.1375)"^^geo:wktLiteral}
  VALUES ?wkt2 {"Point(11.4167 
53.6333)"^^geo:wktLiteral "Point(11.575 48.1375)"^^geo:wktLiteral}

  #FILTER(?wkt1 != ?wkt2 && str(?wkt1) < str(?wkt2))
  BIND(geof:distance(?wkt1, ?wkt2, uom:kilometer) as ?d1)
  BIND(geof:distance(?wkt2, ?wkt1, uom:kilometer) as ?d2)
  BIND(abs(?d1 - ?d2) as ?diff_d1_d2)
  BIND(spatialF:distance(?wkt1, ?wkt2, uom:kilometer) as ?d_hav)
  BIND(afn:max(abs(?d1 - ?d_hav), abs(?d2 - ?d_hav)) as ?diff_eucl_hav)
}

Note, I commented out the filter because here must be some bug, 
FILTER(?wkt1 != ?wkt2) always leads to an error or false. Can somebody 
verify this?


I also checked the source code, indeed the raw euclidean measure is the 
same for two points p1 and p2 - but the post-processing to map the value 
to a unit like kilometers does more math and depends on the starting 
longitude value.





Thanks,

Greg

*A.3.1.4 /conf/geometry-extension/query-functions*

Requirement: /req/geometry-extension/query-functions
Implementations shall support geof:distance, geof:buffer,
geof:convexHull, geof:intersection, geof:union, geof:difference,
geof:symDifference, geof:envelope and geof:boundary as SPARQL extension
functions, consistent with the definitions of the corresponding
functions (distance, buffer, convexHull, intersection, difference,
symDifference, envelope and boundary respectively) in Simple Features
[ISO 19125-1].


On 23/02/2022 08:56, Lorenz Buehmann wrote:

Thanks both for your very helpful input - I'm still a GeoSPARQL novice
and trying to learn stuff and first of all just use the Jena
implementation as efficient as possible.

On 21.02.22 15:22, Andy Seaborne wrote:



On 21/02/2022 09:07, Lorenz Buehmann wrote:

Any experience or comments so far?


Using SubsystemLifecycle, could make the conversions by

    GeoSPARQLOperations.convertGeoPredicates

extensible.

    Andy

But having coordinate location (P625), located on astronomical body
(P376) as properties of a thing, is dangerous because of monotonicity
in RDF:

   SELECT * { ?x wdt:P625 ?coords }

the association of P625 and P376 is lost.


Yep, I could simply omit the extra-terrestrial entities for now when
storing the GeoSPARQL conform triples in the separate graph - clearly,
this would need the Wikidata full dump as qualifiers are not contained
in truthy.

As Marco pointed out there is ongoing discussion on Wikidata
community:
https://www.wikidata.org/wiki/Wikidata:Property_proposal/planetary_coordinates 





What is the range of P625? It is not "earth geometry" any more.
What if there is no P376 on ?x?


Wikidata doesn't really have a concept of range or let's say they do
not make use of RDFS at all. They use "property constraints" and if I
look at https://www.wikidata.org/wiki/Property:P625 they more or less
define some kind of domain

"not being human or company or railway" and some other more weird like
"not being a female given name" etc. - I can'T see any range at least
not in a structured data format maybe in some discussion only.

Currently, I'd treat absence of P376 as "on Earth" but that's just my

Re: Geo indexing Wikidata

2022-02-23 Thread Greg

Hi Lorenz,

Regarding your final point on the use of Euclidean distance for the
geof:distance, this is derived from Requirement A.3.14 on page 38 of the
GeoSPARQL standard (quoted below). The definition of the distance and
other query functions follows that of the Simple Features standard (ISO
19125-1). The Simple Features standard uses a two dimensional planar
approach, the distance calculation is Euclidean and Great Circle is out
of scope. Applying the distance function to non-planar SRS coordinates
is regarded as an acceptable error.

The Jena implementation follows the GeoSPARQL standard by converting the
second Geometry Literal's coordinates to the first Geometry Literal's
SRS, if required.

A Great Circle distance filter function has been provided as
*spatialF:greatCircleGeom(...)*
(https://jena.apache.org/documentation/geosparql/). This is an extension
namepsace for Jena as it is outside the GeoSPARQL standard.

Could you provide the WKT Geometry Literals returned by your query, so
that they can be tested directly for the asymmetry?

Thanks,

Greg

*A.3.1.4 /conf/geometry-extension/query-functions*

Requirement: /req/geometry-extension/query-functions
Implementations shall support geof:distance, geof:buffer,
geof:convexHull, geof:intersection, geof:union, geof:difference,
geof:symDifference, geof:envelope and geof:boundary as SPARQL extension
functions, consistent with the definitions of the corresponding
functions (distance, buffer, convexHull, intersection, difference,
symDifference, envelope and boundary respectively) in Simple Features
[ISO 19125-1].


On 23/02/2022 08:56, Lorenz Buehmann wrote:

Thanks both for your very helpful input - I'm still a GeoSPARQL novice
and trying to learn stuff and first of all just use the Jena
implementation as efficient as possible.

On 21.02.22 15:22, Andy Seaborne wrote:



On 21/02/2022 09:07, Lorenz Buehmann wrote:

Any experience or comments so far?


Using SubsystemLifecycle, could make the conversions by

    GeoSPARQLOperations.convertGeoPredicates

extensible.

    Andy

But having coordinate location (P625), located on astronomical body
(P376) as properties of a thing, is dangerous because of monotonicity
in RDF:

   SELECT * { ?x wdt:P625 ?coords }

the association of P625 and P376 is lost.


Yep, I could simply omit the extra-terrestrial entities for now when
storing the GeoSPARQL conform triples in the separate graph - clearly,
this would need the Wikidata full dump as qualifiers are not contained
in truthy.

As Marco pointed out there is ongoing discussion on Wikidata
community:
https://www.wikidata.org/wiki/Wikidata:Property_proposal/planetary_coordinates



What is the range of P625? It is not "earth geometry" any more.
What if there is no P376 on ?x?


Wikidata doesn't really have a concept of range or let's say they do
not make use of RDFS at all. They use "property constraints" and if I
look at https://www.wikidata.org/wiki/Property:P625 they more or less
define some kind of domain

"not being human or company or railway" and some other more weird like
"not being a female given name" etc. - I can'T see any range at least
not in a structured data format maybe in some discussion only.

Currently, I'd treat absence of P376 as "on Earth" but that's just my
intepretation.



As with any n-ary-like relationship, the indirection keeps the
related properties together.

This is not unique to geo. Temperatures with units for example



-

This brings me to another "issue" - or let's call it unexpected
behavior which for me is counter-intuitive:

I used geof:distance function and according to GeoSPARQL standard this
is defined as


Returns the shortest distance in units between any two Points in the
two geometric
objects as calculated in the spatial reference system ofgeom1.

so I'd consider some metric regarding the used CRS and if absent it
should be CRS84. But Jena does implement just the euclidean distance
according to source code, is this intended? Here is an example of a
few cities in Germany with it's pairwise distance as well as the
Haversine distance:


PREFIX wdt: 
PREFIX uom: 
PREFIX geof: 
PREFIX wd: 
PREFIX spatialF: 
PREFIX afn: 

SELECT ?s ?o ?d1 ?d2 ?diff_d1_d2 ?d_hav ?diff_eucl_hav {
  VALUES ?s {wd:Q1709 wd:Q64 wd:Q1729 wd:Q1718 wd:Q1726}
  VALUES ?o {wd:Q1709 wd:Q64 wd:Q1729 wd:Q1718 wd:Q1726}
  ?s wdt:P625 ?wkt1 .
  ?o wdt:P625 ?wkt2 .
  FILTER(?s != ?o && str(?s) < str(?o))
  BIND(geof:distance(?wkt1, ?wkt2, uom:kilometer) as ?d1)
  BIND(geof:distance(?wkt2, ?wkt1, uom:kilometer) as ?d2)
  BIND(abs(?d1 - ?d2) as ?diff_d1_d2)
  BIND(spatialF:distance(?wkt1, ?wkt2, uom:kilometer) as ?d_hav)
  BIND(afn:max(abs(?d1 - ?d_hav), abs(?d2 - ?d_hav)) as ?diff_eucl_hav)
}


Re: Re: Geo indexing Wikidata

2022-02-23 Thread Lorenz Buehmann
Thanks both for your very helpful input - I'm still a GeoSPARQL novice 
and trying to learn stuff and first of all just use the Jena 
implementation as efficient as possible.


On 21.02.22 15:22, Andy Seaborne wrote:



On 21/02/2022 09:07, Lorenz Buehmann wrote:

Any experience or comments so far?


Using SubsystemLifecycle, could make the conversions by

    GeoSPARQLOperations.convertGeoPredicates

extensible.

    Andy

But having coordinate location (P625), located on astronomical body 
(P376) as properties of a thing, is dangerous because of monotonicity 
in RDF:


   SELECT * { ?x wdt:P625 ?coords }

the association of P625 and P376 is lost.


Yep, I could simply omit the extra-terrestrial entities for now when 
storing the GeoSPARQL conform triples in the separate graph - clearly, 
this would need the Wikidata full dump as qualifiers are not contained 
in truthy.


As Marco pointed out there is ongoing discussion on Wikidata community: 
https://www.wikidata.org/wiki/Wikidata:Property_proposal/planetary_coordinates




What is the range of P625? It is not "earth geometry" any more.
What if there is no P376 on ?x?


Wikidata doesn't really have a concept of range or let's say they do not 
make use of RDFS at all. They use "property constraints" and if I look 
at https://www.wikidata.org/wiki/Property:P625 they more or less define 
some kind of domain


"not being human or company or railway" and some other more weird like 
"not being a female given name" etc. - I can'T see any range at least 
not in a structured data format maybe in some discussion only.


Currently, I'd treat absence of P376 as "on Earth" but that's just my 
intepretation.




As with any n-ary-like relationship, the indirection keeps the related 
properties together.


This is not unique to geo. Temperatures with units for example



-

This brings me to another "issue" - or let's call it unexpected behavior 
which for me is counter-intuitive:


I used geof:distance function and according to GeoSPARQL standard this 
is defined as


Returns the shortest distance in units between any two Points in the 
two geometric

objects as calculated in the spatial reference system ofgeom1.
so I'd consider some metric regarding the used CRS and if absent it 
should be CRS84. But Jena does implement just the euclidean distance 
according to source code, is this intended? Here is an example of a few 
cities in Germany with it's pairwise distance as well as the Haversine 
distance:



PREFIX wdt: 
PREFIX uom: 
PREFIX geof: 
PREFIX wd: 
PREFIX spatialF: 
PREFIX afn: 

SELECT ?s ?o ?d1 ?d2 ?diff_d1_d2 ?d_hav ?diff_eucl_hav {
  VALUES ?s {wd:Q1709 wd:Q64 wd:Q1729 wd:Q1718 wd:Q1726}
  VALUES ?o {wd:Q1709 wd:Q64 wd:Q1729 wd:Q1718 wd:Q1726}
  ?s wdt:P625 ?wkt1 .
  ?o wdt:P625 ?wkt2 .
  FILTER(?s != ?o && str(?s) < str(?o))
  BIND(geof:distance(?wkt1, ?wkt2, uom:kilometer) as ?d1)
  BIND(geof:distance(?wkt2, ?wkt1, uom:kilometer) as ?d2)
  BIND(abs(?d1 - ?d2) as ?diff_d1_d2)
  BIND(spatialF:distance(?wkt1, ?wkt2, uom:kilometer) as ?d_hav)
  BIND(afn:max(abs(?d1 - ?d_hav), abs(?d2 - ?d_hav)) as ?diff_eucl_hav)
}

with result


|+--+--+--+--+--+--+--+||
|||    s |    o |  d1  |  d2 |  
diff_d1_d2  |    d_hav |    diff_eucl_hav |||

||+--+--+--+--+--+--+--+||
||| wd:Q1709 | wd:Q64   | 149.280218e0 | 153.202637e0 | 
3.92241900019e0  | 180.75785e0  | 31.477632e0  |||
||| wd:Q1709 | wd:Q1729 | 177.123944e0 | 188.077111e0 | 
10.9531678e0 | 296.42569e0  | 119.3017459998e0 |||
||| wd:Q1709 | wd:Q1718 | 345.13558e0  | 364.477344e0 | 
19.34176400012e0 | 412.752229e0 | 67.616649e0  |||
||| wd:Q1709 | wd:Q1726 | 362.915021e0 | 408.448278e0 | 
45.533256e0  | 611.210126e0 | 248.2951049992e0 |||
||| wd:Q1729 | wd:Q64   | 197.116217e0 | 190.514338e0 | 
6.6018787e0  | 235.639289e0 | 45.1249509998e0  |||
||| wd:Q1718 | wd:Q64   | 469.456614e0 | 456.224537e0 | 
13.2320774e0 | 475.626349e0 | 19.4018127e0 |||
||| wd:Q1718 | wd:Q1729 | 297.248804e0 | 298.880777e0 | 
1.631973000163e0 | 298.493316e0 | 1.24451199986e0  |||
||| wd:Q1718 | wd:Q1726 | 398.21636e0  | 424.395158e0 | 
26.17879799972e0 | 487.365165e0 | 89.1488049998e0  |||
||| wd:Q1726 | wd:Q64   | 351.968792e0 | 320.94899e0 | 
31.01980200027e0 | 503.534544e0 | 182.585554e0 |||
||| wd:Q1726 | wd:Q1729 | 214.882078e0 | 202.734071e0 | 
12.1480077e0 | 318.297549e0 | 115.563478e0 |||


Re: Geo indexing Wikidata

2022-02-21 Thread Andy Seaborne




On 21/02/2022 09:07, Lorenz Buehmann wrote:

Any experience or comments so far?


Using SubsystemLifecycle, could make the conversions by

GeoSPARQLOperations.convertGeoPredicates

extensible.

Andy

But having coordinate location (P625), located on astronomical body 
(P376) as properties of a thing, is dangerous because of monotonicity in 
RDF:


   SELECT * { ?x wdt:P625 ?coords }

the association of P625 and P376 is lost.

What is the range of P625? It is not "earth geometry" any more.
What if there is no P376 on ?x?

As with any n-ary-like relationship, the indirection keeps the related 
properties together.


This is not unique to geo. Temperatures with units for example.


Re: Geo indexing Wikidata

2022-02-21 Thread Marco Neumann
I have a vague memory of running into this issue in the past, Lorenz. The
wikidata RDF serialization does not conform strictly to the OGC GeoSPARQL
data model nor do the geospatial access methods of the blazegraph extension
attempt to exhibit any of the standards compliant features. That's
something the wikidata community had to address. We should not assume OGC
GeoSPARQL compliance here even though from a Semantic Web point of view it
might be desirable.
I think your workaround is sound as far as OGC GeoSPARQL is concerned. But
I am sure you are familiar with the basic OGC GeoSPARQL data model and that
requires a spatial object to connect via hasGeometry to a type Geometry
that contains a WKT. Which is "just" another literal after all. The current
Apache Jena implementation follows the OGC standard here more or less
strictly. An alleviating factor here seems to be that the wikidata project
only makes use of Point "geometry" at the moment. Of course there is
nothing that prevents you or anyone from picking up alternative shapes in
the data. Like we did in the past with the W3C:WGS84 and Lucene Spatial
Jena geospatial implementation. Apache Jena now provides the
spatialF:convertLatLon function here to transform the data into WKT. But
rather than hacking the data into a suitable form for Apache Jena I'd
recommend discussing this further with a wikidata focused group of people
that would like to improve the wikidata data model. It's also possible that
the Apache Jena community provides an alternative loading syntax for
wikidata in the future. I have noticed that the wikidata community has
recently started to introduce planetary specific / non terrestrial location
identifiers as well.[1][2] As far as I understand, Apache Jena GeoSPARQL
currently only supports reference systems relating to earth via Apache SIS
and the EPSG geodetic dataset.

The GeoLD2022 - 5th International Workshop on Geospatial Linked Data at
ESWC 2022 might be a good place to discuss such an effort.
https://i3mainz.github.io/GeoLD2022/

[1]
https://www.wikidata.org/wiki/Wikidata:Property_proposal/planetary_coordinates
[2] https://arxiv.org/ftp/arxiv/papers/1706/1706.02683.pdf


On Mon, Feb 21, 2022 at 9:08 AM Lorenz Buehmann <
buehm...@informatik.uni-leipzig.de> wrote:

> Hi,
>
> we can use this as an complementary thread for the ongoing "loading
> Wikidata" threads, this time with focus on the geospatial part.
>
> Joachim already did the same for the text index and it works as
> expected, though still loading time could be improved.
>
>
> For the geospatial index things are different, summary and current state
> here:
>
> - Wikidata stores the coordinates in wdt:P625 property
>
> - the literals values are of type geo:wktLiteral
>
> - so far so good? well, not really ... Jena geospatial components
> expects to have the data either
>
>   a) following the GeoSPARQL standard, i.e. having a geometry object
> with a serialization linking to the WKT literal
>
>   b) having the data as WGS lat/lon literals and doing the conversion
> before indexing
>
> - apparently neither a) or b) holds for Wikidata as the WKT literal is
> simply directly attached to an entity via wdt:P625 property, so we do
> not have it in a form like
>
> ||
>
> > |
> >
> > |wd:Q3150 geo:hasDefaultGeometry [||geo:asWKT "||Point(11.58638
> > 50.92722)"^^geo:wktLiteral] .|
> >
> > |
> 
>
> ||nor do we have it as
>
> ||
>
> > ||wd:Q3150 wgs:||lon "11.58638"^^xsd:double;
> >  wgs:lat "50.92722"^^xsd:double .||
> ||
>
> all we have is
>
> ||
>
> > ||wd:Q3150 wdt:P625||"||Point(11.58638
> > 50.92722)"^^geo:wktLiteral .||
> |
> |
>
> So what does this mean? Well, you'll see the following output when
> starting a Fuseki with geosparql assembler used:
>
> > ./fuseki-server --conf ~/fuseki-wikidata-geosparql-assembler.ttl
> > 09:20:46 WARN  system  :: The “SIS_DATA” environment variable
> > is not set.
> > 09:20:46 INFO  Server  :: Apache Jena Fuseki 4.5.0-SNAPSHOT
> > 2022-02-17T09:59:26Z
> > 09:20:46 INFO  Config  ::
> > FUSEKI_HOME=/home/user/apache-jena-fuseki-4.5.0-SNAPSHOT/.
> > 09:20:46 INFO  Config  ::
> > FUSEKI_BASE=/home/user/apache-jena-fuseki-4.5.0-SNAPSHOT/run
> > 09:20:46 INFO  Config  :: Shiro file:
> > file:///home/user/apache-jena-fuseki-4.5.0-SNAPSHOT/run/shiro.ini
> > 09:20:47 INFO  GeoSPARQLOperations :: Find Mode SRS - Started
> > 09:20:47 INFO  GeoSPARQLOperations :: Find Mode SRS - Completed
> > 09:20:47 WARN  GeoAssembler:: No SRS found. Check
> > 'http://www.opengis.net/ont/geosparql#hasSerialization' or
> > '
> http://www.w3.org/2003/01/geo/wgs84_pos#lat'/'http://www.w3.org/2003/01/geo/wgs84_pos#lon'
>
> > predicates are present in the source data. Hint: Inferencing with
> > GeoSPARQL schema may be required.
> > 09:20:47 INFO  Server  :: Path = /ds
> > 09:20:47 INFO  Server  :: System
> > 09:20:47 INFO  Server  ::   Memory: 4.0 GiB
> > 09:20:47 INFO  Server  

Geo indexing Wikidata

2022-02-21 Thread Lorenz Buehmann

Hi,

we can use this as an complementary thread for the ongoing "loading 
Wikidata" threads, this time with focus on the geospatial part.


Joachim already did the same for the text index and it works as 
expected, though still loading time could be improved.



For the geospatial index things are different, summary and current state 
here:


- Wikidata stores the coordinates in wdt:P625 property

- the literals values are of type geo:wktLiteral

- so far so good? well, not really ... Jena geospatial components 
expects to have the data either


 a) following the GeoSPARQL standard, i.e. having a geometry object 
with a serialization linking to the WKT literal


 b) having the data as WGS lat/lon literals and doing the conversion 
before indexing


- apparently neither a) or b) holds for Wikidata as the WKT literal is 
simply directly attached to an entity via wdt:P625 property, so we do 
not have it in a form like


||


|

|wd:Q3150 geo:hasDefaultGeometry [||geo:asWKT "||Point(11.58638 
50.92722)"^^geo:wktLiteral] .|


|



||nor do we have it as

||


||wd:Q3150 wgs:||lon "11.58638"^^xsd:double;
 wgs:lat "50.92722"^^xsd:double .||

||

all we have is

||

||wd:Q3150 wdt:P625||"||Point(11.58638 
50.92722)"^^geo:wktLiteral .||

|
|

So what does this mean? Well, you'll see the following output when 
starting a Fuseki with geosparql assembler used:



./fuseki-server --conf ~/fuseki-wikidata-geosparql-assembler.ttl
09:20:46 WARN  system  :: The “SIS_DATA” environment variable 
is not set.
09:20:46 INFO  Server  :: Apache Jena Fuseki 4.5.0-SNAPSHOT 
2022-02-17T09:59:26Z
09:20:46 INFO  Config  :: 
FUSEKI_HOME=/home/user/apache-jena-fuseki-4.5.0-SNAPSHOT/.
09:20:46 INFO  Config  :: 
FUSEKI_BASE=/home/user/apache-jena-fuseki-4.5.0-SNAPSHOT/run
09:20:46 INFO  Config  :: Shiro file: 
file:///home/user/apache-jena-fuseki-4.5.0-SNAPSHOT/run/shiro.ini

09:20:47 INFO  GeoSPARQLOperations :: Find Mode SRS - Started
09:20:47 INFO  GeoSPARQLOperations :: Find Mode SRS - Completed
09:20:47 WARN  GeoAssembler    :: No SRS found. Check 
'http://www.opengis.net/ont/geosparql#hasSerialization' or 
'http://www.w3.org/2003/01/geo/wgs84_pos#lat'/'http://www.w3.org/2003/01/geo/wgs84_pos#lon' 
predicates are present in the source data. Hint: Inferencing with 
GeoSPARQL schema may be required.

09:20:47 INFO  Server  :: Path = /ds
09:20:47 INFO  Server  :: System
09:20:47 INFO  Server  ::   Memory: 4.0 GiB
09:20:47 INFO  Server  ::   Java:   11.0.11
09:20:47 INFO  Server  ::   OS: Linux 5.4.0-90-generic amd64
09:20:47 INFO  Server  ::   PID:    1866352
09:20:47 INFO  Server  :: Started 2022/02/19 09:20:47 CET on 
port 3030
so technically nothing happens because the data is not in one of the 
expected formats.


So what could be a workaround here? Clearly, we could execute a SPARQL 
Update statement to add the expected triples. There are ~9 million 
wdt:P625 triples, which means 18 million additional triples (takes 2 
triples for each variant) resp. 36 million triples if we want to avoid 
inference needed (2 more triples are necessary to add geo:Feature and 
geo:Geometry). So for each entity we add triple like


|wd:Q3150 a geo:Feature, geo:hasDefaultGeometry [||a geo:Geometry ; 
geo:asWKT "||Point(11.58638 50.92722)"^^geo:wktLiteral] .|


If we do not care about the dataset size we could say not a big deal I 
guess? But for querying it matters as we have to consider this different 
non-Wikidata format. Indeed we don't have to do


|?subj geo:hasDefaultGeometry ?subjGeom . ?subjGeom geo:asWKT ?subjLit 
. ?obj geo:hasDefaultGeometry ?objGeom . ?objGeom geo:|||asWKT| ?objLit . FILTER(geof:sfContains(?subjLit, ?objLit))|
because when query rewriting is enabled, it means for the topological 
functions this is fine as we can write



|?subj geo:sfContains ?obj .|


But for queries using non-topological functions like distance it 
matters, i.e. we either have to go the full path from entity to literal 
or we just get the literal from the original wdt:P625 triple which then 
would be fine. Here I recognized that no spatial index is used for 
distance functions.


So far so good. Now we come to the data quality which I didn't check in 
the first step. Some observations I made and which have to be considered:


- there are some non-literal wdt:P625 triples, those should be omitted 
in the SPARQL Update statement
- some wdt:P625 triples resp. their WKT literal values refer to 
coordinates on other planets, like Mars and the like. The problem here 
is that Wikidata decided to use the corresponding Wikidata planet entity 
URI as CRS inside the WKT literal. Clearly Jena can't parse those as 
this is misleading. There are some CRS URIs for planets, but those 
haven't been used for the non-terrestrial geo literals. For example


|wd:Q2267142 wdt:P31 wd:Q1439394 ; wdt:P376 wd:Q3303 ; wdt:P2824 
"2727" ;