Re: GeoSPARQL process

Greg Albiston Sun, 14 Apr 2019 03:22:22 -0700

Hi,

There are a lot of permutations that a GeoSPARQL query could take which
can generate different values that may or may not be useful later on.
The general strategy is to keep what is generated for a while and if
isn't used then drop it. I don't think any of the Cache implementations
offer this or a suitable alternative.


The expiring-map removes entries that haven't been reused after a period
of time. The duration to retain, rate of checking and maximum size can
all be set. It is used for three purposes:

- The Geometry Wrapper object resulting from de-serialising the Geometry
Literals.
- The transformed Geometry Wrapper object from changing the spatial
reference system.
- The result of a spatial relation between two Geometry Literals to
avoid re-testing when Query Re-writing is applied.

Most of the GeoSPARQL functions are between two Geometry Literals, so
one could be needed in the next iteration of the query and the other
could be needed later.

The first purpose offers the biggest impact on performance as there are
additional de-serialising of the Geometry Literal while Jena is
processing the query. Complex shages, e.g. polygons, can be very costly
to extract.

The second purpose offers most benefit when complex shapes need
transforming. These transformations may be needed again during this
query but not the next. e.g. dataset is in SRS A. Query 1 is a
comparison with a set of values in SRS B. Query 2 then is a comparison
with a set of values in SRS C. The results from Query 1 are useless and
may never be needed again.

The third purpose is due to GeoSPARQL allowing query re-writing where
the Geometry Literal isn't specified and instead Features and Geometries
are used, so a single query could test the same spatial relations upto
four times depending on bindings.

The expiring-map is allowed to fill up while the query is processing and
then drops entries that aren't reused (in batches) or once the query
completes. Once it is full, new entries are quickly rejected but space
is freed up later from those entries not being re-used. A user with a
small dataset can cache everything while a large dataset can choose to
constrain it to get some benefit from caching without consuming vast
junks of memory.

I tried using the Apache Collections 4 LRUMap and it made performance
worse once it was filled (at a guess due to "one out, one in" and
constant searching). I only found one Java implementation of a time
based cache. It seemed excessive to have the whole dependency for one
class and it wasn't as flexible as required.

Hopefully this clarifies why the expiring-map approach was adopted.

Thanks,

Greg

On 10/04/2019 16:50, ajs6f wrote:

Just out of curiosity, Greg, what is the functionality offered by Expiring Map 
that isn't offered by Jena's already-extant oaj.atlas.lib.Cache 
implementations? Is it the ability to manually trigger expirations?

ajs6f

On Apr 9, 2019, at 12:02 PM, Andy Seaborne <[email protected]> wrote:

[INFO] |  \- io.github.galbiston:expiring-map:jar:1.0.2:compile

Re: GeoSPARQL process

Reply via email to