[ 
https://issues.apache.org/jira/browse/JENA-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512113#comment-17512113
 ] 

Greg Albiston commented on JENA-2311:
-------------------------------------

I can see why the string concatenation would cause issues with large numbers of 
complex polygons. 

However, its purpose is to create a reproducible identifier. The query re-write 
mechanism seeks to replace the `Feature` and `Geometry` classes with the 
underlying `GeometryLiteral` they represent for later re-use as a query could 
reach the same conclusion multiple ways after re-writing.

The string concatenation needs to be looked at again and replaced with a test 
against an alternative representation of the triple. Either the three original 
strings (i.e. geometry literal, property URI, geometry literal) or the wrapping 
objects returned in the query (if the equality/equivalence of the objects is 
consistent, e.g. the same objects are returned).

In terms of the proposed solution, it seems to be using a counter as the 
identifier. Is this not going to return a unique identifier for every result 
and so never have any cache hits? The memory consumption is stable because the 
cached data is constantly expiring and the cache is not assisting performance.

> query rewrite index does too expensive caching on geo literals
> --------------------------------------------------------------
>
>                 Key: JENA-2311
>                 URL: https://issues.apache.org/jira/browse/JENA-2311
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: GeoSPARQL
>    Affects Versions: Jena 4.4.0
>            Reporter: Lorenz Bühmann
>            Priority: Major
>
> Using a GeoSPARQL query with a geospatial property function, e.g.
> {code:java}
> SELECT * {
> :x geo:hasGeometry ?geo1 .
> ?s2 geo:hasGeometry ?geo2 .
> ?geo1 geo:sfContains ?geo2
> }
> {code}
> leads to heavy memory consumption for larger datasets - and we're not talking 
> about big data at all. Imagine given a polygon and checking for millions of 
> geometries for containment in the polygon.
> In the {{QueryRewriteIndex}} class for caching a key will be generated, but 
> this is horribly expensive given that the string representation of Geometries 
> is called millions of times leading millions of Byte arrays being created 
> leading a to a possible OOM exception - we got it with 8GB assigned.
> The key generation for reference:
> {code:java}
> String key = subjectGeometryLiteral.getLiteralLexicalForm() + KEY_SEPARATOR + 
> predicate.getURI() + KEY_SEPARATOR + 
> objectGeometryLiteral.getLiteralLexicalForm();
> {code}
> My suggestion is to use a separate {{Node -> Integer}} (or {{Long}}?) Guava 
> cache and use the long values instead to generate the cache key. Or any other 
> more efficient datastructure, not even sure if a String is necessary?
> We tried some fix which works for us and keeps the memory consumption stable:
> {code:java}
>  private LoadingCache<Node, Integer> nodeIDCache;
>  private AtomicInteger cacheCounter;
> ...
>         cacheCounter = new AtomicInteger(0);
>         CacheBuilder<Object, Object> builder = CacheBuilder.newBuilder();
>         if (maxSize > 0) {
>             builder = builder.maximumSize(maxSize);
>         }
>         if (expiryInterval > 0) {
>             builder = builder.expireAfterWrite(expiryInterval, 
> TimeUnit.MILLISECONDS);
>         }
>         nodeIDCache = builder.build(
>                         new CacheLoader<>() {
>                             public Integer load(Node key) {
>                                 return cacheCounter.incrementAndGet();
>                             }
>                         });
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to