We have outstanding:

https://github.com/apache/jena/pull/47

which changes the cache to LRU from fixed.
That does not fix any memory leaks but might mitigate them.

There are two FIXME in the PR which could do with looking at.

    Andy

On 21/06/16 09:28, Dave Reynolds wrote:
Hi Martynas,

On 20/06/16 22:18, Martynas Jusevičius wrote:
Hey,

after using GenericRuleReasoner and InfModel more extensively, we
started experiencing memory leaks that eventually kill our webapp
because it runs out of heap space. Jena version is 2.11.0.

After some profiling, it seems that RETEEngine.clauseIndex and/or
RETEEngine.infGraph are retaining a lot of references. It might be
related to this report, but I'm not sure:
https://mail-archives.apache.org/mod_mbox/jena-users/201403.mbox/%3c5319b4e0.4060...@gmail.com%3E


If it is related to that then it is not a leak it is "just" memory use.

A leak implies that when you turn over data then unused internal state
objects are not reclaimed. Are you continuously adding and deleting
data? If so then the delete should release the whole of the RETEEngine
state and start over. If that isn't happening then that's a bug but you
could work around with an explicit reset() or even delete and recreate
your InfGraph at that stage. A delete loses all the state anyway.

The suggestion was to use use backward rules instead of forward rules.
I have read the following:
https://jena.apache.org/documentation/inference/#rules

But still I fail to understand in which situations backward rules
can/should be used instead of forward rules?

Forward rules are generally faster because they keep all that partially
matched state. So if you have stable data or just add triples
monotonically, and have a lot of queries, then generally use forward
rules for performance.

Backward rules (without tabling) keep no state so there's less memory
overhead and no cost for delete but they are slow and have to redo the
work for every query.

Strictly the performance trade-off is a bit more subtle than that.
Forward rules will try to work out all the entailments whereas backward
rules are just responding to specific queries. So if your queries only
touch a small part of the possible space then backward rules could be
more efficient. However in practice RDF rules seem involve a lot of
unground terms and lots of rules match nearly every query.

Tabling allows you to selectively cache certain predicates which can
enable you to get more reasonable performance while keeping memory use
under control. You can also do some tuning of how the rules execute by
testing if variables are bound or not and using different clause
orderings for different query patterns.

 I guess simply replacing
-> with <- will not be enough?

Unless you use non-monotonic predicates (which, sadly, you do) then that
would be enough to get something working. In fact you don't even need to
do that. If you create a pure backward reasoner instances (as opposed to
the hybrid) reasoner it'll read forward syntax rules but treat them as
backward.

The actual rules in question look like
this:

[gp:    (?class rdf:type
<http://www.w3.org/2000/01/rdf-schema#Class>), (?class ?p ?o), (?p
rdf:type owl:AnnotationProperty), (?p rdfs:isDefinedBy
<http://graphity.org/gp#>), (?subClass rdfs:subClassOf ?class),
(?subClass rdf:type <http://www.w3.org/2000/01/rdf-schema#Class>),
noValue(?subClass ?p) -> (?subClass ?p ?o) ]

That's a horrible rule from the engine's point of view. The head is
completely ungrounded so when running backwards then it will need to run
for *every* triple pattern. [It also makes no sense to me as a use of
owl:AnnotationProperty but whatever.] You could try it backwards but put
the clauses in a more efficient order:

(?subClass ?p ?o) <-
      (?p rdf:type owl:AnnotationProperty),
      (?p rdfs:isDefinedBy <http://graphity.org/gp#>),
      (?subClass rdfs:subClassOf ?class), (?class ?p ?o) .

The rdf:type rdfs:Class constraints are pointless since those are
implied by rdfs:subClassOf anyway. The noValue check is probably best
avoided for both cases.

Alternatively, depending on the nature of your space leak you could use
hybrid rules:

   (?p rdf:type owl:AnnotationProperty),
   (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
     ->
       [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
                              (?class ?p ?o) ]

That way the forward engine is only looking at your annotations and the
backward engine then has rules that have grounded predicates. You could
also table those predicates:

   (?p rdf:type owl:AnnotationProperty),
   (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
     ->
       table(?p),
       [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
                               (?class ?p ?o) ]

[gcdm:  (?template rdf:type <http://graphity.org/gp#Template>),
(?template <http://graphity.org/gc#defaultMode> ?o), (?subClass
rdfs:subClassOf ?template), (?subClass rdf:type
<http://graphity.org/gp#Template>), noValue(?subClass
<http://graphity.org/gc#defaultMode>) -> (?subClass
<http://graphity.org/gc#defaultMode> ?o) ]
[gcsm:  (?template rdf:type <http://graphity.org/gp#Template>),
(?template <http://graphity.org/gc#supportedMode> ?supportedMode),
(?subClass rdfs:subClassOf ?template), (?subClass rdf:type
<http://graphity.org/gp#Template>) -> (?subClass
<http://graphity.org/gc#supportedMode> ?supportedMode) ]

These two are more reasonable and could be used backwards or hybrid.

[rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]

That would work backwards. Depending on the scale of your data you might
want to table rdf:type for performance/space tradeoff.

Can these be rewritten as backward rules instead?

Sure, the challenge is performance tuning as noted above.

 > Does it involve code changes, such as calling reset() etc?

Shouldn't do.

Dave

Reply via email to