Hi Martynas,

If it really is a different schema and and data each time then you can't cache.

If you only have a small number of schemas then you could use bindSchema to generate a set of partially-evaluated reasoners, cache those, and pick the right one to use for a given set of message headers.

The other option (apart from stopping using rules) would be to use backward rules. As already discussed the forward engine does all the inferences up front whereas the backward rules do them on demand.

Dave

On 23/06/16 21:38, Martynas Jusevičius wrote:
Hey again,

I have profiled the CPU time, and it seems that a lot of it (93.5%
after some 22500 HTTP requests) is spent in the following methods:

com.hp.hpl.jena.rdf.model.ModelFactory.createInfModel
(com.hp.hpl.jena.reasoner.Reasoner, com.hp.hpl.jena.rdf.model.Model,
com.hp.hpl.jena.rdf.model.Model)
   com.hp.hpl.jena.reasoner.rulesys.GenericRuleReasoner.bindSchema
(com.hp.hpl.jena.graph.Graph)
     com.hp.hpl.jena.reasoner.rulesys.FBRuleInfGraph.prepare ()
       com.hp.hpl.jena.reasoner.rulesys.impl.RETEEngine.fastInit
(com.hp.hpl.jena.reasoner.Finder)

Probably not so smart to create an InfModel with every
request/response. But in my case it is created using HTTP response
body and metadata only: Model from response body, and schema OntModel
from headers metadata, so I'm not sure how it could be cached. Here is
the code:
https://github.com/AtomGraph/Processor/blob/master/src/main/java/org/graphity/processor/filter/response/HypermediaFilter.java#L107

I would appreciate suggestions on how to improve performance.

Martynas

On Tue, Jun 21, 2016 at 10:28 AM, Dave Reynolds
<dave.e.reyno...@gmail.com> wrote:
Hi Martynas,

On 20/06/16 22:18, Martynas Jusevičius wrote:

Hey,

after using GenericRuleReasoner and InfModel more extensively, we
started experiencing memory leaks that eventually kill our webapp
because it runs out of heap space. Jena version is 2.11.0.

After some profiling, it seems that RETEEngine.clauseIndex and/or
RETEEngine.infGraph are retaining a lot of references. It might be
related to this report, but I'm not sure:

https://mail-archives.apache.org/mod_mbox/jena-users/201403.mbox/%3c5319b4e0.4060...@gmail.com%3E


If it is related to that then it is not a leak it is "just" memory use.

A leak implies that when you turn over data then unused internal state
objects are not reclaimed. Are you continuously adding and deleting data? If
so then the delete should release the whole of the RETEEngine state and
start over. If that isn't happening then that's a bug but you could work
around with an explicit reset() or even delete and recreate your InfGraph at
that stage. A delete loses all the state anyway.

The suggestion was to use use backward rules instead of forward rules.
I have read the following:
https://jena.apache.org/documentation/inference/#rules

But still I fail to understand in which situations backward rules
can/should be used instead of forward rules?


Forward rules are generally faster because they keep all that partially
matched state. So if you have stable data or just add triples monotonically,
and have a lot of queries, then generally use forward rules for performance.

Backward rules (without tabling) keep no state so there's less memory
overhead and no cost for delete but they are slow and have to redo the work
for every query.

Strictly the performance trade-off is a bit more subtle than that. Forward
rules will try to work out all the entailments whereas backward rules are
just responding to specific queries. So if your queries only touch a small
part of the possible space then backward rules could be more efficient.
However in practice RDF rules seem involve a lot of unground terms and lots
of rules match nearly every query.

Tabling allows you to selectively cache certain predicates which can enable
you to get more reasonable performance while keeping memory use under
control. You can also do some tuning of how the rules execute by testing if
variables are bound or not and using different clause orderings for
different query patterns.

  I guess simply replacing
-> with <- will not be enough?


Unless you use non-monotonic predicates (which, sadly, you do) then that
would be enough to get something working. In fact you don't even need to do
that. If you create a pure backward reasoner instances (as opposed to the
hybrid) reasoner it'll read forward syntax rules but treat them as backward.

The actual rules in question look like
this:

[gp:    (?class rdf:type
<http://www.w3.org/2000/01/rdf-schema#Class>), (?class ?p ?o), (?p
rdf:type owl:AnnotationProperty), (?p rdfs:isDefinedBy
<http://graphity.org/gp#>), (?subClass rdfs:subClassOf ?class),
(?subClass rdf:type <http://www.w3.org/2000/01/rdf-schema#Class>),
noValue(?subClass ?p) -> (?subClass ?p ?o) ]


That's a horrible rule from the engine's point of view. The head is
completely ungrounded so when running backwards then it will need to run for
*every* triple pattern. [It also makes no sense to me as a use of
owl:AnnotationProperty but whatever.] You could try it backwards but put the
clauses in a more efficient order:

(?subClass ?p ?o) <-
      (?p rdf:type owl:AnnotationProperty),
      (?p rdfs:isDefinedBy <http://graphity.org/gp#>),
      (?subClass rdfs:subClassOf ?class), (?class ?p ?o) .

The rdf:type rdfs:Class constraints are pointless since those are implied by
rdfs:subClassOf anyway. The noValue check is probably best avoided for both
cases.

Alternatively, depending on the nature of your space leak you could use
hybrid rules:

   (?p rdf:type owl:AnnotationProperty),
   (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
     ->
       [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
                              (?class ?p ?o) ]

That way the forward engine is only looking at your annotations and the
backward engine then has rules that have grounded predicates. You could also
table those predicates:

   (?p rdf:type owl:AnnotationProperty),
   (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
     ->
       table(?p),
       [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
                               (?class ?p ?o) ]

[gcdm:  (?template rdf:type <http://graphity.org/gp#Template>),
(?template <http://graphity.org/gc#defaultMode> ?o), (?subClass
rdfs:subClassOf ?template), (?subClass rdf:type
<http://graphity.org/gp#Template>), noValue(?subClass
<http://graphity.org/gc#defaultMode>) -> (?subClass
<http://graphity.org/gc#defaultMode> ?o) ]
[gcsm:  (?template rdf:type <http://graphity.org/gp#Template>),
(?template <http://graphity.org/gc#supportedMode> ?supportedMode),
(?subClass rdfs:subClassOf ?template), (?subClass rdf:type
<http://graphity.org/gp#Template>) -> (?subClass
<http://graphity.org/gc#supportedMode> ?supportedMode) ]


These two are more reasonable and could be used backwards or hybrid.

[rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]


That would work backwards. Depending on the scale of your data you might
want to table rdf:type for performance/space tradeoff.

Can these be rewritten as backward rules instead?


Sure, the challenge is performance tuning as noted above.

Does it involve code changes, such as calling reset() etc?

Shouldn't do.

Dave

Reply via email to