Hey again,

I have profiled the CPU time, and it seems that a lot of it (93.5%
after some 22500 HTTP requests) is spent in the following methods:

com.hp.hpl.jena.rdf.model.ModelFactory.createInfModel
(com.hp.hpl.jena.reasoner.Reasoner, com.hp.hpl.jena.rdf.model.Model,
com.hp.hpl.jena.rdf.model.Model)
  com.hp.hpl.jena.reasoner.rulesys.GenericRuleReasoner.bindSchema
(com.hp.hpl.jena.graph.Graph)
    com.hp.hpl.jena.reasoner.rulesys.FBRuleInfGraph.prepare ()
      com.hp.hpl.jena.reasoner.rulesys.impl.RETEEngine.fastInit
(com.hp.hpl.jena.reasoner.Finder)

Probably not so smart to create an InfModel with every
request/response. But in my case it is created using HTTP response
body and metadata only: Model from response body, and schema OntModel
from headers metadata, so I'm not sure how it could be cached. Here is
the code:
https://github.com/AtomGraph/Processor/blob/master/src/main/java/org/graphity/processor/filter/response/HypermediaFilter.java#L107

I would appreciate suggestions on how to improve performance.

Martynas

On Tue, Jun 21, 2016 at 10:28 AM, Dave Reynolds
<dave.e.reyno...@gmail.com> wrote:
> Hi Martynas,
>
> On 20/06/16 22:18, Martynas Jusevičius wrote:
>>
>> Hey,
>>
>> after using GenericRuleReasoner and InfModel more extensively, we
>> started experiencing memory leaks that eventually kill our webapp
>> because it runs out of heap space. Jena version is 2.11.0.
>>
>> After some profiling, it seems that RETEEngine.clauseIndex and/or
>> RETEEngine.infGraph are retaining a lot of references. It might be
>> related to this report, but I'm not sure:
>>
>> https://mail-archives.apache.org/mod_mbox/jena-users/201403.mbox/%3c5319b4e0.4060...@gmail.com%3E
>
>
> If it is related to that then it is not a leak it is "just" memory use.
>
> A leak implies that when you turn over data then unused internal state
> objects are not reclaimed. Are you continuously adding and deleting data? If
> so then the delete should release the whole of the RETEEngine state and
> start over. If that isn't happening then that's a bug but you could work
> around with an explicit reset() or even delete and recreate your InfGraph at
> that stage. A delete loses all the state anyway.
>
>> The suggestion was to use use backward rules instead of forward rules.
>> I have read the following:
>> https://jena.apache.org/documentation/inference/#rules
>>
>> But still I fail to understand in which situations backward rules
>> can/should be used instead of forward rules?
>
>
> Forward rules are generally faster because they keep all that partially
> matched state. So if you have stable data or just add triples monotonically,
> and have a lot of queries, then generally use forward rules for performance.
>
> Backward rules (without tabling) keep no state so there's less memory
> overhead and no cost for delete but they are slow and have to redo the work
> for every query.
>
> Strictly the performance trade-off is a bit more subtle than that. Forward
> rules will try to work out all the entailments whereas backward rules are
> just responding to specific queries. So if your queries only touch a small
> part of the possible space then backward rules could be more efficient.
> However in practice RDF rules seem involve a lot of unground terms and lots
> of rules match nearly every query.
>
> Tabling allows you to selectively cache certain predicates which can enable
> you to get more reasonable performance while keeping memory use under
> control. You can also do some tuning of how the rules execute by testing if
> variables are bound or not and using different clause orderings for
> different query patterns.
>
>>  I guess simply replacing
>> -> with <- will not be enough?
>
>
> Unless you use non-monotonic predicates (which, sadly, you do) then that
> would be enough to get something working. In fact you don't even need to do
> that. If you create a pure backward reasoner instances (as opposed to the
> hybrid) reasoner it'll read forward syntax rules but treat them as backward.
>
>> The actual rules in question look like
>> this:
>>
>> [gp:    (?class rdf:type
>> <http://www.w3.org/2000/01/rdf-schema#Class>), (?class ?p ?o), (?p
>> rdf:type owl:AnnotationProperty), (?p rdfs:isDefinedBy
>> <http://graphity.org/gp#>), (?subClass rdfs:subClassOf ?class),
>> (?subClass rdf:type <http://www.w3.org/2000/01/rdf-schema#Class>),
>> noValue(?subClass ?p) -> (?subClass ?p ?o) ]
>
>
> That's a horrible rule from the engine's point of view. The head is
> completely ungrounded so when running backwards then it will need to run for
> *every* triple pattern. [It also makes no sense to me as a use of
> owl:AnnotationProperty but whatever.] You could try it backwards but put the
> clauses in a more efficient order:
>
> (?subClass ?p ?o) <-
>      (?p rdf:type owl:AnnotationProperty),
>      (?p rdfs:isDefinedBy <http://graphity.org/gp#>),
>      (?subClass rdfs:subClassOf ?class), (?class ?p ?o) .
>
> The rdf:type rdfs:Class constraints are pointless since those are implied by
> rdfs:subClassOf anyway. The noValue check is probably best avoided for both
> cases.
>
> Alternatively, depending on the nature of your space leak you could use
> hybrid rules:
>
>   (?p rdf:type owl:AnnotationProperty),
>   (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
>     ->
>       [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
>                              (?class ?p ?o) ]
>
> That way the forward engine is only looking at your annotations and the
> backward engine then has rules that have grounded predicates. You could also
> table those predicates:
>
>   (?p rdf:type owl:AnnotationProperty),
>   (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
>     ->
>       table(?p),
>       [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
>                               (?class ?p ?o) ]
>
>> [gcdm:  (?template rdf:type <http://graphity.org/gp#Template>),
>> (?template <http://graphity.org/gc#defaultMode> ?o), (?subClass
>> rdfs:subClassOf ?template), (?subClass rdf:type
>> <http://graphity.org/gp#Template>), noValue(?subClass
>> <http://graphity.org/gc#defaultMode>) -> (?subClass
>> <http://graphity.org/gc#defaultMode> ?o) ]
>> [gcsm:  (?template rdf:type <http://graphity.org/gp#Template>),
>> (?template <http://graphity.org/gc#supportedMode> ?supportedMode),
>> (?subClass rdfs:subClassOf ?template), (?subClass rdf:type
>> <http://graphity.org/gp#Template>) -> (?subClass
>> <http://graphity.org/gc#supportedMode> ?supportedMode) ]
>
>
> These two are more reasonable and could be used backwards or hybrid.
>
>> [rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]
>
>
> That would work backwards. Depending on the scale of your data you might
> want to table rdf:type for performance/space tradeoff.
>
>> Can these be rewritten as backward rules instead?
>
>
> Sure, the challenge is performance tuning as noted above.
>
>> Does it involve code changes, such as calling reset() etc?
>
> Shouldn't do.
>
> Dave

Reply via email to