Hey again, I have profiled the CPU time, and it seems that a lot of it (93.5% after some 22500 HTTP requests) is spent in the following methods:
com.hp.hpl.jena.rdf.model.ModelFactory.createInfModel (com.hp.hpl.jena.reasoner.Reasoner, com.hp.hpl.jena.rdf.model.Model, com.hp.hpl.jena.rdf.model.Model) com.hp.hpl.jena.reasoner.rulesys.GenericRuleReasoner.bindSchema (com.hp.hpl.jena.graph.Graph) com.hp.hpl.jena.reasoner.rulesys.FBRuleInfGraph.prepare () com.hp.hpl.jena.reasoner.rulesys.impl.RETEEngine.fastInit (com.hp.hpl.jena.reasoner.Finder) Probably not so smart to create an InfModel with every request/response. But in my case it is created using HTTP response body and metadata only: Model from response body, and schema OntModel from headers metadata, so I'm not sure how it could be cached. Here is the code: https://github.com/AtomGraph/Processor/blob/master/src/main/java/org/graphity/processor/filter/response/HypermediaFilter.java#L107 I would appreciate suggestions on how to improve performance. Martynas On Tue, Jun 21, 2016 at 10:28 AM, Dave Reynolds <dave.e.reyno...@gmail.com> wrote: > Hi Martynas, > > On 20/06/16 22:18, Martynas Jusevičius wrote: >> >> Hey, >> >> after using GenericRuleReasoner and InfModel more extensively, we >> started experiencing memory leaks that eventually kill our webapp >> because it runs out of heap space. Jena version is 2.11.0. >> >> After some profiling, it seems that RETEEngine.clauseIndex and/or >> RETEEngine.infGraph are retaining a lot of references. It might be >> related to this report, but I'm not sure: >> >> https://mail-archives.apache.org/mod_mbox/jena-users/201403.mbox/%3c5319b4e0.4060...@gmail.com%3E > > > If it is related to that then it is not a leak it is "just" memory use. > > A leak implies that when you turn over data then unused internal state > objects are not reclaimed. Are you continuously adding and deleting data? If > so then the delete should release the whole of the RETEEngine state and > start over. If that isn't happening then that's a bug but you could work > around with an explicit reset() or even delete and recreate your InfGraph at > that stage. A delete loses all the state anyway. > >> The suggestion was to use use backward rules instead of forward rules. >> I have read the following: >> https://jena.apache.org/documentation/inference/#rules >> >> But still I fail to understand in which situations backward rules >> can/should be used instead of forward rules? > > > Forward rules are generally faster because they keep all that partially > matched state. So if you have stable data or just add triples monotonically, > and have a lot of queries, then generally use forward rules for performance. > > Backward rules (without tabling) keep no state so there's less memory > overhead and no cost for delete but they are slow and have to redo the work > for every query. > > Strictly the performance trade-off is a bit more subtle than that. Forward > rules will try to work out all the entailments whereas backward rules are > just responding to specific queries. So if your queries only touch a small > part of the possible space then backward rules could be more efficient. > However in practice RDF rules seem involve a lot of unground terms and lots > of rules match nearly every query. > > Tabling allows you to selectively cache certain predicates which can enable > you to get more reasonable performance while keeping memory use under > control. You can also do some tuning of how the rules execute by testing if > variables are bound or not and using different clause orderings for > different query patterns. > >> I guess simply replacing >> -> with <- will not be enough? > > > Unless you use non-monotonic predicates (which, sadly, you do) then that > would be enough to get something working. In fact you don't even need to do > that. If you create a pure backward reasoner instances (as opposed to the > hybrid) reasoner it'll read forward syntax rules but treat them as backward. > >> The actual rules in question look like >> this: >> >> [gp: (?class rdf:type >> <http://www.w3.org/2000/01/rdf-schema#Class>), (?class ?p ?o), (?p >> rdf:type owl:AnnotationProperty), (?p rdfs:isDefinedBy >> <http://graphity.org/gp#>), (?subClass rdfs:subClassOf ?class), >> (?subClass rdf:type <http://www.w3.org/2000/01/rdf-schema#Class>), >> noValue(?subClass ?p) -> (?subClass ?p ?o) ] > > > That's a horrible rule from the engine's point of view. The head is > completely ungrounded so when running backwards then it will need to run for > *every* triple pattern. [It also makes no sense to me as a use of > owl:AnnotationProperty but whatever.] You could try it backwards but put the > clauses in a more efficient order: > > (?subClass ?p ?o) <- > (?p rdf:type owl:AnnotationProperty), > (?p rdfs:isDefinedBy <http://graphity.org/gp#>), > (?subClass rdfs:subClassOf ?class), (?class ?p ?o) . > > The rdf:type rdfs:Class constraints are pointless since those are implied by > rdfs:subClassOf anyway. The noValue check is probably best avoided for both > cases. > > Alternatively, depending on the nature of your space leak you could use > hybrid rules: > > (?p rdf:type owl:AnnotationProperty), > (?p rdfs:isDefinedBy <http://graphity.org/gp#>) > -> > [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class), > (?class ?p ?o) ] > > That way the forward engine is only looking at your annotations and the > backward engine then has rules that have grounded predicates. You could also > table those predicates: > > (?p rdf:type owl:AnnotationProperty), > (?p rdfs:isDefinedBy <http://graphity.org/gp#>) > -> > table(?p), > [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class), > (?class ?p ?o) ] > >> [gcdm: (?template rdf:type <http://graphity.org/gp#Template>), >> (?template <http://graphity.org/gc#defaultMode> ?o), (?subClass >> rdfs:subClassOf ?template), (?subClass rdf:type >> <http://graphity.org/gp#Template>), noValue(?subClass >> <http://graphity.org/gc#defaultMode>) -> (?subClass >> <http://graphity.org/gc#defaultMode> ?o) ] >> [gcsm: (?template rdf:type <http://graphity.org/gp#Template>), >> (?template <http://graphity.org/gc#supportedMode> ?supportedMode), >> (?subClass rdfs:subClassOf ?template), (?subClass rdf:type >> <http://graphity.org/gp#Template>) -> (?subClass >> <http://graphity.org/gc#supportedMode> ?supportedMode) ] > > > These two are more reasonable and could be used backwards or hybrid. > >> [rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)] > > > That would work backwards. Depending on the scale of your data you might > want to table rdf:type for performance/space tradeoff. > >> Can these be rewritten as backward rules instead? > > > Sure, the challenge is performance tuning as noted above. > >> Does it involve code changes, such as calling reset() etc? > > Shouldn't do. > > Dave