On 27/08/2021 21:09, Brandon Sara wrote:
I’ve finally tracked down the problem (at least at a high level). When using the Transitive Reasoner, there is a block of code which caches all sub class triples (https://github.com/apache/jena/blob/main/jena-core/src/main/java/org/apache/jena/reasoner/transitiveReasoner/TransitiveEngine.java#L316-L326). Part of this code searches for all sub properties of `subClassOf` and begins caching triples for those sub-properties. In my situation, I’ve added `owl:equivalentClass` manually (since only TransitiveReasoner` is being used) and manually made it a sub property of `subClassOf`. The data that I’m uploading right now has a lot of equivalent class triples (~>300k). It seems, if I’m understanding the code correctly as I’ve been debugging it, that not only is the triple cached…but a traversal of many other triples occurs when the caching occurs for even a single triple, is that correct? This would explain why (1) it never seems to finish what it is doing and (2) the memory grows very, very large while doing it. I ran a single query last night and after more than 6 hours, 8 CPUs, and 20GB of RAM, it still never finished loading the cache. It seems as though that the runtime of this could be exponential in nature.

Indeed it can be expensive. The transitive reasoner is doing a transitive reduction (finding direct links) not just a transitive closure. If I remember correctly this is somewhere between quadratic and cubic (something like O(|V|(|V| + |E|)) in the best case). It uses a standard but rather old algorithm for this but I think the algorithm is still polynomial not exponential. However, (a) there could be some cases that throw it off and (b) at 20m records then even quadratic would be high cost and it's likely to be closer a power of 2.5.

My dataset is well over 20 million records (maybe even more, I still haven’t 
gotten a full count yet, but I know for a fact that it is well over 10 million 
and believe it to be well more than 20 million). Like I’ve mentioned before, 
there are basically no individuals in the dataset, it’s all ontology because it 
is health care industry coding systems and classifications.

Another strange thing, which I’ve mentioned before, is that I don’t have any of 
these issues when I initially load the data, I can load everything with just 4 
GB of RAM, it loads in a reasonable amount of time, and I can submit queries of 
pretty much any complexity after the upload is complete with no issues, and 
they are very fast too. This only occurs when the server has been restarted and 
the first query that actually pulls something from the dataset (I.E. not an 
empty query) is submitted (no matter how simple or complex that query may be).

Can't explain that.

Is this a bug or should `owl:equivalent` class work without my own manual 
specification of it?

Depends what you mean by "work".

I'm afraid I've not been following your earlier posts so I'm not clear on what you are trying to achieve.

If you want to deduce new subClassOf (and perhaps equivalentClass) relationships from a mix of subClassOf and equivalentClass assertions just using the transitive reasoner, with no rule engine, then you would have to insert that manual relationship.

If you just want to store and retrieve the equivalentClass relationships with no inference then clearly you wouldn't need that extra assertion.

Dave

Reply via email to