RE: benchmarking

David Jordan Tue, 20 Sep 2011 10:26:24 -0700

I guess I had the wrong impression of how TDB is implemented. I thought on 64 
bit environments that some of the files were memory-mapped directly into the 
virtual memory space of the application process.

I'll be having a conversation tomorrow with the architect of AllegroGraph.

A few more questions:
1. What I am seeing is that there is a LONG wait the first time I access the 
OntModel. It seems that all of the reasoning/inferencing work is being done in 
bulk at that time, because after that first call, things are fast.

Is this upfront all-at-once reasoning an aspect of all OWL reasoners, i.e. it 
is a necessity of the OWL language itself
or is this just specific to the built-in Jena reasoners?
Are there third party Jena reasoners that do more of a "lazy evaluation" of 
entailments?
Or are there reasoners that do it upfront, but are much faster than the 
built-in reasoners?

2. The ICD9 ontology is a fully self-contained ontology. It does not depend on 
anything outside of itself.
I put this in its own model, I also created an OntModel of it, used writeAll to 
output the OntModel, then placed this in a new model in Jena. I did this 
because I thought it would speed things up considerably. There should not be 
any additional reasoning that needs to be done on this particular model. It is 
read-only and fully "reasoned".  It would be real nice if there was a 
performance hint that could be given to the OntModel that this particular ICD9 
model is fully reasoned, self-contained and requires no further reasoning. I 
understand this is not currently supported, but I am suggesting that this may 
be a very useful feature for improving performance. We will eventually be 
including many other similar biomedical ontologies that are read-only, fully 
self-contained, that we can generate a reasoned model like I have done for ICD9.

-----Original Message-----
From: Dave Reynolds [mailto:[email protected]] 
Sent: Tuesday, September 20, 2011 9:23 AM
To: [email protected]
Subject: RE: benchmarking

Hi,

On Mon, 2011-09-19 at 18:16 +0000, David Jordan wrote: 
> One question is what level of inferencing is really necessary for things like 
> the following:
> It is not very clear to me yet which OWL constructs require particular 
> inferencing levels.
> 
> :Prostate_Cancer a owl:Class ;
>       owl:unionOf (
>               HOM_ICD9:HOM_ICD_10566  # V10.46  These identify ICD9 codes for 
> Prostate Cancer
>               HOM_ICD9:HOM_ICD_1767   # 233.4     which exist in a large 
> class hierarchy
>               HOM_ICD9:HOM_ICD_1343   # 185
>       ) .
> 
> :Cohort1 a owl:Class ;
>       owl:equivalentClass [
>               a owl:Restriction ;
>               owl:onProperty patient:hasDiagnosis ;  # which patients have a 
> diagnosis associated with prostate cancer?
>               owl:someValuesFrom :Prostate_Cancer
>       ] .
> 
> Except for taking the ICD9 class hierarchy into account, this is not really 
> much more than a simple database query.

Actually it is a lot more than a database query. For example, if you now assert 
(:eg a :Cohort1) then a DL reasoner will "know" that :eg has a diagnosis which 
is within one of the three members of the :Prostate_Cancer union but not which 
one.  If you then add additional information that constrains some of the other 
cases you may be able to determine which ICD9 code it is. By using 
owl:equivalentClass you are enabling reasoning in either direction across that 
equivalence, including reasoning by cases.

To determine the level of inference required you either have to assume complete 
inference and profile the language (DL, EL, QL etc) or you need to *also* 
specify what specific queries you might want to ask.

> The nice aspect of doing this in OWL is that we can define these sets, like 
> :Prostate_Cancer and :Cohort1, and then ask other questions of these sets.

Agreed.

> I thought that with TDB running on a 64 bit Linux box, doing memory mapped 
> I/O, that TDB could efficiently pull everything into memory quickly, avoiding 
> doing lots of fine grained SQL calls to a MySQL server.

Think of it as more like a halfway house. Each query to TDB involves walking 
over the appropriate B-tree indexes, just like in a database.
Given enough memory and an operating system good enough at disc block caching 
you might end up with everything, or near everything, paged into memory, but 
that is not guaranteed. Lots of factors come into play.

> I did use writeAll for writing the OntModel.
> 
> Relative to your suggestion of
> (1) Precompute all inferences, store those, then at runtime work with plain 
> (no inference at all) models over that stored closure.
> 
> Would I need to do this for EVERYTHING, including the declarations above for 
> Prostate_Cancer and Cohort1?

Yes. Essentially create an in-memory model containing everything you want to 
reason over. Then either materialize everything or, if you know the query 
patterns you are interested in, then ask those queries and materialize the 
results.

Alternatively you may want to consider the commercial triple stores that offer 
inference at scale with Jena compatibility.

Dave

> 
> 
> -----Original Message-----
> From: Dave Reynolds [mailto:[email protected]]
> Sent: Monday, September 19, 2011 11:39 AM
> To: [email protected]
> Subject: Re: benchmarking
> 
> Hi,
> 
> On Mon, 2011-09-19 at 15:04 +0000, David Jordan wrote: 
> > I have switch over from SDB to TDB to see if I can get better performance.
> > In the following, Database is a class of mine that insulated the code from 
> > knowing if it is SDB or TDB.
> > 
> > I do the following, which combines 2 models I have stored in TDB and then 
> > reads a third small model from a file that contains some classes I want to 
> > “test”. I then have some code that times how long it takes to get a 
> > particular class and list its instances.
> > 
> > Model model1 = Database.getICD9inferredModel(); Model model2 = 
> > Database.getPatientModel(); OntModel omodel = 
> > ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM_MICRO_RULE_INF
> > ,
> > model1); omodel.add(model2);
> 
> That is running a full rule reasoner over the TDB model. As I've mentioned 
> before the rule inference engines store everything in memory so that doesn't 
> give you any scaling over simply loading the file into memory and doing 
> inference over that, it just goes very very slowly!
> 
> > InputStream in = FileManager.get().open(fileName); omodel.read(in, 
> > baseName, "TURTLE");
> > 
> > OntClass oclass = omodel.getOntClass(line);   // access the class
> > 
> > On the first call to getOntClass, I have been seeing a VERY long wait 
> > (around an hour) before I get a response.
> > Then after that first call, subsequent calls are much faster.
> > But I started looking at the CPU utilization. After the call to 
> > getOntClass, CPU utilization is very close to 0.
> > Is this to be expected?
> 
> Seems plausible, the inference engines are in effect doing huge number of 
> triple queries to TDB which will be spend most of its time waiting for the 
> disk.
> 
> If you really need to run live inference over the entire dataset then load it 
> into a memory model first, then construct your inference model over that.
> 
> > Is there any form of tracing/logging that can be turned on to determine 
> > what (if anything) is happening?
> > 
> > Is there something I am doing wrong in setting up my models?
> > For the ICD9 ontology I am using, I had read in the OWL data, created an 
> > OntModel with it, wrote this OntModel data out.
> > Then I store the data from the OntModel into TDB, so it supposedly does not 
> > have to do as much work at runtime.
> 
> As Chris says, make sure you using writeAll not just plain write to store the 
> OntModel.
> 
> That aside, this doesn't necessarily save you much work because the rules are 
> having to run anyway, they are just not discovering anything much new.
> 
> In the absence of a highly scalable inference solution for Jena (something 
> which can't be done without resourcing) then your two good options are:
> 
> (1) Precompute all inferences, store those, then at runtime work with plain 
> (no inference at all) models over that stored closure.
> 
> (2) Load all the data into memory and run inference over that.
> 
> Dave
> 
> 
>

RE: benchmarking

Reply via email to