RE: benchmarking

Dave Reynolds Wed, 21 Sep 2011 03:33:19 -0700

On Tue, 2011-09-20 at 17:22 +0000, David Jordan wrote: 
> I guess I had the wrong impression of how TDB is implemented. I thought on 64 
> bit environments that some of the files were memory-mapped directly into the 
> virtual memory space of the application process.


They are memory mapped but that doesn't mean they are all in memory
unless you have enough memory available, have a cooperative OS, and have
set appropriate options such as -XX:MaxDirectMemorySize depending on
your JVM.

> A few more questions:
> 1. What I am seeing is that there is a LONG wait the first time I access the 
> OntModel. It seems that all of the reasoning/inferencing work is being done 
> in bulk at that time, because after that first call, things are fast.

Actually not quite. In the rules there is a split between forward (up
front, eager) inference and backward (on demand) inference. Though it
errs towards forward inference since that makes post-prepare queries
more predicable.

You can create custom rule sets using entirely backward rules.

> Is this upfront all-at-once reasoning an aspect of all OWL reasoners, i.e. it 
> is a necessity of the OWL language itself
> or is this just specific to the built-in Jena reasoners?

Not necessary, there are lots of ways you can perform the inference and
different trade-offs are possible.

> Are there third party Jena reasoners that do more of a "lazy evaluation" of 
> entailments?

C&P's new Stardog store does query time reasoning I believe. Certainly
for OWL QL.

Virtuoso can support some backward rules but I don't know about their
OWL support.

> Or are there reasoners that do it upfront, but are much faster than the 
> built-in reasoners?

For DL then Pellet can outperform the built in OWL_MICRO for complex
ontologies, though the reverse can also be true in simpler cases where
OWL_MICRO's coverage is sufficient.

There is also BigOWLIM though I don't know the state of its Jena
support.

> 2. The ICD9 ontology is a fully self-contained ontology. It does not depend 
> on anything outside of itself.
> I put this in its own model, I also created an OntModel of it, used writeAll 
> to output the OntModel, then placed this in a new model in Jena. I did this 
> because I thought it would speed things up considerably. There should not be 
> any additional reasoning that needs to be done on this particular model. It 
> is read-only and fully "reasoned". 

"read only" isn't meaningful or relevant here, the reasoners don't
modify the original data.

> It would be real nice if there was a performance hint that could be given to 
> the OntModel that this particular ICD9 model is fully reasoned, 
> self-contained and requires no further reasoning. I understand this is not 
> currently supported, but I am suggesting that this may be a very useful 
> feature for improving performance.

I can see the attraction but its not obvious how to implement it :(

> We will eventually be including many other similar biomedical ontologies that 
> are read-only, fully self-contained, that we can generate a reasoned model 
> like I have done for ICD9.

If your aim is to reason over patient data with those as background then
it might be you want a custom rule set. Perform just the inferences you
want, using the precomputed closures, rather that doing complete OWL
inference over the entire merged ontology set plus instance data. That's
an option that has worked for me in the past though I don't have an
automated way of setting such a thing up.

Dave


> -----Original Message-----
> From: Dave Reynolds [mailto:[email protected]] 
> Sent: Tuesday, September 20, 2011 9:23 AM
> To: [email protected]
> Subject: RE: benchmarking
> 
> Hi,
> 
> On Mon, 2011-09-19 at 18:16 +0000, David Jordan wrote: 
> > One question is what level of inferencing is really necessary for things 
> > like the following:
> > It is not very clear to me yet which OWL constructs require particular 
> > inferencing levels.
> > 
> > :Prostate_Cancer a owl:Class ;
> >     owl:unionOf (
> >             HOM_ICD9:HOM_ICD_10566  # V10.46  These identify ICD9 codes for 
> > Prostate Cancer
> >             HOM_ICD9:HOM_ICD_1767   # 233.4     which exist in a large 
> > class hierarchy
> >             HOM_ICD9:HOM_ICD_1343   # 185
> >     ) .
> > 
> > :Cohort1 a owl:Class ;
> >     owl:equivalentClass [
> >             a owl:Restriction ;
> >             owl:onProperty patient:hasDiagnosis ;  # which patients have a 
> > diagnosis associated with prostate cancer?
> >             owl:someValuesFrom :Prostate_Cancer
> >     ] .
> > 
> > Except for taking the ICD9 class hierarchy into account, this is not really 
> > much more than a simple database query.
> 
> Actually it is a lot more than a database query. For example, if you now 
> assert (:eg a :Cohort1) then a DL reasoner will "know" that :eg has a 
> diagnosis which is within one of the three members of the :Prostate_Cancer 
> union but not which one.  If you then add additional information that 
> constrains some of the other cases you may be able to determine which ICD9 
> code it is. By using owl:equivalentClass you are enabling reasoning in either 
> direction across that equivalence, including reasoning by cases.
> 
> To determine the level of inference required you either have to assume 
> complete inference and profile the language (DL, EL, QL etc) or you need to 
> *also* specify what specific queries you might want to ask.
> 
> > The nice aspect of doing this in OWL is that we can define these sets, like 
> > :Prostate_Cancer and :Cohort1, and then ask other questions of these sets.
> 
> Agreed.
> 
> > I thought that with TDB running on a 64 bit Linux box, doing memory mapped 
> > I/O, that TDB could efficiently pull everything into memory quickly, 
> > avoiding doing lots of fine grained SQL calls to a MySQL server.
> 
> Think of it as more like a halfway house. Each query to TDB involves walking 
> over the appropriate B-tree indexes, just like in a database.
> Given enough memory and an operating system good enough at disc block caching 
> you might end up with everything, or near everything, paged into memory, but 
> that is not guaranteed. Lots of factors come into play.
> 
> > I did use writeAll for writing the OntModel.
> > 
> > Relative to your suggestion of
> > (1) Precompute all inferences, store those, then at runtime work with plain 
> > (no inference at all) models over that stored closure.
> > 
> > Would I need to do this for EVERYTHING, including the declarations above 
> > for Prostate_Cancer and Cohort1?
> 
> Yes. Essentially create an in-memory model containing everything you want to 
> reason over. Then either materialize everything or, if you know the query 
> patterns you are interested in, then ask those queries and materialize the 
> results.
> 
> Alternatively you may want to consider the commercial triple stores that 
> offer inference at scale with Jena compatibility.
> 
> Dave
> 
> > 
> > 
> > -----Original Message-----
> > From: Dave Reynolds [mailto:[email protected]]
> > Sent: Monday, September 19, 2011 11:39 AM
> > To: [email protected]
> > Subject: Re: benchmarking
> > 
> > Hi,
> > 
> > On Mon, 2011-09-19 at 15:04 +0000, David Jordan wrote: 
> > > I have switch over from SDB to TDB to see if I can get better performance.
> > > In the following, Database is a class of mine that insulated the code 
> > > from knowing if it is SDB or TDB.
> > > 
> > > I do the following, which combines 2 models I have stored in TDB and then 
> > > reads a third small model from a file that contains some classes I want 
> > > to “test”. I then have some code that times how long it takes to get a 
> > > particular class and list its instances.
> > > 
> > > Model model1 = Database.getICD9inferredModel(); Model model2 = 
> > > Database.getPatientModel(); OntModel omodel = 
> > > ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM_MICRO_RULE_INF
> > > ,
> > > model1); omodel.add(model2);
> > 
> > That is running a full rule reasoner over the TDB model. As I've mentioned 
> > before the rule inference engines store everything in memory so that 
> > doesn't give you any scaling over simply loading the file into memory and 
> > doing inference over that, it just goes very very slowly!
> > 
> > > InputStream in = FileManager.get().open(fileName); omodel.read(in, 
> > > baseName, "TURTLE");
> > > 
> > > OntClass oclass = omodel.getOntClass(line);   // access the class
> > > 
> > > On the first call to getOntClass, I have been seeing a VERY long wait 
> > > (around an hour) before I get a response.
> > > Then after that first call, subsequent calls are much faster.
> > > But I started looking at the CPU utilization. After the call to 
> > > getOntClass, CPU utilization is very close to 0.
> > > Is this to be expected?
> > 
> > Seems plausible, the inference engines are in effect doing huge number of 
> > triple queries to TDB which will be spend most of its time waiting for the 
> > disk.
> > 
> > If you really need to run live inference over the entire dataset then load 
> > it into a memory model first, then construct your inference model over that.
> > 
> > > Is there any form of tracing/logging that can be turned on to determine 
> > > what (if anything) is happening?
> > > 
> > > Is there something I am doing wrong in setting up my models?
> > > For the ICD9 ontology I am using, I had read in the OWL data, created an 
> > > OntModel with it, wrote this OntModel data out.
> > > Then I store the data from the OntModel into TDB, so it supposedly does 
> > > not have to do as much work at runtime.
> > 
> > As Chris says, make sure you using writeAll not just plain write to store 
> > the OntModel.
> > 
> > That aside, this doesn't necessarily save you much work because the rules 
> > are having to run anyway, they are just not discovering anything much new.
> > 
> > In the absence of a highly scalable inference solution for Jena (something 
> > which can't be done without resourcing) then your two good options are:
> > 
> > (1) Precompute all inferences, store those, then at runtime work with plain 
> > (no inference at all) models over that stored closure.
> > 
> > (2) Load all the data into memory and run inference over that.
> > 
> > Dave
> > 
> > 
> > 
> 
> 
> 
>

RE: benchmarking

Reply via email to