Hi Andy, >> > > > > Is it the case that the default inference engine of Jena requires all > > triples to be in-memory? Is it not possible to do this on this? If > > this is so, what would be the fair way to benchmark the system? > There are a couple of dimensions to think about: > > 1/ Do you want to test LUBM or a more general data? > 2/ What level of inference do you wish to test? > > (1) => For LUBM, there are no inference across universities so you can > generate the data for one university, run the forward chain inference on it > and move on to the next university knowing that no triples will be generated > later that affect the university you have just processed (and so don't need > to retain state for it).
At the moment we are going only for LUBM. In one month we will go for more complex benchmarks. However, we always have as a target limited expressivity, specifically RDFS and OWL 2 QL inference which don't require complex reasoning. We would like to be as efficient as possible for those. Ideally, we don't want the test case to have particular tricks at loading time, it should be a generic one-shot procedure (if possible). > > (2) => Inference for LUBM only needs one data triple and access to the > ontology to calculate the inferences. Once a triple has been processed, to > can emit the inferred triples and move on. Again, no data-related state is > needed. > > The Jena rules-based reasoner, which is RETE-based, is more powerful than is > need for RDFS or LUBM, including rules based on multiple data triples and > retraction, but the cost is that it stores internal state in-memory scaling > with the size of the data. > > There is also a stream-based forward chaining engine, riotcmd.infer, that > keeps the RSF schema in memory but not the state of the data so it uses a > fixed amount of space and does not increase with data size. > > This is probably the best way to infer over LUBM at scale. > > This is exploiting the features of LUBM (you only need one university). I > don't have figures I'd expect the riotcmd.infer to be faster as it's less > general. > > The flow is: > > infer --rdfs=VOCAB DATA | tdbloader2 --loc DB > > on a 64bit system. Linux is faster than Windows. > > (tdbloader2 only runs on linux currently - Paolo has a pure java version on > github) This is great info! it sound exactly as the we are looking for! We'll spend some time studying it and if there is any questions Ill get back here. He had not idea this existed. Thank you very very much for the advice and the info Andy, Best regards, Mariano
