On 05/12/11 13:58, Mariano Rodriguez wrote:
In this case of this first initial round of benchmarks we want to avoid any
Hadoop or
map-reduce approaches. The reason is that
we want to have raw numbers of the core reasoning techniques, in this case
forward chaining
vs. backward chaining and our technique called semantic indexes which is a bit
like backward
chaining but with a tiny bit of extra work at loading time. We want to avoid
evaluating
benefits from the architecture of the system (map-reduce for example) because
the technique that we are
testing can also be extended with map-reduce and a parallel architecture.
In the past, I've experimented with forward-chaining the schema and
doing one step of backward chaining in the query.
Merely forward chaining everything (even just the useful subclass,
subproperty, domain and range as is done by riotcmd.infer) causes triple
bloat and, at scale, the bloat can reduce the effectiveness of disk caching.
But pure backward chaining has a horrible access pattern on the data
(walking arbitrary length paths):
?x rdf:type/rdfs:subClassOf* :type
?x ?p ?v . ?p rdfs:subPropertyOf* :property
(obviously you don't have to do it this way - this is just the naive way
and it can be written in SPARQL 1.1 - it's even in the spec).
Assuming the schema is small compared to the data and fixed,
preprocessing the schema to have a single table of (type, supertype)
with the transitive closure turns it into two patterns:
?x rdf:type ?var . table(?var, :type)
LUBM is unusual in several ways. All systems I know of, load faster on
LUBM than any other benchmark because it has a low node to triple ratio
(i.e. it is very interconnected within each university). RDFS-level
iInference increase this effect because inference can add triples but
not create new RDF terms. Loading nodes means the bytes for the URI or
literal need to be stored needing more work.
It would be easy to add this to TDB (the prototyping was for SDB where
it's more important due to JDBC-isms) - doing it as part of the more
general property tables would be interesting.
TDB scales much better than SDB (load and query).
Andy