TDB and hash ids

Andy Seaborne Thu, 03 Nov 2011 06:51:44 -0700

On 03/11/11 13:19, Paolo Castagna wrote:

 From my experience with tdbloader3 and parallel processing, I say that
the fact that current node ids (currently 64 bits) are offsets in the
nodes.dat file is a big "impediment" to distributed/parallel processing.
Mainly, because whatever you do, you first need to build a dictionary
for that and this is not trivial in parallel.


Loading, I agree; but general distributed/parallel processing?

However, if we could, given an RDF node value generate a node id with
an hash function (sufficiently big so that the probability of collision
is less than being hit by an asteroid) (128 bits?) then tdbloader3 could
be massively simplified, merging TDB indexes directly will become trivial
(as for Lucene indexes), ... my life at work would be so much simpler!


Are you going to test out using hashes of ids in TDB?

It needs someone to actually try it out in an experimental branch. Whatabout the consequences of turning ids back into Nodes for results (whichcould be done in parallel to much of query evaluation).

The drawback of 128 bit node ids is that suddenly you might need to
double your RAM to achieve same performances (to be proven and verified
with experiments). However, there are many other good side effects that
you can fit into 128 bits. For example, I am not so sure anymore if
an optimization such as the one proposed on JENA-144 is possible without
ensuring that all node values can be encoded in the bits available in
the node id: https://issues.apache.org/jira/browse/JENA-144.


Do the calculation on clash probabilities.

e.g.
doi:10.1.1.100.4934
doi:10.1.1.58.1011

Re: JENA-144:

?? Use the same scheme as present - section the id space into valuesand, separately, also hashes. Cost : one bit.


        Andy

TDB and hash ids

Reply via email to