sure, you do md5(SPO)md5(POS)md5(OPS) or something like md5(S) md5(P)md5(O) ...?
On Thu, May 15, 2014 at 9:23 AM, Andy Seaborne <[email protected]> wrote: > On 14/05/14 19:10, Yi Liao wrote: >> >> Hi Andy, >> >> Thanks for your answer. >> >> In you answer "Node ->(by calculation) 128 bit value", did you use >> MD5 hash for this step? Is MD5 collision resistant? I guess it is >> highly unlikely for two nodes to be hashed into the same value, so we >> might just take the risk? >> >> Thanks, Yi Liao > > > Hi there, > > Yes - it's MD5. > > 128 bits is a simply huge number. > > http://stackoverflow.com/questions/8852668/what-is-the-clash-rate-for-md5 > > 2^64 different RDF terms. That's RDF terms (Nodes) - not triples. Triples > aren't id'ed by hash. > > If you are worried by using hashing and assuming uniqueness in 128 bits, > then there are a few others things to worry about: RAM corruption, even > in error-detecting RAM; ethernet, and indeed any data storage or transfer. > > And coding bugs :-) > > Andy > > > >> >> >> -----Original Message----- From: Andy Seaborne >> [mailto:[email protected]] Sent: Tuesday, November 26, 2013 3:39 PM To: >> [email protected] Subject: Re: how does the node to nodeId mapping >> work? >> >> On 26/11/13 19:47, Yi Liao wrote: >>> >>> Hi, >> >> >> Hi there, >> >>> >>> Can anybody explain to me how does Jena map node to nodeId? The >>> following is stated in >>> http://jena.apache.org/documentation/tdb/architecture.html >>> >>> >>> "The Node to NodeId mapping is based on hash of the Node (a 128 >>> bit MD5 hash - the length was found not to major performance >>> factor). >>> >>> The default storage of the node table is a sequential access file >>> for the NodeId to Node mapping and a B+Tree for the Node to NodeId >>> mapping." >>> >>> My understanding is that Jena hashes the node into a long integer, >> >> >> Node ->(by calculation) 128 bit value ->(by index) file offset >> >>> and somehow converts the hashed value into an address offset to >>> the node table, and the node information is stored at the address >>> offset in the node table. >> >> >> There is a hash to offset index. >> >> The NodeTable itself is heavily cached. >> >>> >>> Is my understanding correct? >> >> >> Yes! >> >>> How does Jena converts the hashed value into an address offset? How >>> is B+ tree used in this process? >> >> >> TDB uses a B+tree for the hash to address offset. While it only >> needed to be a pure key->value mapping, the B+Tree code is used as >> it's heavily tested. >> >> There is in the codebase an external hash table which is pure >> key->value. Using it did not make an observable difference (see teh >> cache) so using the B+Tree code was easy and it doesn't have the >> reallocate burstiness of the external hash table. >> >>> >>> Thanks! Yi Liao >>> >> Andy >> >> >> >> > -- --- Marco Neumann KONA
