Re: [DISCUSS] Rebase CouchDB on top of FoundationDB

Robert Samuel Newson Thu, 24 Jan 2019 09:28:23 -0800

Hi,

In theory, yes. But using a SHA256 key would be bad in practice for at least 
the same reasons a fully random doc id is bad in couchdb (and why our default 
"sequential" algorithm works the way it does). Specifically, no two related 
keys would likely land anywhere near each other in the keyspace, to reassemble 
a document would involve consulting nodes far and wide. FoundationDB optimizes 
for access to adjacent keys (those with the same prefix) and we would play to 
that strength. It is one of the key improvements over the 2.0 sharding 
architecture.


B.

On Thu, 24 Jan 2019, at 12:46, Michael Fair wrote:
> On Thu, Jan 24, 2019 at 2:11 AM Robert Samuel Newson <[email protected]>
> wrote:
> 
>> 
>> We’d expand each document into a series of key-value pairs, where the key
>> is the full path into the object and the value is the scalar value. E.g,
>> 
>> {“foo”: 12, “bar”, {“baz”: 13}}
>> 
>> Would be
>> 
>> foo => 12
>> bar.baz => 13
> 
> 
> I realize this quickly belongs in its own thread for later discussion, but
> I wanted to point out/ask that by "interning the path strings" or using
> some kind of deterministic hash algorithm, like SHA256 (or something
> faster), on the "key path", couldn't you turn all variable-length strings
> paths into a fixed size, integer type, field id?
> 
> This eliminates the "length" of the path string concern and keeps every
> document field a straight three entry path:
> docid.revisionid.fieldid => [removed?, value]
> 
> where:
> * docid is the unique document identifier
> * revisionid is obvious
> * fieldid is the id of the path string (if a deterministic hash is used,
> it's computed; if indexed, it's looked up/retrieved)
> 
> This idea assumes that the "path.string" <-> fieldid correlation is also
> managed by interning those strings somewhere.
> 
> By adding the removed bit flag, a document becomes simply the aggregation
> of all the latest revisionids for each distinct fieldid lower than the
> revisionid requested; eliminating all duplicate storage requirements for
> non-changing fields.
> 
> When a document update comes in, it breaks the document down into its
> constituent fields, and only needs to add an entry if the state of a field
> has somehow changed from its previous revision.
> 
> It seems like this whole idea might be optimally and transparently handled
> directly inside FDB if FDB was aware of this revisionid "idea".  I'm of
> course not sure which system is expected to handle the described document
> deconstruction.
> 
> 
> ======
> This "fieldid hash" idea is also related to how the IPLD project creates
> "pointers" to JSON documents inside its distributed p2p system to
> hierarchically link portions of different documents together.
> 
> Since a particular docid.revisionid represents a fixed point/state of a
> document in the database, they use that reference as the "value" of a
> special JSON Object that wants to "include"/"point to" the referenced
> document.
> The special JSON Object they used to create a "document link" looks like
> this: {"/": "documenthashid"}
> 
> The uploading document must explicitly put that reference in its own
> document where it wants the system to link in the referenced document.
> This hijacks this form of a JSON Object for this specific purpose and
> prevents all higher level applications of IPLD from using it for any other
> purpose.
> 
> If desirable, the equivalent idea for CouchDB might be: {"_/":
> "docid.revisionid.fieldid"}
> 
> ======
> 
> I'm not saying any of this is a good idea, simply that (1) the string
> length concerns could be eliminated by using interned strings (which likely
> would also improve performance); and (2) this field level storage in FDB
> could enable a basis for adding "document pointers" which I'm sure many
> people would appreciate.
> 
> 
> Mike

Re: [DISCUSS] Rebase CouchDB on top of FoundationDB

Reply via email to