Hi,
This isn’t the thread (yet!) to get into this level of detail just yet, but I
do have some thoughts.
The two uses of sha256 here seem inappropriate to me. Users will typically
choose short, readable names for both user_name and db_name, and this would
force long, random looking strings on them, which reduces simplicity and
increases key size, the opposite of what we want to do.
Instead, I think we enforce a limit of a few hundred characters on each item.
If a user really can’t work within that constraint they can run the name
through a message digest algorithm and deal with the fallout of that
obfuscation themselves. Users that can name a database succinctly would not be
penalised.
I do agree on the {NS} piece. We should not assume that we’re the only
application inside the FoundationDB database. Indeed the foundationdb
documentation regards this as a best practice
(https://apple.github.io/foundationdb/api-python.html#subspaces: "As a best
practice, API clients should use at least one subspace for application data.”).
B.
> On 24 Jan 2019, at 20:16, Ilya Khlopotov <[email protected]> wrote:
>
> First I apologize if you receive it twice (slightly different versions as
> well). It looks like my email is miss-configured since reply to
> [email protected] from mail client didn't go through.
>
>> This eliminates the "length" of the path string concern and keeps every
>> document field a straight three entry path:
>> docid.revisionid.fieldid => [removed?, value]
>
> Michael this is a very good idea. I was working on proposal to use something
> like the following:
>
> * {NS} / sha256(user_name) / sha256(db_name) / index / by_seq / {update_seq}
> * {NS} / sha256(user_name) / sha256(db_name) / index / by_vsn / {vsn}
> * {NS} / sha256(user_name) / sha256(db_name) / data / docs / idx_by_docid /
> {docid}
> * {NS} / sha256(user_name) / sha256(db_name) / data / docs / {doc_idx} /
> content / {vsn} / body / {json_path} / {page_idx}
>
> Here:
> - {NS} is configurable namespace dedicated to CouchDB on FDB cluster.
> - {vsn} is FDB versionstamp
> - {page_idx} is separate path to represent scalar JSON values which exceed
> FDB limitations on value size
> - {docid} - document id
> - {doc_idx} - arbitrary value to save different revisions of the document. We
> add a level of indirection since we don't want to use {rev}. Because we might
> insert documents during _bulk operations. In this case inserted but not yet
> committed revisions of documents shouldn't be in list of available revisions.
>
> In the above model I couldn't figure out yet how to compress json_path.
>
> I'll send what I have so far into separate thread (when it would be started).
>
> Best regards,
> iilyak
>
>
> On 2019/01/24 12:46:14, Michael Fair <[email protected]> wrote:
>> On Thu, Jan 24, 2019 at 2:11 AM Robert Samuel Newson <[email protected]>
>> wrote:
>>
>>>
>>> We’d expand each document into a series of key-value pairs, where the key
>>> is the full path into the object and the value is the scalar value. E.g,
>>>
>>> {“foo”: 12, “bar”, {“baz”: 13}}
>>>
>>> Would be
>>>
>>> foo => 12
>>> bar.baz => 13
>>
>>
>> I realize this quickly belongs in its own thread for later discussion, but
>> I wanted to point out/ask that by "interning the path strings" or using
>> some kind of deterministic hash algorithm, like SHA256 (or something
>> faster), on the "key path", couldn't you turn all variable-length strings
>> paths into a fixed size, integer type, field id?
>>
>> This eliminates the "length" of the path string concern and keeps every
>> document field a straight three entry path:
>> docid.revisionid.fieldid => [removed?, value]
>>
>> where:
>> * docid is the unique document identifier
>> * revisionid is obvious
>> * fieldid is the id of the path string (if a deterministic hash is used,
>> it's computed; if indexed, it's looked up/retrieved)
>>
>> This idea assumes that the "path.string" <-> fieldid correlation is also
>> managed by interning those strings somewhere.
>>
>> By adding the removed bit flag, a document becomes simply the aggregation
>> of all the latest revisionids for each distinct fieldid lower than the
>> revisionid requested; eliminating all duplicate storage requirements for
>> non-changing fields.
>>
>> When a document update comes in, it breaks the document down into its
>> constituent fields, and only needs to add an entry if the state of a field
>> has somehow changed from its previous revision.
>>
>> It seems like this whole idea might be optimally and transparently handled
>> directly inside FDB if FDB was aware of this revisionid "idea". I'm of
>> course not sure which system is expected to handle the described document
>> deconstruction.
>>
>>
>> ======
>> This "fieldid hash" idea is also related to how the IPLD project creates
>> "pointers" to JSON documents inside its distributed p2p system to
>> hierarchically link portions of different documents together.
>>
>> Since a particular docid.revisionid represents a fixed point/state of a
>> document in the database, they use that reference as the "value" of a
>> special JSON Object that wants to "include"/"point to" the referenced
>> document.
>> The special JSON Object they used to create a "document link" looks like
>> this: {"/": "documenthashid"}
>>
>> The uploading document must explicitly put that reference in its own
>> document where it wants the system to link in the referenced document.
>> This hijacks this form of a JSON Object for this specific purpose and
>> prevents all higher level applications of IPLD from using it for any other
>> purpose.
>>
>> If desirable, the equivalent idea for CouchDB might be: {"_/":
>> "docid.revisionid.fieldid"}
>>
>> ======
>>
>> I'm not saying any of this is a good idea, simply that (1) the string
>> length concerns could be eliminated by using interned strings (which likely
>> would also improve performance); and (2) this field level storage in FDB
>> could enable a basis for adding "document pointers" which I'm sure many
>> people would appreciate.
>>
>>
>> Mike
>>