Hi everyone.

I'm new here and just discovered the ongoing proposition for CouchDB to
rely upon FDB.

With my team, we were considering providing an HTTP API over FDB in the
form of the CouchDB API definition, so I'm very pleased to see there is
already an ongoing effort for this (even if still a proposition). I've
tried to catch up with all the good discussions on how you could make this
work, mapping to the K/V model, but sorry if I could have missed a point.

I'm curious on how you're considering to manage multi tenancy while
ensuring a good scalability and avoiding hotspotting.

I've read an idea from Mickael with CryptoHash to map the model this way :

{bucket_id}/{cryptohash}  : value

We currently use this CryptoHash mecanism to manage some data in a multi
tenancy context applied to Time Series.

Here is a simple diagram that summarize it :

{raw_data} -> ingress component -> {hashed_metadata+data} -> HBase
                                -> {crypted_metadata}     -> HBase
                                -> {crypted_metadata}     -> Directory service

Query -> egress component -> HBase

raw_data is in the metric{tags} format, like in Prometheus/OpenTSDB/Warp10
style.
hashed metadata is a double 64 or 128 bits hashes of hash(metric) +
hash(tags).
Default is 64bits but it can lead to collision in the keyspace above 1B
unique series where 128bits hashes are safer.
egress will query the Directoy service to get the series list to be read in
the store.

While authenticating, a custom "application" label is embedded into a label
that ends in the data model, then hashed that avoid conflict between
users.Hashed metadata are suffixed with a timestamp because it's convenient
for Time Series data.
What makes it very useful is :
 - it can still use scans per series (metrics+tags)
 - it avoids hotspotting the cluster and ensures a very good distributions
among nodes
 - it provides authentication through a directory service that act as an
indirection
 - keys are consistent while metrics or tags can be very long

I think this kind of model can perfectly apply to FDB for documents given
that Namespace would be a user application/bucket/...  :

hash ( {NS} + {...} + {DOC_ID} ) / fields / ...

Drawbacks are that it may require a bit more storage for keys, but hashing
could be adjusted given the use case. Moreover, managing rights at the
document level would also require additional fields or few bytes to manage
this, while using a directory index (could be as memory inside CouchDB,
outside relying on something like Elastic, or available directly inside FDB)

I realize that just FDB as a backend is a considerable amount of work and
pushing multi tenancy adds even more work maybe into CouchDB itself. For
example, Tokens could embed rights and buckets ids, that would be used by
CouchDB to authorize and build the underlying data model for storing with
scalability and optimizations in mind. Also, did anyone considered reaching
the FDB guys to try to align CouchDB document representation to the
Document Layer (
https://foundationdb.github.io/fdb-document-layer/data-modeling.html ).
This would make CouchDB to be also MongoDB API compatible.

I don't where discussions are, but maybe we could help :)

Reply via email to