Re: Filtered replication of _users

Willem van der Westhuizen Thu, 17 Oct 2019 05:43:18 -0700

Hi Sebastien

I would certainly agree that replication without verification would beproblematic, not only from a trust point of view. Far more significantwould be the conflict handling: "which update to the document wins". Donot underestimate the difficulty in getting this modelling correct.Another question is if all the data relevant is in one document only, ora set of documents.

I do not believe that you will be able to get away without some sort ofwhat you call "middleman". Some server based process that receives adocument, (or most likely a related set of documents), with sufficientintelligence to know those documents need to be processed the way we doit is allowing users direct or online access to the core database ordatabases. From there the user will open the document from the server,and open it in the local app. Our apps are running local only due tooffine first design, even if you are working online. If the user savesthe data, then it saves locally to a personal database. The personaldatabase replicates to the server for the user's personal databasethere, and the change feed on the server picks up here is a new packetof documents. Then it determines which database the document belongs to(from an unchangeable metadata tag in each document). It checks all thedocuments are ready, available and of the correct expected version inthe transaction manager, then completes the saving action to the correctdatabase. Your permissions to the database that holds the originaldocument still manages all the access rights. And replication in theonline case is for a single users' documents only.

The offline case works the same with the exception that there is now alocal copy of the orignal database that held the document for the userin their local device. This is a one way filtered replication, driven bya manifest of document ids to include rather than a pure filteredreplication. We found the performance much enhanced that way. The userwill now open the document into the app from the local copy of the maindatabase, rather than the server database, but it willl save to theirlocal personal user database and the saving will go that way. Onceagain, we found that far more efficient than try and replicate directlyinto the main database.


Regards

Willem

On 2019/10/17 11:50, Sebastien wrote:

Thanks for sharing this Willem!

I'll need to digest that idea a bit more. But your approach looks appealing
to me as one of our concerns is about how to handle the obvious trust
issues with offline database modifications. Simply replicating everything
and accepting each modification to the shared databases without
verification is dangerous in a multi-user setting since anyone could be
modifying things that should be read-only for them (for instance) or do
other malicious things, potentially leading to data leaks for other users.

My initial idea there (but this is actually a different topic) was to
introduce a middleman between the databases and the clients so as to
perform validation of all incoming changes, before letting them through to
CouchDB. The problem there is that it doesn't seem very straightforward to
create such a "proxy" in Node for instance, which hooks into the data
changes and exposes the normal CouchDB API to the outside.

I was also thinking about using an event-drive architecture; pushing
normalized events (i.e., one collection of event documents, each with a
certain type, version and fields) and thus separating the reads from the
writes on the client side as well. Basically with that the clients would
maintain their own private read model and log events whenever an action is
taken. Then, synchronization would "simply" mean pushing the event log to
the server through a dedicated API, which would check everything before
inserting the changes where need be and let everyone get the changes
through the (then read-only) shared database.

This is for a tad later in our project so it's still brainstorming at this
point.

kr,
Sébastien

On Thu, Oct 17, 2019 at 10:19 AM Willem van der Westhuizen <
[email protected]> wrote:

Hi Sebastien

I though I will give you a quick input in our experience of supporting
online and offline. We work in conditions where networks are poor and
not reliable, and we have after quite some pain and trial and error
reverted to use the replication mechanism to save data even for users
working online. In our case which is a business process and workflow
tool, it is absolutely essential that all documents arrive in the
correct version. So we build a ACID - styled transaction engine, and
when the user saves, it triggers a limited replication based on document
ids. That has given us orders of magnitute greater stability in poor
networks. Each user saves in a per-user databse and replicates to the
server. the transactions engine processes it to the actuall correct
database completing the save.

Regards

Willem

On 2019/10/17 10:06, Sebastien wrote:

After all, we've decided not to rely on filtered replication for our use
case.

The issue is that we will not only support an offline-first mode where a
filtered copy of the data will be retrieved, but there will also be an
online-only mode (e.g., when accessing the app from an untrusted device,
where the users might prefer not to store anything locally). In the
online-only mode, the users will need to directly access the database,

but

it'll also need to be filtered and I'm not sure if there's a safe way to

do

that.

What we've chosen to do now is to keep the information colocated in

_users

and to go through an API to retrieve the subset of information that is
required (e.g., n properties all members of database X). This way it

works

fine in the online-only scenario, but also for the offline-first one

since

we can persist the information after having retrieve it once. We also

keep

better control over what happens with the data (up to some extent) and

can

wipe it if/when necessary.

This issue is rather hairy form a privacy protection point of view, but
such use cases are critical for multi-user offline-first systems.

Thanks again for the useful feedback!

kr,
Sébastien

On Sun, Oct 13, 2019 at 10:34 AM Stefan Klein <[email protected]>

wrote:

Hi Sebastien,

Am Sa., 12. Okt. 2019 um 15:55 Uhr schrieb Sebastien <

[email protected]

:
Taking that as starting point, one option could indeed be as you

propose

to

copy a subset of that "persons" database into each other database (of
course again only a subset of the info, ideally controllable by the end
users). One problem that I imagine with that is mainly the amount of
incurred data duplication.

With the duplication it needs to be absolutely clear which of the
copies is the authoritative version of the document and which are just
copies, then it's manageable.

For instance, imagine that persons contains [A, B, C, D, E, F], then:
- If [A, B, C] have access to database X, then those users should have

copy of [A, B, C] locally
- If [A, D, E] have access to database Y, then those users should have

copy of [A,D,E] locally
Consequently, A should have A, B, C, D, E in his local "persons"

database

copy.
If at some point E is removed from database Y, then user A should not

have

E in his local database anymore.

Does that sound like something that can be handled through filtered
replication?

I am not aware of any way to delete documents in the target that still
exist in the source.
But if you have a copy of E in Y and delete E from Y at a later point,
this delete will be replicated to the local DB too (If you don't
filter out deleted documents).
Since you probably have some kind of management system to remove E
from Ys _security, you could either delete Es profile from Y in the
same step or have a cron job or similar to remove the redundant
profiles from the databases.

One possible issue here though:
If E gains access to Y again while Es profile wasn't changed, the
former _deleted revision is still the "current" revision and Es
profile stays _deleted in database Y.
You would have to modify Es person document in the persons database,
so it gets a new revision.

I hope that my system will be able to handle hundreds/thousands of
databases with 1-100 users in each database; each use having access to
~1-10 database, thus potentially having access to ~1K user documents
locally (thus is really just an early guesstimate).

Can't comment on pouchdb.
  From my experience CouchDB doesn't care about how many databases
exist, as long there is no current access to a database it is just a
file in the file system.

The system currently doesn't allow users to manage their own profile

but

it's indeed a requirement. I'll probably only allow users to modify

their

own information while online through a dedicated API endpoint checking

the

user's identity instead of letting them directly write to the "persons"
database.

With this you do have a clear dataflow:
Users modify their profile via API, this changes the persons database.
Documents from the persons database are distributed to the destination
databases.
So there should be no issue with data duplication.

regards,
Stefan

Re: Filtered replication of _users

Reply via email to