Hi Sebastien
I would certainly agree that replication without verification would be
problematic, not only from a trust point of view. Far more significant
would be the conflict handling: "which update to the document wins". Do
not underestimate the difficulty in getting this modelling correct.
Another question is if all the data relevant is in one document only, or
a set of documents.
I do not believe that you will be able to get away without some sort of
what you call "middleman". Some server based process that receives a
document, (or most likely a related set of documents), with sufficient
intelligence to know those documents need to be processed the way we do
it is allowing users direct or online access to the core database or
databases. From there the user will open the document from the server,
and open it in the local app. Our apps are running local only due to
offine first design, even if you are working online. If the user saves
the data, then it saves locally to a personal database. The personal
database replicates to the server for the user's personal database
there, and the change feed on the server picks up here is a new packet
of documents. Then it determines which database the document belongs to
(from an unchangeable metadata tag in each document). It checks all the
documents are ready, available and of the correct expected version in
the transaction manager, then completes the saving action to the correct
database. Your permissions to the database that holds the original
document still manages all the access rights. And replication in the
online case is for a single users' documents only.
The offline case works the same with the exception that there is now a
local copy of the orignal database that held the document for the user
in their local device. This is a one way filtered replication, driven by
a manifest of document ids to include rather than a pure filtered
replication. We found the performance much enhanced that way. The user
will now open the document into the app from the local copy of the main
database, rather than the server database, but it willl save to their
local personal user database and the saving will go that way. Once
again, we found that far more efficient than try and replicate directly
into the main database.
Regards
Willem
On 2019/10/17 11:50, Sebastien wrote:
Thanks for sharing this Willem!
I'll need to digest that idea a bit more. But your approach looks appealing
to me as one of our concerns is about how to handle the obvious trust
issues with offline database modifications. Simply replicating everything
and accepting each modification to the shared databases without
verification is dangerous in a multi-user setting since anyone could be
modifying things that should be read-only for them (for instance) or do
other malicious things, potentially leading to data leaks for other users.
My initial idea there (but this is actually a different topic) was to
introduce a middleman between the databases and the clients so as to
perform validation of all incoming changes, before letting them through to
CouchDB. The problem there is that it doesn't seem very straightforward to
create such a "proxy" in Node for instance, which hooks into the data
changes and exposes the normal CouchDB API to the outside.
I was also thinking about using an event-drive architecture; pushing
normalized events (i.e., one collection of event documents, each with a
certain type, version and fields) and thus separating the reads from the
writes on the client side as well. Basically with that the clients would
maintain their own private read model and log events whenever an action is
taken. Then, synchronization would "simply" mean pushing the event log to
the server through a dedicated API, which would check everything before
inserting the changes where need be and let everyone get the changes
through the (then read-only) shared database.
This is for a tad later in our project so it's still brainstorming at this
point.
kr,
Sébastien
On Thu, Oct 17, 2019 at 10:19 AM Willem van der Westhuizen <
[email protected]> wrote:
Hi Sebastien
I though I will give you a quick input in our experience of supporting
online and offline. We work in conditions where networks are poor and
not reliable, and we have after quite some pain and trial and error
reverted to use the replication mechanism to save data even for users
working online. In our case which is a business process and workflow
tool, it is absolutely essential that all documents arrive in the
correct version. So we build a ACID - styled transaction engine, and
when the user saves, it triggers a limited replication based on document
ids. That has given us orders of magnitute greater stability in poor
networks. Each user saves in a per-user databse and replicates to the
server. the transactions engine processes it to the actuall correct
database completing the save.
Regards
Willem
On 2019/10/17 10:06, Sebastien wrote:
After all, we've decided not to rely on filtered replication for our use
case.
The issue is that we will not only support an offline-first mode where a
filtered copy of the data will be retrieved, but there will also be an
online-only mode (e.g., when accessing the app from an untrusted device,
where the users might prefer not to store anything locally). In the
online-only mode, the users will need to directly access the database,
but
it'll also need to be filtered and I'm not sure if there's a safe way to
do
that.
What we've chosen to do now is to keep the information colocated in
_users
and to go through an API to retrieve the subset of information that is
required (e.g., n properties all members of database X). This way it
works
fine in the online-only scenario, but also for the offline-first one
since
we can persist the information after having retrieve it once. We also
keep
better control over what happens with the data (up to some extent) and
can
wipe it if/when necessary.
This issue is rather hairy form a privacy protection point of view, but
such use cases are critical for multi-user offline-first systems.
Thanks again for the useful feedback!
kr,
Sébastien
On Sun, Oct 13, 2019 at 10:34 AM Stefan Klein <[email protected]>
wrote:
Hi Sebastien,
Am Sa., 12. Okt. 2019 um 15:55 Uhr schrieb Sebastien <
[email protected]
:
Taking that as starting point, one option could indeed be as you
propose
to
copy a subset of that "persons" database into each other database (of
course again only a subset of the info, ideally controllable by the end
users). One problem that I imagine with that is mainly the amount of
incurred data duplication.
With the duplication it needs to be absolutely clear which of the
copies is the authoritative version of the document and which are just
copies, then it's manageable.
For instance, imagine that persons contains [A, B, C, D, E, F], then:
- If [A, B, C] have access to database X, then those users should have
a
copy of [A, B, C] locally
- If [A, D, E] have access to database Y, then those users should have
a
copy of [A,D,E] locally
Consequently, A should have A, B, C, D, E in his local "persons"
database
copy.
If at some point E is removed from database Y, then user A should not
have
E in his local database anymore.
Does that sound like something that can be handled through filtered
replication?
I am not aware of any way to delete documents in the target that still
exist in the source.
But if you have a copy of E in Y and delete E from Y at a later point,
this delete will be replicated to the local DB too (If you don't
filter out deleted documents).
Since you probably have some kind of management system to remove E
from Ys _security, you could either delete Es profile from Y in the
same step or have a cron job or similar to remove the redundant
profiles from the databases.
One possible issue here though:
If E gains access to Y again while Es profile wasn't changed, the
former _deleted revision is still the "current" revision and Es
profile stays _deleted in database Y.
You would have to modify Es person document in the persons database,
so it gets a new revision.
I hope that my system will be able to handle hundreds/thousands of
databases with 1-100 users in each database; each use having access to
~1-10 database, thus potentially having access to ~1K user documents
locally (thus is really just an early guesstimate).
Can't comment on pouchdb.
From my experience CouchDB doesn't care about how many databases
exist, as long there is no current access to a database it is just a
file in the file system.
The system currently doesn't allow users to manage their own profile
but
it's indeed a requirement. I'll probably only allow users to modify
their
own information while online through a dedicated API endpoint checking
the
user's identity instead of letting them directly write to the "persons"
database.
With this you do have a clear dataflow:
Users modify their profile via API, this changes the persons database.
Documents from the persons database are distributed to the destination
databases.
So there should be no issue with data duplication.
regards,
Stefan