Re: Gridfs for CouchDb

Reddy B . Mon, 29 Jul 2019 08:11:26 -0700

Many thanks for your reply Adam, this is very interesting. So if I understand 
you correctly, unfortunately even this approach wouldn't be the silver bullet 
to address performance issues. I though they were mainly due to replication, I 
will need to dig deeper into the archives to understand the situation better.

The big appeal with having everything in the database is consistency 
(especially considering the replication capabilities of CouchDb), it is too 
easy to have orphaned links/path. But even by saying this, the truth is that 
what we really care about is the "blackbox" around the blob ensuring that the 
application fully owns the lifecycle of the blob (and that no sysadmin and/or 
cloud vendor can mess it up, by moving or deleting files, changing urls 
structures, or even because of billing issues).

So if talking to an external storage still enables couchdb to fetch the blob 
upon replication (so that the user does not need to mess around manually 
transferring blobs), then this would work even for us who care so much about 
this aspect. Having to install a storage engine on every instance/server in 
addition to couchdb is not a big deal. Then one can create credentials for the 
storage engine that only couchdb knows.

I would make the distinction between attachments and largeAttachments to convey 
to the user that a normal attachment is stored in full with the document, while 
a largeAttachment is chunked gridfs-style.

So maybe attachments can be made to respect whatever limit FoundationDb has, 
while largeAttachments require a storage engine to be configured.

Reddy

________________________________
De : Adam Kocoloski <[email protected]>
Envoyé : lundi 29 juillet 2019 16:33
À : [email protected] <[email protected]>
Objet : Re: Gridfs for CouchDb

Hi Reddy,

Yes, something like this is possible to build on FoundationDB. The main 
challenge is that every FoundationDB transaction needs to be under 10MB, so the 
CouchDB layer would need to stitch together multiple transactions in order to 
support larger attachments and record some metadata at the end to make the 
result visible to the user.

Personally, I’d like to see a design for attachments that allows CouchDB the 
option to offload the actual binary storage for attachments to an object store 
purpose-built for that sort of thing, while still maintaining the CouchDB API 
including replication capabilities. All the major cloud providers have object 
storage services, and if you’re not running on cloud infrastructure there are 
open source projects like Minio and Ceph that are far more efficient at storing 
large binaries than CouchDB or FoundationDB will ever be.

Of course, I recognize that this integration is extra complexity that many 
administrators do not need or want, and so we’ll require some native option for 
attachment storage. The main question I have is whether we write all the extra 
code to support internal storage of attachments that exceed 10 MB, knowing that 
we’d still deliver worse performance at higher cost than the “object store 
offload” approach.

I’m curious why you proposed “attachment” vs. “largeAttachment” as a 
user-visible distinction? That hadn’t occurred to me personally. Cheers,

Adam

> On Jul 29, 2019, at 1:43 AM, Reddy B. <[email protected]> wrote:
>
> Hello,
>
> MongoDb has a driver called Gridfs intended to handle large files. Since they 
> have a hard limit of 16mb per document, this driver transparently splits a 
> file in 256kb chunks and then transparently reassembles it upon read. 
> Metadata are stored so they support things such as range queries (very useful 
> in video/audio streaming scenario - Couchdb supports range queries too), more 
> information is available on this page:
>
> https://docs.mongodb.com/manual/core/gridfs/
>
> I was wondering is something similar could be built on top of FoundationDb 
> and if such an approach would solve the current issues with large 
> attachments. In particular, it could make replication easier, since only 
> small files would need to be replicated and it would be easier to resume 
> replication at a particular chunk.
>
> MongoDb stores this data in a dedicated "collection" which is not the CouchDb 
> way. My thinking was that this could be opt-in: in addition to a document 
> being able to have an attachment, we could introduce a new entity called 
> largeAttachment using such a driver behind the scene, and the user would 
> choose how to best store his data based on the performance caracteristics of 
> each storage method and his needs (field, attachment, largeAttachments).
>
> I am just wondering if the idea is broadly feasible in the next FDB based 
> version or if there is an obvious showstopper / challenge that would need to 
> be addressed first.
>
> Thank you!
>
> Reddy

Re: Gridfs for CouchDb

Reply via email to