Hello all, The Distributed Server so far is relying on the BlobStore component to store potentially large binary data. This includes emails header, bodies and attachments. The mailbox, mailRepository and mailQueue core components, as well as the DeletedMessageVault extension relies on it.
The available implementations include a memory one (testing), cassandra, and Object storage (S3). We furthermore added a cache mechanism in order to fasten recent headers read. As mail traffic is highly duplicated (a single mail can have several recipients, a same attachment can be sent several time) we currently deduplicate blobs stored in the blobStore. However handling deletions in a deduplicated context is a non trivial things. The problem is currentlmy covered in an ADR [1] and this work is currently in (background?) process. [1] https://github.com/apache/james-project/blob/master/src/adr/0039-distributed-blob-garbage-collector.md In order to achieve this work, we decided to distinguish the actual BlobStore business logic from the Data Access Object layer. See https://issues.apache.org/jira/browse/JAMES-3028 . This work had been conducted out for both Memory and Cassandra but is yet to be contributed for ObjectStorage. While working on this topic, we encoutered issues with the current JCloud implementation, which is pretty boilerplate. JCloud was chosen in order to support both S3 and Swift APIs. However, JCloud don't allow asynchronous requests. This leads to a bad, not performant threading model. Furthermore recent Swift version also support S3 APIs. Thus we tried to significantly simplify the code by dropping swift support and rely directly on a S3 client. [3] is a move toward this and seems to unlock significant performance enhancements. [3] https://github.com/linagora/james-project/pull/3430 This is to be noted that deduplication would need a garbage collection to be run, which brings extra operating complexity. Some user might not consider deduplication is worth this operational cost. Also we have to ensure actual blob deletion while waiting a garbage collection solution to be implemented. In this context, one of Linagora customer have been deciding to found the work of providing an alternative to blob deduplication. Most Linagora contributors are likely going to work on this topic in the coming weeks/months. So far, we identified the following strategy: - Continue the work on JAMES-3028 and provide a s3 based blobStore DAO as it is conclusive. - Write a DeDuplicatingBlobStore that replaces all current implementations of the blobStore. It will rely on the BlobStoreDAO interface (currently DumbBlobStore) - Write a PassThroughBlobStore. This BlobStore stores each blob separately, and don't deduplicate content. It can effectively delete content right away, without any garbage collection to take place. - Expose a configuration option of the DistributedServer for choosing either the PassThroughBlobStore or the DeDuplicatingBlobStore. We of course need to ensure configuration management. If I go from 'deduplication.enable=true' to 'deduplication.enable=false' I can end up deleting some blobs referenced by other entities, making their content no longer available. In other parts of the code base, event sourcing is used to handle such concerns. Given that one cannot "disable deduplication after enabling it" and that garbage collection is currently not implemented, we should disable deduplication by default. - The current only usage of `BlobStore::delete` is a special case: DeletedMessageVault::deleteMessage. This operation is intended to immediatly delete a blob and a garbage collection based algorithm is not suited for this need. Of course we should avoid a 'feature' poluting the API of the BlobStore. DeletedMessageVault::deleteMessage could be handled via a call to the BlobStoreDAO. As a consequence, we will be able to turn `DeDuplicatingBlobStore::delete` into a noop operation, while waiting for dereferencing/garbage collection to resume. - Once Deduplicated content is no longer aggressively deleted by BlobStore::delete, we can ensure mailbox, mailqueue and mailRepository data being effectively deleted when using the PassThroughBlobStore. Note that in order to simplify this work, I propose to drop the already deprecated HybridBlobStore. Do you see some other solutions and workarounds? Best regards, Benoit --------------------------------------------------------------------- To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org For additional commands, e-mail: server-dev-h...@james.apache.org