Hi Timo,

I am one of the authors of the software Wido announced in his mail. First, I'd 
like to say that Dovecot is a wonderful piece of software and thank you for it. 
I would like to give some explanations regarding the design we choose.

Von: Timo Sirainen <t...@iki.fi><mailto:t...@iki.fi>
Antworten: Dovecot Mailing List 
<dovecot@dovecot.org><mailto:dovecot@dovecot.org>
Datum: 24. September 2017 at 02:43:44
An: Dovecot Mailing List <dovecot@dovecot.org><mailto:dovecot@dovecot.org>
Betreff:  Re: librmb: Mail storage on RADOS with Dovecot

It would be have been nicer if RADOS support was implemented as lib-fs driver, 
and the fs-API had been used all over the place elsewhere. So 1) 
LibRadosMailBox wouldn't have been relying so much on RADOS specifically and 2) 
fs-rados could have been used for other purposes. There are already fs-dict and 
dict-fs drivers, so the RADOS dict driver may not have been necessary to 
implement if fs-rados was implemented instead (although I didn't check it 
closely enough to verify). (We've had fs-rados on our TODO list for a while 
also.)

Actually I considered using the fs-api to build a RADOS driver. But I did not 
follow that path:

The dict-fs mapping is quite simplistic. For example, I would not be able to 
use RADOS read/write operations to batch request or model the dictionary 
transactions.  Also there is no async support if you hide the RADOS dictionary 
behind as fs-api module, which would make the use of dict-rados in the 
dict-proxy harder. Doing this would help to lower the price you have to pay for 
the process model Dovecot ist using a lot.

Using a fs-rados module behing a storage module, let’s say sdbox, would IMO not 
fit to our goals. We planned to store mails in RADOS object and their 
(immutable) metadata in RADOS omap K/V. We want to be able to access the 
objects without Dovecot. This is not possible if RADOS is hidden behind a 
fs-rados module. The format of the stored objects would be different and 
depended on the storage module sitting in front of fs-rados.
Another reason is that at the fs level the operations are to decomposed. We 
would not have any, as with the dictionaries, transactional contexts etc. This 
context information allows us to use the RADOS operations in an optimized way. 
The storage API is IMO the right level of abstraction. Especially if we follow 
our long term goal to eliminate the fs needs for index data to. I like the 
internal abstraction of sdbox/mdbox a lot. But for our purpose it should have 
been on mail and not file level.

But building a fs-rados should not be very hard.

BTW. We've also been planning on open sourcing some of the obox pieces, mainly 
fs-drivers (e.g. fs-s3). The obox format maybe too, but without the "metacache" 
piece. The current obox code is a bit too much married into the metacache 
though to make open sourcing it easy. (The metacache is about storing the 
Dovecot index files in object storage and efficiently caching them on local 
filesystem, which isn't planned to be open sourced in near future. That's 
pretty much the only difficult piece of the obox plugin, with Cassandra 
integration coming as a good second. I wish there had been a better/easier 
geo-distributed key-value database to use - tombstones are annoyingly 
troublesome.)


That would be great.

And using rmb-mailbox format, my main worries would be:
* doesn't store index files (= message flags) - not necessarily a problem, as 
long as you don't want geo-replication

Your index management is awesome, highly optimized and not easily 
reimplemented. Very nice work. Unfortunately it is not using the fs-api and 
therefore not capable of being located on not fs storage. We are believing that 
CephFS will be a good and stable solution for the next time. Of course it would 
be nicer to have a lib-index that allows us to plug in different backends.

* index corruption means rebuilding them, which means rescanning list of mail 
files, which means rescanning the whole RADOS namespace, which practically 
means rescanning the RADOS pool. That most likely is a very very slow 
operation, which you want to avoid unless it's absolutely necessary. Need to be 
very careful to avoid that happening, and in general to avoid losing mails in 
case of crashes or other bugs.

Yes, disaster is a problem. We are trying to build as many rescue tools as 
possible but in the end scanning mails is involved. All mails are stored within 
separate RADOS namespaces each representing a different user. This will help us 
to avoid scanning the whole pool. But it this not should not be a regular 
operation. You are right.

* I think copying/moving mails physically copies the full data on disk

We tried to optimize this. Moves within a users mailboxes are done without 
copying the mails by just changing the index data. Copies, when really 
necessary, are done be native RADOS commands (OSD to OSD) without transferring 
the data to the client and back. There is potential for even more optimization. 
We could build a mechanism similar to the mdbox reference counters to reduce 
copying. I am sure we will give it a try in a later version.

* Each IMAP/POP3/LMTP/etc process connects to RADOS separately from each others 
- some connection pooling would likely help here

Dovecot is using separate processes a lot. You are right that this is a problem 
for protocols/libraries that have a high setup cost. You build some mechanisms 
like login process reuse or the dict-proxy to overcome that problem.

Ceph is a low latency object store. One reason of the speed of Ceph is the fact 
that the cluster structure is known to the clients. The clients have a direct 
connection to the OSD that hosts the object they are looking for. If we place 
any intermediaries between the client process and the OSD (like with the 
dict-proxy) the performance will suffer.

IMO the processes you mentioned should be reused to reduce the setup cost per 
session (or implemented multithreaded or async). I am aware of the fact that 
this might be a potential security risk.

Right now we do not know the price for the connection setup in a real cluster 
in a Dovecot context. We are curious about the results of the tests with 
Danny's cluster and will change the design of the software to get the best 
results of it if necessary.

Best regards

Peter

Reply via email to