Re: [Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

Ed W Wed, 12 Aug 2009 09:47:06 -0700

1) Since latency requirements are low, why did performance drop somuch previously when you implemented a very simple mysql storagebackend? I glanced at the code a few weeks ago and whilst it'ssurprisingly complicated right now to implement a backend, I was alsosurprised that a database storage engine "sucked" I think you phrasedit? Possibly the code also placed the indexes on the DB? Certainlythis could very well kill performance? (Note I'm not arguing sqlstorage is a good thing, I just want to understand the latency tobackend requirements)
Yes, it placed indexes also to SQL. That's slow. But even without it,Dovecot code needs to be changed to access more mails in parallelbefore the performance can be good for high-latency mail storages.

My expectation then is that with local index and sql message storagethat the performance should be very reasonable for a large class ofusers... (ok, other problems perhaps arise)

2) I would be thinking that with some care, even very high latencystorage would be workable, eg S3/Gluster/MogileFs ? I would love tosee a backend using S3 - If nothing else I think it would quicklyhighlight all the bottlenecks in any design...
Yes, S3 should be possible. With dbox it could even be used to storethe old mails and keep new mails in lower latency storage.

Mogile doesn't handle S3, but I always thought it would be terrific tobe able to have one copy of your data on fast local storage, but to beable to use slower (sometimes cheaper) storage for backups or lessvaluable data (eg older messages), ie replicating some data to otherstorage boxes

CouchDB seems like it would still be more difficult than necessary toscale. I'd really just want something that distributes the load anddisk usage evenly across all servers and allows easily plugging inmore servers and it automatically rebalances the load. CouchDB seemslike much of that would have to be done manually (or building scriptsto do it).

Ahh fair enough - I thought it being massively multi-master would allowsimply querying different machines for different users. Not a perfectscale-out, but good enough for a whole class of requirements...

For the filesystem backend have you looked at the various logstructured filesystems appearing? Whenever I watch the debatebetween Maildir vs Mailbox I always think that a hybrid is the bestsolution because you are optimising for a write one, read manysituation, where you have an increased probability of having goodcache localisation on any given read.
To me this ends up looking like log structured storage... (whichfeels like a hybrid between maildir/mailbox)
Hmm. I don't really see how it looks like log structured storage.. Butyou do know that multi-dbox is kind of a maildir/mbox hybrid, right?

Well the access is largely append only, with some deletes and noise atthe writing end, but largely the older storage stays static with muchlonger gaps between deletes (and extremely infrequent edits)

So maildir is optimised really for deletes, but improves random accessfor a subset of operations. Mailbox is optimised for writes and seemslike it's generally fast for most operations except deletes (people doworry about having a lot of eggs in one basket, but I think this isreally a symptom of other problems at work). Mailbox also has improvedpacking for small messages and probably has improved cache locality oncertain read patterns

So one obvious hybrid would be a mailbox type structure which perhapssplits messages up into variable sized sub mailboxes based on variouscriteria, perhaps including message age, type of message or messagesize...? The rapid write delete would happen at the head, perhaps evenas a maildir layout and gradually the storage would become larger andever more compressed mailboxes as the age/frequency of access/etc declines.


Perhaps this is exactly dbox?

It would also be interesting to consier separate message headers frombody content. Have heavy localisation of message headers, and slowerhigher latency access to the message body. Would this improve accessspeeds in general? Also the mime structure could be torn apart to storeattachments individually - the motivation being single instance storageof large attachments with identical content... Anyway, these seem likevery speculative directions...

I haven't really done any explicit benchmarks, but there are a fewreasons why I think low-latency for indexes is really important:

I think low latency for indexes is a given. You appear to havearchitected the system so that all responses are delivered from theindex and baring an increase in index efficiency the remaining time isspent doing the initial generation and maintenance of those indexes. Iwould have thought bar downloading an entire INBOX that the access timeof individual mails was very much secondary?

- If the goal is performance by allowing a scale-out in quantity ofservers then I guess you need to measure it carefully to make sure itactually works? I haven't had the fortune to develop something thatbig, but the general advice is that scaling out is hard to get right,so assume you made a mistake in your design somewhere... Measure,measure
I don't think it's all that much about performance of a single user,but more about distributing the load more evenly in an easier way.That's basically done by outsourcing the problem to the underlyingstorage (database).

So perhaps something like CouchDB can work then? One user localises perreplica and you keep reusing that replica?

Yes, resolving conflicts due to split brain merging back is somethingI really want to make work as well as it can. The backend database canhopefully again help here (by noticing there was a conflict andallowing the program to resolve it).

In general conflict resolution is thrown back to the application, solikely this is going to become a dovecot problem. It seems that thegeneral class of problem is too hard to solve at the storage side

This is also one of its goals :) Even if I make a mistake in choosinga bad database first, it should be somewhat easy to implement anotherbackend again. The backend FS API will be pretty simple. Basicallyit's going to be:

I wouldn't get too held back by posix semantics. For sure they arememorable, but definitely consider that transactions are the key to anykind of database performance improvement and make sure you can batchtogether stuff to make good use of the backend. Your "flush" commandseems to be the implicit end of transaction, but I guess give it plentyof thought that you might have a super slow system (eg S3) and thebackend might want the flexibility to mark something "kind of done",while uploading for 30 seconds in the background, then marking itproperly done once S3 actually acks the data saved?

- Finally I am a bit sad that offline distributed multi-master isn'tin the roadmap anymore... :-(
I think dsync can do that. It'll do two-way syncing between Dovecotsand resolves all conflicts. Is the syncing itself still done with veryhigh latencies, i.e. something like USB sticks? That's currently notreally working, but it probably wouldn't be too difficult.

What is dsync? There is a dsync.org which is some kind of directorysynchroniser?

Aha, google suggests that I might have missed an email from yourecently... Will read up...

OK, this sounds like a better implementation of the kind of thing we arebuilding here - likely this is the way ahead!


Cheers

Ed W

Re: [Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

Reply via email to