Re: What would it take for FastMail to run murder

2015-03-18 Thread Jeroen van Meeuwen (Kolab Systems)

On 2015-03-18 01:51, Bron Gondwana wrote:
On Wed, Mar 18, 2015, at 09:00 AM, Jeroen van Meeuwen (Kolab Systems) 
wrote:

We promote a standby frontend not otherwise used, to become the new
mupdate server. The interruption is a matter of seconds this way,
unless of course you're in the typical stalemate.


Hmm so maybe it's affordable.  It scales up with number-of-servers
as well though.  Making sure it's up to date costs at least O(number of
backends).



I suppose in your specific case, which I'm not at all too familiar with, 
perhaps enhancing murder/mupdate to allow cascading and/or (geo-based) 
replication and/or multi-master would serve your deployment yet even 
better?


I'm suggesting so because I would be concerned with the round-trip times 
between datacenters if there were only one mupdate master across all -- 
and perhaps the replicas are faster in issuing the cmd_set() than the 
mupdate master is(?).



 Interesting.  Does it also handle the case where the same mailbox
 gets accidentally created on two servers which aren't replica pairs
 though? Or do you get a mailbox fork?


The race condition is not addressed with it, like it is not addressed
currently.


I'm not 100% happy living with unaddressed race conditions.  Addressing
this would be an important part of making FastMail happy to run it.



Ghe, neither am I, but c'est la vie.

That said, in ~5 years and dozens of murder deployments, I have yet to 
encounter a situation or even a support case in which one mailbox is -- 
accidentally or otherwise -- created in two locations without the second 
failing / being rejected.



It solely makes the MUPDATE server not reject the reservation
request from a server that uses the same servername if it already
has an entry for the same servername!partition, so that the
replica successfully creates the local copy -- after which
replication is happy.


Yeah, that makes sense.  Of course, the backend should probably not be
reserving so much.  There are two things conflated here:

1) I'm running cmd_create in an IMAPd and I want to see if this folder
   already exists.

2) I'm a replica backend getting a copy of an existing folder (or
   indeed, a backend which already has a folder) and I'm informing
   mupdate of the fact.

Those two should be treated differently.  The first is does this
already exist, which is a legitimate question to ask.  The second
should always succeed. MUPDATE is a representation of facts, and the
backends are the masters of those facts.



With two-way replication safety however (and in your case, channelled as 
well, right?), which end of the replication (just in case things end up 
load-balanced across replicas?) gets to submit the original cmd_set() is 
up in the air, no?



So this would build a scenario in which:

   pair-1-replica-1.example.org and pair-1-replica-2.example.org
   present themselves as pair-1.example.org

   A DNS IN A RR is created for the fail-over address(es) for pair-
   1.example.org and attached to whichever replica in the pair is
   considered the active node.

Both replicas would be configured to replicate to one another, which
works in a PoC scenario but may seem to require lmtpd/AF_INET
delivery.


So they both have the same server name in mupdate.



Yes, and frontends proxy the connections for mailboxes on the backend to 
the same fake server address.



My plan is that they have different server names in mupdate.  There's a
separate channel somehow to say which is the primary out of those
servers, which can be switched however (failover tooling) based on 
which

servers are up, but the murder has the facts about where the mailbox
really exists.

It may even have statuscache.  Man, how awesome would distributed
statuscache be.

So there are multiple records for the same mailbox, with different 
server

names, in the murder DB.



Would this not open back up a route to entertaining a variety of race 
conditions (that would need to be addressed somehow) though?


Should then one of the duplicate mailboxes be marked as the primary?

A scenario that comes up often is the geographically close-yet-distant 
secondary site for disaster recovery, where a set of backends on the 
primary site replicate to a set of backends on the secondary site. While 
initially this succeeds perfectly fine, and the backends on the 
secondary site can participate in a local-to-site murder, transferring 
mailboxes from one backend to another on the primary site will fail to 
replicate to the secondary site's backends (because of their 
participation in the murder).


This is in part because it is not the XFER being replicated as such, but 
the target backend's CREATE/cmd_set(), which will fail because the 
mailbox already resides on another backend.


I suppose a scenario in which the mupdate master is in fact able to hold 
multiple records for the same mailbox might also allow us to overcome 
this conundrum?



Would using shared memory address the in-memory problem? 

Re: What would it take for FastMail to run murder

2015-03-18 Thread Bron Gondwana
On Wed, Mar 18, 2015, at 09:49 PM, Jeroen van Meeuwen (Kolab Systems) wrote:
 On 2015-03-18 01:51, Bron Gondwana wrote:
  On Wed, Mar 18, 2015, at 09:00 AM, Jeroen van Meeuwen (Kolab Systems) 
  wrote:
  We promote a standby frontend not otherwise used, to become the new
  mupdate server. The interruption is a matter of seconds this way,
  unless of course you're in the typical stalemate.
 
  Hmm so maybe it's affordable.  It scales up with number-of-
  servers as well though.  Making sure it's up to date costs at least
  O(number of backends).
 

 I suppose in your specific case, which I'm not at all too familiar
 with, perhaps enhancing murder/mupdate to allow cascading and/or (geo-
 based) replication and/or multi-master would serve your deployment yet
 even better?

Hmm, yeah - geo updates and mailboxes.db changes.  I'm not super-
concerned that it's a slightly slow path - they are rare.  Might suck if
you're making a ton of changes all at once - but that should be OK too -
just make all the changes locally and then blat the whole lot in a
single transaction to the murder DB.

Or hell, make it eventually consistent.  All you need is a zookeeper-
style way to anoint one server as the owner of each fragment of
namespace.  So you can only create a new user's mailbox in one place at
once, and then every user can only create mailboxes on their home
server.  Stop clashes from ever forming that way.

There are safe ways to do this that aren't a single mupdate master
(which already sucks when you're geographically distributed I'm sure -
ask CMU, I'm pretty sure they are running it globally)

  I'm not 100% happy living with unaddressed race conditions.
  Addressing this would be an important part of making FastMail happy
  to run it.
 

 Ghe, neither am I, but c'est la vie.

 That said, in ~5 years and dozens of murder deployments, I have yet to
 encounter a situation or even a support case in which one mailbox is
 -- accidentally or otherwise -- created in two locations without the
 second failing / being rejected.

Yeah, it's a rare case because normal users can't do it, and at least in
our setup, the user creation itself is brokered through a singleton
database, and the location to create the user is calculated then.

  1) I'm running cmd_create in an IMAPd and I want to see if this
 folder already exists.
 
  2) I'm a replica backend getting a copy of an existing folder (or
 indeed, a backend which already has a folder) and I'm informing
 mupdate of the fact.
 
  Those two should be treated differently.  The first is does this
  already exist, which is a legitimate question to ask.  The second
  should always succeed. MUPDATE is a representation of facts, and the
  backends are the masters of those facts.
 

 With two-way replication safety however (and in your case, channelled
 as well, right?), which end of the replication (just in case things
 end up load-balanced across replicas?) gets to submit the original
 cmd_set() is up in the air, no?

Er, not really.  Worst case they both do and you resolve them, RIAK
style, when they discover each other.

  So they both have the same server name in mupdate.
 

 Yes, and frontends proxy the connections for mailboxes on the backend
 to the same fake server address.

Yeah, of course.  We used to do this with FastMail - failover IP - but
it doesn't work across datacentres, so instead we have a source of truth
(a DB backed daemon for now, but consul soon) which says where the
master is right now, and nginx just connects directly to the slot IP for
the master end - so we can proxy to a different datacentre
transparently.

  It may even have statuscache.  Man, how awesome would distributed
  statuscache be.
 
  So there are multiple records for the same mailbox, with different
  server names, in the murder DB.
 

 Would this not open back up a route to entertaining a variety of race
 conditions (that would need to be addressed somehow) though?

Not really - because writes are always sourced from the backend you are
connected to.  What it COULD create, in theory, is stale reads - but
only stale in the way that if would have been if you'd done the same
read a second ago.  IMAP makes no guarantees about parallel connections.

 Should then one of the duplicate mailboxes be marked as the primary?

Of course.  But not in mailboxes.db itself, separately with either a
per-server scoping or a per-user/nameroot scoping.  There are arguments
for both, per nameroot is a lot more data, particularly to update in a
failover case, but it also allows you to do really amazing stuff like
have per-user replicas configured directly with annotations or
something - such that any one user can be moved to a set of machines
within the murder and there's no need to actually define pairs of
machines at all.

I'd almost certainly have an external process monitor that though,
monitor disk usage, user size, user locations, etc - and rebalance users
by issuing the correct commands 

Re: What would it take for FastMail to run murder

2015-03-17 Thread Jeroen van Meeuwen (Kolab Systems)

On 2015-03-14 22:48, Bron Gondwana wrote:
On Sun, Mar 15, 2015, at 07:18 AM, Jeroen van Meeuwen (Kolab Systems) 
wrote:

How, though, do you ensure that a mailbox for a new user in such
business is created on the same backend as all the other users of said
business?


If the business already exists, the create user code will fetch the 
server name

from the business database table and make that the creation server.

There's a cron job which runs every hour and looks for users who aren't 
on
the right server, so if we import a user to the business, they get 
moved.




Right -- so you seem to agree that one business is limited to one 
backend server, which is precisely what the larger businesses that are 
our customers need to work around, when the number of mailboxes is 
typically tens of thousands, and the mechanism you describe stops 
working.



There's one particular problem with using NGINX as the IMAP proxy --
it requires that external service that responds with the address to
proxy to.


T108


I say problem in quotes to emphasize I use the term problem very
loosely -- whether it be a functioning backend+mupdate+frontend or a
functioning backend+mupdate+frontend+nginx+service is a rather futile
distinction, relatively speaking.


Sure, but backend+distributed mailbox service+nginx would be a much
simpler setup.



Yes, T108 here ;-)

I don't understand how this is an established problem already -- or 
not

as much as I probably should. If 72k users can be happy on a murder
topology, surely 4 times as many could also be happen -- 
inefficiencies

notwithstanding, they're only a vertical scaling limitation.


happy is a relative term. You can get most of the benefit from using
foolstupidclients, but otherwise you're paying O(N) for the number of
users - and taking 4 times as long to do every list command is not 
ideal.




Sure -- the majority of the processing delays seem to lay on the client 
side taking off the wire what is being dumped on it, however.


You're far better entitled to speak to what is in a mailboxes.db and/or 
its in-memory representation by the time you get to scanning the 
complete list for items to which a user might have access, I just have 
to say we've not found this particular part to be as problematic for 
tens of thousands of users (yet).



That said of course I understand it has it's upper limit, but getting
updated lookup tables in-memory pushed there when an update happens
would seem to resolve the problem, no?


Solving the problem is having some kind of index/lookup table indeed.
Whether this is done all in-memory by some sort of LIST service which
scans the mailboxes.db at startup time and then gets updates from 
mupdate.




For frontends specifically (discrete murder), we're able to use tmpfs 
for mailboxes.db (and some other stuff of course) solving a bit of the 
I/O constraints, but it's still a list of folders with parameters 
containing whether the user has access, and what I meant was perhaps the 
list can (in addition) be inverted to be a list of users with folders 
(and rights?).


This is not necessarily what a failed mupdate server does though -- 
new

folders and folder renames (includes deletions!) and folder transfers
won't work, but the cluster remains functional under both the
SMTP-to-backend and LMTP-proxy-via-frontend topology -- autocreate for
Sieve fileinto notwithstanding, and mailbox hierarchies distributed 
over

multiple backends when also using the SMTP-to-backend topoplogy
notwithstanding.


Yeah, until you start up the mupdate server again or configure a new 
one.
Again, you get user visible failures (folder create, etc) while the 
server is

down.  The reason I want to shave off all these edge cases is that in a
big enough system over a long enough time, you will hit every one of 
them.




We promote a standby frontend not otherwise used, to become the new 
mupdate server. The interruption is a matter of seconds this way, unless 
of course you're in the typical stalemate.



 Thankfully, the state of the art in distributed databases has moved a
 long way since mupdate was written.

I have also written a one-or-two line patch that enables backends that
replicate, to both be a part of the same murder topology, to prevent 
the
replica slave from bailing out on the initial creation of a mailbox 
--

consulting mupdate and finding that it would already exist.


Interesting.  Does it also handle the case where the same mailbox gets
accidentally created on two servers which aren't replica pairs though?
Or do you get a mailbox fork?



The race condition is not addressed with it, like it is not addressed 
currently.


It solely makes the MUPDATE server not reject the reservation request 
from a server that uses the same servername if it already has an entry 
for the same servername!partition, so that the replica successfully 
creates the local copy -- after which replication is happy.


So this would build a scenario in which:

  

Re: What would it take for FastMail to run murder

2015-03-17 Thread Bron Gondwana
On Wed, Mar 18, 2015, at 09:00 AM, Jeroen van Meeuwen (Kolab Systems) wrote:
 On 2015-03-14 22:48, Bron Gondwana wrote:
  On Sun, Mar 15, 2015, at 07:18 AM, Jeroen van Meeuwen (Kolab Systems) 
  wrote:
  How, though, do you ensure that a mailbox for a new user in such
  business is created on the same backend as all the other users of
  said business?
 
  If the business already exists, the create user code will fetch the
  server name from the business database table and make that the
  creation server.
 
  There's a cron job which runs every hour and looks for users who
  aren't on the right server, so if we import a user to the business,
  they get moved.
 

 Right -- so you seem to agree that one business is limited to one
 backend server, which is precisely what the larger businesses that
 are our customers need to work around, when the number of mailboxes is
 typically tens of thousands, and the mechanism you describe stops
 working.

Exactly. It's a limit that we want to avoid, hence looking for a
murder-that-scales.

  happy is a relative term. You can get most of the benefit from
  using foolstupidclients, but otherwise you're paying O(N) for the
  number of users - and taking 4 times as long to do every list
  command is not ideal.

 Sure -- the majority of the processing delays seem to lay on the
 client side taking off the wire what is being dumped on it, however.

With over a million mailboxes in a single mailboxes.db I was seeing
parsing cost go up, particularly with DLIST.  I've written a dlist_sax
interface, which cuts out some of the cost, but it's still not free.

The easiest way to make things more efficient is not do them at all ;)

 You're far better entitled to speak to what is in a mailboxes.db
 and/or its in-memory representation by the time you get to scanning
 the complete list for items to which a user might have access, I just
 have to say we've not found this particular part to be as problematic
 for tens of thousands of users (yet).

It's going to hurt when you get to millions.  That's our issue.  If we
merged all the mailboxes.db across all our servers into one place,
that's a huge database.

 For frontends specifically (discrete murder), we're able to use
 tmpfs for mailboxes.db (and some other stuff of course) solving a
 bit of the
 I/O constraints, but it's still a list of folders with parameters
   containing whether the user has access, and what I meant was perhaps
   the list can (in addition) be inverted to be a list of users with
   folders (and rights?).

That's pretty much exactly the idea.  That and avoiding the SPOF that's
a murder master right now.  They're kind of separate goals, we could do
one without the other.

 We promote a standby frontend not otherwise used, to become the new
 mupdate server. The interruption is a matter of seconds this way,
 unless of course you're in the typical stalemate.

Hmm so maybe it's affordable.  It scales up with number-of-servers
as well though.  Making sure it's up to date costs at least O(number of
backends).

  Interesting.  Does it also handle the case where the same mailbox
  gets accidentally created on two servers which aren't replica pairs
  though? Or do you get a mailbox fork?
 

 The race condition is not addressed with it, like it is not addressed
 currently.

I'm not 100% happy living with unaddressed race conditions.  Addressing
this would be an important part of making FastMail happy to run it.

 It solely makes the MUPDATE server not reject the reservation
 request from a server that uses the same servername if it already
 has an entry for the same servername!partition, so that the
 replica successfully creates the local copy -- after which
 replication is happy.

Yeah, that makes sense.  Of course, the backend should probably not be
reserving so much.  There are two things conflated here:

1) I'm running cmd_create in an IMAPd and I want to see if this folder
   already exists.

2) I'm a replica backend getting a copy of an existing folder (or
   indeed, a backend which already has a folder) and I'm informing
   mupdate of the fact.

Those two should be treated differently.  The first is does this
already exist, which is a legitimate question to ask.  The second
should always succeed. MUPDATE is a representation of facts, and the
backends are the masters of those facts.

 So this would build a scenario in which:

pair-1-replica-1.example.org and pair-1-replica-2.example.org
present themselves as pair-1.example.org

A DNS IN A RR is created for the fail-over address(es) for pair-
1.example.org and attached to whichever replica in the pair is
considered the active node.

 Both replicas would be configured to replicate to one another, which
 works in a PoC scenario but may seem to require lmtpd/AF_INET
 delivery.

So they both have the same server name in mupdate.

My plan is that they have different server names in mupdate.  There's a
separate channel somehow to say which is the primary 

Re: What would it take for FastMail to run murder

2015-03-14 Thread Jeroen van Meeuwen (Kolab Systems)

On 2015-03-13 23:50, Bron Gondwana wrote:

So I've been doing a lot of thinking about Cyrus clustering, with the
underlying question being what would it take to make FastMail run a
murder.  We've written a fair bit about our infrastructure - we use
nginx as a frontend proxy to direct traffic to backend servers, and 
have

no interdependencies between the backends, so that we can scale
indefinitely.  With murder as it exists now, we would be pushing the
limits of the system already - particularly with the globally
distributed datacentres.

Why would FastMail consider running murder, given our existing
nice system?

a) we support folder sharing within businesses, so at the moment we are
   limited by the size of a single slot.  Some businesses already push
   that limit.



How, though, do you ensure that a mailbox for a new user in such 
business is created on the same backend as all the other users of said 
business?



Here are our deal-breaker requirements:

1) unified murder - we don't want to run both a frontend AND a backend
   imapd process  for every single connection.  We already have nginx,
   which is non-blocking, for the initial connection and auth handling.



There's one particular problem with using NGINX as the IMAP proxy -- 
it requires that external service that responds with the address to 
proxy to.


I say problem in quotes to emphasize I use the term problem very 
loosely -- whether it be a functioning backend+mupdate+frontend or a 
functioning backend+mupdate+frontend+nginx+service is a rather futile 
distinction, relatively speaking.



2) no table scans - anything that requires a parse and ACL lookup for
   every single row of mailboxes.db is going to be a non- starter when
   you multiply the existing mailboxes.db size by hundreds.



I don't understand how this is an established problem already -- or not 
as much as I probably should. If 72k users can be happy on a murder 
topology, surely 4 times as many could also be happen -- inefficiencies 
notwithstanding, they're only a vertical scaling limitation.


That said of course I understand it has it's upper limit, but getting 
updated lookup tables in-memory pushed there when an update happens 
would seem to resolve the problem, no?


3) no single-point-of-failure - having one mupdate master which can 
stop

   the entire cluster working if it's offline, no thanks.



This is not necessarily what a failed mupdate server does though -- new 
folders and folder renames (includes deletions!) and folder transfers 
won't work, but the cluster remains functional under both the 
SMTP-to-backend and LMTP-proxy-via-frontend topology -- autocreate for 
Sieve fileinto notwithstanding, and mailbox hierarchies distributed over 
multiple backends when also using the SMTP-to-backend topoplogy 
notwithstanding.



Thankfully, the state of the art in distributed databases has moved a
long way since mupdate was written.


I have also written a one-or-two line patch that enables backends that 
replicate, to both be a part of the same murder topology, to prevent the 
replica slave from bailing out on the initial creation of a mailbox -- 
consulting mupdate and finding that it would already exist.


Along with this, we need a reverse lookup for ACLs, so that any one 
user
doesn't ever need to scan the entire mailboxes.db.  This might be 
hooked

into the distributed DB as well, or calculated locally on each node.



I reckon this may be the rebuild more efficient lookup trees in-memory 
or otherwise I may have referred to just now just not in so many words.



And that's pretty much it.  There are some interesting factors around
replication, and I suspect the answer here is to have either multi-
value support or embed the backend name into the mailboxes.db key
(postfix) such that you wind up listing the same mailbox multiple
times.


In a scenario where only one backend is considered active for the 
given (set of) mailbox(es), and the other is passive, this has been 
more of a one-line patch in mupdate plus the proper infrastructure in 
DNS/keepalived type of failover service IP addresses than it has been 
about allowing duplicates and suppressing them.


Kind regards,

Jeroen van Meeuwen

--
Systems Architect, Kolab Systems AG

e: vanmeeuwen at kolabsys.com
m: +41 79 951 9003
w: https://kolabsystems.com

pgp: 9342 BF08


RE: What would it take for FastMail to run murder

2015-03-14 Thread Jeroen van Meeuwen (Kolab Systems)

On 2015-03-13 23:54, Dave McMurtrie wrote:
From my phone, so excuse brevity and top-posting, but Fastmail running 
murder would be a huge bonus.  I not-so-fondly recall the intimate 
relationship I developed with gdb debugging murder issues when we 
upgraded from 2.3 to 2.4 :)




You won't have to for 2.5 (as much) because we're running it at 
supported customer sites, and I'm to blame for the alleged fixes ;-)


Kind regards,

Jeroen van Meeuwen

--
Systems Architect, Kolab Systems AG

e: vanmeeuwen at kolabsys.com
m: +41 79 951 9003
w: https://kolabsystems.com

pgp: 9342 BF08


Re: What would it take for FastMail to run murder

2015-03-14 Thread Vladislav Bogdanov

14.03.2015 01:50, Bron Gondwana wrote:

So I've been doing a lot of thinking about Cyrus clustering, with the
underlying question being what would it take to make FastMail run a
murder.  We've written a fair bit about our infrastructure - we use
nginx as a frontend proxy to direct traffic to backend servers, and have
no interdependencies between the backends, so that we can scale
indefinitely.  With murder as it exists now, we would be pushing the
limits of the system already - particularly with the globally
distributed datacentres.


Btw (as you speak about clusters), I've developed a Proof-of-the-concept 
for a cyrus-imapd cluster a long ago using pacemaker as a cluster 
resource manager. There are many things happened to the linux clustering 
after that, including remote-node support in the pacemaker, so that 
concept may be reworked to be even more perfect and scalable. The only 
thing I did not like that time is that cyrus replication was a bit weak 
to detect changes after a rolling multi-node failure (node1 goes down, 
node2 takes over the replica, node2 goes down, node1 goes up and changes 
made to node2 during node1 was down are lost). Please drop me a note (or 
just post here as I'm a long time silent reader) if you're interested in 
making cyrus-imapd rock-solid from the ha-clustering perspective and 
need some guidance in that so I'll share more details.


Best,
Vladislav




Re: What would it take for FastMail to run murder

2015-03-13 Thread Bron Gondwana

For sure :)

Just having testing infrastructure that tests murder would go a long way
to avoiding that mess again.

The more I think about it, the more having the SAME mailboxes.db for
both local and remote data doesn't make sense. We should have a separate
central database that the mupdate_activate, etc write to. It can just be
a standalone SQL database, or a cluster database, or who cares... the
main thing is that only a few of the MBOXLIST commands need to care
(because they will return the remote information if needed)

Bron.


On Sat, Mar 14, 2015, at 09:54 AM, Dave McMurtrie wrote:
 From my phone, so excuse brevity and top-posting, but Fastmail running
 murder would be a huge bonus. I not-so-fondly recall the intimate
 relationship I developed with gdb debugging murder issues when we
 upgraded from 2.3 to 2.4 :)


 Sent via the Samsung GALAXY S® 5, an ATT 4G LTE smartphone




 Original message 

From: Bron Gondwana br...@fastmail.fm

Date:03/13/2015 6:50 PM (GMT-05:00)

To: Cyrus Devel cyrus-devel@lists.andrew.cmu.edu

Cc:

Subject: What would it take for FastMail to run murder


 So I've been doing a lot of thinking about Cyrus clustering, with the

underlying question being what would it take to make FastMail run a

murder. We've written a fair bit about our infrastructure - we use

nginx as a frontend proxy to direct traffic to backend servers, and have

no interdependencies between the backends, so that we can scale

indefinitely. With murder as it exists now, we would be pushing the

limits of the system already - particularly with the globally

distributed datacentres.


Why would FastMail consider running murder, given our existing

nice system?


a) we support folder sharing within businesses, so at the moment we are

limited by the size of a single slot. Some businesses already push

that limit.

b) it's good to dogfood the server we put so much work into.


Here are our deal-breaker requirements:


1) unified murder - we don't want to run both a frontend AND a backend

imapd process for every single connection. We already have nginx,

which is non-blocking, for the initial connection and auth handling.

2) no table scans - anything that requires a parse and ACL lookup for

every single row of mailboxes.db is going to be a non- starter when

you multiply the existing mailboxes.db size by hundreds.

3) no single-point-of-failure - having one mupdate master which can stop

the entire cluster working if it's offline, no thanks.


Thankfully, the state of the art in distributed databases has moved a

long way since mupdate was written. We'd have to at least change the

mupdate protocol anyway to handle newly added fields, so why not just do

away with it and have every server run a local node of a distributed

database protocol for its mailboxes.db.


Along with this, we need a reverse lookup for ACLs, so that any one user

doesn't ever need to scan the entire mailboxes.db. This might be hooked

into the distributed DB as well, or calculated locally on each node.


And that's pretty much it. There are some interesting factors around

replication, and I suspect the answer here is to have either multi-

value support or embed the backend name into the mailboxes.db key

(postfix) such that you wind up listing the same mailbox multiple

times. We already suppress duplicates in the LIST command, so all we

need then is logic for choosing the actual master. Rob N has done some

work with consul and etcd already at FastMail, and we would use either

that or a flag in the distributed DB to drive master choice for backend

connection purposes.


There are a bunch of nice to haves on top of this, but I think this

would be enough for us to convert our existing standalone servers over

to a murder.


Bron.


--

Bron Gondwana

br...@fastmail.fm


--
Bron Gondwana br...@fastmail.fm




RE: What would it take for FastMail to run murder

2015-03-13 Thread Dave McMurtrie
From my phone, so excuse brevity and top-posting, but Fastmail running murder 
would be a huge bonus.  I not-so-fondly recall the intimate relationship I 
developed with gdb debugging murder issues when we upgraded from 2.3 to 2.4 :)


Sent via the Samsung GALAXY S® 5, an ATT 4G LTE smartphone


 Original message 
From: Bron Gondwana br...@fastmail.fm
Date:03/13/2015 6:50 PM (GMT-05:00)
To: Cyrus Devel cyrus-devel@lists.andrew.cmu.edu
Cc:
Subject: What would it take for FastMail to run murder

So I've been doing a lot of thinking about Cyrus clustering, with the
underlying question being what would it take to make FastMail run a
murder.  We've written a fair bit about our infrastructure - we use
nginx as a frontend proxy to direct traffic to backend servers, and have
no interdependencies between the backends, so that we can scale
indefinitely.  With murder as it exists now, we would be pushing the
limits of the system already - particularly with the globally
distributed datacentres.

Why would FastMail consider running murder, given our existing
nice system?

a) we support folder sharing within businesses, so at the moment we are
   limited by the size of a single slot.  Some businesses already push
   that limit.
b) it's good to dogfood the server we put so much work into.

Here are our deal-breaker requirements:

1) unified murder - we don't want to run both a frontend AND a backend
   imapd process  for every single connection.  We already have nginx,
   which is non-blocking, for the initial connection and auth handling.
2) no table scans - anything that requires a parse and ACL lookup for
   every single row of mailboxes.db is going to be a non- starter when
   you multiply the existing mailboxes.db size by hundreds.
3) no single-point-of-failure - having one mupdate master which can stop
   the entire cluster working if it's offline, no thanks.

Thankfully, the state of the art in distributed databases has moved a
long way since mupdate was written.  We'd have to at least change the
mupdate protocol anyway to handle newly added fields, so why not just do
away with it and have every server run a local node of a distributed
database protocol for its mailboxes.db.

Along with this, we need a reverse lookup for ACLs, so that any one user
doesn't ever need to scan the entire mailboxes.db.  This might be hooked
into the distributed DB as well, or calculated locally on each node.

And that's pretty much it.  There are some interesting factors around
replication, and I suspect the answer here is to have either multi-
value support or embed the backend name into the mailboxes.db key
(postfix) such that you wind up listing the same mailbox multiple
times. We already suppress duplicates in the LIST command, so all we
need then is logic for choosing the actual master.  Rob N has done some
work with consul and etcd already at FastMail, and we would use either
that or a flag in the distributed DB to drive master choice for backend
connection purposes.

There are a bunch of nice to haves on top of this, but I think this
would be enough for us to convert our existing standalone servers over
to a murder.

Bron.

--
  Bron Gondwana
  br...@fastmail.fm