On 4/7/2012 9:43 AM, Emmanuel Noobadmin wrote:
On 4/7/12, Stan Hoeppner s...@hardwarefreak.com wrote:
Firstly, thanks for the comprehensive reply. :)
I'll assume networked storage nodes means NFS, not FC/iSCSI SAN, in
which case you'd have said SAN.
I haven't decided on that but it would either be NFS or iSCSI over
Gigabit. I don't exactly get a big budget for this. iSCSI because I
planned to do md/mpath over two separate switches so that if one
switch explodes, the email service would still work.
So it seems you have two courses of action:
1. Identify individual current choke points and add individual systems
and storage to eliminate those choke points.
2. Analyze your entire workflow and all systems, identifying all choke
points, then design a completely new well integrated storage
architecture that solves all current problems and addresses future needs
Adding an NFS server and moving infrequently (old) accessed files to
alternate storage will alleviate your space problems. But it will
probably not fix some of the other problems you mention, such as servers
bogging down and becoming unresponsive, as that's not a space issue.
The cause of that would likely be an IOPS issue, meaning you don't have
enough storage spindles to service requests in a timely manner.
Less complexity and cost is always better. CPU throughput isn't a
factor in mail workloads--it's all about IO latency. A 1U NFS server
with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks
less juice and dissipates less heat than 4 1U servers each w/ 4 drives.
My worry is that if that one server dies, everything is dead. With at
least a pair of servers, I could keep it running, or if necessary,
restore the accounts on the dead servers from backup, make some config
changes and have everything back running while waiting for replacement
hardware.
You are a perfect candidate for VMware ESX. The HA feature will do
exactly what you want. If one physical node in the cluster dies, HA
automatically restarts the dead VMs on other nodes, transparently.
Clients will will have to reestablish connections, but everything else
will pretty much be intact. Worse case scenario will possibly be a few
corrupted mailboxes that were being written when the hardware crashed.
A SAN is required for such a setup. I had extensive experience with ESX
and HA about 5 years ago and it works as advertised. After 5 years it
can only have improved. It's not cheap but usually pays for itself
due to being able to consolidate the workload of dozens of physical
servers into just 2 or 3 boxes.
I don't recall seeing your user load or IOPS requirements so I'm making
some educated guesses WRT your required performance and total storage.
I'm embarrassed to admit I don't have hard numbers on the user load
except the rapidly dwindling disk space count and the fact when the
web-based mail application try to list and check disk quota, it can
bring the servers to a crawl.
Maybe just starting with a description of your current hardware setup
and number of total users/mailboxes would be a good starting point. How
many servers do you have, what storage is connected to each, percent of
MUA POP/IMAP connections from user PCs versus those from webmail
applications, etc, etc.
Probably the single most important piece of information would be the
hardware specs of your current Dovecot server, CPUs/RAM/disk array, etc,
and what version of Dovecot you're running.
The focus of your email is building a storage server strictly to offload
old mail and free up space on the Dovecot server. From the sound of
things, this may not be sufficient to solve all your problems.
My lame excuse is that I'm just the web
dev who got caught holding the server admin potato.
Baptism by fire. Ouch. What doesn't kill you makes you stronger. ;)
is nearly irrelevant for a mail workload, you can see it's much cheaper
to scale capacity and IOPS with a single node w/fat storage than with
skinny nodes w/thin storage. Ok, so here's the baseline config I threw
together:
One of my concern is that heavy IO on the same server slow the overall
performance even though the theoretical IOPS of the total drives are
the same on 1 and on X servers. Right now, the servers are usually
screeching to a halt, to the point of even locking out SSH access due
to IOWait sending the load in top to triple digits.
If multiple servers are screeching to a halt due to iowait, either all
of your servers individual disks are overloaded, or you already have
shared storage. We really need more info on your current architecture.
Right now we don't knw if we're talking about 4 servers or 40., 100
users or 10,000.
Some host failure redundancy is about all you'd gain from the farm
setup. Dovecot shouldn't barf due to one NFS node being down, only
hiccup. I.e. only imap process accessing files on the downed node would
have trouble.
But if I only have one big storage