[Dovecot] Director pop-login and imap-login processes exiting on signal 11

2012-04-07 Thread Andy Dills

We recently upgraded our cluster to 2.1.3, to enable director proxying.

Everything appears to be working fine for the most part; the only odd 
thing is that I'm seeing a lot of entries in the logs like this:

Apr  7 02:18:05 mail-out06 dovecot: pop3-login: Fatal: master: 
service(pop3-login): child 75029 killed with signal 11 (core not dumped - 
set service pop3-login { drop_priv_before_exec=yes })

This is on the proxy side, not that backend side.

When I try to get a dump out of it, and add drop_prive_before_exec and 
chroot= to the pop3-login statement on the proxy, I keep running into 
permissions errors with the various service sockets.

Any suggestions?

Thanks,
Andy

---
Andy Dills
Xecunet, Inc.
www.xecu.net
301-682-9972
---


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-07 Thread Stan Hoeppner
On 4/5/2012 3:02 PM, Emmanuel Noobadmin wrote:

Hi Emmanuel,

 I'm trying to improve the setup of our Dovecot/Exim mail servers to
 handle the increasingly huge accounts (everybody thinks it's like
 infinitely growing storage like gmail and stores everything forever in
 their email accounts) by changing from Maildir to mdbox, and to take
 advantage of offloading older emails to alternative networked storage
 nodes.

I'll assume networked storage nodes means NFS, not FC/iSCSI SAN, in
which case you'd have said SAN.

 The question now is whether having a single large server or will a
 number of 1U servers with the same total capacity be better? 

Less complexity and cost is always better.  CPU throughput isn't a
factor in mail workloads--it's all about IO latency.  A 1U NFS server
with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks
less juice and dissipates less heat than 4 1U servers each w/ 4 drives.
 I don't recall seeing your user load or IOPS requirements so I'm making
some educated guesses WRT your required performance and total storage.
I came up with the following system that should be close to suitable,
for ~$10k USD.  The 4 node system runs ~$12k USD.  At $2k this isn't
substantially higher.  But when we double the storage of each
architecture we're at ~$19k, vs ~$26k for an 8 node cluster, a
difference of ~$7k.  That's $1k shy of another 12 disk JBOD.  Since CPU
is nearly irrelevant for a mail workload, you can see it's much cheaper
to scale capacity and IOPS with a single node w/fat storage than with
skinny nodes w/thin storage.  Ok, so here's the baseline config I threw
together:

http://h10010.www1.hp.com/wwpc/us/en/sm/WF06b/15351-15351-3328412-241644-3328421-4091396-4158470-4158440.html?dnr=1
8 cores is plenty, 2 boot drives mirrored on B110i, 16GB (4x4GB)
http://www.lsi.com/products/storagecomponents/Pages/LSISAS9205-8e.aspx
http://h10010.www1.hp.com/wwpc/us/en/sm/WF06b/12169-304616-3930445-3930445-3930445-3954787-4021626-4021628.html?dnr=1
w/ 12 2TB 7.2K SATA drives, configured as md concat+RAID1 pairs with 12
allocation groups, 12TB usable.  Format the md device with the defaults:

$ mkfs.xfs /dev/md0

Mount with inode64.  No XFS stripe alignment to monkey with.  No md
chunk size or anything else to worry about.  XFS' allocation group
design is pure elegance here.

If 12 TB isn't sufficient, or if you need more space later, you can
daisy chain up to 3 additional D2600 JBODs for ~$8500 USD each, just add
cables.  This quadruples IOPS, throughput, and capacity--96TB total,
48TB net.  Simply create 6 more mdraid1 devices and grow the linear
array with them.  Then do an xfs_growfs to bring the extra 12TB of free
space into the filesystem.

If you're budget conscious and/or simply prefer quality inexpensive
whitebox/DIY type gear, as I do, you can get 24 x 2TB drives in one JBOD
chassis for $7400 USD.  That twice the drives, capacity, IOPS, for
~$2500 less than the HP JBOD.  And unlike the HP 'enterprise SATA'
drives, the 2TB WD Black series have a 5 year warranty, and work great
with mdraid.  Chassis and drives at Newegg:

http://www.newegg.com/Product/Product.aspx?Item=N82E16816133047
http://www.newegg.com/Product/Product.aspx?Item=N82E16822136792

You can daisy chain 3 of these off one HBA SFF8088 port, 6 total on our
LSI 9205-8e above, for a total of 144 2TB drives, 72 effective spindles
in our concat+RAID1 setup, 144TB net space.

 Will be
 using RAID 1 pairs, likely XFS based on reading Hoeppner's
 recommendation on this and the mdadm list.

To be clear, the XFS configuration I recommend/promote for mailbox
storage is very specific and layered.  The layers must all be used
together to get the performance.  These layers consist of using multiple
hardware or software RAID1 pairs and concatenating them with an md
linear array.  You then format that md device with the XFS defaults, or
a specific agcount if you know how to precisely tune AG layout based on
disk size and your anticipated concurrency level of writers.

Putting XFS on a singe RAID1 pair, as you seem to be describing above
for the multiple thin node case, and hitting one node with parallel
writes to multiple user mail dirs, you'll get less performance than
EXT3/4 on that mirror pair--possibly less than half, depending on the
size of the disks and thus the number of AGs created.  The 'secret' to
XFS performance with this workload is concatenation of spindles.
Without it you can't spread the AGs--thus directories, thus parallel
file writes--horizontally across the spindles--and this is the key.  By
spreading AGs 'horizontally' across the disks in a concat, instead of
'vertically' down a striped array, you accomplish two important things:

1.  You dramatically reduce disk head seeking by using the concat array.
 With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs
evenly spaced vertically down each disk in the array, following the
stripe pattern.  Each user mailbox is stored in a different directory.

Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-07 Thread Emmanuel Noobadmin
On 4/7/12, Stan Hoeppner s...@hardwarefreak.com wrote:

Firstly, thanks for the comprehensive reply. :)

 I'll assume networked storage nodes means NFS, not FC/iSCSI SAN, in
 which case you'd have said SAN.

I haven't decided on that but it would either be NFS or iSCSI over
Gigabit. I don't exactly get a big budget for this. iSCSI because I
planned to do md/mpath over two separate switches so that if one
switch explodes, the email service would still work.

 Less complexity and cost is always better.  CPU throughput isn't a
 factor in mail workloads--it's all about IO latency.  A 1U NFS server
 with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks
 less juice and dissipates less heat than 4 1U servers each w/ 4 drives.

My worry is that if that one server dies, everything is dead. With at
least a pair of servers, I could keep it running, or if necessary,
restore the accounts on the dead servers from backup, make some config
changes and have everything back running while waiting for replacement
hardware.

  I don't recall seeing your user load or IOPS requirements so I'm making
 some educated guesses WRT your required performance and total storage.

I'm embarrassed to admit I don't have hard numbers on the user load
except the rapidly dwindling disk space count and the fact when the
web-based mail application try to list and check disk quota, it can
bring the servers to a crawl. My lame excuse is that I'm just the web
dev who got caught holding the server admin potato.

 is nearly irrelevant for a mail workload, you can see it's much cheaper
 to scale capacity and IOPS with a single node w/fat storage than with
 skinny nodes w/thin storage.  Ok, so here's the baseline config I threw
 together:

One of my concern is that heavy IO on the same server slow the overall
performance even though the theoretical IOPS of the total drives are
the same on 1 and on X servers. Right now, the servers are usually
screeching to a halt, to the point of even locking out SSH access due
to IOWait sending the load in top to triple digits.


 Some host failure redundancy is about all you'd gain from the farm
 setup.  Dovecot shouldn't barf due to one NFS node being down, only
 hiccup.  I.e. only imap process accessing files on the downed node would
 have trouble.

But if I only have one big storage node and that went down, Dovecot
would barf wouldn't it?
Or would the mdbox format mean Dovecot would still use the local
storage, just that users can't access the offloaded messages?

 Also, I could possibly arrange them in a sort
 of network raid 1 to gain redundancy over single machine failure.

 Now you're sounding like Charles Marcus, but worse. ;)  Stay where you
 are, and brush your hair away from your forehead.  I'm coming over with
 my branding iron that says K.I.S.S

Lol, I have no idea who Charles is, but I always feel safer if there
was some kind of backup. Especially since I don't have the time to
dedicate myself to server administration, by the time I notice
something is bad, it might be too late for anything but the backup.

Of course management and clients don't agree with me since
backup/redundancy costs money. :)


Re: [Dovecot] Dovecot LDA/LMTP vs postfix virtual delivery agent and the x-original-to header

2012-04-07 Thread Daniel L. Miller

On 4/6/2012 1:00 PM, Charles Marcus wrote:

On 2012-04-06 2:53 PM, Daniel L. Miller dmil...@amfes.com wrote:

I'm currently using Postfix 2.7, Dovecot 2.1, and the Dovecot LDA. I
have a pure virtual user environment stored in LDAP. My messages include
X-Original-To and Delivered-To headers.


Well that is great news... at least I'll be able to use the LDA, if 
not LMTP...


Thanks! :)


I had difficulty getting the LMTP transport to work previously - I may
revisit that.


If you do, by all means reply back on whether or not the headers are 
still there...


Thanks again,



From the documentation...
http://www.postfix.org/virtual.8.html

The*virtual*(8)  http://www.postfix.org/virtual.8.html   delivery  agent  prepends 
 a *From*  /sender/
   /time/*_*/stamp/ envelope header to each  message,  prepends  a
   *Delivered-To:*   message  header with the envelope recipient
   address, prepends an*X-Original-To:*  header with the recip-
   ient  address as given to Postfix, prepends a*Return-Path:*
   message header with the envelope sender address,  prepends
   a  character to lines beginning with *From*  , and appends
   an empty line.

Using the Postfix pipe agent, which is what is used with the Dovecot LDA,
http://www.postfix.org/pipe.8.html

*flags=BDFORXhqu.*  (optional)
  Optional  message  processing  flags. By default, a
  message is copied unchanged.

  *B*   Append a blank line at the end of each  mes-
 sage.  This  is  required  by some mail user
 agents that recognize  *From* lines  only
 when preceded by a blank line.

  *D*   Prepend  a *Delivered-To:*  /recipient/ message
 header with the envelope recipient  address.
 Note: for this to work, the/transport/*_desti-*
 *nation_recipient_limit*  must be 1  (see  SIN-
 GLE-RECIPIENT DELIVERY above for details).

 The*D*   flag  also  enforces  loop detection
 (Postfix  2.5  and  later):  if  a   message
 already contains a*Delivered-To:*  header with
 the same recipient address, then the message
 is  returned  as  undeliverable. The address
 comparison is case insensitive.

 This feature is available as of Postfix 2.0.

  *F*   Prepend  a *From*  /sender time/*_*/stamp/ envelope
 header to  the  message  content.   This  is
 expected by, for example,*UUCP*  software.

  *O*   Prepend  an  *X-Original-To:*  /recipient/ mes-
 sage header with the  recipient  address  as
 given  to  Postfix.  Note: for this to work,
 the*/transport/_destination_recipient_limit  
http://www.postfix.org/postconf.5.html#transport_destination_recipient_limit*
 must  be  1  (see  SINGLE-RECIPIENT DELIVERY
 above for details).


Unfortunately, the docs for the ltmp agent 
http://www.postfix.org/lmtp.8.html don't say anything about adding these 
headers.  I tried asking on the Postfix list - didn't get much of an 
answer.

--
Daniel


Re: [Dovecot] Dovecot LDA/LMTP vs postfix virtual delivery agent and the x-original-to header

2012-04-07 Thread Jerry
On Sat, 07 Apr 2012 11:06:48 -0700
Daniel L. Miller articulated:

 Unfortunately, the docs for the ltmp agent 
 http://www.postfix.org/lmtp.8.html don't say anything about adding
 these headers.  I tried asking on the Postfix list - didn't get much
 of an answer.

I may be wrong; however, from what I have been able to understand in
regards to the Postfix documentation, if it does not explicitly claim to
have a feature, then that feature is not available. In other words, if
it doesn't state it can do it, it can't.

-- 
Jerry ♔

Disclaimer: off-list followups get on-list replies or get ignored.
Please do not ignore the Reply-To header.
__



Re: [Dovecot] Outlook (2010) - Dovecot (IMAP) 10x slower with high network load and many folders

2012-04-07 Thread Thomas von Eyben
On Sat, Apr 7, 2012 at 3:16 AM, Willie Gillespie
wgillespie+dove...@es2eng.com wrote:
 On 4/6/2012 3:52 AM, Thomas von Eyben wrote:

 Test results:
 CLIENT-1 is having the problems when CLIENT-2 is using all the
 (100Mbps) bandwidth eg. copying files to MAIL-SRV.
 If I move CLIENT-1 to CLIENT-3 then almost all the delay is gone.
 NB.: I have not (yet) tested if the problem also exists when CLIENT-2
 generates traffic to MAIL-SRV as opposed to OTHER-SRV (but I am
 expecting the same problems).


 So the link between your 100 Mbps switch and the 1 Gbps switch is saturated
 by CLIENT-2, so CLIENT-1 is just getting the leftovers?

 Since CLIENT-3 doesn't go through that 100 Mbps switch, it obviously doesn't
 see that issue.

Yes - that's my current workaround (perhaps also solution), I'm
wondering if the performance is really expected to be _so_ bad when
other users are utilizing the LAN.
(You seem to indicate that what I am observing is expected and is
just caused by [un-intended] semi-bad behavior from other users…)

BR TvE


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-07 Thread Robin



Putting XFS on a singe RAID1 pair, as you seem to be describing above
for the multiple thin node case, and hitting one node with parallel
writes to multiple user mail dirs, you'll get less performance than
EXT3/4 on that mirror pair--possibly less than half, depending on the
size of the disks and thus the number of AGs created.  The 'secret' to
XFS performance with this workload is concatenation of spindles.
Without it you can't spread the AGs--thus directories, thus parallel
file writes--horizontally across the spindles--and this is the key.  By
spreading AGs 'horizontally' across the disks in a concat, instead of
'vertically' down a striped array, you accomplish two important things:

1.  You dramatically reduce disk head seeking by using the concat array.
  With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs
evenly spaced vertically down each disk in the array, following the
stripe pattern.  Each user mailbox is stored in a different directory.
Each directory was created in a different AG.  So if you have 96 users
writing their dovecot index concurrently, you have at worst case a
minimum 192 head movements occurring back and forth across the entire
platter of each disk, and likely not well optimized by TCQ/NCQ.  Why 192
instead of 96?  The modification time in the directory metadata must be
updated for each index file, among other things.


Does the XFS allocator automatically distribute AGs in this way even 
when disk usage is extremely light, i.e, a freshly formatted system with 
user directories initially created, and then the actual mailbox contents 
copied into them?


If this is indeed the case, then what you describe is a wondrous 
revelation, since you're scaling out the number of simultaneous metadata 
reads+writes/second as you add RAID1 pairs, if my understanding of this 
is correct.  I'm assuming of course, but should look at the code, that 
metadata locks imposed by the filesystem distribute as the number of 
pairs increase - if it's all just one Big Lock, then that wouldn't be 
the case.


Forgive my laziness, as I could just experiment and take a look at the 
on-disk structures myself, but I don't have four empty drives handy to 
experiment.


The bandwidth improvements due to striping (RAID0/5/6 style) are no help 
for metadata-intensive IO loads, and probably of little value for even 
mdbox loads too, I suspect, unless the mdbox max size is set to 
something pretty large, no?


Have you tried other filesystems and seen if they distribute metadata in 
a similarly efficient and scalable manner across concatenated drive sets?


Is there ANY point to using striping at all, a la RAID10 in this?  I'd 
have thought just making as many RAID1 pairs out of your drives as 
possible would be the ideal strategy - is this not the case?


=R=


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-07 Thread Stan Hoeppner
On 4/7/2012 3:45 PM, Robin wrote:
 
 Putting XFS on a singe RAID1 pair, as you seem to be describing above
 for the multiple thin node case, and hitting one node with parallel
 writes to multiple user mail dirs, you'll get less performance than
 EXT3/4 on that mirror pair--possibly less than half, depending on the
 size of the disks and thus the number of AGs created.  The 'secret' to
 XFS performance with this workload is concatenation of spindles.
 Without it you can't spread the AGs--thus directories, thus parallel
 file writes--horizontally across the spindles--and this is the key.  By
 spreading AGs 'horizontally' across the disks in a concat, instead of
 'vertically' down a striped array, you accomplish two important things:

 1.  You dramatically reduce disk head seeking by using the concat array.
   With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs
 evenly spaced vertically down each disk in the array, following the
 stripe pattern.  Each user mailbox is stored in a different directory.
 Each directory was created in a different AG.  So if you have 96 users
 writing their dovecot index concurrently, you have at worst case a
 minimum 192 head movements occurring back and forth across the entire
 platter of each disk, and likely not well optimized by TCQ/NCQ.  Why 192
 instead of 96?  The modification time in the directory metadata must be
 updated for each index file, among other things.
 
 Does the XFS allocator automatically distribute AGs in this way even
 when disk usage is extremely light, i.e, a freshly formatted system with
 user directories initially created, and then the actual mailbox contents
 copied into them?

It doesn't distribute AGs.  There are a static number created during
mkfs.xfs.  The inode64 allocator round robins new directory creation
across the AGs, and does the same with files created in those
directories.  Having the directory metadata and file extents in the same
AG decreases head movement and thus seek latency for mixed
metadata/extent high IOPS workloads.

 If this is indeed the case, then what you describe is a wondrous
 revelation, since you're scaling out the number of simultaneous metadata
 reads+writes/second as you add RAID1 pairs, if my understanding of this
 is correct.  

Correct.  And adding more space and IOPS is uncomplicated.  No chunk
calculations, no restriping of the array.  You simply grow the md linear
array adding the new disk device.  Then grow XFS to add the new free
space to the filesystem.  AFAIK this can be done infinitely,
theoretically.  I'm guessing md has a device count limit somewhere.  If
not your bash line buffer might. ;)

 I'm assuming of course, but should look at the code, that
 metadata locks imposed by the filesystem distribute as the number of
 pairs increase - if it's all just one Big Lock, then that wouldn't be
 the case.

XFS locking is done as minimally as possibly and is insanely fast.  I've
not come across any reported performance issues relating to it.  And
yes, any single metadata lock will occur in a single AG on one mirror
pair using the concat setup.

 Forgive my laziness, as I could just experiment and take a look at the
 on-disk structures myself, but I don't have four empty drives handy to
 experiment.

Don't sweat it.  All of this stuff is covered in the XFS Filesystem
Structure Guide, exciting reading if you enjoy a root canal while
watching snales race:
http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/index.html

 The bandwidth improvements due to striping (RAID0/5/6 style) are no help
 for metadata-intensive IO loads, and probably of little value for even
 mdbox loads too, I suspect, unless the mdbox max size is set to
 something pretty large, no?

The problem with striped parity RAID is not allocation, which takes
place in free space and is pretty fast.  The problem is the extra read
seeks and bandwidth of the RMW cycle when you modify an existing stripe.
 Updating a single flag in a Dovecot index causes md or the hardware
RAID controller to read the entire stripe into buffer space or RAID
cache, modify the flag byte, recalculate parity, then write the whole
stripe and parity block back out across all the disks.

With a linear concat of RAID1 pairs we're simply rewriting a single 4KB
filesystem block, maybe only a single 512B sector.  I'm at the edge of
my knowledge here.  I don't know exactly how Timo does the index
updates.  Regardless of the method, the index update is light years
faster with the concat setup as there is no RMW and full stripe
writeback as with the RAID5/6 case.

 Have you tried other filesystems and seen if they distribute metadata in
 a similarly efficient and scalable manner across concatenated drive sets?

EXT, any version, does not.  ReiserFS does not.  Both require disk
striping to achieve any parallelism.  With concat they both simply start
writing at the beginning sectors of the first RAID1 pair and 4 years
later maybe reach the last pair as