[Dovecot] Director pop-login and imap-login processes exiting on signal 11
We recently upgraded our cluster to 2.1.3, to enable director proxying. Everything appears to be working fine for the most part; the only odd thing is that I'm seeing a lot of entries in the logs like this: Apr 7 02:18:05 mail-out06 dovecot: pop3-login: Fatal: master: service(pop3-login): child 75029 killed with signal 11 (core not dumped - set service pop3-login { drop_priv_before_exec=yes }) This is on the proxy side, not that backend side. When I try to get a dump out of it, and add drop_prive_before_exec and chroot= to the pop3-login statement on the proxy, I keep running into permissions errors with the various service sockets. Any suggestions? Thanks, Andy --- Andy Dills Xecunet, Inc. www.xecu.net 301-682-9972 ---
Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?
On 4/5/2012 3:02 PM, Emmanuel Noobadmin wrote: Hi Emmanuel, I'm trying to improve the setup of our Dovecot/Exim mail servers to handle the increasingly huge accounts (everybody thinks it's like infinitely growing storage like gmail and stores everything forever in their email accounts) by changing from Maildir to mdbox, and to take advantage of offloading older emails to alternative networked storage nodes. I'll assume networked storage nodes means NFS, not FC/iSCSI SAN, in which case you'd have said SAN. The question now is whether having a single large server or will a number of 1U servers with the same total capacity be better? Less complexity and cost is always better. CPU throughput isn't a factor in mail workloads--it's all about IO latency. A 1U NFS server with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks less juice and dissipates less heat than 4 1U servers each w/ 4 drives. I don't recall seeing your user load or IOPS requirements so I'm making some educated guesses WRT your required performance and total storage. I came up with the following system that should be close to suitable, for ~$10k USD. The 4 node system runs ~$12k USD. At $2k this isn't substantially higher. But when we double the storage of each architecture we're at ~$19k, vs ~$26k for an 8 node cluster, a difference of ~$7k. That's $1k shy of another 12 disk JBOD. Since CPU is nearly irrelevant for a mail workload, you can see it's much cheaper to scale capacity and IOPS with a single node w/fat storage than with skinny nodes w/thin storage. Ok, so here's the baseline config I threw together: http://h10010.www1.hp.com/wwpc/us/en/sm/WF06b/15351-15351-3328412-241644-3328421-4091396-4158470-4158440.html?dnr=1 8 cores is plenty, 2 boot drives mirrored on B110i, 16GB (4x4GB) http://www.lsi.com/products/storagecomponents/Pages/LSISAS9205-8e.aspx http://h10010.www1.hp.com/wwpc/us/en/sm/WF06b/12169-304616-3930445-3930445-3930445-3954787-4021626-4021628.html?dnr=1 w/ 12 2TB 7.2K SATA drives, configured as md concat+RAID1 pairs with 12 allocation groups, 12TB usable. Format the md device with the defaults: $ mkfs.xfs /dev/md0 Mount with inode64. No XFS stripe alignment to monkey with. No md chunk size or anything else to worry about. XFS' allocation group design is pure elegance here. If 12 TB isn't sufficient, or if you need more space later, you can daisy chain up to 3 additional D2600 JBODs for ~$8500 USD each, just add cables. This quadruples IOPS, throughput, and capacity--96TB total, 48TB net. Simply create 6 more mdraid1 devices and grow the linear array with them. Then do an xfs_growfs to bring the extra 12TB of free space into the filesystem. If you're budget conscious and/or simply prefer quality inexpensive whitebox/DIY type gear, as I do, you can get 24 x 2TB drives in one JBOD chassis for $7400 USD. That twice the drives, capacity, IOPS, for ~$2500 less than the HP JBOD. And unlike the HP 'enterprise SATA' drives, the 2TB WD Black series have a 5 year warranty, and work great with mdraid. Chassis and drives at Newegg: http://www.newegg.com/Product/Product.aspx?Item=N82E16816133047 http://www.newegg.com/Product/Product.aspx?Item=N82E16822136792 You can daisy chain 3 of these off one HBA SFF8088 port, 6 total on our LSI 9205-8e above, for a total of 144 2TB drives, 72 effective spindles in our concat+RAID1 setup, 144TB net space. Will be using RAID 1 pairs, likely XFS based on reading Hoeppner's recommendation on this and the mdadm list. To be clear, the XFS configuration I recommend/promote for mailbox storage is very specific and layered. The layers must all be used together to get the performance. These layers consist of using multiple hardware or software RAID1 pairs and concatenating them with an md linear array. You then format that md device with the XFS defaults, or a specific agcount if you know how to precisely tune AG layout based on disk size and your anticipated concurrency level of writers. Putting XFS on a singe RAID1 pair, as you seem to be describing above for the multiple thin node case, and hitting one node with parallel writes to multiple user mail dirs, you'll get less performance than EXT3/4 on that mirror pair--possibly less than half, depending on the size of the disks and thus the number of AGs created. The 'secret' to XFS performance with this workload is concatenation of spindles. Without it you can't spread the AGs--thus directories, thus parallel file writes--horizontally across the spindles--and this is the key. By spreading AGs 'horizontally' across the disks in a concat, instead of 'vertically' down a striped array, you accomplish two important things: 1. You dramatically reduce disk head seeking by using the concat array. With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs evenly spaced vertically down each disk in the array, following the stripe pattern. Each user mailbox is stored in a different directory.
Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?
On 4/7/12, Stan Hoeppner s...@hardwarefreak.com wrote: Firstly, thanks for the comprehensive reply. :) I'll assume networked storage nodes means NFS, not FC/iSCSI SAN, in which case you'd have said SAN. I haven't decided on that but it would either be NFS or iSCSI over Gigabit. I don't exactly get a big budget for this. iSCSI because I planned to do md/mpath over two separate switches so that if one switch explodes, the email service would still work. Less complexity and cost is always better. CPU throughput isn't a factor in mail workloads--it's all about IO latency. A 1U NFS server with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks less juice and dissipates less heat than 4 1U servers each w/ 4 drives. My worry is that if that one server dies, everything is dead. With at least a pair of servers, I could keep it running, or if necessary, restore the accounts on the dead servers from backup, make some config changes and have everything back running while waiting for replacement hardware. I don't recall seeing your user load or IOPS requirements so I'm making some educated guesses WRT your required performance and total storage. I'm embarrassed to admit I don't have hard numbers on the user load except the rapidly dwindling disk space count and the fact when the web-based mail application try to list and check disk quota, it can bring the servers to a crawl. My lame excuse is that I'm just the web dev who got caught holding the server admin potato. is nearly irrelevant for a mail workload, you can see it's much cheaper to scale capacity and IOPS with a single node w/fat storage than with skinny nodes w/thin storage. Ok, so here's the baseline config I threw together: One of my concern is that heavy IO on the same server slow the overall performance even though the theoretical IOPS of the total drives are the same on 1 and on X servers. Right now, the servers are usually screeching to a halt, to the point of even locking out SSH access due to IOWait sending the load in top to triple digits. Some host failure redundancy is about all you'd gain from the farm setup. Dovecot shouldn't barf due to one NFS node being down, only hiccup. I.e. only imap process accessing files on the downed node would have trouble. But if I only have one big storage node and that went down, Dovecot would barf wouldn't it? Or would the mdbox format mean Dovecot would still use the local storage, just that users can't access the offloaded messages? Also, I could possibly arrange them in a sort of network raid 1 to gain redundancy over single machine failure. Now you're sounding like Charles Marcus, but worse. ;) Stay where you are, and brush your hair away from your forehead. I'm coming over with my branding iron that says K.I.S.S Lol, I have no idea who Charles is, but I always feel safer if there was some kind of backup. Especially since I don't have the time to dedicate myself to server administration, by the time I notice something is bad, it might be too late for anything but the backup. Of course management and clients don't agree with me since backup/redundancy costs money. :)
Re: [Dovecot] Dovecot LDA/LMTP vs postfix virtual delivery agent and the x-original-to header
On 4/6/2012 1:00 PM, Charles Marcus wrote: On 2012-04-06 2:53 PM, Daniel L. Miller dmil...@amfes.com wrote: I'm currently using Postfix 2.7, Dovecot 2.1, and the Dovecot LDA. I have a pure virtual user environment stored in LDAP. My messages include X-Original-To and Delivered-To headers. Well that is great news... at least I'll be able to use the LDA, if not LMTP... Thanks! :) I had difficulty getting the LMTP transport to work previously - I may revisit that. If you do, by all means reply back on whether or not the headers are still there... Thanks again, From the documentation... http://www.postfix.org/virtual.8.html The*virtual*(8) http://www.postfix.org/virtual.8.html delivery agent prepends a *From* /sender/ /time/*_*/stamp/ envelope header to each message, prepends a *Delivered-To:* message header with the envelope recipient address, prepends an*X-Original-To:* header with the recip- ient address as given to Postfix, prepends a*Return-Path:* message header with the envelope sender address, prepends a character to lines beginning with *From* , and appends an empty line. Using the Postfix pipe agent, which is what is used with the Dovecot LDA, http://www.postfix.org/pipe.8.html *flags=BDFORXhqu.* (optional) Optional message processing flags. By default, a message is copied unchanged. *B* Append a blank line at the end of each mes- sage. This is required by some mail user agents that recognize *From* lines only when preceded by a blank line. *D* Prepend a *Delivered-To:* /recipient/ message header with the envelope recipient address. Note: for this to work, the/transport/*_desti-* *nation_recipient_limit* must be 1 (see SIN- GLE-RECIPIENT DELIVERY above for details). The*D* flag also enforces loop detection (Postfix 2.5 and later): if a message already contains a*Delivered-To:* header with the same recipient address, then the message is returned as undeliverable. The address comparison is case insensitive. This feature is available as of Postfix 2.0. *F* Prepend a *From* /sender time/*_*/stamp/ envelope header to the message content. This is expected by, for example,*UUCP* software. *O* Prepend an *X-Original-To:* /recipient/ mes- sage header with the recipient address as given to Postfix. Note: for this to work, the*/transport/_destination_recipient_limit http://www.postfix.org/postconf.5.html#transport_destination_recipient_limit* must be 1 (see SINGLE-RECIPIENT DELIVERY above for details). Unfortunately, the docs for the ltmp agent http://www.postfix.org/lmtp.8.html don't say anything about adding these headers. I tried asking on the Postfix list - didn't get much of an answer. -- Daniel
Re: [Dovecot] Dovecot LDA/LMTP vs postfix virtual delivery agent and the x-original-to header
On Sat, 07 Apr 2012 11:06:48 -0700 Daniel L. Miller articulated: Unfortunately, the docs for the ltmp agent http://www.postfix.org/lmtp.8.html don't say anything about adding these headers. I tried asking on the Postfix list - didn't get much of an answer. I may be wrong; however, from what I have been able to understand in regards to the Postfix documentation, if it does not explicitly claim to have a feature, then that feature is not available. In other words, if it doesn't state it can do it, it can't. -- Jerry ♔ Disclaimer: off-list followups get on-list replies or get ignored. Please do not ignore the Reply-To header. __
Re: [Dovecot] Outlook (2010) - Dovecot (IMAP) 10x slower with high network load and many folders
On Sat, Apr 7, 2012 at 3:16 AM, Willie Gillespie wgillespie+dove...@es2eng.com wrote: On 4/6/2012 3:52 AM, Thomas von Eyben wrote: Test results: CLIENT-1 is having the problems when CLIENT-2 is using all the (100Mbps) bandwidth eg. copying files to MAIL-SRV. If I move CLIENT-1 to CLIENT-3 then almost all the delay is gone. NB.: I have not (yet) tested if the problem also exists when CLIENT-2 generates traffic to MAIL-SRV as opposed to OTHER-SRV (but I am expecting the same problems). So the link between your 100 Mbps switch and the 1 Gbps switch is saturated by CLIENT-2, so CLIENT-1 is just getting the leftovers? Since CLIENT-3 doesn't go through that 100 Mbps switch, it obviously doesn't see that issue. Yes - that's my current workaround (perhaps also solution), I'm wondering if the performance is really expected to be _so_ bad when other users are utilizing the LAN. (You seem to indicate that what I am observing is expected and is just caused by [un-intended] semi-bad behavior from other users…) BR TvE
Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?
Putting XFS on a singe RAID1 pair, as you seem to be describing above for the multiple thin node case, and hitting one node with parallel writes to multiple user mail dirs, you'll get less performance than EXT3/4 on that mirror pair--possibly less than half, depending on the size of the disks and thus the number of AGs created. The 'secret' to XFS performance with this workload is concatenation of spindles. Without it you can't spread the AGs--thus directories, thus parallel file writes--horizontally across the spindles--and this is the key. By spreading AGs 'horizontally' across the disks in a concat, instead of 'vertically' down a striped array, you accomplish two important things: 1. You dramatically reduce disk head seeking by using the concat array. With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs evenly spaced vertically down each disk in the array, following the stripe pattern. Each user mailbox is stored in a different directory. Each directory was created in a different AG. So if you have 96 users writing their dovecot index concurrently, you have at worst case a minimum 192 head movements occurring back and forth across the entire platter of each disk, and likely not well optimized by TCQ/NCQ. Why 192 instead of 96? The modification time in the directory metadata must be updated for each index file, among other things. Does the XFS allocator automatically distribute AGs in this way even when disk usage is extremely light, i.e, a freshly formatted system with user directories initially created, and then the actual mailbox contents copied into them? If this is indeed the case, then what you describe is a wondrous revelation, since you're scaling out the number of simultaneous metadata reads+writes/second as you add RAID1 pairs, if my understanding of this is correct. I'm assuming of course, but should look at the code, that metadata locks imposed by the filesystem distribute as the number of pairs increase - if it's all just one Big Lock, then that wouldn't be the case. Forgive my laziness, as I could just experiment and take a look at the on-disk structures myself, but I don't have four empty drives handy to experiment. The bandwidth improvements due to striping (RAID0/5/6 style) are no help for metadata-intensive IO loads, and probably of little value for even mdbox loads too, I suspect, unless the mdbox max size is set to something pretty large, no? Have you tried other filesystems and seen if they distribute metadata in a similarly efficient and scalable manner across concatenated drive sets? Is there ANY point to using striping at all, a la RAID10 in this? I'd have thought just making as many RAID1 pairs out of your drives as possible would be the ideal strategy - is this not the case? =R=
Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?
On 4/7/2012 3:45 PM, Robin wrote: Putting XFS on a singe RAID1 pair, as you seem to be describing above for the multiple thin node case, and hitting one node with parallel writes to multiple user mail dirs, you'll get less performance than EXT3/4 on that mirror pair--possibly less than half, depending on the size of the disks and thus the number of AGs created. The 'secret' to XFS performance with this workload is concatenation of spindles. Without it you can't spread the AGs--thus directories, thus parallel file writes--horizontally across the spindles--and this is the key. By spreading AGs 'horizontally' across the disks in a concat, instead of 'vertically' down a striped array, you accomplish two important things: 1. You dramatically reduce disk head seeking by using the concat array. With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs evenly spaced vertically down each disk in the array, following the stripe pattern. Each user mailbox is stored in a different directory. Each directory was created in a different AG. So if you have 96 users writing their dovecot index concurrently, you have at worst case a minimum 192 head movements occurring back and forth across the entire platter of each disk, and likely not well optimized by TCQ/NCQ. Why 192 instead of 96? The modification time in the directory metadata must be updated for each index file, among other things. Does the XFS allocator automatically distribute AGs in this way even when disk usage is extremely light, i.e, a freshly formatted system with user directories initially created, and then the actual mailbox contents copied into them? It doesn't distribute AGs. There are a static number created during mkfs.xfs. The inode64 allocator round robins new directory creation across the AGs, and does the same with files created in those directories. Having the directory metadata and file extents in the same AG decreases head movement and thus seek latency for mixed metadata/extent high IOPS workloads. If this is indeed the case, then what you describe is a wondrous revelation, since you're scaling out the number of simultaneous metadata reads+writes/second as you add RAID1 pairs, if my understanding of this is correct. Correct. And adding more space and IOPS is uncomplicated. No chunk calculations, no restriping of the array. You simply grow the md linear array adding the new disk device. Then grow XFS to add the new free space to the filesystem. AFAIK this can be done infinitely, theoretically. I'm guessing md has a device count limit somewhere. If not your bash line buffer might. ;) I'm assuming of course, but should look at the code, that metadata locks imposed by the filesystem distribute as the number of pairs increase - if it's all just one Big Lock, then that wouldn't be the case. XFS locking is done as minimally as possibly and is insanely fast. I've not come across any reported performance issues relating to it. And yes, any single metadata lock will occur in a single AG on one mirror pair using the concat setup. Forgive my laziness, as I could just experiment and take a look at the on-disk structures myself, but I don't have four empty drives handy to experiment. Don't sweat it. All of this stuff is covered in the XFS Filesystem Structure Guide, exciting reading if you enjoy a root canal while watching snales race: http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/index.html The bandwidth improvements due to striping (RAID0/5/6 style) are no help for metadata-intensive IO loads, and probably of little value for even mdbox loads too, I suspect, unless the mdbox max size is set to something pretty large, no? The problem with striped parity RAID is not allocation, which takes place in free space and is pretty fast. The problem is the extra read seeks and bandwidth of the RMW cycle when you modify an existing stripe. Updating a single flag in a Dovecot index causes md or the hardware RAID controller to read the entire stripe into buffer space or RAID cache, modify the flag byte, recalculate parity, then write the whole stripe and parity block back out across all the disks. With a linear concat of RAID1 pairs we're simply rewriting a single 4KB filesystem block, maybe only a single 512B sector. I'm at the edge of my knowledge here. I don't know exactly how Timo does the index updates. Regardless of the method, the index update is light years faster with the concat setup as there is no RMW and full stripe writeback as with the RAID5/6 case. Have you tried other filesystems and seen if they distribute metadata in a similarly efficient and scalable manner across concatenated drive sets? EXT, any version, does not. ReiserFS does not. Both require disk striping to achieve any parallelism. With concat they both simply start writing at the beginning sectors of the first RAID1 pair and 4 years later maybe reach the last pair as