[ceph-users] IRQ balancing, distribution
Hello, not really specific to Ceph, but since one of the default questions by the Ceph team when people are facing performance problems seems to be Have you tried turning it off and on again? ^o^ err, Are all your interrupts on one CPU? I'm going to wax on about this for a bit and hope for some feedback from others with different experiences and architectures than me. Now firstly that question if all your IRQ handling is happening on the same CPU is a valid one, as depending on a bewildering range of factors ranging from kernel parameters to actual hardware one often does indeed wind up with that scenario, usually with all on CPU0. Which certainly is the case with all my recent hardware and Debian kernels. I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and thus feedback from Intel users is very much sought after, as I'm considering Intel based storage nodes in the future. It's vaguely amusing that Ceph storage nodes seem to have more CPU (individual core performance, not necessarily # of cores) and similar RAM requirements than my VM hosts. ^o^ So the common wisdom is that all IRQs on one CPU is a bad thing, lest it gets overloaded and for example drop network packets because of this. And while that is true, I'm hard pressed to generate any load on my clusters where the IRQ ratio on CPU0 goes much beyond 50%. Thus it should come as no surprise that spreading out IRQs with irqbalance or more accurately by manually setting the /proc/irq/xx/smp_affinity mask doesn't give me any discernible differences when it comes to benchmark results. With irqbalance spreading things out willy-nilly w/o any regards or knowledge about the hardware and what IRQ does what it's definitely something I won't be using out of the box. This goes especially for systems with different NUMA regions without proper policyscripts for irqbalance. So for my current hardware I'm going to keep IRQs on CPU0 and CPU1 which are the same Bulldozer module and thus sharing L2 and L3 cache. In particular the AHCI (journal SSDs) and HBA or RAID controller IRQs on CPU0 and the network (Infiniband) on CPU1. That should give me sufficient reserves in processing power and keep intra core (module) and NUMA (additional physical CPUs) traffic to a minimum. This also will (within a certain load range) allow these 2 CPUs (module) to be ramped up to full speed while other cores can remain at a lower frequency. Now with Intel some PCIe lanes are handled by a specific CPU (that's why you often see the need for adding a 2nd CPU to use all slots) and in that case pinning the IRQ handling for those slots on a specific CPU might actually make a lot of sense. Especially if not all the traffic generated by that card will have to transferred to the other CPU anyway. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] IRQ balancing, distribution
hi christian, we once were debugging some performance isssues, and IRQ balancing was one of the issues we looked in, but no real benefit there for us. all interrupts on one cpu is only an issue if the hardware itself is not the bottleneck. we were running some default SAS HBA (Dell H200), and those simply can't generated enough load to cause any IRQ issue even on older AMD cpus (we did tests on R515 boxes). (there was a ceph persentation somewhere that highlights the impact of using the proper the disk controller, we'll have to fix that first in our case. i'll be happy if IRQ balancing actually becomes an issue ;) but another issue is the OSD processes: do you pin those as well? and how much data do they actually handle. to checksum, the OSD process needs all data, so that can also cause a lot of NUMA traffic, esp if they are not pinned. i sort of hope that current CPUs have enough pcie lanes and cores so we can use single socket nodes, to avoid at least the NUMA traffic. stijn not really specific to Ceph, but since one of the default questions by the Ceph team when people are facing performance problems seems to be Have you tried turning it off and on again? ^o^ err, Are all your interrupts on one CPU? I'm going to wax on about this for a bit and hope for some feedback from others with different experiences and architectures than me. Now firstly that question if all your IRQ handling is happening on the same CPU is a valid one, as depending on a bewildering range of factors ranging from kernel parameters to actual hardware one often does indeed wind up with that scenario, usually with all on CPU0. Which certainly is the case with all my recent hardware and Debian kernels. I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and thus feedback from Intel users is very much sought after, as I'm considering Intel based storage nodes in the future. It's vaguely amusing that Ceph storage nodes seem to have more CPU (individual core performance, not necessarily # of cores) and similar RAM requirements than my VM hosts. ^o^ So the common wisdom is that all IRQs on one CPU is a bad thing, lest it gets overloaded and for example drop network packets because of this. And while that is true, I'm hard pressed to generate any load on my clusters where the IRQ ratio on CPU0 goes much beyond 50%. Thus it should come as no surprise that spreading out IRQs with irqbalance or more accurately by manually setting the /proc/irq/xx/smp_affinity mask doesn't give me any discernible differences when it comes to benchmark results. With irqbalance spreading things out willy-nilly w/o any regards or knowledge about the hardware and what IRQ does what it's definitely something I won't be using out of the box. This goes especially for systems with different NUMA regions without proper policyscripts for irqbalance. So for my current hardware I'm going to keep IRQs on CPU0 and CPU1 which are the same Bulldozer module and thus sharing L2 and L3 cache. In particular the AHCI (journal SSDs) and HBA or RAID controller IRQs on CPU0 and the network (Infiniband) on CPU1. That should give me sufficient reserves in processing power and keep intra core (module) and NUMA (additional physical CPUs) traffic to a minimum. This also will (within a certain load range) allow these 2 CPUs (module) to be ramped up to full speed while other cores can remain at a lower frequency. Now with Intel some PCIe lanes are handled by a specific CPU (that's why you often see the need for adding a 2nd CPU to use all slots) and in that case pinning the IRQ handling for those slots on a specific CPU might actually make a lot of sense. Especially if not all the traffic generated by that card will have to transferred to the other CPU anyway. Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] IRQ balancing, distribution
Hello, On Mon, 22 Sep 2014 09:35:10 +0200 Stijn De Weirdt wrote: hi christian, we once were debugging some performance isssues, and IRQ balancing was one of the issues we looked in, but no real benefit there for us. all interrupts on one cpu is only an issue if the hardware itself is not the bottleneck. In particular the spinning rust. ^o^ But this crept up in recent discussions about all SSD OSD storage servers, so there is some (remote) possibility for this to happen. we were running some default SAS HBA (Dell H200), and those simply can't generated enough load to cause any IRQ issue even on older AMD cpus (we did tests on R515 boxes). (there was a ceph persentation somewhere that highlights the impact of using the proper the disk controller, we'll have to fix that first in our case. i'll be happy if IRQ balancing actually becomes an issue ;) Yeah, this pretty much matches what I'm seeing and experienced over the years. but another issue is the OSD processes: do you pin those as well? and how much data do they actually handle. to checksum, the OSD process needs all data, so that can also cause a lot of NUMA traffic, esp if they are not pinned. That's why all my (production) storage nodes have only a single 6 or 8 core CPU. Unfortunately that also limits the amount of RAM in there, 16GB modules have just recently become an economically viable alternative to 8GB ones. Thus I don't pin OSD processes, given that on my 8 core nodes with 8 OSDs and 4 journal SSDs I can make Ceph eat babies and nearly all CPU (not IOwait!) resources with the right (or is that wrong) tests, namely 4K FIOs. The linux scheduler usually is quite decent in keeping processes where the action is, thus you see for example a clear preference of DRBD or KVM vnet processes to be near or on the CPU(s) where the IRQs are. i sort of hope that current CPUs have enough pcie lanes and cores so we can use single socket nodes, to avoid at least the NUMA traffic. Even the lackluster Opterons with just PCIe v2 and less lanes than current Intel CPUs are plenty fast enough (sufficient bandwidth) when it comes to the storage node density I'm deploying. Christian stijn not really specific to Ceph, but since one of the default questions by the Ceph team when people are facing performance problems seems to be Have you tried turning it off and on again? ^o^ err, Are all your interrupts on one CPU? I'm going to wax on about this for a bit and hope for some feedback from others with different experiences and architectures than me. Now firstly that question if all your IRQ handling is happening on the same CPU is a valid one, as depending on a bewildering range of factors ranging from kernel parameters to actual hardware one often does indeed wind up with that scenario, usually with all on CPU0. Which certainly is the case with all my recent hardware and Debian kernels. I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and thus feedback from Intel users is very much sought after, as I'm considering Intel based storage nodes in the future. It's vaguely amusing that Ceph storage nodes seem to have more CPU (individual core performance, not necessarily # of cores) and similar RAM requirements than my VM hosts. ^o^ So the common wisdom is that all IRQs on one CPU is a bad thing, lest it gets overloaded and for example drop network packets because of this. And while that is true, I'm hard pressed to generate any load on my clusters where the IRQ ratio on CPU0 goes much beyond 50%. Thus it should come as no surprise that spreading out IRQs with irqbalance or more accurately by manually setting the /proc/irq/xx/smp_affinity mask doesn't give me any discernible differences when it comes to benchmark results. With irqbalance spreading things out willy-nilly w/o any regards or knowledge about the hardware and what IRQ does what it's definitely something I won't be using out of the box. This goes especially for systems with different NUMA regions without proper policyscripts for irqbalance. So for my current hardware I'm going to keep IRQs on CPU0 and CPU1 which are the same Bulldozer module and thus sharing L2 and L3 cache. In particular the AHCI (journal SSDs) and HBA or RAID controller IRQs on CPU0 and the network (Infiniband) on CPU1. That should give me sufficient reserves in processing power and keep intra core (module) and NUMA (additional physical CPUs) traffic to a minimum. This also will (within a certain load range) allow these 2 CPUs (module) to be ramped up to full speed while other cores can remain at a lower frequency. Now with Intel some PCIe lanes are handled by a specific CPU (that's why you often see the need for adding a 2nd CPU to use all slots) and in that case pinning the IRQ handling for those slots on a specific CPU might actually make a lot of sense. Especially if not all the
Re: [ceph-users] IRQ balancing, distribution
On Mon, Sep 22, 2014 at 10:21 AM, Christian Balzer ch...@gol.com wrote: The linux scheduler usually is quite decent in keeping processes where the action is, thus you see for example a clear preference of DRBD or KVM vnet processes to be near or on the CPU(s) where the IRQs are. Since you're just mentioning it: DRBD, for one, needs to *tell* the kernel that its sender, receiver and worker threads should be on the same CPU. It has done that for some time now, but you shouldn't assume that this is some kernel magic that DRBD can just use. Not suggesting that you're unaware of this, but the casual reader might be. :) Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] IRQ balancing, distribution
but another issue is the OSD processes: do you pin those as well? and how much data do they actually handle. to checksum, the OSD process needs all data, so that can also cause a lot of NUMA traffic, esp if they are not pinned. That's why all my (production) storage nodes have only a single 6 or 8 core CPU. Unfortunately that also limits the amount of RAM in there, 16GB modules have just recently become an economically viable alternative to 8GB ones. Thus I don't pin OSD processes, given that on my 8 core nodes with 8 OSDs and 4 journal SSDs I can make Ceph eat babies and nearly all CPU (not IOwait!) resources with the right (or is that wrong) tests, namely 4K FIOs. The linux scheduler usually is quite decent in keeping processes where the action is, thus you see for example a clear preference of DRBD or KVM vnet processes to be near or on the CPU(s) where the IRQs are. the scheduler has improved recently, but i don't know since what version (certainly not backported to RHEL6 kernel). pinning the OSDs might actually be a bad idea, unless the page cache is flushed before each osd restart. kernel VM has this nice feature where allocating memory in a NUMA domain does not trigger freeing of cache memory in the domain, but it will first try to allocate memory on another NUMA domain. although typically the VM cache will be maxed out on OSD boxes, i'm not sure the cache clearing itself is NUMA aware, so who knows where the memory is located when it's allocated. stijn ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] IRQ balancing, distribution
Page reclamation in Linux is NUMA aware. So page reclamation is not an issue. You can see performance improvements only if all the components of a given IO completes on a single core. This is hard to achieve in Ceph as a single IO goes through multiple thread switches and the threads are not bound to any core. Starting an OSD with numactl and binding it to one core might aggravate the problem as all the threads spawned by that OSD will compete for the CPU on a single core. OSD with default configuration has 20+ threads . Binding the OSD process to one core using taskset does not help as some memory (especially heap) may be already allocated on the other NUMA node. Looks the design principle followed is to fan out by spawning multiple threads at each of the pipelining stage to utilize the available cores in the system. Because the IOs won't complete on the same core as issued, lots of cycles are lost for cache coherency. Regards, Anand -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Stijn De Weirdt Sent: Monday, September 22, 2014 2:36 PM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] IRQ balancing, distribution but another issue is the OSD processes: do you pin those as well? and how much data do they actually handle. to checksum, the OSD process needs all data, so that can also cause a lot of NUMA traffic, esp if they are not pinned. That's why all my (production) storage nodes have only a single 6 or 8 core CPU. Unfortunately that also limits the amount of RAM in there, 16GB modules have just recently become an economically viable alternative to 8GB ones. Thus I don't pin OSD processes, given that on my 8 core nodes with 8 OSDs and 4 journal SSDs I can make Ceph eat babies and nearly all CPU (not IOwait!) resources with the right (or is that wrong) tests, namely 4K FIOs. The linux scheduler usually is quite decent in keeping processes where the action is, thus you see for example a clear preference of DRBD or KVM vnet processes to be near or on the CPU(s) where the IRQs are. the scheduler has improved recently, but i don't know since what version (certainly not backported to RHEL6 kernel). pinning the OSDs might actually be a bad idea, unless the page cache is flushed before each osd restart. kernel VM has this nice feature where allocating memory in a NUMA domain does not trigger freeing of cache memory in the domain, but it will first try to allocate memory on another NUMA domain. although typically the VM cache will be maxed out on OSD boxes, i'm not sure the cache clearing itself is NUMA aware, so who knows where the memory is located when it's allocated. stijn ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Newbie Ceph Design Questions
Hi Christian, On 22.09.2014 05:36, Christian Balzer wrote: Hello, On Sun, 21 Sep 2014 21:00:48 +0200 Udo Lembke wrote: Hi Christian, On 21.09.2014 07:18, Christian Balzer wrote: ... Personally I found ext4 to be faster than XFS in nearly all use cases and the lack of full, real kernel integration of ZFS is something that doesn't appeal to me either. a little bit OT... what kind of ext4-mount options do you use? I have an 5-node cluster with xfs (60 osds), and perhaps the performance with ext4 would be better?! Hard to tell w/o testing your particular load, I/O patterns. When benchmarking directly with single disks or RAIDs it is fairly straightforward to see: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-April/028540.html Also note that the actual question has never been answered by the Ceph team, which is a shame as I venture that it would make things faster. do you run your cluster without filestore_xattr_use_omap = true or with due missing answer (to be on the safe side)?? Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] IRQ balancing, distribution
hi, Page reclamation in Linux is NUMA aware. So page reclamation is not an issue. except for the first min_free_kbytes? those can come from anywhere, no? or is the reclamation such that it tries to free equal portion for each NUMA domain. if the OSD allocates memory in chunks smaller then that value, you might be lucky. You can see performance improvements only if all the components of a given IO completes on a single core. This is hard to achieve in Ceph as a single IO goes through multiple thread switches and the threads are not bound to any core. Starting an OSD with numactl and binding it to one core might aggravate the problem as all the threads spawned by that OSD will compete for the CPU on a single core. OSD with default configuration has 20+ threads . Binding the OSD process to one core using taskset does not help as some memory (especially heap) may be already allocated on the other NUMA node. this is not true if you start the process under numactl, is it? but binding an OSD to a NUMA domain makes sense. Looks the design principle followed is to fan out by spawning multiple threads at each of the pipelining stage to utilize the available cores in the system. Because the IOs won't complete on the same core as issued, lots of cycles are lost for cache coherency. is intel HT a solution/help for this? turn on HT and start the OSD on the L2 (e.g. with hwloc-bind) as a more general question, the recommendation for ceph to have one cpu core for each OSD; can these be HT cores or actual physical cores? stijn Regards, Anand -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Stijn De Weirdt Sent: Monday, September 22, 2014 2:36 PM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] IRQ balancing, distribution but another issue is the OSD processes: do you pin those as well? and how much data do they actually handle. to checksum, the OSD process needs all data, so that can also cause a lot of NUMA traffic, esp if they are not pinned. That's why all my (production) storage nodes have only a single 6 or 8 core CPU. Unfortunately that also limits the amount of RAM in there, 16GB modules have just recently become an economically viable alternative to 8GB ones. Thus I don't pin OSD processes, given that on my 8 core nodes with 8 OSDs and 4 journal SSDs I can make Ceph eat babies and nearly all CPU (not IOwait!) resources with the right (or is that wrong) tests, namely 4K FIOs. The linux scheduler usually is quite decent in keeping processes where the action is, thus you see for example a clear preference of DRBD or KVM vnet processes to be near or on the CPU(s) where the IRQs are. the scheduler has improved recently, but i don't know since what version (certainly not backported to RHEL6 kernel). pinning the OSDs might actually be a bad idea, unless the page cache is flushed before each osd restart. kernel VM has this nice feature where allocating memory in a NUMA domain does not trigger freeing of cache memory in the domain, but it will first try to allocate memory on another NUMA domain. although typically the VM cache will be maxed out on OSD boxes, i'm not sure the cache clearing itself is NUMA aware, so who knows where the memory is located when it's allocated. stijn ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Pgs are in stale+down+peering state
Hi all, I used command 'ceph osd thrash ' command and after all osds are up and in, 3 pgs are in stale+down+peering state sudo ceph -s cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean monmap e1: 3 mons at {rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0}, election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3 osdmap e17031: 64 osds: 64 up, 64 in pgmap v76728: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects 12501 GB used, 10975 GB / 23476 GB avail 2145 active+clean 3 stale+down+peering sudo ceph health detail HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean pg 0.4d is stuck inactive for 341048.948643, current state stale+down+peering, last acting [12,56,27] pg 0.49 is stuck inactive for 341048.948667, current state stale+down+peering, last acting [12,6,25] pg 0.1c is stuck inactive for 341048.949362, current state stale+down+peering, last acting [12,25,23] pg 0.4d is stuck unclean for 341048.948665, current state stale+down+peering, last acting [12,56,27] pg 0.49 is stuck unclean for 341048.948687, current state stale+down+peering, last acting [12,6,25] pg 0.1c is stuck unclean for 341048.949382, current state stale+down+peering, last acting [12,25,23] pg 0.4d is stuck stale for 339823.956929, current state stale+down+peering, last acting [12,56,27] pg 0.49 is stuck stale for 339823.956930, current state stale+down+peering, last acting [12,6,25] pg 0.1c is stuck stale for 339823.956925, current state stale+down+peering, last acting [12,25,23] Please, can anyone explain why pgs are in this state. Sahana Lokeshappa Test Development Engineer I SanDisk Corporation 3rd Floor, Bagmane Laurel, Bagmane Tech Park C V Raman nagar, Bangalore 560093 T: +918042422283 sahana.lokesha...@sandisk.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Timeout on ceph-disk activate
I would run that one command (sudo ceph-disk -v activate --mark-init sysvinit --mount /data/osd ) on the hp10 box and see what is going on when you do so. On Thu, Sep 18, 2014 at 12:09 PM, BG bglac...@nyx.com wrote: I've hit a timeout issue on calls to ceph-disk activate. Initially, I followed the 'Storage Cluster Quick Start' on the CEPH website to get a cluster up and running. I wanted to tweak the configuration however and decided to blow away the initial setup using the purge / purgedata / forgetkeys commands with ceph-deploy. Next time around I'm getting a timeout error when attempting to activate an OSD on two out of the three boxes I'm using: [ceph_deploy.cli][INFO ] Invoked (1.5.15): /usr/bin/ceph-deploy osd activate hp10:/data/osd [ceph_deploy.osd][DEBUG ] Activating cluster ceph disks hp10:/data/osd: [hp10][DEBUG ] connected to host: hp10 [hp10][DEBUG ] detect platform information from remote host [hp10][DEBUG ] detect machine type [ceph_deploy.osd][INFO ] Distro info: CentOS Linux 7.0.1406 Core [ceph_deploy.osd][DEBUG ] activating host hp10 disk /data/osd [ceph_deploy.osd][DEBUG ] will use init type: sysvinit [hp10][INFO ] Running command: sudo ceph-disk -v activate --mark-init sysvinit --mount /data/osd [hp10][WARNIN] No data was received after 300 seconds, disconnecting... [hp10][INFO ] checking OSD status... [hp10][INFO ] Running command: sudo ceph --cluster=ceph osd stat --format=json This is on CentOS 7, ceph-deploy version is 1.5.15. The firewalld service is disabled, network connectivity should be good as the cluster previously worked on these boxes. Any suggestions where I should start looking to track down the root cause of the timeout? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph health related message
I had this happen to me as well. Turned out to be a connlimit thing for me. I would check dmesg/kernel log and see if you see any conntrack limit reached connection dropped messages then increase connlimit. Odd as I connected over ssh for this but I can't deny syslog. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] IRQ balancing, distribution
On 09/22/2014 01:55 AM, Christian Balzer wrote: Hello, not really specific to Ceph, but since one of the default questions by the Ceph team when people are facing performance problems seems to be Have you tried turning it off and on again? ^o^ err, Are all your interrupts on one CPU? I'm going to wax on about this for a bit and hope for some feedback from others with different experiences and architectures than me. This may be a result of me harping about this after a customer's clusters had mysterious performance issues and where irqbalance didn't appear to be working properly. :) Now firstly that question if all your IRQ handling is happening on the same CPU is a valid one, as depending on a bewildering range of factors ranging from kernel parameters to actual hardware one often does indeed wind up with that scenario, usually with all on CPU0. Which certainly is the case with all my recent hardware and Debian kernels. Yes, there are certainly a lot of scenarios where this can happen. I think the hope has been that with MSI-X, interrupts will get evenly distributed by default and that is typically better than throwing them all at core 0, but things are still quite complicated. I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and thus feedback from Intel users is very much sought after, as I'm considering Intel based storage nodes in the future. It's vaguely amusing that Ceph storage nodes seem to have more CPU (individual core performance, not necessarily # of cores) and similar RAM requirements than my VM hosts. ^o^ It might be reasonable to say that Ceph is a pretty intensive piece of software. With lots of OSDs on a system there are hundreds if not thousands of threads. Under heavy load conditions the CPUs, network cards, HBAs, memory, socket interconnects, possibly SAS expanders are all getting worked pretty hard and possibly in unusual ways where both throughput and latency are important. At the cluster scale things like switch bisection bandwidth and network topology become issues too. High performance clustered storage is imho one of the most complicated performance subjects in computing. The good news is that much of this can be avoided by sticking to simple designs with fewer OSDs per node. The more OSDs you try to stick in 1 system, the more you need to worry about all of this if you care about high performance. So the common wisdom is that all IRQs on one CPU is a bad thing, lest it gets overloaded and for example drop network packets because of this. And while that is true, I'm hard pressed to generate any load on my clusters where the IRQ ratio on CPU0 goes much beyond 50%. Thus it should come as no surprise that spreading out IRQs with irqbalance or more accurately by manually setting the /proc/irq/xx/smp_affinity mask doesn't give me any discernible differences when it comes to benchmark results. Ok, that's fine, but this is pretty subjective. Without knowing the load and the hardware setup I don't think we can really draw any conclusions other than that in your test on your hardware this wasn't the bottleneck. With irqbalance spreading things out willy-nilly w/o any regards or knowledge about the hardware and what IRQ does what it's definitely something I won't be using out of the box. This goes especially for systems with different NUMA regions without proper policyscripts for irqbalance. I believe irqbalance takes PCI topology into account when making mapping decisions. See: http://dcs.nac.uci.edu/support/sysadmin/security/archive/msg09707.html So for my current hardware I'm going to keep IRQs on CPU0 and CPU1 which are the same Bulldozer module and thus sharing L2 and L3 cache. In particular the AHCI (journal SSDs) and HBA or RAID controller IRQs on CPU0 and the network (Infiniband) on CPU1. That should give me sufficient reserves in processing power and keep intra core (module) and NUMA (additional physical CPUs) traffic to a minimum. This also will (within a certain load range) allow these 2 CPUs (module) to be ramped up to full speed while other cores can remain at a lower frequency. So it's been a while since I looked at AMD CPU interconnect topology, but back in the magnycours era I drew up some diagrams: 2 socket: https://docs.google.com/drawings/d/1_egexLqN14k9bhoN2nkv3iTgAbbPcwuwJmhwWAmakwo/edit?usp=sharing 4 socket: https://docs.google.com/drawings/d/1V5sFSInKq3uuKRbETx1LVOURyYQF_9Z4zElPrl1YIrw/edit?usp=sharing I think Interlagos looks somewhat similar from a hypertransport perspective. My gut instinct is that you really want to keep everything you can local to the socket on these kinds of systems. So if your HBA is on the first socket, you want your processing and interrupt handling there too. In the 4-socket configuration this is especially true. It's entirely possible that you may have to go through both an on-die and a inter-socket HT link before you get to a neighbour
Re: [ceph-users] Newbie Ceph Design Questions
On Mon, 22 Sep 2014 13:35:26 +0200 Udo Lembke wrote: Hi Christian, On 22.09.2014 05:36, Christian Balzer wrote: Hello, On Sun, 21 Sep 2014 21:00:48 +0200 Udo Lembke wrote: Hi Christian, On 21.09.2014 07:18, Christian Balzer wrote: ... Personally I found ext4 to be faster than XFS in nearly all use cases and the lack of full, real kernel integration of ZFS is something that doesn't appeal to me either. a little bit OT... what kind of ext4-mount options do you use? I have an 5-node cluster with xfs (60 osds), and perhaps the performance with ext4 would be better?! Hard to tell w/o testing your particular load, I/O patterns. When benchmarking directly with single disks or RAIDs it is fairly straightforward to see: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-April/028540.html Also note that the actual question has never been answered by the Ceph team, which is a shame as I venture that it would make things faster. do you run your cluster without filestore_xattr_use_omap = true or with due missing answer (to be on the safe side)?? For the time being at the default, aka filestore_xattr_use_omap = true. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Adding another radosgw node
Hi We've got a three node ceph cluster, and radosgw on a fourth machine. We would like to add another radosgw machine for high availability. Here are a few questions I have: - We aren't expecting to deploy to multiple regions and zones anywhere soon. So presumably, we do not have to worry about federated deployment. Would it be hard to move to a federated deployment later? - What is a radosgw instance? I was guessing that it was a machine running radosgw. If not, is it a separate gateway with a separate set of user and pools, possibly running on the same machine? - Can I simply deploy another radosgw machine with the same configuration as the first one? If the second interpretation is true, I guess I could. - Am I right that all gateway users go in the same keyring, which is copied to all the gateway nodes and all the monitor nodes? - The gateway nodes obviously need a [client.radosgw.{instance-name}] stanza in /etc/ceph.conf. Do the monitor nodes also need a copy of the stanza? - Do the gateway nodes need all of the monitors' [global] stanza in their /etc/ceph.conf? Presumably, they at least need mon_host to know who to talk to. What else? Regards Jon Jon Kåre Hellan, UNINETT AS, Trondheim, Norway ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state
Stale means that the primary OSD for the PG went down and the status is stale. They all seem to be from OSD.12... Seems like something is preventing that OSD from reporting to the mon? sage On September 22, 2014 7:51:48 AM EDT, Sahana Lokeshappa sahana.lokesha...@sandisk.com wrote: Hi all, I used command 'ceph osd thrash ' command and after all osds are up and in, 3 pgs are in stale+down+peering state sudo ceph -s cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean monmap e1: 3 mons at {rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0}, election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3 osdmap e17031: 64 osds: 64 up, 64 in pgmap v76728: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects 12501 GB used, 10975 GB / 23476 GB avail 2145 active+clean 3 stale+down+peering sudo ceph health detail HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean pg 0.4d is stuck inactive for 341048.948643, current state stale+down+peering, last acting [12,56,27] pg 0.49 is stuck inactive for 341048.948667, current state stale+down+peering, last acting [12,6,25] pg 0.1c is stuck inactive for 341048.949362, current state stale+down+peering, last acting [12,25,23] pg 0.4d is stuck unclean for 341048.948665, current state stale+down+peering, last acting [12,56,27] pg 0.49 is stuck unclean for 341048.948687, current state stale+down+peering, last acting [12,6,25] pg 0.1c is stuck unclean for 341048.949382, current state stale+down+peering, last acting [12,25,23] pg 0.4d is stuck stale for 339823.956929, current state stale+down+peering, last acting [12,56,27] pg 0.49 is stuck stale for 339823.956930, current state stale+down+peering, last acting [12,6,25] pg 0.1c is stuck stale for 339823.956925, current state stale+down+peering, last acting [12,25,23] Please, can anyone explain why pgs are in this state. Sahana Lokeshappa Test Development Engineer I SanDisk Corporation 3rd Floor, Bagmane Laurel, Bagmane Tech Park C V Raman nagar, Bangalore 560093 T: +918042422283 sahana.lokesha...@sandisk.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ Ceph-community mailing list ceph-commun...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com -- Sent from Kaiten Mail. Please excuse my brevity.___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state
Hi Sage, To give more context on this problem, This cluster has two pools rbd and user-created. Osd.12 is a primary for some other PG’s , but the problem happens for these three PG’s. $ sudo ceph osd lspools 0 rbd,2 pool1, $ sudo ceph -s cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean; 1 requests are blocked 32 sec monmap e1: 3 mons at {rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0}, election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3 osdmap e17842: 64 osds: 64 up, 64 in pgmap v79729: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects 12504 GB used, 10971 GB / 23476 GB avail 2145 active+clean 3 stale+down+peering Snippet from pg dump: 2.a9518 0 0 0 0 2172649472 30013001 active+clean2014-09-22 17:49:35.357586 6826'35762 17842:72706 [12,7,28] 12 [12,7,28] 12 6826'35762 2014-09-22 11:33:55.985449 0'0 2014-09-16 20:11:32.693864 0.590 0 0 0 0 0 0 0 active+clean2014-09-22 17:50:00.751218 0'0 17842:4472 [12,41,2] 12 [12,41,2] 12 0'0 2014-09-22 16:47:09.315499 0'0 2014-09-16 12:20:48.618726 0.4d0 0 0 0 0 0 4 4 stale+down+peering 2014-09-18 17:51:10.038247 186'4 11134:498 [12,56,27] 12 [12,56,27] 12 186'42014-09-18 17:30:32.393188 0'0 2014-09-16 12:20:48.615322 0.490 0 0 0 0 0 0 0 stale+down+peering 2014-09-18 17:44:52.681513 0'0 11134:498 [12,6,25] 12 [12,6,25] 12 0'0 2014-09-18 17:16:12.986658 0'0 2014-09-16 12:20:48.614192 0.1c0 0 0 0 0 0 12 12 stale+down+peering 2014-09-18 17:51:16.735549 186'12 11134:522 [12,25,23] 12 [12,25,23] 12 186'12 2014-09-18 17:16:04.457863 186'10 2014-09-16 14:23:58.731465 2.17510 0 0 0 0 2139095040 30013001 active+clean2014-09-22 17:52:20.364754 6784'30742 17842:72033 [12,27,23] 12 [12,27,23] 12 6784'30742 2014-09-22 00:19:39.905291 0'0 2014-09-16 20:11:17.016299 2.7e8 508 0 0 0 0 2130706432 34333433 active+clean2014-09-22 17:52:20.365083 6702'21132 17842:64769 [12,25,23] 12 [12,25,23] 12 6702'21132 2014-09-22 17:01:20.546126 0'0 2014-09-16 14:42:32.079187 2.6a5 528 0 0 0 0 2214592512 28402840 active+clean2014-09-22 22:50:38.092084 6775'34416 17842:83221 [12,58,0] 12 [12,58,0] 12 6775'34416 2014-09-22 22:50:38.091989 0'0 2014-09-16 20:11:32.703368 And we couldn’t observe and peering events happening on the primary osd. $ sudo ceph pg 0.49 query Error ENOENT: i don't have pgid 0.49 $ sudo ceph pg 0.4d query Error ENOENT: i don't have pgid 0.4d $ sudo ceph pg 0.1c query Error ENOENT: i don't have pgid 0.1c Not able to explain why the peering was stuck. BTW, Rbd pool doesn’t contain any data. Varada From: Ceph-community [mailto:ceph-community-boun...@lists.ceph.com] On Behalf Of Sage Weil Sent: Monday, September 22, 2014 10:44 PM To: Sahana Lokeshappa; ceph-users@lists.ceph.com; ceph-us...@ceph.com; ceph-commun...@lists.ceph.com Subject: Re: [Ceph-community] Pgs are in stale+down+peering state Stale means that the primary OSD for the PG went down and the status is stale. They all seem to be from OSD.12... Seems like something is preventing that OSD from reporting to the mon? sage On September 22, 2014 7:51:48 AM EDT, Sahana Lokeshappa sahana.lokesha...@sandisk.commailto:sahana.lokesha...@sandisk.com wrote: Hi all, I used command ‘ceph osd thrash ‘ command and after all osds are up and in, 3 pgs are in stale+down+peering state sudo ceph -s cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean monmap e1: 3 mons at {rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0}, election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3 osdmap e17031: 64 osds: 64 up, 64 in pgmap v76728: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects 12501 GB used, 10975 GB / 23476 GB avail 2145 active+clean 3 stale+down+peering sudo ceph health detail HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs
[ceph-users] XenServer and Ceph - any updates?
Hello guys, I was wondering if there has been any updates on getting XenServer ready for ceph? I've seen a howto that was written well over a year ago (I think) for a PoC integration of XenServer and Ceph. However, I've not seen any developments lately.It would be cool to see other hypervisors adapting Ceph )) Cheers Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Reassigning admin server
If I have a machine/VM I am using as an Admin node for a ceph cluster, can I relocate that admin to another machine/VM after I've built a cluster? I would expect as the Admin isn't an actual operating part of the cluster itself (other than Calamari, if it happens to be running) the rest of the cluster should be adequately served with a -update-conf. -- CONFIDENTIALITY NOTICE: If you have received this email in error, please immediately notify the sender by e-mail at the address shown. This email transmission may contain confidential information. This information is intended only for the use of the individual(s) or entity to whom it is intended even if addressed incorrectly. Please delete it from your files if you are not the intended recipient. Thank you for your compliance. Copyright (c) 2014 Cigna == ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bcache / Enhanceio with osds
We are still in the middle of testing things, but so far we have had more improvement with SSD journals than the OSD cached with bcache (five OSDs fronted by one SSD). We still have yet to test if adding a bcache layer in addition to the SSD journals provides any additional improvements. Robert LeBlanc On Sun, Sep 14, 2014 at 6:13 PM, Mark Nelson mark.nel...@inktank.com wrote: On 09/14/2014 05:11 PM, Andrei Mikhailovsky wrote: Hello guys, Was wondering if anyone uses or done some testing with using bcache or enhanceio caching in front of ceph osds? I've got a small cluster of 2 osd servers, 16 osds in total and 4 ssds for journals. I've recently purchased four additional ssds to be used for ceph cache pool, but i've found performance of guest vms to be slower with the cache pool for many benchmarks. The write performance has slightly improved, but the read performance has suffered a lot (as much as 60% in some tests). Therefore, I am planning to scrap the cache pool (at least until it matures) and use either bcache or enhanceio instead. We're actually looking at dm-cache a bit right now. (and talking some of the developers about the challenges they are facing to help improve our own cache tiering) No meaningful benchmarks of dm-cache yet though. Bcache, enhanceio, and flashcache all look interesting too. Regarding the cache pool: we've got a couple of ideas that should help improve performance, especially for reads. There are definitely advantages to keeping cache local to the node though. I think some form of local node caching could be pretty useful going forward. Thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bcache / Enhanceio with osds
Likely it won't since the OSD is already coalescing journal writes. FWIW, I ran through a bunch of tests using seekwatcher and blktrace at 4k, 128k, and 4m IO sizes on a 4 OSD cluster (3x replication) to get a feel for what the IO patterns are like for the dm-cache developers. I included both the raw blktrace data and seekwatcher graphs here: http://nhm.ceph.com/firefly_blktrace/ there are some interesting patterns but they aren't too easy to spot (I don't know why the Chris decided to use blue and green by default!) Mark On 09/22/2014 04:32 PM, Robert LeBlanc wrote: We are still in the middle of testing things, but so far we have had more improvement with SSD journals than the OSD cached with bcache (five OSDs fronted by one SSD). We still have yet to test if adding a bcache layer in addition to the SSD journals provides any additional improvements. Robert LeBlanc On Sun, Sep 14, 2014 at 6:13 PM, Mark Nelson mark.nel...@inktank.com mailto:mark.nel...@inktank.com wrote: On 09/14/2014 05:11 PM, Andrei Mikhailovsky wrote: Hello guys, Was wondering if anyone uses or done some testing with using bcache or enhanceio caching in front of ceph osds? I've got a small cluster of 2 osd servers, 16 osds in total and 4 ssds for journals. I've recently purchased four additional ssds to be used for ceph cache pool, but i've found performance of guest vms to be slower with the cache pool for many benchmarks. The write performance has slightly improved, but the read performance has suffered a lot (as much as 60% in some tests). Therefore, I am planning to scrap the cache pool (at least until it matures) and use either bcache or enhanceio instead. We're actually looking at dm-cache a bit right now. (and talking some of the developers about the challenges they are facing to help improve our own cache tiering) No meaningful benchmarks of dm-cache yet though. Bcache, enhanceio, and flashcache all look interesting too. Regarding the cache pool: we've got a couple of ideas that should help improve performance, especially for reads. There are definitely advantages to keeping cache local to the node though. I think some form of local node caching could be pretty useful going forward. Thanks Andrei _ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph Day Speaking Slots
Hey cephers, As we finalize the next couple schedules for Ceph Days in NYC and London it looks like there are still a couple of speaking slots open. If you are available in NYC on 08 OCT or in London on 22 OCT and would be interested in speaking about your Ceph experiences (of any kind) please contact me as soon as possible. Thanks. http://ceph.com/cephdays/ Best Regards, Patrick McGarry Director Ceph Community || Red Hat http://ceph.com || http://community.redhat.com @scuttlemonkey || @ceph ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bcache / Enhanceio with osds
I've done a bit of testing with Enhanceio on my cluster and I can see a definate improvement in read performance for cached data. The performance increase is around 3-4 times the cluster speed prior to using enhanceio based on large block size IO (1M and 4M). I've done a concurrent test of running a single dd if=/dev/vda of=/dev/null bs=1M/4M iflag=direct instance over 20 vms which were running on 4 host servers. Prior to enchanceio i was getting around 30-35MB/s per guest vm regardless of how many times i run the test. With enhanceio (from the second run) I was hitting over 130MB/s per vm. I've not seen any lag in performance of other vms while using enchanceio, unlike a considerable lag without the enchanceio. The ssd disk utilisation was not hitting much over 60%. The small block size (4K) performance hasn't changed with enhanceio, which made me think that the performance of osds themselves is limited when using small block sizes. I wasn't getting much over 2-3MB/s per guest vm. On a contrary, when I tried to use the firefly cache pool on the same hardware, my cluster has performed significantly slower with the cache pool. The whole cluster seemed under a lot more load and the performance has dropped to around 12-15MB/s and other guest vms were very very slow. The ssd disks were utilised 100% all the time during the test with majority of write IO. I admit that these tests shouldn't be considered as a definate and fully performance tests of ceph cluster as this is a live cluster with disk io actiivity outside outside of the test vms. The average load is not much (300-500 IO/s), mainly reads. However, it still indicates that there is a room for improvement in the ceph's cache pool implementation. Looking at my results, I think ceph is missing a lot of hits on the read cache, which causes osds to write a lot of data. With enchanceio I was getting well over 50% read hit ratio and the main activity on the ssds was read io unlike ceph. Outside of the tests, i've left enchanceio running on the osd servers. It has been a few days now and the hit ratio on the osds is around 8-11%, which seems a bit low. I was wondering if I should change the default block size of enchance io to 2K instead of the default 4K. Taking into account's ceph object size of 4M I am not sure if this will help the hit ratio. Does anyone have an idea? Andrei - Original Message - From: Mark Nelson mark.nel...@inktank.com To: Robert LeBlanc rob...@leblancnet.us, Mark Nelson mark.nel...@inktank.com Cc: ceph-users@lists.ceph.com Sent: Monday, 22 September, 2014 10:49:42 PM Subject: Re: [ceph-users] Bcache / Enhanceio with osds Likely it won't since the OSD is already coalescing journal writes. FWIW, I ran through a bunch of tests using seekwatcher and blktrace at 4k, 128k, and 4m IO sizes on a 4 OSD cluster (3x replication) to get a feel for what the IO patterns are like for the dm-cache developers. I included both the raw blktrace data and seekwatcher graphs here: http://nhm.ceph.com/firefly_blktrace/ there are some interesting patterns but they aren't too easy to spot (I don't know why the Chris decided to use blue and green by default!) Mark On 09/22/2014 04:32 PM, Robert LeBlanc wrote: We are still in the middle of testing things, but so far we have had more improvement with SSD journals than the OSD cached with bcache (five OSDs fronted by one SSD). We still have yet to test if adding a bcache layer in addition to the SSD journals provides any additional improvements. Robert LeBlanc On Sun, Sep 14, 2014 at 6:13 PM, Mark Nelson mark.nel...@inktank.com mailto:mark.nel...@inktank.com wrote: On 09/14/2014 05:11 PM, Andrei Mikhailovsky wrote: Hello guys, Was wondering if anyone uses or done some testing with using bcache or enhanceio caching in front of ceph osds? I've got a small cluster of 2 osd servers, 16 osds in total and 4 ssds for journals. I've recently purchased four additional ssds to be used for ceph cache pool, but i've found performance of guest vms to be slower with the cache pool for many benchmarks. The write performance has slightly improved, but the read performance has suffered a lot (as much as 60% in some tests). Therefore, I am planning to scrap the cache pool (at least until it matures) and use either bcache or enhanceio instead. We're actually looking at dm-cache a bit right now. (and talking some of the developers about the challenges they are facing to help improve our own cache tiering) No meaningful benchmarks of dm-cache yet though. Bcache, enhanceio, and flashcache all look interesting too. Regarding the cache pool: we've got a couple of ideas that should help improve performance, especially for reads. There are definitely advantages to keeping cache local to the node though. I think some form of local node caching could be pretty useful going
Re: [ceph-users] OSDs are crashing with Cannot fork or cannot create thread but plenty of memory is left
Hi Christian, Your problem is probably that your kernel.pid_max (the maximum threads+processes across the entire system) needs to be increased - the default is 32768, which is too low for even a medium density deployment. You can test this easily enough with $ ps axms | wc -l If you get a number around the 30,000 mark then you are going to be affected. There's an issue here http://tracker.ceph.com/issues/6142 , although it doesn't seem to have gotten much traction in terms of informing users. Regards Nathan On 15/09/2014 7:13 PM, Christian Eichelmann wrote: Hi all, I have no idea why running out of filehandles should produce a out of memory error, but well. I've increased the ulimit as you told me, and nothing changed. I've noticed that the osd init script sets the max open file handles explicitly, so I was setting the corresponding option in my ceph conf. Now the limits of an OSD process look like this: Limit Soft Limit Hard Limit Units Max cpu time unlimitedunlimited seconds Max file size unlimitedunlimited bytes Max data size unlimitedunlimited bytes Max stack size8388608 unlimited bytes Max core file sizeunlimitedunlimited bytes Max resident set unlimitedunlimited bytes Max processes 2067478 2067478 processes Max open files6553665536 files Max locked memory 6553665536 bytes Max address space unlimitedunlimited bytes Max file locksunlimitedunlimited locks Max pending signals 2067478 2067478 signals Max msgqueue size 819200 819200 bytes Max nice priority 00 Max realtime priority 00 Max realtime timeout unlimitedunlimitedus Anyways, the exact same behavior as before. I was also finding a mailing on this list from someone who had the exact same problem: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040059.html Unfortunately, there was also no real solution for this problem. So again: this is *NOT* a ulimit issue. We were running emperor and dumpling on the same hardware without any issues. They first started after our upgrade to firefly. Regards, Christian Am 12.09.2014 18:26, schrieb Christian Balzer: On Fri, 12 Sep 2014 12:05:06 -0400 Brian Rak wrote: That's not how ulimit works. Check the `ulimit -a` output. Indeed. And to forestall the next questions, see man initscript, mine looks like this: --- ulimit -Hn 131072 ulimit -Sn 65536 # Execute the program. eval exec $4 --- And also a /etc/security/limits.d/tuning.conf (debian) like this: --- rootsoftnofile 65536 roothardnofile 131072 * softnofile 16384 * hardnofile 65536 --- Adjusted to your actual needs. There might be other limits you're hitting, but that is the most likely one Also 45 OSDs with 12 (24 with HT, bleah) CPU cores is pretty ballsy. I personally would rather do 4 RAID6 (10 disks, with OSD SSD journals) with that kind of case and enjoy the fact that my OSDs never fail. ^o^ Christian (another one) On 9/12/2014 10:15 AM, Christian Eichelmann wrote: Hi, I am running all commands as root, so there are no limits for the processes. Regards, Christian ___ Von: Mariusz Gronczewski [mariusz.gronczew...@efigence.com] Gesendet: Freitag, 12. September 2014 15:33 An: Christian Eichelmann Cc: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] OSDs are crashing with Cannot fork or cannot create thread but plenty of memory is left do cat /proc/pid/limits probably you hit max processes limit or max FD limit Hi Ceph-Users, I have absolutely no idea what is going on on my systems... Hardware: 45 x 4TB Harddisks 2 x 6 Core CPUs 256GB Memory When initializing all disks and join them to the cluster, after approximately 30 OSDs, other osds are crashing. When I try to start them again I see different kinds of errors. For example: Starting Ceph osd.316 on ceph-osd-bs04...already running === osd.317 === Traceback (most recent call last): File /usr/bin/ceph, line 830, in module sys.exit(main()) File /usr/bin/ceph, line 773, in main sigdict, inbuf, verbose) File /usr/bin/ceph, line 420, in new_style_command inbuf=inbuf) File /usr/lib/python2.7/dist-packages/ceph_argparse.py, line 1112, in json_command raise RuntimeError('{0}: exception {1}'.format(cmd, e)) NameError: global name 'cmd' is not defined Exception thread.error: error(can't start new thread,) in bound method Rados.__del__ of rados.Rados object at 0x29ee410 ignored or: /etc/init.d/ceph: 190: /etc/init.d/ceph: Cannot fork /etc/init.d/ceph: 191:
[ceph-users] get amount of space used by snapshots
Hello, If I have an rbd image and a series of snapshots of that image, is there a fast way to determine how much space the objects composing the original image and all the snapshots are using in the cluster, or even just the space used by the snaps? The only way I've been able to find so far is to get the block_name_prefix for the image with rbd info and then grep for that prefix in the output of rados ls, eg. rados ls|grep rb.0.396de.238e1f29|wc -l. This is relatively slow, printing ~250 objects/s, which means hours to count through 10s of TB of objects. Basically, if I'm keeping daily snapshots for a set of images, I'd like to be able to tell how much space those snapshots are using so I can determine how frequently I need to prune old snaps. Thanks! -Steve -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] IRQ balancing, distribution
Hello, On Mon, 22 Sep 2014 08:55:48 -0500 Mark Nelson wrote: On 09/22/2014 01:55 AM, Christian Balzer wrote: Hello, not really specific to Ceph, but since one of the default questions by the Ceph team when people are facing performance problems seems to be Have you tried turning it off and on again? ^o^ err, Are all your interrupts on one CPU? I'm going to wax on about this for a bit and hope for some feedback from others with different experiences and architectures than me. This may be a result of me harping about this after a customer's clusters had mysterious performance issues and where irqbalance didn't appear to be working properly. :) Now firstly that question if all your IRQ handling is happening on the same CPU is a valid one, as depending on a bewildering range of factors ranging from kernel parameters to actual hardware one often does indeed wind up with that scenario, usually with all on CPU0. Which certainly is the case with all my recent hardware and Debian kernels. Yes, there are certainly a lot of scenarios where this can happen. I think the hope has been that with MSI-X, interrupts will get evenly distributed by default and that is typically better than throwing them all at core 0, but things are still quite complicated. I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and thus feedback from Intel users is very much sought after, as I'm considering Intel based storage nodes in the future. It's vaguely amusing that Ceph storage nodes seem to have more CPU (individual core performance, not necessarily # of cores) and similar RAM requirements than my VM hosts. ^o^ It might be reasonable to say that Ceph is a pretty intensive piece of software. With lots of OSDs on a system there are hundreds if not thousands of threads. Under heavy load conditions the CPUs, network cards, HBAs, memory, socket interconnects, possibly SAS expanders are all getting worked pretty hard and possibly in unusual ways where both throughput and latency are important. At the cluster scale things like switch bisection bandwidth and network topology become issues too. High performance clustered storage is imho one of the most complicated performance subjects in computing. Nobody will argue that. ^.^ The good news is that much of this can be avoided by sticking to simple designs with fewer OSDs per node. The more OSDs you try to stick in 1 system, the more you need to worry about all of this if you care about high performance. I'd say that 8 OSDs isn't exactly dense (my case), but the advantages of less densely populated nodes come with the significant price tag of rack space and hardware costs. So the common wisdom is that all IRQs on one CPU is a bad thing, lest it gets overloaded and for example drop network packets because of this. And while that is true, I'm hard pressed to generate any load on my clusters where the IRQ ratio on CPU0 goes much beyond 50%. Thus it should come as no surprise that spreading out IRQs with irqbalance or more accurately by manually setting the /proc/irq/xx/smp_affinity mask doesn't give me any discernible differences when it comes to benchmark results. Ok, that's fine, but this is pretty subjective. Without knowing the load and the hardware setup I don't think we can really draw any conclusions other than that in your test on your hardware this wasn't the bottleneck. Of course, I can only realistically talk about what I have tested and thus invited feedback from others. I can certainly see situations where this could be an issue with Ceph and do have experience with VM hosts that benefited from spreading IRQ handling over more than one CPU. What I'm trying to get across is for people to not fall into a cargo cult trap and think/examine things for themselves, as blindly turning on indiscriminate IRQ balancing might do more harm than good in certain scenarios. With irqbalance spreading things out willy-nilly w/o any regards or knowledge about the hardware and what IRQ does what it's definitely something I won't be using out of the box. This goes especially for systems with different NUMA regions without proper policyscripts for irqbalance. I believe irqbalance takes PCI topology into account when making mapping decisions. See: http://dcs.nac.uci.edu/support/sysadmin/security/archive/msg09707.html I'm sure it tries to do the right thing and it gets at least some things right, like what my system (single Opteron 4386) looks like: --- Package 0: numa_node is 0 cpu mask is 00ff (load 0) Cache domain 0: numa_node is 0 cpu mask is 0003 (load 0) CPU number 0 numa_node is 0 (load 0) CPU number 1 numa_node is 0 (load 0) Cache domain 1: numa_node is 0 cpu mask is 000c (load 0) CPU number 2 numa_node is 0 (load 0)