Re: [ceph-users] Improving Performance with more OSD's?
Hi Udo, Lindsay did this for performance reasons so that the data is spread evenly over the disks, I believe it has been accepted that the remaining 2tb on the 3tb disks will not be used. Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Udo Lembke Sent: 05 January 2015 07:15 To: Lindsay Mathieson Cc: ceph-us...@ceph.com ceph-users Subject: Re: [ceph-users] Improving Performance with more OSD's? Hi Lindsay, On 05.01.2015 06:52, Lindsay Mathieson wrote: ... So two OSD Nodes had: - Samsung 840 EVO SSD for Op. Sys. - Intel 530 SSD for Journals (10GB Per OSD) - 3TB WD Red - 1 TB WD Blue - 1 TB WD Blue - Each disk weighted at 1.0 - Primary affinity of the WD Red (slow) set to 0 the weight should be the size of the filesystem. With weight 1 for all disks, you run in trouble if your cluster filled, because the 1TB-Disks are full, before the 3TB disk! You should have something like 0.9 for the 1TB and 2.82 for the 3TB disks ( df -k | grep osd | awk '{print $2/(1024^3) }' ). Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
I've been having good results with OMD (Check_MK + Nagios) There is a plugin for Ceph as well that I made a small modification to, to work with a wider range of cluster sizes http://www.spinics.net/lists/ceph-users/msg13355.html Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Lindsay Mathieson Sent: 05 January 2015 12:35 To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Improving Performance with more OSD's? On Mon, 5 Jan 2015 09:21:16 AM Nick Fisk wrote: Lindsay did this for performance reasons so that the data is spread evenly over the disks, I believe it has been accepted that the remaining 2tb on the 3tb disks will not be used. Exactly, thanks Nick. I only have a terabyte of data, and its not going to grow much, if at all. With 3 OSD's per node the 1TB OSD's are only at 40% utilisation, but you can bet I'll be keeping a close eye on that. Next step, get nagios or icinga setup. -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
On Mon, 5 Jan 2015 01:15:03 PM Nick Fisk wrote: I've been having good results with OMD (Check_MK + Nagios) There is a plugin for Ceph as well that I made a small modification to, to work with a wider range of cluster sizes Thanks, I'll check it out. Currently trying zabbix, seems more straightforward than nagios. -- Lindsay signature.asc Description: This is a digitally signed message part. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
On Mon, 5 Jan 2015 09:21:16 AM Nick Fisk wrote: Lindsay did this for performance reasons so that the data is spread evenly over the disks, I believe it has been accepted that the remaining 2tb on the 3tb disks will not be used. Exactly, thanks Nick. I only have a terabyte of data, and its not going to grow much, if at all. With 3 OSD's per node the 1TB OSD's are only at 40% utilisation, but you can bet I'll be keeping a close eye on that. Next step, get nagios or icinga setup. -- Lindsay signature.asc Description: This is a digitally signed message part. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
Well I upgraded my cluster over the weekend :) To each node I added: - Intel SSD 530 for journals - 2 * 1TB WD Blue So two OSD Nodes had: - Samsung 840 EVO SSD for Op. Sys. - Intel 530 SSD for Journals (10GB Per OSD) - 3TB WD Red - 1 TB WD Blue - 1 TB WD Blue - Each disk weighted at 1.0 - Primary affinity of the WD Red (slow) set to 0 Took about 8 hours for 1TB of data to rebalance over the OSD's Very pleased with results so far. rados benchmark: - Write bandwidth has increased from 49 MB/s to 140 MB/s - Reads have stayed roughly the same at 500 MB/s VM Benchmarks: - Actually have stayed much the same, but have more depth - multiple VM's share the bandwidth nicely. Users are finding their VM's *much* less laggy. Thanks for all the help and suggestions. Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
Hi Lindsay, On 05.01.2015 06:52, Lindsay Mathieson wrote: ... So two OSD Nodes had: - Samsung 840 EVO SSD for Op. Sys. - Intel 530 SSD for Journals (10GB Per OSD) - 3TB WD Red - 1 TB WD Blue - 1 TB WD Blue - Each disk weighted at 1.0 - Primary affinity of the WD Red (slow) set to 0 the weight should be the size of the filesystem. With weight 1 for all disks, you run in trouble if your cluster filled, because the 1TB-Disks are full, before the 3TB disk! You should have something like 0.9 for the 1TB and 2.82 for the 3TB disks ( df -k | grep osd | awk '{print $2/(1024^3) }' ). Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
Hi, On 29/12/14 15:12, Christian Balzer wrote: 3rd Node - Monitor only, for quorum - Intel Nuc - 8GB RAM - CPU: Celeron N2820 Uh oh, a bit weak for a monitor. Where does the OS live (on this and the other nodes)? The leveldb (/var/lib/ceph/..) of the monitors likes it fast, SSDs preferably. I have a small setup with such a node (only 4 GB RAM, another 2 good nodes for OSD and virtualization) - it works like a charm and CPU max is always under 5% in the graphs. It only peaks when backups are dumped to its 1TB disk using NFS. I'd prefer to use the existing third node (the Intel Nuc), but its expansion is limited to USB3 devices. Are there USB3 external drives with decent performance stats? I'd advise against it. That node doing both monitor and OSDs is not going to end well. My experience has led me not to trust USB disks for continuous operation, I wouldn't do this either. Just my cents Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
Hi, On 30/12/14 11:55, Lindsay Mathieson wrote: On Tue, 30 Dec 2014 11:26:08 AM Eneko Lacunza wrote: have a small setup with such a node (only 4 GB RAM, another 2 good nodes for OSD and virtualization) - it works like a charm and CPU max is always under 5% in the graphs. It only peaks when backups are dumped to its 1TB disk using NFS. Yes, CPU has not been a problem for em at all, I even occasional run a windows VM on the NUC. Sounds like we have very similar setups - 2 good ndoes that run full osd's, mon and VM's, and a third smaller node for quorum. Do you have OSD's on your thrid ndoe as well? No, I have never had a VM running on it, there are only 6 VMs in this cluster and the other 2 nodes have plenty of RAM/CPU for them. I might try if one of the good nodes goes down ;) I'd advise against it. That node doing both monitor and OSDs is not going to end well. My experience has led me not to trust USB disks for continuous operation, I wouldn't do this either. Yeah, it doesn't sound like a good idea. Pity, the nucs are so small and quiet Yes. But I think the CPU would become a problem as soon as we put 1-2 OSDs on that NUC. Maybe with a Core i3 NUC... :) Cheers Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
On Tue, 30 Dec 2014 11:26:08 AM Eneko Lacunza wrote: have a small setup with such a node (only 4 GB RAM, another 2 good nodes for OSD and virtualization) - it works like a charm and CPU max is always under 5% in the graphs. It only peaks when backups are dumped to its 1TB disk using NFS. Yes, CPU has not been a problem for em at all, I even occasional run a windows VM on the NUC. Sounds like we have very similar setups - 2 good ndoes that run full osd's, mon and VM's, and a third smaller node for quorum. Do you have OSD's on your thrid ndoe as well? I'd advise against it. That node doing both monitor and OSDs is not going to end well. My experience has led me not to trust USB disks for continuous operation, I wouldn't do this either. Yeah, it doesn't sound like a good idea. Pity, the nucs are so small and quiet thanks, -- Lindsay signature.asc Description: This is a digitally signed message part. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
On Sun, Dec 28, 2014 at 02:49:08PM +0900, Christian Balzer wrote: You really, really want size 3 and a third node for both performance (reads) and redundancy. How does it benefit read performance? I thought all reads are made only from the active primary OSD. -- Tomasz Kuzemko tomasz.kuze...@ovh.net signature.asc Description: Digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
On Mon, Dec 29, 2014 at 12:47 PM, Tomasz Kuzemko tomasz.kuze...@ovh.net wrote: On Sun, Dec 28, 2014 at 02:49:08PM +0900, Christian Balzer wrote: You really, really want size 3 and a third node for both performance (reads) and redundancy. How does it benefit read performance? I thought all reads are made only from the active primary OSD. -- Tomasz Kuzemko tomasz.kuze...@ovh.net You`ll have chunks of primary data scattered between three devices instead of two, as each pg will have a random acting set (until you decide to pin primary). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
Hello, On Mon, 29 Dec 2014 00:05:40 +1000 Lindsay Mathieson wrote: Appreciate the detailed reply Christian. On Sun, 28 Dec 2014 02:49:08 PM Christian Balzer wrote: On Sun, 28 Dec 2014 08:59:33 +1000 Lindsay Mathieson wrote: I'm looking to improve the raw performance on my small setup (2 Compute Nodes, 2 OSD's). Only used for hosting KVM images. This doesn't really make things clear, do you mean 2 STORAGE nodes with 2 OSDs (HDDs) each? 2 Nodes, 1 OSD per node Hardware is indentical for all nodes disks - Mobo: P9X79 WS - CPU:Intel Xeon E5-2620 Not particularly fast, but sufficient for about 4 OSDs - RAM: 32 GB ECC Good enough. - 1GB Nic Public Access - 2 * 1GB Bond for ceph Is that a private cluster network just between Ceph storage nodes or is this for all ceph traffic (including clients)? The later would probably be better, a private cluster network twice as fast as the client one isn't particular helpful 99% of the time. - OSD: 3TB WD Red - Journal: 10GB on Samsung 840 EVO 3rd Node - Monitor only, for quorum - Intel Nuc - 8GB RAM - CPU: Celeron N2820 Uh oh, a bit weak for a monitor. Where does the OS live (on this and the other nodes)? The leveldb (/var/lib/ceph/..) of the monitors likes it fast, SSDs preferably. In either case that's a very small setup (and with a replication of 2 a risky one, too), so don't expect great performance. Ok. Throughput numbers aren't exactly worthless, but you will find IOPS to be the killer in most cases. Also without describing how you measured these numbers (rados bench, fio, bonnie, on the host, inside a VM) they become even more muddled. - rados bench on the node to test raw write - fio in a VM - Crystal DiskMark in a windows VM to test IOPS You really, really want size 3 and a third node for both performance (reads) and redundancy. I can probably scare up a desktop PC to use as a fourth node with another 3TB disk. The closer it is to the current storage nodes, the better. The slowest OSD in a cluster can impede all (most of) the others. I'd prefer to use the existing third node (the Intel Nuc), but its expansion is limited to USB3 devices. Are there USB3 external drives with decent performance stats? I'd advise against it. That node doing both monitor and OSDs is not going to end well. Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
Hello, On Mon, 29 Dec 2014 13:49:49 +0400 Andrey Korolyov wrote: On Mon, Dec 29, 2014 at 12:47 PM, Tomasz Kuzemko tomasz.kuze...@ovh.net wrote: On Sun, Dec 28, 2014 at 02:49:08PM +0900, Christian Balzer wrote: You really, really want size 3 and a third node for both performance (reads) and redundancy. How does it benefit read performance? I thought all reads are made only from the active primary OSD. -- Tomasz Kuzemko tomasz.kuze...@ovh.net You`ll have chunks of primary data scattered between three devices instead of two, as each pg will have a random acting set (until you decide to pin primary). What Andrey wrote. Reads will scale up (on a cluster basis, individual clients might not benefit as much) linearly with each additional device (host/OSD). Writes will scale up with each additional device divided by replica size. Fun fact, if you have 1 node with replica 1 and add 2 more identical nodes and increase the replica to 3, your write performance will be less than 50% of the single node. Once you add a 4th node, write speed will increase again. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
On Mon, 29 Dec 2014 11:12:06 PM Christian Balzer wrote: Is that a private cluster network just between Ceph storage nodes or is this for all ceph traffic (including clients)? The later would probably be better, a private cluster network twice as fast as the client one isn't particular helpful 99% of the time. The later - all ceph traffic including clients (qemu rbd). 3rd Node - Monitor only, for quorum - Intel Nuc - 8GB RAM - CPU: Celeron N2820 Uh oh, a bit weak for a monitor. Where does the OS live (on this and the other nodes)? The leveldb (/var/lib/ceph/..) of the monitors likes it fast, SSDs preferably. On a SSD (all the nodes have OS on SSD). Looks like I misunderstood the purpose of the monitors, I presumed they were just for monitoring node health. They do more than that? The closer it is to the current storage nodes, the better. The slowest OSD in a cluster can impede all (most of) the others. Closer as in similar hardware specs? -- Lindsay signature.asc Description: This is a digitally signed message part. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
On Sun, 28 Dec 2014 04:08:03 PM Nick Fisk wrote: If you can't add another full host, your best bet would be to add another 2-3 disks to each server. This should give you a bit more performance. It's much better to have lots of small disks rather than large multi-TB ones from a performance perspective. So maybe look to see if you can get 500GB/1TB drives cheap. Thanks, will do. Can you set replica 3 with two nodes and 6-8 OSD's? one would have to tweak the crush map? -- Lindsay signature.asc Description: This is a digitally signed message part. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
On Mon, 29 Dec 2014 11:29:11 PM Christian Balzer wrote: Reads will scale up (on a cluster basis, individual clients might not benefit as much) linearly with each additional device (host/OSD). I'm taking that to mean individual clients as a whole will be limited by the speed of individual OSD's, but multiple clients will spread their reads between multiple OSD's, leading to a higher aggregate bandwidth than individual disks could sustain. I guess the limiting factor there would be network. Writes will scale up with each additional device divided by replica size. So adding OSD's will increase write speed from individual clients? seq writes go out to different OSD's simultaneously? Fun fact, if you have 1 node with replica 1 and add 2 more identical nodes and increase the replica to 3, your write performance will be less than 50% of the single node. Interesting - this seems to imply that writes go to the replica OSD's one after another, rather than simultaneously like I expected. thanks, -- Lindsay signature.asc Description: This is a digitally signed message part. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
You would need to modify the crush map, so that it would store two of the same replica's on the same host, however I'm not sure how you would go about this and still make sure that at least 1 other replica is on a different host. But to be honest with the amount of OSD's you will have, the data loss probability with a replica size of 2 is not as bad as when you have much larger clusters so you may decide that size 2 is fine. But as always, make sure you have backups. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Lindsay Mathieson Sent: 29 December 2014 22:24 To: Nick Fisk Cc: ceph-us...@ceph.com Subject: Re: [ceph-users] Improving Performance with more OSD's? On Sun, 28 Dec 2014 04:08:03 PM Nick Fisk wrote: If you can't add another full host, your best bet would be to add another 2-3 disks to each server. This should give you a bit more performance. It's much better to have lots of small disks rather than large multi-TB ones from a performance perspective. So maybe look to see if you can get 500GB/1TB drives cheap. Thanks, will do. Can you set replica 3 with two nodes and 6-8 OSD's? one would have to tweak the crush map? -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
On Sun, 28 Dec 2014 04:08:03 PM Nick Fisk wrote: This should give you a bit more performance. It's much better to have lots of small disks rather than large multi-TB ones from a performance perspective. So maybe look to see if you can get 500GB/1TB drives cheap. Is this from the docs still relevant in this case? /A weight is the relative difference between device capacities. We recommend using 1.00 as the relative weight for a 1TB storage device. In such a scenario, a weight of 0.5 would represent approximately 500GB, and a weight of 3.00 would represent approximately 3TB/ So I would have maybe 1 3TB and 2 * 1TB Kinda regret getting the 3TB drives now learning experience. -- Lindsay signature.asc Description: This is a digitally signed message part. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
Hello, On Tue, 30 Dec 2014 08:12:21 +1000 Lindsay Mathieson wrote: On Mon, 29 Dec 2014 11:12:06 PM Christian Balzer wrote: Is that a private cluster network just between Ceph storage nodes or is this for all ceph traffic (including clients)? The later would probably be better, a private cluster network twice as fast as the client one isn't particular helpful 99% of the time. The later - all ceph traffic including clients (qemu rbd). Very good. ^.^ 3rd Node - Monitor only, for quorum - Intel Nuc - 8GB RAM - CPU: Celeron N2820 Uh oh, a bit weak for a monitor. Where does the OS live (on this and the other nodes)? The leveldb (/var/lib/ceph/..) of the monitors likes it fast, SSDs preferably. On a SSD (all the nodes have OS on SSD). Good. Looks like I misunderstood the purpose of the monitors, I presumed they were just for monitoring node health. They do more than that? They keep the maps and the pgmap in particular is of course very busy. All that action is at: /var/lib/ceph/mon/monitorname/store.db/ . In addition monitors log like no tomorrow, also straining the OS storage. The closer it is to the current storage nodes, the better. The slowest OSD in a cluster can impede all (most of) the others. Closer as in similar hardware specs? Ayup. The less variation, the better and the more predictable things become. Again, having 1 node slow down 2 fast nodes is not what you want. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
On Tue, 30 Dec 2014 12:48:58 PM Christian Balzer wrote: Looks like I misunderstood the purpose of the monitors, I presumed they were just for monitoring node health. They do more than that? They keep the maps and the pgmap in particular is of course very busy. All that action is at: /var/lib/ceph/mon/monitorname/store.db/ . In addition monitors log like no tomorrow, also straining the OS storage. Yikes! Did a quick check, root data storage at under 10% usage - Phew! Could the third under spec'd monitor (which only has 1GB Eth) be slowing things down? worthwhile removing it as a test? -- Lindsay signature.asc Description: This is a digitally signed message part. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
On Tue, 30 Dec 2014 08:22:01 +1000 Lindsay Mathieson wrote: On Mon, 29 Dec 2014 11:29:11 PM Christian Balzer wrote: Reads will scale up (on a cluster basis, individual clients might not benefit as much) linearly with each additional device (host/OSD). I'm taking that to mean individual clients as a whole will be limited by the speed of individual OSD's, but multiple clients will spread their reads between multiple OSD's, leading to a higher aggregate bandwidth than individual disks could sustain. A single client like a VM or an application (see rados bench threads) might of course do things in parallel, too. And thus benefit from accessing mulitple OSDs on multiple nodes at the same time. However a client doing a single, sequential read won't improve much of course (the fact that there are more OSDs with less spindel competition may still help though). I guess the limiting factor there would be network. For bandwidth/throughput, most likely and certainly in your case. But bandwidth really tends to become very quickly the least of your concerns, IOPS is where bottlenecks tend to appear first. And there aside from the obvious limitations of your disks (and SSDs) the for most people surprising next bottleneck is the CPU. Writes will scale up with each additional device divided by replica size. So adding OSD's will increase write speed from individual clients? OSDs help in and by itself due to the fact that activity can be distributed between more spindles (HDDs). So you can certainly increase the speed of your current storage nodes by adding more OSDs (lets say 4 per node). However increasing the node count and replica size to 3 will not improve things, rather the opposite. Because simply put in that configuration each node will have to do the same things as the others, plus overhead and limitations imposed things like the network. Once you add a 4th node, things speed up again. seq writes go out to different OSD's simultaneously? Unless there are multiple threads, no. But given the default object size of 4MB, they go to different OSDs sequentially and rather quickly so. Fun fact, if you have 1 node with replica 1 and add 2 more identical nodes and increase the replica to 3, your write performance will be less than 50% of the single node. Interesting - this seems to imply that writes go to the replica OSD's one after another, rather than simultaneously like I expected. There is a graphic on the Ceph documentation: http://ceph.com/docs/master/architecture/#smart-daemons-enable-hyperscale The numbering of the requests suggests sequential operation, but even if the primary OSD sends the data to the secondary one(s) in parallel your network bandwidth and LATENCY as well as the activity on those nodes and OSDs will of course delay things when compared to just a single, local write. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
On Tue, 30 Dec 2014 14:08:32 +1000 Lindsay Mathieson wrote: On Tue, 30 Dec 2014 12:48:58 PM Christian Balzer wrote: Looks like I misunderstood the purpose of the monitors, I presumed they were just for monitoring node health. They do more than that? They keep the maps and the pgmap in particular is of course very busy. All that action is at: /var/lib/ceph/mon/monitorname/store.db/ . In addition monitors log like no tomorrow, also straining the OS storage. Yikes! Did a quick check, root data storage at under 10% usage - Phew! The DB doesn't (shouldn't) grow out of bounds and the logs while chatty ought to be rotated. Your issue is IOPS, how busy those SSDs are more than anything. But even crappy SSDs should be just fine. Use a good monitoring tool like atop to watch how busy things are. And do that while running a normal rados bench like this from a client node: rados -p rbd bench 60 write -t 32 And again like this: rados -p rbd bench 60 write -t 32 -b 4096 In particular (but not only), compare the CPU usage during those runs. Could the third under spec'd monitor (which only has 1GB Eth) be slowing things down? worthwhile removing it as a test? Check with atop, but I doubt it. The network should be fine, storage on SSD should be fine, the memory (if not doing anything else) should do for your cluster size. CPU probably as well, but that is for you to check. Also the primary monitor is the one with the lowest IP (unfortunately not documented anywhere or configurable). Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
On Tue, 30 Dec 2014 12:48:58 PM Christian Balzer wrote: Looks like I misunderstood the purpose of the monitors, I presumed they were just for monitoring node health. They do more than that? They keep the maps and the pgmap in particular is of course very busy. All that action is at: /var/lib/ceph/mon/monitorname/store.db/ . In addition monitors log like no tomorrow, also straining the OS storage. Yikes! Did a quick check, root data storage at under 10% usage - Phew! Could the third under spec'd monitor (which only has 1GB Eth) be slowing things down? worthwhile removing it as a test? -- Lindsay signature.asc Description: This is a digitally signed message part. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
On 30 December 2014 at 14:28, Christian Balzer ch...@gol.com wrote: Use a good monitoring tool like atop to watch how busy things are. And do that while running a normal rados bench like this from a client node: rados -p rbd bench 60 write -t 32 And again like this: rados -p rbd bench 60 write -t 32 -b 4096 In particular (but not only), compare the CPU usage during those runs. Interesting results - First 14 seconds: CPU : 1 core at sys/user 2%/1%, rest idle HD : 45% Busy SDDD: 35% Busy - After 14 seconds CPU : 1 core at sys/user 20%/7%, rest idle HD : 100% Busy SDD: 30% - 50% Busy Journal size is 10GB max sync interval = 46.5 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
Appreciate the detailed reply Christian. On Sun, 28 Dec 2014 02:49:08 PM Christian Balzer wrote: On Sun, 28 Dec 2014 08:59:33 +1000 Lindsay Mathieson wrote: I'm looking to improve the raw performance on my small setup (2 Compute Nodes, 2 OSD's). Only used for hosting KVM images. This doesn't really make things clear, do you mean 2 STORAGE nodes with 2 OSDs (HDDs) each? 2 Nodes, 1 OSD per node Hardware is indentical for all nodes disks - Mobo: P9X79 WS - CPU:Intel Xeon E5-2620 - RAM: 32 GB ECC - 1GB Nic Public Access - 2 * 1GB Bond for ceph - OSD: 3TB WD Red - Journal: 10GB on Samsung 840 EVO 3rd Node - Monitor only, for quorum - Intel Nuc - 8GB RAM - CPU: Celeron N2820 In either case that's a very small setup (and with a replication of 2 a risky one, too), so don't expect great performance. Ok. Throughput numbers aren't exactly worthless, but you will find IOPS to be the killer in most cases. Also without describing how you measured these numbers (rados bench, fio, bonnie, on the host, inside a VM) they become even more muddled. - rados bench on the node to test raw write - fio in a VM - Crystal DiskMark in a windows VM to test IOPS You really, really want size 3 and a third node for both performance (reads) and redundancy. I can probably scare up a desktop PC to use as a fourth node with another 3TB disk. I'd prefer to use the existing third node (the Intel Nuc), but its expansion is limited to USB3 devices. Are there USB3 external drives with decent performance stats? thanks, -- Lindsay signature.asc Description: This is a digitally signed message part. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
Hi Lindsay, Ceph is really designed to scale across large amounts of OSD's and whilst it will still function with only 2 OSD's, I wouldn't expect it to perform as well as compared to a RAID 1 mirror with Battery Backed Cache. I wouldn't recommend running the OSD's on USB, although it should work reasonably well. If you can't add another full host, your best bet would be to add another 2-3 disks to each server. This should give you a bit more performance. It's much better to have lots of small disks rather than large multi-TB ones from a performance perspective. So maybe look to see if you can get 500GB/1TB drives cheap. Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Lindsay Mathieson Sent: 28 December 2014 14:06 To: ceph-us...@ceph.com Subject: Re: [ceph-users] Improving Performance with more OSD's? Appreciate the detailed reply Christian. On Sun, 28 Dec 2014 02:49:08 PM Christian Balzer wrote: On Sun, 28 Dec 2014 08:59:33 +1000 Lindsay Mathieson wrote: I'm looking to improve the raw performance on my small setup (2 Compute Nodes, 2 OSD's). Only used for hosting KVM images. This doesn't really make things clear, do you mean 2 STORAGE nodes with 2 OSDs (HDDs) each? 2 Nodes, 1 OSD per node Hardware is indentical for all nodes disks - Mobo: P9X79 WS - CPU:Intel Xeon E5-2620 - RAM: 32 GB ECC - 1GB Nic Public Access - 2 * 1GB Bond for ceph - OSD: 3TB WD Red - Journal: 10GB on Samsung 840 EVO 3rd Node - Monitor only, for quorum - Intel Nuc - 8GB RAM - CPU: Celeron N2820 In either case that's a very small setup (and with a replication of 2 a risky one, too), so don't expect great performance. Ok. Throughput numbers aren't exactly worthless, but you will find IOPS to be the killer in most cases. Also without describing how you measured these numbers (rados bench, fio, bonnie, on the host, inside a VM) they become even more muddled. - rados bench on the node to test raw write - fio in a VM - Crystal DiskMark in a windows VM to test IOPS You really, really want size 3 and a third node for both performance (reads) and redundancy. I can probably scare up a desktop PC to use as a fourth node with another 3TB disk. I'd prefer to use the existing third node (the Intel Nuc), but its expansion is limited to USB3 devices. Are there USB3 external drives with decent performance stats? thanks, -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Improving Performance with more OSD's?
I'm looking to improve the raw performance on my small setup (2 Compute Nodes, 2 OSD's). Only used for hosting KVM images. Raw read/write is roughly 200/35 MB/s. Starting 4+ VM's simultaneously pushes iowaits over 30%, though the system keeps chugging along. Budget is limited ... :( I plan to upgrade my SSD journals to something better than the Samsung 840 EVO's (Intel 520/530?) One of the things I see mentioned a lot in blogs etc is how ceph's performance improves as you add more OSD's and that the quality of the disks does not matter so much as the quantity. How does this work? does ceph stripe reads and writes across the OSD's to improve performance? If I add 3 cheap OSD's to each node (500GB - 1TB) with 10GB SSD journal partition each could I expect a big improvement in performance? What sort of redundancy to setup? currently its min= 1, size=2. Size is not an issue, we already have 150% more space than we need, redundancy and performance is more important. Now I think on it, we can live with the slow write performance, but reducing iowait would be *really* good. thanks, -- Lindsay signature.asc Description: This is a digitally signed message part. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
On Sun, 28 Dec 2014 08:59:33 +1000 Lindsay Mathieson wrote: I'm looking to improve the raw performance on my small setup (2 Compute Nodes, 2 OSD's). Only used for hosting KVM images. This doesn't really make things clear, do you mean 2 STORAGE nodes with 2 OSDs (HDDs) each? In either case that's a very small setup (and with a replication of 2 a risky one, too), so don't expect great performance. It would help if you'd tell us what these nodes are made of (CPU, RAM, disks, network) so we can at least guess what that cluster might be capable of. Raw read/write is roughly 200/35 MB/s. Starting 4+ VM's simultaneously pushes iowaits over 30%, though the system keeps chugging along. Throughput numbers aren't exactly worthless, but you will find IOPS to be the killer in most cases. Also without describing how you measured these numbers (rados bench, fio, bonnie, on the host, inside a VM) they become even more muddled. Budget is limited ... :( I plan to upgrade my SSD journals to something better than the Samsung 840 EVO's (Intel 520/530?) Not a big improvement really. Take a look at the 100GB Intel DC S3700s, while they can write only at 200MB/s they are priced rather nicely and they will deliver that performance at ANY time and for a long time, too. One of the things I see mentioned a lot in blogs etc is how ceph's performance improves as you add more OSD's and that the quality of the disks does not matter so much as the quantity. How does this work? does ceph stripe reads and writes across the OSD's to improve performance? Yes and no. It stripes by default to 4MB objects, so with enough OSDs and clients I/Os will become distributed, scaling up nicely. However a single client could be hitting the same object on the same OSD all the time (small DB file for example), so you won't see much or any improvement in that case. There is also the option to stripe things on a much smaller scale, however that takes some planning and needs to be done at pool creation time. See and read the Ceph documentation. If I add 3 cheap OSD's to each node (500GB - 1TB) with 10GB SSD journal partition each could I expect a big improvement in performance? That depends a lot on the stuff you haven't told us (CPU/RAM/network). Given that there is sufficient of those, especially CPU, the answer is yes. A large amount of RAM on the storage nodes will improve reads, as hot objects become and remain cached. Of course having decent HDDs will help even with journals on SSDs, for example the Toshiba DTxx (totally not recommended for ANYTHING) HDDs cost about the same as their entry level enterprise MG0x drives, which are nearly twice as fast in the IOPS department. What sort of redundancy to setup? currently its min= 1, size=2. Size is not an issue, we already have 150% more space than we need, redundancy and performance is more important. You really, really want size 3 and a third node for both performance (reads) and redundancy. Now I think on it, we can live with the slow write performance, but reducing iowait would be *really* good. Decent SSDs (see above) and more (decent) spindles will help with both. Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com