Re: [ceph-users] How to tell a VM to write more local ceph nodes than to the network.
On 14 January 2015 at 12:08, JM wrote: > Hi Roland, > > You should tune your Ceph Crushmap with a custom rule in order to do that > (write first on s3 and then to others). This custom rule will be applied > then to your proxmox pool. > (what you want to do is only interesting if you run VM from host s3) > > Can you give us your crushmap ? > # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 osd.4 device 5 osd.5 device 6 osd.6 device 7 osd.7 device 8 osd.8 device 9 osd.9 device 10 osd.10 device 11 osd.11 device 12 osd.12 device 13 osd.13 device 14 osd.14 device 15 osd.15 device 16 osd.16 device 17 osd.17 device 18 osd.18 # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host h1 { id -2# do not change unnecessarily # weight 8.140 alg straw hash 0# rjenkins1 item osd.1 weight 0.900 item osd.3 weight 0.900 item osd.4 weight 0.900 item osd.5 weight 0.680 item osd.6 weight 0.680 item osd.7 weight 0.680 item osd.8 weight 0.680 item osd.9 weight 0.680 item osd.10 weight 0.680 item osd.11 weight 0.680 item osd.12 weight 0.680 } host s3 { id -3# do not change unnecessarily # weight 0.450 alg straw hash 0# rjenkins1 item osd.2 weight 0.450 } host s2 { id -4# do not change unnecessarily # weight 0.900 alg straw hash 0# rjenkins1 item osd.13 weight 0.900 } host s1 { id -5# do not change unnecessarily # weight 1.640 alg straw hash 0# rjenkins1 item osd.14 weight 0.290 item osd.0 weight 0.270 item osd.15 weight 0.270 item osd.16 weight 0.270 item osd.17 weight 0.270 item osd.18 weight 0.270 } root default { id -1# do not change unnecessarily # weight 11.130 alg straw hash 0# rjenkins1 item h1 weight 8.140 item s3 weight 0.450 item s2 weight 0.900 item s1 weight 1.640 } # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map thanks so far! regards Roland > > > > 2015-01-13 22:03 GMT+01:00 Roland Giesler : > >> I have a 4 node ceph cluster, but the disks are not equally distributed >> across all machines (they are substantially different from each other) >> >> One machine has 12 x 1TB SAS drives (h1), another has 8 x 300GB SAS (s3) >> and two machines have only two 1 TB drives each (s2 & s1). >> >> Now machine s3 has by far the most CPU's and RAM, so I'm running my VM's >> mostly from there, but I want to make sure that the writes that happen to >> the ceph cluster get written to the "local" osd's on s3 first and then the >> additional writes/copies get done to the network. >> >> Is this possible with ceph. The VM's are KVM in Proxmox in case it's >> relevant. >> >> regards >> >> >> *Roland * >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to tell a VM to write more local ceph nodes than to the network.
On 16 January 2015 at 17:15, Gregory Farnum wrote: > > I have set up 4 machines in a cluster. When I created the Windows 2008 > > server VM on S1 (I corrected my first email: I have three Sunfire X > series > > servers, S1, S2, S3) since S1 has 36GB of RAM en 8 x 300GB SAS drives, it > > was running normally, pretty close to what I had on the bare metal. > About a > > month later (after being on leave to 2 weeks), I found a machine that is > > crawling at a snail pace and I cannot figure out why. > > You mean one of the VMs has very slow disk access? Or one of the hosts > is very slow? > The Windows 2008 VM is very slow. Inside Windows all seems normal, the CPU's are never more 20% used and when navigating even the menus take a long time to respond. The Host (S1) is not slow. > In any case, you'd need to look at what about that system is different > from the others and poke at that difference until it exposes an issue, > I suppose. > I'll move the machine to one of the smaller hosts (S2 or S3). I'll just have to lower the spec of the VM, since I've set RAM at 10GB, which is much more than S2 or S3 have. Let's see what happens. > -Greg > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to tell a VM to write more local ceph nodes than to the network.
On Fri, Jan 16, 2015 at 2:52 AM, Roland Giesler wrote: > On 14 January 2015 at 21:46, Gregory Farnum wrote: >> >> On Tue, Jan 13, 2015 at 1:03 PM, Roland Giesler >> wrote: >> > I have a 4 node ceph cluster, but the disks are not equally distributed >> > across all machines (they are substantially different from each other) >> > >> > One machine has 12 x 1TB SAS drives (h1), another has 8 x 300GB SAS (s3) >> > and >> > two machines have only two 1 TB drives each (s2 & s1). >> > >> > Now machine s3 has by far the most CPU's and RAM, so I'm running my VM's >> > mostly from there, but I want to make sure that the writes that happen >> > to >> > the ceph cluster get written to the "local" osd's on s3 first and then >> > the >> > additional writes/copies get done to the network. >> > >> > Is this possible with ceph. The VM's are KVM in Proxmox in case it's >> > relevant. >> >> In general you can't set up Ceph to write to the local node first. In >> some specific cases you can if you're willing to do a lot more work >> around placement, and this *might* be one of those cases. >> >> To do this, you'd need to change the CRUSH rules pretty extensively, >> so that instead of selecting OSDs at random, they have two steps: >> 1) starting from bucket s3, select a random OSD and put it at the >> front of the OSD list for the PG. >> 2) Starting from a bucket which contains all the other OSDs, select >> N-1 more at random (where N is the number of desired replicas). > > > I understand in principle what you're saying. Let me go back a step and ask > the question somewhat differently then: > > I have set up 4 machines in a cluster. When I created the Windows 2008 > server VM on S1 (I corrected my first email: I have three Sunfire X series > servers, S1, S2, S3) since S1 has 36GB of RAM en 8 x 300GB SAS drives, it > was running normally, pretty close to what I had on the bare metal. About a > month later (after being on leave to 2 weeks), I found a machine that is > crawling at a snail pace and I cannot figure out why. You mean one of the VMs has very slow disk access? Or one of the hosts is very slow? In any case, you'd need to look at what about that system is different from the others and poke at that difference until it exposes an issue, I suppose. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to tell a VM to write more local ceph nodes than to the network.
On 14 January 2015 at 21:46, Gregory Farnum wrote: > On Tue, Jan 13, 2015 at 1:03 PM, Roland Giesler > wrote: > > I have a 4 node ceph cluster, but the disks are not equally distributed > > across all machines (they are substantially different from each other) > > > > One machine has 12 x 1TB SAS drives (h1), another has 8 x 300GB SAS (s3) > and > > two machines have only two 1 TB drives each (s2 & s1). > > > > Now machine s3 has by far the most CPU's and RAM, so I'm running my VM's > > mostly from there, but I want to make sure that the writes that happen to > > the ceph cluster get written to the "local" osd's on s3 first and then > the > > additional writes/copies get done to the network. > > > > Is this possible with ceph. The VM's are KVM in Proxmox in case it's > > relevant. > > In general you can't set up Ceph to write to the local node first. In > some specific cases you can if you're willing to do a lot more work > around placement, and this *might* be one of those cases. > > To do this, you'd need to change the CRUSH rules pretty extensively, > so that instead of selecting OSDs at random, they have two steps: > 1) starting from bucket s3, select a random OSD and put it at the > front of the OSD list for the PG. > 2) Starting from a bucket which contains all the other OSDs, select > N-1 more at random (where N is the number of desired replicas). > I understand in principle what you're saying. Let me go back a step and ask the question somewhat differently then: I have set up 4 machines in a cluster. When I created the Windows 2008 server VM on S1 (I corrected my first email: I have three Sunfire X series servers, S1, S2, S3) since S1 has 36GB of RAM en 8 x 300GB SAS drives, it was running normally, pretty close to what I had on the bare metal. About a month later (after being on leave to 2 weeks), I found a machine that is crawling at a snail pace and I cannot figure out why. So instead of suggesting something from my side (without in-depth knowledge yet), what should I do to get this machine to run at speed again? Further to my hardware and network: S1: 2 x Quad Code Xeon, 36GB RAM, 8 x 300GB HDD's S2: 1 x Opteron Dual Core, 8GB RAM, 2 x 750GB HDD's S3: 1 x Opetron Dual Core, 8GB RAM, 2 x 750GB HDD's H1: 1 x Xeon Dual Core, 5GB RAM, 12 x 1TB HDD's (All these machines are at full drive capacity, that is all their slots are being utilised) All the servers are linked with Dual Gigabit Ethernet connections to a switch with LACP enabled and the links are bonded on each server. While this doesn't raise the total transfer speed, it does allow more bandwidth between the servers. The H1 machine is only running ceph and thus acts only as storage. The other machines (S1, S2 & S3) are for web servers (development and production), the Windows 2008 server and a few other functions all managed from proxmox. The hardware is what my client has been using, but there were lots of inefficiencies and little redundancy in the setup before we embarked on this project. However, the hardware is sufficient for their needs. I hope that gives you a reasonable picture of the setup so be able to give me some advice on how to troubleshoot this. regards Roland > > You can look at the documentation on CRUSH or search the list archives > for more on this subject. > > Note that doing this has a bunch of down sides: you'll have balance > issues because every piece of data will be on the s3 node (that's a > TERRIBLE name for a project which has API support for Amazon S3, btw > :p), if you add new VMs on a different node they'll all be going to > the s3 node for all their writes (unless you set them up on a > different pool with different CRUSH rules), s3 will be satisfying all > the read requests so the other nodes are just backups in case of disk > failure, etc. > -Greg > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to tell a VM to write more local ceph nodes than to the network.
So you can see my server names and their osd's too... # idweighttype nameup/downreweight -111.13root default -28.14host h1 1 0.9osd.1 up1 3 0.9osd.3 up1 4 0.9osd.4 up1 50.68osd.5 up1 60.68osd.6 up1 70.68osd.7 up1 80.68osd.8 up1 90.68osd.9 up1 100.68osd.10up1 110.68osd.11up1 120.68osd.12up1 -30.45host s3 20.45osd.2 up1 -4 0.9host s2 13 0.9osd.13up1 -51.64host s1 140.29osd.14up1 00.27osd.0down 0 150.27osd.15up1 160.27osd.16up1 170.27osd.17up1 180.27osd.18up1 regards Roland ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to tell a VM to write more local ceph nodes than to the network.
# Get the compiled crushmap root@server01:~# ceph osd getcrushmap -o /tmp/myfirstcrushmap # Decompile the compiled crushmap above root@server01:~# crushtool -d /tmp/myfirstcrushmap -o /tmp/myfirstcrushmap.txt then give us your /tmp/myfirstcrushmap.txt file.. :) 2015-01-14 17:36 GMT+01:00 Roland Giesler : > On 14 January 2015 at 12:08, JM wrote: > >> Hi Roland, >> >> You should tune your Ceph Crushmap with a custom rule in order to do that >> (write first on s3 and then to others). This custom rule will be applied >> then to your proxmox pool. >> (what you want to do is only interesting if you run VM from host s3) >> >> Can you give us your crushmap ? >> > > Please note that I made a mistake in my email. The machine that I want to > run on write first, is S1 not S3 > > For the life of me I cannot find how to extract the crush map. I found: > > ceph osd getcrushmap -o crushfilename > > Where can I find the crush file? I've never needed this. > This is my first installation, so please bear with my while I learn! > > Lionel: I read what you're saying. However, the strange thing is that > last year I had this Windows 2008 VM running on the same cluster without > changes and coming back from leave in the new year, it has crawled to a > painfully slow state. And I don't quite know where to start to trace > this. The windows machine is not the problem, since even before windows > starts up the boot process of the VM is very slow. > > thanks > > Roland > > > > >> >> >> >> 2015-01-13 22:03 GMT+01:00 Roland Giesler : >> >>> I have a 4 node ceph cluster, but the disks are not equally distributed >>> across all machines (they are substantially different from each other) >>> >>> One machine has 12 x 1TB SAS drives (h1), another has 8 x 300GB SAS (s3) >>> and two machines have only two 1 TB drives each (s2 & s1). >>> >>> Now machine s3 has by far the most CPU's and RAM, so I'm running my VM's >>> mostly from there, but I want to make sure that the writes that happen to >>> the ceph cluster get written to the "local" osd's on s3 first and then the >>> additional writes/copies get done to the network. >>> >>> Is this possible with ceph. The VM's are KVM in Proxmox in case it's >>> relevant. >>> >>> regards >>> >>> >>> *Roland * >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to tell a VM to write more local ceph nodes than to the network.
On 14 January 2015 at 12:08, JM wrote: > Hi Roland, > > You should tune your Ceph Crushmap with a custom rule in order to do that > (write first on s3 and then to others). This custom rule will be applied > then to your proxmox pool. > (what you want to do is only interesting if you run VM from host s3) > > Can you give us your crushmap ? > Please note that I made a mistake in my email. The machine that I want to run on write first, is S1 not S3 For the life of me I cannot find how to extract the crush map. I found: ceph osd getcrushmap -o crushfilename Where can I find the crush file? I've never needed this. This is my first installation, so please bear with my while I learn! Lionel: I read what you're saying. However, the strange thing is that last year I had this Windows 2008 VM running on the same cluster without changes and coming back from leave in the new year, it has crawled to a painfully slow state. And I don't quite know where to start to trace this. The windows machine is not the problem, since even before windows starts up the boot process of the VM is very slow. thanks Roland > > > > 2015-01-13 22:03 GMT+01:00 Roland Giesler : > >> I have a 4 node ceph cluster, but the disks are not equally distributed >> across all machines (they are substantially different from each other) >> >> One machine has 12 x 1TB SAS drives (h1), another has 8 x 300GB SAS (s3) >> and two machines have only two 1 TB drives each (s2 & s1). >> >> Now machine s3 has by far the most CPU's and RAM, so I'm running my VM's >> mostly from there, but I want to make sure that the writes that happen to >> the ceph cluster get written to the "local" osd's on s3 first and then the >> additional writes/copies get done to the network. >> >> Is this possible with ceph. The VM's are KVM in Proxmox in case it's >> relevant. >> >> regards >> >> >> *Roland * >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to tell a VM to write more local ceph nodes than to the network.
Hi Roland, You should tune your Ceph Crushmap with a custom rule in order to do that (write first on s3 and then to others). This custom rule will be applied then to your proxmox pool. (what you want to do is only interesting if you run VM from host s3) Can you give us your crushmap ? 2015-01-13 22:03 GMT+01:00 Roland Giesler : > I have a 4 node ceph cluster, but the disks are not equally distributed > across all machines (they are substantially different from each other) > > One machine has 12 x 1TB SAS drives (h1), another has 8 x 300GB SAS (s3) > and two machines have only two 1 TB drives each (s2 & s1). > > Now machine s3 has by far the most CPU's and RAM, so I'm running my VM's > mostly from there, but I want to make sure that the writes that happen to > the ceph cluster get written to the "local" osd's on s3 first and then the > additional writes/copies get done to the network. > > Is this possible with ceph. The VM's are KVM in Proxmox in case it's > relevant. > > regards > > > *Roland * > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to tell a VM to write more local ceph nodes than to the network.
On Tue, Jan 13, 2015 at 1:03 PM, Roland Giesler wrote: > I have a 4 node ceph cluster, but the disks are not equally distributed > across all machines (they are substantially different from each other) > > One machine has 12 x 1TB SAS drives (h1), another has 8 x 300GB SAS (s3) and > two machines have only two 1 TB drives each (s2 & s1). > > Now machine s3 has by far the most CPU's and RAM, so I'm running my VM's > mostly from there, but I want to make sure that the writes that happen to > the ceph cluster get written to the "local" osd's on s3 first and then the > additional writes/copies get done to the network. > > Is this possible with ceph. The VM's are KVM in Proxmox in case it's > relevant. In general you can't set up Ceph to write to the local node first. In some specific cases you can if you're willing to do a lot more work around placement, and this *might* be one of those cases. To do this, you'd need to change the CRUSH rules pretty extensively, so that instead of selecting OSDs at random, they have two steps: 1) starting from bucket s3, select a random OSD and put it at the front of the OSD list for the PG. 2) Starting from a bucket which contains all the other OSDs, select N-1 more at random (where N is the number of desired replicas). You can look at the documentation on CRUSH or search the list archives for more on this subject. Note that doing this has a bunch of down sides: you'll have balance issues because every piece of data will be on the s3 node (that's a TERRIBLE name for a project which has API support for Amazon S3, btw :p), if you add new VMs on a different node they'll all be going to the s3 node for all their writes (unless you set them up on a different pool with different CRUSH rules), s3 will be satisfying all the read requests so the other nodes are just backups in case of disk failure, etc. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to tell a VM to write more local ceph nodes than to the network.
On 01/13/15 22:03, Roland Giesler wrote: > I have a 4 node ceph cluster, but the disks are not equally > distributed across all machines (they are substantially different from > each other) > > One machine has 12 x 1TB SAS drives (h1), another has 8 x 300GB SAS > (s3) and two machines have only two 1 TB drives each (s2 & s1). > > Now machine s3 has by far the most CPU's and RAM, so I'm running my > VM's mostly from there, but I want to make sure that the writes that > happen to the ceph cluster get written to the "local" osd's on s3 > first and then the additional writes/copies get done to the network. > > Is this possible with ceph. The VM's are KVM in Proxmox in case it's > relevant. I don't think it is possible because I believe it would break its durability guarantees: IIRC each write is only considered done when all replicas have been written to (with default settings 3 replicas on 3 different servers) so you have to wait for 3 servers to acknowledge the write for it to complete in the VM. You could maybe achieve what you want by using tiering with a cache pool configured to use only disks on the s3 server in write-back mode but given the user experience reports on the list it may actually perform worse than your current setup. Best regards, Lionel Bouton ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to tell a VM to write more local ceph nodes than to the network.
I have a 4 node ceph cluster, but the disks are not equally distributed across all machines (they are substantially different from each other) One machine has 12 x 1TB SAS drives (h1), another has 8 x 300GB SAS (s3) and two machines have only two 1 TB drives each (s2 & s1). Now machine s3 has by far the most CPU's and RAM, so I'm running my VM's mostly from there, but I want to make sure that the writes that happen to the ceph cluster get written to the "local" osd's on s3 first and then the additional writes/copies get done to the network. Is this possible with ceph. The VM's are KVM in Proxmox in case it's relevant. regards *Roland * ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com