Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !
On 16/03/2015, at 12.23, Alexandre DERUMIER aderum...@odiso.com wrote: We use Proxmox, so I think it uses librbd ? As It's me that I made the proxmox rbd plugin, I can confirm that yes, it's librbd ;) Is the ceph cluster on dedicated nodes ? or vms are running on same nodes than osd daemons ? My cluster have Ceph OSDs+MONs on seperate PVE nodes, no VMs And I precise that not all VMs on that pool crashed, only some of them (a large majority), and on a same host, some crashed and others not. Is the vm crashed, like no more qemu process ? or is it the guest os which is crashed ? Hmm long time now, remember VM status was stopped, resumed didn't work aka they were started again asap :) (do you use virtio, virtio-scsi or ide for your guest ?) virtio /Steffen signature.asc Description: Message signed with OpenPGP using GPGMail ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !
That full system slows down, OK, but brutal stop... This is strange, that could be: - qemu crash, maybe a bug in rbd block storage (if you use librbd) - oom-killer on you host (any logs ?) what is your qemu version ? - Mail original - De: Florent Bautista flor...@coppint.com À: ceph-users ceph-users@lists.ceph.com Envoyé: Lundi 16 Mars 2015 10:11:43 Objet: Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down ! Of course but it does not explain why VMs stopped... That full system slows down, OK, but brutal stop... On 03/14/2015 07:00 PM, Andrija Panic wrote: changin PG number - causes LOOOT of data rebalancing (in my case was 80%) which I learned the hard way... On 14 March 2015 at 18:49, Gabri Mate mailingl...@modernbiztonsag.org wrote: BQ_BEGIN I had the same issue a few days ago. I was increasing the pg_num of one pool from 512 to 1024 and all the VMs in that pool stopped. I came to the conclusion that doubling the pg_num caused such a high load in ceph that the VMs were blocked. The next time I will test with small increments. On 12:38 Sat 14 Mar , Florent B wrote: Hi all, I have a Giant cluster in production. Today one of my RBD pools had the too few pgs warning. So I changed pg_num pgp_num. And at this moment, some of the VM stored on this pool were stopped (on some hosts, not all, it depends, no logic) All was running fine for months... Have you ever seen this ? What could have caused this ? Thank you. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com BQ_END ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !
We use Proxmox, so I think it uses librbd ? As It's me that I made the proxmox rbd plugin, I can confirm that yes, it's librbd ;) Is the ceph cluster on dedicated nodes ? or vms are running on same nodes than osd daemons ? And I precise that not all VMs on that pool crashed, only some of them (a large majority), and on a same host, some crashed and others not. Is the vm crashed, like no more qemu process ? or is it the guest os which is crashed ? (do you use virtio, virtio-scsi or ide for your guest ?) - Mail original - De: Florent Bautista flor...@coppint.com À: aderumier aderum...@odiso.com Cc: ceph-users ceph-users@lists.ceph.com Envoyé: Lundi 16 Mars 2015 11:14:45 Objet: Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down ! On 03/16/2015 11:03 AM, Alexandre DERUMIER wrote: This is strange, that could be: - qemu crash, maybe a bug in rbd block storage (if you use librbd) - oom-killer on you host (any logs ?) what is your qemu version ? Now, we have version 2.1.3. Some VMs that stopped were running for a long time, but some other had only 4 days uptime. And I precise that not all VMs on that pool crashed, only some of them (a large majority), and on a same host, some crashed and others not. We use Proxmox, so I think it uses librbd ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !
On 16/03/2015, at 11.14, Florent B flor...@coppint.com wrote: On 03/16/2015 11:03 AM, Alexandre DERUMIER wrote: This is strange, that could be: - qemu crash, maybe a bug in rbd block storage (if you use librbd) - oom-killer on you host (any logs ?) what is your qemu version ? Now, we have version 2.1.3. Some VMs that stopped were running for a long time, but some other had only 4 days uptime. And I precise that not all VMs on that pool crashed, only some of them (a large majority), and on a same host, some crashed and others not. We use Proxmox, so I think it uses librbd ? I had the same issue once also when bumping up PG_NUM, majority of my ProxMox VMs stopped. I believe this might be due to heavy rebalancing causing time out when VMs tries to do IO OPs and thus generating kernel panics. Next time around I want to go smaller increments of pg_num and hopefully avoid this. I follow the need for more PGs when having more OSDs, but how come PGs gets to few when adding more objects/data to a pool? /Steffen signature.asc Description: Message signed with OpenPGP using GPGMail ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !
May I know your ceph version.?. The latest version of firefly 80.9 has patches to avoid excessive data migrations during rewighting osds. You may need set a tunable inorder make this patch active. This is a bugfix release for firefly. It fixes a performance regression in librbd, an important CRUSH misbehavior (see below), and several RGW bugs. We have also backported support for flock/fcntl locks to ceph-fuse and libcephfs. We recommend that all Firefly users upgrade. For more detailed information, see http://docs.ceph.com/docs/master/_downloads/v0.80.9.txt Adjusting CRUSH maps * This point release fixes several issues with CRUSH that trigger excessive data migration when adjusting OSD weights. These are most obvious when a very small weight change (e.g., a change from 0 to .01) triggers a large amount of movement, but the same set of bugs can also lead to excessive (though less noticeable) movement in other cases. However, because the bug may already have affected your cluster, fixing it may trigger movement *back* to the more correct location. For this reason, you must manually opt-in to the fixed behavior. In order to set the new tunable to correct the behavior:: ceph osd crush set-tunable straw_calc_version 1 Note that this change will have no immediate effect. However, from this point forward, any 'straw' bucket in your CRUSH map that is adjusted will get non-buggy internal weights, and that transition may trigger some rebalancing. You can estimate how much rebalancing will eventually be necessary on your cluster with:: ceph osd getcrushmap -o /tmp/cm crushtool -i /tmp/cm --num-rep 3 --test --show-mappings /tmp/a 21 crushtool -i /tmp/cm --set-straw-calc-version 1 -o /tmp/cm2 crushtool -i /tmp/cm2 --reweight -o /tmp/cm2 crushtool -i /tmp/cm2 --num-rep 3 --test --show-mappings /tmp/b 21 wc -l /tmp/a # num total mappings diff -u /tmp/a /tmp/b | grep -c ^+# num changed mappings Divide the total number of lines in /tmp/a with the number of lines changed. We've found that most clusters are under 10%. You can force all of this rebalancing to happen at once with:: ceph osd crush reweight-all Otherwise, it will happen at some unknown point in the future when CRUSH weights are next adjusted. Notable Changes --- * ceph-fuse: flock, fcntl lock support (Yan, Zheng, Greg Farnum) * crush: fix straw bucket weight calculation, add straw_calc_version tunable (#10095 Sage Weil) * crush: fix tree bucket (Rongzu Zhu) * crush: fix underflow of tree weights (Loic Dachary, Sage Weil) * crushtool: add --reweight (Sage Weil) * librbd: complete pending operations before losing image (#10299 Jason Dillaman) * librbd: fix read caching performance regression (#9854 Jason Dillaman) * librbd: gracefully handle deleted/renamed pools (#10270 Jason Dillaman) * mon: fix dump of chooseleaf_vary_r tunable (Sage Weil) * osd: fix PG ref leak in snaptrimmer on peering (#10421 Kefu Chai) * osd: handle no-op write with snapshot (#10262 Sage Weil) * radosgw-admi On 03/16/2015 12:37 PM, Alexandre DERUMIER wrote: VMs are running on the same nodes than OSD Are you sure that you didn't some kind of out of memory. pg rebalance can be memory hungry. (depend how many osd you have). 2 OSD per host, and 5 hosts in this cluster. hosts h ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !
I always keep my pg number a power of 2. So I’d go from 2048 to 4096. I’m not sure if this is the safest way, but it’s worked for me. [yp] Michael Kuriger Sr. Unix Systems Engineer • mk7...@yp.commailto:mk7...@yp.com |• 818-649-7235 From: Chu Duc Minh chu.ducm...@gmail.commailto:chu.ducm...@gmail.com Date: Monday, March 16, 2015 at 7:49 AM To: Florent B flor...@coppint.commailto:flor...@coppint.com Cc: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com Subject: Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down ! I'm using the latest Giant and have the same issue. When i increase PG_num of a pool from 2048 to 2148, my VMs is still ok. When i increase from 2148 to 2400, some VMs die (Qemu-kvm process die). My physical servers (host VMs) running kernel 3.13 and use librbd. I think it's a bug in librbd with crushmap. (I set crush_tunables3 on my ceph cluster, does it make sense?) Do you know a way to safely increase PG_num? (I don't think increase PG_num 100 each times is a safe good way) Regards, On Mon, Mar 16, 2015 at 8:50 PM, Florent B flor...@coppint.commailto:flor...@coppint.com wrote: We are on Giant. On 03/16/2015 02:03 PM, Azad Aliyar wrote: May I know your ceph version.?. The latest version of firefly 80.9 has patches to avoid excessive data migrations during rewighting osds. You may need set a tunable inorder make this patch active. This is a bugfix release for firefly. It fixes a performance regression in librbd, an important CRUSH misbehavior (see below), and several RGW bugs. We have also backported support for flock/fcntl locks to ceph-fuse and libcephfs. We recommend that all Firefly users upgrade. For more detailed information, see http://docs.ceph.com/docs/master/_downloads/v0.80.9.txthttps://urldefense.proofpoint.com/v2/url?u=http-3A__docs.ceph.com_docs_master_-5Fdownloads_v0.80.9.txtd=AwMFaQc=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSncm6Vfn0C_UQr=CSYA9OS6Qd7fQySI2LDvlQm=0MEOMMXqQGLq4weFd85B2Bxn5uBH9V9uMiuajNVb7o0s=-HHkWm2cMQZ06FKpWF4Ai-YkFb9lUR_tH_KR0eITbuUe= Adjusting CRUSH maps * This point release fixes several issues with CRUSH that trigger excessive data migration when adjusting OSD weights. These are most obvious when a very small weight change (e.g., a change from 0 to .01) triggers a large amount of movement, but the same set of bugs can also lead to excessive (though less noticeable) movement in other cases. However, because the bug may already have affected your cluster, fixing it may trigger movement *back* to the more correct location. For this reason, you must manually opt-in to the fixed behavior. In order to set the new tunable to correct the behavior:: ceph osd crush set-tunable straw_calc_version 1 Note that this change will have no immediate effect. However, from this point forward, any 'straw' bucket in your CRUSH map that is adjusted will get non-buggy internal weights, and that transition may trigger some rebalancing. You can estimate how much rebalancing will eventually be necessary on your cluster with:: ceph osd getcrushmap -o /tmp/cm crushtool -i /tmp/cm --num-rep 3 --test --show-mappings /tmp/a 21 crushtool -i /tmp/cm --set-straw-calc-version 1 -o /tmp/cm2 crushtool -i /tmp/cm2 --reweight -o /tmp/cm2 crushtool -i /tmp/cm2 --num-rep 3 --test --show-mappings /tmp/b 21 wc -l /tmp/a # num total mappings diff -u /tmp/a /tmp/b | grep -c ^+# num changed mappings Divide the total number of lines in /tmp/a with the number of lines changed. We've found that most clusters are under 10%. You can force all of this rebalancing to happen at once with:: ceph osd crush reweight-all Otherwise, it will happen at some unknown point in the future when CRUSH weights are next adjusted. Notable Changes --- * ceph-fuse: flock, fcntl lock support (Yan, Zheng, Greg Farnum) * crush: fix straw bucket weight calculation, add straw_calc_version tunable (#10095 Sage Weil) * crush: fix tree bucket (Rongzu Zhu) * crush: fix underflow of tree weights (Loic Dachary, Sage Weil) * crushtool: add --reweight (Sage Weil) * librbd: complete pending operations before losing image (#10299 Jason Dillaman) * librbd: fix read caching performance regression (#9854 Jason Dillaman) * librbd: gracefully handle deleted/renamed pools (#10270 Jason Dillaman) * mon: fix dump of chooseleaf_vary_r tunable (Sage Weil) * osd: fix PG ref leak in snaptrimmer on peering (#10421 Kefu Chai) * osd: handle no-op write with snapshot (#10262 Sage Weil) * radosgw-admi On 03/16/2015 12:37 PM, Alexandre DERUMIER wrote: VMs are running on the same nodes than OSD Are you sure that you didn't some kind of out of memory. pg rebalance can be memory hungry. (depend how many osd you have). 2 OSD per host
Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !
@Michael Kuriger: when ceph/librbd operate normally, i know that double the pg_num is the safe way. But when it has problem, i think double it can make many many VMs die (maybe = 50%?) On Mon, Mar 16, 2015 at 9:53 PM, Michael Kuriger mk7...@yp.com wrote: I always keep my pg number a power of 2. So I’d go from 2048 to 4096. I’m not sure if this is the safest way, but it’s worked for me. [image: yp] Michael Kuriger Sr. Unix Systems Engineer * mk7...@yp.com |( 818-649-7235 From: Chu Duc Minh chu.ducm...@gmail.com Date: Monday, March 16, 2015 at 7:49 AM To: Florent B flor...@coppint.com Cc: ceph-users@lists.ceph.com ceph-users@lists.ceph.com Subject: Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down ! I'm using the latest Giant and have the same issue. When i increase PG_num of a pool from 2048 to 2148, my VMs is still ok. When i increase from 2148 to 2400, some VMs die (Qemu-kvm process die). My physical servers (host VMs) running kernel 3.13 and use librbd. I think it's a bug in librbd with crushmap. (I set crush_tunables3 on my ceph cluster, does it make sense?) Do you know a way to safely increase PG_num? (I don't think increase PG_num 100 each times is a safe good way) Regards, On Mon, Mar 16, 2015 at 8:50 PM, Florent B flor...@coppint.com wrote: We are on Giant. On 03/16/2015 02:03 PM, Azad Aliyar wrote: May I know your ceph version.?. The latest version of firefly 80.9 has patches to avoid excessive data migrations during rewighting osds. You may need set a tunable inorder make this patch active. This is a bugfix release for firefly. It fixes a performance regression in librbd, an important CRUSH misbehavior (see below), and several RGW bugs. We have also backported support for flock/fcntl locks to ceph-fuse and libcephfs. We recommend that all Firefly users upgrade. For more detailed information, see http://docs.ceph.com/docs/master/_downloads/v0.80.9.txt https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.ceph.com_docs_master_-5Fdownloads_v0.80.9.txtd=AwMFaQc=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSncm6Vfn0C_UQr=CSYA9OS6Qd7fQySI2LDvlQm=0MEOMMXqQGLq4weFd85B2Bxn5uBH9V9uMiuajNVb7o0s=-HHkWm2cMQZ06FKpWF4Ai-YkFb9lUR_tH_KR0eITbuUe= Adjusting CRUSH maps * This point release fixes several issues with CRUSH that trigger excessive data migration when adjusting OSD weights. These are most obvious when a very small weight change (e.g., a change from 0 to .01) triggers a large amount of movement, but the same set of bugs can also lead to excessive (though less noticeable) movement in other cases. However, because the bug may already have affected your cluster, fixing it may trigger movement *back* to the more correct location. For this reason, you must manually opt-in to the fixed behavior. In order to set the new tunable to correct the behavior:: ceph osd crush set-tunable straw_calc_version 1 Note that this change will have no immediate effect. However, from this point forward, any 'straw' bucket in your CRUSH map that is adjusted will get non-buggy internal weights, and that transition may trigger some rebalancing. You can estimate how much rebalancing will eventually be necessary on your cluster with:: ceph osd getcrushmap -o /tmp/cm crushtool -i /tmp/cm --num-rep 3 --test --show-mappings /tmp/a 21 crushtool -i /tmp/cm --set-straw-calc-version 1 -o /tmp/cm2 crushtool -i /tmp/cm2 --reweight -o /tmp/cm2 crushtool -i /tmp/cm2 --num-rep 3 --test --show-mappings /tmp/b 21 wc -l /tmp/a # num total mappings diff -u /tmp/a /tmp/b | grep -c ^+# num changed mappings Divide the total number of lines in /tmp/a with the number of lines changed. We've found that most clusters are under 10%. You can force all of this rebalancing to happen at once with:: ceph osd crush reweight-all Otherwise, it will happen at some unknown point in the future when CRUSH weights are next adjusted. Notable Changes --- * ceph-fuse: flock, fcntl lock support (Yan, Zheng, Greg Farnum) * crush: fix straw bucket weight calculation, add straw_calc_version tunable (#10095 Sage Weil) * crush: fix tree bucket (Rongzu Zhu) * crush: fix underflow of tree weights (Loic Dachary, Sage Weil) * crushtool: add --reweight (Sage Weil) * librbd: complete pending operations before losing image (#10299 Jason Dillaman) * librbd: fix read caching performance regression (#9854 Jason Dillaman) * librbd: gracefully handle deleted/renamed pools (#10270 Jason Dillaman) * mon: fix dump of chooseleaf_vary_r tunable (Sage Weil) * osd: fix PG ref leak in snaptrimmer on peering (#10421 Kefu Chai) * osd: handle no-op write with snapshot (#10262 Sage Weil) * radosgw-admi On 03/16/2015 12:37 PM
Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !
I'm using the latest Giant and have the same issue. When i increase PG_num of a pool from 2048 to 2148, my VMs is still ok. When i increase from 2148 to 2400, some VMs die (Qemu-kvm process die). My physical servers (host VMs) running kernel 3.13 and use librbd. I think it's a bug in librbd with crushmap. (I set crush_tunables3 on my ceph cluster, does it make sense?) Do you know a way to safely increase PG_num? (I don't think increase PG_num 100 each times is a safe good way) Regards, On Mon, Mar 16, 2015 at 8:50 PM, Florent B flor...@coppint.com wrote: We are on Giant. On 03/16/2015 02:03 PM, Azad Aliyar wrote: May I know your ceph version.?. The latest version of firefly 80.9 has patches to avoid excessive data migrations during rewighting osds. You may need set a tunable inorder make this patch active. This is a bugfix release for firefly. It fixes a performance regression in librbd, an important CRUSH misbehavior (see below), and several RGW bugs. We have also backported support for flock/fcntl locks to ceph-fuse and libcephfs. We recommend that all Firefly users upgrade. For more detailed information, see http://docs.ceph.com/docs/master/_downloads/v0.80.9.txt Adjusting CRUSH maps * This point release fixes several issues with CRUSH that trigger excessive data migration when adjusting OSD weights. These are most obvious when a very small weight change (e.g., a change from 0 to .01) triggers a large amount of movement, but the same set of bugs can also lead to excessive (though less noticeable) movement in other cases. However, because the bug may already have affected your cluster, fixing it may trigger movement *back* to the more correct location. For this reason, you must manually opt-in to the fixed behavior. In order to set the new tunable to correct the behavior:: ceph osd crush set-tunable straw_calc_version 1 Note that this change will have no immediate effect. However, from this point forward, any 'straw' bucket in your CRUSH map that is adjusted will get non-buggy internal weights, and that transition may trigger some rebalancing. You can estimate how much rebalancing will eventually be necessary on your cluster with:: ceph osd getcrushmap -o /tmp/cm crushtool -i /tmp/cm --num-rep 3 --test --show-mappings /tmp/a 21 crushtool -i /tmp/cm --set-straw-calc-version 1 -o /tmp/cm2 crushtool -i /tmp/cm2 --reweight -o /tmp/cm2 crushtool -i /tmp/cm2 --num-rep 3 --test --show-mappings /tmp/b 21 wc -l /tmp/a # num total mappings diff -u /tmp/a /tmp/b | grep -c ^+# num changed mappings Divide the total number of lines in /tmp/a with the number of lines changed. We've found that most clusters are under 10%. You can force all of this rebalancing to happen at once with:: ceph osd crush reweight-all Otherwise, it will happen at some unknown point in the future when CRUSH weights are next adjusted. Notable Changes --- * ceph-fuse: flock, fcntl lock support (Yan, Zheng, Greg Farnum) * crush: fix straw bucket weight calculation, add straw_calc_version tunable (#10095 Sage Weil) * crush: fix tree bucket (Rongzu Zhu) * crush: fix underflow of tree weights (Loic Dachary, Sage Weil) * crushtool: add --reweight (Sage Weil) * librbd: complete pending operations before losing image (#10299 Jason Dillaman) * librbd: fix read caching performance regression (#9854 Jason Dillaman) * librbd: gracefully handle deleted/renamed pools (#10270 Jason Dillaman) * mon: fix dump of chooseleaf_vary_r tunable (Sage Weil) * osd: fix PG ref leak in snaptrimmer on peering (#10421 Kefu Chai) * osd: handle no-op write with snapshot (#10262 Sage Weil) * radosgw-admi On 03/16/2015 12:37 PM, Alexandre DERUMIER wrote: VMs are running on the same nodes than OSD Are you sure that you didn't some kind of out of memory. pg rebalance can be memory hungry. (depend how many osd you have). 2 OSD per host, and 5 hosts in this cluster. hosts h ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !
VMs are running on the same nodes than OSD Are you sure that you didn't some kind of out of memory. pg rebalance can be memory hungry. (depend how many osd you have). do you see oom-killer in your host logs ? - Mail original - De: Florent Bautista flor...@coppint.com À: aderumier aderum...@odiso.com Cc: ceph-users ceph-users@lists.ceph.com Envoyé: Lundi 16 Mars 2015 12:35:11 Objet: Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down ! On 03/16/2015 12:23 PM, Alexandre DERUMIER wrote: We use Proxmox, so I think it uses librbd ? As It's me that I made the proxmox rbd plugin, I can confirm that yes, it's librbd ;) Is the ceph cluster on dedicated nodes ? or vms are running on same nodes than osd daemons ? VMs are running on the same nodes than OSD And I precise that not all VMs on that pool crashed, only some of them (a large majority), and on a same host, some crashed and others not. Is the vm crashed, like no more qemu process ? or is it the guest os which is crashed ? (do you use virtio, virtio-scsi or ide for your guest ?) I don't really know what crashed, I think qemu process but not sure. We use virtio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !
I had the same issue a few days ago. I was increasing the pg_num of one pool from 512 to 1024 and all the VMs in that pool stopped. I came to the conclusion that doubling the pg_num caused such a high load in ceph that the VMs were blocked. The next time I will test with small increments. On 12:38 Sat 14 Mar , Florent B wrote: Hi all, I have a Giant cluster in production. Today one of my RBD pools had the too few pgs warning. So I changed pg_num pgp_num. And at this moment, some of the VM stored on this pool were stopped (on some hosts, not all, it depends, no logic) All was running fine for months... Have you ever seen this ? What could have caused this ? Thank you. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [SPAM] Changing pg_num = RBD VM down !
changin PG number - causes LOOOT of data rebalancing (in my case was 80%) which I learned the hard way... On 14 March 2015 at 18:49, Gabri Mate mailingl...@modernbiztonsag.org wrote: I had the same issue a few days ago. I was increasing the pg_num of one pool from 512 to 1024 and all the VMs in that pool stopped. I came to the conclusion that doubling the pg_num caused such a high load in ceph that the VMs were blocked. The next time I will test with small increments. On 12:38 Sat 14 Mar , Florent B wrote: Hi all, I have a Giant cluster in production. Today one of my RBD pools had the too few pgs warning. So I changed pg_num pgp_num. And at this moment, some of the VM stored on this pool were stopped (on some hosts, not all, it depends, no logic) All was running fine for months... Have you ever seen this ? What could have caused this ? Thank you. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com