Re: [ceph-users] Storage node refurbishing, a "freeze" OSD feature would be nice
Hello, On Mon, 31 Aug 2015 22:44:05 + Stillwell, Bryan wrote: > We have the following in our ceph.conf to bring in new OSDs with a weight > of 0: > > [osd] > osd_crush_initial_weight = 0 > > > We then set 'nobackfill' and bring in each OSD at full weight one at a > time (letting things settle down before bring in the next OSD). Once all > the OSDs are brought in we unset 'nobackfill' and let ceph take care of > the rest. This seems to work pretty well for us. > That looks interesting, will give it a spin on my test cluster. One thing the "letting things settle down" reminded me of is that adding OSDs and especially a new node will cause (potentially significant) data movement resulting from CRUSH map changes, something to keep in mind when scheduling even those "harmless" first steps. Christian > Bryan > > On 8/31/15, 4:08 PM, "ceph-users on behalf of Wang, Warren" > warren_w...@cable.comcast.com> wrote: > > >When we know we need to off a node, we weight it down over time. > >Depending on your cluster, you may need to do this over days or hours. > > > >In theory, you could do the same when putting OSDs in, by setting noin, > >and then setting weight to something very low, and going up over time. I > >haven¹t tried this though. > > > >-- > >Warren Wang > >Comcast Cloud (OpenStack) > > > > > > > >On 8/31/15, 2:57 AM, "ceph-users on behalf of Udo Lembke" > > > >wrote: > > > >>Hi Christian, > >>for my setup "b" takes too long - too much data movement and stress to > >>all nodes. > >>I have simply (with replica 3) "set noout", reinstall one node (with > >>new filesystem on the OSDs, but leave them in the > >>crushmap) and start all OSDs (at friday night) - takes app. less than > >>one day for rebuild (11*4TB 1*8TB). > >>Do also stress the other nodes, but less than with weigting to zero. > >> > >>Udo > >> > >>On 31.08.2015 06:07, Christian Balzer wrote: > >>> > >>> Hello, > >>> > >>> I'm about to add another storage node to small firefly cluster here > >>> and refurbish 2 existing nodes (more RAM, different OSD disks). > >>> > >>> Insert rant about not going to start using ceph-deploy as I would > >>> have > >>>to > >>> set the cluster to no-in since "prepare" also activates things due to > >>>the > >>> udev magic... > >>> > >>> This cluster is quite at the limits of its IOPS capacity (the HW was > >>> requested ages ago, but the mills here grind slowly and not > >>> particular fine either), so the plan is to: > >>> > >>> a) phase in the new node (lets call it C), one OSD at a time (in the > >>>dead > >>> of night) > >>> b) empty out old node A (weight 0), one OSD at a time. When > >>> done, refurbish and bring it back in, like above. > >>> c) repeat with 2nd old node B. > >>> > >>> Looking at this it's obvious where the big optimization in this > >>>procedure > >>> would be, having the ability to "freeze" the OSDs on node B. > >>> That is making them ineligible for any new PGs while preserving their > >>> current status. > >>> So that data moves from A to C (which is significantly faster than A > >>> or > >>>B) > >>> and then back to A when it is refurbished, avoiding any heavy lifting > >>>by B. > >>> > >>> Does that sound like something other people might find useful as well > >>>and > >>> is it feasible w/o upsetting the CRUSH applecart? > >>> > >>> Christian > >>> > >> > >>___ > >>ceph-users mailing list > >>ceph-users@lists.ceph.com > >>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > > >___ > >ceph-users mailing list > >ceph-users@lists.ceph.com > >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > This E-mail and any of its attachments may contain Time Warner Cable > proprietary information, which is privileged, confidential, or subject > to copyright belonging to Time Warner Cable. This E-mail is intended > solely for the use of the individual or entity to which it is addressed. > If you are not the intended recipient of this E-mail, you are hereby > notified that any dissemination, distribution, copying, or action taken > in relation to the contents of and attachments to this E-mail is > strictly prohibited and may be unlawful. If you have received this > E-mail in error, please notify the sender immediately and permanently > delete the original and any copy of this E-mail and any printout. -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Storage node refurbishing, a "freeze" OSD feature would be nice
On Mon, 31 Aug 2015 08:57:23 +0200 Udo Lembke wrote: > Hi Christian, > for my setup "b" takes too long - too much data movement and stress to > all nodes. I have simply (with replica 3) "set noout", reinstall one > node (with new filesystem on the OSDs, but leave them in the crushmap) > and start all OSDs (at friday night) - takes app. less than one day for > rebuild (11*4TB 1*8TB). Do also stress the other nodes, but less than > with weigting to zero. > I will hopefully have a good idea of what times and impacts I'm looking at after a). But I think that doing a massive push-pull like that will be too long for the maintenance windows we have, also the number of OSDs will change in the old nodes. Christian > Udo > > On 31.08.2015 06:07, Christian Balzer wrote: > > > > Hello, > > > > I'm about to add another storage node to small firefly cluster here and > > refurbish 2 existing nodes (more RAM, different OSD disks). > > > > Insert rant about not going to start using ceph-deploy as I would have > > to set the cluster to no-in since "prepare" also activates things due > > to the udev magic... > > > > This cluster is quite at the limits of its IOPS capacity (the HW was > > requested ages ago, but the mills here grind slowly and not particular > > fine either), so the plan is to: > > > > a) phase in the new node (lets call it C), one OSD at a time (in the > > dead of night) > > b) empty out old node A (weight 0), one OSD at a time. When > > done, refurbish and bring it back in, like above. > > c) repeat with 2nd old node B. > > > > Looking at this it's obvious where the big optimization in this > > procedure would be, having the ability to "freeze" the OSDs on node B. > > That is making them ineligible for any new PGs while preserving their > > current status. > > So that data moves from A to C (which is significantly faster than A > > or B) and then back to A when it is refurbished, avoiding any heavy > > lifting by B. > > > > Does that sound like something other people might find useful as well > > and is it feasible w/o upsetting the CRUSH applecart? > > > > Christian > > > > -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Storage node refurbishing, a "freeze" OSD feature would be nice
We have the following in our ceph.conf to bring in new OSDs with a weight of 0: [osd] osd_crush_initial_weight = 0 We then set 'nobackfill' and bring in each OSD at full weight one at a time (letting things settle down before bring in the next OSD). Once all the OSDs are brought in we unset 'nobackfill' and let ceph take care of the rest. This seems to work pretty well for us. Bryan On 8/31/15, 4:08 PM, "ceph-users on behalf of Wang, Warren" wrote: >When we know we need to off a node, we weight it down over time. Depending >on your cluster, you may need to do this over days or hours. > >In theory, you could do the same when putting OSDs in, by setting noin, >and then setting weight to something very low, and going up over time. I >haven¹t tried this though. > >-- >Warren Wang >Comcast Cloud (OpenStack) > > > >On 8/31/15, 2:57 AM, "ceph-users on behalf of Udo Lembke" > >wrote: > >>Hi Christian, >>for my setup "b" takes too long - too much data movement and stress to >>all nodes. >>I have simply (with replica 3) "set noout", reinstall one node (with new >>filesystem on the OSDs, but leave them in the >>crushmap) and start all OSDs (at friday night) - takes app. less than one >>day for rebuild (11*4TB 1*8TB). >>Do also stress the other nodes, but less than with weigting to zero. >> >>Udo >> >>On 31.08.2015 06:07, Christian Balzer wrote: >>> >>> Hello, >>> >>> I'm about to add another storage node to small firefly cluster here and >>> refurbish 2 existing nodes (more RAM, different OSD disks). >>> >>> Insert rant about not going to start using ceph-deploy as I would have >>>to >>> set the cluster to no-in since "prepare" also activates things due to >>>the >>> udev magic... >>> >>> This cluster is quite at the limits of its IOPS capacity (the HW was >>> requested ages ago, but the mills here grind slowly and not particular >>> fine either), so the plan is to: >>> >>> a) phase in the new node (lets call it C), one OSD at a time (in the >>>dead >>> of night) >>> b) empty out old node A (weight 0), one OSD at a time. When >>> done, refurbish and bring it back in, like above. >>> c) repeat with 2nd old node B. >>> >>> Looking at this it's obvious where the big optimization in this >>>procedure >>> would be, having the ability to "freeze" the OSDs on node B. >>> That is making them ineligible for any new PGs while preserving their >>> current status. >>> So that data moves from A to C (which is significantly faster than A or >>>B) >>> and then back to A when it is refurbished, avoiding any heavy lifting >>>by B. >>> >>> Does that sound like something other people might find useful as well >>>and >>> is it feasible w/o upsetting the CRUSH applecart? >>> >>> Christian >>> >> >>___ >>ceph-users mailing list >>ceph-users@lists.ceph.com >>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >___ >ceph-users mailing list >ceph-users@lists.ceph.com >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Storage node refurbishing, a "freeze" OSD feature would be nice
When we know we need to off a node, we weight it down over time. Depending on your cluster, you may need to do this over days or hours. In theory, you could do the same when putting OSDs in, by setting noin, and then setting weight to something very low, and going up over time. I haven¹t tried this though. -- Warren Wang Comcast Cloud (OpenStack) On 8/31/15, 2:57 AM, "ceph-users on behalf of Udo Lembke" wrote: >Hi Christian, >for my setup "b" takes too long - too much data movement and stress to >all nodes. >I have simply (with replica 3) "set noout", reinstall one node (with new >filesystem on the OSDs, but leave them in the >crushmap) and start all OSDs (at friday night) - takes app. less than one >day for rebuild (11*4TB 1*8TB). >Do also stress the other nodes, but less than with weigting to zero. > >Udo > >On 31.08.2015 06:07, Christian Balzer wrote: >> >> Hello, >> >> I'm about to add another storage node to small firefly cluster here and >> refurbish 2 existing nodes (more RAM, different OSD disks). >> >> Insert rant about not going to start using ceph-deploy as I would have >>to >> set the cluster to no-in since "prepare" also activates things due to >>the >> udev magic... >> >> This cluster is quite at the limits of its IOPS capacity (the HW was >> requested ages ago, but the mills here grind slowly and not particular >> fine either), so the plan is to: >> >> a) phase in the new node (lets call it C), one OSD at a time (in the >>dead >> of night) >> b) empty out old node A (weight 0), one OSD at a time. When >> done, refurbish and bring it back in, like above. >> c) repeat with 2nd old node B. >> >> Looking at this it's obvious where the big optimization in this >>procedure >> would be, having the ability to "freeze" the OSDs on node B. >> That is making them ineligible for any new PGs while preserving their >> current status. >> So that data moves from A to C (which is significantly faster than A or >>B) >> and then back to A when it is refurbished, avoiding any heavy lifting >>by B. >> >> Does that sound like something other people might find useful as well >>and >> is it feasible w/o upsetting the CRUSH applecart? >> >> Christian >> > >___ >ceph-users mailing list >ceph-users@lists.ceph.com >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Storage node refurbishing, a "freeze" OSD feature would be nice
On Mon, Aug 31, 2015 at 5:07 AM, Christian Balzer wrote: > > Hello, > > I'm about to add another storage node to small firefly cluster here and > refurbish 2 existing nodes (more RAM, different OSD disks). > > Insert rant about not going to start using ceph-deploy as I would have to > set the cluster to no-in since "prepare" also activates things due to the > udev magic... > > This cluster is quite at the limits of its IOPS capacity (the HW was > requested ages ago, but the mills here grind slowly and not particular > fine either), so the plan is to: > > a) phase in the new node (lets call it C), one OSD at a time (in the dead > of night) > b) empty out old node A (weight 0), one OSD at a time. When > done, refurbish and bring it back in, like above. > c) repeat with 2nd old node B. > > Looking at this it's obvious where the big optimization in this procedure > would be, having the ability to "freeze" the OSDs on node B. > That is making them ineligible for any new PGs while preserving their > current status. > So that data moves from A to C (which is significantly faster than A or B) > and then back to A when it is refurbished, avoiding any heavy lifting by B. > > Does that sound like something other people might find useful as well and > is it feasible w/o upsetting the CRUSH applecart? That's the rub, isn't it. Freezing an OSD is implicitly switching from calculating locations to enumerating them. I can think of the start to a few hacks around that (mostly around our existing temp pg mappings), but I don't think it's possible to scale them. :/ -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Storage node refurbishing, a "freeze" OSD feature would be nice
Hi Christian, for my setup "b" takes too long - too much data movement and stress to all nodes. I have simply (with replica 3) "set noout", reinstall one node (with new filesystem on the OSDs, but leave them in the crushmap) and start all OSDs (at friday night) - takes app. less than one day for rebuild (11*4TB 1*8TB). Do also stress the other nodes, but less than with weigting to zero. Udo On 31.08.2015 06:07, Christian Balzer wrote: > > Hello, > > I'm about to add another storage node to small firefly cluster here and > refurbish 2 existing nodes (more RAM, different OSD disks). > > Insert rant about not going to start using ceph-deploy as I would have to > set the cluster to no-in since "prepare" also activates things due to the > udev magic... > > This cluster is quite at the limits of its IOPS capacity (the HW was > requested ages ago, but the mills here grind slowly and not particular > fine either), so the plan is to: > > a) phase in the new node (lets call it C), one OSD at a time (in the dead > of night) > b) empty out old node A (weight 0), one OSD at a time. When > done, refurbish and bring it back in, like above. > c) repeat with 2nd old node B. > > Looking at this it's obvious where the big optimization in this procedure > would be, having the ability to "freeze" the OSDs on node B. > That is making them ineligible for any new PGs while preserving their > current status. > So that data moves from A to C (which is significantly faster than A or B) > and then back to A when it is refurbished, avoiding any heavy lifting by B. > > Does that sound like something other people might find useful as well and > is it feasible w/o upsetting the CRUSH applecart? > > Christian > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Storage node refurbishing, a "freeze" OSD feature would be nice
Hello, I'm about to add another storage node to small firefly cluster here and refurbish 2 existing nodes (more RAM, different OSD disks). Insert rant about not going to start using ceph-deploy as I would have to set the cluster to no-in since "prepare" also activates things due to the udev magic... This cluster is quite at the limits of its IOPS capacity (the HW was requested ages ago, but the mills here grind slowly and not particular fine either), so the plan is to: a) phase in the new node (lets call it C), one OSD at a time (in the dead of night) b) empty out old node A (weight 0), one OSD at a time. When done, refurbish and bring it back in, like above. c) repeat with 2nd old node B. Looking at this it's obvious where the big optimization in this procedure would be, having the ability to "freeze" the OSDs on node B. That is making them ineligible for any new PGs while preserving their current status. So that data moves from A to C (which is significantly faster than A or B) and then back to A when it is refurbished, avoiding any heavy lifting by B. Does that sound like something other people might find useful as well and is it feasible w/o upsetting the CRUSH applecart? Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com