Re: [ceph-users] Storage node refurbishing, a "freeze" OSD feature would be nice

2015-08-31 Thread Udo Lembke
Hi Christian,
for my setup "b" takes too long - too much data movement and stress to all 
nodes.
I have simply (with replica 3) "set noout", reinstall one node (with new 
filesystem on the OSDs, but leave them in the
crushmap) and start all OSDs (at friday night) - takes app. less than one day 
for rebuild (11*4TB 1*8TB).
Do also stress the other nodes, but less than with weigting to zero.

Udo

On 31.08.2015 06:07, Christian Balzer wrote:
> 
> Hello,
> 
> I'm about to add another storage node to small firefly cluster here and
> refurbish 2 existing nodes (more RAM, different OSD disks).
> 
> Insert rant about not going to start using ceph-deploy as I would have to
> set the cluster to no-in since "prepare" also activates things due to the
> udev magic...
> 
> This cluster is quite at the limits of its IOPS capacity (the HW was
> requested ages ago, but the mills here grind slowly and not particular
> fine either), so the plan is to:
> 
> a) phase in the new node (lets call it C), one OSD at a time (in the dead
> of night)
> b) empty out old node A (weight 0), one OSD at a time. When
> done, refurbish and bring it back in, like above.
> c) repeat with 2nd old node B.
> 
> Looking at this it's obvious where the big optimization in this procedure
> would be, having the ability to "freeze" the OSDs on node B.
> That is making them ineligible for any new PGs while preserving their
> current status. 
> So that data moves from A to C (which is significantly faster than A or B)
> and then back to A when it is refurbished, avoiding any heavy lifting by B.
> 
> Does that sound like something other people might find useful as well and
> is it feasible w/o upsetting the CRUSH applecart?
> 
> Christian
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Storage node refurbishing, a "freeze" OSD feature would be nice

2015-08-31 Thread Wang, Warren
When we know we need to off a node, we weight it down over time. Depending
on your cluster, you may need to do this over days or hours.

In theory, you could do the same when putting OSDs in, by setting noin,
and then setting weight to something very low, and going up over time. I
haven¹t tried this though.

-- 
Warren Wang
Comcast Cloud (OpenStack)



On 8/31/15, 2:57 AM, "ceph-users on behalf of Udo Lembke"

wrote:

>Hi Christian,
>for my setup "b" takes too long - too much data movement and stress to
>all nodes.
>I have simply (with replica 3) "set noout", reinstall one node (with new
>filesystem on the OSDs, but leave them in the
>crushmap) and start all OSDs (at friday night) - takes app. less than one
>day for rebuild (11*4TB 1*8TB).
>Do also stress the other nodes, but less than with weigting to zero.
>
>Udo
>
>On 31.08.2015 06:07, Christian Balzer wrote:
>> 
>> Hello,
>> 
>> I'm about to add another storage node to small firefly cluster here and
>> refurbish 2 existing nodes (more RAM, different OSD disks).
>> 
>> Insert rant about not going to start using ceph-deploy as I would have
>>to
>> set the cluster to no-in since "prepare" also activates things due to
>>the
>> udev magic...
>> 
>> This cluster is quite at the limits of its IOPS capacity (the HW was
>> requested ages ago, but the mills here grind slowly and not particular
>> fine either), so the plan is to:
>> 
>> a) phase in the new node (lets call it C), one OSD at a time (in the
>>dead
>> of night)
>> b) empty out old node A (weight 0), one OSD at a time. When
>> done, refurbish and bring it back in, like above.
>> c) repeat with 2nd old node B.
>> 
>> Looking at this it's obvious where the big optimization in this
>>procedure
>> would be, having the ability to "freeze" the OSDs on node B.
>> That is making them ineligible for any new PGs while preserving their
>> current status. 
>> So that data moves from A to C (which is significantly faster than A or
>>B)
>> and then back to A when it is refurbished, avoiding any heavy lifting
>>by B.
>> 
>> Does that sound like something other people might find useful as well
>>and
>> is it feasible w/o upsetting the CRUSH applecart?
>> 
>> Christian
>> 
>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Storage node refurbishing, a "freeze" OSD feature would be nice

2015-08-31 Thread Stillwell, Bryan
We have the following in our ceph.conf to bring in new OSDs with a weight
of 0:

[osd]
osd_crush_initial_weight = 0


We then set 'nobackfill' and bring in each OSD at full weight one at a
time (letting things settle down before bring in the next OSD).  Once all
the OSDs are brought in we unset 'nobackfill' and let ceph take care of
the rest.  This seems to work pretty well for us.

Bryan

On 8/31/15, 4:08 PM, "ceph-users on behalf of Wang, Warren"
 wrote:

>When we know we need to off a node, we weight it down over time. Depending
>on your cluster, you may need to do this over days or hours.
>
>In theory, you could do the same when putting OSDs in, by setting noin,
>and then setting weight to something very low, and going up over time. I
>haven¹t tried this though.
>
>--
>Warren Wang
>Comcast Cloud (OpenStack)
>
>
>
>On 8/31/15, 2:57 AM, "ceph-users on behalf of Udo Lembke"
>
>wrote:
>
>>Hi Christian,
>>for my setup "b" takes too long - too much data movement and stress to
>>all nodes.
>>I have simply (with replica 3) "set noout", reinstall one node (with new
>>filesystem on the OSDs, but leave them in the
>>crushmap) and start all OSDs (at friday night) - takes app. less than one
>>day for rebuild (11*4TB 1*8TB).
>>Do also stress the other nodes, but less than with weigting to zero.
>>
>>Udo
>>
>>On 31.08.2015 06:07, Christian Balzer wrote:
>>>
>>> Hello,
>>>
>>> I'm about to add another storage node to small firefly cluster here and
>>> refurbish 2 existing nodes (more RAM, different OSD disks).
>>>
>>> Insert rant about not going to start using ceph-deploy as I would have
>>>to
>>> set the cluster to no-in since "prepare" also activates things due to
>>>the
>>> udev magic...
>>>
>>> This cluster is quite at the limits of its IOPS capacity (the HW was
>>> requested ages ago, but the mills here grind slowly and not particular
>>> fine either), so the plan is to:
>>>
>>> a) phase in the new node (lets call it C), one OSD at a time (in the
>>>dead
>>> of night)
>>> b) empty out old node A (weight 0), one OSD at a time. When
>>> done, refurbish and bring it back in, like above.
>>> c) repeat with 2nd old node B.
>>>
>>> Looking at this it's obvious where the big optimization in this
>>>procedure
>>> would be, having the ability to "freeze" the OSDs on node B.
>>> That is making them ineligible for any new PGs while preserving their
>>> current status.
>>> So that data moves from A to C (which is significantly faster than A or
>>>B)
>>> and then back to A when it is refurbished, avoiding any heavy lifting
>>>by B.
>>>
>>> Does that sound like something other people might find useful as well
>>>and
>>> is it feasible w/o upsetting the CRUSH applecart?
>>>
>>> Christian
>>>
>>
>>___
>>ceph-users mailing list
>>ceph-users@lists.ceph.com
>>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




This E-mail and any of its attachments may contain Time Warner Cable 
proprietary information, which is privileged, confidential, or subject to 
copyright belonging to Time Warner Cable. This E-mail is intended solely for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient of this E-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this E-mail is strictly prohibited and may be 
unlawful. If you have received this E-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this E-mail and 
any printout.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Storage node refurbishing, a "freeze" OSD feature would be nice

2015-08-31 Thread Christian Balzer
On Mon, 31 Aug 2015 08:57:23 +0200 Udo Lembke wrote:

> Hi Christian,
> for my setup "b" takes too long - too much data movement and stress to
> all nodes. I have simply (with replica 3) "set noout", reinstall one
> node (with new filesystem on the OSDs, but leave them in the crushmap)
> and start all OSDs (at friday night) - takes app. less than one day for
> rebuild (11*4TB 1*8TB). Do also stress the other nodes, but less than
> with weigting to zero.
> 
I will hopefully have a good idea of what times and impacts I'm looking at
after a).
But I think that doing a massive push-pull like that will be too long
for the maintenance windows we have, also the number of OSDs will change in
the old nodes.

Christian

> Udo
> 
> On 31.08.2015 06:07, Christian Balzer wrote:
> > 
> > Hello,
> > 
> > I'm about to add another storage node to small firefly cluster here and
> > refurbish 2 existing nodes (more RAM, different OSD disks).
> > 
> > Insert rant about not going to start using ceph-deploy as I would have
> > to set the cluster to no-in since "prepare" also activates things due
> > to the udev magic...
> > 
> > This cluster is quite at the limits of its IOPS capacity (the HW was
> > requested ages ago, but the mills here grind slowly and not particular
> > fine either), so the plan is to:
> > 
> > a) phase in the new node (lets call it C), one OSD at a time (in the
> > dead of night)
> > b) empty out old node A (weight 0), one OSD at a time. When
> > done, refurbish and bring it back in, like above.
> > c) repeat with 2nd old node B.
> > 
> > Looking at this it's obvious where the big optimization in this
> > procedure would be, having the ability to "freeze" the OSDs on node B.
> > That is making them ineligible for any new PGs while preserving their
> > current status. 
> > So that data moves from A to C (which is significantly faster than A
> > or B) and then back to A when it is refurbished, avoiding any heavy
> > lifting by B.
> > 
> > Does that sound like something other people might find useful as well
> > and is it feasible w/o upsetting the CRUSH applecart?
> > 
> > Christian
> > 
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Storage node refurbishing, a "freeze" OSD feature would be nice

2015-08-31 Thread Christian Balzer

Hello,

On Mon, 31 Aug 2015 22:44:05 + Stillwell, Bryan wrote:

> We have the following in our ceph.conf to bring in new OSDs with a weight
> of 0:
> 
> [osd]
> osd_crush_initial_weight = 0
> 
> 
> We then set 'nobackfill' and bring in each OSD at full weight one at a
> time (letting things settle down before bring in the next OSD).  Once all
> the OSDs are brought in we unset 'nobackfill' and let ceph take care of
> the rest.  This seems to work pretty well for us.
> 
That looks interesting, will give it a spin on my test cluster.

One thing the "letting things settle down" reminded me of is that adding
OSDs and especially a new node will cause (potentially significant) data
movement resulting from CRUSH map changes, something to keep in mind when
scheduling even those "harmless" first steps.

Christian

> Bryan
> 
> On 8/31/15, 4:08 PM, "ceph-users on behalf of Wang, Warren"
>  warren_w...@cable.comcast.com> wrote:
> 
> >When we know we need to off a node, we weight it down over time.
> >Depending on your cluster, you may need to do this over days or hours.
> >
> >In theory, you could do the same when putting OSDs in, by setting noin,
> >and then setting weight to something very low, and going up over time. I
> >haven¹t tried this though.
> >
> >--
> >Warren Wang
> >Comcast Cloud (OpenStack)
> >
> >
> >
> >On 8/31/15, 2:57 AM, "ceph-users on behalf of Udo Lembke"
> >
> >wrote:
> >
> >>Hi Christian,
> >>for my setup "b" takes too long - too much data movement and stress to
> >>all nodes.
> >>I have simply (with replica 3) "set noout", reinstall one node (with
> >>new filesystem on the OSDs, but leave them in the
> >>crushmap) and start all OSDs (at friday night) - takes app. less than
> >>one day for rebuild (11*4TB 1*8TB).
> >>Do also stress the other nodes, but less than with weigting to zero.
> >>
> >>Udo
> >>
> >>On 31.08.2015 06:07, Christian Balzer wrote:
> >>>
> >>> Hello,
> >>>
> >>> I'm about to add another storage node to small firefly cluster here
> >>> and refurbish 2 existing nodes (more RAM, different OSD disks).
> >>>
> >>> Insert rant about not going to start using ceph-deploy as I would
> >>> have
> >>>to
> >>> set the cluster to no-in since "prepare" also activates things due to
> >>>the
> >>> udev magic...
> >>>
> >>> This cluster is quite at the limits of its IOPS capacity (the HW was
> >>> requested ages ago, but the mills here grind slowly and not
> >>> particular fine either), so the plan is to:
> >>>
> >>> a) phase in the new node (lets call it C), one OSD at a time (in the
> >>>dead
> >>> of night)
> >>> b) empty out old node A (weight 0), one OSD at a time. When
> >>> done, refurbish and bring it back in, like above.
> >>> c) repeat with 2nd old node B.
> >>>
> >>> Looking at this it's obvious where the big optimization in this
> >>>procedure
> >>> would be, having the ability to "freeze" the OSDs on node B.
> >>> That is making them ineligible for any new PGs while preserving their
> >>> current status.
> >>> So that data moves from A to C (which is significantly faster than A
> >>> or
> >>>B)
> >>> and then back to A when it is refurbished, avoiding any heavy lifting
> >>>by B.
> >>>
> >>> Does that sound like something other people might find useful as well
> >>>and
> >>> is it feasible w/o upsetting the CRUSH applecart?
> >>>
> >>> Christian
> >>>
> >>
> >>___
> >>ceph-users mailing list
> >>ceph-users@lists.ceph.com
> >>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> >___
> >ceph-users mailing list
> >ceph-users@lists.ceph.com
> >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> This E-mail and any of its attachments may contain Time Warner Cable
> proprietary information, which is privileged, confidential, or subject
> to copyright belonging to Time Warner Cable. This E-mail is intended
> solely for the use of the individual or entity to which it is addressed.
> If you are not the intended recipient of this E-mail, you are hereby
> notified that any dissemination, distribution, copying, or action taken
> in relation to the contents of and attachments to this E-mail is
> strictly prohibited and may be unlawful. If you have received this
> E-mail in error, please notify the sender immediately and permanently
> delete the original and any copy of this E-mail and any printout.


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Storage node refurbishing, a "freeze" OSD feature would be nice

2015-08-31 Thread Gregory Farnum
On Mon, Aug 31, 2015 at 5:07 AM, Christian Balzer  wrote:
>
> Hello,
>
> I'm about to add another storage node to small firefly cluster here and
> refurbish 2 existing nodes (more RAM, different OSD disks).
>
> Insert rant about not going to start using ceph-deploy as I would have to
> set the cluster to no-in since "prepare" also activates things due to the
> udev magic...
>
> This cluster is quite at the limits of its IOPS capacity (the HW was
> requested ages ago, but the mills here grind slowly and not particular
> fine either), so the plan is to:
>
> a) phase in the new node (lets call it C), one OSD at a time (in the dead
> of night)
> b) empty out old node A (weight 0), one OSD at a time. When
> done, refurbish and bring it back in, like above.
> c) repeat with 2nd old node B.
>
> Looking at this it's obvious where the big optimization in this procedure
> would be, having the ability to "freeze" the OSDs on node B.
> That is making them ineligible for any new PGs while preserving their
> current status.
> So that data moves from A to C (which is significantly faster than A or B)
> and then back to A when it is refurbished, avoiding any heavy lifting by B.
>
> Does that sound like something other people might find useful as well and
> is it feasible w/o upsetting the CRUSH applecart?

That's the rub, isn't it. Freezing an OSD is implicitly switching from
calculating locations to enumerating them. I can think of the start to
a few hacks around that (mostly around our existing temp pg mappings),
but I don't think it's possible to scale them. :/
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Storage node refurbishing, a freeze OSD feature would be nice

2015-08-30 Thread Christian Balzer

Hello,

I'm about to add another storage node to small firefly cluster here and
refurbish 2 existing nodes (more RAM, different OSD disks).

Insert rant about not going to start using ceph-deploy as I would have to
set the cluster to no-in since prepare also activates things due to the
udev magic...

This cluster is quite at the limits of its IOPS capacity (the HW was
requested ages ago, but the mills here grind slowly and not particular
fine either), so the plan is to:

a) phase in the new node (lets call it C), one OSD at a time (in the dead
of night)
b) empty out old node A (weight 0), one OSD at a time. When
done, refurbish and bring it back in, like above.
c) repeat with 2nd old node B.

Looking at this it's obvious where the big optimization in this procedure
would be, having the ability to freeze the OSDs on node B.
That is making them ineligible for any new PGs while preserving their
current status. 
So that data moves from A to C (which is significantly faster than A or B)
and then back to A when it is refurbished, avoiding any heavy lifting by B.

Does that sound like something other people might find useful as well and
is it feasible w/o upsetting the CRUSH applecart?

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com