Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread David Turner
I would recommend stopping the OSD daemon, waiting for everything to
happen, and then starting it again.  The cluster has settings for
automatically marking it down and subsequently out on its own.
On Wed, Sep 20, 2017 at 1:18 PM Jonas Jaszkowic <
jonasjaszkowic.w...@gmail.com> wrote:

> Thank you for this detailed information. In fact I am seeing spikes in my
> objects/second recovery. Unfortunately
> I had to shut down my cluster. But thanks for the help!
>
> In order to improve my scenario (marking an osd out to see how the cluster
> recovers),
> what is the best way to simulate an OSD failure without actually shutting
> down or powering off the whole machine?
> I want to do automate the process with bash scripts.
>
> Currently I am just doing *ceph osd out *. Should I additionally
> do *ceph osd down * in combination with
> killing the OSD daemon on the specific host? What I am trying to do is
> basically the following:
>
> - Killing one OSD
> - Measuring recovery time, monitoring, etc.
> - Bringing the OSD back in
> - Again measuring recovery time, monitoring, etc
> - Enjoy a healthy cluster
>
> This is (quite) working as described with *ceph osd out * and *ceph
> osd in *, but I am wondering
> if this produces a realistic behavior.
>
>
> Am 20.09.2017 um 18:06 schrieb David Turner :
>
> When you posted your ceph status, you only had 56 PGs degraded.  Any value
> of osd_max_backfills or osd_recovery_max_active over 56 would not do
> anything.  What these settings do is dictate to each OSD the maximum amount
> of PGs that it can be involved in a recovery process at once.  If you had
> 56 PGs degraded and all of them were on a single OSD, then a value of 56
> would tell all of them to be able to run at the same time.  If they were
> more spread out across your cluster, then a lower setting would still allow
> all of the PGs to recover at the same time.
>
> Now you're talking about how many objects are recovering a second.  Note
> that your PGs are recovering and not backfilling.  Backfilling is moving
> all of the data for a PG from 1 OSD to another.  All objects need to be
> recovered and you'll see a much higher number of objects/second.  Recovery
> is just catching up after an OSD has been down for a bit, but never marked
> out.  It only needs to catch up on the objects that have been altered,
> created, or deleted since it was last caught up for the PG.  When the PG
> finishes it's recovery state and is in a healthy state again, all of the
> objects that were in it but that didn't need to catch up are all at once
> marked recovered and you'll see spikes in your objects/second recovery.
>
> Your scenario (marking an OSD out to see how the cluster rebounds)
> shouldn't have a lot of PGs in recovery they should all be in backfill
> because the data needs to shift between OSDs.  I'm guessing that had
> something to do with the OSD still being up while it was marked down or
> that you had some other OSDs in your cluster be marked down due to not
> responding or possibly being restarted due to an OOM killer from the
> kernel.  What is your current `ceph status`?
>
> On Wed, Sep 20, 2017 at 11:52 AM Jonas Jaszkowic <
> jonasjaszkowic.w...@gmail.com> wrote:
>
>> Thank you for the admin socket information and the hint to Luminous, I
>> will try it out when I have the time.
>>
>> What I noticed when looking at ceph -w is that the number of objects per
>> second recovering is still very low.
>> Meanwhile I set the options osd_recovery_max_active and osd_max_backfills
>> to very high numbers (4096, just to be sure).
>> Most of the time it is something like ‚0 objects/s recovering‘ or less
>> than ‚10 objects/s recovering‘, for example:
>>
>> 2017-09-20 15:41:12.341364 mon.0 [INF] pgmap v16029: 256 pgs: 68
>> active+recovering+degraded, 15 active+remapped+backfilling, 173
>> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
>> 30554/1376215 objects degraded (2.220%); 12205/1376215 objects misplaced
>> (0.887%); 42131 kB/s, 3 objects/s recovering
>> 2017-09-20 15:41:13.344684 mon.0 [INF] pgmap v16030: 256 pgs: 68
>> active+recovering+degraded, 15 active+remapped+backfilling, 173
>> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
>> 30554/1376215 objects degraded (2.220%); 12205/1376215 objects misplaced
>> (0.887%); 9655 kB/s, 2 objects/s recovering
>> 2017-09-20 15:41:14.352699 mon.0 [INF] pgmap v16031: 256 pgs: 68
>> active+recovering+degraded, 15 active+remapped+backfilling, 173
>> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
>> 30554/1376215 objects degraded (2.220%); 12204/1376215 objects misplaced
>> (0.887%); 2034 kB/s, 0 objects/s recovering
>> 2017-09-20 15:41:15.363921 mon.0 [INF] pgmap v16032: 256 pgs: 68
>> active+recovering+degraded, 15 active+remapped+backfilling, 173
>> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
>> 30553/1376215 objects degraded (2.220%); 12204/1376215 objects misplaced
>> (0.887%); 255 MB/s, 0 ob

Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread Jonas Jaszkowic
Thank you for this detailed information. In fact I am seeing spikes in my 
objects/second recovery. Unfortunately
I had to shut down my cluster. But thanks for the help! 

In order to improve my scenario (marking an osd out to see how the cluster 
recovers),
what is the best way to simulate an OSD failure without actually shutting down 
or powering off the whole machine?
I want to do automate the process with bash scripts.

Currently I am just doing ceph osd out . Should I additionally do ceph 
osd down  in combination with
killing the OSD daemon on the specific host? What I am trying to do is 
basically the following:

- Killing one OSD
- Measuring recovery time, monitoring, etc.
- Bringing the OSD back in
- Again measuring recovery time, monitoring, etc
- Enjoy a healthy cluster

This is (quite) working as described with ceph osd out  and ceph osd in 
, but I am wondering
if this produces a realistic behavior.


> Am 20.09.2017 um 18:06 schrieb David Turner :
> 
> When you posted your ceph status, you only had 56 PGs degraded.  Any value of 
> osd_max_backfills or osd_recovery_max_active over 56 would not do anything.  
> What these settings do is dictate to each OSD the maximum amount of PGs that 
> it can be involved in a recovery process at once.  If you had 56 PGs degraded 
> and all of them were on a single OSD, then a value of 56 would tell all of 
> them to be able to run at the same time.  If they were more spread out across 
> your cluster, then a lower setting would still allow all of the PGs to 
> recover at the same time.
> 
> Now you're talking about how many objects are recovering a second.  Note that 
> your PGs are recovering and not backfilling.  Backfilling is moving all of 
> the data for a PG from 1 OSD to another.  All objects need to be recovered 
> and you'll see a much higher number of objects/second.  Recovery is just 
> catching up after an OSD has been down for a bit, but never marked out.  It 
> only needs to catch up on the objects that have been altered, created, or 
> deleted since it was last caught up for the PG.  When the PG finishes it's 
> recovery state and is in a healthy state again, all of the objects that were 
> in it but that didn't need to catch up are all at once marked recovered and 
> you'll see spikes in your objects/second recovery.
> 
> Your scenario (marking an OSD out to see how the cluster rebounds) shouldn't 
> have a lot of PGs in recovery they should all be in backfill because the data 
> needs to shift between OSDs.  I'm guessing that had something to do with the 
> OSD still being up while it was marked down or that you had some other OSDs 
> in your cluster be marked down due to not responding or possibly being 
> restarted due to an OOM killer from the kernel.  What is your current `ceph 
> status`?
> 
> On Wed, Sep 20, 2017 at 11:52 AM Jonas Jaszkowic 
> mailto:jonasjaszkowic.w...@gmail.com>> wrote:
> Thank you for the admin socket information and the hint to Luminous, I will 
> try it out when I have the time.
> 
> What I noticed when looking at ceph -w is that the number of objects per 
> second recovering is still very low.
> Meanwhile I set the options osd_recovery_max_active and osd_max_backfills to 
> very high numbers (4096, just to be sure).
> Most of the time it is something like ‚0 objects/s recovering‘ or less than 
> ‚10 objects/s recovering‘, for example:
> 
> 2017-09-20 15:41:12.341364 mon.0 [INF] pgmap v16029: 256 pgs: 68 
> active+recovering+degraded, 15 active+remapped+backfilling, 173 active+clean; 
> 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail; 30554/1376215 objects 
> degraded (2.220%); 12205/1376215 objects misplaced (0.887%); 42131 kB/s, 3 
> objects/s recovering
> 2017-09-20 15:41:13.344684 mon.0 [INF] pgmap v16030: 256 pgs: 68 
> active+recovering+degraded, 15 active+remapped+backfilling, 173 active+clean; 
> 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail; 30554/1376215 objects 
> degraded (2.220%); 12205/1376215 objects misplaced (0.887%); 9655 kB/s, 2 
> objects/s recovering
> 2017-09-20 15:41:14.352699 mon.0 [INF] pgmap v16031: 256 pgs: 68 
> active+recovering+degraded, 15 active+remapped+backfilling, 173 active+clean; 
> 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail; 30554/1376215 objects 
> degraded (2.220%); 12204/1376215 objects misplaced (0.887%); 2034 kB/s, 0 
> objects/s recovering
> 2017-09-20 15:41:15.363921 mon.0 [INF] pgmap v16032: 256 pgs: 68 
> active+recovering+degraded, 15 active+remapped+backfilling, 173 active+clean; 
> 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail; 30553/1376215 objects 
> degraded (2.220%); 12204/1376215 objects misplaced (0.887%); 255 MB/s, 0 
> objects/s recovering
> 2017-09-20 15:41:16.367734 mon.0 [INF] pgmap v16033: 256 pgs: 68 
> active+recovering+degraded, 15 active+remapped+backfilling, 173 active+clean; 
> 1975 GB data, 3011 GB used, 7063 GB / 10075 GB avail; 30553/1376215 objects 
> degraded (2.220%); 12203/1376215 objects misplaced (0.887%); 2

Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread David Turner
Correction, if the OSD had been marked down and been marked out, some of
its PGs would be in a backfill state while others would be in a recovery
state depending on how long the OSD was marked down and how much
backfilling had completed in the cluster.

On Wed, Sep 20, 2017 at 12:06 PM David Turner  wrote:

> When you posted your ceph status, you only had 56 PGs degraded.  Any value
> of osd_max_backfills or osd_recovery_max_active over 56 would not do
> anything.  What these settings do is dictate to each OSD the maximum amount
> of PGs that it can be involved in a recovery process at once.  If you had
> 56 PGs degraded and all of them were on a single OSD, then a value of 56
> would tell all of them to be able to run at the same time.  If they were
> more spread out across your cluster, then a lower setting would still allow
> all of the PGs to recover at the same time.
>
> Now you're talking about how many objects are recovering a second.  Note
> that your PGs are recovering and not backfilling.  Backfilling is moving
> all of the data for a PG from 1 OSD to another.  All objects need to be
> recovered and you'll see a much higher number of objects/second.  Recovery
> is just catching up after an OSD has been down for a bit, but never marked
> out.  It only needs to catch up on the objects that have been altered,
> created, or deleted since it was last caught up for the PG.  When the PG
> finishes it's recovery state and is in a healthy state again, all of the
> objects that were in it but that didn't need to catch up are all at once
> marked recovered and you'll see spikes in your objects/second recovery.
>
> Your scenario (marking an OSD out to see how the cluster rebounds)
> shouldn't have a lot of PGs in recovery they should all be in backfill
> because the data needs to shift between OSDs.  I'm guessing that had
> something to do with the OSD still being up while it was marked down or
> that you had some other OSDs in your cluster be marked down due to not
> responding or possibly being restarted due to an OOM killer from the
> kernel.  What is your current `ceph status`?
>
> On Wed, Sep 20, 2017 at 11:52 AM Jonas Jaszkowic <
> jonasjaszkowic.w...@gmail.com> wrote:
>
>> Thank you for the admin socket information and the hint to Luminous, I
>> will try it out when I have the time.
>>
>> What I noticed when looking at ceph -w is that the number of objects per
>> second recovering is still very low.
>> Meanwhile I set the options osd_recovery_max_active and osd_max_backfills
>> to very high numbers (4096, just to be sure).
>> Most of the time it is something like ‚0 objects/s recovering‘ or less
>> than ‚10 objects/s recovering‘, for example:
>>
>> 2017-09-20 15:41:12.341364 mon.0 [INF] pgmap v16029: 256 pgs: 68
>> active+recovering+degraded, 15 active+remapped+backfilling, 173
>> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
>> 30554/1376215 objects degraded (2.220%); 12205/1376215 objects misplaced
>> (0.887%); 42131 kB/s, 3 objects/s recovering
>> 2017-09-20 15:41:13.344684 mon.0 [INF] pgmap v16030: 256 pgs: 68
>> active+recovering+degraded, 15 active+remapped+backfilling, 173
>> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
>> 30554/1376215 objects degraded (2.220%); 12205/1376215 objects misplaced
>> (0.887%); 9655 kB/s, 2 objects/s recovering
>> 2017-09-20 15:41:14.352699 mon.0 [INF] pgmap v16031: 256 pgs: 68
>> active+recovering+degraded, 15 active+remapped+backfilling, 173
>> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
>> 30554/1376215 objects degraded (2.220%); 12204/1376215 objects misplaced
>> (0.887%); 2034 kB/s, 0 objects/s recovering
>> 2017-09-20 15:41:15.363921 mon.0 [INF] pgmap v16032: 256 pgs: 68
>> active+recovering+degraded, 15 active+remapped+backfilling, 173
>> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
>> 30553/1376215 objects degraded (2.220%); 12204/1376215 objects misplaced
>> (0.887%); 255 MB/s, 0 objects/s recovering
>> 2017-09-20 15:41:16.367734 mon.0 [INF] pgmap v16033: 256 pgs: 68
>> active+recovering+degraded, 15 active+remapped+backfilling, 173
>> active+clean; 1975 GB data, 3011 GB used, 7063 GB / 10075 GB avail;
>> 30553/1376215 objects degraded (2.220%); 12203/1376215 objects misplaced
>> (0.887%); 254 MB/s, 0 objects/s recovering
>> 2017-09-20 15:41:17.379183 mon.0 [INF] pgmap v16034: 256 pgs: 68
>> active+recovering+degraded, 15 active+remapped+backfilling, 173
>> active+clean; 1975 GB data, 3011 GB used, 7063 GB / 10075 GB avail;
>> 30549/1376215 objects degraded (2.220%); 12201/1376215 objects misplaced
>> (0.887%); 21868 kB/s, 3 objects/s recovering
>>
>> Is this an acceptable recovery rate? Unfortunately I have no point of
>> reference. My internal OSD network throughput is 500MBit/s (in a
>> virtualized Amazon EC2 environment).
>>
>> Am 20.09.2017 um 17:45 schrieb David Turner :
>>
>> You can always check what settings your daemons are running by querying
>> the admin

Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread David Turner
When you posted your ceph status, you only had 56 PGs degraded.  Any value
of osd_max_backfills or osd_recovery_max_active over 56 would not do
anything.  What these settings do is dictate to each OSD the maximum amount
of PGs that it can be involved in a recovery process at once.  If you had
56 PGs degraded and all of them were on a single OSD, then a value of 56
would tell all of them to be able to run at the same time.  If they were
more spread out across your cluster, then a lower setting would still allow
all of the PGs to recover at the same time.

Now you're talking about how many objects are recovering a second.  Note
that your PGs are recovering and not backfilling.  Backfilling is moving
all of the data for a PG from 1 OSD to another.  All objects need to be
recovered and you'll see a much higher number of objects/second.  Recovery
is just catching up after an OSD has been down for a bit, but never marked
out.  It only needs to catch up on the objects that have been altered,
created, or deleted since it was last caught up for the PG.  When the PG
finishes it's recovery state and is in a healthy state again, all of the
objects that were in it but that didn't need to catch up are all at once
marked recovered and you'll see spikes in your objects/second recovery.

Your scenario (marking an OSD out to see how the cluster rebounds)
shouldn't have a lot of PGs in recovery they should all be in backfill
because the data needs to shift between OSDs.  I'm guessing that had
something to do with the OSD still being up while it was marked down or
that you had some other OSDs in your cluster be marked down due to not
responding or possibly being restarted due to an OOM killer from the
kernel.  What is your current `ceph status`?

On Wed, Sep 20, 2017 at 11:52 AM Jonas Jaszkowic <
jonasjaszkowic.w...@gmail.com> wrote:

> Thank you for the admin socket information and the hint to Luminous, I
> will try it out when I have the time.
>
> What I noticed when looking at ceph -w is that the number of objects per
> second recovering is still very low.
> Meanwhile I set the options osd_recovery_max_active and osd_max_backfills
> to very high numbers (4096, just to be sure).
> Most of the time it is something like ‚0 objects/s recovering‘ or less
> than ‚10 objects/s recovering‘, for example:
>
> 2017-09-20 15:41:12.341364 mon.0 [INF] pgmap v16029: 256 pgs: 68
> active+recovering+degraded, 15 active+remapped+backfilling, 173
> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
> 30554/1376215 objects degraded (2.220%); 12205/1376215 objects misplaced
> (0.887%); 42131 kB/s, 3 objects/s recovering
> 2017-09-20 15:41:13.344684 mon.0 [INF] pgmap v16030: 256 pgs: 68
> active+recovering+degraded, 15 active+remapped+backfilling, 173
> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
> 30554/1376215 objects degraded (2.220%); 12205/1376215 objects misplaced
> (0.887%); 9655 kB/s, 2 objects/s recovering
> 2017-09-20 15:41:14.352699 mon.0 [INF] pgmap v16031: 256 pgs: 68
> active+recovering+degraded, 15 active+remapped+backfilling, 173
> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
> 30554/1376215 objects degraded (2.220%); 12204/1376215 objects misplaced
> (0.887%); 2034 kB/s, 0 objects/s recovering
> 2017-09-20 15:41:15.363921 mon.0 [INF] pgmap v16032: 256 pgs: 68
> active+recovering+degraded, 15 active+remapped+backfilling, 173
> active+clean; 1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail;
> 30553/1376215 objects degraded (2.220%); 12204/1376215 objects misplaced
> (0.887%); 255 MB/s, 0 objects/s recovering
> 2017-09-20 15:41:16.367734 mon.0 [INF] pgmap v16033: 256 pgs: 68
> active+recovering+degraded, 15 active+remapped+backfilling, 173
> active+clean; 1975 GB data, 3011 GB used, 7063 GB / 10075 GB avail;
> 30553/1376215 objects degraded (2.220%); 12203/1376215 objects misplaced
> (0.887%); 254 MB/s, 0 objects/s recovering
> 2017-09-20 15:41:17.379183 mon.0 [INF] pgmap v16034: 256 pgs: 68
> active+recovering+degraded, 15 active+remapped+backfilling, 173
> active+clean; 1975 GB data, 3011 GB used, 7063 GB / 10075 GB avail;
> 30549/1376215 objects degraded (2.220%); 12201/1376215 objects misplaced
> (0.887%); 21868 kB/s, 3 objects/s recovering
>
> Is this an acceptable recovery rate? Unfortunately I have no point of
> reference. My internal OSD network throughput is 500MBit/s (in a
> virtualized Amazon EC2 environment).
>
> Am 20.09.2017 um 17:45 schrieb David Turner :
>
> You can always check what settings your daemons are running by querying
> the admin socket.  I'm linking you to the kraken version of the docs.
> AFAIK, the "unchangeable" is wrong, especially for these settings.  I don't
> know why it's there, but you can always query the admin socket to see your
> currently running settings to make sure that they took effect.
>
>
> http://docs.ceph.com/docs/kraken/rados/operations/monitoring/#using-the-admin-socket
>
> On Wed, Sep 20, 2017 at 11:42 AM David T

Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread Jonas Jaszkowic
Thank you for the admin socket information and the hint to Luminous, I will try 
it out when I have the time.

What I noticed when looking at ceph -w is that the number of objects per second 
recovering is still very low.
Meanwhile I set the options osd_recovery_max_active and osd_max_backfills to 
very high numbers (4096, just to be sure).
Most of the time it is something like ‚0 objects/s recovering‘ or less than ‚10 
objects/s recovering‘, for example:

2017-09-20 15:41:12.341364 mon.0 [INF] pgmap v16029: 256 pgs: 68 
active+recovering+degraded, 15 active+remapped+backfilling, 173 active+clean; 
1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail; 30554/1376215 objects 
degraded (2.220%); 12205/1376215 objects misplaced (0.887%); 42131 kB/s, 3 
objects/s recovering
2017-09-20 15:41:13.344684 mon.0 [INF] pgmap v16030: 256 pgs: 68 
active+recovering+degraded, 15 active+remapped+backfilling, 173 active+clean; 
1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail; 30554/1376215 objects 
degraded (2.220%); 12205/1376215 objects misplaced (0.887%); 9655 kB/s, 2 
objects/s recovering
2017-09-20 15:41:14.352699 mon.0 [INF] pgmap v16031: 256 pgs: 68 
active+recovering+degraded, 15 active+remapped+backfilling, 173 active+clean; 
1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail; 30554/1376215 objects 
degraded (2.220%); 12204/1376215 objects misplaced (0.887%); 2034 kB/s, 0 
objects/s recovering
2017-09-20 15:41:15.363921 mon.0 [INF] pgmap v16032: 256 pgs: 68 
active+recovering+degraded, 15 active+remapped+backfilling, 173 active+clean; 
1975 GB data, 3011 GB used, 7064 GB / 10075 GB avail; 30553/1376215 objects 
degraded (2.220%); 12204/1376215 objects misplaced (0.887%); 255 MB/s, 0 
objects/s recovering
2017-09-20 15:41:16.367734 mon.0 [INF] pgmap v16033: 256 pgs: 68 
active+recovering+degraded, 15 active+remapped+backfilling, 173 active+clean; 
1975 GB data, 3011 GB used, 7063 GB / 10075 GB avail; 30553/1376215 objects 
degraded (2.220%); 12203/1376215 objects misplaced (0.887%); 254 MB/s, 0 
objects/s recovering
2017-09-20 15:41:17.379183 mon.0 [INF] pgmap v16034: 256 pgs: 68 
active+recovering+degraded, 15 active+remapped+backfilling, 173 active+clean; 
1975 GB data, 3011 GB used, 7063 GB / 10075 GB avail; 30549/1376215 objects 
degraded (2.220%); 12201/1376215 objects misplaced (0.887%); 21868 kB/s, 3 
objects/s recovering

Is this an acceptable recovery rate? Unfortunately I have no point of 
reference. My internal OSD network throughput is 500MBit/s (in a virtualized 
Amazon EC2 environment).

> Am 20.09.2017 um 17:45 schrieb David Turner :
> 
> You can always check what settings your daemons are running by querying the 
> admin socket.  I'm linking you to the kraken version of the docs.  AFAIK, the 
> "unchangeable" is wrong, especially for these settings.  I don't know why 
> it's there, but you can always query the admin socket to see your currently 
> running settings to make sure that they took effect.
> 
> http://docs.ceph.com/docs/kraken/rados/operations/monitoring/#using-the-admin-socket
>  
> 
> On Wed, Sep 20, 2017 at 11:42 AM David Turner  > wrote:
> You are currently on Kraken, but if you upgrade to Luminous you'll gain 
> access to the new setting `osd_recovery_sleep` which you can tweak.
> 
> The best way to deal with recovery speed vs client IO is to be aware of what 
> your cluster does.  If you have a time of day that you don't have much client 
> IO, then you can increase your recovery during that time.  Otherwise your 
> best bet is to do testing with these settings while watching `iostat -x 1` on 
> your OSDs to see what settings you need to maintaining something around 80% 
> disk utilization while client IO and recovery is happening.  That will ensure 
> that your clients have some overhead to not notice the recovery.  If Client 
> IO isn't so important that they aren't aware of a minor speed decrease during 
> recovery, then you can aim for closer to 100% disk utilization with both 
> client IO and recovery happening.
> 
> On Wed, Sep 20, 2017 at 11:30 AM Jean-Charles Lopez  > wrote:
> Hi,
> 
> you can play with the following 2 parameters:
> osd_recovery_max_active
> osd_max_backfills
> 
> The higher the number the higher the number of PGs being processed at the 
> same time.
> 
> Regards
> Jean-Charles LOPEZ
> jeanchlo...@mac.com 
> 
> 
> 
> JC Lopez
> Senior Technical Instructor, Global Storage Consulting Practice
> Red Hat, Inc.
> jelo...@redhat.com 
> +1 408-680-6959 
> 
>> On Sep 20, 2017, at 08:26, Jonas Jaszkowic > > wrote:
>> 
>> Thank you, that is very helpful. I didn’t know about the osd_max_backfills 
>> option. Recovery is now working faster. 
>> 
>> What is the best way to make recovery as fast as possible assuming that I do 
>> not 

Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread David Turner
You can always check what settings your daemons are running by querying the
admin socket.  I'm linking you to the kraken version of the docs.  AFAIK,
the "unchangeable" is wrong, especially for these settings.  I don't know
why it's there, but you can always query the admin socket to see your
currently running settings to make sure that they took effect.

http://docs.ceph.com/docs/kraken/rados/operations/monitoring/#using-the-admin-socket

On Wed, Sep 20, 2017 at 11:42 AM David Turner  wrote:

> You are currently on Kraken, but if you upgrade to Luminous you'll gain
> access to the new setting `osd_recovery_sleep` which you can tweak.
>
> The best way to deal with recovery speed vs client IO is to be aware of
> what your cluster does.  If you have a time of day that you don't have much
> client IO, then you can increase your recovery during that time.  Otherwise
> your best bet is to do testing with these settings while watching `iostat
> -x 1` on your OSDs to see what settings you need to maintaining something
> around 80% disk utilization while client IO and recovery is happening.
> That will ensure that your clients have some overhead to not notice the
> recovery.  If Client IO isn't so important that they aren't aware of a
> minor speed decrease during recovery, then you can aim for closer to 100%
> disk utilization with both client IO and recovery happening.
>
> On Wed, Sep 20, 2017 at 11:30 AM Jean-Charles Lopez 
> wrote:
>
>> Hi,
>>
>> you can play with the following 2 parameters:
>> osd_recovery_max_active
>> osd_max_backfills
>>
>> The higher the number the higher the number of PGs being processed at the
>> same time.
>>
>> Regards
>> Jean-Charles LOPEZ
>> jeanchlo...@mac.com
>>
>>
>>
>> JC Lopez
>> Senior Technical Instructor, Global Storage Consulting Practice
>> Red Hat, Inc.
>> jelo...@redhat.com
>> +1 408-680-6959 <(408)%20680-6959>
>>
>> On Sep 20, 2017, at 08:26, Jonas Jaszkowic 
>> wrote:
>>
>> Thank you, that is very helpful. I didn’t know about the *osd_max_backfills
>> *option. Recovery is now working faster.
>>
>> What is the best way to make recovery as fast as possible assuming that I
>> do not care about read/write speed? (Besides
>> setting *osd_max_backfills *as high as possible). Are there any
>> important options that I have to know?
>>
>> What is the best practice to deal with the issue recovery speed vs.
>> read/write speed during a recovery situation? Do you
>> have any suggestions/references/hints how to deal with such situations?
>>
>>
>> Am 20.09.2017 um 16:45 schrieb David Turner :
>>
>> To help things look a little better, I would also stop the daemon for
>> osd.6 and mark it down `ceph osd down 6`.  Note that if the OSD is still
>> running it will likely mark itself back up and in on its own.  I don't
>> think that the OSD still running and being up in the cluster is causing the
>> issue, but it might.  After that, I would increase how many PGs can recover
>> at the same time by increasing osd_max_backfills `ceph tell osd.*
>> injectargs '--osd_max_backfills=5'`.  Note that for production you'll want
>> to set this number to something that doesn't negatively impact your client
>> IO, but high enough to help recover your cluster faster.  You can figure
>> out that number by increasing it 1 at a time and watching the OSD
>> performance with `iostat -x 1` or something to see how heavily used the
>> OSDs are during your normal usage and again during recover while testing
>> the settings.  For testing, you can set it as high as you'd like (probably
>> no need to go above 20 as that will likely saturate your disks'
>> performance) to get the PGs out of the wait status and into active recovery
>> and backfilling.
>>
>> On Wed, Sep 20, 2017 at 10:03 AM Jonas Jaszkowic <
>> jonasjaszkowic.w...@gmail.com> wrote:
>>
>>> Output of *ceph status*:
>>>
>>> cluster 18e87fd8-17c1-4045-a1a2-07aac106f200
>>>  health HEALTH_WARN
>>> 1 pgs backfill_wait
>>> 56 pgs degraded
>>> 1 pgs recovering
>>> 55 pgs recovery_wait
>>> 56 pgs stuck degraded
>>> 57 pgs stuck unclean
>>> recovery 50570/1369003 objects degraded (3.694%)
>>> recovery 854/1369003 objects misplaced (0.062%)
>>>  monmap e2: 1 mons at {ip-172-31-16-102=172.31.16.102:6789/0}
>>> election epoch 4, quorum 0 ip-172-31-16-102
>>> mgr active: ip-172-31-16-102
>>>  osdmap e247: 32 osds: 32 up, 31 in; 1 remapped pgs
>>> flags sortbitwise,require_jewel_osds,require_kraken_osds
>>>   pgmap v10860: 256 pgs, 1 pools, 1975 GB data, 111 kobjects
>>> 2923 GB used, 6836 GB / 9760 GB avail
>>> 50570/1369003 objects degraded (3.694%)
>>> 854/1369003 objects misplaced (0.062%)
>>>  199 active+clean
>>>   55 active+recovery_wait+degraded
>>>1 active+remapped+backfill_wait
>>>1 active+recovering+de

Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread David Turner
You are currently on Kraken, but if you upgrade to Luminous you'll gain
access to the new setting `osd_recovery_sleep` which you can tweak.

The best way to deal with recovery speed vs client IO is to be aware of
what your cluster does.  If you have a time of day that you don't have much
client IO, then you can increase your recovery during that time.  Otherwise
your best bet is to do testing with these settings while watching `iostat
-x 1` on your OSDs to see what settings you need to maintaining something
around 80% disk utilization while client IO and recovery is happening.
That will ensure that your clients have some overhead to not notice the
recovery.  If Client IO isn't so important that they aren't aware of a
minor speed decrease during recovery, then you can aim for closer to 100%
disk utilization with both client IO and recovery happening.

On Wed, Sep 20, 2017 at 11:30 AM Jean-Charles Lopez 
wrote:

> Hi,
>
> you can play with the following 2 parameters:
> osd_recovery_max_active
> osd_max_backfills
>
> The higher the number the higher the number of PGs being processed at the
> same time.
>
> Regards
> Jean-Charles LOPEZ
> jeanchlo...@mac.com
>
>
>
> JC Lopez
> Senior Technical Instructor, Global Storage Consulting Practice
> Red Hat, Inc.
> jelo...@redhat.com
> +1 408-680-6959 <(408)%20680-6959>
>
> On Sep 20, 2017, at 08:26, Jonas Jaszkowic 
> wrote:
>
> Thank you, that is very helpful. I didn’t know about the *osd_max_backfills
> *option. Recovery is now working faster.
>
> What is the best way to make recovery as fast as possible assuming that I
> do not care about read/write speed? (Besides
> setting *osd_max_backfills *as high as possible). Are there any important
> options that I have to know?
>
> What is the best practice to deal with the issue recovery speed vs.
> read/write speed during a recovery situation? Do you
> have any suggestions/references/hints how to deal with such situations?
>
>
> Am 20.09.2017 um 16:45 schrieb David Turner :
>
> To help things look a little better, I would also stop the daemon for
> osd.6 and mark it down `ceph osd down 6`.  Note that if the OSD is still
> running it will likely mark itself back up and in on its own.  I don't
> think that the OSD still running and being up in the cluster is causing the
> issue, but it might.  After that, I would increase how many PGs can recover
> at the same time by increasing osd_max_backfills `ceph tell osd.*
> injectargs '--osd_max_backfills=5'`.  Note that for production you'll want
> to set this number to something that doesn't negatively impact your client
> IO, but high enough to help recover your cluster faster.  You can figure
> out that number by increasing it 1 at a time and watching the OSD
> performance with `iostat -x 1` or something to see how heavily used the
> OSDs are during your normal usage and again during recover while testing
> the settings.  For testing, you can set it as high as you'd like (probably
> no need to go above 20 as that will likely saturate your disks'
> performance) to get the PGs out of the wait status and into active recovery
> and backfilling.
>
> On Wed, Sep 20, 2017 at 10:03 AM Jonas Jaszkowic <
> jonasjaszkowic.w...@gmail.com> wrote:
>
>> Output of *ceph status*:
>>
>> cluster 18e87fd8-17c1-4045-a1a2-07aac106f200
>>  health HEALTH_WARN
>> 1 pgs backfill_wait
>> 56 pgs degraded
>> 1 pgs recovering
>> 55 pgs recovery_wait
>> 56 pgs stuck degraded
>> 57 pgs stuck unclean
>> recovery 50570/1369003 objects degraded (3.694%)
>> recovery 854/1369003 objects misplaced (0.062%)
>>  monmap e2: 1 mons at {ip-172-31-16-102=172.31.16.102:6789/0}
>> election epoch 4, quorum 0 ip-172-31-16-102
>> mgr active: ip-172-31-16-102
>>  osdmap e247: 32 osds: 32 up, 31 in; 1 remapped pgs
>> flags sortbitwise,require_jewel_osds,require_kraken_osds
>>   pgmap v10860: 256 pgs, 1 pools, 1975 GB data, 111 kobjects
>> 2923 GB used, 6836 GB / 9760 GB avail
>> 50570/1369003 objects degraded (3.694%)
>> 854/1369003 objects misplaced (0.062%)
>>  199 active+clean
>>   55 active+recovery_wait+degraded
>>1 active+remapped+backfill_wait
>>1 active+recovering+degraded
>>   client io 513 MB/s rd, 131 op/s rd, 0 op/s wr
>>
>> Output of* ceph osd tree*:
>>
>> ID  WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>  -1 9.83984 root default
>>  -2 0.30750 host ip-172-31-24-96
>>   0 0.30750 osd.0  up  1.0  1.0
>>  -3 0.30750 host ip-172-31-30-32
>>   1 0.30750 osd.1  up  1.0  1.0
>>  -4 0.30750 host ip-172-31-28-36
>>   2 0.30750 osd.2  up  1.0  1.0
>>  -5 0.30750 host ip-172-31-18-100
>>   3 0.30750 osd.3   

Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread Jonas Jaszkowic
So to speed things up I would basically do the following two things:

ceph tell osd.* injectargs "\'—osd_max_backfills=\’“
ceph tell osd.* injectargs "\'—osd_recovery_max_active=\’"

The second command return with the following output:

osd.0: osd_recovery_max_active = '5' (unchangeable)

The error code is 0, but what does the ‚unchangeable‘ mean? Am I inserting the 
options correctly?


> Am 20.09.2017 um 17:30 schrieb Jean-Charles Lopez :
> 
> Hi,
> 
> you can play with the following 2 parameters:
> osd_recovery_max_active
> osd_max_backfills
> 
> The higher the number the higher the number of PGs being processed at the 
> same time.
> 
> Regards
> Jean-Charles LOPEZ
> jeanchlo...@mac.com 
> 
> 
> 
> JC Lopez
> Senior Technical Instructor, Global Storage Consulting Practice
> Red Hat, Inc.
> jelo...@redhat.com 
> +1 408-680-6959
> 
>> On Sep 20, 2017, at 08:26, Jonas Jaszkowic > > wrote:
>> 
>> Thank you, that is very helpful. I didn’t know about the osd_max_backfills 
>> option. Recovery is now working faster. 
>> 
>> What is the best way to make recovery as fast as possible assuming that I do 
>> not care about read/write speed? (Besides
>> setting osd_max_backfills as high as possible). Are there any important 
>> options that I have to know?
>> 
>> What is the best practice to deal with the issue recovery speed vs. 
>> read/write speed during a recovery situation? Do you
>> have any suggestions/references/hints how to deal with such situations?
>> 
>> 
>>> Am 20.09.2017 um 16:45 schrieb David Turner >> >:
>>> 
>>> To help things look a little better, I would also stop the daemon for osd.6 
>>> and mark it down `ceph osd down 6`.  Note that if the OSD is still running 
>>> it will likely mark itself back up and in on its own.  I don't think that 
>>> the OSD still running and being up in the cluster is causing the issue, but 
>>> it might.  After that, I would increase how many PGs can recover at the 
>>> same time by increasing osd_max_backfills `ceph tell osd.* injectargs 
>>> '--osd_max_backfills=5'`.  Note that for production you'll want to set this 
>>> number to something that doesn't negatively impact your client IO, but high 
>>> enough to help recover your cluster faster.  You can figure out that number 
>>> by increasing it 1 at a time and watching the OSD performance with `iostat 
>>> -x 1` or something to see how heavily used the OSDs are during your normal 
>>> usage and again during recover while testing the settings.  For testing, 
>>> you can set it as high as you'd like (probably no need to go above 20 as 
>>> that will likely saturate your disks' performance) to get the PGs out of 
>>> the wait status and into active recovery and backfilling.
>>> 
>>> On Wed, Sep 20, 2017 at 10:03 AM Jonas Jaszkowic 
>>> mailto:jonasjaszkowic.w...@gmail.com>> 
>>> wrote:
>>> Output of ceph status:
>>> 
>>> cluster 18e87fd8-17c1-4045-a1a2-07aac106f200
>>>  health HEALTH_WARN
>>> 1 pgs backfill_wait
>>> 56 pgs degraded
>>> 1 pgs recovering
>>> 55 pgs recovery_wait
>>> 56 pgs stuck degraded
>>> 57 pgs stuck unclean
>>> recovery 50570/1369003 objects degraded (3.694%)
>>> recovery 854/1369003 objects misplaced (0.062%)
>>>  monmap e2: 1 mons at {ip-172-31-16-102=172.31.16.102:6789/0 
>>> }
>>> election epoch 4, quorum 0 ip-172-31-16-102
>>> mgr active: ip-172-31-16-102
>>>  osdmap e247: 32 osds: 32 up, 31 in; 1 remapped pgs
>>> flags sortbitwise,require_jewel_osds,require_kraken_osds
>>>   pgmap v10860: 256 pgs, 1 pools, 1975 GB data, 111 kobjects
>>> 2923 GB used, 6836 GB / 9760 GB avail
>>> 50570/1369003 objects degraded (3.694%)
>>> 854/1369003 objects misplaced (0.062%)
>>>  199 active+clean
>>>   55 active+recovery_wait+degraded
>>>1 active+remapped+backfill_wait
>>>1 active+recovering+degraded
>>>   client io 513 MB/s rd, 131 op/s rd, 0 op/s wr
>>> 
>>> Output of ceph osd tree:
>>> 
>>> ID  WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>>  -1 9.83984 root default
>>>  -2 0.30750 host ip-172-31-24-96
>>>   0 0.30750 osd.0  up  1.0  1.0
>>>  -3 0.30750 host ip-172-31-30-32
>>>   1 0.30750 osd.1  up  1.0  1.0
>>>  -4 0.30750 host ip-172-31-28-36
>>>   2 0.30750 osd.2  up  1.0  1.0
>>>  -5 0.30750 host ip-172-31-18-100
>>>   3 0.30750 osd.3  up  1.0  1.0
>>>  -6 0.30750 host ip-172-31-25-240
>>>   4 0.30750 osd.4  up  1.0  1.0
>>>  -7 0.30750 host ip-172-31-24

Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread Jean-Charles Lopez
Hi,

you can play with the following 2 parameters:
osd_recovery_max_active
osd_max_backfills

The higher the number the higher the number of PGs being processed at the same 
time.

Regards
Jean-Charles LOPEZ
jeanchlo...@mac.com



JC Lopez
Senior Technical Instructor, Global Storage Consulting Practice
Red Hat, Inc.
jelo...@redhat.com 
+1 408-680-6959

> On Sep 20, 2017, at 08:26, Jonas Jaszkowic  
> wrote:
> 
> Thank you, that is very helpful. I didn’t know about the osd_max_backfills 
> option. Recovery is now working faster. 
> 
> What is the best way to make recovery as fast as possible assuming that I do 
> not care about read/write speed? (Besides
> setting osd_max_backfills as high as possible). Are there any important 
> options that I have to know?
> 
> What is the best practice to deal with the issue recovery speed vs. 
> read/write speed during a recovery situation? Do you
> have any suggestions/references/hints how to deal with such situations?
> 
> 
>> Am 20.09.2017 um 16:45 schrieb David Turner > >:
>> 
>> To help things look a little better, I would also stop the daemon for osd.6 
>> and mark it down `ceph osd down 6`.  Note that if the OSD is still running 
>> it will likely mark itself back up and in on its own.  I don't think that 
>> the OSD still running and being up in the cluster is causing the issue, but 
>> it might.  After that, I would increase how many PGs can recover at the same 
>> time by increasing osd_max_backfills `ceph tell osd.* injectargs 
>> '--osd_max_backfills=5'`.  Note that for production you'll want to set this 
>> number to something that doesn't negatively impact your client IO, but high 
>> enough to help recover your cluster faster.  You can figure out that number 
>> by increasing it 1 at a time and watching the OSD performance with `iostat 
>> -x 1` or something to see how heavily used the OSDs are during your normal 
>> usage and again during recover while testing the settings.  For testing, you 
>> can set it as high as you'd like (probably no need to go above 20 as that 
>> will likely saturate your disks' performance) to get the PGs out of the wait 
>> status and into active recovery and backfilling.
>> 
>> On Wed, Sep 20, 2017 at 10:03 AM Jonas Jaszkowic 
>> mailto:jonasjaszkowic.w...@gmail.com>> wrote:
>> Output of ceph status:
>> 
>> cluster 18e87fd8-17c1-4045-a1a2-07aac106f200
>>  health HEALTH_WARN
>> 1 pgs backfill_wait
>> 56 pgs degraded
>> 1 pgs recovering
>> 55 pgs recovery_wait
>> 56 pgs stuck degraded
>> 57 pgs stuck unclean
>> recovery 50570/1369003 objects degraded (3.694%)
>> recovery 854/1369003 objects misplaced (0.062%)
>>  monmap e2: 1 mons at {ip-172-31-16-102=172.31.16.102:6789/0 
>> }
>> election epoch 4, quorum 0 ip-172-31-16-102
>> mgr active: ip-172-31-16-102
>>  osdmap e247: 32 osds: 32 up, 31 in; 1 remapped pgs
>> flags sortbitwise,require_jewel_osds,require_kraken_osds
>>   pgmap v10860: 256 pgs, 1 pools, 1975 GB data, 111 kobjects
>> 2923 GB used, 6836 GB / 9760 GB avail
>> 50570/1369003 objects degraded (3.694%)
>> 854/1369003 objects misplaced (0.062%)
>>  199 active+clean
>>   55 active+recovery_wait+degraded
>>1 active+remapped+backfill_wait
>>1 active+recovering+degraded
>>   client io 513 MB/s rd, 131 op/s rd, 0 op/s wr
>> 
>> Output of ceph osd tree:
>> 
>> ID  WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>  -1 9.83984 root default
>>  -2 0.30750 host ip-172-31-24-96
>>   0 0.30750 osd.0  up  1.0  1.0
>>  -3 0.30750 host ip-172-31-30-32
>>   1 0.30750 osd.1  up  1.0  1.0
>>  -4 0.30750 host ip-172-31-28-36
>>   2 0.30750 osd.2  up  1.0  1.0
>>  -5 0.30750 host ip-172-31-18-100
>>   3 0.30750 osd.3  up  1.0  1.0
>>  -6 0.30750 host ip-172-31-25-240
>>   4 0.30750 osd.4  up  1.0  1.0
>>  -7 0.30750 host ip-172-31-24-110
>>   5 0.30750 osd.5  up  1.0  1.0
>>  -8 0.30750 host ip-172-31-20-245
>>   6 0.30750 osd.6  up0  1.0
>>  -9 0.30750 host ip-172-31-17-241
>>   7 0.30750 osd.7  up  1.0  1.0
>> -10 0.30750 host ip-172-31-18-107
>>   8 0.30750 osd.8  up  1.0  1.0
>> -11 0.30750 host ip-172-31-21-170
>>   9 0.30750 osd.9  up  1.0  1.0
>> -12 0.30750 host ip-172-31-21-29
>>  10 0.30750 osd.10 up  1.0

Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread Jonas Jaszkowic
Thank you, that is very helpful. I didn’t know about the osd_max_backfills 
option. Recovery is now working faster. 

What is the best way to make recovery as fast as possible assuming that I do 
not care about read/write speed? (Besides
setting osd_max_backfills as high as possible). Are there any important options 
that I have to know?

What is the best practice to deal with the issue recovery speed vs. read/write 
speed during a recovery situation? Do you
have any suggestions/references/hints how to deal with such situations?


> Am 20.09.2017 um 16:45 schrieb David Turner :
> 
> To help things look a little better, I would also stop the daemon for osd.6 
> and mark it down `ceph osd down 6`.  Note that if the OSD is still running it 
> will likely mark itself back up and in on its own.  I don't think that the 
> OSD still running and being up in the cluster is causing the issue, but it 
> might.  After that, I would increase how many PGs can recover at the same 
> time by increasing osd_max_backfills `ceph tell osd.* injectargs 
> '--osd_max_backfills=5'`.  Note that for production you'll want to set this 
> number to something that doesn't negatively impact your client IO, but high 
> enough to help recover your cluster faster.  You can figure out that number 
> by increasing it 1 at a time and watching the OSD performance with `iostat -x 
> 1` or something to see how heavily used the OSDs are during your normal usage 
> and again during recover while testing the settings.  For testing, you can 
> set it as high as you'd like (probably no need to go above 20 as that will 
> likely saturate your disks' performance) to get the PGs out of the wait 
> status and into active recovery and backfilling.
> 
> On Wed, Sep 20, 2017 at 10:03 AM Jonas Jaszkowic 
> mailto:jonasjaszkowic.w...@gmail.com>> wrote:
> Output of ceph status:
> 
> cluster 18e87fd8-17c1-4045-a1a2-07aac106f200
>  health HEALTH_WARN
> 1 pgs backfill_wait
> 56 pgs degraded
> 1 pgs recovering
> 55 pgs recovery_wait
> 56 pgs stuck degraded
> 57 pgs stuck unclean
> recovery 50570/1369003 objects degraded (3.694%)
> recovery 854/1369003 objects misplaced (0.062%)
>  monmap e2: 1 mons at {ip-172-31-16-102=172.31.16.102:6789/0 
> }
> election epoch 4, quorum 0 ip-172-31-16-102
> mgr active: ip-172-31-16-102
>  osdmap e247: 32 osds: 32 up, 31 in; 1 remapped pgs
> flags sortbitwise,require_jewel_osds,require_kraken_osds
>   pgmap v10860: 256 pgs, 1 pools, 1975 GB data, 111 kobjects
> 2923 GB used, 6836 GB / 9760 GB avail
> 50570/1369003 objects degraded (3.694%)
> 854/1369003 objects misplaced (0.062%)
>  199 active+clean
>   55 active+recovery_wait+degraded
>1 active+remapped+backfill_wait
>1 active+recovering+degraded
>   client io 513 MB/s rd, 131 op/s rd, 0 op/s wr
> 
> Output of ceph osd tree:
> 
> ID  WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>  -1 9.83984 root default
>  -2 0.30750 host ip-172-31-24-96
>   0 0.30750 osd.0  up  1.0  1.0
>  -3 0.30750 host ip-172-31-30-32
>   1 0.30750 osd.1  up  1.0  1.0
>  -4 0.30750 host ip-172-31-28-36
>   2 0.30750 osd.2  up  1.0  1.0
>  -5 0.30750 host ip-172-31-18-100
>   3 0.30750 osd.3  up  1.0  1.0
>  -6 0.30750 host ip-172-31-25-240
>   4 0.30750 osd.4  up  1.0  1.0
>  -7 0.30750 host ip-172-31-24-110
>   5 0.30750 osd.5  up  1.0  1.0
>  -8 0.30750 host ip-172-31-20-245
>   6 0.30750 osd.6  up0  1.0
>  -9 0.30750 host ip-172-31-17-241
>   7 0.30750 osd.7  up  1.0  1.0
> -10 0.30750 host ip-172-31-18-107
>   8 0.30750 osd.8  up  1.0  1.0
> -11 0.30750 host ip-172-31-21-170
>   9 0.30750 osd.9  up  1.0  1.0
> -12 0.30750 host ip-172-31-21-29
>  10 0.30750 osd.10 up  1.0  1.0
> -13 0.30750 host ip-172-31-23-220
>  11 0.30750 osd.11 up  1.0  1.0
> -14 0.30750 host ip-172-31-24-154
>  12 0.30750 osd.12 up  1.0  1.0
> -15 0.30750 host ip-172-31-26-25
>  13 0.30750 osd.13 up  1.0  1.0
> -16 0.30750 host ip-172-31-20-28
>  14 0.30750 osd.14 up  1.0  1.0
> -17 0.30750 host ip-172-31-23-90
>  15 0.30750 osd.15 up  1.0   

Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread David Turner
To help things look a little better, I would also stop the daemon for osd.6
and mark it down `ceph osd down 6`.  Note that if the OSD is still running
it will likely mark itself back up and in on its own.  I don't think that
the OSD still running and being up in the cluster is causing the issue, but
it might.  After that, I would increase how many PGs can recover at the
same time by increasing osd_max_backfills `ceph tell osd.* injectargs
'--osd_max_backfills=5'`.  Note that for production you'll want to set this
number to something that doesn't negatively impact your client IO, but high
enough to help recover your cluster faster.  You can figure out that number
by increasing it 1 at a time and watching the OSD performance with `iostat
-x 1` or something to see how heavily used the OSDs are during your normal
usage and again during recover while testing the settings.  For testing,
you can set it as high as you'd like (probably no need to go above 20 as
that will likely saturate your disks' performance) to get the PGs out of
the wait status and into active recovery and backfilling.

On Wed, Sep 20, 2017 at 10:03 AM Jonas Jaszkowic <
jonasjaszkowic.w...@gmail.com> wrote:

> Output of *ceph status*:
>
> cluster 18e87fd8-17c1-4045-a1a2-07aac106f200
>  health HEALTH_WARN
> 1 pgs backfill_wait
> 56 pgs degraded
> 1 pgs recovering
> 55 pgs recovery_wait
> 56 pgs stuck degraded
> 57 pgs stuck unclean
> recovery 50570/1369003 objects degraded (3.694%)
> recovery 854/1369003 objects misplaced (0.062%)
>  monmap e2: 1 mons at {ip-172-31-16-102=172.31.16.102:6789/0}
> election epoch 4, quorum 0 ip-172-31-16-102
> mgr active: ip-172-31-16-102
>  osdmap e247: 32 osds: 32 up, 31 in; 1 remapped pgs
> flags sortbitwise,require_jewel_osds,require_kraken_osds
>   pgmap v10860: 256 pgs, 1 pools, 1975 GB data, 111 kobjects
> 2923 GB used, 6836 GB / 9760 GB avail
> 50570/1369003 objects degraded (3.694%)
> 854/1369003 objects misplaced (0.062%)
>  199 active+clean
>   55 active+recovery_wait+degraded
>1 active+remapped+backfill_wait
>1 active+recovering+degraded
>   client io 513 MB/s rd, 131 op/s rd, 0 op/s wr
>
> Output of* ceph osd tree*:
>
> ID  WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>  -1 9.83984 root default
>  -2 0.30750 host ip-172-31-24-96
>   0 0.30750 osd.0  up  1.0  1.0
>  -3 0.30750 host ip-172-31-30-32
>   1 0.30750 osd.1  up  1.0  1.0
>  -4 0.30750 host ip-172-31-28-36
>   2 0.30750 osd.2  up  1.0  1.0
>  -5 0.30750 host ip-172-31-18-100
>   3 0.30750 osd.3  up  1.0  1.0
>  -6 0.30750 host ip-172-31-25-240
>   4 0.30750 osd.4  up  1.0  1.0
>  -7 0.30750 host ip-172-31-24-110
>   5 0.30750 osd.5  up  1.0  1.0
>  -8 0.30750 host ip-172-31-20-245
>   6 0.30750 osd.6  up0  1.0
>  -9 0.30750 host ip-172-31-17-241
>   7 0.30750 osd.7  up  1.0  1.0
> -10 0.30750 host ip-172-31-18-107
>   8 0.30750 osd.8  up  1.0  1.0
> -11 0.30750 host ip-172-31-21-170
>   9 0.30750 osd.9  up  1.0  1.0
> -12 0.30750 host ip-172-31-21-29
>  10 0.30750 osd.10 up  1.0  1.0
> -13 0.30750 host ip-172-31-23-220
>  11 0.30750 osd.11 up  1.0  1.0
> -14 0.30750 host ip-172-31-24-154
>  12 0.30750 osd.12 up  1.0  1.0
> -15 0.30750 host ip-172-31-26-25
>  13 0.30750 osd.13 up  1.0  1.0
> -16 0.30750 host ip-172-31-20-28
>  14 0.30750 osd.14 up  1.0  1.0
> -17 0.30750 host ip-172-31-23-90
>  15 0.30750 osd.15 up  1.0  1.0
> -18 0.30750 host ip-172-31-31-197
>  16 0.30750 osd.16 up  1.0  1.0
> -19 0.30750 host ip-172-31-29-195
>  17 0.30750 osd.17 up  1.0  1.0
> -20 0.30750 host ip-172-31-28-9
>  18 0.30750 osd.18 up  1.0  1.0
> -21 0.30750 host ip-172-31-25-199
>  19 0.30750 osd.19 up  1.0  1.0
> -22 0.30750 host ip-172-31-25-187
>  20 0.30750 osd.20 up  1.0  1.0
> -23 0.30750 host ip-172-31-31-57
>  21 0.30750 osd.21 up  1.0  

Re: [ceph-users] Ceph fails to recover

2017-09-20 Thread Jonas Jaszkowic
Output of ceph status:

cluster 18e87fd8-17c1-4045-a1a2-07aac106f200
 health HEALTH_WARN
1 pgs backfill_wait
56 pgs degraded
1 pgs recovering
55 pgs recovery_wait
56 pgs stuck degraded
57 pgs stuck unclean
recovery 50570/1369003 objects degraded (3.694%)
recovery 854/1369003 objects misplaced (0.062%)
 monmap e2: 1 mons at {ip-172-31-16-102=172.31.16.102:6789/0}
election epoch 4, quorum 0 ip-172-31-16-102
mgr active: ip-172-31-16-102
 osdmap e247: 32 osds: 32 up, 31 in; 1 remapped pgs
flags sortbitwise,require_jewel_osds,require_kraken_osds
  pgmap v10860: 256 pgs, 1 pools, 1975 GB data, 111 kobjects
2923 GB used, 6836 GB / 9760 GB avail
50570/1369003 objects degraded (3.694%)
854/1369003 objects misplaced (0.062%)
 199 active+clean
  55 active+recovery_wait+degraded
   1 active+remapped+backfill_wait
   1 active+recovering+degraded
  client io 513 MB/s rd, 131 op/s rd, 0 op/s wr

Output of ceph osd tree:

ID  WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -1 9.83984 root default
 -2 0.30750 host ip-172-31-24-96
  0 0.30750 osd.0  up  1.0  1.0
 -3 0.30750 host ip-172-31-30-32
  1 0.30750 osd.1  up  1.0  1.0
 -4 0.30750 host ip-172-31-28-36
  2 0.30750 osd.2  up  1.0  1.0
 -5 0.30750 host ip-172-31-18-100
  3 0.30750 osd.3  up  1.0  1.0
 -6 0.30750 host ip-172-31-25-240
  4 0.30750 osd.4  up  1.0  1.0
 -7 0.30750 host ip-172-31-24-110
  5 0.30750 osd.5  up  1.0  1.0
 -8 0.30750 host ip-172-31-20-245
  6 0.30750 osd.6  up0  1.0
 -9 0.30750 host ip-172-31-17-241
  7 0.30750 osd.7  up  1.0  1.0
-10 0.30750 host ip-172-31-18-107
  8 0.30750 osd.8  up  1.0  1.0
-11 0.30750 host ip-172-31-21-170
  9 0.30750 osd.9  up  1.0  1.0
-12 0.30750 host ip-172-31-21-29
 10 0.30750 osd.10 up  1.0  1.0
-13 0.30750 host ip-172-31-23-220
 11 0.30750 osd.11 up  1.0  1.0
-14 0.30750 host ip-172-31-24-154
 12 0.30750 osd.12 up  1.0  1.0
-15 0.30750 host ip-172-31-26-25
 13 0.30750 osd.13 up  1.0  1.0
-16 0.30750 host ip-172-31-20-28
 14 0.30750 osd.14 up  1.0  1.0
-17 0.30750 host ip-172-31-23-90
 15 0.30750 osd.15 up  1.0  1.0
-18 0.30750 host ip-172-31-31-197
 16 0.30750 osd.16 up  1.0  1.0
-19 0.30750 host ip-172-31-29-195
 17 0.30750 osd.17 up  1.0  1.0
-20 0.30750 host ip-172-31-28-9
 18 0.30750 osd.18 up  1.0  1.0
-21 0.30750 host ip-172-31-25-199
 19 0.30750 osd.19 up  1.0  1.0
-22 0.30750 host ip-172-31-25-187
 20 0.30750 osd.20 up  1.0  1.0
-23 0.30750 host ip-172-31-31-57
 21 0.30750 osd.21 up  1.0  1.0
-24 0.30750 host ip-172-31-20-64
 22 0.30750 osd.22 up  1.0  1.0
-25 0.30750 host ip-172-31-26-255
 23 0.30750 osd.23 up  1.0  1.0
-26 0.30750 host ip-172-31-18-146
 24 0.30750 osd.24 up  1.0  1.0
-27 0.30750 host ip-172-31-22-16
 25 0.30750 osd.25 up  1.0  1.0
-28 0.30750 host ip-172-31-26-152
 26 0.30750 osd.26 up  1.0  1.0
-29 0.30750 host ip-172-31-24-215
 27 0.30750 osd.27 up  1.0  1.0
-30 0.30750 host ip-172-31-24-138
 28 0.30750 osd.28 up  1.0  1.0
-31 0.30750 host ip-172-31-24-10
 29 0.30750 osd.29 up  1.0  1.0
-32 0.30750 host ip-172-31-20-79
 30 0.30750 osd.30 up  1.0  1.0
-33 0.30750 host ip-172-31-23-140
 31 0.30750 osd.31 up  1.0  1.0

Output of ceph health detail:

HEALTH_WARN 1 pgs backfill_wait; 55 pgs degraded; 1 pgs recovering; 54 pgs 
recovery_wait; 55 pgs stuck degraded; 56 pgs stuck unclean; recovery 
49688/1369003 objects degraded (3.630%); recovery 

Re: [ceph-users] Ceph fails to recover

2017-09-19 Thread David Turner
Can you please provide the output of `ceph status`, `ceph osd tree`, and
`ceph health detail`?  Thank you.

On Tue, Sep 19, 2017 at 2:59 PM Jonas Jaszkowic <
jonasjaszkowic.w...@gmail.com> wrote:

> Hi all,
>
> I have setup a Ceph cluster consisting of one monitor, 32 OSD hosts (1 OSD
> of size 320GB per host) and 16 clients which are reading
> and writing to the cluster. I have one erasure coded pool (shec plugin)
> with k=8, m=4, c=3 and pg_num=256. Failure domain is host.
> I am able to reach a HEALTH_OK state and everything is working as
> expected. The pool was populated with
> 114048 files of different sizes ranging from 1kB to 4GB. Total amount of
> data in the pool was around 3TB. The capacity of the
> pool was around 10TB.
>
> I want to evaluate how Ceph is rebalancing data in case of an OSD loss
> while clients are still reading. To do so, I am killing one OSD on purpose
> via *ceph osd out  *without adding a new one, i.e. I have 31 OSDs
> left. Ceph seems to notice this failure and starts to rebalance data
> which I can observe with the *ceph -w *command.
>
> However, Ceph failed to rebalance the data. The recovering process seemed
> to be stuck at a random point. I waited more than 12h but the
> number of degraded objects did not reduce and some PGs were stuck. Why is
> this happening? Based on the number of OSDs and the k,m,c values
> there should be enough hosts and OSDs to be able to recover from a single
> OSD failure?
>
> Thank you in advance!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph fails to recover

2017-09-19 Thread Jonas Jaszkowic
Hi all, 

I have setup a Ceph cluster consisting of one monitor, 32 OSD hosts (1 OSD of 
size 320GB per host) and 16 clients which are reading
and writing to the cluster. I have one erasure coded pool (shec plugin) with 
k=8, m=4, c=3 and pg_num=256. Failure domain is host.
I am able to reach a HEALTH_OK state and everything is working as expected. The 
pool was populated with
114048 files of different sizes ranging from 1kB to 4GB. Total amount of data 
in the pool was around 3TB. The capacity of the
pool was around 10TB.

I want to evaluate how Ceph is rebalancing data in case of an OSD loss while 
clients are still reading. To do so, I am killing one OSD on purpose
via ceph osd out  without adding a new one, i.e. I have 31 OSDs left. 
Ceph seems to notice this failure and starts to rebalance data
which I can observe with the ceph -w command.

However, Ceph failed to rebalance the data. The recovering process seemed to be 
stuck at a random point. I waited more than 12h but the
number of degraded objects did not reduce and some PGs were stuck. Why is this 
happening? Based on the number of OSDs and the k,m,c values 
there should be enough hosts and OSDs to be able to recover from a single OSD 
failure?

Thank you in advance!___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com