Re: [ceph-users] Ceph recovery kill VM's even with the smallest priority

2018-04-07 Thread Damian Dabrowski
ok now I understand, thanks for all this helpful answers!

On Sat, Apr 7, 2018, 15:26 David Turner  wrote:

> I'm seconding what Greg is saying  There is no reason to set nobackfill
> and norecover just for restarting OSDs. That will only cause the problems
> you're seeing without giving you any benefit. There are reasons to use
> norecover and nobackfill but unless you're manually editing the crush map,
> having osds consistently segfault, or for any other reason you really just
> need to stop the io from recovery, then they aren't the flags for you. Even
> at that, nobackfill is most likely what you need and norecover is still
> probably not helpful.
>
> On Wed, Apr 4, 2018, 6:59 PM Gregory Farnum  wrote:
>
>> On Thu, Mar 29, 2018 at 3:17 PM Damian Dabrowski 
>> wrote:
>>
>>> Greg, thanks for Your reply!
>>>
>>> I think Your idea makes sense, I've did tests and its quite hard to
>>> understand for me. I'll try to explain my situation in few steps
>>> below.
>>> I think that ceph showing progress in recovery but it can only solve
>>> objects which doesn't really changed. It won't try to repair objects
>>> which are really degraded because of norecovery flag. Am i right?
>>> After a while I see blocked requests(as You can see below).
>>>
>>
>> Yeah, so the implementation of this is a bit funky. Basically, when the
>> OSD gets a map specifying norecovery, it will prevent any new recovery ops
>> from starting once it processes that map. But it doesn't change the state
>> of the PGs out of recovery; they just won't queue up more work.
>>
>> So probably the existing recovery IO was from OSDs that weren't
>> up-to-date yet. Or maybe there's a bug in the norecover implementation; it
>> definitely looks a bit fragile.
>>
>> But really I just wouldn't use that command. It's an expert flag that you
>> shouldn't use except in some extreme wonky cluster situations (and even
>> those may no longer exist in modern Ceph). For the use case you shared in
>> your first email, I'd just stick with noout.
>> -Greg
>>
>>
>>>
>>> - FEW SEC AFTER START OSD -
>>> # ceph status
>>> cluster 848b340a-be27-45cb-ab66-3151d877a5a0
>>>  health HEALTH_WARN
>>> 140 pgs degraded
>>> 1 pgs recovering
>>> 92 pgs recovery_wait
>>> 140 pgs stuck unclean
>>> recovery 942/5772119 objects degraded (0.016%)
>>> noout,nobackfill,norecover flag(s) set
>>>  monmap e10: 3 mons at
>>> {node-19=
>>> 172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0}
>>> election epoch 724, quorum 0,1,2 node-19,node-21,node-20
>>>  osdmap e18727: 36 osds: 36 up, 30 in
>>> flags noout,nobackfill,norecover
>>>   pgmap v20851644: 1472 pgs, 7 pools, 8510 GB data, 1880 kobjects
>>> 25204 GB used, 17124 GB / 42329 GB avail
>>> 942/5772119 objects degraded (0.016%)
>>> 1332 active+clean
>>>   92 active+recovery_wait+degraded
>>>   47 active+degraded
>>>1 active+recovering+degraded
>>> recovery io 31608 kB/s, 4 objects/s
>>>   client io 73399 kB/s rd, 80233 kB/s wr, 1218 op/s
>>>
>>> - 1 MIN AFTER OSD START, RECOVERY STUCK, BLOCKED REQUESTS -
>>> # ceph status
>>> cluster 848b340a-be27-45cb-ab66-3151d877a5a0
>>>  health HEALTH_WARN
>>> 140 pgs degraded
>>> 1 pgs recovering
>>> 109 pgs recovery_wait
>>> 140 pgs stuck unclean
>>> 80 requests are blocked > 32 sec
>>> recovery 847/5775929 <(847)%20577-5929> objects degraded
>>> (0.015%)
>>> noout,nobackfill,norecover flag(s) set
>>>  monmap e10: 3 mons at
>>> {node-19=
>>> 172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0}
>>> election epoch 724, quorum 0,1,2 node-19,node-21,node-20
>>>  osdmap e18727: 36 osds: 36 up, 30 in
>>> flags noout,nobackfill,norecover
>>>   pgmap v20851812: 1472 pgs, 7 pools, 8520 GB data, 1881 kobjects
>>> 25234 GB used, 17094 GB / 42329 GB avail
>>> 847/5775929 <(847)%20577-5929> objects degraded (0.015%)
>>> 1332 active+clean
>>>  109 active+recovery_wait+degraded
>>>   30 active+degraded < degraded objects count got
>>> stuck
>>>1 active+recovering+degraded
>>> recovery io 3743 kB/s, 0 objects/s < depend on command execution
>>> this line showing 0 objects/s or doesn't exists
>>>   client io 26521 kB/s rd, 64211 kB/s wr, 1212 op/s
>>>
>>> - FEW SECONDS AFTER UNSETTING FLAGS NOOUT, NORECOVERY, NOBACKFILL
>>> -
>>> # ceph status
>>> cluster 848b340a-be27-45cb-ab66-3151d877a5a0
>>>  health HEALTH_WARN
>>> 134 pgs degraded
>>> 134 pgs recovery_wait
>>> 134 pgs stuck degraded
>>> 134 pgs stuck 

Re: [ceph-users] Ceph recovery kill VM's even with the smallest priority

2018-04-07 Thread David Turner
I'm seconding what Greg is saying  There is no reason to set nobackfill and
norecover just for restarting OSDs. That will only cause the problems
you're seeing without giving you any benefit. There are reasons to use
norecover and nobackfill but unless you're manually editing the crush map,
having osds consistently segfault, or for any other reason you really just
need to stop the io from recovery, then they aren't the flags for you. Even
at that, nobackfill is most likely what you need and norecover is still
probably not helpful.

On Wed, Apr 4, 2018, 6:59 PM Gregory Farnum  wrote:

> On Thu, Mar 29, 2018 at 3:17 PM Damian Dabrowski 
> wrote:
>
>> Greg, thanks for Your reply!
>>
>> I think Your idea makes sense, I've did tests and its quite hard to
>> understand for me. I'll try to explain my situation in few steps
>> below.
>> I think that ceph showing progress in recovery but it can only solve
>> objects which doesn't really changed. It won't try to repair objects
>> which are really degraded because of norecovery flag. Am i right?
>> After a while I see blocked requests(as You can see below).
>>
>
> Yeah, so the implementation of this is a bit funky. Basically, when the
> OSD gets a map specifying norecovery, it will prevent any new recovery ops
> from starting once it processes that map. But it doesn't change the state
> of the PGs out of recovery; they just won't queue up more work.
>
> So probably the existing recovery IO was from OSDs that weren't up-to-date
> yet. Or maybe there's a bug in the norecover implementation; it definitely
> looks a bit fragile.
>
> But really I just wouldn't use that command. It's an expert flag that you
> shouldn't use except in some extreme wonky cluster situations (and even
> those may no longer exist in modern Ceph). For the use case you shared in
> your first email, I'd just stick with noout.
> -Greg
>
>
>>
>> - FEW SEC AFTER START OSD -
>> # ceph status
>> cluster 848b340a-be27-45cb-ab66-3151d877a5a0
>>  health HEALTH_WARN
>> 140 pgs degraded
>> 1 pgs recovering
>> 92 pgs recovery_wait
>> 140 pgs stuck unclean
>> recovery 942/5772119 objects degraded (0.016%)
>> noout,nobackfill,norecover flag(s) set
>>  monmap e10: 3 mons at
>> {node-19=
>> 172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0}
>> election epoch 724, quorum 0,1,2 node-19,node-21,node-20
>>  osdmap e18727: 36 osds: 36 up, 30 in
>> flags noout,nobackfill,norecover
>>   pgmap v20851644: 1472 pgs, 7 pools, 8510 GB data, 1880 kobjects
>> 25204 GB used, 17124 GB / 42329 GB avail
>> 942/5772119 objects degraded (0.016%)
>> 1332 active+clean
>>   92 active+recovery_wait+degraded
>>   47 active+degraded
>>1 active+recovering+degraded
>> recovery io 31608 kB/s, 4 objects/s
>>   client io 73399 kB/s rd, 80233 kB/s wr, 1218 op/s
>>
>> - 1 MIN AFTER OSD START, RECOVERY STUCK, BLOCKED REQUESTS -
>> # ceph status
>> cluster 848b340a-be27-45cb-ab66-3151d877a5a0
>>  health HEALTH_WARN
>> 140 pgs degraded
>> 1 pgs recovering
>> 109 pgs recovery_wait
>> 140 pgs stuck unclean
>> 80 requests are blocked > 32 sec
>> recovery 847/5775929 <(847)%20577-5929> objects degraded
>> (0.015%)
>> noout,nobackfill,norecover flag(s) set
>>  monmap e10: 3 mons at
>> {node-19=
>> 172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0}
>> election epoch 724, quorum 0,1,2 node-19,node-21,node-20
>>  osdmap e18727: 36 osds: 36 up, 30 in
>> flags noout,nobackfill,norecover
>>   pgmap v20851812: 1472 pgs, 7 pools, 8520 GB data, 1881 kobjects
>> 25234 GB used, 17094 GB / 42329 GB avail
>> 847/5775929 <(847)%20577-5929> objects degraded (0.015%)
>> 1332 active+clean
>>  109 active+recovery_wait+degraded
>>   30 active+degraded < degraded objects count got
>> stuck
>>1 active+recovering+degraded
>> recovery io 3743 kB/s, 0 objects/s < depend on command execution
>> this line showing 0 objects/s or doesn't exists
>>   client io 26521 kB/s rd, 64211 kB/s wr, 1212 op/s
>>
>> - FEW SECONDS AFTER UNSETTING FLAGS NOOUT, NORECOVERY, NOBACKFILL
>> -
>> # ceph status
>> cluster 848b340a-be27-45cb-ab66-3151d877a5a0
>>  health HEALTH_WARN
>> 134 pgs degraded
>> 134 pgs recovery_wait
>> 134 pgs stuck degraded
>> 134 pgs stuck unclean
>> recovery 591/5778179 objects degraded (0.010%)
>>  monmap e10: 3 mons at
>> {node-19=
>> 172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0}
>> election epoch 724, quorum 0,1,2 

Re: [ceph-users] Ceph recovery kill VM's even with the smallest priority

2018-04-04 Thread Gregory Farnum
On Thu, Mar 29, 2018 at 3:17 PM Damian Dabrowski  wrote:

> Greg, thanks for Your reply!
>
> I think Your idea makes sense, I've did tests and its quite hard to
> understand for me. I'll try to explain my situation in few steps
> below.
> I think that ceph showing progress in recovery but it can only solve
> objects which doesn't really changed. It won't try to repair objects
> which are really degraded because of norecovery flag. Am i right?
> After a while I see blocked requests(as You can see below).
>

Yeah, so the implementation of this is a bit funky. Basically, when the OSD
gets a map specifying norecovery, it will prevent any new recovery ops from
starting once it processes that map. But it doesn't change the state of the
PGs out of recovery; they just won't queue up more work.

So probably the existing recovery IO was from OSDs that weren't up-to-date
yet. Or maybe there's a bug in the norecover implementation; it definitely
looks a bit fragile.

But really I just wouldn't use that command. It's an expert flag that you
shouldn't use except in some extreme wonky cluster situations (and even
those may no longer exist in modern Ceph). For the use case you shared in
your first email, I'd just stick with noout.
-Greg


>
> - FEW SEC AFTER START OSD -
> # ceph status
> cluster 848b340a-be27-45cb-ab66-3151d877a5a0
>  health HEALTH_WARN
> 140 pgs degraded
> 1 pgs recovering
> 92 pgs recovery_wait
> 140 pgs stuck unclean
> recovery 942/5772119 objects degraded (0.016%)
> noout,nobackfill,norecover flag(s) set
>  monmap e10: 3 mons at
> {node-19=
> 172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0}
> election epoch 724, quorum 0,1,2 node-19,node-21,node-20
>  osdmap e18727: 36 osds: 36 up, 30 in
> flags noout,nobackfill,norecover
>   pgmap v20851644: 1472 pgs, 7 pools, 8510 GB data, 1880 kobjects
> 25204 GB used, 17124 GB / 42329 GB avail
> 942/5772119 objects degraded (0.016%)
> 1332 active+clean
>   92 active+recovery_wait+degraded
>   47 active+degraded
>1 active+recovering+degraded
> recovery io 31608 kB/s, 4 objects/s
>   client io 73399 kB/s rd, 80233 kB/s wr, 1218 op/s
>
> - 1 MIN AFTER OSD START, RECOVERY STUCK, BLOCKED REQUESTS -
> # ceph status
> cluster 848b340a-be27-45cb-ab66-3151d877a5a0
>  health HEALTH_WARN
> 140 pgs degraded
> 1 pgs recovering
> 109 pgs recovery_wait
> 140 pgs stuck unclean
> 80 requests are blocked > 32 sec
> recovery 847/5775929 <(847)%20577-5929> objects degraded
> (0.015%)
> noout,nobackfill,norecover flag(s) set
>  monmap e10: 3 mons at
> {node-19=
> 172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0}
> election epoch 724, quorum 0,1,2 node-19,node-21,node-20
>  osdmap e18727: 36 osds: 36 up, 30 in
> flags noout,nobackfill,norecover
>   pgmap v20851812: 1472 pgs, 7 pools, 8520 GB data, 1881 kobjects
> 25234 GB used, 17094 GB / 42329 GB avail
> 847/5775929 <(847)%20577-5929> objects degraded (0.015%)
> 1332 active+clean
>  109 active+recovery_wait+degraded
>   30 active+degraded < degraded objects count got stuck
>1 active+recovering+degraded
> recovery io 3743 kB/s, 0 objects/s < depend on command execution
> this line showing 0 objects/s or doesn't exists
>   client io 26521 kB/s rd, 64211 kB/s wr, 1212 op/s
>
> - FEW SECONDS AFTER UNSETTING FLAGS NOOUT, NORECOVERY, NOBACKFILL -
> # ceph status
> cluster 848b340a-be27-45cb-ab66-3151d877a5a0
>  health HEALTH_WARN
> 134 pgs degraded
> 134 pgs recovery_wait
> 134 pgs stuck degraded
> 134 pgs stuck unclean
> recovery 591/5778179 objects degraded (0.010%)
>  monmap e10: 3 mons at
> {node-19=
> 172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0}
> election epoch 724, quorum 0,1,2 node-19,node-21,node-20
>  osdmap e18730: 36 osds: 36 up, 30 in
>   pgmap v20851909: 1472 pgs, 7 pools, 8526 GB data, 1881 kobjects
> 25252 GB used, 17076 GB / 42329 GB avail
> 591/5778179 objects degraded (0.010%)
> 1338 active+clean
>  134 active+recovery_wait+degraded
> recovery io 191 MB/s, 26 objects/s
>   client io 100654 kB/s rd, 184 MB/s wr, 6303 op/s
>
>
>
> 2018-03-29 18:22 GMT+02:00 Gregory Farnum :
> >
> > On Thu, Mar 29, 2018 at 7:27 AM Damian Dabrowski 
> wrote:
> >>
> >> Hello,
> >>
> >> Few days ago I had very strange situation.
> >>
> >> I had to turn off few OSDs for a while. So I've set flags:noout,
> >> nobackfill, 

Re: [ceph-users] Ceph recovery kill VM's even with the smallest priority

2018-03-29 Thread Damian Dabrowski
Greg, thanks for Your reply!

I think Your idea makes sense, I've did tests and its quite hard to
understand for me. I'll try to explain my situation in few steps
below.
I think that ceph showing progress in recovery but it can only solve
objects which doesn't really changed. It won't try to repair objects
which are really degraded because of norecovery flag. Am i right?
After a while I see blocked requests(as You can see below).

- FEW SEC AFTER START OSD -
# ceph status
cluster 848b340a-be27-45cb-ab66-3151d877a5a0
 health HEALTH_WARN
140 pgs degraded
1 pgs recovering
92 pgs recovery_wait
140 pgs stuck unclean
recovery 942/5772119 objects degraded (0.016%)
noout,nobackfill,norecover flag(s) set
 monmap e10: 3 mons at
{node-19=172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0}
election epoch 724, quorum 0,1,2 node-19,node-21,node-20
 osdmap e18727: 36 osds: 36 up, 30 in
flags noout,nobackfill,norecover
  pgmap v20851644: 1472 pgs, 7 pools, 8510 GB data, 1880 kobjects
25204 GB used, 17124 GB / 42329 GB avail
942/5772119 objects degraded (0.016%)
1332 active+clean
  92 active+recovery_wait+degraded
  47 active+degraded
   1 active+recovering+degraded
recovery io 31608 kB/s, 4 objects/s
  client io 73399 kB/s rd, 80233 kB/s wr, 1218 op/s

- 1 MIN AFTER OSD START, RECOVERY STUCK, BLOCKED REQUESTS -
# ceph status
cluster 848b340a-be27-45cb-ab66-3151d877a5a0
 health HEALTH_WARN
140 pgs degraded
1 pgs recovering
109 pgs recovery_wait
140 pgs stuck unclean
80 requests are blocked > 32 sec
recovery 847/5775929 objects degraded (0.015%)
noout,nobackfill,norecover flag(s) set
 monmap e10: 3 mons at
{node-19=172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0}
election epoch 724, quorum 0,1,2 node-19,node-21,node-20
 osdmap e18727: 36 osds: 36 up, 30 in
flags noout,nobackfill,norecover
  pgmap v20851812: 1472 pgs, 7 pools, 8520 GB data, 1881 kobjects
25234 GB used, 17094 GB / 42329 GB avail
847/5775929 objects degraded (0.015%)
1332 active+clean
 109 active+recovery_wait+degraded
  30 active+degraded < degraded objects count got stuck
   1 active+recovering+degraded
recovery io 3743 kB/s, 0 objects/s < depend on command execution
this line showing 0 objects/s or doesn't exists
  client io 26521 kB/s rd, 64211 kB/s wr, 1212 op/s

- FEW SECONDS AFTER UNSETTING FLAGS NOOUT, NORECOVERY, NOBACKFILL -
# ceph status
cluster 848b340a-be27-45cb-ab66-3151d877a5a0
 health HEALTH_WARN
134 pgs degraded
134 pgs recovery_wait
134 pgs stuck degraded
134 pgs stuck unclean
recovery 591/5778179 objects degraded (0.010%)
 monmap e10: 3 mons at
{node-19=172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0}
election epoch 724, quorum 0,1,2 node-19,node-21,node-20
 osdmap e18730: 36 osds: 36 up, 30 in
  pgmap v20851909: 1472 pgs, 7 pools, 8526 GB data, 1881 kobjects
25252 GB used, 17076 GB / 42329 GB avail
591/5778179 objects degraded (0.010%)
1338 active+clean
 134 active+recovery_wait+degraded
recovery io 191 MB/s, 26 objects/s
  client io 100654 kB/s rd, 184 MB/s wr, 6303 op/s



2018-03-29 18:22 GMT+02:00 Gregory Farnum :
>
> On Thu, Mar 29, 2018 at 7:27 AM Damian Dabrowski  wrote:
>>
>> Hello,
>>
>> Few days ago I had very strange situation.
>>
>> I had to turn off few OSDs for a while. So I've set flags:noout,
>> nobackfill, norecover and then turned off selected OSDs.
>> All was ok, but when I started these OSDs again all VMs went down due
>> to recovery process(even when recovery priority was very low).
>
>
> So you forbade the OSDs from doing any recovery work, but then you turned on
> old ones that required recovery work to function properly?
>
> And your cluster stopped functioning?
>
>
>>
>>
>> There's more important config values:
>> "osd_recovery_threads": "1",
>> "osd_recovery_thread_timeout": "30",
>> "osd_recovery_thread_suicide_timeout": "300",
>> "osd_recovery_delay_start": "0",
>> "osd_recovery_max_active": "1",
>> "osd_recovery_max_single_start": "5",
>> "osd_recovery_max_chunk": "8388608",
>> "osd_client_op_priority": "63",
>> "osd_recovery_op_priority": "1",
>> "osd_recovery_op_warn_multiple": "16",
>> "osd_backfill_full_ratio": "0.85",
>> "osd_backfill_retry_interval": "10",
>> "osd_backfill_scan_min": "64",
>> "osd_backfill_scan_max": "512",
>> "osd_kill_backfill_at": "0",

Re: [ceph-users] Ceph recovery kill VM's even with the smallest priority

2018-03-29 Thread Gregory Farnum
On Thu, Mar 29, 2018 at 7:27 AM Damian Dabrowski  wrote:

> Hello,
>
> Few days ago I had very strange situation.
>
> I had to turn off few OSDs for a while. So I've set flags:noout,
> nobackfill, norecover and then turned off selected OSDs.
> All was ok, but when I started these OSDs again all VMs went down due
> to recovery process(even when recovery priority was very low).


So you forbade the OSDs from doing any recovery work, but then you turned
on old ones that required recovery work to function properly?

And your cluster stopped functioning?



>
> There's more important config values:
> "osd_recovery_threads": "1",
> "osd_recovery_thread_timeout": "30",
> "osd_recovery_thread_suicide_timeout": "300",
> "osd_recovery_delay_start": "0",
> "osd_recovery_max_active": "1",
> "osd_recovery_max_single_start": "5",
> "osd_recovery_max_chunk": "8388608",
> "osd_client_op_priority": "63",
> "osd_recovery_op_priority": "1",
> "osd_recovery_op_warn_multiple": "16",
> "osd_backfill_full_ratio": "0.85",
> "osd_backfill_retry_interval": "10",
> "osd_backfill_scan_min": "64",
> "osd_backfill_scan_max": "512",
> "osd_kill_backfill_at": "0",
> "osd_max_backfills": "1",
>
>
>
> I don't know why ceph started recovery process if there was
> norecovery flags enabled but the fact is that it killed all
> VMs.


Did it actually start recovering? Or you just saw client IO pause?
I confess I don’t know what the behavior will be like with that combined
set of flags, but I rather suspect it did what you told it to, and some PGs
went down as a result.
-Greg



>
> Next, I've turned off noout, nobackfill, norecover flags and it
> started to look better. VM's went back online and recovery process was
> still going. I didn't saw performance impact on SSD disks but there
> was huge impact on spinners.
> Normally %util is about 25%, but during recovery it was nearly 100%.
> CPU Load increased on HDD based VMs by ~400%.
>
> iostat fragment(during recovery):
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdh  0.30 1.00  150.90   36.00 13665.60   954.60
> 156.4510.63   56.88   25.60  188.02   5.34  99.80
>
>
> Now, I'm little lost, I don't know answers for few questions.
> 1. Why ceph started recovery even if nobackfill option was
> enabled?
> 2. Why recovery caused much more performance impact when
> norecovery options was enabled?
> 3. Why when norecovery was turned off, cluster started to
> look better but %util on HDD disks was so big(while
> recovery_op_priority=1 and client_op_priority=63)? 25% is normal,
> increased to 100% during recovery?
>
>
> Cluster information:
> ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
> 3x nodes(CPU E5-2630, 32GB RAM, 6xHDD 2TB with SSD journal, 3x SSD 1TB
> with NVMe journal), triple replication
>
>
> I would be very grateful If somebody can help me.
> Sorry if I've done something in wrong way - this is my first time
> writing on mailing list.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph recovery kill VM's even with the smallest priority

2018-03-29 Thread Damian Dabrowski
Hello,

Few days ago I had very strange situation.

I had to turn off few OSDs for a while. So I've set flags:noout,
nobackfill, norecover and then turned off selected OSDs.
All was ok, but when I started these OSDs again all VMs went down due
to recovery process(even when recovery priority was very low).

There's more important config values:
"osd_recovery_threads": "1",
"osd_recovery_thread_timeout": "30",
"osd_recovery_thread_suicide_timeout": "300",
"osd_recovery_delay_start": "0",
"osd_recovery_max_active": "1",
"osd_recovery_max_single_start": "5",
"osd_recovery_max_chunk": "8388608",
"osd_client_op_priority": "63",
"osd_recovery_op_priority": "1",
"osd_recovery_op_warn_multiple": "16",
"osd_backfill_full_ratio": "0.85",
"osd_backfill_retry_interval": "10",
"osd_backfill_scan_min": "64",
"osd_backfill_scan_max": "512",
"osd_kill_backfill_at": "0",
"osd_max_backfills": "1",



I don't know why ceph started recovery process if there was
norecovery flags enabled but the fact is that it killed all
VMs.

Next, I've turned off noout, nobackfill, norecover flags and it
started to look better. VM's went back online and recovery process was
still going. I didn't saw performance impact on SSD disks but there
was huge impact on spinners.
Normally %util is about 25%, but during recovery it was nearly 100%.
CPU Load increased on HDD based VMs by ~400%.

iostat fragment(during recovery):
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdh  0.30 1.00  150.90   36.00 13665.60   954.60
156.4510.63   56.88   25.60  188.02   5.34  99.80


Now, I'm little lost, I don't know answers for few questions.
1. Why ceph started recovery even if nobackfill option was enabled?
2. Why recovery caused much more performance impact when
norecovery options was enabled?
3. Why when norecovery was turned off, cluster started to
look better but %util on HDD disks was so big(while
recovery_op_priority=1 and client_op_priority=63)? 25% is normal,
increased to 100% during recovery?


Cluster information:
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
3x nodes(CPU E5-2630, 32GB RAM, 6xHDD 2TB with SSD journal, 3x SSD 1TB
with NVMe journal), triple replication


I would be very grateful If somebody can help me.
Sorry if I've done something in wrong way - this is my first time
writing on mailing list.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com