On Thu, Mar 29, 2018 at 7:27 AM Damian Dabrowski <scoot...@gmail.com> wrote:

> Hello,
>
> Few days ago I had very strange situation.
>
> I had to turn off few OSDs for a while. So I've set flags:noout,
> nobackfill, norecover and then turned off selected OSDs.
> All was ok, but when I started these OSDs again all VMs went down due
> to recovery process(even when recovery priority was very low).


So you forbade the OSDs from doing any recovery work, but then you turned
on old ones that required recovery work to function properly?

And your cluster stopped functioning?



>
> There's more important config values:
>     "osd_recovery_threads": "1",
>     "osd_recovery_thread_timeout": "30",
>     "osd_recovery_thread_suicide_timeout": "300",
>     "osd_recovery_delay_start": "0",
>     "osd_recovery_max_active": "1",
>     "osd_recovery_max_single_start": "5",
>     "osd_recovery_max_chunk": "8388608",
>     "osd_client_op_priority": "63",
>     "osd_recovery_op_priority": "1",
>     "osd_recovery_op_warn_multiple": "16",
>     "osd_backfill_full_ratio": "0.85",
>     "osd_backfill_retry_interval": "10",
>     "osd_backfill_scan_min": "64",
>     "osd_backfill_scan_max": "512",
>     "osd_kill_backfill_at": "0",
>     "osd_max_backfills": "1",
>
>
>
> I don't know why ceph started recovery process if there was
> norecovery&nobackfill flags enabled but the fact is that it killed all
> VMs.


Did it actually start recovering? Or you just saw client IO pause?
I confess I don’t know what the behavior will be like with that combined
set of flags, but I rather suspect it did what you told it to, and some PGs
went down as a result.
-Greg



>
> Next, I've turned off noout, nobackfill, norecover flags and it
> started to look better. VM's went back online and recovery process was
> still going. I didn't saw performance impact on SSD disks but there
> was huge impact on spinners.
> Normally %util is about 25%, but during recovery it was nearly 100%.
> CPU Load increased on HDD based VMs by ~400%.
>
> iostat fragment(during recovery):
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdh              0.30     1.00  150.90   36.00 13665.60   954.60
> 156.45    10.63   56.88   25.60  188.02   5.34  99.80
>
>
> Now, I'm little lost, I don't know answers for few questions.
> 1. Why ceph started recovery even if nobackfill&norecovery option was
> enabled?
> 2. Why recovery caused much more performance impact when
> norecovery&nobackfill options was enabled?
> 3. Why when norecovery&nobackfill was turned off, cluster started to
> look better but %util on HDD disks was so big(while
> recovery_op_priority=1 and client_op_priority=63)? 25% is normal,
> increased to 100% during recovery?
>
>
> Cluster information:
> ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
> 3x nodes(CPU E5-2630, 32GB RAM, 6xHDD 2TB with SSD journal, 3x SSD 1TB
> with NVMe journal), triple replication
>
>
> I would be very grateful If somebody can help me.
> Sorry if I've done something in wrong way - this is my first time
> writing on mailing list.
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to