Re: [ceph-users] Backfilling on Luminous

Pavan Rallabhandi Mon, 19 Mar 2018 14:29:11 -0700

David,

Pretty sure you must be aware of the filestore random split on existing OSD 
PGs, `filestore split rand factor`, may be you could try that too.

Thanks,
-Pavan.

From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of David Turner 
<drakonst...@gmail.com>
Date: Monday, March 19, 2018 at 1:36 PM
To: Caspar Smit <caspars...@supernas.eu>
Cc: ceph-users <ceph-users@lists.ceph.com>
Subject: EXT: Re: [ceph-users] Backfilling on Luminous

Sorry for being away. I set all of my backfilling to VERY slow settings over 
the weekend and things have been stable, but incredibly slow (1% recovery from 
3% misplaced to 2% all weekend).  I'm back on it now and well rested.

@Caspar, SWAP isn't being used on these nodes and all of the affected OSDs have 
been filestore.

@Dan, I think you hit the nail on the head.  I didn't know that logging was 
added for subfolder splitting in Luminous!!! That's AMAZING!!!!  We are seeing 
consistent subfolder splitting all across the cluster.  The majority of the 
crashed OSDs have a split started before the crash and then commenting about it 
in the crash dump.  Looks like I just need to write a daemon to watch for 
splitting to start and throttle recovery until it's done.

I had injected the following timeout settings, but it didn't seem to affect 
anything.  I may need to have placed them in ceph.conf and let them pick up the 
new settings as the OSDs crashed, but I didn't really want different settings 
on some OSDs in the cluster.

osd_op_thread_suicide_timeout=1200 (from 180)
osd-recovery-thread-timeout=300  (from 30)

My game plan for now is to watch for splitting in the log, increase recovery 
sleep, decrease osd_recovery_max_active, and watch for splitting to finish 
before setting them back to more aggressive settings.  After this cluster is 
done backfilling I'm going to do my best to reproduce this scenario in a test 
environment and open a ticket to hopefully fix why this is happening so 
detrimentally.

On Fri, Mar 16, 2018 at 4:00 AM Caspar Smit 
<caspars...@supernas.eu<mailto:caspars...@supernas.eu>> wrote:
Hi David,

What about memory usage?

1] 23 OSD nodes: 15x 10TB Seagate Ironwolf filestore with journals on Intel DC 
P3700, 70% full cluster, Dual Socket E5-2620 v4 @ 2.10GHz, 128GB RAM.

If you upgrade to bluestore, memory usage will likely increase. 15x10TB ~~ 
150GB RAM needed especially in recovery/backfilling scenario's like these.

Kind regards,
Caspar

2018-03-15 21:53 GMT+01:00 Dan van der Ster 
<d...@vanderster.com<mailto:d...@vanderster.com>>:
Did you use perf top or iotop to try to identify where the osd is stuck?
Did you try increasing the op thread suicide timeout from 180s?

Splitting should log at the beginning and end of an op, so it should be clear 
if it's taking longer than the timeout.

.. Dan

On Mar 15, 2018 9:23 PM, "David Turner" 
<drakonst...@gmail.com<mailto:drakonst...@gmail.com>> wrote:
I am aware of the filestore splitting happening.  I manually split all of the 
subfolders a couple weeks ago on this cluster, but every time we have 
backfilling the newly moved PGs have a chance to split before the backfilling 
is done.  When that has happened in the past it causes some blocked requests 
and will flap OSDs if we don't increase the osd_heartbeat_grace, but it has 
never consistently killed the OSDs during the task.  Maybe that's new in 
Luminous due to some of the priority and timeout settings.

This problem in general seems unrelated to the subfolder splitting, though, 
since it started to happen very quickly into the backfilling process.  
Definitely before many of the recently moved PGs would have reached that point. 
 I've also confirmed that the OSDs that are dying are not just stuck on a 
process (like it looks like with filestore splitting), but actually segfaulting 
and restarting.

On Thu, Mar 15, 2018 at 4:08 PM Dan van der Ster 
<d...@vanderster.com<mailto:d...@vanderster.com>> wrote:
Hi,

Do you see any split or merge messages in the osd logs?
I recall some surprise filestore splitting on a few osds after the luminous 
upgrade.

.. Dan

On Mar 15, 2018 6:04 PM, "David Turner" 
<drakonst...@gmail.com<mailto:drakonst...@gmail.com>> wrote:
I upgraded a [1] cluster from Jewel 10.2.7 to Luminous 12.2.2 and last week I 
added 2 nodes to the cluster.  The backfilling has been ATROCIOUS.  I have OSDs 
consistently [2] segfaulting during recovery.  There's no pattern of which OSDs 
are segfaulting, which hosts have segfaulting OSDs, etc... It's all over the 
cluster.  I have been trying variants on all of these following settings with 
different levels of success, but I cannot eliminate the blocked requests and 
segfaulting OSDs.  osd_heartbeat_grace, osd_max_backfills, 
osd_op_thread_suicide_timeout, osd_recovery_max_active, osd_recovery_sleep_hdd, 
osd_recovery_sleep_hybrid, osd_recovery_thread_timeout, and 
osd_scrub_during_recovery.  Except for setting nobackfilling on the cluster I 
can't stop OSDs from segfaulting during recovery.

Does anyone have any ideas for this?  I've been struggling with this for over a 
week now.  For the first couple days I rebalanced the cluster and had this 
exact same issue prior to adding new storage.  Even setting osd_max_backfills 
to 1 and recovery_sleep to 1.0, with everything else on defaults, doesn't help.

Backfilling caused things to slow down on Jewel, but I wasn't having OSDs 
segfault multiple times/hour like I am on Luminous.  So many OSDs are going 
down that I had to set nodown to prevent potential data instability of OSDs on 
multiple hosts going up and down all the time.  That blocks IO for every OSD 
that dies either until it comes back up or I manually mark it down.  I hope 
someone has some ideas for me here.  Our plan moving forward is to only use 
half of the capacity of the drives by pretending they're 5TB instead of 10TB to 
increase the spindle speed per TB.  Also migrating to bluestore will hopefully 
help.

[1] 23 OSD nodes: 15x 10TB Seagate Ironwolf filestore with journals on Intel DC 
P3700, 70% full cluster, Dual Socket E5-2620 v4 @ 2.10GHz, 128GB RAM.

[2]    -19> 2018-03-15 16:42:17.998074 7fe661601700  5 -- 
10.130.115.25:6811/2942118<http://10.130.115.25:6811/2942118> >> 
10.130.115.48:0/372681<http://10.130.115.48:0/372681> conn(0x55e3ea087000 :6811 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pg
s=1920 cs=1 l=1). rx osd.254 seq 74507 0x55e3eb8e2e00 osd_ping(ping e93182 
stamp 2018-03-15 16:42:17.990698) v4
   -18> 2018-03-15 16:42:17.998091 7fe661601700  1 -- 
10.130.115.25:6811/2942118<http://10.130.115.25:6811/2942118> <== osd.254 
10.130.115.48:0/372681<http://10.130.115.48:0/372681> 74507 ==== osd_ping(ping 
e93182 stamp 2018-03-15 16:42:17.990698)
 v4 ==== 2004+0+0 (492539280 0 0) 0x55e3eb8e2e00 con 0x55e3ea087000
   -17> 2018-03-15 16:42:17.998109 7fe661601700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fe639772700' had timed out after 60
   -16> 2018-03-15 16:42:17.998111 7fe661601700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fe639f73700' had timed out after 60
   -15> 2018-03-15 16:42:17.998120 7fe661601700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fe63a774700' had timed out after 60
   -14> 2018-03-15 16:42:17.998123 7fe661601700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fe63af75700' had timed out after 60
   -13> 2018-03-15 16:42:17.998126 7fe661601700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fe63b776700' had timed out after 60
   -12> 2018-03-15 16:42:17.998129 7fe661601700  1 heartbeat_map is_healthy 
'FileStore::op_tp thread 0x7fe654854700' had timed out after 60
   -11> 2018-03-15 16:42:18.004203 7fe661601700  5 -- 
10.130.115.25:6811/2942118<http://10.130.115.25:6811/2942118> >> 
10.130.115.33:0/3348055<http://10.130.115.33:0/3348055> conn(0x55e3eb5f0000 
:6811 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH p
gs=1894 cs=1 l=1). rx osd.169 seq 74633 0x55e3eb8e2e00 osd_ping(ping e93182 
stamp 2018-03-15 16:42:17.998828) v4
   -10> 2018-03-15 16:42:18.004230 7fe661601700  1 -- 
10.130.115.25:6811/2942118<http://10.130.115.25:6811/2942118> <== osd.169 
10.130.115.33:0/3348055<http://10.130.115.33:0/3348055> 74633 ==== 
osd_ping(ping e93182 stamp 2018-03-15 16:42:17.998828
) v4 ==== 2004+0+0 (2306332339 0 0) 0x55e3eb8e2e00 con 0x55e3eb5f0000
    -9> 2018-03-15 16:42:18.004241 7fe661601700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fe639772700' had timed out after 60
    -8> 2018-03-15 16:42:18.004244 7fe661601700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fe639f73700' had timed out after 60
    -7> 2018-03-15 16:42:18.004246 7fe661601700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fe63a774700' had timed out after 60
    -6> 2018-03-15 16:42:18.004248 7fe661601700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fe63af75700' had timed out after 60
    -5> 2018-03-15 16:42:18.004249 7fe661601700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fe63b776700' had timed out after 60
    -4> 2018-03-15 16:42:18.004251 7fe661601700  1 heartbeat_map is_healthy 
'FileStore::op_tp thread 0x7fe654854700' had timed out after 60
    -3> 2018-03-15 16:42:18.004256 7fe661601700  1 heartbeat_map is_healthy 
'FileStore::op_tp thread 0x7fe654854700' had suicide timed out after 180
    -2> 2018-03-15 16:42:18.004462 7fe6605ff700  5 -- 
10.130.113.25:6811/2942118<http://10.130.113.25:6811/2942118> >> 
10.130.113.33:0/3348055<http://10.130.113.33:0/3348055> conn(0x55e3eb599800 
:6811 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH p
gs=1937 cs=1 l=1). rx osd.169 seq 74633 0x55e3eef6d200 osd_ping(ping e93182 
stamp 2018-03-15 16:42:17.998828) v4
    -1> 2018-03-15 16:42:18.004502 7fe6605ff700  1 -- 
10.130.113.25:6811/2942118<http://10.130.113.25:6811/2942118> <== osd.169 
10.130.113.33:0/3348055<http://10.130.113.33:0/3348055> 74633 ==== 
osd_ping(ping e93182 stamp 2018-03-15 16:42:17.998828
) v4 ==== 2004+0+0 (2306332339 0 0) 0x55e3eef6d200 con 0x55e3eb599800
     0> 2018-03-15 16:42:18.015185 7fe654854700 -1 *** Caught signal (Aborted) 
**
 in thread 7fe654854700 thread_name:tp_fstore_op

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Backfilling on Luminous

Reply via email to