You may not want to set your heartbeat grace so high, it will make I/O block for a long time in the case of a real failure. You may want to look at increasing down reporters instead.
Robert LeBlanc Sent from a mobile device please excuse any typos. On Jul 2, 2015 9:39 PM, "Tuomas Juntunen" <tuomas.juntu...@databasement.fi> wrote: > Just reporting back on my findings > > > > After making these changes the flapping occurred just once during the > night. To fix it further I changed the heartbeat grace to 120secs. Also > matched > > osd_op_threads and filestore_op_threads to core count. > > > > Br,T > > > > > > *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf > Of *Tuomas Juntunen > *Sent:* 2. heinäkuuta 2015 16:23 > *To:* 'Somnath Roy'; 'ceph-users' > *Subject:* Re: [ceph-users] One of our nodes has logs saying: wrongly > marked me down > > > > Thanks > > > > I’ll test these values, and also add the osd heartbeat grace to 60 seconds > instead of 20, hopefully that would help with the latency during deep scrub. > > > > I changed shards to 6 and shard threads to 2, then it matches physical > cores on the server not including hyperthreading. > > > > Br, T > > > > *From:* Somnath Roy [mailto:somnath....@sandisk.com > <somnath....@sandisk.com>] > *Sent:* 2. heinäkuuta 2015 6:29 > *To:* Tuomas Juntunen; 'ceph-users' > *Subject:* RE: [ceph-users] One of our nodes has logs saying: wrongly > marked me down > > > > Yeah, this can happen during deep_scrub and also during rebalancing..I > forgot to mention that.. > > Generally, it is a good idea to throttle those..For deep scrub, you can > try using (got it from old post, I never used it) > > > > osd_scrub_chunk_min = 1 > > osd_scrub_chunk_max = 1 > > osd_scrub_sleep = 0.1 > > > > For rebalancing I think you are already using proper value.. > > > > But, I don’t think this will eliminate the scenario all together but > should alleviate it a bit. > > > > Also, why you are using so many shards ? How many OSDs you are running in > a box ? shard 25 should be good if you are running with single OSD, IF you > have lot of OSDs in a box, try to reduce it ~5 or so. > > > > Thanks & Regards > > Somnath > > > > > > *From:* Tuomas Juntunen [mailto:tuomas.juntu...@databasement.fi > <tuomas.juntu...@databasement.fi>] > *Sent:* Wednesday, July 01, 2015 8:18 PM > *To:* Somnath Roy; 'ceph-users' > *Subject:* RE: [ceph-users] One of our nodes has logs saying: wrongly > marked me down > > > > I’ve checked the network, we use IPoIB and all nodes are connected to the > same switch, there are no breaks in connectivity while this happens. My > constant ping says 0.03 – 0.1ms. I would say this is ok. > > > > This happens almost every time when deep scrubbing is running. Our loads > on this particular server goes to 300+ and osd’s are marked down. > > > > Any suggestions on settings? I now have the following settings that might > affect this > > > > [global] > > osd_op_threads = 6 > > osd_op_num_threads_per_shard = 1 > > osd_op_num_shards = 25 > > #osd_op_num_sharded_pool_threads = 25 > > filestore_op_threads = 6 > > ms_nocrc = true > > filestore_fd_cache_size = 64 > > filestore_fd_cache_shards = 32 > > ms_dispatch_throttle_bytes = 0 > > throttler_perf_counter = false > > > > [osd] > > osd scrub load threshold = 0.1 > > osd max backfills = 1 > > osd recovery max active = 1 > > osd scrub sleep = .1 > > osd disk thread ioprio class = idle > > osd disk thread ioprio priority = 7 > > osd scrub chunk max = 5 > > osd deep scrub stride = 1048576 > > filestore queue max ops = 10000 > > filestore max sync interval = 30 > > filestore min sync interval = 29 > > osd_client_message_size_cap = 0 > > osd_client_message_cap = 0 > > osd_enable_op_tracker = false > > > > Br, T > > > > > > *From:* Somnath Roy [mailto:somnath....@sandisk.com > <somnath....@sandisk.com>] > *Sent:* 2. heinäkuuta 2015 0:30 > *To:* Tuomas Juntunen; 'ceph-users' > *Subject:* RE: [ceph-users] One of our nodes has logs saying: wrongly > marked me down > > > > This can happen if your OSDs are flapping.. Hope your network is stable. > > > > Thanks & Regards > > Somnath > > > > *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com > <ceph-users-boun...@lists.ceph.com>] *On Behalf Of *Tuomas Juntunen > *Sent:* Wednesday, July 01, 2015 2:24 PM > *To:* 'ceph-users' > *Subject:* [ceph-users] One of our nodes has logs saying: wrongly marked > me down > > > > Hi > > > > One our nodes has OSD logs that say “wrongly marked me down” for every OSD > at some point. What could be the reason for this. Anyone have any similar > experiences? > > > > Other nodes work totally fine and they are all identical. > > > > Br,T > > > ------------------------------ > > > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If > the reader of this message is not the intended recipient, you are hereby > notified that you have received this message in error and that any review, > dissemination, distribution, or copying of this message is strictly > prohibited. If you have received this communication in error, please notify > the sender by telephone or e-mail (as shown above) immediately and destroy > any and all copies of this message in your possession (whether hard copies > or electronically stored copies). > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com