Re: [ceph-users] cluster is not stable
Den tors 14 mars 2019 kl 17:00 skrev Zhenshi Zhou : > I think I've found the root cause which make the monmap contains no > feature. As I moved the servers from one place to another, I modified > the monmap once. If this was the empty cluster that you refused to redo from scratch, then I feel it might be right to quote myself from the discussion before the move: -- If the cluster is clean I see no reason for doing brain surgery on monmaps just to "save" a few minutes of redoing correctly from scratch. *Whatif you miss some part, some command gives you an erroryou really aren't comfortable with, something doesn't really feelright after doing it, then the whole lifetime of that clusterwill be followed by a small nagging feeling* that it might have been that time you followed a guide that tries to talk you out of doing it that way, for a cluster with no data. -- I think the part in bold is *exactly* what happened to you now, you did something quite far out of the ordinary which was doable, but recommended against, and somehow some part not anticipated or covered in the "blindly type these commands into your ceph" occured. >From this point on, you _will_ know that your cluster is not 100% like everyone elses, and any future errors and crashes just _might_ be from it being different in a way noone has ever tested before. Some bit unset, some string left uninitialised, some value left untouched that never could be like that if done right. If you have little data in it now, I would still recommend moving data elsewhere and setting it up correctly. -- May the most significant bit of your life be positive. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cluster is not stable
Hi huang, I think I've found the root cause which make the monmap contains no feature. As I moved the servers from one place to another, I modified the monmap once. However, not all monmap is the same on all mons. I modified monmap on one of the mons, and create from scratch on the other two mons for convenience. (ssh is disabled among the servers, and I don't wanna do the modify operations 3 times) As a result, the epoch number is not the same within the 3 mons. I think this would be the root cause. I have transferred the monmap which dumped from the leader mon, by nc command, to other two mons and inject into the mon. The mon features recover now. After a period of time on watching the cluster status, there is no "mark-down" operations on osd. # ceph mon feature ls all features supported: [kraken,luminous,mimic,osdmap-prune] persistent: [kraken,luminous,mimic,osdmap-prune] on current monmap (epoch 3) persistent: [kraken,luminous,mimic,osdmap-prune] required: [kraken,luminous,mimic,osdmap-prune] Thanks all your helps guys:) Zhenshi Zhou 于2019年3月14日周四 下午3:20写道: > Hi, > > I'll try that command soon. > > It's a new cluster installed mimic. Not sure what the exact reason, but as > far as I can think of, 2 things may cause this issue. One is that I moved > these servers from a datacenter to this one, followed by steps [1]. > Another > is that I create a bridge using the interface by which ceph connection > used. > > [1] > http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-the-messy-way > > > Thanks > > huang jun 于2019年3月14日周四 下午3:04写道: > >> You can try that commands, but maybe you need to find the root cause >> why the current monmap contains no features at all, do you upgrade >> cluster from luminous to mimic, >> or it's a new cluster installed mimic? >> >> >> Zhenshi Zhou 于2019年3月14日周四 下午2:37写道: >> > >> > Hi huang, >> > >> > It's a pre-production environment. If everything is fine, I'll use it >> for production. >> > >> > My cluster is version mimic, should I set all features you listed in >> the command? >> > >> > Thanks >> > >> > huang jun 于2019年3月14日周四 下午2:11写道: >> >> >> >> sorry, the script should be >> >> for f in kraken luminous mimic osdmap-prune; do >> >> ceph mon feature set $f --yes-i-really-mean-it >> >> done >> >> >> >> huang jun 于2019年3月14日周四 下午2:04写道: >> >> > >> >> > ok, if this is a **test environment**, you can try >> >> > for f in 'kraken,luminous,mimic,osdmap-prune'; do >> >> > ceph mon feature set $f --yes-i-really-mean-it >> >> > done >> >> > >> >> > If it is a production environment, you should eval the risk first, >> and >> >> > maybe setup a test cluster to testing first. >> >> > >> >> > Zhenshi Zhou 于2019年3月14日周四 下午1:56写道: >> >> > > >> >> > > # ceph mon feature ls >> >> > > all features >> >> > > supported: [kraken,luminous,mimic,osdmap-prune] >> >> > > persistent: [kraken,luminous,mimic,osdmap-prune] >> >> > > on current monmap (epoch 2) >> >> > > persistent: [none] >> >> > > required: [none] >> >> > > >> >> > > huang jun 于2019年3月14日周四 下午1:50写道: >> >> > >> >> >> > >> what's the output of 'ceph mon feature ls'? >> >> > >> >> >> > >> from the code, maybe mon features not contain luminous >> >> > >> 6263 void OSD::send_beacon(const >> ceph::coarse_mono_clock::time_point& now) >> >> > >> >> >> > >> 6264 { >> >> > >> >> >> > >> 6265 const auto& monmap = monc->monmap; >> >> > >> >> >> > >> 6266 // send beacon to mon even if we are just connected, and >> the >> >> > >> monmap is not >> >> > >> >> >> > >> 6267 // initialized yet by then. >> >> > >> >> >> > >> 6268 if (monmap.epoch > 0 && >> >> > >> >> >> > >> 6269 monmap.get_required_features().contains_all( >> >> > >> >> >> > >> 6270 ceph::features::mon::FEATURE_LUMINOUS)) { >> >> > >> >> >> > >> 6271 dout(20) << __func__ << " sending" << dendl; >> >> > >> >> >> > >> 6272 MOSDBeacon* beacon = nullptr; >> >> > >> >> >> > >> 6273 { >> >> > >> >> >> > >> 6274 std::lock_guard l{min_last_epoch_clean_lock}; >> >> > >> >> >> > >> 6275 beacon = new MOSDBeacon(osdmap->get_epoch(), >> min_last_epoch_clean); >> >> > >> >> >> > >> 6276 std::swap(beacon->pgs, min_last_epoch_clean_pgs); >> >> > >> >> >> > >> 6277 last_sent_beacon = now; >> >> > >> >> >> > >> 6278 } >> >> > >> >> >> > >> 6279 monc->send_mon_message(beacon); >> >> > >> >> >> > >> 6280 } else { >> >> > >> >> >> > >> 6281 dout(20) << __func__ << " not sending" << dendl; >> >> > >> >> >> > >> 6282 } >> >> > >> >> >> > >> 6283 } >> >> > >> >> >> > >> >> >> > >> Zhenshi Zhou 于2019年3月14日周四 下午12:43写道: >> >> > >> > >> >> > >> > Hi, >> >> > >> > >> >> > >> > One of the log says the beacon not sending as below: >> >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 >> tick_without_osd_lock >> >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 >> can_inc_scr
Re: [ceph-users] cluster is not stable
Hi, I'll try that command soon. It's a new cluster installed mimic. Not sure what the exact reason, but as far as I can think of, 2 things may cause this issue. One is that I moved these servers from a datacenter to this one, followed by steps [1]. Another is that I create a bridge using the interface by which ceph connection used. [1] http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-the-messy-way Thanks huang jun 于2019年3月14日周四 下午3:04写道: > You can try that commands, but maybe you need to find the root cause > why the current monmap contains no features at all, do you upgrade > cluster from luminous to mimic, > or it's a new cluster installed mimic? > > > Zhenshi Zhou 于2019年3月14日周四 下午2:37写道: > > > > Hi huang, > > > > It's a pre-production environment. If everything is fine, I'll use it > for production. > > > > My cluster is version mimic, should I set all features you listed in the > command? > > > > Thanks > > > > huang jun 于2019年3月14日周四 下午2:11写道: > >> > >> sorry, the script should be > >> for f in kraken luminous mimic osdmap-prune; do > >> ceph mon feature set $f --yes-i-really-mean-it > >> done > >> > >> huang jun 于2019年3月14日周四 下午2:04写道: > >> > > >> > ok, if this is a **test environment**, you can try > >> > for f in 'kraken,luminous,mimic,osdmap-prune'; do > >> > ceph mon feature set $f --yes-i-really-mean-it > >> > done > >> > > >> > If it is a production environment, you should eval the risk first, and > >> > maybe setup a test cluster to testing first. > >> > > >> > Zhenshi Zhou 于2019年3月14日周四 下午1:56写道: > >> > > > >> > > # ceph mon feature ls > >> > > all features > >> > > supported: [kraken,luminous,mimic,osdmap-prune] > >> > > persistent: [kraken,luminous,mimic,osdmap-prune] > >> > > on current monmap (epoch 2) > >> > > persistent: [none] > >> > > required: [none] > >> > > > >> > > huang jun 于2019年3月14日周四 下午1:50写道: > >> > >> > >> > >> what's the output of 'ceph mon feature ls'? > >> > >> > >> > >> from the code, maybe mon features not contain luminous > >> > >> 6263 void OSD::send_beacon(const > ceph::coarse_mono_clock::time_point& now) > >> > >> > >> > >> 6264 { > >> > >> > >> > >> 6265 const auto& monmap = monc->monmap; > >> > >> > >> > >> 6266 // send beacon to mon even if we are just connected, and > the > >> > >> monmap is not > >> > >> > >> > >> 6267 // initialized yet by then. > >> > >> > >> > >> 6268 if (monmap.epoch > 0 && > >> > >> > >> > >> 6269 monmap.get_required_features().contains_all( > >> > >> > >> > >> 6270 ceph::features::mon::FEATURE_LUMINOUS)) { > >> > >> > >> > >> 6271 dout(20) << __func__ << " sending" << dendl; > >> > >> > >> > >> 6272 MOSDBeacon* beacon = nullptr; > >> > >> > >> > >> 6273 { > >> > >> > >> > >> 6274 std::lock_guard l{min_last_epoch_clean_lock}; > >> > >> > >> > >> 6275 beacon = new MOSDBeacon(osdmap->get_epoch(), > min_last_epoch_clean); > >> > >> > >> > >> 6276 std::swap(beacon->pgs, min_last_epoch_clean_pgs); > >> > >> > >> > >> 6277 last_sent_beacon = now; > >> > >> > >> > >> 6278 } > >> > >> > >> > >> 6279 monc->send_mon_message(beacon); > >> > >> > >> > >> 6280 } else { > >> > >> > >> > >> 6281 dout(20) << __func__ << " not sending" << dendl; > >> > >> > >> > >> 6282 } > >> > >> > >> > >> 6283 } > >> > >> > >> > >> > >> > >> Zhenshi Zhou 于2019年3月14日周四 下午12:43写道: > >> > >> > > >> > >> > Hi, > >> > >> > > >> > >> > One of the log says the beacon not sending as below: > >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 > tick_without_osd_lock > >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 > can_inc_scrubs_pending 0 -> 1 (max 1, active 0) > >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 > scrub_time_permit should run between 0 - 24 now 12 = yes > >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 > scrub_load_below_threshold loadavg per cpu 0 < max 0.5 = yes > >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub > load_is_low=1 > >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 sched_scrub > 1.79 scheduled at 2019-03-14 13:17:51.290050 > 2019-03-14 12:41:15.723848 > >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub > done > >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 > promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 B; target > 25 obj/sec or 5 MiB/sec > >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 > promote_throttle_recalibrate new_prob 1000 > >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 > promote_throttle_recalibrate actual 0, actual/prob ratio 1, adjusted > new_prob 1000, prob 1000 -> 1000 > >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 send_beacon > not sending > >> > >> > > >> > >> > > >> > >> > huang jun 于2019年3月14日周四 下午12:30写道: > >> > >> >> > >> > >> >> osd will
Re: [ceph-users] cluster is not stable
You can try that commands, but maybe you need to find the root cause why the current monmap contains no features at all, do you upgrade cluster from luminous to mimic, or it's a new cluster installed mimic? Zhenshi Zhou 于2019年3月14日周四 下午2:37写道: > > Hi huang, > > It's a pre-production environment. If everything is fine, I'll use it for > production. > > My cluster is version mimic, should I set all features you listed in the > command? > > Thanks > > huang jun 于2019年3月14日周四 下午2:11写道: >> >> sorry, the script should be >> for f in kraken luminous mimic osdmap-prune; do >> ceph mon feature set $f --yes-i-really-mean-it >> done >> >> huang jun 于2019年3月14日周四 下午2:04写道: >> > >> > ok, if this is a **test environment**, you can try >> > for f in 'kraken,luminous,mimic,osdmap-prune'; do >> > ceph mon feature set $f --yes-i-really-mean-it >> > done >> > >> > If it is a production environment, you should eval the risk first, and >> > maybe setup a test cluster to testing first. >> > >> > Zhenshi Zhou 于2019年3月14日周四 下午1:56写道: >> > > >> > > # ceph mon feature ls >> > > all features >> > > supported: [kraken,luminous,mimic,osdmap-prune] >> > > persistent: [kraken,luminous,mimic,osdmap-prune] >> > > on current monmap (epoch 2) >> > > persistent: [none] >> > > required: [none] >> > > >> > > huang jun 于2019年3月14日周四 下午1:50写道: >> > >> >> > >> what's the output of 'ceph mon feature ls'? >> > >> >> > >> from the code, maybe mon features not contain luminous >> > >> 6263 void OSD::send_beacon(const ceph::coarse_mono_clock::time_point& >> > >> now) >> > >> >> > >> 6264 { >> > >> >> > >> 6265 const auto& monmap = monc->monmap; >> > >> >> > >> 6266 // send beacon to mon even if we are just connected, and the >> > >> monmap is not >> > >> >> > >> 6267 // initialized yet by then. >> > >> >> > >> 6268 if (monmap.epoch > 0 && >> > >> >> > >> 6269 monmap.get_required_features().contains_all( >> > >> >> > >> 6270 ceph::features::mon::FEATURE_LUMINOUS)) { >> > >> >> > >> 6271 dout(20) << __func__ << " sending" << dendl; >> > >> >> > >> 6272 MOSDBeacon* beacon = nullptr; >> > >> >> > >> 6273 { >> > >> >> > >> 6274 std::lock_guard l{min_last_epoch_clean_lock}; >> > >> >> > >> 6275 beacon = new MOSDBeacon(osdmap->get_epoch(), >> > >> min_last_epoch_clean); >> > >> >> > >> 6276 std::swap(beacon->pgs, min_last_epoch_clean_pgs); >> > >> >> > >> 6277 last_sent_beacon = now; >> > >> >> > >> 6278 } >> > >> >> > >> 6279 monc->send_mon_message(beacon); >> > >> >> > >> 6280 } else { >> > >> >> > >> 6281 dout(20) << __func__ << " not sending" << dendl; >> > >> >> > >> 6282 } >> > >> >> > >> 6283 } >> > >> >> > >> >> > >> Zhenshi Zhou 于2019年3月14日周四 下午12:43写道: >> > >> > >> > >> > Hi, >> > >> > >> > >> > One of the log says the beacon not sending as below: >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 >> > >> > tick_without_osd_lock >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 >> > >> > can_inc_scrubs_pending 0 -> 1 (max 1, active 0) >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 scrub_time_permit >> > >> > should run between 0 - 24 now 12 = yes >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 >> > >> > scrub_load_below_threshold loadavg per cpu 0 < max 0.5 = yes >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub >> > >> > load_is_low=1 >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 sched_scrub 1.79 >> > >> > scheduled at 2019-03-14 13:17:51.290050 > 2019-03-14 12:41:15.723848 >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub done >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 >> > >> > promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 B; >> > >> > target 25 obj/sec or 5 MiB/sec >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 >> > >> > promote_throttle_recalibrate new_prob 1000 >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 >> > >> > promote_throttle_recalibrate actual 0, actual/prob ratio 1, adjusted >> > >> > new_prob 1000, prob 1000 -> 1000 >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 send_beacon not >> > >> > sending >> > >> > >> > >> > >> > >> > huang jun 于2019年3月14日周四 下午12:30写道: >> > >> >> >> > >> >> osd will not send beacons to mon if its not in ACTIVE state, >> > >> >> so you maybe turn on one osd's debug_osd=20 to see what is going on >> > >> >> >> > >> >> Zhenshi Zhou 于2019年3月14日周四 上午11:07写道: >> > >> >> > >> > >> >> > What's more, I find that the osds don't send beacons all the time, >> > >> >> > some osds send beacons >> > >> >> > for a period of time and then stop sending beacons. >> > >> >> > >> > >> >> > >> > >> >> > >> > >> >> > Zhenshi Zhou 于2019年3月14日周四 上午10:57写道: >> > >> >> >> >> > >> >> >> Hi >> > >> >> >> >> > >> >> >> I set the config on every osd and check whether all osds send >>
Re: [ceph-users] cluster is not stable
Hi huang, It's a pre-production environment. If everything is fine, I'll use it for production. My cluster is version mimic, should I set all features you listed in the command? Thanks huang jun 于2019年3月14日周四 下午2:11写道: > sorry, the script should be > for f in kraken luminous mimic osdmap-prune; do > ceph mon feature set $f --yes-i-really-mean-it > done > > huang jun 于2019年3月14日周四 下午2:04写道: > > > > ok, if this is a **test environment**, you can try > > for f in 'kraken,luminous,mimic,osdmap-prune'; do > > ceph mon feature set $f --yes-i-really-mean-it > > done > > > > If it is a production environment, you should eval the risk first, and > > maybe setup a test cluster to testing first. > > > > Zhenshi Zhou 于2019年3月14日周四 下午1:56写道: > > > > > > # ceph mon feature ls > > > all features > > > supported: [kraken,luminous,mimic,osdmap-prune] > > > persistent: [kraken,luminous,mimic,osdmap-prune] > > > on current monmap (epoch 2) > > > persistent: [none] > > > required: [none] > > > > > > huang jun 于2019年3月14日周四 下午1:50写道: > > >> > > >> what's the output of 'ceph mon feature ls'? > > >> > > >> from the code, maybe mon features not contain luminous > > >> 6263 void OSD::send_beacon(const ceph::coarse_mono_clock::time_point& > now) > > >> > > >> 6264 { > > >> > > >> 6265 const auto& monmap = monc->monmap; > > >> > > >> 6266 // send beacon to mon even if we are just connected, and the > > >> monmap is not > > >> > > >> 6267 // initialized yet by then. > > >> > > >> 6268 if (monmap.epoch > 0 && > > >> > > >> 6269 monmap.get_required_features().contains_all( > > >> > > >> 6270 ceph::features::mon::FEATURE_LUMINOUS)) { > > >> > > >> 6271 dout(20) << __func__ << " sending" << dendl; > > >> > > >> 6272 MOSDBeacon* beacon = nullptr; > > >> > > >> 6273 { > > >> > > >> 6274 std::lock_guard l{min_last_epoch_clean_lock}; > > >> > > >> 6275 beacon = new MOSDBeacon(osdmap->get_epoch(), > min_last_epoch_clean); > > >> > > >> 6276 std::swap(beacon->pgs, min_last_epoch_clean_pgs); > > >> > > >> 6277 last_sent_beacon = now; > > >> > > >> 6278 } > > >> > > >> 6279 monc->send_mon_message(beacon); > > >> > > >> 6280 } else { > > >> > > >> 6281 dout(20) << __func__ << " not sending" << dendl; > > >> > > >> 6282 } > > >> > > >> 6283 } > > >> > > >> > > >> Zhenshi Zhou 于2019年3月14日周四 下午12:43写道: > > >> > > > >> > Hi, > > >> > > > >> > One of the log says the beacon not sending as below: > > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 > tick_without_osd_lock > > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 > can_inc_scrubs_pending 0 -> 1 (max 1, active 0) > > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 > scrub_time_permit should run between 0 - 24 now 12 = yes > > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 > scrub_load_below_threshold loadavg per cpu 0 < max 0.5 = yes > > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub > load_is_low=1 > > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 sched_scrub > 1.79 scheduled at 2019-03-14 13:17:51.290050 > 2019-03-14 12:41:15.723848 > > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub done > > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 > promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 B; target > 25 obj/sec or 5 MiB/sec > > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 > promote_throttle_recalibrate new_prob 1000 > > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 > promote_throttle_recalibrate actual 0, actual/prob ratio 1, adjusted > new_prob 1000, prob 1000 -> 1000 > > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 send_beacon not > sending > > >> > > > >> > > > >> > huang jun 于2019年3月14日周四 下午12:30写道: > > >> >> > > >> >> osd will not send beacons to mon if its not in ACTIVE state, > > >> >> so you maybe turn on one osd's debug_osd=20 to see what is going on > > >> >> > > >> >> Zhenshi Zhou 于2019年3月14日周四 上午11:07写道: > > >> >> > > > >> >> > What's more, I find that the osds don't send beacons all the > time, some osds send beacons > > >> >> > for a period of time and then stop sending beacons. > > >> >> > > > >> >> > > > >> >> > > > >> >> > Zhenshi Zhou 于2019年3月14日周四 上午10:57写道: > > >> >> >> > > >> >> >> Hi > > >> >> >> > > >> >> >> I set the config on every osd and check whether all osds send > beacons > > >> >> >> to monitors. > > >> >> >> > > >> >> >> The result shows that only part of the osds send beacons and > the monitor > > >> >> >> receives all beacons from which the osd send out. > > >> >> >> > > >> >> >> But why some osds don't send beacon? > > >> >> >> > > >> >> >> huang jun 于2019年3月13日周三 下午11:02写道: > > >> >> >>> > > >> >> >>> sorry for not make it clearly, you may need to set one of your > osd's > > >> >> >>> osd_beacon_report_interval = 5 > > >> >> >>> and debug_ms=1 and then restart the
Re: [ceph-users] cluster is not stable
sorry, the script should be for f in kraken luminous mimic osdmap-prune; do ceph mon feature set $f --yes-i-really-mean-it done huang jun 于2019年3月14日周四 下午2:04写道: > > ok, if this is a **test environment**, you can try > for f in 'kraken,luminous,mimic,osdmap-prune'; do > ceph mon feature set $f --yes-i-really-mean-it > done > > If it is a production environment, you should eval the risk first, and > maybe setup a test cluster to testing first. > > Zhenshi Zhou 于2019年3月14日周四 下午1:56写道: > > > > # ceph mon feature ls > > all features > > supported: [kraken,luminous,mimic,osdmap-prune] > > persistent: [kraken,luminous,mimic,osdmap-prune] > > on current monmap (epoch 2) > > persistent: [none] > > required: [none] > > > > huang jun 于2019年3月14日周四 下午1:50写道: > >> > >> what's the output of 'ceph mon feature ls'? > >> > >> from the code, maybe mon features not contain luminous > >> 6263 void OSD::send_beacon(const ceph::coarse_mono_clock::time_point& now) > >> > >> 6264 { > >> > >> 6265 const auto& monmap = monc->monmap; > >> > >> 6266 // send beacon to mon even if we are just connected, and the > >> monmap is not > >> > >> 6267 // initialized yet by then. > >> > >> 6268 if (monmap.epoch > 0 && > >> > >> 6269 monmap.get_required_features().contains_all( > >> > >> 6270 ceph::features::mon::FEATURE_LUMINOUS)) { > >> > >> 6271 dout(20) << __func__ << " sending" << dendl; > >> > >> 6272 MOSDBeacon* beacon = nullptr; > >> > >> 6273 { > >> > >> 6274 std::lock_guard l{min_last_epoch_clean_lock}; > >> > >> 6275 beacon = new MOSDBeacon(osdmap->get_epoch(), > >> min_last_epoch_clean); > >> > >> 6276 std::swap(beacon->pgs, min_last_epoch_clean_pgs); > >> > >> 6277 last_sent_beacon = now; > >> > >> 6278 } > >> > >> 6279 monc->send_mon_message(beacon); > >> > >> 6280 } else { > >> > >> 6281 dout(20) << __func__ << " not sending" << dendl; > >> > >> 6282 } > >> > >> 6283 } > >> > >> > >> Zhenshi Zhou 于2019年3月14日周四 下午12:43写道: > >> > > >> > Hi, > >> > > >> > One of the log says the beacon not sending as below: > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 tick_without_osd_lock > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 > >> > can_inc_scrubs_pending 0 -> 1 (max 1, active 0) > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 scrub_time_permit > >> > should run between 0 - 24 now 12 = yes > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 > >> > scrub_load_below_threshold loadavg per cpu 0 < max 0.5 = yes > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub > >> > load_is_low=1 > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 sched_scrub 1.79 > >> > scheduled at 2019-03-14 13:17:51.290050 > 2019-03-14 12:41:15.723848 > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub done > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 > >> > promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 B; > >> > target 25 obj/sec or 5 MiB/sec > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 > >> > promote_throttle_recalibrate new_prob 1000 > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 > >> > promote_throttle_recalibrate actual 0, actual/prob ratio 1, adjusted > >> > new_prob 1000, prob 1000 -> 1000 > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 send_beacon not > >> > sending > >> > > >> > > >> > huang jun 于2019年3月14日周四 下午12:30写道: > >> >> > >> >> osd will not send beacons to mon if its not in ACTIVE state, > >> >> so you maybe turn on one osd's debug_osd=20 to see what is going on > >> >> > >> >> Zhenshi Zhou 于2019年3月14日周四 上午11:07写道: > >> >> > > >> >> > What's more, I find that the osds don't send beacons all the time, > >> >> > some osds send beacons > >> >> > for a period of time and then stop sending beacons. > >> >> > > >> >> > > >> >> > > >> >> > Zhenshi Zhou 于2019年3月14日周四 上午10:57写道: > >> >> >> > >> >> >> Hi > >> >> >> > >> >> >> I set the config on every osd and check whether all osds send beacons > >> >> >> to monitors. > >> >> >> > >> >> >> The result shows that only part of the osds send beacons and the > >> >> >> monitor > >> >> >> receives all beacons from which the osd send out. > >> >> >> > >> >> >> But why some osds don't send beacon? > >> >> >> > >> >> >> huang jun 于2019年3月13日周三 下午11:02写道: > >> >> >>> > >> >> >>> sorry for not make it clearly, you may need to set one of your osd's > >> >> >>> osd_beacon_report_interval = 5 > >> >> >>> and debug_ms=1 and then restart the osd process, then check the osd > >> >> >>> log by 'grep beacon /var/log/ceph/ceph-osd.$id.log' > >> >> >>> to make sure osd send beacons to mon, if osd send beacon to mon, you > >> >> >>> should also turn on debug_ms=1 on leader mon, > >> >> >>> and restart mon process, then check the mon log to make sure mon > >> >> >>> received osd beacon; > >> >> >>> > >> >> >>> Zhenshi Zhou 于
Re: [ceph-users] cluster is not stable
ok, if this is a **test environment**, you can try for f in 'kraken,luminous,mimic,osdmap-prune'; do ceph mon feature set $f --yes-i-really-mean-it done If it is a production environment, you should eval the risk first, and maybe setup a test cluster to testing first. Zhenshi Zhou 于2019年3月14日周四 下午1:56写道: > > # ceph mon feature ls > all features > supported: [kraken,luminous,mimic,osdmap-prune] > persistent: [kraken,luminous,mimic,osdmap-prune] > on current monmap (epoch 2) > persistent: [none] > required: [none] > > huang jun 于2019年3月14日周四 下午1:50写道: >> >> what's the output of 'ceph mon feature ls'? >> >> from the code, maybe mon features not contain luminous >> 6263 void OSD::send_beacon(const ceph::coarse_mono_clock::time_point& now) >> >> 6264 { >> >> 6265 const auto& monmap = monc->monmap; >> >> 6266 // send beacon to mon even if we are just connected, and the >> monmap is not >> >> 6267 // initialized yet by then. >> >> 6268 if (monmap.epoch > 0 && >> >> 6269 monmap.get_required_features().contains_all( >> >> 6270 ceph::features::mon::FEATURE_LUMINOUS)) { >> >> 6271 dout(20) << __func__ << " sending" << dendl; >> >> 6272 MOSDBeacon* beacon = nullptr; >> >> 6273 { >> >> 6274 std::lock_guard l{min_last_epoch_clean_lock}; >> >> 6275 beacon = new MOSDBeacon(osdmap->get_epoch(), >> min_last_epoch_clean); >> >> 6276 std::swap(beacon->pgs, min_last_epoch_clean_pgs); >> >> 6277 last_sent_beacon = now; >> >> 6278 } >> >> 6279 monc->send_mon_message(beacon); >> >> 6280 } else { >> >> 6281 dout(20) << __func__ << " not sending" << dendl; >> >> 6282 } >> >> 6283 } >> >> >> Zhenshi Zhou 于2019年3月14日周四 下午12:43写道: >> > >> > Hi, >> > >> > One of the log says the beacon not sending as below: >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 tick_without_osd_lock >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 can_inc_scrubs_pending >> > 0 -> 1 (max 1, active 0) >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 scrub_time_permit >> > should run between 0 - 24 now 12 = yes >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 >> > scrub_load_below_threshold loadavg per cpu 0 < max 0.5 = yes >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub >> > load_is_low=1 >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 sched_scrub 1.79 >> > scheduled at 2019-03-14 13:17:51.290050 > 2019-03-14 12:41:15.723848 >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub done >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 >> > promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 B; >> > target 25 obj/sec or 5 MiB/sec >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 >> > promote_throttle_recalibrate new_prob 1000 >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 >> > promote_throttle_recalibrate actual 0, actual/prob ratio 1, adjusted >> > new_prob 1000, prob 1000 -> 1000 >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 send_beacon not sending >> > >> > >> > huang jun 于2019年3月14日周四 下午12:30写道: >> >> >> >> osd will not send beacons to mon if its not in ACTIVE state, >> >> so you maybe turn on one osd's debug_osd=20 to see what is going on >> >> >> >> Zhenshi Zhou 于2019年3月14日周四 上午11:07写道: >> >> > >> >> > What's more, I find that the osds don't send beacons all the time, some >> >> > osds send beacons >> >> > for a period of time and then stop sending beacons. >> >> > >> >> > >> >> > >> >> > Zhenshi Zhou 于2019年3月14日周四 上午10:57写道: >> >> >> >> >> >> Hi >> >> >> >> >> >> I set the config on every osd and check whether all osds send beacons >> >> >> to monitors. >> >> >> >> >> >> The result shows that only part of the osds send beacons and the >> >> >> monitor >> >> >> receives all beacons from which the osd send out. >> >> >> >> >> >> But why some osds don't send beacon? >> >> >> >> >> >> huang jun 于2019年3月13日周三 下午11:02写道: >> >> >>> >> >> >>> sorry for not make it clearly, you may need to set one of your osd's >> >> >>> osd_beacon_report_interval = 5 >> >> >>> and debug_ms=1 and then restart the osd process, then check the osd >> >> >>> log by 'grep beacon /var/log/ceph/ceph-osd.$id.log' >> >> >>> to make sure osd send beacons to mon, if osd send beacon to mon, you >> >> >>> should also turn on debug_ms=1 on leader mon, >> >> >>> and restart mon process, then check the mon log to make sure mon >> >> >>> received osd beacon; >> >> >>> >> >> >>> Zhenshi Zhou 于2019年3月13日周三 下午8:20写道: >> >> >>> > >> >> >>> > And now, new errors are cliaming.. >> >> >>> > >> >> >>> > >> >> >>> > Zhenshi Zhou 于2019年3月13日周三 下午2:58写道: >> >> >>> >> >> >> >>> >> Hi, >> >> >>> >> >> >> >>> >> I didn't set osd_beacon_report_interval as it must be the default >> >> >>> >> value. >> >> >>> >> I have set osd_beacon_report_interval to 60 and debug_mon to 10. >> >> >>> >> >> >> >>> >> Attachment is the leader monitor log,
Re: [ceph-users] cluster is not stable
# ceph mon feature ls all features supported: [kraken,luminous,mimic,osdmap-prune] persistent: [kraken,luminous,mimic,osdmap-prune] on current monmap (epoch 2) persistent: [none] required: [none] huang jun 于2019年3月14日周四 下午1:50写道: > what's the output of 'ceph mon feature ls'? > > from the code, maybe mon features not contain luminous > 6263 void OSD::send_beacon(const ceph::coarse_mono_clock::time_point& now) > > 6264 { > > 6265 const auto& monmap = monc->monmap; > > 6266 // send beacon to mon even if we are just connected, and the > monmap is not > > 6267 // initialized yet by then. > > 6268 if (monmap.epoch > 0 && > > 6269 monmap.get_required_features().contains_all( > > 6270 ceph::features::mon::FEATURE_LUMINOUS)) { > > 6271 dout(20) << __func__ << " sending" << dendl; > > 6272 MOSDBeacon* beacon = nullptr; > > 6273 { > > 6274 std::lock_guard l{min_last_epoch_clean_lock}; > > 6275 beacon = new MOSDBeacon(osdmap->get_epoch(), > min_last_epoch_clean); > > 6276 std::swap(beacon->pgs, min_last_epoch_clean_pgs); > > 6277 last_sent_beacon = now; > > 6278 } > > 6279 monc->send_mon_message(beacon); > > 6280 } else { > > 6281 dout(20) << __func__ << " not sending" << dendl; > > 6282 } > > 6283 } > > > Zhenshi Zhou 于2019年3月14日周四 下午12:43写道: > > > > Hi, > > > > One of the log says the beacon not sending as below: > > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 tick_without_osd_lock > > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 > can_inc_scrubs_pending 0 -> 1 (max 1, active 0) > > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 scrub_time_permit > should run between 0 - 24 now 12 = yes > > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 > scrub_load_below_threshold loadavg per cpu 0 < max 0.5 = yes > > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub > load_is_low=1 > > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 sched_scrub 1.79 > scheduled at 2019-03-14 13:17:51.290050 > 2019-03-14 12:41:15.723848 > > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub done > > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 > promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 B; target > 25 obj/sec or 5 MiB/sec > > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 > promote_throttle_recalibrate new_prob 1000 > > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 > promote_throttle_recalibrate actual 0, actual/prob ratio 1, adjusted > new_prob 1000, prob 1000 -> 1000 > > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 send_beacon not > sending > > > > > > huang jun 于2019年3月14日周四 下午12:30写道: > >> > >> osd will not send beacons to mon if its not in ACTIVE state, > >> so you maybe turn on one osd's debug_osd=20 to see what is going on > >> > >> Zhenshi Zhou 于2019年3月14日周四 上午11:07写道: > >> > > >> > What's more, I find that the osds don't send beacons all the time, > some osds send beacons > >> > for a period of time and then stop sending beacons. > >> > > >> > > >> > > >> > Zhenshi Zhou 于2019年3月14日周四 上午10:57写道: > >> >> > >> >> Hi > >> >> > >> >> I set the config on every osd and check whether all osds send beacons > >> >> to monitors. > >> >> > >> >> The result shows that only part of the osds send beacons and the > monitor > >> >> receives all beacons from which the osd send out. > >> >> > >> >> But why some osds don't send beacon? > >> >> > >> >> huang jun 于2019年3月13日周三 下午11:02写道: > >> >>> > >> >>> sorry for not make it clearly, you may need to set one of your osd's > >> >>> osd_beacon_report_interval = 5 > >> >>> and debug_ms=1 and then restart the osd process, then check the osd > >> >>> log by 'grep beacon /var/log/ceph/ceph-osd.$id.log' > >> >>> to make sure osd send beacons to mon, if osd send beacon to mon, you > >> >>> should also turn on debug_ms=1 on leader mon, > >> >>> and restart mon process, then check the mon log to make sure mon > >> >>> received osd beacon; > >> >>> > >> >>> Zhenshi Zhou 于2019年3月13日周三 下午8:20写道: > >> >>> > > >> >>> > And now, new errors are cliaming.. > >> >>> > > >> >>> > > >> >>> > Zhenshi Zhou 于2019年3月13日周三 下午2:58写道: > >> >>> >> > >> >>> >> Hi, > >> >>> >> > >> >>> >> I didn't set osd_beacon_report_interval as it must be the > default value. > >> >>> >> I have set osd_beacon_report_interval to 60 and debug_mon to 10. > >> >>> >> > >> >>> >> Attachment is the leader monitor log, the "mark-down" operations > is at 14:22 > >> >>> >> > >> >>> >> Thanks > >> >>> >> > >> >>> >> huang jun 于2019年3月13日周三 下午2:07写道: > >> >>> >>> > >> >>> >>> can you get the value of osd_beacon_report_interval item? the > default > >> >>> >>> is 300, you can set to 60, or maybe turn on debug_ms=1 > debug_mon=10 > >> >>> >>> can get more infos. > >> >>> >>> > >> >>> >>> > >> >>> >>> Zhenshi Zhou 于2019年3月13日周三 下午1:20写道: > >> >>> >>> > > >> >>> >>> > Hi, > >> >>> >>> > > >> >>> >>> > The servers are cennected
Re: [ceph-users] cluster is not stable
what's the output of 'ceph mon feature ls'? from the code, maybe mon features not contain luminous 6263 void OSD::send_beacon(const ceph::coarse_mono_clock::time_point& now) 6264 { 6265 const auto& monmap = monc->monmap; 6266 // send beacon to mon even if we are just connected, and the monmap is not 6267 // initialized yet by then. 6268 if (monmap.epoch > 0 && 6269 monmap.get_required_features().contains_all( 6270 ceph::features::mon::FEATURE_LUMINOUS)) { 6271 dout(20) << __func__ << " sending" << dendl; 6272 MOSDBeacon* beacon = nullptr; 6273 { 6274 std::lock_guard l{min_last_epoch_clean_lock}; 6275 beacon = new MOSDBeacon(osdmap->get_epoch(), min_last_epoch_clean); 6276 std::swap(beacon->pgs, min_last_epoch_clean_pgs); 6277 last_sent_beacon = now; 6278 } 6279 monc->send_mon_message(beacon); 6280 } else { 6281 dout(20) << __func__ << " not sending" << dendl; 6282 } 6283 } Zhenshi Zhou 于2019年3月14日周四 下午12:43写道: > > Hi, > > One of the log says the beacon not sending as below: > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 tick_without_osd_lock > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 can_inc_scrubs_pending 0 > -> 1 (max 1, active 0) > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 scrub_time_permit should > run between 0 - 24 now 12 = yes > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 > scrub_load_below_threshold loadavg per cpu 0 < max 0.5 = yes > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub load_is_low=1 > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 sched_scrub 1.79 > scheduled at 2019-03-14 13:17:51.290050 > 2019-03-14 12:41:15.723848 > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub done > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 > promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 B; target > 25 obj/sec or 5 MiB/sec > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 > promote_throttle_recalibrate new_prob 1000 > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 > promote_throttle_recalibrate actual 0, actual/prob ratio 1, adjusted > new_prob 1000, prob 1000 -> 1000 > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 send_beacon not sending > > > huang jun 于2019年3月14日周四 下午12:30写道: >> >> osd will not send beacons to mon if its not in ACTIVE state, >> so you maybe turn on one osd's debug_osd=20 to see what is going on >> >> Zhenshi Zhou 于2019年3月14日周四 上午11:07写道: >> > >> > What's more, I find that the osds don't send beacons all the time, some >> > osds send beacons >> > for a period of time and then stop sending beacons. >> > >> > >> > >> > Zhenshi Zhou 于2019年3月14日周四 上午10:57写道: >> >> >> >> Hi >> >> >> >> I set the config on every osd and check whether all osds send beacons >> >> to monitors. >> >> >> >> The result shows that only part of the osds send beacons and the monitor >> >> receives all beacons from which the osd send out. >> >> >> >> But why some osds don't send beacon? >> >> >> >> huang jun 于2019年3月13日周三 下午11:02写道: >> >>> >> >>> sorry for not make it clearly, you may need to set one of your osd's >> >>> osd_beacon_report_interval = 5 >> >>> and debug_ms=1 and then restart the osd process, then check the osd >> >>> log by 'grep beacon /var/log/ceph/ceph-osd.$id.log' >> >>> to make sure osd send beacons to mon, if osd send beacon to mon, you >> >>> should also turn on debug_ms=1 on leader mon, >> >>> and restart mon process, then check the mon log to make sure mon >> >>> received osd beacon; >> >>> >> >>> Zhenshi Zhou 于2019年3月13日周三 下午8:20写道: >> >>> > >> >>> > And now, new errors are cliaming.. >> >>> > >> >>> > >> >>> > Zhenshi Zhou 于2019年3月13日周三 下午2:58写道: >> >>> >> >> >>> >> Hi, >> >>> >> >> >>> >> I didn't set osd_beacon_report_interval as it must be the default >> >>> >> value. >> >>> >> I have set osd_beacon_report_interval to 60 and debug_mon to 10. >> >>> >> >> >>> >> Attachment is the leader monitor log, the "mark-down" operations is >> >>> >> at 14:22 >> >>> >> >> >>> >> Thanks >> >>> >> >> >>> >> huang jun 于2019年3月13日周三 下午2:07写道: >> >>> >>> >> >>> >>> can you get the value of osd_beacon_report_interval item? the default >> >>> >>> is 300, you can set to 60, or maybe turn on debug_ms=1 debug_mon=10 >> >>> >>> can get more infos. >> >>> >>> >> >>> >>> >> >>> >>> Zhenshi Zhou 于2019年3月13日周三 下午1:20写道: >> >>> >>> > >> >>> >>> > Hi, >> >>> >>> > >> >>> >>> > The servers are cennected to the same switch. >> >>> >>> > I can ping from anyone of the servers to other servers >> >>> >>> > without a packet lost and the average round trip time >> >>> >>> > is under 0.1 ms. >> >>> >>> > >> >>> >>> > Thanks >> >>> >>> > >> >>> >>> > Ashley Merrick 于2019年3月13日周三 下午12:06写道: >> >>> >>> >> >> >>> >>> >> Can you ping all your OSD servers from all your mons, and ping >> >>> >>> >> your mons from all your OSD servers? >> >>> >>> >> >> >>> >>> >> I’ve seen this where a
Re: [ceph-users] cluster is not stable
Hi, One of the log says the beacon not sending as below: 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 tick_without_osd_lock 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 can_inc_scrubs_pending 0 -> 1 (max 1, active 0) 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 scrub_time_permit should run between 0 - 24 now 12 = yes 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 scrub_load_below_threshold loadavg per cpu 0 < max 0.5 = yes 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub load_is_low=1 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 sched_scrub 1.79 scheduled at 2019-03-14 13:17:51.290050 > 2019-03-14 12:41:15.723848 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub done 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 B; target 25 obj/sec or 5 MiB/sec 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 promote_throttle_recalibrate new_prob 1000 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 promote_throttle_recalibrate actual 0, actual/prob ratio 1, adjusted new_prob 1000, prob 1000 -> 1000 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 send_beacon not sending huang jun 于2019年3月14日周四 下午12:30写道: > osd will not send beacons to mon if its not in ACTIVE state, > so you maybe turn on one osd's debug_osd=20 to see what is going on > > Zhenshi Zhou 于2019年3月14日周四 上午11:07写道: > > > > What's more, I find that the osds don't send beacons all the time, some > osds send beacons > > for a period of time and then stop sending beacons. > > > > > > > > Zhenshi Zhou 于2019年3月14日周四 上午10:57写道: > >> > >> Hi > >> > >> I set the config on every osd and check whether all osds send beacons > >> to monitors. > >> > >> The result shows that only part of the osds send beacons and the monitor > >> receives all beacons from which the osd send out. > >> > >> But why some osds don't send beacon? > >> > >> huang jun 于2019年3月13日周三 下午11:02写道: > >>> > >>> sorry for not make it clearly, you may need to set one of your osd's > >>> osd_beacon_report_interval = 5 > >>> and debug_ms=1 and then restart the osd process, then check the osd > >>> log by 'grep beacon /var/log/ceph/ceph-osd.$id.log' > >>> to make sure osd send beacons to mon, if osd send beacon to mon, you > >>> should also turn on debug_ms=1 on leader mon, > >>> and restart mon process, then check the mon log to make sure mon > >>> received osd beacon; > >>> > >>> Zhenshi Zhou 于2019年3月13日周三 下午8:20写道: > >>> > > >>> > And now, new errors are cliaming.. > >>> > > >>> > > >>> > Zhenshi Zhou 于2019年3月13日周三 下午2:58写道: > >>> >> > >>> >> Hi, > >>> >> > >>> >> I didn't set osd_beacon_report_interval as it must be the default > value. > >>> >> I have set osd_beacon_report_interval to 60 and debug_mon to 10. > >>> >> > >>> >> Attachment is the leader monitor log, the "mark-down" operations is > at 14:22 > >>> >> > >>> >> Thanks > >>> >> > >>> >> huang jun 于2019年3月13日周三 下午2:07写道: > >>> >>> > >>> >>> can you get the value of osd_beacon_report_interval item? the > default > >>> >>> is 300, you can set to 60, or maybe turn on debug_ms=1 > debug_mon=10 > >>> >>> can get more infos. > >>> >>> > >>> >>> > >>> >>> Zhenshi Zhou 于2019年3月13日周三 下午1:20写道: > >>> >>> > > >>> >>> > Hi, > >>> >>> > > >>> >>> > The servers are cennected to the same switch. > >>> >>> > I can ping from anyone of the servers to other servers > >>> >>> > without a packet lost and the average round trip time > >>> >>> > is under 0.1 ms. > >>> >>> > > >>> >>> > Thanks > >>> >>> > > >>> >>> > Ashley Merrick 于2019年3月13日周三 > 下午12:06写道: > >>> >>> >> > >>> >>> >> Can you ping all your OSD servers from all your mons, and ping > your mons from all your OSD servers? > >>> >>> >> > >>> >>> >> I’ve seen this where a route wasn’t working one direction, so > it made OSDs flap when it used that mon to check availability: > >>> >>> >> > >>> >>> >> On Wed, 13 Mar 2019 at 11:50 AM, Zhenshi Zhou < > deader...@gmail.com> wrote: > >>> >>> >>> > >>> >>> >>> After checking the network and syslog/dmsg, I think it's not > the network or hardware issue. Now there're some > >>> >>> >>> osds being marked down every 15 minutes. > >>> >>> >>> > >>> >>> >>> here is ceph.log: > >>> >>> >>> 2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0 > 10.39.0.34:6789/0 6756 : cluster [INF] Cluster is now healthy > >>> >>> >>> 2019-03-13 11:21:21.705787 mon.ceph-mon1 mon.0 > 10.39.0.34:6789/0 6757 : cluster [INF] osd.1 marked down after no beacon > for 900.067020 seconds > >>> >>> >>> 2019-03-13 11:21:21.705858 mon.ceph-mon1 mon.0 > 10.39.0.34:6789/0 6758 : cluster [INF] osd.2 marked down after no beacon > for 900.067020 seconds > >>> >>> >>> 2019-03-13 11:21:21.705920 mon.ceph-mon1 mon.0 > 10.39.0.34:6789/0 6759 : cluster [INF] osd.4 marked down after no beacon > for 900.067020 seconds > >>> >>> >>> 2019-03-13 11:21:21.705957 mon.ceph-mon1 mon.0 > 10.39.0.34:6789/0 6760 : cluster [INF] osd.6 marked down after no beacon > for 900.0
Re: [ceph-users] cluster is not stable
osd will not send beacons to mon if its not in ACTIVE state, so you maybe turn on one osd's debug_osd=20 to see what is going on Zhenshi Zhou 于2019年3月14日周四 上午11:07写道: > > What's more, I find that the osds don't send beacons all the time, some osds > send beacons > for a period of time and then stop sending beacons. > > > > Zhenshi Zhou 于2019年3月14日周四 上午10:57写道: >> >> Hi >> >> I set the config on every osd and check whether all osds send beacons >> to monitors. >> >> The result shows that only part of the osds send beacons and the monitor >> receives all beacons from which the osd send out. >> >> But why some osds don't send beacon? >> >> huang jun 于2019年3月13日周三 下午11:02写道: >>> >>> sorry for not make it clearly, you may need to set one of your osd's >>> osd_beacon_report_interval = 5 >>> and debug_ms=1 and then restart the osd process, then check the osd >>> log by 'grep beacon /var/log/ceph/ceph-osd.$id.log' >>> to make sure osd send beacons to mon, if osd send beacon to mon, you >>> should also turn on debug_ms=1 on leader mon, >>> and restart mon process, then check the mon log to make sure mon >>> received osd beacon; >>> >>> Zhenshi Zhou 于2019年3月13日周三 下午8:20写道: >>> > >>> > And now, new errors are cliaming.. >>> > >>> > >>> > Zhenshi Zhou 于2019年3月13日周三 下午2:58写道: >>> >> >>> >> Hi, >>> >> >>> >> I didn't set osd_beacon_report_interval as it must be the default value. >>> >> I have set osd_beacon_report_interval to 60 and debug_mon to 10. >>> >> >>> >> Attachment is the leader monitor log, the "mark-down" operations is at >>> >> 14:22 >>> >> >>> >> Thanks >>> >> >>> >> huang jun 于2019年3月13日周三 下午2:07写道: >>> >>> >>> >>> can you get the value of osd_beacon_report_interval item? the default >>> >>> is 300, you can set to 60, or maybe turn on debug_ms=1 debug_mon=10 >>> >>> can get more infos. >>> >>> >>> >>> >>> >>> Zhenshi Zhou 于2019年3月13日周三 下午1:20写道: >>> >>> > >>> >>> > Hi, >>> >>> > >>> >>> > The servers are cennected to the same switch. >>> >>> > I can ping from anyone of the servers to other servers >>> >>> > without a packet lost and the average round trip time >>> >>> > is under 0.1 ms. >>> >>> > >>> >>> > Thanks >>> >>> > >>> >>> > Ashley Merrick 于2019年3月13日周三 下午12:06写道: >>> >>> >> >>> >>> >> Can you ping all your OSD servers from all your mons, and ping your >>> >>> >> mons from all your OSD servers? >>> >>> >> >>> >>> >> I’ve seen this where a route wasn’t working one direction, so it >>> >>> >> made OSDs flap when it used that mon to check availability: >>> >>> >> >>> >>> >> On Wed, 13 Mar 2019 at 11:50 AM, Zhenshi Zhou >>> >>> >> wrote: >>> >>> >>> >>> >>> >>> After checking the network and syslog/dmsg, I think it's not the >>> >>> >>> network or hardware issue. Now there're some >>> >>> >>> osds being marked down every 15 minutes. >>> >>> >>> >>> >>> >>> here is ceph.log: >>> >>> >>> 2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >>> >>> >>> 6756 : cluster [INF] Cluster is now healthy >>> >>> >>> 2019-03-13 11:21:21.705787 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >>> >>> >>> 6757 : cluster [INF] osd.1 marked down after no beacon for >>> >>> >>> 900.067020 seconds >>> >>> >>> 2019-03-13 11:21:21.705858 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >>> >>> >>> 6758 : cluster [INF] osd.2 marked down after no beacon for >>> >>> >>> 900.067020 seconds >>> >>> >>> 2019-03-13 11:21:21.705920 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >>> >>> >>> 6759 : cluster [INF] osd.4 marked down after no beacon for >>> >>> >>> 900.067020 seconds >>> >>> >>> 2019-03-13 11:21:21.705957 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >>> >>> >>> 6760 : cluster [INF] osd.6 marked down after no beacon for >>> >>> >>> 900.067020 seconds >>> >>> >>> 2019-03-13 11:21:21.705999 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >>> >>> >>> 6761 : cluster [INF] osd.7 marked down after no beacon for >>> >>> >>> 900.067020 seconds >>> >>> >>> 2019-03-13 11:21:21.706040 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >>> >>> >>> 6762 : cluster [INF] osd.10 marked down after no beacon for >>> >>> >>> 900.067020 seconds >>> >>> >>> 2019-03-13 11:21:21.706079 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >>> >>> >>> 6763 : cluster [INF] osd.11 marked down after no beacon for >>> >>> >>> 900.067020 seconds >>> >>> >>> 2019-03-13 11:21:21.706118 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >>> >>> >>> 6764 : cluster [INF] osd.12 marked down after no beacon for >>> >>> >>> 900.067020 seconds >>> >>> >>> 2019-03-13 11:21:21.706155 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >>> >>> >>> 6765 : cluster [INF] osd.13 marked down after no beacon for >>> >>> >>> 900.067020 seconds >>> >>> >>> 2019-03-13 11:21:21.706195 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >>> >>> >>> 6766 : cluster [INF] osd.14 marked down after no beacon for >>> >>> >>> 900.067020 seconds >>> >>> >>> 2019-03-13 11:21:21.706233 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >>> >>> >>> 6767 : cluster [INF] osd.15 marked down after no beacon for >>> >>> >>> 900.067020 seconds >>> >>> >>> 2019-03-
Re: [ceph-users] cluster is not stable
Hi I set the config on every osd and check whether all osds send beacons to monitors. The result shows that only part of the osds send beacons and the monitor receives all beacons from which the osd send out. But why some osds don't send beacon? huang jun 于2019年3月13日周三 下午11:02写道: > sorry for not make it clearly, you may need to set one of your osd's > osd_beacon_report_interval = 5 > and debug_ms=1 and then restart the osd process, then check the osd > log by 'grep beacon /var/log/ceph/ceph-osd.$id.log' > to make sure osd send beacons to mon, if osd send beacon to mon, you > should also turn on debug_ms=1 on leader mon, > and restart mon process, then check the mon log to make sure mon > received osd beacon; > > Zhenshi Zhou 于2019年3月13日周三 下午8:20写道: > > > > And now, new errors are cliaming.. > > > > > > Zhenshi Zhou 于2019年3月13日周三 下午2:58写道: > >> > >> Hi, > >> > >> I didn't set osd_beacon_report_interval as it must be the default > value. > >> I have set osd_beacon_report_interval to 60 and debug_mon to 10. > >> > >> Attachment is the leader monitor log, the "mark-down" operations is at > 14:22 > >> > >> Thanks > >> > >> huang jun 于2019年3月13日周三 下午2:07写道: > >>> > >>> can you get the value of osd_beacon_report_interval item? the default > >>> is 300, you can set to 60, or maybe turn on debug_ms=1 debug_mon=10 > >>> can get more infos. > >>> > >>> > >>> Zhenshi Zhou 于2019年3月13日周三 下午1:20写道: > >>> > > >>> > Hi, > >>> > > >>> > The servers are cennected to the same switch. > >>> > I can ping from anyone of the servers to other servers > >>> > without a packet lost and the average round trip time > >>> > is under 0.1 ms. > >>> > > >>> > Thanks > >>> > > >>> > Ashley Merrick 于2019年3月13日周三 下午12:06写道: > >>> >> > >>> >> Can you ping all your OSD servers from all your mons, and ping your > mons from all your OSD servers? > >>> >> > >>> >> I’ve seen this where a route wasn’t working one direction, so it > made OSDs flap when it used that mon to check availability: > >>> >> > >>> >> On Wed, 13 Mar 2019 at 11:50 AM, Zhenshi Zhou > wrote: > >>> >>> > >>> >>> After checking the network and syslog/dmsg, I think it's not the > network or hardware issue. Now there're some > >>> >>> osds being marked down every 15 minutes. > >>> >>> > >>> >>> here is ceph.log: > >>> >>> 2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 > 6756 : cluster [INF] Cluster is now healthy > >>> >>> 2019-03-13 11:21:21.705787 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 > 6757 : cluster [INF] osd.1 marked down after no beacon for 900.067020 > seconds > >>> >>> 2019-03-13 11:21:21.705858 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 > 6758 : cluster [INF] osd.2 marked down after no beacon for 900.067020 > seconds > >>> >>> 2019-03-13 11:21:21.705920 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 > 6759 : cluster [INF] osd.4 marked down after no beacon for 900.067020 > seconds > >>> >>> 2019-03-13 11:21:21.705957 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 > 6760 : cluster [INF] osd.6 marked down after no beacon for 900.067020 > seconds > >>> >>> 2019-03-13 11:21:21.705999 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 > 6761 : cluster [INF] osd.7 marked down after no beacon for 900.067020 > seconds > >>> >>> 2019-03-13 11:21:21.706040 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 > 6762 : cluster [INF] osd.10 marked down after no beacon for 900.067020 > seconds > >>> >>> 2019-03-13 11:21:21.706079 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 > 6763 : cluster [INF] osd.11 marked down after no beacon for 900.067020 > seconds > >>> >>> 2019-03-13 11:21:21.706118 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 > 6764 : cluster [INF] osd.12 marked down after no beacon for 900.067020 > seconds > >>> >>> 2019-03-13 11:21:21.706155 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 > 6765 : cluster [INF] osd.13 marked down after no beacon for 900.067020 > seconds > >>> >>> 2019-03-13 11:21:21.706195 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 > 6766 : cluster [INF] osd.14 marked down after no beacon for 900.067020 > seconds > >>> >>> 2019-03-13 11:21:21.706233 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 > 6767 : cluster [INF] osd.15 marked down after no beacon for 900.067020 > seconds > >>> >>> 2019-03-13 11:21:21.706273 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 > 6768 : cluster [INF] osd.16 marked down after no beacon for 900.067020 > seconds > >>> >>> 2019-03-13 11:21:21.706312 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 > 6769 : cluster [INF] osd.17 marked down after no beacon for 900.067020 > seconds > >>> >>> 2019-03-13 11:21:21.706351 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 > 6770 : cluster [INF] osd.18 marked down after no beacon for 900.067020 > seconds > >>> >>> 2019-03-13 11:21:21.706385 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 > 6771 : cluster [INF] osd.19 marked down after no beacon for 900.067020 > seconds > >>> >>> 2019-03-13 11:21:21.706423 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 > 6772 : cluster [INF] osd.20 marked down after no beacon for 900.067020 > seconds > >>> >>> 2019-03-13 11:21:21.706503 mon.ceph-mon1 mon.0 10.39.0
Re: [ceph-users] cluster is not stable
sorry for not make it clearly, you may need to set one of your osd's osd_beacon_report_interval = 5 and debug_ms=1 and then restart the osd process, then check the osd log by 'grep beacon /var/log/ceph/ceph-osd.$id.log' to make sure osd send beacons to mon, if osd send beacon to mon, you should also turn on debug_ms=1 on leader mon, and restart mon process, then check the mon log to make sure mon received osd beacon; Zhenshi Zhou 于2019年3月13日周三 下午8:20写道: > > And now, new errors are cliaming.. > > > Zhenshi Zhou 于2019年3月13日周三 下午2:58写道: >> >> Hi, >> >> I didn't set osd_beacon_report_interval as it must be the default value. >> I have set osd_beacon_report_interval to 60 and debug_mon to 10. >> >> Attachment is the leader monitor log, the "mark-down" operations is at 14:22 >> >> Thanks >> >> huang jun 于2019年3月13日周三 下午2:07写道: >>> >>> can you get the value of osd_beacon_report_interval item? the default >>> is 300, you can set to 60, or maybe turn on debug_ms=1 debug_mon=10 >>> can get more infos. >>> >>> >>> Zhenshi Zhou 于2019年3月13日周三 下午1:20写道: >>> > >>> > Hi, >>> > >>> > The servers are cennected to the same switch. >>> > I can ping from anyone of the servers to other servers >>> > without a packet lost and the average round trip time >>> > is under 0.1 ms. >>> > >>> > Thanks >>> > >>> > Ashley Merrick 于2019年3月13日周三 下午12:06写道: >>> >> >>> >> Can you ping all your OSD servers from all your mons, and ping your mons >>> >> from all your OSD servers? >>> >> >>> >> I’ve seen this where a route wasn’t working one direction, so it made >>> >> OSDs flap when it used that mon to check availability: >>> >> >>> >> On Wed, 13 Mar 2019 at 11:50 AM, Zhenshi Zhou >>> >> wrote: >>> >>> >>> >>> After checking the network and syslog/dmsg, I think it's not the >>> >>> network or hardware issue. Now there're some >>> >>> osds being marked down every 15 minutes. >>> >>> >>> >>> here is ceph.log: >>> >>> 2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6756 : >>> >>> cluster [INF] Cluster is now healthy >>> >>> 2019-03-13 11:21:21.705787 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6757 : >>> >>> cluster [INF] osd.1 marked down after no beacon for 900.067020 seconds >>> >>> 2019-03-13 11:21:21.705858 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6758 : >>> >>> cluster [INF] osd.2 marked down after no beacon for 900.067020 seconds >>> >>> 2019-03-13 11:21:21.705920 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6759 : >>> >>> cluster [INF] osd.4 marked down after no beacon for 900.067020 seconds >>> >>> 2019-03-13 11:21:21.705957 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6760 : >>> >>> cluster [INF] osd.6 marked down after no beacon for 900.067020 seconds >>> >>> 2019-03-13 11:21:21.705999 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6761 : >>> >>> cluster [INF] osd.7 marked down after no beacon for 900.067020 seconds >>> >>> 2019-03-13 11:21:21.706040 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6762 : >>> >>> cluster [INF] osd.10 marked down after no beacon for 900.067020 seconds >>> >>> 2019-03-13 11:21:21.706079 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6763 : >>> >>> cluster [INF] osd.11 marked down after no beacon for 900.067020 seconds >>> >>> 2019-03-13 11:21:21.706118 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6764 : >>> >>> cluster [INF] osd.12 marked down after no beacon for 900.067020 seconds >>> >>> 2019-03-13 11:21:21.706155 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6765 : >>> >>> cluster [INF] osd.13 marked down after no beacon for 900.067020 seconds >>> >>> 2019-03-13 11:21:21.706195 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6766 : >>> >>> cluster [INF] osd.14 marked down after no beacon for 900.067020 seconds >>> >>> 2019-03-13 11:21:21.706233 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6767 : >>> >>> cluster [INF] osd.15 marked down after no beacon for 900.067020 seconds >>> >>> 2019-03-13 11:21:21.706273 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6768 : >>> >>> cluster [INF] osd.16 marked down after no beacon for 900.067020 seconds >>> >>> 2019-03-13 11:21:21.706312 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6769 : >>> >>> cluster [INF] osd.17 marked down after no beacon for 900.067020 seconds >>> >>> 2019-03-13 11:21:21.706351 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6770 : >>> >>> cluster [INF] osd.18 marked down after no beacon for 900.067020 seconds >>> >>> 2019-03-13 11:21:21.706385 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6771 : >>> >>> cluster [INF] osd.19 marked down after no beacon for 900.067020 seconds >>> >>> 2019-03-13 11:21:21.706423 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6772 : >>> >>> cluster [INF] osd.20 marked down after no beacon for 900.067020 seconds >>> >>> 2019-03-13 11:21:21.706503 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6773 : >>> >>> cluster [INF] osd.22 marked down after no beacon for 900.067020 seconds >>> >>> 2019-03-13 11:21:21.706549 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6774 : >>> >>> cluster [INF] osd.23 marked down after no beacon for 900.067020 seconds >>> >>> 2019-03-13 11:21:21.706587 mon.ceph-mon1 mon.0 10.39.0.3
Re: [ceph-users] cluster is not stable
And now, new errors are cliaming.. [image: image.png] Zhenshi Zhou 于2019年3月13日周三 下午2:58写道: > Hi, > > I didn't set osd_beacon_report_interval as it must be the default value. > I have set osd_beacon_report_interval to 60 and debug_mon to 10. > > Attachment is the leader monitor log, the "mark-down" operations is at > 14:22 > > Thanks > > huang jun 于2019年3月13日周三 下午2:07写道: > >> can you get the value of osd_beacon_report_interval item? the default >> is 300, you can set to 60, or maybe turn on debug_ms=1 debug_mon=10 >> can get more infos. >> >> >> Zhenshi Zhou 于2019年3月13日周三 下午1:20写道: >> > >> > Hi, >> > >> > The servers are cennected to the same switch. >> > I can ping from anyone of the servers to other servers >> > without a packet lost and the average round trip time >> > is under 0.1 ms. >> > >> > Thanks >> > >> > Ashley Merrick 于2019年3月13日周三 下午12:06写道: >> >> >> >> Can you ping all your OSD servers from all your mons, and ping your >> mons from all your OSD servers? >> >> >> >> I’ve seen this where a route wasn’t working one direction, so it made >> OSDs flap when it used that mon to check availability: >> >> >> >> On Wed, 13 Mar 2019 at 11:50 AM, Zhenshi Zhou >> wrote: >> >>> >> >>> After checking the network and syslog/dmsg, I think it's not the >> network or hardware issue. Now there're some >> >>> osds being marked down every 15 minutes. >> >>> >> >>> here is ceph.log: >> >>> 2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6756 : cluster [INF] Cluster is now healthy >> >>> 2019-03-13 11:21:21.705787 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6757 : cluster [INF] osd.1 marked down after no beacon for 900.067020 >> seconds >> >>> 2019-03-13 11:21:21.705858 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6758 : cluster [INF] osd.2 marked down after no beacon for 900.067020 >> seconds >> >>> 2019-03-13 11:21:21.705920 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6759 : cluster [INF] osd.4 marked down after no beacon for 900.067020 >> seconds >> >>> 2019-03-13 11:21:21.705957 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6760 : cluster [INF] osd.6 marked down after no beacon for 900.067020 >> seconds >> >>> 2019-03-13 11:21:21.705999 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6761 : cluster [INF] osd.7 marked down after no beacon for 900.067020 >> seconds >> >>> 2019-03-13 11:21:21.706040 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6762 : cluster [INF] osd.10 marked down after no beacon for 900.067020 >> seconds >> >>> 2019-03-13 11:21:21.706079 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6763 : cluster [INF] osd.11 marked down after no beacon for 900.067020 >> seconds >> >>> 2019-03-13 11:21:21.706118 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6764 : cluster [INF] osd.12 marked down after no beacon for 900.067020 >> seconds >> >>> 2019-03-13 11:21:21.706155 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6765 : cluster [INF] osd.13 marked down after no beacon for 900.067020 >> seconds >> >>> 2019-03-13 11:21:21.706195 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6766 : cluster [INF] osd.14 marked down after no beacon for 900.067020 >> seconds >> >>> 2019-03-13 11:21:21.706233 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6767 : cluster [INF] osd.15 marked down after no beacon for 900.067020 >> seconds >> >>> 2019-03-13 11:21:21.706273 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6768 : cluster [INF] osd.16 marked down after no beacon for 900.067020 >> seconds >> >>> 2019-03-13 11:21:21.706312 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6769 : cluster [INF] osd.17 marked down after no beacon for 900.067020 >> seconds >> >>> 2019-03-13 11:21:21.706351 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6770 : cluster [INF] osd.18 marked down after no beacon for 900.067020 >> seconds >> >>> 2019-03-13 11:21:21.706385 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6771 : cluster [INF] osd.19 marked down after no beacon for 900.067020 >> seconds >> >>> 2019-03-13 11:21:21.706423 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6772 : cluster [INF] osd.20 marked down after no beacon for 900.067020 >> seconds >> >>> 2019-03-13 11:21:21.706503 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6773 : cluster [INF] osd.22 marked down after no beacon for 900.067020 >> seconds >> >>> 2019-03-13 11:21:21.706549 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6774 : cluster [INF] osd.23 marked down after no beacon for 900.067020 >> seconds >> >>> 2019-03-13 11:21:21.706587 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6775 : cluster [INF] osd.25 marked down after no beacon for 900.067020 >> seconds >> >>> 2019-03-13 11:21:21.706625 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6776 : cluster [INF] osd.26 marked down after no beacon for 900.067020 >> seconds >> >>> 2019-03-13 11:21:21.706665 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6777 : cluster [INF] osd.27 marked down after no beacon for 900.067020 >> seconds >> >>> 2019-03-13 11:21:21.706703 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 >> 6778 : cluster [INF] osd.28 marked down after no beacon for 900.067020 >> seconds >> >>> 2019-03-13 11:21:21
Re: [ceph-users] cluster is not stable
can you get the value of osd_beacon_report_interval item? the default is 300, you can set to 60, or maybe turn on debug_ms=1 debug_mon=10 can get more infos. Zhenshi Zhou 于2019年3月13日周三 下午1:20写道: > > Hi, > > The servers are cennected to the same switch. > I can ping from anyone of the servers to other servers > without a packet lost and the average round trip time > is under 0.1 ms. > > Thanks > > Ashley Merrick 于2019年3月13日周三 下午12:06写道: >> >> Can you ping all your OSD servers from all your mons, and ping your mons >> from all your OSD servers? >> >> I’ve seen this where a route wasn’t working one direction, so it made OSDs >> flap when it used that mon to check availability: >> >> On Wed, 13 Mar 2019 at 11:50 AM, Zhenshi Zhou wrote: >>> >>> After checking the network and syslog/dmsg, I think it's not the network or >>> hardware issue. Now there're some >>> osds being marked down every 15 minutes. >>> >>> here is ceph.log: >>> 2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6756 : >>> cluster [INF] Cluster is now healthy >>> 2019-03-13 11:21:21.705787 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6757 : >>> cluster [INF] osd.1 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.705858 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6758 : >>> cluster [INF] osd.2 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.705920 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6759 : >>> cluster [INF] osd.4 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.705957 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6760 : >>> cluster [INF] osd.6 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.705999 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6761 : >>> cluster [INF] osd.7 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.706040 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6762 : >>> cluster [INF] osd.10 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.706079 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6763 : >>> cluster [INF] osd.11 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.706118 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6764 : >>> cluster [INF] osd.12 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.706155 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6765 : >>> cluster [INF] osd.13 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.706195 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6766 : >>> cluster [INF] osd.14 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.706233 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6767 : >>> cluster [INF] osd.15 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.706273 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6768 : >>> cluster [INF] osd.16 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.706312 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6769 : >>> cluster [INF] osd.17 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.706351 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6770 : >>> cluster [INF] osd.18 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.706385 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6771 : >>> cluster [INF] osd.19 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.706423 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6772 : >>> cluster [INF] osd.20 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.706503 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6773 : >>> cluster [INF] osd.22 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.706549 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6774 : >>> cluster [INF] osd.23 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.706587 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6775 : >>> cluster [INF] osd.25 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.706625 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6776 : >>> cluster [INF] osd.26 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.706665 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6777 : >>> cluster [INF] osd.27 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.706703 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6778 : >>> cluster [INF] osd.28 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.706741 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6779 : >>> cluster [INF] osd.30 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.706779 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6780 : >>> cluster [INF] osd.31 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.706817 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6781 : >>> cluster [INF] osd.33 marked down after no beacon for 900.067020 seconds >>> 2019-03-13 11:21:21.706856 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6782 : >>> cluster [INF] osd.34 marked down aft
Re: [ceph-users] cluster is not stable
Hi, The servers are cennected to the same switch. I can ping from anyone of the servers to other servers without a packet lost and the average round trip time is under 0.1 ms. Thanks Ashley Merrick 于2019年3月13日周三 下午12:06写道: > Can you ping all your OSD servers from all your mons, and ping your mons > from all your OSD servers? > > I’ve seen this where a route wasn’t working one direction, so it made OSDs > flap when it used that mon to check availability: > > On Wed, 13 Mar 2019 at 11:50 AM, Zhenshi Zhou wrote: > >> After checking the network and syslog/dmsg, I think it's not the network >> or hardware issue. Now there're some >> osds being marked down every 15 minutes. >> >> here is ceph.log: >> 2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6756 : >> cluster [INF] Cluster is now healthy >> 2019-03-13 11:21:21.705787 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6757 : >> cluster [INF] osd.1 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.705858 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6758 : >> cluster [INF] osd.2 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.705920 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6759 : >> cluster [INF] osd.4 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.705957 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6760 : >> cluster [INF] osd.6 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.705999 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6761 : >> cluster [INF] osd.7 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706040 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6762 : >> cluster [INF] osd.10 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706079 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6763 : >> cluster [INF] osd.11 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706118 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6764 : >> cluster [INF] osd.12 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706155 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6765 : >> cluster [INF] osd.13 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706195 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6766 : >> cluster [INF] osd.14 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706233 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6767 : >> cluster [INF] osd.15 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706273 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6768 : >> cluster [INF] osd.16 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706312 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6769 : >> cluster [INF] osd.17 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706351 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6770 : >> cluster [INF] osd.18 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706385 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6771 : >> cluster [INF] osd.19 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706423 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6772 : >> cluster [INF] osd.20 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706503 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6773 : >> cluster [INF] osd.22 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706549 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6774 : >> cluster [INF] osd.23 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706587 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6775 : >> cluster [INF] osd.25 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706625 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6776 : >> cluster [INF] osd.26 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706665 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6777 : >> cluster [INF] osd.27 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706703 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6778 : >> cluster [INF] osd.28 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706741 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6779 : >> cluster [INF] osd.30 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706779 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6780 : >> cluster [INF] osd.31 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706817 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6781 : >> cluster [INF] osd.33 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706856 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6782 : >> cluster [INF] osd.34 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706894 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6783 : >> cluster [INF] osd.36 marked down after no beacon for 900.067020 seconds >> 2019-03-13 11:21:21.706930 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6784 : >> cluster [INF] osd.38 marked down after no beacon for 9
Re: [ceph-users] cluster is not stable
Can you ping all your OSD servers from all your mons, and ping your mons from all your OSD servers? I’ve seen this where a route wasn’t working one direction, so it made OSDs flap when it used that mon to check availability: On Wed, 13 Mar 2019 at 11:50 AM, Zhenshi Zhou wrote: > After checking the network and syslog/dmsg, I think it's not the network > or hardware issue. Now there're some > osds being marked down every 15 minutes. > > here is ceph.log: > 2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6756 : > cluster [INF] Cluster is now healthy > 2019-03-13 11:21:21.705787 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6757 : > cluster [INF] osd.1 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.705858 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6758 : > cluster [INF] osd.2 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.705920 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6759 : > cluster [INF] osd.4 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.705957 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6760 : > cluster [INF] osd.6 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.705999 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6761 : > cluster [INF] osd.7 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706040 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6762 : > cluster [INF] osd.10 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706079 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6763 : > cluster [INF] osd.11 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706118 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6764 : > cluster [INF] osd.12 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706155 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6765 : > cluster [INF] osd.13 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706195 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6766 : > cluster [INF] osd.14 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706233 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6767 : > cluster [INF] osd.15 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706273 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6768 : > cluster [INF] osd.16 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706312 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6769 : > cluster [INF] osd.17 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706351 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6770 : > cluster [INF] osd.18 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706385 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6771 : > cluster [INF] osd.19 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706423 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6772 : > cluster [INF] osd.20 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706503 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6773 : > cluster [INF] osd.22 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706549 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6774 : > cluster [INF] osd.23 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706587 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6775 : > cluster [INF] osd.25 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706625 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6776 : > cluster [INF] osd.26 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706665 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6777 : > cluster [INF] osd.27 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706703 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6778 : > cluster [INF] osd.28 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706741 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6779 : > cluster [INF] osd.30 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706779 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6780 : > cluster [INF] osd.31 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706817 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6781 : > cluster [INF] osd.33 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706856 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6782 : > cluster [INF] osd.34 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706894 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6783 : > cluster [INF] osd.36 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706930 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6784 : > cluster [INF] osd.38 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.706974 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6785 : > cluster [INF] osd.40 marked down after no beacon for 900.067020 seconds > 2019-03-13 11:21:21.707013 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6786 : > cluster [INF] osd.41 marked down after no beacon for 900.06702
Re: [ceph-users] cluster is not stable
After checking the network and syslog/dmsg, I think it's not the network or hardware issue. Now there're some osds being marked down every 15 minutes. here is ceph.log: 2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6756 : cluster [INF] Cluster is now healthy 2019-03-13 11:21:21.705787 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6757 : cluster [INF] osd.1 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.705858 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6758 : cluster [INF] osd.2 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.705920 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6759 : cluster [INF] osd.4 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.705957 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6760 : cluster [INF] osd.6 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.705999 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6761 : cluster [INF] osd.7 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706040 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6762 : cluster [INF] osd.10 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706079 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6763 : cluster [INF] osd.11 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706118 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6764 : cluster [INF] osd.12 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706155 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6765 : cluster [INF] osd.13 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706195 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6766 : cluster [INF] osd.14 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706233 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6767 : cluster [INF] osd.15 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706273 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6768 : cluster [INF] osd.16 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706312 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6769 : cluster [INF] osd.17 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706351 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6770 : cluster [INF] osd.18 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706385 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6771 : cluster [INF] osd.19 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706423 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6772 : cluster [INF] osd.20 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706503 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6773 : cluster [INF] osd.22 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706549 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6774 : cluster [INF] osd.23 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706587 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6775 : cluster [INF] osd.25 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706625 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6776 : cluster [INF] osd.26 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706665 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6777 : cluster [INF] osd.27 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706703 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6778 : cluster [INF] osd.28 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706741 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6779 : cluster [INF] osd.30 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706779 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6780 : cluster [INF] osd.31 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706817 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6781 : cluster [INF] osd.33 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706856 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6782 : cluster [INF] osd.34 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706894 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6783 : cluster [INF] osd.36 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706930 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6784 : cluster [INF] osd.38 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.706974 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6785 : cluster [INF] osd.40 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.707013 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6786 : cluster [INF] osd.41 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.707051 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6787 : cluster [INF] osd.42 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.707090 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6788 : cluster [INF] osd.44 marked down after no beacon for 900.067020 seconds 2019-03-13 11:21:21.707128 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6789 : cluster [INF] osd.45 marked down after no bea
Re: [ceph-users] cluster is not stable
Hi Kevin, I'm sure the firewalld are disabled on each host. Well, the network is not a problem. The servers are connected to the same switch and the connection is good when the osds are marked as down. There was no interruption or delay. I restart the leader monitor daemon and it seems return to the normal state. Thanks. Kevin Olbrich 于2019年3月12日周二 下午5:44写道: > Are you sure that firewalld is stopped and disabled? > Looks exactly like that when I missed one host in a test cluster. > > Kevin > > > Am Di., 12. März 2019 um 09:31 Uhr schrieb Zhenshi Zhou < > deader...@gmail.com>: > >> Hi, >> >> I deployed a ceph cluster with good performance. But the logs >> indicate that the cluster is not as stable as I think it should be. >> >> The log shows the monitors mark some osd as down periodly: >> [image: image.png] >> >> I didn't find any useful information in osd logs. >> >> ceph version 13.2.4 mimic (stable) >> OS version CentOS 7.6.1810 >> kernel version 5.0.0-2.el7 >> >> Thanks. >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cluster is not stable
Are you sure that firewalld is stopped and disabled? Looks exactly like that when I missed one host in a test cluster. Kevin Am Di., 12. März 2019 um 09:31 Uhr schrieb Zhenshi Zhou : > Hi, > > I deployed a ceph cluster with good performance. But the logs > indicate that the cluster is not as stable as I think it should be. > > The log shows the monitors mark some osd as down periodly: > [image: image.png] > > I didn't find any useful information in osd logs. > > ceph version 13.2.4 mimic (stable) > OS version CentOS 7.6.1810 > kernel version 5.0.0-2.el7 > > Thanks. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cluster is not stable
Yep, I think it maybe a network issue as well. I'll check the connections. Thanks Eugen:) Eugen Block 于2019年3月12日周二 下午4:35写道: > Hi, > > my first guess would be a network issue. Double-check your connections > and make sure the network setup works as expected. Check syslogs, > dmesg, switches etc. for hints that a network interruption may have > occured. > > Regards, > Eugen > > > Zitat von Zhenshi Zhou : > > > Hi, > > > > I deployed a ceph cluster with good performance. But the logs > > indicate that the cluster is not as stable as I think it should be. > > > > The log shows the monitors mark some osd as down periodly: > > [image: image.png] > > > > I didn't find any useful information in osd logs. > > > > ceph version 13.2.4 mimic (stable) > > OS version CentOS 7.6.1810 > > kernel version 5.0.0-2.el7 > > > > Thanks. > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cluster is not stable
Hi, my first guess would be a network issue. Double-check your connections and make sure the network setup works as expected. Check syslogs, dmesg, switches etc. for hints that a network interruption may have occured. Regards, Eugen Zitat von Zhenshi Zhou : Hi, I deployed a ceph cluster with good performance. But the logs indicate that the cluster is not as stable as I think it should be. The log shows the monitors mark some osd as down periodly: [image: image.png] I didn't find any useful information in osd logs. ceph version 13.2.4 mimic (stable) OS version CentOS 7.6.1810 kernel version 5.0.0-2.el7 Thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cluster is not stable
Hi, I deployed a ceph cluster with good performance. But the logs indicate that the cluster is not as stable as I think it should be. The log shows the monitors mark some osd as down periodly: [image: image.png] I didn't find any useful information in osd logs. ceph version 13.2.4 mimic (stable) OS version CentOS 7.6.1810 kernel version 5.0.0-2.el7 Thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com