Re: [ceph-users] cluster is not stable

2019-03-15 Thread Janne Johansson
Den tors 14 mars 2019 kl 17:00 skrev Zhenshi Zhou :
> I think I've found the root cause which make the monmap contains no
> feature. As I moved the servers from one place to another, I modified
> the monmap once.

If this was the empty cluster that you refused to redo from scratch, then I
feel it might be right to quote myself from the discussion before the move:
--
 If the cluster is clean I see no
reason for doing brain surgery on monmaps
just to "save" a few minutes of redoing correctly from scratch.



*Whatif you miss some part, some command gives you an erroryou really
aren't comfortable with, something doesn't really feelright after doing it,
then the whole lifetime of that clusterwill be followed by a small nagging
feeling* that it might have been
that time you followed a guide that tries to talk you out of
doing it that way, for a cluster with no data.
--

I think the part in bold is *exactly* what happened to you now, you did
something quite far out of the ordinary which was doable, but recommended
against, and somehow some part not anticipated or covered in the "blindly
type these commands into your ceph" occured.

>From this point on, you _will_ know that your cluster is not 100% like
everyone elses, and any future errors and crashes just _might_ be from it
being different in a way noone has ever tested before. Some bit unset, some
string left uninitialised, some value left untouched that never could be
like that if done right.

If you have little data in it now, I would still recommend moving data
elsewhere and setting it up correctly.

--
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster is not stable

2019-03-14 Thread Zhenshi Zhou
Hi huang,

I think I've found the root cause which make the monmap contains no
feature. As I moved the servers from one place to another, I modified
the monmap once.

However, not all monmap is the same on all mons. I modified monmap
on one of the mons, and create from scratch on the other two mons for
convenience. (ssh is disabled among the servers, and I don't wanna do
the modify operations 3 times)

As a result, the epoch number is not the same within the 3 mons. I think
this would be the root cause.

I have transferred the monmap which dumped from the leader mon, by
nc command, to other two mons and inject into the mon. The mon features
recover now. After a period of time on watching the cluster status, there
is
no "mark-down" operations on osd.

# ceph mon feature ls
all features
supported: [kraken,luminous,mimic,osdmap-prune]
persistent: [kraken,luminous,mimic,osdmap-prune]
on current monmap (epoch 3)
persistent: [kraken,luminous,mimic,osdmap-prune]
required: [kraken,luminous,mimic,osdmap-prune]

Thanks all your helps guys:)


Zhenshi Zhou  于2019年3月14日周四 下午3:20写道:

> Hi,
>
> I'll try that command soon.
>
> It's a new cluster installed mimic. Not sure what the exact reason, but as
> far as I can think of, 2 things may cause this issue. One is that I moved
> these servers from a datacenter to this one, followed by steps [1].
> Another
> is that I create a bridge using the interface by which ceph connection
> used.
>
> [1]
> http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-the-messy-way
>
>
> Thanks
>
> huang jun  于2019年3月14日周四 下午3:04写道:
>
>> You can try that commands, but maybe you need to find the root cause
>> why the current monmap contains no features at all, do you upgrade
>> cluster from luminous to mimic,
>> or it's a new cluster installed mimic?
>>
>>
>> Zhenshi Zhou  于2019年3月14日周四 下午2:37写道:
>> >
>> > Hi huang,
>> >
>> > It's a pre-production environment. If everything is fine, I'll use it
>> for production.
>> >
>> > My cluster is version mimic, should I set all features you listed in
>> the command?
>> >
>> > Thanks
>> >
>> > huang jun  于2019年3月14日周四 下午2:11写道:
>> >>
>> >> sorry, the script should be
>> >> for f in kraken luminous mimic osdmap-prune; do
>> >>   ceph mon feature set $f --yes-i-really-mean-it
>> >> done
>> >>
>> >> huang jun  于2019年3月14日周四 下午2:04写道:
>> >> >
>> >> > ok, if this is a **test environment**, you can try
>> >> > for f in 'kraken,luminous,mimic,osdmap-prune'; do
>> >> >   ceph mon feature set $f --yes-i-really-mean-it
>> >> > done
>> >> >
>> >> > If it is a production environment, you should eval the risk first,
>> and
>> >> > maybe setup a test cluster to testing first.
>> >> >
>> >> > Zhenshi Zhou  于2019年3月14日周四 下午1:56写道:
>> >> > >
>> >> > > # ceph mon feature ls
>> >> > > all features
>> >> > > supported: [kraken,luminous,mimic,osdmap-prune]
>> >> > > persistent: [kraken,luminous,mimic,osdmap-prune]
>> >> > > on current monmap (epoch 2)
>> >> > > persistent: [none]
>> >> > > required: [none]
>> >> > >
>> >> > > huang jun  于2019年3月14日周四 下午1:50写道:
>> >> > >>
>> >> > >> what's the output of 'ceph mon feature ls'?
>> >> > >>
>> >> > >> from the code, maybe mon features not contain luminous
>> >> > >> 6263 void OSD::send_beacon(const
>> ceph::coarse_mono_clock::time_point& now)
>> >> > >>
>> >> > >>  6264 {
>> >> > >>
>> >> > >>  6265   const auto& monmap = monc->monmap;
>> >> > >>
>> >> > >>  6266   // send beacon to mon even if we are just connected, and
>> the
>> >> > >> monmap is not
>> >> > >>
>> >> > >>  6267   // initialized yet by then.
>> >> > >>
>> >> > >>  6268   if (monmap.epoch > 0 &&
>> >> > >>
>> >> > >>  6269   monmap.get_required_features().contains_all(
>> >> > >>
>> >> > >>  6270 ceph::features::mon::FEATURE_LUMINOUS)) {
>> >> > >>
>> >> > >>  6271 dout(20) << __func__ << " sending" << dendl;
>> >> > >>
>> >> > >>  6272 MOSDBeacon* beacon = nullptr;
>> >> > >>
>> >> > >>  6273 {
>> >> > >>
>> >> > >>  6274   std::lock_guard l{min_last_epoch_clean_lock};
>> >> > >>
>> >> > >>  6275   beacon = new MOSDBeacon(osdmap->get_epoch(),
>> min_last_epoch_clean);
>> >> > >>
>> >> > >>  6276   std::swap(beacon->pgs, min_last_epoch_clean_pgs);
>> >> > >>
>> >> > >>  6277   last_sent_beacon = now;
>> >> > >>
>> >> > >>  6278 }
>> >> > >>
>> >> > >>  6279 monc->send_mon_message(beacon);
>> >> > >>
>> >> > >>  6280   } else {
>> >> > >>
>> >> > >>  6281 dout(20) << __func__ << " not sending" << dendl;
>> >> > >>
>> >> > >>  6282   }
>> >> > >>
>> >> > >>  6283 }
>> >> > >>
>> >> > >>
>> >> > >> Zhenshi Zhou  于2019年3月14日周四 下午12:43写道:
>> >> > >> >
>> >> > >> > Hi,
>> >> > >> >
>> >> > >> > One of the log says the beacon not sending as below:
>> >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032
>> tick_without_osd_lock
>> >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032
>> can_inc_scr

Re: [ceph-users] cluster is not stable

2019-03-14 Thread Zhenshi Zhou
Hi,

I'll try that command soon.

It's a new cluster installed mimic. Not sure what the exact reason, but as
far as I can think of, 2 things may cause this issue. One is that I moved
these servers from a datacenter to this one, followed by steps [1]. Another
is that I create a bridge using the interface by which ceph connection used.

[1]
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-the-messy-way


Thanks

huang jun  于2019年3月14日周四 下午3:04写道:

> You can try that commands, but maybe you need to find the root cause
> why the current monmap contains no features at all, do you upgrade
> cluster from luminous to mimic,
> or it's a new cluster installed mimic?
>
>
> Zhenshi Zhou  于2019年3月14日周四 下午2:37写道:
> >
> > Hi huang,
> >
> > It's a pre-production environment. If everything is fine, I'll use it
> for production.
> >
> > My cluster is version mimic, should I set all features you listed in the
> command?
> >
> > Thanks
> >
> > huang jun  于2019年3月14日周四 下午2:11写道:
> >>
> >> sorry, the script should be
> >> for f in kraken luminous mimic osdmap-prune; do
> >>   ceph mon feature set $f --yes-i-really-mean-it
> >> done
> >>
> >> huang jun  于2019年3月14日周四 下午2:04写道:
> >> >
> >> > ok, if this is a **test environment**, you can try
> >> > for f in 'kraken,luminous,mimic,osdmap-prune'; do
> >> >   ceph mon feature set $f --yes-i-really-mean-it
> >> > done
> >> >
> >> > If it is a production environment, you should eval the risk first, and
> >> > maybe setup a test cluster to testing first.
> >> >
> >> > Zhenshi Zhou  于2019年3月14日周四 下午1:56写道:
> >> > >
> >> > > # ceph mon feature ls
> >> > > all features
> >> > > supported: [kraken,luminous,mimic,osdmap-prune]
> >> > > persistent: [kraken,luminous,mimic,osdmap-prune]
> >> > > on current monmap (epoch 2)
> >> > > persistent: [none]
> >> > > required: [none]
> >> > >
> >> > > huang jun  于2019年3月14日周四 下午1:50写道:
> >> > >>
> >> > >> what's the output of 'ceph mon feature ls'?
> >> > >>
> >> > >> from the code, maybe mon features not contain luminous
> >> > >> 6263 void OSD::send_beacon(const
> ceph::coarse_mono_clock::time_point& now)
> >> > >>
> >> > >>  6264 {
> >> > >>
> >> > >>  6265   const auto& monmap = monc->monmap;
> >> > >>
> >> > >>  6266   // send beacon to mon even if we are just connected, and
> the
> >> > >> monmap is not
> >> > >>
> >> > >>  6267   // initialized yet by then.
> >> > >>
> >> > >>  6268   if (monmap.epoch > 0 &&
> >> > >>
> >> > >>  6269   monmap.get_required_features().contains_all(
> >> > >>
> >> > >>  6270 ceph::features::mon::FEATURE_LUMINOUS)) {
> >> > >>
> >> > >>  6271 dout(20) << __func__ << " sending" << dendl;
> >> > >>
> >> > >>  6272 MOSDBeacon* beacon = nullptr;
> >> > >>
> >> > >>  6273 {
> >> > >>
> >> > >>  6274   std::lock_guard l{min_last_epoch_clean_lock};
> >> > >>
> >> > >>  6275   beacon = new MOSDBeacon(osdmap->get_epoch(),
> min_last_epoch_clean);
> >> > >>
> >> > >>  6276   std::swap(beacon->pgs, min_last_epoch_clean_pgs);
> >> > >>
> >> > >>  6277   last_sent_beacon = now;
> >> > >>
> >> > >>  6278 }
> >> > >>
> >> > >>  6279 monc->send_mon_message(beacon);
> >> > >>
> >> > >>  6280   } else {
> >> > >>
> >> > >>  6281 dout(20) << __func__ << " not sending" << dendl;
> >> > >>
> >> > >>  6282   }
> >> > >>
> >> > >>  6283 }
> >> > >>
> >> > >>
> >> > >> Zhenshi Zhou  于2019年3月14日周四 下午12:43写道:
> >> > >> >
> >> > >> > Hi,
> >> > >> >
> >> > >> > One of the log says the beacon not sending as below:
> >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032
> tick_without_osd_lock
> >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032
> can_inc_scrubs_pending 0 -> 1 (max 1, active 0)
> >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032
> scrub_time_permit should run between 0 - 24 now 12 = yes
> >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032
> scrub_load_below_threshold loadavg per cpu 0 < max 0.5 = yes
> >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub
> load_is_low=1
> >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 sched_scrub
> 1.79 scheduled at 2019-03-14 13:17:51.290050 > 2019-03-14 12:41:15.723848
> >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub
> done
> >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032
> promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 B; target
> 25 obj/sec or 5 MiB/sec
> >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032
> promote_throttle_recalibrate  new_prob 1000
> >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032
> promote_throttle_recalibrate  actual 0, actual/prob ratio 1, adjusted
> new_prob 1000, prob 1000 -> 1000
> >> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 send_beacon
> not sending
> >> > >> >
> >> > >> >
> >> > >> > huang jun  于2019年3月14日周四 下午12:30写道:
> >> > >> >>
> >> > >> >> osd will 

Re: [ceph-users] cluster is not stable

2019-03-14 Thread huang jun
You can try that commands, but maybe you need to find the root cause
why the current monmap contains no features at all, do you upgrade
cluster from luminous to mimic,
or it's a new cluster installed mimic?


Zhenshi Zhou  于2019年3月14日周四 下午2:37写道:
>
> Hi huang,
>
> It's a pre-production environment. If everything is fine, I'll use it for 
> production.
>
> My cluster is version mimic, should I set all features you listed in the 
> command?
>
> Thanks
>
> huang jun  于2019年3月14日周四 下午2:11写道:
>>
>> sorry, the script should be
>> for f in kraken luminous mimic osdmap-prune; do
>>   ceph mon feature set $f --yes-i-really-mean-it
>> done
>>
>> huang jun  于2019年3月14日周四 下午2:04写道:
>> >
>> > ok, if this is a **test environment**, you can try
>> > for f in 'kraken,luminous,mimic,osdmap-prune'; do
>> >   ceph mon feature set $f --yes-i-really-mean-it
>> > done
>> >
>> > If it is a production environment, you should eval the risk first, and
>> > maybe setup a test cluster to testing first.
>> >
>> > Zhenshi Zhou  于2019年3月14日周四 下午1:56写道:
>> > >
>> > > # ceph mon feature ls
>> > > all features
>> > > supported: [kraken,luminous,mimic,osdmap-prune]
>> > > persistent: [kraken,luminous,mimic,osdmap-prune]
>> > > on current monmap (epoch 2)
>> > > persistent: [none]
>> > > required: [none]
>> > >
>> > > huang jun  于2019年3月14日周四 下午1:50写道:
>> > >>
>> > >> what's the output of 'ceph mon feature ls'?
>> > >>
>> > >> from the code, maybe mon features not contain luminous
>> > >> 6263 void OSD::send_beacon(const ceph::coarse_mono_clock::time_point& 
>> > >> now)
>> > >>
>> > >>  6264 {
>> > >>
>> > >>  6265   const auto& monmap = monc->monmap;
>> > >>
>> > >>  6266   // send beacon to mon even if we are just connected, and the
>> > >> monmap is not
>> > >>
>> > >>  6267   // initialized yet by then.
>> > >>
>> > >>  6268   if (monmap.epoch > 0 &&
>> > >>
>> > >>  6269   monmap.get_required_features().contains_all(
>> > >>
>> > >>  6270 ceph::features::mon::FEATURE_LUMINOUS)) {
>> > >>
>> > >>  6271 dout(20) << __func__ << " sending" << dendl;
>> > >>
>> > >>  6272 MOSDBeacon* beacon = nullptr;
>> > >>
>> > >>  6273 {
>> > >>
>> > >>  6274   std::lock_guard l{min_last_epoch_clean_lock};
>> > >>
>> > >>  6275   beacon = new MOSDBeacon(osdmap->get_epoch(), 
>> > >> min_last_epoch_clean);
>> > >>
>> > >>  6276   std::swap(beacon->pgs, min_last_epoch_clean_pgs);
>> > >>
>> > >>  6277   last_sent_beacon = now;
>> > >>
>> > >>  6278 }
>> > >>
>> > >>  6279 monc->send_mon_message(beacon);
>> > >>
>> > >>  6280   } else {
>> > >>
>> > >>  6281 dout(20) << __func__ << " not sending" << dendl;
>> > >>
>> > >>  6282   }
>> > >>
>> > >>  6283 }
>> > >>
>> > >>
>> > >> Zhenshi Zhou  于2019年3月14日周四 下午12:43写道:
>> > >> >
>> > >> > Hi,
>> > >> >
>> > >> > One of the log says the beacon not sending as below:
>> > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 
>> > >> > tick_without_osd_lock
>> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 
>> > >> > can_inc_scrubs_pending 0 -> 1 (max 1, active 0)
>> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 scrub_time_permit 
>> > >> > should run between 0 - 24 now 12 = yes
>> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 
>> > >> > scrub_load_below_threshold loadavg per cpu 0 < max 0.5 = yes
>> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub 
>> > >> > load_is_low=1
>> > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 sched_scrub 1.79 
>> > >> > scheduled at 2019-03-14 13:17:51.290050 > 2019-03-14 12:41:15.723848
>> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub done
>> > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 
>> > >> > promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 B; 
>> > >> > target 25 obj/sec or 5 MiB/sec
>> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 
>> > >> > promote_throttle_recalibrate  new_prob 1000
>> > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 
>> > >> > promote_throttle_recalibrate  actual 0, actual/prob ratio 1, adjusted 
>> > >> > new_prob 1000, prob 1000 -> 1000
>> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 send_beacon not 
>> > >> > sending
>> > >> >
>> > >> >
>> > >> > huang jun  于2019年3月14日周四 下午12:30写道:
>> > >> >>
>> > >> >> osd will not send beacons to mon if its not in ACTIVE state,
>> > >> >> so you maybe turn on one osd's debug_osd=20 to see what is going on
>> > >> >>
>> > >> >> Zhenshi Zhou  于2019年3月14日周四 上午11:07写道:
>> > >> >> >
>> > >> >> > What's more, I find that the osds don't send beacons all the time, 
>> > >> >> > some osds send beacons
>> > >> >> > for a period of time and then stop sending beacons.
>> > >> >> >
>> > >> >> >
>> > >> >> >
>> > >> >> > Zhenshi Zhou  于2019年3月14日周四 上午10:57写道:
>> > >> >> >>
>> > >> >> >> Hi
>> > >> >> >>
>> > >> >> >> I set the config on every osd and check whether all osds send 
>> 

Re: [ceph-users] cluster is not stable

2019-03-13 Thread Zhenshi Zhou
Hi huang,

It's a pre-production environment. If everything is fine, I'll use it for
production.

My cluster is version mimic, should I set all features you listed in the
command?

Thanks

huang jun  于2019年3月14日周四 下午2:11写道:

> sorry, the script should be
> for f in kraken luminous mimic osdmap-prune; do
>   ceph mon feature set $f --yes-i-really-mean-it
> done
>
> huang jun  于2019年3月14日周四 下午2:04写道:
> >
> > ok, if this is a **test environment**, you can try
> > for f in 'kraken,luminous,mimic,osdmap-prune'; do
> >   ceph mon feature set $f --yes-i-really-mean-it
> > done
> >
> > If it is a production environment, you should eval the risk first, and
> > maybe setup a test cluster to testing first.
> >
> > Zhenshi Zhou  于2019年3月14日周四 下午1:56写道:
> > >
> > > # ceph mon feature ls
> > > all features
> > > supported: [kraken,luminous,mimic,osdmap-prune]
> > > persistent: [kraken,luminous,mimic,osdmap-prune]
> > > on current monmap (epoch 2)
> > > persistent: [none]
> > > required: [none]
> > >
> > > huang jun  于2019年3月14日周四 下午1:50写道:
> > >>
> > >> what's the output of 'ceph mon feature ls'?
> > >>
> > >> from the code, maybe mon features not contain luminous
> > >> 6263 void OSD::send_beacon(const ceph::coarse_mono_clock::time_point&
> now)
> > >>
> > >>  6264 {
> > >>
> > >>  6265   const auto& monmap = monc->monmap;
> > >>
> > >>  6266   // send beacon to mon even if we are just connected, and the
> > >> monmap is not
> > >>
> > >>  6267   // initialized yet by then.
> > >>
> > >>  6268   if (monmap.epoch > 0 &&
> > >>
> > >>  6269   monmap.get_required_features().contains_all(
> > >>
> > >>  6270 ceph::features::mon::FEATURE_LUMINOUS)) {
> > >>
> > >>  6271 dout(20) << __func__ << " sending" << dendl;
> > >>
> > >>  6272 MOSDBeacon* beacon = nullptr;
> > >>
> > >>  6273 {
> > >>
> > >>  6274   std::lock_guard l{min_last_epoch_clean_lock};
> > >>
> > >>  6275   beacon = new MOSDBeacon(osdmap->get_epoch(),
> min_last_epoch_clean);
> > >>
> > >>  6276   std::swap(beacon->pgs, min_last_epoch_clean_pgs);
> > >>
> > >>  6277   last_sent_beacon = now;
> > >>
> > >>  6278 }
> > >>
> > >>  6279 monc->send_mon_message(beacon);
> > >>
> > >>  6280   } else {
> > >>
> > >>  6281 dout(20) << __func__ << " not sending" << dendl;
> > >>
> > >>  6282   }
> > >>
> > >>  6283 }
> > >>
> > >>
> > >> Zhenshi Zhou  于2019年3月14日周四 下午12:43写道:
> > >> >
> > >> > Hi,
> > >> >
> > >> > One of the log says the beacon not sending as below:
> > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032
> tick_without_osd_lock
> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032
> can_inc_scrubs_pending 0 -> 1 (max 1, active 0)
> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032
> scrub_time_permit should run between 0 - 24 now 12 = yes
> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032
> scrub_load_below_threshold loadavg per cpu 0 < max 0.5 = yes
> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub
> load_is_low=1
> > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 sched_scrub
> 1.79 scheduled at 2019-03-14 13:17:51.290050 > 2019-03-14 12:41:15.723848
> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub done
> > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032
> promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 B; target
> 25 obj/sec or 5 MiB/sec
> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032
> promote_throttle_recalibrate  new_prob 1000
> > >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032
> promote_throttle_recalibrate  actual 0, actual/prob ratio 1, adjusted
> new_prob 1000, prob 1000 -> 1000
> > >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 send_beacon not
> sending
> > >> >
> > >> >
> > >> > huang jun  于2019年3月14日周四 下午12:30写道:
> > >> >>
> > >> >> osd will not send beacons to mon if its not in ACTIVE state,
> > >> >> so you maybe turn on one osd's debug_osd=20 to see what is going on
> > >> >>
> > >> >> Zhenshi Zhou  于2019年3月14日周四 上午11:07写道:
> > >> >> >
> > >> >> > What's more, I find that the osds don't send beacons all the
> time, some osds send beacons
> > >> >> > for a period of time and then stop sending beacons.
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > Zhenshi Zhou  于2019年3月14日周四 上午10:57写道:
> > >> >> >>
> > >> >> >> Hi
> > >> >> >>
> > >> >> >> I set the config on every osd and check whether all osds send
> beacons
> > >> >> >> to monitors.
> > >> >> >>
> > >> >> >> The result shows that only part of the osds send beacons and
> the monitor
> > >> >> >> receives all beacons from which the osd send out.
> > >> >> >>
> > >> >> >> But why some osds don't send beacon?
> > >> >> >>
> > >> >> >> huang jun  于2019年3月13日周三 下午11:02写道:
> > >> >> >>>
> > >> >> >>> sorry for not make it clearly, you may need to set one of your
> osd's
> > >> >> >>> osd_beacon_report_interval = 5
> > >> >> >>> and debug_ms=1 and then restart the

Re: [ceph-users] cluster is not stable

2019-03-13 Thread huang jun
sorry, the script should be
for f in kraken luminous mimic osdmap-prune; do
  ceph mon feature set $f --yes-i-really-mean-it
done

huang jun  于2019年3月14日周四 下午2:04写道:
>
> ok, if this is a **test environment**, you can try
> for f in 'kraken,luminous,mimic,osdmap-prune'; do
>   ceph mon feature set $f --yes-i-really-mean-it
> done
>
> If it is a production environment, you should eval the risk first, and
> maybe setup a test cluster to testing first.
>
> Zhenshi Zhou  于2019年3月14日周四 下午1:56写道:
> >
> > # ceph mon feature ls
> > all features
> > supported: [kraken,luminous,mimic,osdmap-prune]
> > persistent: [kraken,luminous,mimic,osdmap-prune]
> > on current monmap (epoch 2)
> > persistent: [none]
> > required: [none]
> >
> > huang jun  于2019年3月14日周四 下午1:50写道:
> >>
> >> what's the output of 'ceph mon feature ls'?
> >>
> >> from the code, maybe mon features not contain luminous
> >> 6263 void OSD::send_beacon(const ceph::coarse_mono_clock::time_point& now)
> >>
> >>  6264 {
> >>
> >>  6265   const auto& monmap = monc->monmap;
> >>
> >>  6266   // send beacon to mon even if we are just connected, and the
> >> monmap is not
> >>
> >>  6267   // initialized yet by then.
> >>
> >>  6268   if (monmap.epoch > 0 &&
> >>
> >>  6269   monmap.get_required_features().contains_all(
> >>
> >>  6270 ceph::features::mon::FEATURE_LUMINOUS)) {
> >>
> >>  6271 dout(20) << __func__ << " sending" << dendl;
> >>
> >>  6272 MOSDBeacon* beacon = nullptr;
> >>
> >>  6273 {
> >>
> >>  6274   std::lock_guard l{min_last_epoch_clean_lock};
> >>
> >>  6275   beacon = new MOSDBeacon(osdmap->get_epoch(), 
> >> min_last_epoch_clean);
> >>
> >>  6276   std::swap(beacon->pgs, min_last_epoch_clean_pgs);
> >>
> >>  6277   last_sent_beacon = now;
> >>
> >>  6278 }
> >>
> >>  6279 monc->send_mon_message(beacon);
> >>
> >>  6280   } else {
> >>
> >>  6281 dout(20) << __func__ << " not sending" << dendl;
> >>
> >>  6282   }
> >>
> >>  6283 }
> >>
> >>
> >> Zhenshi Zhou  于2019年3月14日周四 下午12:43写道:
> >> >
> >> > Hi,
> >> >
> >> > One of the log says the beacon not sending as below:
> >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 tick_without_osd_lock
> >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 
> >> > can_inc_scrubs_pending 0 -> 1 (max 1, active 0)
> >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 scrub_time_permit 
> >> > should run between 0 - 24 now 12 = yes
> >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 
> >> > scrub_load_below_threshold loadavg per cpu 0 < max 0.5 = yes
> >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub 
> >> > load_is_low=1
> >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 sched_scrub 1.79 
> >> > scheduled at 2019-03-14 13:17:51.290050 > 2019-03-14 12:41:15.723848
> >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub done
> >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 
> >> > promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 B; 
> >> > target 25 obj/sec or 5 MiB/sec
> >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 
> >> > promote_throttle_recalibrate  new_prob 1000
> >> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 
> >> > promote_throttle_recalibrate  actual 0, actual/prob ratio 1, adjusted 
> >> > new_prob 1000, prob 1000 -> 1000
> >> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 send_beacon not 
> >> > sending
> >> >
> >> >
> >> > huang jun  于2019年3月14日周四 下午12:30写道:
> >> >>
> >> >> osd will not send beacons to mon if its not in ACTIVE state,
> >> >> so you maybe turn on one osd's debug_osd=20 to see what is going on
> >> >>
> >> >> Zhenshi Zhou  于2019年3月14日周四 上午11:07写道:
> >> >> >
> >> >> > What's more, I find that the osds don't send beacons all the time, 
> >> >> > some osds send beacons
> >> >> > for a period of time and then stop sending beacons.
> >> >> >
> >> >> >
> >> >> >
> >> >> > Zhenshi Zhou  于2019年3月14日周四 上午10:57写道:
> >> >> >>
> >> >> >> Hi
> >> >> >>
> >> >> >> I set the config on every osd and check whether all osds send beacons
> >> >> >> to monitors.
> >> >> >>
> >> >> >> The result shows that only part of the osds send beacons and the 
> >> >> >> monitor
> >> >> >> receives all beacons from which the osd send out.
> >> >> >>
> >> >> >> But why some osds don't send beacon?
> >> >> >>
> >> >> >> huang jun  于2019年3月13日周三 下午11:02写道:
> >> >> >>>
> >> >> >>> sorry for not make it clearly, you may need to set one of your osd's
> >> >> >>> osd_beacon_report_interval = 5
> >> >> >>> and debug_ms=1 and then restart the osd process, then check the osd
> >> >> >>> log by 'grep beacon /var/log/ceph/ceph-osd.$id.log'
> >> >> >>> to make sure osd send beacons to mon, if osd send beacon to mon, you
> >> >> >>> should also turn on debug_ms=1 on leader mon,
> >> >> >>> and restart mon process, then check the mon log to make sure mon
> >> >> >>> received osd beacon;
> >> >> >>>
> >> >> >>> Zhenshi Zhou  于

Re: [ceph-users] cluster is not stable

2019-03-13 Thread huang jun
ok, if this is a **test environment**, you can try
for f in 'kraken,luminous,mimic,osdmap-prune'; do
  ceph mon feature set $f --yes-i-really-mean-it
done

If it is a production environment, you should eval the risk first, and
maybe setup a test cluster to testing first.

Zhenshi Zhou  于2019年3月14日周四 下午1:56写道:
>
> # ceph mon feature ls
> all features
> supported: [kraken,luminous,mimic,osdmap-prune]
> persistent: [kraken,luminous,mimic,osdmap-prune]
> on current monmap (epoch 2)
> persistent: [none]
> required: [none]
>
> huang jun  于2019年3月14日周四 下午1:50写道:
>>
>> what's the output of 'ceph mon feature ls'?
>>
>> from the code, maybe mon features not contain luminous
>> 6263 void OSD::send_beacon(const ceph::coarse_mono_clock::time_point& now)
>>
>>  6264 {
>>
>>  6265   const auto& monmap = monc->monmap;
>>
>>  6266   // send beacon to mon even if we are just connected, and the
>> monmap is not
>>
>>  6267   // initialized yet by then.
>>
>>  6268   if (monmap.epoch > 0 &&
>>
>>  6269   monmap.get_required_features().contains_all(
>>
>>  6270 ceph::features::mon::FEATURE_LUMINOUS)) {
>>
>>  6271 dout(20) << __func__ << " sending" << dendl;
>>
>>  6272 MOSDBeacon* beacon = nullptr;
>>
>>  6273 {
>>
>>  6274   std::lock_guard l{min_last_epoch_clean_lock};
>>
>>  6275   beacon = new MOSDBeacon(osdmap->get_epoch(), 
>> min_last_epoch_clean);
>>
>>  6276   std::swap(beacon->pgs, min_last_epoch_clean_pgs);
>>
>>  6277   last_sent_beacon = now;
>>
>>  6278 }
>>
>>  6279 monc->send_mon_message(beacon);
>>
>>  6280   } else {
>>
>>  6281 dout(20) << __func__ << " not sending" << dendl;
>>
>>  6282   }
>>
>>  6283 }
>>
>>
>> Zhenshi Zhou  于2019年3月14日周四 下午12:43写道:
>> >
>> > Hi,
>> >
>> > One of the log says the beacon not sending as below:
>> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 tick_without_osd_lock
>> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 can_inc_scrubs_pending 
>> > 0 -> 1 (max 1, active 0)
>> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 scrub_time_permit 
>> > should run between 0 - 24 now 12 = yes
>> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 
>> > scrub_load_below_threshold loadavg per cpu 0 < max 0.5 = yes
>> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub 
>> > load_is_low=1
>> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 sched_scrub 1.79 
>> > scheduled at 2019-03-14 13:17:51.290050 > 2019-03-14 12:41:15.723848
>> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub done
>> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 
>> > promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 B; 
>> > target 25 obj/sec or 5 MiB/sec
>> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 
>> > promote_throttle_recalibrate  new_prob 1000
>> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 
>> > promote_throttle_recalibrate  actual 0, actual/prob ratio 1, adjusted 
>> > new_prob 1000, prob 1000 -> 1000
>> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 send_beacon not sending
>> >
>> >
>> > huang jun  于2019年3月14日周四 下午12:30写道:
>> >>
>> >> osd will not send beacons to mon if its not in ACTIVE state,
>> >> so you maybe turn on one osd's debug_osd=20 to see what is going on
>> >>
>> >> Zhenshi Zhou  于2019年3月14日周四 上午11:07写道:
>> >> >
>> >> > What's more, I find that the osds don't send beacons all the time, some 
>> >> > osds send beacons
>> >> > for a period of time and then stop sending beacons.
>> >> >
>> >> >
>> >> >
>> >> > Zhenshi Zhou  于2019年3月14日周四 上午10:57写道:
>> >> >>
>> >> >> Hi
>> >> >>
>> >> >> I set the config on every osd and check whether all osds send beacons
>> >> >> to monitors.
>> >> >>
>> >> >> The result shows that only part of the osds send beacons and the 
>> >> >> monitor
>> >> >> receives all beacons from which the osd send out.
>> >> >>
>> >> >> But why some osds don't send beacon?
>> >> >>
>> >> >> huang jun  于2019年3月13日周三 下午11:02写道:
>> >> >>>
>> >> >>> sorry for not make it clearly, you may need to set one of your osd's
>> >> >>> osd_beacon_report_interval = 5
>> >> >>> and debug_ms=1 and then restart the osd process, then check the osd
>> >> >>> log by 'grep beacon /var/log/ceph/ceph-osd.$id.log'
>> >> >>> to make sure osd send beacons to mon, if osd send beacon to mon, you
>> >> >>> should also turn on debug_ms=1 on leader mon,
>> >> >>> and restart mon process, then check the mon log to make sure mon
>> >> >>> received osd beacon;
>> >> >>>
>> >> >>> Zhenshi Zhou  于2019年3月13日周三 下午8:20写道:
>> >> >>> >
>> >> >>> > And now, new errors are cliaming..
>> >> >>> >
>> >> >>> >
>> >> >>> > Zhenshi Zhou  于2019年3月13日周三 下午2:58写道:
>> >> >>> >>
>> >> >>> >> Hi,
>> >> >>> >>
>> >> >>> >> I didn't set  osd_beacon_report_interval as it must be the default 
>> >> >>> >> value.
>> >> >>> >> I have set osd_beacon_report_interval to 60 and debug_mon to 10.
>> >> >>> >>
>> >> >>> >> Attachment is the leader monitor log, 

Re: [ceph-users] cluster is not stable

2019-03-13 Thread Zhenshi Zhou
# ceph mon feature ls
all features
supported: [kraken,luminous,mimic,osdmap-prune]
persistent: [kraken,luminous,mimic,osdmap-prune]
on current monmap (epoch 2)
persistent: [none]
required: [none]

huang jun  于2019年3月14日周四 下午1:50写道:

> what's the output of 'ceph mon feature ls'?
>
> from the code, maybe mon features not contain luminous
> 6263 void OSD::send_beacon(const ceph::coarse_mono_clock::time_point& now)
>
>  6264 {
>
>  6265   const auto& monmap = monc->monmap;
>
>  6266   // send beacon to mon even if we are just connected, and the
> monmap is not
>
>  6267   // initialized yet by then.
>
>  6268   if (monmap.epoch > 0 &&
>
>  6269   monmap.get_required_features().contains_all(
>
>  6270 ceph::features::mon::FEATURE_LUMINOUS)) {
>
>  6271 dout(20) << __func__ << " sending" << dendl;
>
>  6272 MOSDBeacon* beacon = nullptr;
>
>  6273 {
>
>  6274   std::lock_guard l{min_last_epoch_clean_lock};
>
>  6275   beacon = new MOSDBeacon(osdmap->get_epoch(),
> min_last_epoch_clean);
>
>  6276   std::swap(beacon->pgs, min_last_epoch_clean_pgs);
>
>  6277   last_sent_beacon = now;
>
>  6278 }
>
>  6279 monc->send_mon_message(beacon);
>
>  6280   } else {
>
>  6281 dout(20) << __func__ << " not sending" << dendl;
>
>  6282   }
>
>  6283 }
>
>
> Zhenshi Zhou  于2019年3月14日周四 下午12:43写道:
> >
> > Hi,
> >
> > One of the log says the beacon not sending as below:
> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 tick_without_osd_lock
> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032
> can_inc_scrubs_pending 0 -> 1 (max 1, active 0)
> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 scrub_time_permit
> should run between 0 - 24 now 12 = yes
> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032
> scrub_load_below_threshold loadavg per cpu 0 < max 0.5 = yes
> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub
> load_is_low=1
> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 sched_scrub 1.79
> scheduled at 2019-03-14 13:17:51.290050 > 2019-03-14 12:41:15.723848
> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub done
> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032
> promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 B; target
> 25 obj/sec or 5 MiB/sec
> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032
> promote_throttle_recalibrate  new_prob 1000
> > 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032
> promote_throttle_recalibrate  actual 0, actual/prob ratio 1, adjusted
> new_prob 1000, prob 1000 -> 1000
> > 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 send_beacon not
> sending
> >
> >
> > huang jun  于2019年3月14日周四 下午12:30写道:
> >>
> >> osd will not send beacons to mon if its not in ACTIVE state,
> >> so you maybe turn on one osd's debug_osd=20 to see what is going on
> >>
> >> Zhenshi Zhou  于2019年3月14日周四 上午11:07写道:
> >> >
> >> > What's more, I find that the osds don't send beacons all the time,
> some osds send beacons
> >> > for a period of time and then stop sending beacons.
> >> >
> >> >
> >> >
> >> > Zhenshi Zhou  于2019年3月14日周四 上午10:57写道:
> >> >>
> >> >> Hi
> >> >>
> >> >> I set the config on every osd and check whether all osds send beacons
> >> >> to monitors.
> >> >>
> >> >> The result shows that only part of the osds send beacons and the
> monitor
> >> >> receives all beacons from which the osd send out.
> >> >>
> >> >> But why some osds don't send beacon?
> >> >>
> >> >> huang jun  于2019年3月13日周三 下午11:02写道:
> >> >>>
> >> >>> sorry for not make it clearly, you may need to set one of your osd's
> >> >>> osd_beacon_report_interval = 5
> >> >>> and debug_ms=1 and then restart the osd process, then check the osd
> >> >>> log by 'grep beacon /var/log/ceph/ceph-osd.$id.log'
> >> >>> to make sure osd send beacons to mon, if osd send beacon to mon, you
> >> >>> should also turn on debug_ms=1 on leader mon,
> >> >>> and restart mon process, then check the mon log to make sure mon
> >> >>> received osd beacon;
> >> >>>
> >> >>> Zhenshi Zhou  于2019年3月13日周三 下午8:20写道:
> >> >>> >
> >> >>> > And now, new errors are cliaming..
> >> >>> >
> >> >>> >
> >> >>> > Zhenshi Zhou  于2019年3月13日周三 下午2:58写道:
> >> >>> >>
> >> >>> >> Hi,
> >> >>> >>
> >> >>> >> I didn't set  osd_beacon_report_interval as it must be the
> default value.
> >> >>> >> I have set osd_beacon_report_interval to 60 and debug_mon to 10.
> >> >>> >>
> >> >>> >> Attachment is the leader monitor log, the "mark-down" operations
> is at 14:22
> >> >>> >>
> >> >>> >> Thanks
> >> >>> >>
> >> >>> >> huang jun  于2019年3月13日周三 下午2:07写道:
> >> >>> >>>
> >> >>> >>> can you get the value of osd_beacon_report_interval item? the
> default
> >> >>> >>> is 300, you can set to 60,  or maybe turn on debug_ms=1
> debug_mon=10
> >> >>> >>> can get more infos.
> >> >>> >>>
> >> >>> >>>
> >> >>> >>> Zhenshi Zhou  于2019年3月13日周三 下午1:20写道:
> >> >>> >>> >
> >> >>> >>> > Hi,
> >> >>> >>> >
> >> >>> >>> > The servers are cennected

Re: [ceph-users] cluster is not stable

2019-03-13 Thread huang jun
what's the output of 'ceph mon feature ls'?

from the code, maybe mon features not contain luminous
6263 void OSD::send_beacon(const ceph::coarse_mono_clock::time_point& now)

 6264 {

 6265   const auto& monmap = monc->monmap;

 6266   // send beacon to mon even if we are just connected, and the
monmap is not

 6267   // initialized yet by then.

 6268   if (monmap.epoch > 0 &&

 6269   monmap.get_required_features().contains_all(

 6270 ceph::features::mon::FEATURE_LUMINOUS)) {

 6271 dout(20) << __func__ << " sending" << dendl;

 6272 MOSDBeacon* beacon = nullptr;

 6273 {

 6274   std::lock_guard l{min_last_epoch_clean_lock};

 6275   beacon = new MOSDBeacon(osdmap->get_epoch(), min_last_epoch_clean);

 6276   std::swap(beacon->pgs, min_last_epoch_clean_pgs);

 6277   last_sent_beacon = now;

 6278 }

 6279 monc->send_mon_message(beacon);

 6280   } else {

 6281 dout(20) << __func__ << " not sending" << dendl;

 6282   }

 6283 }


Zhenshi Zhou  于2019年3月14日周四 下午12:43写道:
>
> Hi,
>
> One of the log says the beacon not sending as below:
> 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 tick_without_osd_lock
> 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 can_inc_scrubs_pending 0 
> -> 1 (max 1, active 0)
> 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 scrub_time_permit should 
> run between 0 - 24 now 12 = yes
> 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 
> scrub_load_below_threshold loadavg per cpu 0 < max 0.5 = yes
> 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub load_is_low=1
> 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 sched_scrub 1.79 
> scheduled at 2019-03-14 13:17:51.290050 > 2019-03-14 12:41:15.723848
> 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub done
> 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 
> promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 B; target 
> 25 obj/sec or 5 MiB/sec
> 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 
> promote_throttle_recalibrate  new_prob 1000
> 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 
> promote_throttle_recalibrate  actual 0, actual/prob ratio 1, adjusted 
> new_prob 1000, prob 1000 -> 1000
> 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 send_beacon not sending
>
>
> huang jun  于2019年3月14日周四 下午12:30写道:
>>
>> osd will not send beacons to mon if its not in ACTIVE state,
>> so you maybe turn on one osd's debug_osd=20 to see what is going on
>>
>> Zhenshi Zhou  于2019年3月14日周四 上午11:07写道:
>> >
>> > What's more, I find that the osds don't send beacons all the time, some 
>> > osds send beacons
>> > for a period of time and then stop sending beacons.
>> >
>> >
>> >
>> > Zhenshi Zhou  于2019年3月14日周四 上午10:57写道:
>> >>
>> >> Hi
>> >>
>> >> I set the config on every osd and check whether all osds send beacons
>> >> to monitors.
>> >>
>> >> The result shows that only part of the osds send beacons and the monitor
>> >> receives all beacons from which the osd send out.
>> >>
>> >> But why some osds don't send beacon?
>> >>
>> >> huang jun  于2019年3月13日周三 下午11:02写道:
>> >>>
>> >>> sorry for not make it clearly, you may need to set one of your osd's
>> >>> osd_beacon_report_interval = 5
>> >>> and debug_ms=1 and then restart the osd process, then check the osd
>> >>> log by 'grep beacon /var/log/ceph/ceph-osd.$id.log'
>> >>> to make sure osd send beacons to mon, if osd send beacon to mon, you
>> >>> should also turn on debug_ms=1 on leader mon,
>> >>> and restart mon process, then check the mon log to make sure mon
>> >>> received osd beacon;
>> >>>
>> >>> Zhenshi Zhou  于2019年3月13日周三 下午8:20写道:
>> >>> >
>> >>> > And now, new errors are cliaming..
>> >>> >
>> >>> >
>> >>> > Zhenshi Zhou  于2019年3月13日周三 下午2:58写道:
>> >>> >>
>> >>> >> Hi,
>> >>> >>
>> >>> >> I didn't set  osd_beacon_report_interval as it must be the default 
>> >>> >> value.
>> >>> >> I have set osd_beacon_report_interval to 60 and debug_mon to 10.
>> >>> >>
>> >>> >> Attachment is the leader monitor log, the "mark-down" operations is 
>> >>> >> at 14:22
>> >>> >>
>> >>> >> Thanks
>> >>> >>
>> >>> >> huang jun  于2019年3月13日周三 下午2:07写道:
>> >>> >>>
>> >>> >>> can you get the value of osd_beacon_report_interval item? the default
>> >>> >>> is 300, you can set to 60,  or maybe turn on debug_ms=1 debug_mon=10
>> >>> >>> can get more infos.
>> >>> >>>
>> >>> >>>
>> >>> >>> Zhenshi Zhou  于2019年3月13日周三 下午1:20写道:
>> >>> >>> >
>> >>> >>> > Hi,
>> >>> >>> >
>> >>> >>> > The servers are cennected to the same switch.
>> >>> >>> > I can ping from anyone of the servers to other servers
>> >>> >>> > without a packet lost and the average round trip time
>> >>> >>> > is under 0.1 ms.
>> >>> >>> >
>> >>> >>> > Thanks
>> >>> >>> >
>> >>> >>> > Ashley Merrick  于2019年3月13日周三 下午12:06写道:
>> >>> >>> >>
>> >>> >>> >> Can you ping all your OSD servers from all your mons, and ping 
>> >>> >>> >> your mons from all your OSD servers?
>> >>> >>> >>
>> >>> >>> >> I’ve seen this where a

Re: [ceph-users] cluster is not stable

2019-03-13 Thread Zhenshi Zhou
Hi,

One of the log says the beacon not sending as below:
2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 tick_without_osd_lock
2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 can_inc_scrubs_pending
0 -> 1 (max 1, active 0)
2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 scrub_time_permit
should run between 0 - 24 now 12 = yes
2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032
scrub_load_below_threshold loadavg per cpu 0 < max 0.5 = yes
2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub
load_is_low=1
2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 sched_scrub 1.79
scheduled at 2019-03-14 13:17:51.290050 > 2019-03-14 12:41:15.723848
2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 sched_scrub done
2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032
promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 B; target
25 obj/sec or 5 MiB/sec
2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032
promote_throttle_recalibrate  new_prob 1000
2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032
promote_throttle_recalibrate  actual 0, actual/prob ratio 1, adjusted
new_prob 1000, prob 1000 -> 1000
2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 send_beacon not sending


huang jun  于2019年3月14日周四 下午12:30写道:

> osd will not send beacons to mon if its not in ACTIVE state,
> so you maybe turn on one osd's debug_osd=20 to see what is going on
>
> Zhenshi Zhou  于2019年3月14日周四 上午11:07写道:
> >
> > What's more, I find that the osds don't send beacons all the time, some
> osds send beacons
> > for a period of time and then stop sending beacons.
> >
> >
> >
> > Zhenshi Zhou  于2019年3月14日周四 上午10:57写道:
> >>
> >> Hi
> >>
> >> I set the config on every osd and check whether all osds send beacons
> >> to monitors.
> >>
> >> The result shows that only part of the osds send beacons and the monitor
> >> receives all beacons from which the osd send out.
> >>
> >> But why some osds don't send beacon?
> >>
> >> huang jun  于2019年3月13日周三 下午11:02写道:
> >>>
> >>> sorry for not make it clearly, you may need to set one of your osd's
> >>> osd_beacon_report_interval = 5
> >>> and debug_ms=1 and then restart the osd process, then check the osd
> >>> log by 'grep beacon /var/log/ceph/ceph-osd.$id.log'
> >>> to make sure osd send beacons to mon, if osd send beacon to mon, you
> >>> should also turn on debug_ms=1 on leader mon,
> >>> and restart mon process, then check the mon log to make sure mon
> >>> received osd beacon;
> >>>
> >>> Zhenshi Zhou  于2019年3月13日周三 下午8:20写道:
> >>> >
> >>> > And now, new errors are cliaming..
> >>> >
> >>> >
> >>> > Zhenshi Zhou  于2019年3月13日周三 下午2:58写道:
> >>> >>
> >>> >> Hi,
> >>> >>
> >>> >> I didn't set  osd_beacon_report_interval as it must be the default
> value.
> >>> >> I have set osd_beacon_report_interval to 60 and debug_mon to 10.
> >>> >>
> >>> >> Attachment is the leader monitor log, the "mark-down" operations is
> at 14:22
> >>> >>
> >>> >> Thanks
> >>> >>
> >>> >> huang jun  于2019年3月13日周三 下午2:07写道:
> >>> >>>
> >>> >>> can you get the value of osd_beacon_report_interval item? the
> default
> >>> >>> is 300, you can set to 60,  or maybe turn on debug_ms=1
> debug_mon=10
> >>> >>> can get more infos.
> >>> >>>
> >>> >>>
> >>> >>> Zhenshi Zhou  于2019年3月13日周三 下午1:20写道:
> >>> >>> >
> >>> >>> > Hi,
> >>> >>> >
> >>> >>> > The servers are cennected to the same switch.
> >>> >>> > I can ping from anyone of the servers to other servers
> >>> >>> > without a packet lost and the average round trip time
> >>> >>> > is under 0.1 ms.
> >>> >>> >
> >>> >>> > Thanks
> >>> >>> >
> >>> >>> > Ashley Merrick  于2019年3月13日周三
> 下午12:06写道:
> >>> >>> >>
> >>> >>> >> Can you ping all your OSD servers from all your mons, and ping
> your mons from all your OSD servers?
> >>> >>> >>
> >>> >>> >> I’ve seen this where a route wasn’t working one direction, so
> it made OSDs flap when it used that mon to check availability:
> >>> >>> >>
> >>> >>> >> On Wed, 13 Mar 2019 at 11:50 AM, Zhenshi Zhou <
> deader...@gmail.com> wrote:
> >>> >>> >>>
> >>> >>> >>> After checking the network and syslog/dmsg, I think it's not
> the network or hardware issue. Now there're some
> >>> >>> >>> osds being marked down every 15 minutes.
> >>> >>> >>>
> >>> >>> >>> here is ceph.log:
> >>> >>> >>> 2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0
> 10.39.0.34:6789/0 6756 : cluster [INF] Cluster is now healthy
> >>> >>> >>> 2019-03-13 11:21:21.705787 mon.ceph-mon1 mon.0
> 10.39.0.34:6789/0 6757 : cluster [INF] osd.1 marked down after no beacon
> for 900.067020 seconds
> >>> >>> >>> 2019-03-13 11:21:21.705858 mon.ceph-mon1 mon.0
> 10.39.0.34:6789/0 6758 : cluster [INF] osd.2 marked down after no beacon
> for 900.067020 seconds
> >>> >>> >>> 2019-03-13 11:21:21.705920 mon.ceph-mon1 mon.0
> 10.39.0.34:6789/0 6759 : cluster [INF] osd.4 marked down after no beacon
> for 900.067020 seconds
> >>> >>> >>> 2019-03-13 11:21:21.705957 mon.ceph-mon1 mon.0
> 10.39.0.34:6789/0 6760 : cluster [INF] osd.6 marked down after no beacon
> for 900.0

Re: [ceph-users] cluster is not stable

2019-03-13 Thread huang jun
osd will not send beacons to mon if its not in ACTIVE state,
so you maybe turn on one osd's debug_osd=20 to see what is going on

Zhenshi Zhou  于2019年3月14日周四 上午11:07写道:
>
> What's more, I find that the osds don't send beacons all the time, some osds 
> send beacons
> for a period of time and then stop sending beacons.
>
>
>
> Zhenshi Zhou  于2019年3月14日周四 上午10:57写道:
>>
>> Hi
>>
>> I set the config on every osd and check whether all osds send beacons
>> to monitors.
>>
>> The result shows that only part of the osds send beacons and the monitor
>> receives all beacons from which the osd send out.
>>
>> But why some osds don't send beacon?
>>
>> huang jun  于2019年3月13日周三 下午11:02写道:
>>>
>>> sorry for not make it clearly, you may need to set one of your osd's
>>> osd_beacon_report_interval = 5
>>> and debug_ms=1 and then restart the osd process, then check the osd
>>> log by 'grep beacon /var/log/ceph/ceph-osd.$id.log'
>>> to make sure osd send beacons to mon, if osd send beacon to mon, you
>>> should also turn on debug_ms=1 on leader mon,
>>> and restart mon process, then check the mon log to make sure mon
>>> received osd beacon;
>>>
>>> Zhenshi Zhou  于2019年3月13日周三 下午8:20写道:
>>> >
>>> > And now, new errors are cliaming..
>>> >
>>> >
>>> > Zhenshi Zhou  于2019年3月13日周三 下午2:58写道:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I didn't set  osd_beacon_report_interval as it must be the default value.
>>> >> I have set osd_beacon_report_interval to 60 and debug_mon to 10.
>>> >>
>>> >> Attachment is the leader monitor log, the "mark-down" operations is at 
>>> >> 14:22
>>> >>
>>> >> Thanks
>>> >>
>>> >> huang jun  于2019年3月13日周三 下午2:07写道:
>>> >>>
>>> >>> can you get the value of osd_beacon_report_interval item? the default
>>> >>> is 300, you can set to 60,  or maybe turn on debug_ms=1 debug_mon=10
>>> >>> can get more infos.
>>> >>>
>>> >>>
>>> >>> Zhenshi Zhou  于2019年3月13日周三 下午1:20写道:
>>> >>> >
>>> >>> > Hi,
>>> >>> >
>>> >>> > The servers are cennected to the same switch.
>>> >>> > I can ping from anyone of the servers to other servers
>>> >>> > without a packet lost and the average round trip time
>>> >>> > is under 0.1 ms.
>>> >>> >
>>> >>> > Thanks
>>> >>> >
>>> >>> > Ashley Merrick  于2019年3月13日周三 下午12:06写道:
>>> >>> >>
>>> >>> >> Can you ping all your OSD servers from all your mons, and ping your 
>>> >>> >> mons from all your OSD servers?
>>> >>> >>
>>> >>> >> I’ve seen this where a route wasn’t working one direction, so it 
>>> >>> >> made OSDs flap when it used that mon to check availability:
>>> >>> >>
>>> >>> >> On Wed, 13 Mar 2019 at 11:50 AM, Zhenshi Zhou  
>>> >>> >> wrote:
>>> >>> >>>
>>> >>> >>> After checking the network and syslog/dmsg, I think it's not the 
>>> >>> >>> network or hardware issue. Now there're some
>>> >>> >>> osds being marked down every 15 minutes.
>>> >>> >>>
>>> >>> >>> here is ceph.log:
>>> >>> >>> 2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 
>>> >>> >>> 6756 : cluster [INF] Cluster is now healthy
>>> >>> >>> 2019-03-13 11:21:21.705787 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 
>>> >>> >>> 6757 : cluster [INF] osd.1 marked down after no beacon for 
>>> >>> >>> 900.067020 seconds
>>> >>> >>> 2019-03-13 11:21:21.705858 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 
>>> >>> >>> 6758 : cluster [INF] osd.2 marked down after no beacon for 
>>> >>> >>> 900.067020 seconds
>>> >>> >>> 2019-03-13 11:21:21.705920 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 
>>> >>> >>> 6759 : cluster [INF] osd.4 marked down after no beacon for 
>>> >>> >>> 900.067020 seconds
>>> >>> >>> 2019-03-13 11:21:21.705957 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 
>>> >>> >>> 6760 : cluster [INF] osd.6 marked down after no beacon for 
>>> >>> >>> 900.067020 seconds
>>> >>> >>> 2019-03-13 11:21:21.705999 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 
>>> >>> >>> 6761 : cluster [INF] osd.7 marked down after no beacon for 
>>> >>> >>> 900.067020 seconds
>>> >>> >>> 2019-03-13 11:21:21.706040 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 
>>> >>> >>> 6762 : cluster [INF] osd.10 marked down after no beacon for 
>>> >>> >>> 900.067020 seconds
>>> >>> >>> 2019-03-13 11:21:21.706079 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 
>>> >>> >>> 6763 : cluster [INF] osd.11 marked down after no beacon for 
>>> >>> >>> 900.067020 seconds
>>> >>> >>> 2019-03-13 11:21:21.706118 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 
>>> >>> >>> 6764 : cluster [INF] osd.12 marked down after no beacon for 
>>> >>> >>> 900.067020 seconds
>>> >>> >>> 2019-03-13 11:21:21.706155 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 
>>> >>> >>> 6765 : cluster [INF] osd.13 marked down after no beacon for 
>>> >>> >>> 900.067020 seconds
>>> >>> >>> 2019-03-13 11:21:21.706195 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 
>>> >>> >>> 6766 : cluster [INF] osd.14 marked down after no beacon for 
>>> >>> >>> 900.067020 seconds
>>> >>> >>> 2019-03-13 11:21:21.706233 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 
>>> >>> >>> 6767 : cluster [INF] osd.15 marked down after no beacon for 
>>> >>> >>> 900.067020 seconds
>>> >>> >>> 2019-03-

Re: [ceph-users] cluster is not stable

2019-03-13 Thread Zhenshi Zhou
Hi

I set the config on every osd and check whether all osds send beacons
to monitors.

The result shows that only part of the osds send beacons and the monitor
receives all beacons from which the osd send out.

But why some osds don't send beacon?

huang jun  于2019年3月13日周三 下午11:02写道:

> sorry for not make it clearly, you may need to set one of your osd's
> osd_beacon_report_interval = 5
> and debug_ms=1 and then restart the osd process, then check the osd
> log by 'grep beacon /var/log/ceph/ceph-osd.$id.log'
> to make sure osd send beacons to mon, if osd send beacon to mon, you
> should also turn on debug_ms=1 on leader mon,
> and restart mon process, then check the mon log to make sure mon
> received osd beacon;
>
> Zhenshi Zhou  于2019年3月13日周三 下午8:20写道:
> >
> > And now, new errors are cliaming..
> >
> >
> > Zhenshi Zhou  于2019年3月13日周三 下午2:58写道:
> >>
> >> Hi,
> >>
> >> I didn't set  osd_beacon_report_interval as it must be the default
> value.
> >> I have set osd_beacon_report_interval to 60 and debug_mon to 10.
> >>
> >> Attachment is the leader monitor log, the "mark-down" operations is at
> 14:22
> >>
> >> Thanks
> >>
> >> huang jun  于2019年3月13日周三 下午2:07写道:
> >>>
> >>> can you get the value of osd_beacon_report_interval item? the default
> >>> is 300, you can set to 60,  or maybe turn on debug_ms=1 debug_mon=10
> >>> can get more infos.
> >>>
> >>>
> >>> Zhenshi Zhou  于2019年3月13日周三 下午1:20写道:
> >>> >
> >>> > Hi,
> >>> >
> >>> > The servers are cennected to the same switch.
> >>> > I can ping from anyone of the servers to other servers
> >>> > without a packet lost and the average round trip time
> >>> > is under 0.1 ms.
> >>> >
> >>> > Thanks
> >>> >
> >>> > Ashley Merrick  于2019年3月13日周三 下午12:06写道:
> >>> >>
> >>> >> Can you ping all your OSD servers from all your mons, and ping your
> mons from all your OSD servers?
> >>> >>
> >>> >> I’ve seen this where a route wasn’t working one direction, so it
> made OSDs flap when it used that mon to check availability:
> >>> >>
> >>> >> On Wed, 13 Mar 2019 at 11:50 AM, Zhenshi Zhou 
> wrote:
> >>> >>>
> >>> >>> After checking the network and syslog/dmsg, I think it's not the
> network or hardware issue. Now there're some
> >>> >>> osds being marked down every 15 minutes.
> >>> >>>
> >>> >>> here is ceph.log:
> >>> >>> 2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
> 6756 : cluster [INF] Cluster is now healthy
> >>> >>> 2019-03-13 11:21:21.705787 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
> 6757 : cluster [INF] osd.1 marked down after no beacon for 900.067020
> seconds
> >>> >>> 2019-03-13 11:21:21.705858 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
> 6758 : cluster [INF] osd.2 marked down after no beacon for 900.067020
> seconds
> >>> >>> 2019-03-13 11:21:21.705920 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
> 6759 : cluster [INF] osd.4 marked down after no beacon for 900.067020
> seconds
> >>> >>> 2019-03-13 11:21:21.705957 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
> 6760 : cluster [INF] osd.6 marked down after no beacon for 900.067020
> seconds
> >>> >>> 2019-03-13 11:21:21.705999 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
> 6761 : cluster [INF] osd.7 marked down after no beacon for 900.067020
> seconds
> >>> >>> 2019-03-13 11:21:21.706040 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
> 6762 : cluster [INF] osd.10 marked down after no beacon for 900.067020
> seconds
> >>> >>> 2019-03-13 11:21:21.706079 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
> 6763 : cluster [INF] osd.11 marked down after no beacon for 900.067020
> seconds
> >>> >>> 2019-03-13 11:21:21.706118 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
> 6764 : cluster [INF] osd.12 marked down after no beacon for 900.067020
> seconds
> >>> >>> 2019-03-13 11:21:21.706155 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
> 6765 : cluster [INF] osd.13 marked down after no beacon for 900.067020
> seconds
> >>> >>> 2019-03-13 11:21:21.706195 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
> 6766 : cluster [INF] osd.14 marked down after no beacon for 900.067020
> seconds
> >>> >>> 2019-03-13 11:21:21.706233 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
> 6767 : cluster [INF] osd.15 marked down after no beacon for 900.067020
> seconds
> >>> >>> 2019-03-13 11:21:21.706273 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
> 6768 : cluster [INF] osd.16 marked down after no beacon for 900.067020
> seconds
> >>> >>> 2019-03-13 11:21:21.706312 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
> 6769 : cluster [INF] osd.17 marked down after no beacon for 900.067020
> seconds
> >>> >>> 2019-03-13 11:21:21.706351 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
> 6770 : cluster [INF] osd.18 marked down after no beacon for 900.067020
> seconds
> >>> >>> 2019-03-13 11:21:21.706385 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
> 6771 : cluster [INF] osd.19 marked down after no beacon for 900.067020
> seconds
> >>> >>> 2019-03-13 11:21:21.706423 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
> 6772 : cluster [INF] osd.20 marked down after no beacon for 900.067020
> seconds
> >>> >>> 2019-03-13 11:21:21.706503 mon.ceph-mon1 mon.0 10.39.0

Re: [ceph-users] cluster is not stable

2019-03-13 Thread huang jun
sorry for not make it clearly, you may need to set one of your osd's
osd_beacon_report_interval = 5
and debug_ms=1 and then restart the osd process, then check the osd
log by 'grep beacon /var/log/ceph/ceph-osd.$id.log'
to make sure osd send beacons to mon, if osd send beacon to mon, you
should also turn on debug_ms=1 on leader mon,
and restart mon process, then check the mon log to make sure mon
received osd beacon;

Zhenshi Zhou  于2019年3月13日周三 下午8:20写道:
>
> And now, new errors are cliaming..
>
>
> Zhenshi Zhou  于2019年3月13日周三 下午2:58写道:
>>
>> Hi,
>>
>> I didn't set  osd_beacon_report_interval as it must be the default value.
>> I have set osd_beacon_report_interval to 60 and debug_mon to 10.
>>
>> Attachment is the leader monitor log, the "mark-down" operations is at 14:22
>>
>> Thanks
>>
>> huang jun  于2019年3月13日周三 下午2:07写道:
>>>
>>> can you get the value of osd_beacon_report_interval item? the default
>>> is 300, you can set to 60,  or maybe turn on debug_ms=1 debug_mon=10
>>> can get more infos.
>>>
>>>
>>> Zhenshi Zhou  于2019年3月13日周三 下午1:20写道:
>>> >
>>> > Hi,
>>> >
>>> > The servers are cennected to the same switch.
>>> > I can ping from anyone of the servers to other servers
>>> > without a packet lost and the average round trip time
>>> > is under 0.1 ms.
>>> >
>>> > Thanks
>>> >
>>> > Ashley Merrick  于2019年3月13日周三 下午12:06写道:
>>> >>
>>> >> Can you ping all your OSD servers from all your mons, and ping your mons 
>>> >> from all your OSD servers?
>>> >>
>>> >> I’ve seen this where a route wasn’t working one direction, so it made 
>>> >> OSDs flap when it used that mon to check availability:
>>> >>
>>> >> On Wed, 13 Mar 2019 at 11:50 AM, Zhenshi Zhou  
>>> >> wrote:
>>> >>>
>>> >>> After checking the network and syslog/dmsg, I think it's not the 
>>> >>> network or hardware issue. Now there're some
>>> >>> osds being marked down every 15 minutes.
>>> >>>
>>> >>> here is ceph.log:
>>> >>> 2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6756 : 
>>> >>> cluster [INF] Cluster is now healthy
>>> >>> 2019-03-13 11:21:21.705787 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6757 : 
>>> >>> cluster [INF] osd.1 marked down after no beacon for 900.067020 seconds
>>> >>> 2019-03-13 11:21:21.705858 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6758 : 
>>> >>> cluster [INF] osd.2 marked down after no beacon for 900.067020 seconds
>>> >>> 2019-03-13 11:21:21.705920 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6759 : 
>>> >>> cluster [INF] osd.4 marked down after no beacon for 900.067020 seconds
>>> >>> 2019-03-13 11:21:21.705957 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6760 : 
>>> >>> cluster [INF] osd.6 marked down after no beacon for 900.067020 seconds
>>> >>> 2019-03-13 11:21:21.705999 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6761 : 
>>> >>> cluster [INF] osd.7 marked down after no beacon for 900.067020 seconds
>>> >>> 2019-03-13 11:21:21.706040 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6762 : 
>>> >>> cluster [INF] osd.10 marked down after no beacon for 900.067020 seconds
>>> >>> 2019-03-13 11:21:21.706079 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6763 : 
>>> >>> cluster [INF] osd.11 marked down after no beacon for 900.067020 seconds
>>> >>> 2019-03-13 11:21:21.706118 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6764 : 
>>> >>> cluster [INF] osd.12 marked down after no beacon for 900.067020 seconds
>>> >>> 2019-03-13 11:21:21.706155 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6765 : 
>>> >>> cluster [INF] osd.13 marked down after no beacon for 900.067020 seconds
>>> >>> 2019-03-13 11:21:21.706195 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6766 : 
>>> >>> cluster [INF] osd.14 marked down after no beacon for 900.067020 seconds
>>> >>> 2019-03-13 11:21:21.706233 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6767 : 
>>> >>> cluster [INF] osd.15 marked down after no beacon for 900.067020 seconds
>>> >>> 2019-03-13 11:21:21.706273 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6768 : 
>>> >>> cluster [INF] osd.16 marked down after no beacon for 900.067020 seconds
>>> >>> 2019-03-13 11:21:21.706312 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6769 : 
>>> >>> cluster [INF] osd.17 marked down after no beacon for 900.067020 seconds
>>> >>> 2019-03-13 11:21:21.706351 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6770 : 
>>> >>> cluster [INF] osd.18 marked down after no beacon for 900.067020 seconds
>>> >>> 2019-03-13 11:21:21.706385 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6771 : 
>>> >>> cluster [INF] osd.19 marked down after no beacon for 900.067020 seconds
>>> >>> 2019-03-13 11:21:21.706423 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6772 : 
>>> >>> cluster [INF] osd.20 marked down after no beacon for 900.067020 seconds
>>> >>> 2019-03-13 11:21:21.706503 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6773 : 
>>> >>> cluster [INF] osd.22 marked down after no beacon for 900.067020 seconds
>>> >>> 2019-03-13 11:21:21.706549 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6774 : 
>>> >>> cluster [INF] osd.23 marked down after no beacon for 900.067020 seconds
>>> >>> 2019-03-13 11:21:21.706587 mon.ceph-mon1 mon.0 10.39.0.3

Re: [ceph-users] cluster is not stable

2019-03-13 Thread Zhenshi Zhou
And now, new errors are cliaming..
[image: image.png]

Zhenshi Zhou  于2019年3月13日周三 下午2:58写道:

> Hi,
>
> I didn't set  osd_beacon_report_interval as it must be the default value.
> I have set osd_beacon_report_interval to 60 and debug_mon to 10.
>
> Attachment is the leader monitor log, the "mark-down" operations is at
> 14:22
>
> Thanks
>
> huang jun  于2019年3月13日周三 下午2:07写道:
>
>> can you get the value of osd_beacon_report_interval item? the default
>> is 300, you can set to 60,  or maybe turn on debug_ms=1 debug_mon=10
>> can get more infos.
>>
>>
>> Zhenshi Zhou  于2019年3月13日周三 下午1:20写道:
>> >
>> > Hi,
>> >
>> > The servers are cennected to the same switch.
>> > I can ping from anyone of the servers to other servers
>> > without a packet lost and the average round trip time
>> > is under 0.1 ms.
>> >
>> > Thanks
>> >
>> > Ashley Merrick  于2019年3月13日周三 下午12:06写道:
>> >>
>> >> Can you ping all your OSD servers from all your mons, and ping your
>> mons from all your OSD servers?
>> >>
>> >> I’ve seen this where a route wasn’t working one direction, so it made
>> OSDs flap when it used that mon to check availability:
>> >>
>> >> On Wed, 13 Mar 2019 at 11:50 AM, Zhenshi Zhou 
>> wrote:
>> >>>
>> >>> After checking the network and syslog/dmsg, I think it's not the
>> network or hardware issue. Now there're some
>> >>> osds being marked down every 15 minutes.
>> >>>
>> >>> here is ceph.log:
>> >>> 2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6756 : cluster [INF] Cluster is now healthy
>> >>> 2019-03-13 11:21:21.705787 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6757 : cluster [INF] osd.1 marked down after no beacon for 900.067020
>> seconds
>> >>> 2019-03-13 11:21:21.705858 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6758 : cluster [INF] osd.2 marked down after no beacon for 900.067020
>> seconds
>> >>> 2019-03-13 11:21:21.705920 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6759 : cluster [INF] osd.4 marked down after no beacon for 900.067020
>> seconds
>> >>> 2019-03-13 11:21:21.705957 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6760 : cluster [INF] osd.6 marked down after no beacon for 900.067020
>> seconds
>> >>> 2019-03-13 11:21:21.705999 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6761 : cluster [INF] osd.7 marked down after no beacon for 900.067020
>> seconds
>> >>> 2019-03-13 11:21:21.706040 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6762 : cluster [INF] osd.10 marked down after no beacon for 900.067020
>> seconds
>> >>> 2019-03-13 11:21:21.706079 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6763 : cluster [INF] osd.11 marked down after no beacon for 900.067020
>> seconds
>> >>> 2019-03-13 11:21:21.706118 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6764 : cluster [INF] osd.12 marked down after no beacon for 900.067020
>> seconds
>> >>> 2019-03-13 11:21:21.706155 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6765 : cluster [INF] osd.13 marked down after no beacon for 900.067020
>> seconds
>> >>> 2019-03-13 11:21:21.706195 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6766 : cluster [INF] osd.14 marked down after no beacon for 900.067020
>> seconds
>> >>> 2019-03-13 11:21:21.706233 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6767 : cluster [INF] osd.15 marked down after no beacon for 900.067020
>> seconds
>> >>> 2019-03-13 11:21:21.706273 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6768 : cluster [INF] osd.16 marked down after no beacon for 900.067020
>> seconds
>> >>> 2019-03-13 11:21:21.706312 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6769 : cluster [INF] osd.17 marked down after no beacon for 900.067020
>> seconds
>> >>> 2019-03-13 11:21:21.706351 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6770 : cluster [INF] osd.18 marked down after no beacon for 900.067020
>> seconds
>> >>> 2019-03-13 11:21:21.706385 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6771 : cluster [INF] osd.19 marked down after no beacon for 900.067020
>> seconds
>> >>> 2019-03-13 11:21:21.706423 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6772 : cluster [INF] osd.20 marked down after no beacon for 900.067020
>> seconds
>> >>> 2019-03-13 11:21:21.706503 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6773 : cluster [INF] osd.22 marked down after no beacon for 900.067020
>> seconds
>> >>> 2019-03-13 11:21:21.706549 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6774 : cluster [INF] osd.23 marked down after no beacon for 900.067020
>> seconds
>> >>> 2019-03-13 11:21:21.706587 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6775 : cluster [INF] osd.25 marked down after no beacon for 900.067020
>> seconds
>> >>> 2019-03-13 11:21:21.706625 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6776 : cluster [INF] osd.26 marked down after no beacon for 900.067020
>> seconds
>> >>> 2019-03-13 11:21:21.706665 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6777 : cluster [INF] osd.27 marked down after no beacon for 900.067020
>> seconds
>> >>> 2019-03-13 11:21:21.706703 mon.ceph-mon1 mon.0 10.39.0.34:6789/0
>> 6778 : cluster [INF] osd.28 marked down after no beacon for 900.067020
>> seconds
>> >>> 2019-03-13 11:21:21

Re: [ceph-users] cluster is not stable

2019-03-12 Thread huang jun
can you get the value of osd_beacon_report_interval item? the default
is 300, you can set to 60,  or maybe turn on debug_ms=1 debug_mon=10
can get more infos.


Zhenshi Zhou  于2019年3月13日周三 下午1:20写道:
>
> Hi,
>
> The servers are cennected to the same switch.
> I can ping from anyone of the servers to other servers
> without a packet lost and the average round trip time
> is under 0.1 ms.
>
> Thanks
>
> Ashley Merrick  于2019年3月13日周三 下午12:06写道:
>>
>> Can you ping all your OSD servers from all your mons, and ping your mons 
>> from all your OSD servers?
>>
>> I’ve seen this where a route wasn’t working one direction, so it made OSDs 
>> flap when it used that mon to check availability:
>>
>> On Wed, 13 Mar 2019 at 11:50 AM, Zhenshi Zhou  wrote:
>>>
>>> After checking the network and syslog/dmsg, I think it's not the network or 
>>> hardware issue. Now there're some
>>> osds being marked down every 15 minutes.
>>>
>>> here is ceph.log:
>>> 2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6756 : 
>>> cluster [INF] Cluster is now healthy
>>> 2019-03-13 11:21:21.705787 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6757 : 
>>> cluster [INF] osd.1 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.705858 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6758 : 
>>> cluster [INF] osd.2 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.705920 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6759 : 
>>> cluster [INF] osd.4 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.705957 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6760 : 
>>> cluster [INF] osd.6 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.705999 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6761 : 
>>> cluster [INF] osd.7 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706040 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6762 : 
>>> cluster [INF] osd.10 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706079 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6763 : 
>>> cluster [INF] osd.11 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706118 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6764 : 
>>> cluster [INF] osd.12 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706155 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6765 : 
>>> cluster [INF] osd.13 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706195 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6766 : 
>>> cluster [INF] osd.14 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706233 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6767 : 
>>> cluster [INF] osd.15 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706273 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6768 : 
>>> cluster [INF] osd.16 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706312 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6769 : 
>>> cluster [INF] osd.17 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706351 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6770 : 
>>> cluster [INF] osd.18 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706385 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6771 : 
>>> cluster [INF] osd.19 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706423 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6772 : 
>>> cluster [INF] osd.20 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706503 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6773 : 
>>> cluster [INF] osd.22 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706549 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6774 : 
>>> cluster [INF] osd.23 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706587 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6775 : 
>>> cluster [INF] osd.25 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706625 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6776 : 
>>> cluster [INF] osd.26 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706665 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6777 : 
>>> cluster [INF] osd.27 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706703 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6778 : 
>>> cluster [INF] osd.28 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706741 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6779 : 
>>> cluster [INF] osd.30 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706779 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6780 : 
>>> cluster [INF] osd.31 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706817 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6781 : 
>>> cluster [INF] osd.33 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706856 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6782 : 
>>> cluster [INF] osd.34 marked down aft

Re: [ceph-users] cluster is not stable

2019-03-12 Thread Zhenshi Zhou
Hi,

The servers are cennected to the same switch.
I can ping from anyone of the servers to other servers
without a packet lost and the average round trip time
is under 0.1 ms.

Thanks

Ashley Merrick  于2019年3月13日周三 下午12:06写道:

> Can you ping all your OSD servers from all your mons, and ping your mons
> from all your OSD servers?
>
> I’ve seen this where a route wasn’t working one direction, so it made OSDs
> flap when it used that mon to check availability:
>
> On Wed, 13 Mar 2019 at 11:50 AM, Zhenshi Zhou  wrote:
>
>> After checking the network and syslog/dmsg, I think it's not the network
>> or hardware issue. Now there're some
>> osds being marked down every 15 minutes.
>>
>> here is ceph.log:
>> 2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6756 :
>> cluster [INF] Cluster is now healthy
>> 2019-03-13 11:21:21.705787 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6757 :
>> cluster [INF] osd.1 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.705858 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6758 :
>> cluster [INF] osd.2 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.705920 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6759 :
>> cluster [INF] osd.4 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.705957 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6760 :
>> cluster [INF] osd.6 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.705999 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6761 :
>> cluster [INF] osd.7 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706040 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6762 :
>> cluster [INF] osd.10 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706079 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6763 :
>> cluster [INF] osd.11 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706118 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6764 :
>> cluster [INF] osd.12 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706155 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6765 :
>> cluster [INF] osd.13 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706195 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6766 :
>> cluster [INF] osd.14 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706233 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6767 :
>> cluster [INF] osd.15 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706273 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6768 :
>> cluster [INF] osd.16 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706312 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6769 :
>> cluster [INF] osd.17 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706351 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6770 :
>> cluster [INF] osd.18 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706385 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6771 :
>> cluster [INF] osd.19 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706423 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6772 :
>> cluster [INF] osd.20 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706503 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6773 :
>> cluster [INF] osd.22 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706549 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6774 :
>> cluster [INF] osd.23 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706587 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6775 :
>> cluster [INF] osd.25 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706625 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6776 :
>> cluster [INF] osd.26 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706665 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6777 :
>> cluster [INF] osd.27 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706703 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6778 :
>> cluster [INF] osd.28 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706741 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6779 :
>> cluster [INF] osd.30 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706779 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6780 :
>> cluster [INF] osd.31 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706817 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6781 :
>> cluster [INF] osd.33 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706856 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6782 :
>> cluster [INF] osd.34 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706894 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6783 :
>> cluster [INF] osd.36 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706930 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6784 :
>> cluster [INF] osd.38 marked down after no beacon for 9

Re: [ceph-users] cluster is not stable

2019-03-12 Thread Ashley Merrick
Can you ping all your OSD servers from all your mons, and ping your mons
from all your OSD servers?

I’ve seen this where a route wasn’t working one direction, so it made OSDs
flap when it used that mon to check availability:

On Wed, 13 Mar 2019 at 11:50 AM, Zhenshi Zhou  wrote:

> After checking the network and syslog/dmsg, I think it's not the network
> or hardware issue. Now there're some
> osds being marked down every 15 minutes.
>
> here is ceph.log:
> 2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6756 :
> cluster [INF] Cluster is now healthy
> 2019-03-13 11:21:21.705787 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6757 :
> cluster [INF] osd.1 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.705858 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6758 :
> cluster [INF] osd.2 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.705920 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6759 :
> cluster [INF] osd.4 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.705957 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6760 :
> cluster [INF] osd.6 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.705999 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6761 :
> cluster [INF] osd.7 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706040 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6762 :
> cluster [INF] osd.10 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706079 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6763 :
> cluster [INF] osd.11 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706118 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6764 :
> cluster [INF] osd.12 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706155 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6765 :
> cluster [INF] osd.13 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706195 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6766 :
> cluster [INF] osd.14 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706233 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6767 :
> cluster [INF] osd.15 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706273 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6768 :
> cluster [INF] osd.16 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706312 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6769 :
> cluster [INF] osd.17 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706351 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6770 :
> cluster [INF] osd.18 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706385 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6771 :
> cluster [INF] osd.19 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706423 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6772 :
> cluster [INF] osd.20 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706503 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6773 :
> cluster [INF] osd.22 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706549 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6774 :
> cluster [INF] osd.23 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706587 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6775 :
> cluster [INF] osd.25 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706625 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6776 :
> cluster [INF] osd.26 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706665 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6777 :
> cluster [INF] osd.27 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706703 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6778 :
> cluster [INF] osd.28 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706741 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6779 :
> cluster [INF] osd.30 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706779 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6780 :
> cluster [INF] osd.31 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706817 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6781 :
> cluster [INF] osd.33 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706856 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6782 :
> cluster [INF] osd.34 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706894 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6783 :
> cluster [INF] osd.36 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706930 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6784 :
> cluster [INF] osd.38 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706974 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6785 :
> cluster [INF] osd.40 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.707013 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6786 :
> cluster [INF] osd.41 marked down after no beacon for 900.06702

Re: [ceph-users] cluster is not stable

2019-03-12 Thread Zhenshi Zhou
After checking the network and syslog/dmsg, I think it's not the network or
hardware issue. Now there're some
osds being marked down every 15 minutes.

here is ceph.log:
2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6756 :
cluster [INF] Cluster is now healthy
2019-03-13 11:21:21.705787 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6757 :
cluster [INF] osd.1 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.705858 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6758 :
cluster [INF] osd.2 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.705920 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6759 :
cluster [INF] osd.4 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.705957 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6760 :
cluster [INF] osd.6 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.705999 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6761 :
cluster [INF] osd.7 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706040 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6762 :
cluster [INF] osd.10 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706079 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6763 :
cluster [INF] osd.11 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706118 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6764 :
cluster [INF] osd.12 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706155 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6765 :
cluster [INF] osd.13 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706195 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6766 :
cluster [INF] osd.14 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706233 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6767 :
cluster [INF] osd.15 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706273 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6768 :
cluster [INF] osd.16 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706312 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6769 :
cluster [INF] osd.17 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706351 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6770 :
cluster [INF] osd.18 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706385 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6771 :
cluster [INF] osd.19 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706423 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6772 :
cluster [INF] osd.20 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706503 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6773 :
cluster [INF] osd.22 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706549 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6774 :
cluster [INF] osd.23 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706587 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6775 :
cluster [INF] osd.25 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706625 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6776 :
cluster [INF] osd.26 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706665 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6777 :
cluster [INF] osd.27 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706703 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6778 :
cluster [INF] osd.28 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706741 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6779 :
cluster [INF] osd.30 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706779 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6780 :
cluster [INF] osd.31 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706817 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6781 :
cluster [INF] osd.33 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706856 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6782 :
cluster [INF] osd.34 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706894 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6783 :
cluster [INF] osd.36 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706930 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6784 :
cluster [INF] osd.38 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706974 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6785 :
cluster [INF] osd.40 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.707013 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6786 :
cluster [INF] osd.41 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.707051 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6787 :
cluster [INF] osd.42 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.707090 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6788 :
cluster [INF] osd.44 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.707128 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6789 :
cluster [INF] osd.45 marked down after no bea

Re: [ceph-users] cluster is not stable

2019-03-12 Thread Zhenshi Zhou
Hi Kevin,

I'm sure the firewalld are disabled on each host.

Well, the network is not a problem. The servers are connected
to the same switch and the connection is good when the osds
are marked as down. There was no interruption or delay.

I restart the leader monitor daemon and it seems return to the
normal state.

Thanks.

Kevin Olbrich  于2019年3月12日周二 下午5:44写道:

> Are you sure that firewalld is stopped and disabled?
> Looks exactly like that when I missed one host in a test cluster.
>
> Kevin
>
>
> Am Di., 12. März 2019 um 09:31 Uhr schrieb Zhenshi Zhou <
> deader...@gmail.com>:
>
>> Hi,
>>
>> I deployed a ceph cluster with good performance. But the logs
>> indicate that the cluster is not as stable as I think it should be.
>>
>> The log shows the monitors mark some osd as down periodly:
>> [image: image.png]
>>
>> I didn't find any useful information in osd logs.
>>
>> ceph version 13.2.4 mimic (stable)
>> OS version CentOS 7.6.1810
>> kernel version 5.0.0-2.el7
>>
>> Thanks.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster is not stable

2019-03-12 Thread Kevin Olbrich
Are you sure that firewalld is stopped and disabled?
Looks exactly like that when I missed one host in a test cluster.

Kevin


Am Di., 12. März 2019 um 09:31 Uhr schrieb Zhenshi Zhou :

> Hi,
>
> I deployed a ceph cluster with good performance. But the logs
> indicate that the cluster is not as stable as I think it should be.
>
> The log shows the monitors mark some osd as down periodly:
> [image: image.png]
>
> I didn't find any useful information in osd logs.
>
> ceph version 13.2.4 mimic (stable)
> OS version CentOS 7.6.1810
> kernel version 5.0.0-2.el7
>
> Thanks.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster is not stable

2019-03-12 Thread Zhenshi Zhou
Yep, I think it maybe a network issue as well. I'll check the connections.

Thanks Eugen:)

Eugen Block  于2019年3月12日周二 下午4:35写道:

> Hi,
>
> my first guess would be a network issue. Double-check your connections
> and make sure the network setup works as expected. Check syslogs,
> dmesg, switches etc. for hints that a network interruption may have
> occured.
>
> Regards,
> Eugen
>
>
> Zitat von Zhenshi Zhou :
>
> > Hi,
> >
> > I deployed a ceph cluster with good performance. But the logs
> > indicate that the cluster is not as stable as I think it should be.
> >
> > The log shows the monitors mark some osd as down periodly:
> > [image: image.png]
> >
> > I didn't find any useful information in osd logs.
> >
> > ceph version 13.2.4 mimic (stable)
> > OS version CentOS 7.6.1810
> > kernel version 5.0.0-2.el7
> >
> > Thanks.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster is not stable

2019-03-12 Thread Eugen Block

Hi,

my first guess would be a network issue. Double-check your connections  
and make sure the network setup works as expected. Check syslogs,  
dmesg, switches etc. for hints that a network interruption may have  
occured.


Regards,
Eugen


Zitat von Zhenshi Zhou :


Hi,

I deployed a ceph cluster with good performance. But the logs
indicate that the cluster is not as stable as I think it should be.

The log shows the monitors mark some osd as down periodly:
[image: image.png]

I didn't find any useful information in osd logs.

ceph version 13.2.4 mimic (stable)
OS version CentOS 7.6.1810
kernel version 5.0.0-2.el7

Thanks.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cluster is not stable

2019-03-12 Thread Zhenshi Zhou
Hi,

I deployed a ceph cluster with good performance. But the logs
indicate that the cluster is not as stable as I think it should be.

The log shows the monitors mark some osd as down periodly:
[image: image.png]

I didn't find any useful information in osd logs.

ceph version 13.2.4 mimic (stable)
OS version CentOS 7.6.1810
kernel version 5.0.0-2.el7

Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com