Hi Sage,

Yeah, that is what I mean, and the output make more sense than I was thinking 
before (only show if I cannot reach the whole crush level ).

I will try to to it then.

Thanks.

-Xiaoxi



> -----Original Message-----
> From: Sage Weil [mailto:sw...@redhat.com]
> Sent: Friday, November 20, 2015 7:28 PM
> To: Chen, Xiaoxi
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: Aggregate failure report in ceph -s
> 
> On Fri, 20 Nov 2015, Chen, Xiaoxi wrote:
> >
> > Hi Sage,
> >
> >        As we are looking at the failure detection part of
> > ceph(basically around osd flipping issue), we  got some suggestion
> > from customer that showing the aggregated failure report in ?ceph ?s?.
> The idea is:
> >
> >       When an OSD find it cannot hear heartbeat from some of the
> > peers, it will try to aggregate the failure domain, say ?I cannot
> > reach all my peers in Rack C,    something wrong??  and this kind of
> > log will be showed on ceph ?s.   So if we see ceph ?s and notice a lot
> > of complain saying cannot reach Rack C, we will easily diagnose the Rack C
> has some network issue.
> >
> >
> >
> >       Is that make sense?
> 
> Yeah, sounds reasonable to me!  It's a bit more awkward to do this at the
> mon level since rack C may talk to the mon, but doing it at the OSD makes
> sense.  There will be a lot of heuristics involved, though.  I expect the
> messages might include
> 
> - cannot reach _% of peers outside of my $crushlevel $foo [on front|back]
> - cannot reach _% of hosts in $crushlevel $foo [on front|back]
> 
> ?
> 
> Also note that it would be easiest to log these in the cluster log (ceph -w, 
> not
> ceph -s).. I'm guessing that's what you mean?
> 
> Thanks!
> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to