Ok let me try to explain this better, we are doing this back and forth and
its not going anywhere. I'll just be as genuine as I can and explain the
issue.

What we are testing is a critical failure scenario and actually more of a
real world scenario. Basically just what happens when it is 1AM and the
shit hits the fan, half of your servers are down and 1 of the 3 MDS boxes
are still alive.
There is one very important fact that happens with CephFS and when the
single Active MDS server fails. It is guaranteed 100% all IO is blocked. No
split-brain, no corrupted data, 100% guaranteed ever since we started using
CephFS

Now with multi_mds, I understand this changes the logic and I understand
how difficult and how hard this problem is, trust me I would not be able to
tackle this. Basically I need to answer the question; what happens when 1
of 2 multi_mds fails with no standbys ready to come save them?
What I have tested is not the same of a single active MDS; this absolutely
changes the logic of what happens and how we troubleshoot. The CephFS is
still alive and it does allow operations and does allow resources to go
through. How, why and what is affected are very relevant questions if this
is what the failure looks like since it is not 100% blocking.

This is the problem, I have programs writing a massive amount of data and I
don't want it corrupted or lost. I need to know what happens and I need to
have guarantees.

Best


On Thu, Apr 26, 2018 at 5:03 PM Patrick Donnelly <pdonn...@redhat.com>
wrote:

> On Thu, Apr 26, 2018 at 4:40 PM, Scottix <scot...@gmail.com> wrote:
> >> Of course -- the mons can't tell the difference!
> > That is really unfortunate, it would be nice to know if the filesystem
> has
> > been degraded and to what degree.
>
> If a rank is laggy/crashed, the file system as a whole is generally
> unavailable. The span between partial outage and full is small and not
> worth quantifying.
>
> >> You must have standbys for high availability. This is the docs.
> > Ok but what if you have your standby go down and a master go down. This
> > could happen in the real world and is a valid error scenario.
> >Also there is
> > a period between when the standby becomes active what happens in-between
> > that time?
>
> The standby MDS goes through a series of states where it recovers the
> lost state and connections with clients. Finally, it goes active.
>
> >> It depends(tm) on how the metadata is distributed and what locks are
> > held by each MDS.
> > Your saying depending on which mds had a lock on a resource it will block
> > that particular POSIX operation? Can you clarify a little bit?
> >
> >> Standbys are not optional in any production cluster.
> > Of course in production I would hope people have standbys but in theory
> > there is no enforcement in Ceph for this other than a warning. So when
> you
> > say not optional that is not exactly true it will still run.
>
> It's self-defeating to expect CephFS to enforce having standbys --
> presumably by throwing an error or becoming unavailable -- when the
> standbys exist to make the system available.
>
> There's nothing to enforce. A warning is sufficient for the operator
> that (a) they didn't configure any standbys or (b) MDS daemon
> processes/boxes are going away and not coming back as standbys (i.e.
> the pool of MDS daemons is decreasing with each failover)
>
> --
> Patrick Donnelly
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to