Hey Gregory,

On wo, 2013-09-11 at 11:36 -0700, Gregory Farnum wrote:
> On Wed, Sep 11, 2013 at 7:48 AM, Yan, Zheng <uker...@gmail.com> wrote:
> > On Wed, Sep 11, 2013 at 10:06 PM, Oliver Daudey <oli...@xs4all.nl> wrote:
> >> Hey Yan,
> >>
> >> On 11-09-13 15:12, Yan, Zheng wrote:
> >>> On Wed, Sep 11, 2013 at 7:51 PM, Oliver Daudey <oli...@xs4all.nl> wrote:
> >>>> Hey Gregory,
> >>>>
> >>>> I wiped and re-created the MDS-cluster I just mailed about, starting out
> >>>> by making sure CephFS is not mounted anywhere, stopping all MDSs,
> >>>> completely cleaning the "data" and "metadata"-pools using "rados
> >>>> --pool=<pool> cleanup <prefix>", then creating a new cluster using `ceph
> >>>> mds newfs 1 0 --yes-i-really-mean-it' and starting all MDSs again.
> >>>> Directly afterwards, I saw this:
> >>>> # rados --pool=metadata ls
> >>>> 1.00000000
> >>>> 2.00000000
> >>>> 200.00000000
> >>>> 200.00000001
> >>>> 600.00000000
> >>>> 601.00000000
> >>>> 602.00000000
> >>>> 603.00000000
> >>>> 605.00000000
> >>>> 606.00000000
> >>>> 608.00000000
> >>>> 609.00000000
> >>>> mds0_inotable
> >>>> mds0_sessionmap
> >>>>
> >>>> Note the missing objects, right from the start.  I was able to mount the
> >>>> CephFS at this point, but after unmounting it and restarting the
> >>>> MDS-cluster, it failed to come up, with the same symptoms as before.  I
> >>>> didn't place any files on CephFS at any point between newfs and failure.
> >>>> Naturally, I tried initializing it again, but now, even after more than
> >>>> 5 tries, the "mds*"-objects simply no longer show up in the
> >>>> "metadata"-pool at all.  In fact, it remains empty.  I can mount CephFS
> >>>> after the first start of the MDS-cluster after a newfs, but on restart,
> >>>> it fails because of the missing objects.  Am I doing anything wrong
> >>>> while initializing the cluster, maybe?  Is cleaning the pools and doing
> >>>> the newfs enough?  I did the same on the other cluster yesterday and it
> >>>> seems to have all objects.
> >>>>
> >>>
> >>> Thank you for your default information.
> >>>
> >>> The cause of missing object is that the MDS IDs for old FS and new FS
> >>> are the same (incarnations are the same). When OSD receives MDS
> >>> requests for the newly created FS. It silently drops the requests,
> >>> because it thinks they are duplicated.  You can get around the bug by
> >>> creating new pools for the newfs.
> >>
> >> Thanks for this very useful info, I think this solves the mystery!
> >> Could I get around it any other way?  I'd rather not have to re-create
> >> the pools and switch to new pool-ID's every time I have to do this.
> >> Does the OSD store this info in it's meta-data, or might restarting the
> >> OSDs be enough?  I'm quite sure that I re-created MDS-clusters on the
> >> same pools many times, without all the objects going missing.  This was
> >> usually as part of tests, where I also restarted other
> >> cluster-components, like OSDs.  This could explain why only some files
> >> went missing.  If some OSDs are restarted and processed the requests,
> >> while others dropped the requests, it would appear as if some, but not
> >> all objects are missing.  The problem then persists until the active MDS
> >> in the MDS-cluster is restarted, after which the missing objects get
> >> noticed, because things fail to restart.  IMHO, this is a bug.  Why
> >
> > Yes, it's a bug. Fixing it should be easy.
> >
> >> would the OSD ignore these requests, if the objects the MDS tries to
> >> write don't even exist at that time?
> >>
> >
> > OSD uses informartion in PG log to check duplicated requests, so
> > restarting OSD does not work. Another way to get around the bug is
> > generate lots of writes to the data/metadata pools, make sure each PG
> > trim old entries in its log.
> >
> > Regards
> > Yan, Zheng
> 
> This definitely explains the symptoms seen here on a
> not-very-busy/long-lived cluster; I wish I had the notes to figure out
> if it could have caused the problem for other users as well. I'm not
> sure the best way to work around the problem in the code, though. We
> could add an "fs generation" number to every object or every mds
> incarnation, but that seems a bit icky. Did you have other ideas,
> Zheng?

Another symptom I've experienced with CephFS which might easily be
explained by this bug, is that whole directories went missing from
CephFS over time.  You might only notice the missing directory after a
while and by then, it might no longer be in any backups you took, so
this one is even more sneaky.

As many others have said here, CephFS desperately needs a
scrubbing-mechanism/fsck-tool, which at least compares the metadata and
data actually in the data-stores to each other in both ways, checking if
anything referenced in the metadata exists in the data and the other way
around, if there are any orphans in the data, which are not referenced
in metadata.  Meanwhile, it should check basic consistency of the actual
data-structures, if possible.


   Regards,

      Oliver

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to