Re: [ceph-users] RGW Multisite metadata sync init

David Turner Thu, 07 Sep 2017 13:06:37 -0700

I created a test user named 'ice' and then used it to create a bucket named
ice.  The bucket ice can be found in the second dc, but not the user.
 `mdlog list` showed ice for the bucket, but not for the user.  I performed
the same test in the internal realm and it showed the user and bucket both
in `mdlog list`.




On Thu, Sep 7, 2017 at 3:27 PM Yehuda Sadeh-Weinraub <yeh...@redhat.com>
wrote:

> On Thu, Sep 7, 2017 at 10:04 PM, David Turner <drakonst...@gmail.com>
> wrote:
> > One realm is called public with a zonegroup called public-zg with a zone
> for
> > each datacenter.  The second realm is called internal with a zonegroup
> > called internal-zg with a zone for each datacenter.  they each have their
> > own rgw's and load balancers.  The needs of our public facing rgw's and
> load
> > balancers vs internal use ones was different enough that we split them up
> > completely.  We also have a local realm that does not use multisite and a
> > 4th realm called QA that mimics the public realm as much as possible for
> > staging configuration stages for the rgw daemons.  All 4 realms have
> their
> > own buckets, users, etc and that is all working fine.  For all of the
> > radosgw-admin commands I am using the proper identifiers to make sure
> that
> > each datacenter and realm are running commands on exactly what I expect
> them
> > to (--rgw-realm=public --rgw-zonegroup=public-zg --rgw-zone=public-dc1
> > --source-zone=public-dc2).
> >
> > The data sync issue was in the internal realm but running a data sync
> init
> > and kickstarting the rgw daemons in each datacenter fixed the data
> > discrepancies (I'm thinking it had something to do with a power failure a
> > few months back that I just noticed recently).  The metadata sync issue
> is
> > in the public realm.  I have no idea what is causing this to not sync
> > properly since running a `metadata sync init` catches it back up to the
> > primary zone, but then it doesn't receive any new users created after
> that.
> >
>
> Sounds like an issue with the metadata log in the primary master zone.
> Not sure what could go wrong there, but maybe the master zone doesn't
> know that it is a master zone, or it's set to not log metadata. Or
> maybe there's a problem when the secondary is trying to fetch the
> metadata log. Maybe some kind of # of shards mismatch (though not
> likely).
> Try to see if the master logs any changes: should use the
> 'radosgw-admin mdlog list' command.
>
> Yehuda
>
> > On Thu, Sep 7, 2017 at 2:52 PM Yehuda Sadeh-Weinraub <yeh...@redhat.com>
> > wrote:
> >>
> >> On Thu, Sep 7, 2017 at 7:44 PM, David Turner <drakonst...@gmail.com>
> >> wrote:
> >> > Ok, I've been testing, investigating, researching, etc for the last
> week
> >> > and
> >> > I don't have any problems with data syncing.  The clients on one side
> >> > are
> >> > creating multipart objects while the multisite sync is creating them
> as
> >> > whole objects and one of the datacenters is slower at cleaning up the
> >> > shadow
> >> > files.  That's the big discrepancy between object counts in the pools
> >> > between datacenters.  I created a tool that goes through for each
> bucket
> >> > in
> >> > a realm and does a recursive listing of all objects in it for both
> >> > datacenters and compares the 2 lists for any differences.  The data is
> >> > definitely in sync between the 2 datacenters down to the modified time
> >> > and
> >> > byte of each file in s3.
> >> >
> >> > The metadata is still not syncing for the other realm, though.  If I
> run
> >> > `metadata sync init` then the second datacenter will catch up with all
> >> > of
> >> > the new users, but until I do that newly created users on the primary
> >> > side
> >> > don't exist on the secondary side.  `metadata sync status`, `sync
> >> > status`,
> >> > `metadata sync run` (only left running for 30 minutes before I ctrl+c
> >> > it),
> >> > etc don't show any problems... but the new users just don't exist on
> the
> >> > secondary side until I run `metadata sync init`.  I created a new
> bucket
> >> > with the new user and the bucket shows up in the second datacenter,
> but
> >> > no
> >> > objects because the objects don't have a valid owner.
> >> >
> >> > Thank you all for the help with the data sync issue.  You pushed me
> into
> >> > good directions.  Does anyone have any insight as to what is
> preventing
> >> > the
> >> > metadata from syncing in the other realm?  I have 2 realms being sync
> >> > using
> >> > multi-site and it's only 1 of them that isn't getting the metadata
> >> > across.
> >> > As far as I can tell it is configured identically.
> >>
> >> What do you mean you have two realms? Zones and zonegroups need to
> >> exist in the same realm in order for meta and data sync to happen
> >> correctly. Maybe I'm misunderstanding.
> >>
> >> Yehuda
> >>
> >> >
> >> > On Thu, Aug 31, 2017 at 12:46 PM David Turner <drakonst...@gmail.com>
> >> > wrote:
> >> >>
> >> >> All of the messages from sync error list are listed below.  The
> number
> >> >> on
> >> >> the left is how many times the error message is found.
> >> >>
> >> >>    1811                     "message": "failed to sync bucket
> instance:
> >> >> (16) Device or resource busy"
> >> >>       7                     "message": "failed to sync bucket
> instance:
> >> >> (5) Input\/output error"
> >> >>      65                     "message": "failed to sync object"
> >> >>
> >> >> On Tue, Aug 29, 2017 at 10:00 AM Orit Wasserman <owass...@redhat.com
> >
> >> >> wrote:
> >> >>>
> >> >>>
> >> >>> Hi David,
> >> >>>
> >> >>> On Mon, Aug 28, 2017 at 8:33 PM, David Turner <
> drakonst...@gmail.com>
> >> >>> wrote:
> >> >>>>
> >> >>>> The vast majority of the sync error list is "failed to sync bucket
> >> >>>> instance: (16) Device or resource busy".  I can't find anything on
> >> >>>> Google
> >> >>>> about this error message in relation to Ceph.  Does anyone have any
> >> >>>> idea
> >> >>>> what this means? and/or how to fix it?
> >> >>>
> >> >>>
> >> >>> Those are intermediate errors resulting from several radosgw trying
> to
> >> >>> acquire the same sync log shard lease. It doesn't effect the sync
> >> >>> progress.
> >> >>> Are there any other errors?
> >> >>>
> >> >>> Orit
> >> >>>>
> >> >>>>
> >> >>>> On Fri, Aug 25, 2017 at 2:48 PM Casey Bodley <cbod...@redhat.com>
> >> >>>> wrote:
> >> >>>>>
> >> >>>>> Hi David,
> >> >>>>>
> >> >>>>> The 'data sync init' command won't touch any actual object data,
> no.
> >> >>>>> Resetting the data sync status will just cause a zone to restart a
> >> >>>>> full sync
> >> >>>>> of the --source-zone's data changes log. This log only lists which
> >> >>>>> buckets/shards have changes in them, which causes radosgw to
> >> >>>>> consider them
> >> >>>>> for bucket sync. So while the command may silence the warnings
> about
> >> >>>>> data
> >> >>>>> shards being behind, it's unlikely to resolve the issue with
> missing
> >> >>>>> objects
> >> >>>>> in those buckets.
> >> >>>>>
> >> >>>>> When data sync is behind for an extended period of time, it's
> >> >>>>> usually
> >> >>>>> because it's stuck retrying previous bucket sync failures. The
> 'sync
> >> >>>>> error
> >> >>>>> list' may help narrow down where those failures are.
> >> >>>>>
> >> >>>>> There is also a 'bucket sync init' command to clear the bucket
> sync
> >> >>>>> status. Following that with a 'bucket sync run' should restart a
> >> >>>>> full sync
> >> >>>>> on the bucket, pulling in any new objects that are present on the
> >> >>>>> source-zone. I'm afraid that those commands haven't seen a lot of
> >> >>>>> polish or
> >> >>>>> testing, however.
> >> >>>>>
> >> >>>>> Casey
> >> >>>>>
> >> >>>>>
> >> >>>>> On 08/24/2017 04:15 PM, David Turner wrote:
> >> >>>>>
> >> >>>>> Apparently the data shards that are behind go in both directions,
> >> >>>>> but
> >> >>>>> only one zone is aware of the problem.  Each cluster has objects
> in
> >> >>>>> their
> >> >>>>> data pool that the other doesn't have.  I'm thinking about
> >> >>>>> initiating a
> >> >>>>> `data sync init` on both sides (one at a time) to get them back on
> >> >>>>> the same
> >> >>>>> page.  Does anyone know if that command will overwrite any local
> >> >>>>> data that
> >> >>>>> the zone has that the other doesn't if you run `data sync init` on
> >> >>>>> it?
> >> >>>>>
> >> >>>>> On Thu, Aug 24, 2017 at 1:51 PM David Turner <
> drakonst...@gmail.com>
> >> >>>>> wrote:
> >> >>>>>>
> >> >>>>>> After restarting the 2 RGW daemons on the second site again,
> >> >>>>>> everything caught up on the metadata sync.  Is there something
> >> >>>>>> about having
> >> >>>>>> 2 RGW daemons on each side of the multisite that might be causing
> >> >>>>>> an issue
> >> >>>>>> with the sync getting stale?  I have another realm set up the
> same
> >> >>>>>> way that
> >> >>>>>> is having a hard time with its data shards being behind.  I
> haven't
> >> >>>>>> told
> >> >>>>>> them to resync, but yesterday I noticed 90 shards were behind.
> >> >>>>>> It's caught
> >> >>>>>> back up to only 17 shards behind, but the oldest change not
> applied
> >> >>>>>> is 2
> >> >>>>>> months old and no order of restarting RGW daemons is helping to
> >> >>>>>> resolve
> >> >>>>>> this.
> >> >>>>>>
> >> >>>>>> On Thu, Aug 24, 2017 at 10:59 AM David Turner
> >> >>>>>> <drakonst...@gmail.com>
> >> >>>>>> wrote:
> >> >>>>>>>
> >> >>>>>>> I have a RGW Multisite 10.2.7 set up for bi-directional syncing.
> >> >>>>>>> This has been operational for 5 months and working fine.  I
> >> >>>>>>> recently created
> >> >>>>>>> a new user on the master zone, used that user to create a
> bucket,
> >> >>>>>>> and put in
> >> >>>>>>> a public-acl object in there.  The Bucket created on the second
> >> >>>>>>> site, but
> >> >>>>>>> the user did not and the object errors out complaining about the
> >> >>>>>>> access_key
> >> >>>>>>> not existing.
> >> >>>>>>>
> >> >>>>>>> That led me to think that the metadata isn't syncing, while
> bucket
> >> >>>>>>> and data both are.  I've also confirmed that data is syncing for
> >> >>>>>>> other
> >> >>>>>>> buckets as well in both directions. The sync status from the
> >> >>>>>>> second site was
> >> >>>>>>> this.
> >> >>>>>>>
> >> >>>>>>>   metadata sync syncing
> >> >>>>>>>
> >> >>>>>>>                 full sync: 0/64 shards
> >> >>>>>>>
> >> >>>>>>>                 incremental sync: 64/64 shards
> >> >>>>>>>
> >> >>>>>>>                 metadata is caught up with master
> >> >>>>>>>
> >> >>>>>>>       data sync source: f4c12327-4721-47c9-a365-86332d84c227
> >> >>>>>>> (public-atl01)
> >> >>>>>>>
> >> >>>>>>>                         syncing
> >> >>>>>>>
> >> >>>>>>>                         full sync: 0/128 shards
> >> >>>>>>>
> >> >>>>>>>                         incremental sync: 128/128 shards
> >> >>>>>>>
> >> >>>>>>>                         data is caught up with source
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> Sync status leads me to think that the second site believes it
> is
> >> >>>>>>> up
> >> >>>>>>> to date, even though it is missing a freshly created user.  I
> >> >>>>>>> restarted all
> >> >>>>>>> of the rgw daemons for the zonegroup, but it didn't trigger
> >> >>>>>>> anything to fix
> >> >>>>>>> the missing user in the second site.  I did some googling and
> >> >>>>>>> found the sync
> >> >>>>>>> init commands mentioned in a few ML posts and used metadata sync
> >> >>>>>>> init and
> >> >>>>>>> now have this as the sync status.
> >> >>>>>>>
> >> >>>>>>>   metadata sync preparing for full sync
> >> >>>>>>>
> >> >>>>>>>                 full sync: 64/64 shards
> >> >>>>>>>
> >> >>>>>>>                 full sync: 0 entries to sync
> >> >>>>>>>
> >> >>>>>>>                 incremental sync: 0/64 shards
> >> >>>>>>>
> >> >>>>>>>                 metadata is behind on 70 shards
> >> >>>>>>>
> >> >>>>>>>                 oldest incremental change not applied:
> 2017-03-01
> >> >>>>>>> 21:13:43.0.126971s
> >> >>>>>>>
> >> >>>>>>>       data sync source: f4c12327-4721-47c9-a365-86332d84c227
> >> >>>>>>> (public-atl01)
> >> >>>>>>>
> >> >>>>>>>                         syncing
> >> >>>>>>>
> >> >>>>>>>                         full sync: 0/128 shards
> >> >>>>>>>
> >> >>>>>>>                         incremental sync: 128/128 shards
> >> >>>>>>>
> >> >>>>>>>                         data is caught up with source
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> It definitely triggered a fresh sync and told it to forget about
> >> >>>>>>> what
> >> >>>>>>> it's previously applied as the date of the oldest change not
> >> >>>>>>> applied is the
> >> >>>>>>> day we initially set up multisite for this zone.  The problem is
> >> >>>>>>> that was
> >> >>>>>>> over 12 hours ago and the sync stat hasn't caught up on any
> shards
> >> >>>>>>> yet.
> >> >>>>>>>
> >> >>>>>>> Does anyone have any suggestions other than blast the second
> site
> >> >>>>>>> and
> >> >>>>>>> set it back up with a fresh start (the only option I can think
> of
> >> >>>>>>> at this
> >> >>>>>>> point)?
> >> >>>>>>>
> >> >>>>>>> Thank you,
> >> >>>>>>> David Turner
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> _______________________________________________
> >> >>>>> ceph-users mailing list
> >> >>>>> ceph-users@lists.ceph.com
> >> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >>>>>
> >> >>>>>
> >> >>>>> _______________________________________________
> >> >>>>> ceph-users mailing list
> >> >>>>> ceph-users@lists.ceph.com
> >> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >>>>
> >> >>>>
> >> >>>> _______________________________________________
> >> >>>> ceph-users mailing list
> >> >>>> ceph-users@lists.ceph.com
> >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >>>>
> >> >
> >> > _______________________________________________
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RGW Multisite metadata sync init

Reply via email to