I'm pretty sure I'm using the cluster admin user/keyring. Is there any output that would be helpful? Period, zonegroup get, etc?
On Thu, Sep 7, 2017 at 4:27 PM Yehuda Sadeh-Weinraub <yeh...@redhat.com> wrote: > On Thu, Sep 7, 2017 at 11:02 PM, David Turner <drakonst...@gmail.com> > wrote: > > I created a test user named 'ice' and then used it to create a bucket > named > > ice. The bucket ice can be found in the second dc, but not the user. > > `mdlog list` showed ice for the bucket, but not for the user. I > performed > > the same test in the internal realm and it showed the user and bucket > both > > in `mdlog list`. > > > > Maybe your radosgw-admin command is running with a ceph user that > doesn't have permissions to write to the log pool? (probably not, > because you are able to run the sync init commands). > Another very slim explanation would be if you had for some reason > overlapping zones configuration that shared some of the config but not > all of it, having radosgw running against the correct one and > radosgw-admin against the bad one. I don't think it's the second > option. > > Yehuda > > > > > > > On Thu, Sep 7, 2017 at 3:27 PM Yehuda Sadeh-Weinraub <yeh...@redhat.com> > > wrote: > >> > >> On Thu, Sep 7, 2017 at 10:04 PM, David Turner <drakonst...@gmail.com> > >> wrote: > >> > One realm is called public with a zonegroup called public-zg with a > zone > >> > for > >> > each datacenter. The second realm is called internal with a zonegroup > >> > called internal-zg with a zone for each datacenter. they each have > >> > their > >> > own rgw's and load balancers. The needs of our public facing rgw's > and > >> > load > >> > balancers vs internal use ones was different enough that we split them > >> > up > >> > completely. We also have a local realm that does not use multisite > and > >> > a > >> > 4th realm called QA that mimics the public realm as much as possible > for > >> > staging configuration stages for the rgw daemons. All 4 realms have > >> > their > >> > own buckets, users, etc and that is all working fine. For all of the > >> > radosgw-admin commands I am using the proper identifiers to make sure > >> > that > >> > each datacenter and realm are running commands on exactly what I > expect > >> > them > >> > to (--rgw-realm=public --rgw-zonegroup=public-zg --rgw-zone=public-dc1 > >> > --source-zone=public-dc2). > >> > > >> > The data sync issue was in the internal realm but running a data sync > >> > init > >> > and kickstarting the rgw daemons in each datacenter fixed the data > >> > discrepancies (I'm thinking it had something to do with a power > failure > >> > a > >> > few months back that I just noticed recently). The metadata sync > issue > >> > is > >> > in the public realm. I have no idea what is causing this to not sync > >> > properly since running a `metadata sync init` catches it back up to > the > >> > primary zone, but then it doesn't receive any new users created after > >> > that. > >> > > >> > >> Sounds like an issue with the metadata log in the primary master zone. > >> Not sure what could go wrong there, but maybe the master zone doesn't > >> know that it is a master zone, or it's set to not log metadata. Or > >> maybe there's a problem when the secondary is trying to fetch the > >> metadata log. Maybe some kind of # of shards mismatch (though not > >> likely). > >> Try to see if the master logs any changes: should use the > >> 'radosgw-admin mdlog list' command. > >> > >> Yehuda > >> > >> > On Thu, Sep 7, 2017 at 2:52 PM Yehuda Sadeh-Weinraub < > yeh...@redhat.com> > >> > wrote: > >> >> > >> >> On Thu, Sep 7, 2017 at 7:44 PM, David Turner <drakonst...@gmail.com> > >> >> wrote: > >> >> > Ok, I've been testing, investigating, researching, etc for the last > >> >> > week > >> >> > and > >> >> > I don't have any problems with data syncing. The clients on one > side > >> >> > are > >> >> > creating multipart objects while the multisite sync is creating > them > >> >> > as > >> >> > whole objects and one of the datacenters is slower at cleaning up > the > >> >> > shadow > >> >> > files. That's the big discrepancy between object counts in the > pools > >> >> > between datacenters. I created a tool that goes through for each > >> >> > bucket > >> >> > in > >> >> > a realm and does a recursive listing of all objects in it for both > >> >> > datacenters and compares the 2 lists for any differences. The data > >> >> > is > >> >> > definitely in sync between the 2 datacenters down to the modified > >> >> > time > >> >> > and > >> >> > byte of each file in s3. > >> >> > > >> >> > The metadata is still not syncing for the other realm, though. If > I > >> >> > run > >> >> > `metadata sync init` then the second datacenter will catch up with > >> >> > all > >> >> > of > >> >> > the new users, but until I do that newly created users on the > primary > >> >> > side > >> >> > don't exist on the secondary side. `metadata sync status`, `sync > >> >> > status`, > >> >> > `metadata sync run` (only left running for 30 minutes before I > ctrl+c > >> >> > it), > >> >> > etc don't show any problems... but the new users just don't exist > on > >> >> > the > >> >> > secondary side until I run `metadata sync init`. I created a new > >> >> > bucket > >> >> > with the new user and the bucket shows up in the second datacenter, > >> >> > but > >> >> > no > >> >> > objects because the objects don't have a valid owner. > >> >> > > >> >> > Thank you all for the help with the data sync issue. You pushed me > >> >> > into > >> >> > good directions. Does anyone have any insight as to what is > >> >> > preventing > >> >> > the > >> >> > metadata from syncing in the other realm? I have 2 realms being > sync > >> >> > using > >> >> > multi-site and it's only 1 of them that isn't getting the metadata > >> >> > across. > >> >> > As far as I can tell it is configured identically. > >> >> > >> >> What do you mean you have two realms? Zones and zonegroups need to > >> >> exist in the same realm in order for meta and data sync to happen > >> >> correctly. Maybe I'm misunderstanding. > >> >> > >> >> Yehuda > >> >> > >> >> > > >> >> > On Thu, Aug 31, 2017 at 12:46 PM David Turner < > drakonst...@gmail.com> > >> >> > wrote: > >> >> >> > >> >> >> All of the messages from sync error list are listed below. The > >> >> >> number > >> >> >> on > >> >> >> the left is how many times the error message is found. > >> >> >> > >> >> >> 1811 "message": "failed to sync bucket > >> >> >> instance: > >> >> >> (16) Device or resource busy" > >> >> >> 7 "message": "failed to sync bucket > >> >> >> instance: > >> >> >> (5) Input\/output error" > >> >> >> 65 "message": "failed to sync object" > >> >> >> > >> >> >> On Tue, Aug 29, 2017 at 10:00 AM Orit Wasserman > >> >> >> <owass...@redhat.com> > >> >> >> wrote: > >> >> >>> > >> >> >>> > >> >> >>> Hi David, > >> >> >>> > >> >> >>> On Mon, Aug 28, 2017 at 8:33 PM, David Turner > >> >> >>> <drakonst...@gmail.com> > >> >> >>> wrote: > >> >> >>>> > >> >> >>>> The vast majority of the sync error list is "failed to sync > bucket > >> >> >>>> instance: (16) Device or resource busy". I can't find anything > on > >> >> >>>> Google > >> >> >>>> about this error message in relation to Ceph. Does anyone have > >> >> >>>> any > >> >> >>>> idea > >> >> >>>> what this means? and/or how to fix it? > >> >> >>> > >> >> >>> > >> >> >>> Those are intermediate errors resulting from several radosgw > trying > >> >> >>> to > >> >> >>> acquire the same sync log shard lease. It doesn't effect the sync > >> >> >>> progress. > >> >> >>> Are there any other errors? > >> >> >>> > >> >> >>> Orit > >> >> >>>> > >> >> >>>> > >> >> >>>> On Fri, Aug 25, 2017 at 2:48 PM Casey Bodley < > cbod...@redhat.com> > >> >> >>>> wrote: > >> >> >>>>> > >> >> >>>>> Hi David, > >> >> >>>>> > >> >> >>>>> The 'data sync init' command won't touch any actual object > data, > >> >> >>>>> no. > >> >> >>>>> Resetting the data sync status will just cause a zone to > restart > >> >> >>>>> a > >> >> >>>>> full sync > >> >> >>>>> of the --source-zone's data changes log. This log only lists > >> >> >>>>> which > >> >> >>>>> buckets/shards have changes in them, which causes radosgw to > >> >> >>>>> consider them > >> >> >>>>> for bucket sync. So while the command may silence the warnings > >> >> >>>>> about > >> >> >>>>> data > >> >> >>>>> shards being behind, it's unlikely to resolve the issue with > >> >> >>>>> missing > >> >> >>>>> objects > >> >> >>>>> in those buckets. > >> >> >>>>> > >> >> >>>>> When data sync is behind for an extended period of time, it's > >> >> >>>>> usually > >> >> >>>>> because it's stuck retrying previous bucket sync failures. The > >> >> >>>>> 'sync > >> >> >>>>> error > >> >> >>>>> list' may help narrow down where those failures are. > >> >> >>>>> > >> >> >>>>> There is also a 'bucket sync init' command to clear the bucket > >> >> >>>>> sync > >> >> >>>>> status. Following that with a 'bucket sync run' should restart > a > >> >> >>>>> full sync > >> >> >>>>> on the bucket, pulling in any new objects that are present on > the > >> >> >>>>> source-zone. I'm afraid that those commands haven't seen a lot > of > >> >> >>>>> polish or > >> >> >>>>> testing, however. > >> >> >>>>> > >> >> >>>>> Casey > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> On 08/24/2017 04:15 PM, David Turner wrote: > >> >> >>>>> > >> >> >>>>> Apparently the data shards that are behind go in both > directions, > >> >> >>>>> but > >> >> >>>>> only one zone is aware of the problem. Each cluster has > objects > >> >> >>>>> in > >> >> >>>>> their > >> >> >>>>> data pool that the other doesn't have. I'm thinking about > >> >> >>>>> initiating a > >> >> >>>>> `data sync init` on both sides (one at a time) to get them back > >> >> >>>>> on > >> >> >>>>> the same > >> >> >>>>> page. Does anyone know if that command will overwrite any > local > >> >> >>>>> data that > >> >> >>>>> the zone has that the other doesn't if you run `data sync init` > >> >> >>>>> on > >> >> >>>>> it? > >> >> >>>>> > >> >> >>>>> On Thu, Aug 24, 2017 at 1:51 PM David Turner > >> >> >>>>> <drakonst...@gmail.com> > >> >> >>>>> wrote: > >> >> >>>>>> > >> >> >>>>>> After restarting the 2 RGW daemons on the second site again, > >> >> >>>>>> everything caught up on the metadata sync. Is there something > >> >> >>>>>> about having > >> >> >>>>>> 2 RGW daemons on each side of the multisite that might be > >> >> >>>>>> causing > >> >> >>>>>> an issue > >> >> >>>>>> with the sync getting stale? I have another realm set up the > >> >> >>>>>> same > >> >> >>>>>> way that > >> >> >>>>>> is having a hard time with its data shards being behind. I > >> >> >>>>>> haven't > >> >> >>>>>> told > >> >> >>>>>> them to resync, but yesterday I noticed 90 shards were behind. > >> >> >>>>>> It's caught > >> >> >>>>>> back up to only 17 shards behind, but the oldest change not > >> >> >>>>>> applied > >> >> >>>>>> is 2 > >> >> >>>>>> months old and no order of restarting RGW daemons is helping > to > >> >> >>>>>> resolve > >> >> >>>>>> this. > >> >> >>>>>> > >> >> >>>>>> On Thu, Aug 24, 2017 at 10:59 AM David Turner > >> >> >>>>>> <drakonst...@gmail.com> > >> >> >>>>>> wrote: > >> >> >>>>>>> > >> >> >>>>>>> I have a RGW Multisite 10.2.7 set up for bi-directional > >> >> >>>>>>> syncing. > >> >> >>>>>>> This has been operational for 5 months and working fine. I > >> >> >>>>>>> recently created > >> >> >>>>>>> a new user on the master zone, used that user to create a > >> >> >>>>>>> bucket, > >> >> >>>>>>> and put in > >> >> >>>>>>> a public-acl object in there. The Bucket created on the > second > >> >> >>>>>>> site, but > >> >> >>>>>>> the user did not and the object errors out complaining about > >> >> >>>>>>> the > >> >> >>>>>>> access_key > >> >> >>>>>>> not existing. > >> >> >>>>>>> > >> >> >>>>>>> That led me to think that the metadata isn't syncing, while > >> >> >>>>>>> bucket > >> >> >>>>>>> and data both are. I've also confirmed that data is syncing > >> >> >>>>>>> for > >> >> >>>>>>> other > >> >> >>>>>>> buckets as well in both directions. The sync status from the > >> >> >>>>>>> second site was > >> >> >>>>>>> this. > >> >> >>>>>>> > >> >> >>>>>>> metadata sync syncing > >> >> >>>>>>> > >> >> >>>>>>> full sync: 0/64 shards > >> >> >>>>>>> > >> >> >>>>>>> incremental sync: 64/64 shards > >> >> >>>>>>> > >> >> >>>>>>> metadata is caught up with master > >> >> >>>>>>> > >> >> >>>>>>> data sync source: f4c12327-4721-47c9-a365-86332d84c227 > >> >> >>>>>>> (public-atl01) > >> >> >>>>>>> > >> >> >>>>>>> syncing > >> >> >>>>>>> > >> >> >>>>>>> full sync: 0/128 shards > >> >> >>>>>>> > >> >> >>>>>>> incremental sync: 128/128 shards > >> >> >>>>>>> > >> >> >>>>>>> data is caught up with source > >> >> >>>>>>> > >> >> >>>>>>> > >> >> >>>>>>> Sync status leads me to think that the second site believes > it > >> >> >>>>>>> is > >> >> >>>>>>> up > >> >> >>>>>>> to date, even though it is missing a freshly created user. I > >> >> >>>>>>> restarted all > >> >> >>>>>>> of the rgw daemons for the zonegroup, but it didn't trigger > >> >> >>>>>>> anything to fix > >> >> >>>>>>> the missing user in the second site. I did some googling and > >> >> >>>>>>> found the sync > >> >> >>>>>>> init commands mentioned in a few ML posts and used metadata > >> >> >>>>>>> sync > >> >> >>>>>>> init and > >> >> >>>>>>> now have this as the sync status. > >> >> >>>>>>> > >> >> >>>>>>> metadata sync preparing for full sync > >> >> >>>>>>> > >> >> >>>>>>> full sync: 64/64 shards > >> >> >>>>>>> > >> >> >>>>>>> full sync: 0 entries to sync > >> >> >>>>>>> > >> >> >>>>>>> incremental sync: 0/64 shards > >> >> >>>>>>> > >> >> >>>>>>> metadata is behind on 70 shards > >> >> >>>>>>> > >> >> >>>>>>> oldest incremental change not applied: > >> >> >>>>>>> 2017-03-01 > >> >> >>>>>>> 21:13:43.0.126971s > >> >> >>>>>>> > >> >> >>>>>>> data sync source: f4c12327-4721-47c9-a365-86332d84c227 > >> >> >>>>>>> (public-atl01) > >> >> >>>>>>> > >> >> >>>>>>> syncing > >> >> >>>>>>> > >> >> >>>>>>> full sync: 0/128 shards > >> >> >>>>>>> > >> >> >>>>>>> incremental sync: 128/128 shards > >> >> >>>>>>> > >> >> >>>>>>> data is caught up with source > >> >> >>>>>>> > >> >> >>>>>>> > >> >> >>>>>>> It definitely triggered a fresh sync and told it to forget > >> >> >>>>>>> about > >> >> >>>>>>> what > >> >> >>>>>>> it's previously applied as the date of the oldest change not > >> >> >>>>>>> applied is the > >> >> >>>>>>> day we initially set up multisite for this zone. The problem > >> >> >>>>>>> is > >> >> >>>>>>> that was > >> >> >>>>>>> over 12 hours ago and the sync stat hasn't caught up on any > >> >> >>>>>>> shards > >> >> >>>>>>> yet. > >> >> >>>>>>> > >> >> >>>>>>> Does anyone have any suggestions other than blast the second > >> >> >>>>>>> site > >> >> >>>>>>> and > >> >> >>>>>>> set it back up with a fresh start (the only option I can > think > >> >> >>>>>>> of > >> >> >>>>>>> at this > >> >> >>>>>>> point)? > >> >> >>>>>>> > >> >> >>>>>>> Thank you, > >> >> >>>>>>> David Turner > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> _______________________________________________ > >> >> >>>>> ceph-users mailing list > >> >> >>>>> ceph-users@lists.ceph.com > >> >> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> _______________________________________________ > >> >> >>>>> ceph-users mailing list > >> >> >>>>> ceph-users@lists.ceph.com > >> >> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> >>>> > >> >> >>>> > >> >> >>>> _______________________________________________ > >> >> >>>> ceph-users mailing list > >> >> >>>> ceph-users@lists.ceph.com > >> >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> >>>> > >> >> > > >> >> > _______________________________________________ > >> >> > ceph-users mailing list > >> >> > ceph-users@lists.ceph.com > >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com