I sent the output of all of the files including the logs to you. Thank you for your help so far.
On Thu, Sep 7, 2017 at 4:48 PM Yehuda Sadeh-Weinraub <yeh...@redhat.com> wrote: > On Thu, Sep 7, 2017 at 11:37 PM, David Turner <drakonst...@gmail.com> > wrote: > > I'm pretty sure I'm using the cluster admin user/keyring. Is there any > > output that would be helpful? Period, zonegroup get, etc? > > - radosgw-admin period get > - radosgw-admin zone list > - radosgw-admin zonegroup list > > For each zone, zonegroup in result: > - radosgw-admin zone get --rgw-zone=<zone> > - radosgw-admin zonegroup get --rgw-zonegroup=<zonegroup> > > - rados lspools > > Also, create a user with --debug-rgw=20 --debug-ms=1, need to look at the > log. > > Yehuda > > > > > > On Thu, Sep 7, 2017 at 4:27 PM Yehuda Sadeh-Weinraub <yeh...@redhat.com> > > wrote: > >> > >> On Thu, Sep 7, 2017 at 11:02 PM, David Turner <drakonst...@gmail.com> > >> wrote: > >> > I created a test user named 'ice' and then used it to create a bucket > >> > named > >> > ice. The bucket ice can be found in the second dc, but not the user. > >> > `mdlog list` showed ice for the bucket, but not for the user. I > >> > performed > >> > the same test in the internal realm and it showed the user and bucket > >> > both > >> > in `mdlog list`. > >> > > >> > >> Maybe your radosgw-admin command is running with a ceph user that > >> doesn't have permissions to write to the log pool? (probably not, > >> because you are able to run the sync init commands). > >> Another very slim explanation would be if you had for some reason > >> overlapping zones configuration that shared some of the config but not > >> all of it, having radosgw running against the correct one and > >> radosgw-admin against the bad one. I don't think it's the second > >> option. > >> > >> Yehuda > >> > >> > > >> > > >> > On Thu, Sep 7, 2017 at 3:27 PM Yehuda Sadeh-Weinraub < > yeh...@redhat.com> > >> > wrote: > >> >> > >> >> On Thu, Sep 7, 2017 at 10:04 PM, David Turner <drakonst...@gmail.com > > > >> >> wrote: > >> >> > One realm is called public with a zonegroup called public-zg with a > >> >> > zone > >> >> > for > >> >> > each datacenter. The second realm is called internal with a > >> >> > zonegroup > >> >> > called internal-zg with a zone for each datacenter. they each have > >> >> > their > >> >> > own rgw's and load balancers. The needs of our public facing rgw's > >> >> > and > >> >> > load > >> >> > balancers vs internal use ones was different enough that we split > >> >> > them > >> >> > up > >> >> > completely. We also have a local realm that does not use multisite > >> >> > and > >> >> > a > >> >> > 4th realm called QA that mimics the public realm as much as > possible > >> >> > for > >> >> > staging configuration stages for the rgw daemons. All 4 realms > have > >> >> > their > >> >> > own buckets, users, etc and that is all working fine. For all of > the > >> >> > radosgw-admin commands I am using the proper identifiers to make > sure > >> >> > that > >> >> > each datacenter and realm are running commands on exactly what I > >> >> > expect > >> >> > them > >> >> > to (--rgw-realm=public --rgw-zonegroup=public-zg > >> >> > --rgw-zone=public-dc1 > >> >> > --source-zone=public-dc2). > >> >> > > >> >> > The data sync issue was in the internal realm but running a data > sync > >> >> > init > >> >> > and kickstarting the rgw daemons in each datacenter fixed the data > >> >> > discrepancies (I'm thinking it had something to do with a power > >> >> > failure > >> >> > a > >> >> > few months back that I just noticed recently). The metadata sync > >> >> > issue > >> >> > is > >> >> > in the public realm. I have no idea what is causing this to not > sync > >> >> > properly since running a `metadata sync init` catches it back up to > >> >> > the > >> >> > primary zone, but then it doesn't receive any new users created > after > >> >> > that. > >> >> > > >> >> > >> >> Sounds like an issue with the metadata log in the primary master > zone. > >> >> Not sure what could go wrong there, but maybe the master zone doesn't > >> >> know that it is a master zone, or it's set to not log metadata. Or > >> >> maybe there's a problem when the secondary is trying to fetch the > >> >> metadata log. Maybe some kind of # of shards mismatch (though not > >> >> likely). > >> >> Try to see if the master logs any changes: should use the > >> >> 'radosgw-admin mdlog list' command. > >> >> > >> >> Yehuda > >> >> > >> >> > On Thu, Sep 7, 2017 at 2:52 PM Yehuda Sadeh-Weinraub > >> >> > <yeh...@redhat.com> > >> >> > wrote: > >> >> >> > >> >> >> On Thu, Sep 7, 2017 at 7:44 PM, David Turner < > drakonst...@gmail.com> > >> >> >> wrote: > >> >> >> > Ok, I've been testing, investigating, researching, etc for the > >> >> >> > last > >> >> >> > week > >> >> >> > and > >> >> >> > I don't have any problems with data syncing. The clients on one > >> >> >> > side > >> >> >> > are > >> >> >> > creating multipart objects while the multisite sync is creating > >> >> >> > them > >> >> >> > as > >> >> >> > whole objects and one of the datacenters is slower at cleaning > up > >> >> >> > the > >> >> >> > shadow > >> >> >> > files. That's the big discrepancy between object counts in the > >> >> >> > pools > >> >> >> > between datacenters. I created a tool that goes through for > each > >> >> >> > bucket > >> >> >> > in > >> >> >> > a realm and does a recursive listing of all objects in it for > both > >> >> >> > datacenters and compares the 2 lists for any differences. The > >> >> >> > data > >> >> >> > is > >> >> >> > definitely in sync between the 2 datacenters down to the > modified > >> >> >> > time > >> >> >> > and > >> >> >> > byte of each file in s3. > >> >> >> > > >> >> >> > The metadata is still not syncing for the other realm, though. > If > >> >> >> > I > >> >> >> > run > >> >> >> > `metadata sync init` then the second datacenter will catch up > with > >> >> >> > all > >> >> >> > of > >> >> >> > the new users, but until I do that newly created users on the > >> >> >> > primary > >> >> >> > side > >> >> >> > don't exist on the secondary side. `metadata sync status`, > `sync > >> >> >> > status`, > >> >> >> > `metadata sync run` (only left running for 30 minutes before I > >> >> >> > ctrl+c > >> >> >> > it), > >> >> >> > etc don't show any problems... but the new users just don't > exist > >> >> >> > on > >> >> >> > the > >> >> >> > secondary side until I run `metadata sync init`. I created a > new > >> >> >> > bucket > >> >> >> > with the new user and the bucket shows up in the second > >> >> >> > datacenter, > >> >> >> > but > >> >> >> > no > >> >> >> > objects because the objects don't have a valid owner. > >> >> >> > > >> >> >> > Thank you all for the help with the data sync issue. You pushed > >> >> >> > me > >> >> >> > into > >> >> >> > good directions. Does anyone have any insight as to what is > >> >> >> > preventing > >> >> >> > the > >> >> >> > metadata from syncing in the other realm? I have 2 realms being > >> >> >> > sync > >> >> >> > using > >> >> >> > multi-site and it's only 1 of them that isn't getting the > metadata > >> >> >> > across. > >> >> >> > As far as I can tell it is configured identically. > >> >> >> > >> >> >> What do you mean you have two realms? Zones and zonegroups need to > >> >> >> exist in the same realm in order for meta and data sync to happen > >> >> >> correctly. Maybe I'm misunderstanding. > >> >> >> > >> >> >> Yehuda > >> >> >> > >> >> >> > > >> >> >> > On Thu, Aug 31, 2017 at 12:46 PM David Turner > >> >> >> > <drakonst...@gmail.com> > >> >> >> > wrote: > >> >> >> >> > >> >> >> >> All of the messages from sync error list are listed below. The > >> >> >> >> number > >> >> >> >> on > >> >> >> >> the left is how many times the error message is found. > >> >> >> >> > >> >> >> >> 1811 "message": "failed to sync bucket > >> >> >> >> instance: > >> >> >> >> (16) Device or resource busy" > >> >> >> >> 7 "message": "failed to sync bucket > >> >> >> >> instance: > >> >> >> >> (5) Input\/output error" > >> >> >> >> 65 "message": "failed to sync object" > >> >> >> >> > >> >> >> >> On Tue, Aug 29, 2017 at 10:00 AM Orit Wasserman > >> >> >> >> <owass...@redhat.com> > >> >> >> >> wrote: > >> >> >> >>> > >> >> >> >>> > >> >> >> >>> Hi David, > >> >> >> >>> > >> >> >> >>> On Mon, Aug 28, 2017 at 8:33 PM, David Turner > >> >> >> >>> <drakonst...@gmail.com> > >> >> >> >>> wrote: > >> >> >> >>>> > >> >> >> >>>> The vast majority of the sync error list is "failed to sync > >> >> >> >>>> bucket > >> >> >> >>>> instance: (16) Device or resource busy". I can't find > anything > >> >> >> >>>> on > >> >> >> >>>> Google > >> >> >> >>>> about this error message in relation to Ceph. Does anyone > have > >> >> >> >>>> any > >> >> >> >>>> idea > >> >> >> >>>> what this means? and/or how to fix it? > >> >> >> >>> > >> >> >> >>> > >> >> >> >>> Those are intermediate errors resulting from several radosgw > >> >> >> >>> trying > >> >> >> >>> to > >> >> >> >>> acquire the same sync log shard lease. It doesn't effect the > >> >> >> >>> sync > >> >> >> >>> progress. > >> >> >> >>> Are there any other errors? > >> >> >> >>> > >> >> >> >>> Orit > >> >> >> >>>> > >> >> >> >>>> > >> >> >> >>>> On Fri, Aug 25, 2017 at 2:48 PM Casey Bodley > >> >> >> >>>> <cbod...@redhat.com> > >> >> >> >>>> wrote: > >> >> >> >>>>> > >> >> >> >>>>> Hi David, > >> >> >> >>>>> > >> >> >> >>>>> The 'data sync init' command won't touch any actual object > >> >> >> >>>>> data, > >> >> >> >>>>> no. > >> >> >> >>>>> Resetting the data sync status will just cause a zone to > >> >> >> >>>>> restart > >> >> >> >>>>> a > >> >> >> >>>>> full sync > >> >> >> >>>>> of the --source-zone's data changes log. This log only lists > >> >> >> >>>>> which > >> >> >> >>>>> buckets/shards have changes in them, which causes radosgw to > >> >> >> >>>>> consider them > >> >> >> >>>>> for bucket sync. So while the command may silence the > warnings > >> >> >> >>>>> about > >> >> >> >>>>> data > >> >> >> >>>>> shards being behind, it's unlikely to resolve the issue with > >> >> >> >>>>> missing > >> >> >> >>>>> objects > >> >> >> >>>>> in those buckets. > >> >> >> >>>>> > >> >> >> >>>>> When data sync is behind for an extended period of time, > it's > >> >> >> >>>>> usually > >> >> >> >>>>> because it's stuck retrying previous bucket sync failures. > The > >> >> >> >>>>> 'sync > >> >> >> >>>>> error > >> >> >> >>>>> list' may help narrow down where those failures are. > >> >> >> >>>>> > >> >> >> >>>>> There is also a 'bucket sync init' command to clear the > bucket > >> >> >> >>>>> sync > >> >> >> >>>>> status. Following that with a 'bucket sync run' should > restart > >> >> >> >>>>> a > >> >> >> >>>>> full sync > >> >> >> >>>>> on the bucket, pulling in any new objects that are present > on > >> >> >> >>>>> the > >> >> >> >>>>> source-zone. I'm afraid that those commands haven't seen a > lot > >> >> >> >>>>> of > >> >> >> >>>>> polish or > >> >> >> >>>>> testing, however. > >> >> >> >>>>> > >> >> >> >>>>> Casey > >> >> >> >>>>> > >> >> >> >>>>> > >> >> >> >>>>> On 08/24/2017 04:15 PM, David Turner wrote: > >> >> >> >>>>> > >> >> >> >>>>> Apparently the data shards that are behind go in both > >> >> >> >>>>> directions, > >> >> >> >>>>> but > >> >> >> >>>>> only one zone is aware of the problem. Each cluster has > >> >> >> >>>>> objects > >> >> >> >>>>> in > >> >> >> >>>>> their > >> >> >> >>>>> data pool that the other doesn't have. I'm thinking about > >> >> >> >>>>> initiating a > >> >> >> >>>>> `data sync init` on both sides (one at a time) to get them > >> >> >> >>>>> back > >> >> >> >>>>> on > >> >> >> >>>>> the same > >> >> >> >>>>> page. Does anyone know if that command will overwrite any > >> >> >> >>>>> local > >> >> >> >>>>> data that > >> >> >> >>>>> the zone has that the other doesn't if you run `data sync > >> >> >> >>>>> init` > >> >> >> >>>>> on > >> >> >> >>>>> it? > >> >> >> >>>>> > >> >> >> >>>>> On Thu, Aug 24, 2017 at 1:51 PM David Turner > >> >> >> >>>>> <drakonst...@gmail.com> > >> >> >> >>>>> wrote: > >> >> >> >>>>>> > >> >> >> >>>>>> After restarting the 2 RGW daemons on the second site > again, > >> >> >> >>>>>> everything caught up on the metadata sync. Is there > >> >> >> >>>>>> something > >> >> >> >>>>>> about having > >> >> >> >>>>>> 2 RGW daemons on each side of the multisite that might be > >> >> >> >>>>>> causing > >> >> >> >>>>>> an issue > >> >> >> >>>>>> with the sync getting stale? I have another realm set up > the > >> >> >> >>>>>> same > >> >> >> >>>>>> way that > >> >> >> >>>>>> is having a hard time with its data shards being behind. I > >> >> >> >>>>>> haven't > >> >> >> >>>>>> told > >> >> >> >>>>>> them to resync, but yesterday I noticed 90 shards were > >> >> >> >>>>>> behind. > >> >> >> >>>>>> It's caught > >> >> >> >>>>>> back up to only 17 shards behind, but the oldest change not > >> >> >> >>>>>> applied > >> >> >> >>>>>> is 2 > >> >> >> >>>>>> months old and no order of restarting RGW daemons is > helping > >> >> >> >>>>>> to > >> >> >> >>>>>> resolve > >> >> >> >>>>>> this. > >> >> >> >>>>>> > >> >> >> >>>>>> On Thu, Aug 24, 2017 at 10:59 AM David Turner > >> >> >> >>>>>> <drakonst...@gmail.com> > >> >> >> >>>>>> wrote: > >> >> >> >>>>>>> > >> >> >> >>>>>>> I have a RGW Multisite 10.2.7 set up for bi-directional > >> >> >> >>>>>>> syncing. > >> >> >> >>>>>>> This has been operational for 5 months and working fine. > I > >> >> >> >>>>>>> recently created > >> >> >> >>>>>>> a new user on the master zone, used that user to create a > >> >> >> >>>>>>> bucket, > >> >> >> >>>>>>> and put in > >> >> >> >>>>>>> a public-acl object in there. The Bucket created on the > >> >> >> >>>>>>> second > >> >> >> >>>>>>> site, but > >> >> >> >>>>>>> the user did not and the object errors out complaining > about > >> >> >> >>>>>>> the > >> >> >> >>>>>>> access_key > >> >> >> >>>>>>> not existing. > >> >> >> >>>>>>> > >> >> >> >>>>>>> That led me to think that the metadata isn't syncing, > while > >> >> >> >>>>>>> bucket > >> >> >> >>>>>>> and data both are. I've also confirmed that data is > syncing > >> >> >> >>>>>>> for > >> >> >> >>>>>>> other > >> >> >> >>>>>>> buckets as well in both directions. The sync status from > the > >> >> >> >>>>>>> second site was > >> >> >> >>>>>>> this. > >> >> >> >>>>>>> > >> >> >> >>>>>>> metadata sync syncing > >> >> >> >>>>>>> > >> >> >> >>>>>>> full sync: 0/64 shards > >> >> >> >>>>>>> > >> >> >> >>>>>>> incremental sync: 64/64 shards > >> >> >> >>>>>>> > >> >> >> >>>>>>> metadata is caught up with master > >> >> >> >>>>>>> > >> >> >> >>>>>>> data sync source: > f4c12327-4721-47c9-a365-86332d84c227 > >> >> >> >>>>>>> (public-atl01) > >> >> >> >>>>>>> > >> >> >> >>>>>>> syncing > >> >> >> >>>>>>> > >> >> >> >>>>>>> full sync: 0/128 shards > >> >> >> >>>>>>> > >> >> >> >>>>>>> incremental sync: 128/128 shards > >> >> >> >>>>>>> > >> >> >> >>>>>>> data is caught up with source > >> >> >> >>>>>>> > >> >> >> >>>>>>> > >> >> >> >>>>>>> Sync status leads me to think that the second site > believes > >> >> >> >>>>>>> it > >> >> >> >>>>>>> is > >> >> >> >>>>>>> up > >> >> >> >>>>>>> to date, even though it is missing a freshly created user. > >> >> >> >>>>>>> I > >> >> >> >>>>>>> restarted all > >> >> >> >>>>>>> of the rgw daemons for the zonegroup, but it didn't > trigger > >> >> >> >>>>>>> anything to fix > >> >> >> >>>>>>> the missing user in the second site. I did some googling > >> >> >> >>>>>>> and > >> >> >> >>>>>>> found the sync > >> >> >> >>>>>>> init commands mentioned in a few ML posts and used > metadata > >> >> >> >>>>>>> sync > >> >> >> >>>>>>> init and > >> >> >> >>>>>>> now have this as the sync status. > >> >> >> >>>>>>> > >> >> >> >>>>>>> metadata sync preparing for full sync > >> >> >> >>>>>>> > >> >> >> >>>>>>> full sync: 64/64 shards > >> >> >> >>>>>>> > >> >> >> >>>>>>> full sync: 0 entries to sync > >> >> >> >>>>>>> > >> >> >> >>>>>>> incremental sync: 0/64 shards > >> >> >> >>>>>>> > >> >> >> >>>>>>> metadata is behind on 70 shards > >> >> >> >>>>>>> > >> >> >> >>>>>>> oldest incremental change not applied: > >> >> >> >>>>>>> 2017-03-01 > >> >> >> >>>>>>> 21:13:43.0.126971s > >> >> >> >>>>>>> > >> >> >> >>>>>>> data sync source: > f4c12327-4721-47c9-a365-86332d84c227 > >> >> >> >>>>>>> (public-atl01) > >> >> >> >>>>>>> > >> >> >> >>>>>>> syncing > >> >> >> >>>>>>> > >> >> >> >>>>>>> full sync: 0/128 shards > >> >> >> >>>>>>> > >> >> >> >>>>>>> incremental sync: 128/128 shards > >> >> >> >>>>>>> > >> >> >> >>>>>>> data is caught up with source > >> >> >> >>>>>>> > >> >> >> >>>>>>> > >> >> >> >>>>>>> It definitely triggered a fresh sync and told it to forget > >> >> >> >>>>>>> about > >> >> >> >>>>>>> what > >> >> >> >>>>>>> it's previously applied as the date of the oldest change > not > >> >> >> >>>>>>> applied is the > >> >> >> >>>>>>> day we initially set up multisite for this zone. The > >> >> >> >>>>>>> problem > >> >> >> >>>>>>> is > >> >> >> >>>>>>> that was > >> >> >> >>>>>>> over 12 hours ago and the sync stat hasn't caught up on > any > >> >> >> >>>>>>> shards > >> >> >> >>>>>>> yet. > >> >> >> >>>>>>> > >> >> >> >>>>>>> Does anyone have any suggestions other than blast the > second > >> >> >> >>>>>>> site > >> >> >> >>>>>>> and > >> >> >> >>>>>>> set it back up with a fresh start (the only option I can > >> >> >> >>>>>>> think > >> >> >> >>>>>>> of > >> >> >> >>>>>>> at this > >> >> >> >>>>>>> point)? > >> >> >> >>>>>>> > >> >> >> >>>>>>> Thank you, > >> >> >> >>>>>>> David Turner > >> >> >> >>>>> > >> >> >> >>>>> > >> >> >> >>>>> > >> >> >> >>>>> _______________________________________________ > >> >> >> >>>>> ceph-users mailing list > >> >> >> >>>>> ceph-users@lists.ceph.com > >> >> >> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> >> >>>>> > >> >> >> >>>>> > >> >> >> >>>>> _______________________________________________ > >> >> >> >>>>> ceph-users mailing list > >> >> >> >>>>> ceph-users@lists.ceph.com > >> >> >> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> >> >>>> > >> >> >> >>>> > >> >> >> >>>> _______________________________________________ > >> >> >> >>>> ceph-users mailing list > >> >> >> >>>> ceph-users@lists.ceph.com > >> >> >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> >> >>>> > >> >> >> > > >> >> >> > _______________________________________________ > >> >> >> > ceph-users mailing list > >> >> >> > ceph-users@lists.ceph.com > >> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> >> > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com