Re: [ceph-users] RGW Multisite metadata sync init

David Turner Thu, 31 Aug 2017 09:46:54 -0700

All of the messages from sync error list are listed below.  The number on
the left is how many times the error message is found.


   1811                     "message": "failed to sync bucket instance:
(16) Device or resource busy"
      7                     "message": "failed to sync bucket instance: (5)
Input\/output error"
     65                     "message": "failed to sync object"

On Tue, Aug 29, 2017 at 10:00 AM Orit Wasserman <owass...@redhat.com> wrote:

>
> Hi David,
>
> On Mon, Aug 28, 2017 at 8:33 PM, David Turner <drakonst...@gmail.com>
> wrote:
>
>> The vast majority of the sync error list is "failed to sync bucket
>> instance: (16) Device or resource busy".  I can't find anything on Google
>> about this error message in relation to Ceph.  Does anyone have any idea
>> what this means? and/or how to fix it?
>>
>
> Those are intermediate errors resulting from several radosgw trying to
> acquire the same sync log shard lease. It doesn't effect the sync progress.
> Are there any other errors?
>
> Orit
>
>>
>> On Fri, Aug 25, 2017 at 2:48 PM Casey Bodley <cbod...@redhat.com> wrote:
>>
>>> Hi David,
>>>
>>> The 'data sync init' command won't touch any actual object data, no.
>>> Resetting the data sync status will just cause a zone to restart a full
>>> sync of the --source-zone's data changes log. This log only lists which
>>> buckets/shards have changes in them, which causes radosgw to consider them
>>> for bucket sync. So while the command may silence the warnings about data
>>> shards being behind, it's unlikely to resolve the issue with missing
>>> objects in those buckets.
>>>
>>> When data sync is behind for an extended period of time, it's usually
>>> because it's stuck retrying previous bucket sync failures. The 'sync error
>>> list' may help narrow down where those failures are.
>>>
>>> There is also a 'bucket sync init' command to clear the bucket sync
>>> status. Following that with a 'bucket sync run' should restart a full sync
>>> on the bucket, pulling in any new objects that are present on the
>>> source-zone. I'm afraid that those commands haven't seen a lot of polish or
>>> testing, however.
>>>
>>> Casey
>>>
>>> On 08/24/2017 04:15 PM, David Turner wrote:
>>>
>>> Apparently the data shards that are behind go in both directions, but
>>> only one zone is aware of the problem.  Each cluster has objects in their
>>> data pool that the other doesn't have.  I'm thinking about initiating a
>>> `data sync init` on both sides (one at a time) to get them back on the same
>>> page.  Does anyone know if that command will overwrite any local data that
>>> the zone has that the other doesn't if you run `data sync init` on it?
>>>
>>> On Thu, Aug 24, 2017 at 1:51 PM David Turner <drakonst...@gmail.com>
>>> wrote:
>>>
>>>> After restarting the 2 RGW daemons on the second site again, everything
>>>> caught up on the metadata sync.  Is there something about having 2 RGW
>>>> daemons on each side of the multisite that might be causing an issue with
>>>> the sync getting stale?  I have another realm set up the same way that is
>>>> having a hard time with its data shards being behind.  I haven't told them
>>>> to resync, but yesterday I noticed 90 shards were behind.  It's caught back
>>>> up to only 17 shards behind, but the oldest change not applied is 2 months
>>>> old and no order of restarting RGW daemons is helping to resolve this.
>>>>
>>>> On Thu, Aug 24, 2017 at 10:59 AM David Turner <drakonst...@gmail.com>
>>>> wrote:
>>>>
>>>>> I have a RGW Multisite 10.2.7 set up for bi-directional syncing.  This
>>>>> has been operational for 5 months and working fine.  I recently created a
>>>>> new user on the master zone, used that user to create a bucket, and put in
>>>>> a public-acl object in there.  The Bucket created on the second site, but
>>>>> the user did not and the object errors out complaining about the 
>>>>> access_key
>>>>> not existing.
>>>>>
>>>>> That led me to think that the metadata isn't syncing, while bucket and
>>>>> data both are.  I've also confirmed that data is syncing for other buckets
>>>>> as well in both directions. The sync status from the second site was this.
>>>>>
>>>>>
>>>>>    1.
>>>>>
>>>>>      metadata sync syncing
>>>>>
>>>>>    2.
>>>>>
>>>>>                    full sync: 0/64 shards
>>>>>
>>>>>    3.
>>>>>
>>>>>                    incremental sync: 64/64 shards
>>>>>
>>>>>    4.
>>>>>
>>>>>                    metadata is caught up with master
>>>>>
>>>>>    5.
>>>>>
>>>>>          data sync source: f4c12327-4721-47c9-a365-86332d84c227 
>>>>> (public-atl01)
>>>>>
>>>>>    6.
>>>>>
>>>>>                            syncing
>>>>>
>>>>>    7.
>>>>>
>>>>>                            full sync: 0/128 shards
>>>>>
>>>>>    8.
>>>>>
>>>>>                            incremental sync: 128/128 shards
>>>>>
>>>>>    9.
>>>>>
>>>>>                            data is caught up with source
>>>>>
>>>>>
>>>>>
>>>>> Sync status leads me to think that the second site believes it is up
>>>>> to date, even though it is missing a freshly created user.  I restarted 
>>>>> all
>>>>> of the rgw daemons for the zonegroup, but it didn't trigger anything to 
>>>>> fix
>>>>> the missing user in the second site.  I did some googling and found the
>>>>> sync init commands mentioned in a few ML posts and used metadata sync init
>>>>> and now have this as the sync status.
>>>>>
>>>>>
>>>>>    1.
>>>>>
>>>>>      metadata sync preparing for full sync
>>>>>
>>>>>    2.
>>>>>
>>>>>                    full sync: 64/64 shards
>>>>>
>>>>>    3.
>>>>>
>>>>>                    full sync: 0 entries to sync
>>>>>
>>>>>    4.
>>>>>
>>>>>                    incremental sync: 0/64 shards
>>>>>
>>>>>    5.
>>>>>
>>>>>                    metadata is behind on 70 shards
>>>>>
>>>>>    6.
>>>>>
>>>>>                    oldest incremental change not applied: 2017-03-01 
>>>>> 21:13:43.0.126971s
>>>>>
>>>>>    7.
>>>>>
>>>>>          data sync source: f4c12327-4721-47c9-a365-86332d84c227 
>>>>> (public-atl01)
>>>>>
>>>>>    8.
>>>>>
>>>>>                            syncing
>>>>>
>>>>>    9.
>>>>>
>>>>>                            full sync: 0/128 shards
>>>>>
>>>>>    10.
>>>>>
>>>>>                            incremental sync: 128/128 shards
>>>>>
>>>>>    11.
>>>>>
>>>>>                            data is caught up with source
>>>>>
>>>>>
>>>>>
>>>>> It definitely triggered a fresh sync and told it to forget about what
>>>>> it's previously applied as the date of the oldest change not applied is 
>>>>> the
>>>>> day we initially set up multisite for this zone.  The problem is that was
>>>>> over 12 hours ago and the sync stat hasn't caught up on any shards yet.
>>>>>
>>>>> Does anyone have any suggestions other than blast the second site and
>>>>> set it back up with a fresh start (the only option I can think of at this
>>>>> point)?
>>>>>
>>>>> Thank you,
>>>>> David Turner
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing 
>>> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RGW Multisite metadata sync init

Reply via email to