Re: [ceph-users] radosgw sync falling behind regularly

2019-03-11 Thread Trey Palmer
HI Casey,

We're still trying to figure this sync problem out, if you could possibly
tell us anything further we would be deeply grateful!

Our errors are coming from 'data sync'.   In `sync status` we pretty
constantly show one shard behind, but a different one each time we run it.

Here's a paste -- these commands were run in rapid succession.

root@sv3-ceph-rgw1:~# radosgw-admin sync status
  realm b3e2afe7-2254-494a-9a34-ce50358779fd (savagebucket)
  zonegroup de6af748-1a2f-44a1-9d44-30799cf1313e (us)
   zone 331d3f1e-1b72-4c56-bb5a-d1d0fcf6d0b8 (sv3-prod)
  metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
  data sync source: 107d29a0-b732-4bf1-a26e-1f64f820e839 (dc11-prod)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source
source: 1e27bf9c-3a2f-4845-85b6-33a24bbe1c04 (sv5-corp)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source
root@sv3-ceph-rgw1:~# radosgw-admin sync status
  realm b3e2afe7-2254-494a-9a34-ce50358779fd (savagebucket)
  zonegroup de6af748-1a2f-44a1-9d44-30799cf1313e (us)
   zone 331d3f1e-1b72-4c56-bb5a-d1d0fcf6d0b8 (sv3-prod)
  metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
  data sync source: 107d29a0-b732-4bf1-a26e-1f64f820e839 (dc11-prod)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 1 shards
behind shards: [30]
oldest incremental change not applied: 2019-01-19
22:53:23.0.16109s
source: 1e27bf9c-3a2f-4845-85b6-33a24bbe1c04 (sv5-corp)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source
root@sv3-ceph-rgw1:~#


Below I'm pasting a small section of log.  Thanks so much for looking!

Trey Palmer


root@sv3-ceph-rgw1:/var/log/ceph# tail -f ceph-rgw-sv3-ceph-rgw1.log | grep
-i error
2019-03-08 11:43:07.208572 7fa080cc7700  0 data sync: ERROR: failed to read
remote data log info: ret=-2
2019-03-08 11:43:07.211348 7fa080cc7700  0 meta sync: ERROR:
RGWBackoffControlCR called coroutine returned -2
2019-03-08 11:43:07.267117 7fa080cc7700  0 data sync: ERROR: failed to read
remote data log info: ret=-2
2019-03-08 11:43:07.269631 7fa080cc7700  0 meta sync: ERROR:
RGWBackoffControlCR called coroutine returned -2
2019-03-08 11:43:07.895192 7fa080cc7700  0 data sync: ERROR: init sync on
dmv/dmv:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18467.134 failed, retcode=-2
2019-03-08 11:43:08.046685 7fa080cc7700  0 data sync: ERROR: init sync on
dmv/dmv:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18467.134 failed, retcode=-2
2019-03-08 11:43:08.171277 7fa0870eb700  0 ERROR: failed to get bucket
instance info for
.bucket.meta.phowe_superset:phowe_superset:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.233
2019-03-08 11:43:08.171748 7fa0850e7700  0 ERROR: failed to get bucket
instance info for
.bucket.meta.gdfp_dev:gdfp_dev:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.158
2019-03-08 11:43:08.175867 7fa08a0f1700  0 meta sync: ERROR: can't remove
key:
bucket.instance:phowe_superset/phowe_superset:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.233
ret=-2
2019-03-08 11:43:08.176755 7fa0820e1700  0 data sync: ERROR: init sync on
whoiswho/whoiswho:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18467.293 failed,
retcode=-2
2019-03-08 11:43:08.176872 7fa0820e1700  0 data sync: ERROR: init sync on
dmv/dmv:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18467.134 failed, retcode=-2
2019-03-08 11:43:08.176885 7fa093103700  0 ERROR: failed to get bucket
instance info for
.bucket.meta.phowe_superset:phowe_superset:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.233
2019-03-08 11:43:08.176925 7fa0820e1700  0 data sync: ERROR: failed to
retrieve bucket info for
bucket=phowe_superset/phowe_superset:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.233
2019-03-08 11:43:08.177916 7fa0910ff700  0 meta sync: ERROR: can't remove
key:
bucket.instance:gdfp_dev/gdfp_dev:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.158
ret=-2
2019-03-08 11:43:08.178815 7fa08b0f3700  0 ERROR: failed to get bucket
instance info for
.bucket.meta.gdfp_dev:gdfp_dev:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.158
2019-03-08 11:43:08.178847 7fa0820e1700  0 data sync: ERROR: failed to
retrieve bucket info for
bucket=gdfp_dev/gdfp_dev:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.158
2019-03-08 11:43:08.179492 7fa0820e1700  0 data sync: 

Re: [ceph-users] radosgw sync falling behind regularly

2019-03-08 Thread Casey Bodley

(cc ceph-users)

Can you tell whether these sync errors are coming from metadata sync or 
data sync? Are they blocking sync from making progress according to your 
'sync status'?


On 3/8/19 10:23 AM, Trey Palmer wrote:

Casey,

Having done the 'reshard stale-instances delete' earlier on the advice 
of another list member, we have tons of sync errors on deleted 
buckets, as you mention.


After 'data sync init' we're still seeing all of these errors on 
deleted buckets.


Since buckets are metadata, it occurred to me this morning that 
buckets are metadata so a 'sync init' wouldn't refresh that info.  
 But a 'metadata sync init' might get rid of the stale bucket sync 
info and stop the sync errors.   Would that be the way to go?


Thanks,

Trey



On Wed, Mar 6, 2019 at 11:47 AM Casey Bodley > wrote:


Hi Trey,

I think it's more likely that these stale metadata entries are from
deleted buckets, rather than accidental bucket reshards. When a
bucket
is deleted in a multisite configuration, we don't delete its bucket
instance because other zones may still need to sync the object
deletes -
and they can't make progress on sync if the bucket metadata
disappears.
These leftover bucket instances look the same to the 'reshard
stale-instances' commands, but I'd be cautious about using that to
remove them in multisite, as it may cause more sync errors and
potentially leak storage if they still contain objects.

Regarding 'datalog trim', that alone isn't safe because it could trim
entries that hadn't been applied on other zones yet, causing them to
miss some updates. What you can do is run 'data sync init' on each
zone,
and restart gateways. This will restart with a data full sync (which
will scan all buckets for changes), and skip past any datalog entries
from before the full sync. I was concerned that the bug in error
handling (ie "ERROR: init sync on...") would also affect full
sync, but
that doesn't appear to be the case - so I do think that's worth
trying.

On 3/5/19 6:24 PM, Trey Palmer wrote:
> Casey,
>
> Thanks very much for the reply!
>
> We definitely have lots of errors on sync-disabled buckets and the
> workaround for that is obvious (most of them are empty anyway).
>
> Our second form of error is stale buckets.  We had dynamic
resharding
> enabled but have now disabled it (having discovered it was on by
> default, and not supported in multisite).
>
> We removed several hundred stale buckets via 'radosgw-admin
sharding
> stale-instances rm', but they are still giving us sync errors.
>
> I have found that these buckets do have entries in 'radosgw-admin
> datalog list', and my guess is this could be fixed by doing a
> 'radosgw-admin datalog trim' for each entry on the master zone.
>
> Does that sound right?  :-)
>
> Thanks again for the detailed explanation,
>
> Trey Palmer
>
> On Tue, Mar 5, 2019 at 5:55 PM Casey Bodley mailto:cbod...@redhat.com>
> >> wrote:
>
>     Hi Christian,
>
>     I think you've correctly intuited that the issues are related to
>     the use
>     of 'bucket sync disable'. There was a bug fix for that
feature in
> http://tracker.ceph.com/issues/26895, and I recently found that a
>     block
>     of code was missing from its luminous backport. That missing
code is
>     what handled those "ERROR: init sync on 
failed,
>     retcode=-2" errors.
>
>     I included a fix for that in a later backport
>     (https://github.com/ceph/ceph/pull/26549), which I'm still
working to
>     get through qa. I'm afraid I can't really recommend a workaround
>     for the
>     issue in the meantime.
>
>     Looking forward though, we do plan to support something like
s3's
>     cross
>     region replication so you can enable replication on a
specific bucket
>     without having to enable it globally.
>
>     Casey
>
>
>     On 3/5/19 2:32 PM, Christian Rice wrote:
>     >
>     > Much appreciated.  We’ll continue to poke around and
certainly will
>     > disable the dynamic resharding.
>     >
>     > We started with 12.2.8 in production.  We definitely did not
>     have it
>     > enabled in ceph.conf
>     >
>     > *From: *Matthew H mailto:matthew.he...@hotmail.com>
>     >>
>     > *Date: *Tuesday, March 5, 2019 at 11:22 AM
>     > *To: *Christian Rice mailto:cr...@pandora.com>
>     >>,
ceph-users
>     > mailto:ceph-users@lists.ceph.com>


Re: [ceph-users] radosgw sync falling behind regularly

2019-03-06 Thread Trey Palmer
It appears we eventually got 'data sync init' working.

At least, it's worked on 5 of the 6 sync directions in our 3-node cluster.
 The sixth has not run without an error returned, although 'sync status'
does say "preparing for full sync".

Thanks,

Trey

On Wed, Mar 6, 2019 at 1:22 PM Trey Palmer  wrote:

> Casey,
>
> This was the result of trying 'data sync init':
>
> root@c2-rgw1:~# radosgw-admin data sync init
> ERROR: source zone not specified
> root@c2-rgw1:~# radosgw-admin data sync init --source-zone= uuid>
> WARNING: cannot find source zone id for name=
> ERROR: sync.init_sync_status() returned ret=-2
> root@c2-rgw1:~# radosgw-admin data sync init --source-zone=c1-zone
> ERROR: sync.init() returned ret=-5
> 2019-03-06 10:14:59.815735 7fecb214fe40  0 data sync: ERROR: failed to
> fetch datalog info
> root@c2-rgw1:~#
>
> Do you have any further advice or info?
>
> Thanks again,
>
> Trey
>
>
> On Wed, Mar 6, 2019 at 11:47 AM Casey Bodley  wrote:
>
>> Hi Trey,
>>
>> I think it's more likely that these stale metadata entries are from
>> deleted buckets, rather than accidental bucket reshards. When a bucket
>> is deleted in a multisite configuration, we don't delete its bucket
>> instance because other zones may still need to sync the object deletes -
>> and they can't make progress on sync if the bucket metadata disappears.
>> These leftover bucket instances look the same to the 'reshard
>> stale-instances' commands, but I'd be cautious about using that to
>> remove them in multisite, as it may cause more sync errors and
>> potentially leak storage if they still contain objects.
>>
>> Regarding 'datalog trim', that alone isn't safe because it could trim
>> entries that hadn't been applied on other zones yet, causing them to
>> miss some updates. What you can do is run 'data sync init' on each zone,
>> and restart gateways. This will restart with a data full sync (which
>> will scan all buckets for changes), and skip past any datalog entries
>> from before the full sync. I was concerned that the bug in error
>> handling (ie "ERROR: init sync on...") would also affect full sync, but
>> that doesn't appear to be the case - so I do think that's worth trying.
>>
>> On 3/5/19 6:24 PM, Trey Palmer wrote:
>> > Casey,
>> >
>> > Thanks very much for the reply!
>> >
>> > We definitely have lots of errors on sync-disabled buckets and the
>> > workaround for that is obvious (most of them are empty anyway).
>> >
>> > Our second form of error is stale buckets.  We had dynamic resharding
>> > enabled but have now disabled it (having discovered it was on by
>> > default, and not supported in multisite).
>> >
>> > We removed several hundred stale buckets via 'radosgw-admin sharding
>> > stale-instances rm', but they are still giving us sync errors.
>> >
>> > I have found that these buckets do have entries in 'radosgw-admin
>> > datalog list', and my guess is this could be fixed by doing a
>> > 'radosgw-admin datalog trim' for each entry on the master zone.
>> >
>> > Does that sound right?  :-)
>> >
>> > Thanks again for the detailed explanation,
>> >
>> > Trey Palmer
>> >
>> > On Tue, Mar 5, 2019 at 5:55 PM Casey Bodley > > > wrote:
>> >
>> > Hi Christian,
>> >
>> > I think you've correctly intuited that the issues are related to
>> > the use
>> > of 'bucket sync disable'. There was a bug fix for that feature in
>> > http://tracker.ceph.com/issues/26895, and I recently found that a
>> > block
>> > of code was missing from its luminous backport. That missing code is
>> > what handled those "ERROR: init sync on  failed,
>> > retcode=-2" errors.
>> >
>> > I included a fix for that in a later backport
>> > (https://github.com/ceph/ceph/pull/26549), which I'm still working
>> to
>> > get through qa. I'm afraid I can't really recommend a workaround
>> > for the
>> > issue in the meantime.
>> >
>> > Looking forward though, we do plan to support something like s3's
>> > cross
>> > region replication so you can enable replication on a specific
>> bucket
>> > without having to enable it globally.
>> >
>> > Casey
>> >
>> >
>> > On 3/5/19 2:32 PM, Christian Rice wrote:
>> > >
>> > > Much appreciated.  We’ll continue to poke around and certainly
>> will
>> > > disable the dynamic resharding.
>> > >
>> > > We started with 12.2.8 in production.  We definitely did not
>> > have it
>> > > enabled in ceph.conf
>> > >
>> > > *From: *Matthew H > > >
>> > > *Date: *Tuesday, March 5, 2019 at 11:22 AM
>> > > *To: *Christian Rice > > >, ceph-users
>> > > mailto:ceph-users@lists.ceph.com>>
>> > > *Cc: *Trey Palmer > > >
>> > > *Subject: *Re: radosgw sync falling behind regularly
>> > >
>> > > Hi Christian,
>> > >
>> > > To be on the safe side and future proof 

Re: [ceph-users] radosgw sync falling behind regularly

2019-03-06 Thread Trey Palmer
Casey,

This was the result of trying 'data sync init':

root@c2-rgw1:~# radosgw-admin data sync init
ERROR: source zone not specified
root@c2-rgw1:~# radosgw-admin data sync init --source-zone=
WARNING: cannot find source zone id for name=
ERROR: sync.init_sync_status() returned ret=-2
root@c2-rgw1:~# radosgw-admin data sync init --source-zone=c1-zone
ERROR: sync.init() returned ret=-5
2019-03-06 10:14:59.815735 7fecb214fe40  0 data sync: ERROR: failed to
fetch datalog info
root@c2-rgw1:~#

Do you have any further advice or info?

Thanks again,

Trey


On Wed, Mar 6, 2019 at 11:47 AM Casey Bodley  wrote:

> Hi Trey,
>
> I think it's more likely that these stale metadata entries are from
> deleted buckets, rather than accidental bucket reshards. When a bucket
> is deleted in a multisite configuration, we don't delete its bucket
> instance because other zones may still need to sync the object deletes -
> and they can't make progress on sync if the bucket metadata disappears.
> These leftover bucket instances look the same to the 'reshard
> stale-instances' commands, but I'd be cautious about using that to
> remove them in multisite, as it may cause more sync errors and
> potentially leak storage if they still contain objects.
>
> Regarding 'datalog trim', that alone isn't safe because it could trim
> entries that hadn't been applied on other zones yet, causing them to
> miss some updates. What you can do is run 'data sync init' on each zone,
> and restart gateways. This will restart with a data full sync (which
> will scan all buckets for changes), and skip past any datalog entries
> from before the full sync. I was concerned that the bug in error
> handling (ie "ERROR: init sync on...") would also affect full sync, but
> that doesn't appear to be the case - so I do think that's worth trying.
>
> On 3/5/19 6:24 PM, Trey Palmer wrote:
> > Casey,
> >
> > Thanks very much for the reply!
> >
> > We definitely have lots of errors on sync-disabled buckets and the
> > workaround for that is obvious (most of them are empty anyway).
> >
> > Our second form of error is stale buckets.  We had dynamic resharding
> > enabled but have now disabled it (having discovered it was on by
> > default, and not supported in multisite).
> >
> > We removed several hundred stale buckets via 'radosgw-admin sharding
> > stale-instances rm', but they are still giving us sync errors.
> >
> > I have found that these buckets do have entries in 'radosgw-admin
> > datalog list', and my guess is this could be fixed by doing a
> > 'radosgw-admin datalog trim' for each entry on the master zone.
> >
> > Does that sound right?  :-)
> >
> > Thanks again for the detailed explanation,
> >
> > Trey Palmer
> >
> > On Tue, Mar 5, 2019 at 5:55 PM Casey Bodley  > > wrote:
> >
> > Hi Christian,
> >
> > I think you've correctly intuited that the issues are related to
> > the use
> > of 'bucket sync disable'. There was a bug fix for that feature in
> > http://tracker.ceph.com/issues/26895, and I recently found that a
> > block
> > of code was missing from its luminous backport. That missing code is
> > what handled those "ERROR: init sync on  failed,
> > retcode=-2" errors.
> >
> > I included a fix for that in a later backport
> > (https://github.com/ceph/ceph/pull/26549), which I'm still working
> to
> > get through qa. I'm afraid I can't really recommend a workaround
> > for the
> > issue in the meantime.
> >
> > Looking forward though, we do plan to support something like s3's
> > cross
> > region replication so you can enable replication on a specific bucket
> > without having to enable it globally.
> >
> > Casey
> >
> >
> > On 3/5/19 2:32 PM, Christian Rice wrote:
> > >
> > > Much appreciated.  We’ll continue to poke around and certainly will
> > > disable the dynamic resharding.
> > >
> > > We started with 12.2.8 in production.  We definitely did not
> > have it
> > > enabled in ceph.conf
> > >
> > > *From: *Matthew H  > >
> > > *Date: *Tuesday, March 5, 2019 at 11:22 AM
> > > *To: *Christian Rice  > >, ceph-users
> > > mailto:ceph-users@lists.ceph.com>>
> > > *Cc: *Trey Palmer  > >
> > > *Subject: *Re: radosgw sync falling behind regularly
> > >
> > > Hi Christian,
> > >
> > > To be on the safe side and future proof yourself will want to go
> > ahead
> > > and set the following in your ceph.conf file, and then issue a
> > restart
> > > to your RGW instances.
> > >
> > > rgw_dynamic_resharding = false
> > >
> > > There are a number of issues with dynamic resharding, multisite rgw
> > > problems being just one of them. However I thought it was disabled
> > > automatically when multisite rgw is used (but I will have to double
> > > 

Re: [ceph-users] radosgw sync falling behind regularly

2019-03-06 Thread Trey Palmer
Casey,

You are spot on that almost all of these are deleted buckets.   At some
point in the last few months we deleted and replaced buckets with
underscores in their names,  and those are responsible for most of these
errors.

Thanks very much for the reply and explanation.  We’ll give ‘data sync
init’ a try.

— Trey


On Wed, Mar 6, 2019 at 11:47 AM Casey Bodley  wrote:

> Hi Trey,
>
> I think it's more likely that these stale metadata entries are from
> deleted buckets, rather than accidental bucket reshards. When a bucket
> is deleted in a multisite configuration, we don't delete its bucket
> instance because other zones may still need to sync the object deletes -
> and they can't make progress on sync if the bucket metadata disappears.
> These leftover bucket instances look the same to the 'reshard
> stale-instances' commands, but I'd be cautious about using that to
> remove them in multisite, as it may cause more sync errors and
> potentially leak storage if they still contain objects.
>
> Regarding 'datalog trim', that alone isn't safe because it could trim
> entries that hadn't been applied on other zones yet, causing them to
> miss some updates. What you can do is run 'data sync init' on each zone,
> and restart gateways. This will restart with a data full sync (which
> will scan all buckets for changes), and skip past any datalog entries
> from before the full sync. I was concerned that the bug in error
> handling (ie "ERROR: init sync on...") would also affect full sync, but
> that doesn't appear to be the case - so I do think that's worth trying.
>
> On 3/5/19 6:24 PM, Trey Palmer wrote:
> > Casey,
> >
> > Thanks very much for the reply!
> >
> > We definitely have lots of errors on sync-disabled buckets and the
> > workaround for that is obvious (most of them are empty anyway).
> >
> > Our second form of error is stale buckets.  We had dynamic resharding
> > enabled but have now disabled it (having discovered it was on by
> > default, and not supported in multisite).
> >
> > We removed several hundred stale buckets via 'radosgw-admin sharding
> > stale-instances rm', but they are still giving us sync errors.
> >
> > I have found that these buckets do have entries in 'radosgw-admin
> > datalog list', and my guess is this could be fixed by doing a
> > 'radosgw-admin datalog trim' for each entry on the master zone.
> >
> > Does that sound right?  :-)
> >
> > Thanks again for the detailed explanation,
> >
> > Trey Palmer
> >
> > On Tue, Mar 5, 2019 at 5:55 PM Casey Bodley  > > wrote:
> >
> > Hi Christian,
> >
> > I think you've correctly intuited that the issues are related to
> > the use
> > of 'bucket sync disable'. There was a bug fix for that feature in
> > http://tracker.ceph.com/issues/26895, and I recently found that a
> > block
> > of code was missing from its luminous backport. That missing code is
> > what handled those "ERROR: init sync on  failed,
> > retcode=-2" errors.
> >
> > I included a fix for that in a later backport
> > (https://github.com/ceph/ceph/pull/26549), which I'm still working
> to
> > get through qa. I'm afraid I can't really recommend a workaround
> > for the
> > issue in the meantime.
> >
> > Looking forward though, we do plan to support something like s3's
> > cross
> > region replication so you can enable replication on a specific bucket
> > without having to enable it globally.
> >
> > Casey
> >
> >
> > On 3/5/19 2:32 PM, Christian Rice wrote:
> > >
> > > Much appreciated.  We’ll continue to poke around and certainly will
> > > disable the dynamic resharding.
> > >
> > > We started with 12.2.8 in production.  We definitely did not
> > have it
> > > enabled in ceph.conf
> > >
> > > *From: *Matthew H  > >
> > > *Date: *Tuesday, March 5, 2019 at 11:22 AM
> > > *To: *Christian Rice  > >, ceph-users
> > > mailto:ceph-users@lists.ceph.com>>
> > > *Cc: *Trey Palmer  > >
> > > *Subject: *Re: radosgw sync falling behind regularly
> > >
> > > Hi Christian,
> > >
> > > To be on the safe side and future proof yourself will want to go
> > ahead
> > > and set the following in your ceph.conf file, and then issue a
> > restart
> > > to your RGW instances.
> > >
> > > rgw_dynamic_resharding = false
> > >
> > > There are a number of issues with dynamic resharding, multisite rgw
> > > problems being just one of them. However I thought it was disabled
> > > automatically when multisite rgw is used (but I will have to double
> > > check the code on that). What version of Ceph did you initially
> > > install the cluster with? Prior to v12.2.2 this feature was
> > enabled by
> > > default for all rgw use cases.
> > >
> > > Thanks,
> > >
> 

Re: [ceph-users] radosgw sync falling behind regularly

2019-03-06 Thread Casey Bodley

Hi Trey,

I think it's more likely that these stale metadata entries are from 
deleted buckets, rather than accidental bucket reshards. When a bucket 
is deleted in a multisite configuration, we don't delete its bucket 
instance because other zones may still need to sync the object deletes - 
and they can't make progress on sync if the bucket metadata disappears. 
These leftover bucket instances look the same to the 'reshard 
stale-instances' commands, but I'd be cautious about using that to 
remove them in multisite, as it may cause more sync errors and 
potentially leak storage if they still contain objects.


Regarding 'datalog trim', that alone isn't safe because it could trim 
entries that hadn't been applied on other zones yet, causing them to 
miss some updates. What you can do is run 'data sync init' on each zone, 
and restart gateways. This will restart with a data full sync (which 
will scan all buckets for changes), and skip past any datalog entries 
from before the full sync. I was concerned that the bug in error 
handling (ie "ERROR: init sync on...") would also affect full sync, but 
that doesn't appear to be the case - so I do think that's worth trying.


On 3/5/19 6:24 PM, Trey Palmer wrote:

Casey,

Thanks very much for the reply!

We definitely have lots of errors on sync-disabled buckets and the 
workaround for that is obvious (most of them are empty anyway).


Our second form of error is stale buckets.  We had dynamic resharding 
enabled but have now disabled it (having discovered it was on by 
default, and not supported in multisite).


We removed several hundred stale buckets via 'radosgw-admin sharding 
stale-instances rm', but they are still giving us sync errors.


I have found that these buckets do have entries in 'radosgw-admin 
datalog list', and my guess is this could be fixed by doing a 
'radosgw-admin datalog trim' for each entry on the master zone.


Does that sound right?  :-)

Thanks again for the detailed explanation,

Trey Palmer

On Tue, Mar 5, 2019 at 5:55 PM Casey Bodley > wrote:


Hi Christian,

I think you've correctly intuited that the issues are related to
the use
of 'bucket sync disable'. There was a bug fix for that feature in
http://tracker.ceph.com/issues/26895, and I recently found that a
block
of code was missing from its luminous backport. That missing code is
what handled those "ERROR: init sync on  failed,
retcode=-2" errors.

I included a fix for that in a later backport
(https://github.com/ceph/ceph/pull/26549), which I'm still working to
get through qa. I'm afraid I can't really recommend a workaround
for the
issue in the meantime.

Looking forward though, we do plan to support something like s3's
cross
region replication so you can enable replication on a specific bucket
without having to enable it globally.

Casey


On 3/5/19 2:32 PM, Christian Rice wrote:
>
> Much appreciated.  We’ll continue to poke around and certainly will
> disable the dynamic resharding.
>
> We started with 12.2.8 in production.  We definitely did not
have it
> enabled in ceph.conf
>
> *From: *Matthew H mailto:matthew.he...@hotmail.com>>
> *Date: *Tuesday, March 5, 2019 at 11:22 AM
> *To: *Christian Rice mailto:cr...@pandora.com>>, ceph-users
> mailto:ceph-users@lists.ceph.com>>
> *Cc: *Trey Palmer mailto:nerdmagic...@gmail.com>>
> *Subject: *Re: radosgw sync falling behind regularly
>
> Hi Christian,
>
> To be on the safe side and future proof yourself will want to go
ahead
> and set the following in your ceph.conf file, and then issue a
restart
> to your RGW instances.
>
> rgw_dynamic_resharding = false
>
> There are a number of issues with dynamic resharding, multisite rgw
> problems being just one of them. However I thought it was disabled
> automatically when multisite rgw is used (but I will have to double
> check the code on that). What version of Ceph did you initially
> install the cluster with? Prior to v12.2.2 this feature was
enabled by
> default for all rgw use cases.
>
> Thanks,
>
>

>
> *From:*Christian Rice mailto:cr...@pandora.com>>
> *Sent:* Tuesday, March 5, 2019 2:07 PM
> *To:* Matthew H; ceph-users
> *Subject:* Re: radosgw sync falling behind regularly
>
> Matthew, first of all, let me say we very much appreciate your help!
>
> So I don’t think we turned dynamic resharding on, nor did we
manually
> reshard buckets. Seems like it defaults to on for luminous but the
> mimic docs say it’s not supported in multisite.  So do we need to
> disable it manually via tell and ceph.conf?
>
> Also, after running the command you suggested, all the stale
instances
> are gone…these 

Re: [ceph-users] radosgw sync falling behind regularly

2019-03-05 Thread Trey Palmer
Casey,

Thanks very much for the reply!

We definitely have lots of errors on sync-disabled buckets and the
workaround for that is obvious (most of them are empty anyway).

Our second form of error is stale buckets.  We had dynamic resharding
enabled but have now disabled it (having discovered it was on by default,
and not supported in multisite).

We removed several hundred stale buckets via 'radosgw-admin sharding
stale-instances rm', but they are still giving us sync errors.

I have found that these buckets do have entries in 'radosgw-admin datalog
list', and my guess is this could be fixed by doing a 'radosgw-admin
datalog trim' for each entry on the master zone.

Does that sound right?  :-)

Thanks again for the detailed explanation,

Trey Palmer

On Tue, Mar 5, 2019 at 5:55 PM Casey Bodley  wrote:

> Hi Christian,
>
> I think you've correctly intuited that the issues are related to the use
> of 'bucket sync disable'. There was a bug fix for that feature in
> http://tracker.ceph.com/issues/26895, and I recently found that a block
> of code was missing from its luminous backport. That missing code is
> what handled those "ERROR: init sync on  failed,
> retcode=-2" errors.
>
> I included a fix for that in a later backport
> (https://github.com/ceph/ceph/pull/26549), which I'm still working to
> get through qa. I'm afraid I can't really recommend a workaround for the
> issue in the meantime.
>
> Looking forward though, we do plan to support something like s3's cross
> region replication so you can enable replication on a specific bucket
> without having to enable it globally.
>
> Casey
>
>
> On 3/5/19 2:32 PM, Christian Rice wrote:
> >
> > Much appreciated.  We’ll continue to poke around and certainly will
> > disable the dynamic resharding.
> >
> > We started with 12.2.8 in production.  We definitely did not have it
> > enabled in ceph.conf
> >
> > *From: *Matthew H 
> > *Date: *Tuesday, March 5, 2019 at 11:22 AM
> > *To: *Christian Rice , ceph-users
> > 
> > *Cc: *Trey Palmer 
> > *Subject: *Re: radosgw sync falling behind regularly
> >
> > Hi Christian,
> >
> > To be on the safe side and future proof yourself will want to go ahead
> > and set the following in your ceph.conf file, and then issue a restart
> > to your RGW instances.
> >
> > rgw_dynamic_resharding = false
> >
> > There are a number of issues with dynamic resharding, multisite rgw
> > problems being just one of them. However I thought it was disabled
> > automatically when multisite rgw is used (but I will have to double
> > check the code on that). What version of Ceph did you initially
> > install the cluster with? Prior to v12.2.2 this feature was enabled by
> > default for all rgw use cases.
> >
> > Thanks,
> >
> > 
> >
> > *From:*Christian Rice 
> > *Sent:* Tuesday, March 5, 2019 2:07 PM
> > *To:* Matthew H; ceph-users
> > *Subject:* Re: radosgw sync falling behind regularly
> >
> > Matthew, first of all, let me say we very much appreciate your help!
> >
> > So I don’t think we turned dynamic resharding on, nor did we manually
> > reshard buckets. Seems like it defaults to on for luminous but the
> > mimic docs say it’s not supported in multisite.  So do we need to
> > disable it manually via tell and ceph.conf?
> >
> > Also, after running the command you suggested, all the stale instances
> > are gone…these from my examples were in output:
> >
> > "bucket_instance":
> > "sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18467.303",
> >
> > "bucket_instance":
> > "sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299",
> >
> > "bucket_instance":
> > "sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.301",
> >
> > Though we still get lots of log messages like so in rgw:
> >
> > 2019-03-05 11:01:09.526120 7f64120ae700  0 ERROR: failed to get bucket
> > instance info for
> >
> .bucket.meta.sysad_task:sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299
> >
> > 2019-03-05 11:01:09.528664 7f63e5016700  1 civetweb: 0x55976f1c2000:
> > 172.17.136.17 - - [05/Mar/2019:10:54:06 -0800] "GET
> >
> /admin/metadata/bucket.instance/sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299?key=sysad_task%2Fsysad-task%3A1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299=de6af748-1a2f-44a1-9d44-30799cf1313e
>
> > HTTP/1.1" 404 0 - -
> >
> > 2019-03-05 11:01:09.529648 7f64130b0700  0 meta sync: ERROR: can't
> > remove key:
> >
> bucket.instance:sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299
>
> > ret=-2
> >
> > 2019-03-05 11:01:09.530324 7f64138b1700  0 ERROR: failed to get bucket
> > instance info for
> >
> .bucket.meta.sysad_task:sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299
> >
> > 2019-03-05 11:01:09.530345 7f6405094700  0 data sync: ERROR: failed to
> > retrieve bucket info for
> >
> bucket=sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299
> >
> > 

Re: [ceph-users] radosgw sync falling behind regularly

2019-03-05 Thread Casey Bodley

Hi Christian,

I think you've correctly intuited that the issues are related to the use 
of 'bucket sync disable'. There was a bug fix for that feature in 
http://tracker.ceph.com/issues/26895, and I recently found that a block 
of code was missing from its luminous backport. That missing code is 
what handled those "ERROR: init sync on  failed, 
retcode=-2" errors.


I included a fix for that in a later backport 
(https://github.com/ceph/ceph/pull/26549), which I'm still working to 
get through qa. I'm afraid I can't really recommend a workaround for the 
issue in the meantime.


Looking forward though, we do plan to support something like s3's cross 
region replication so you can enable replication on a specific bucket 
without having to enable it globally.


Casey


On 3/5/19 2:32 PM, Christian Rice wrote:


Much appreciated.  We’ll continue to poke around and certainly will 
disable the dynamic resharding.


We started with 12.2.8 in production.  We definitely did not have it 
enabled in ceph.conf


*From: *Matthew H 
*Date: *Tuesday, March 5, 2019 at 11:22 AM
*To: *Christian Rice , ceph-users 


*Cc: *Trey Palmer 
*Subject: *Re: radosgw sync falling behind regularly

Hi Christian,

To be on the safe side and future proof yourself will want to go ahead 
and set the following in your ceph.conf file, and then issue a restart 
to your RGW instances.


rgw_dynamic_resharding = false

There are a number of issues with dynamic resharding, multisite rgw 
problems being just one of them. However I thought it was disabled 
automatically when multisite rgw is used (but I will have to double 
check the code on that). What version of Ceph did you initially 
install the cluster with? Prior to v12.2.2 this feature was enabled by 
default for all rgw use cases.


Thanks,



*From:*Christian Rice 
*Sent:* Tuesday, March 5, 2019 2:07 PM
*To:* Matthew H; ceph-users
*Subject:* Re: radosgw sync falling behind regularly

Matthew, first of all, let me say we very much appreciate your help!

So I don’t think we turned dynamic resharding on, nor did we manually 
reshard buckets. Seems like it defaults to on for luminous but the 
mimic docs say it’s not supported in multisite.  So do we need to 
disable it manually via tell and ceph.conf?


Also, after running the command you suggested, all the stale instances 
are gone…these from my examples were in output:


    "bucket_instance": 
"sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18467.303",


    "bucket_instance": 
"sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299",


    "bucket_instance": 
"sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.301",


Though we still get lots of log messages like so in rgw:

2019-03-05 11:01:09.526120 7f64120ae700  0 ERROR: failed to get bucket 
instance info for 
.bucket.meta.sysad_task:sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299


2019-03-05 11:01:09.528664 7f63e5016700  1 civetweb: 0x55976f1c2000: 
172.17.136.17 - - [05/Mar/2019:10:54:06 -0800] "GET 
/admin/metadata/bucket.instance/sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299?key=sysad_task%2Fsysad-task%3A1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299=de6af748-1a2f-44a1-9d44-30799cf1313e 
HTTP/1.1" 404 0 - -


2019-03-05 11:01:09.529648 7f64130b0700  0 meta sync: ERROR: can't 
remove key: 
bucket.instance:sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299 
ret=-2


2019-03-05 11:01:09.530324 7f64138b1700  0 ERROR: failed to get bucket 
instance info for 
.bucket.meta.sysad_task:sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299


2019-03-05 11:01:09.530345 7f6405094700  0 data sync: ERROR: failed to 
retrieve bucket info for 
bucket=sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299


2019-03-05 11:01:09.531774 7f6405094700  0 data sync: WARNING: 
skipping data log entry for missing bucket 
sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299


2019-03-05 11:01:09.571680 7f6405094700  0 data sync: ERROR: init sync 
on 
sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.302 
failed, retcode=-2


2019-03-05 11:01:09.573179 7f6405094700  0 data sync: WARNING: 
skipping data log entry for missing bucket 
sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.302


2019-03-05 11:01:13.504308 7f63f903e700  1 civetweb: 0x55976f0f2000: 
10.105.18.20 - - [05/Mar/2019:11:00:57 -0800] "GET 
/admin/metadata/bucket.instance/sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299?key=sysad_task%2Fsysad-task%3A1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299=de6af748-1a2f-44a1-9d44-30799cf1313e 
HTTP/1.1" 404 0 - -


*From: *Matthew H 
*Date: *Tuesday, March 5, 2019 at 10:03 AM
*To: *Christian Rice , ceph-users 


*Subject: *Re: radosgw sync falling behind regularly

Hi Christian,

You have stale bucket instances that need to be clean up, which 

Re: [ceph-users] radosgw sync falling behind regularly

2019-03-05 Thread Matthew H
Hi Christian,

To be on the safe side and future proof yourself will want to go ahead and set 
the following in your ceph.conf file, and then issue a restart to your RGW 
instances.

rgw_dynamic_resharding = false

There are a number of issues with dynamic resharding, multisite rgw problems 
being just one of them. However I thought it was disabled automatically when 
multisite rgw is used (but I will have to double check the code on that). What 
version of Ceph did you initially install the cluster with? Prior to v12.2.2 
this feature was enabled by default for all rgw use cases.

Thanks,


From: Christian Rice 
Sent: Tuesday, March 5, 2019 2:07 PM
To: Matthew H; ceph-users
Subject: Re: radosgw sync falling behind regularly


Matthew, first of all, let me say we very much appreciate your help!



So I don’t think we turned dynamic resharding on, nor did we manually reshard 
buckets.  Seems like it defaults to on for luminous but the mimic docs say it’s 
not supported in multisite.  So do we need to disable it manually via tell and 
ceph.conf?



Also, after running the command you suggested, all the stale instances are 
gone…these from my examples were in output:

"bucket_instance": 
"sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18467.303",

"bucket_instance": 
"sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299",

"bucket_instance": 
"sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.301",



Though we still get lots of log messages like so in rgw:



2019-03-05 11:01:09.526120 7f64120ae700  0 ERROR: failed to get bucket instance 
info for 
.bucket.meta.sysad_task:sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299

2019-03-05 11:01:09.528664 7f63e5016700  1 civetweb: 0x55976f1c2000: 
172.17.136.17 - - [05/Mar/2019:10:54:06 -0800] "GET 
/admin/metadata/bucket.instance/sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299?key=sysad_task%2Fsysad-task%3A1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299=de6af748-1a2f-44a1-9d44-30799cf1313e
 HTTP/1.1" 404 0 - -

2019-03-05 11:01:09.529648 7f64130b0700  0 meta sync: ERROR: can't remove key: 
bucket.instance:sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299
 ret=-2

2019-03-05 11:01:09.530324 7f64138b1700  0 ERROR: failed to get bucket instance 
info for 
.bucket.meta.sysad_task:sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299

2019-03-05 11:01:09.530345 7f6405094700  0 data sync: ERROR: failed to retrieve 
bucket info for 
bucket=sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299

2019-03-05 11:01:09.531774 7f6405094700  0 data sync: WARNING: skipping data 
log entry for missing bucket 
sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299

2019-03-05 11:01:09.571680 7f6405094700  0 data sync: ERROR: init sync on 
sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.302 failed, 
retcode=-2

2019-03-05 11:01:09.573179 7f6405094700  0 data sync: WARNING: skipping data 
log entry for missing bucket 
sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.302

2019-03-05 11:01:13.504308 7f63f903e700  1 civetweb: 0x55976f0f2000: 
10.105.18.20 - - [05/Mar/2019:11:00:57 -0800] "GET 
/admin/metadata/bucket.instance/sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299?key=sysad_task%2Fsysad-task%3A1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299=de6af748-1a2f-44a1-9d44-30799cf1313e
 HTTP/1.1" 404 0 - -



From: Matthew H 
Date: Tuesday, March 5, 2019 at 10:03 AM
To: Christian Rice , ceph-users 
Subject: Re: radosgw sync falling behind regularly



Hi Christian,



You have stale bucket instances that need to be clean up, which is what 
'radosgw-admin reshard stale-instances list' is showing you. Have you or were 
you manually resharding your buckets? The errors you are seeing in the logs are 
related to these stale instances being kept around.



In v12.2.11 this command along with 'radosgw-admin reshard stale-instance rm' 
was introduced [1].



Hopefully this helps.



[1]

https://ceph.com/releases/v12-2-11-luminous-released/



"There have been fixes to RGW dynamic and manual resharding, which no longer
leaves behind stale bucket instances to be removed manually. For finding and
cleaning up older instances from a reshard a radosgw-admin command reshard
stale-instances list and reshard stale-instances rm should do the necessary
cleanup."





From: Christian Rice 
Sent: Tuesday, March 5, 2019 11:34 AM
To: Matthew H; ceph-users
Subject: Re: radosgw sync falling behind regularly



The output of “radosgw-admin reshard stale-instances 

Re: [ceph-users] radosgw sync falling behind regularly

2019-03-05 Thread Christian Rice
Matthew, first of all, let me say we very much appreciate your help!

So I don’t think we turned dynamic resharding on, nor did we manually reshard 
buckets.  Seems like it defaults to on for luminous but the mimic docs say it’s 
not supported in multisite.  So do we need to disable it manually via tell and 
ceph.conf?

Also, after running the command you suggested, all the stale instances are 
gone…these from my examples were in output:
"bucket_instance": 
"sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18467.303",
"bucket_instance": 
"sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299",
"bucket_instance": 
"sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.301",

Though we still get lots of log messages like so in rgw:

2019-03-05 11:01:09.526120 7f64120ae700  0 ERROR: failed to get bucket instance 
info for 
.bucket.meta.sysad_task:sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299
2019-03-05 11:01:09.528664 7f63e5016700  1 civetweb: 0x55976f1c2000: 
172.17.136.17 - - [05/Mar/2019:10:54:06 -0800] "GET 
/admin/metadata/bucket.instance/sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299?key=sysad_task%2Fsysad-task%3A1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299=de6af748-1a2f-44a1-9d44-30799cf1313e
 HTTP/1.1" 404 0 - -
2019-03-05 11:01:09.529648 7f64130b0700  0 meta sync: ERROR: can't remove key: 
bucket.instance:sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299
 ret=-2
2019-03-05 11:01:09.530324 7f64138b1700  0 ERROR: failed to get bucket instance 
info for 
.bucket.meta.sysad_task:sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299
2019-03-05 11:01:09.530345 7f6405094700  0 data sync: ERROR: failed to retrieve 
bucket info for 
bucket=sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299
2019-03-05 11:01:09.531774 7f6405094700  0 data sync: WARNING: skipping data 
log entry for missing bucket 
sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299
2019-03-05 11:01:09.571680 7f6405094700  0 data sync: ERROR: init sync on 
sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.302 failed, 
retcode=-2
2019-03-05 11:01:09.573179 7f6405094700  0 data sync: WARNING: skipping data 
log entry for missing bucket 
sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.302
2019-03-05 11:01:13.504308 7f63f903e700  1 civetweb: 0x55976f0f2000: 
10.105.18.20 - - [05/Mar/2019:11:00:57 -0800] "GET 
/admin/metadata/bucket.instance/sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299?key=sysad_task%2Fsysad-task%3A1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299=de6af748-1a2f-44a1-9d44-30799cf1313e
 HTTP/1.1" 404 0 - -

From: Matthew H 
Date: Tuesday, March 5, 2019 at 10:03 AM
To: Christian Rice , ceph-users 
Subject: Re: radosgw sync falling behind regularly

Hi Christian,

You have stale bucket instances that need to be clean up, which is what 
'radosgw-admin reshard stale-instances list' is showing you. Have you or were 
you manually resharding your buckets? The errors you are seeing in the logs are 
related to these stale instances being kept around.

In v12.2.11 this command along with 'radosgw-admin reshard stale-instance rm' 
was introduced [1].

Hopefully this helps.

[1]
https://ceph.com/releases/v12-2-11-luminous-released/

"There have been fixes to RGW dynamic and manual resharding, which no longer
leaves behind stale bucket instances to be removed manually. For finding and
cleaning up older instances from a reshard a radosgw-admin command reshard
stale-instances list and reshard stale-instances rm should do the necessary
cleanup."


From: Christian Rice 
Sent: Tuesday, March 5, 2019 11:34 AM
To: Matthew H; ceph-users
Subject: Re: radosgw sync falling behind regularly


The output of “radosgw-admin reshard stale-instances list” shows 242 entries, 
which might embed too much proprietary info for me to list, but here’s a tiny 
sample:

"sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18467.303",

"sysad_task/sysad_task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18467.281",

"sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299",

"sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.301",



Some of appear repeatedly in the radosgw error logs like so:

2019-03-05 08:13:08.929206 7f6405094700  0 data sync: ERROR: init sync on 
sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.302 failed, 
retcode=-2

2019-03-05 08:13:08.930581 7f6405094700  0 data sync: WARNING: skipping data 
log entry for missing bucket 

Re: [ceph-users] radosgw sync falling behind regularly

2019-03-05 Thread Trey Palmer
Hi Matthew,

I work with Christian.  Thanks so much for looking at this.

We have a huge stale-instances list from that command.

Our periods are all the same, I redirected them to a file on each node and
checksummed them.  Here's the period:

{
"id": "3d0d40ef-90de-40ea-8c44-caa20ea8dc53",
"epoch": 16,
"predecessor_uuid": "926c74c7-c1a7-46b1-9f25-eb5c392a7fbb",
"sync_status": [],
"period_map": {
"id": "3d0d40ef-90de-40ea-8c44-caa20ea8dc53",
"zonegroups": [
{
"id": "de6af748-1a2f-44a1-9d44-30799cf1313e",
"name": "us",
"api_name": "us",
"is_master": "true",
"endpoints": [
"http://sv5-ceph-rgw1.savagebeast.com:8080;
],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",
"zones": [
{
"id": "107d29a0-b732-4bf1-a26e-1f64f820e839",
"name": "dc11-prod",
"endpoints": [
"http://dc11-ceph-rgw1:8080;
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": []
},
{
"id": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",
"name": "sv5-corp",
"endpoints": [
"http://sv5-ceph-rgw1.savagebeast.com:8080;
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": []
},
{
"id": "331d3f1e-1b72-4c56-bb5a-d1d0fcf6d0b8",
"name": "sv3-prod",
"endpoints": [
"http://sv3-ceph-rgw1:8080;
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": []
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": []
}
],
"default_placement": "default-placement",
"realm_id": "b3e2afe7-2254-494a-9a34-ce50358779fd"
}
],
"short_zone_ids": [
{
"key": "107d29a0-b732-4bf1-a26e-1f64f820e839",
"val": 1720993486
},
{
"key": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",
"val": 2301637458
},
{
"key": "331d3f1e-1b72-4c56-bb5a-d1d0fcf6d0b8",
"val": 1449486239
}
]
},
"master_zonegroup": "de6af748-1a2f-44a1-9d44-30799cf1313e",
"master_zone": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",
"period_config": {
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
},
"realm_id": "b3e2afe7-2254-494a-9a34-ce50358779fd",
"realm_name": "savagebucket",
"realm_epoch": 2
}




On Tue, Mar 5, 2019 at 7:31 AM Matthew H  wrote:

> Hi Christian,
>
> You haven't resharded any of your buckets have you?  You can run the
> command below in v12.2.11 to list stale bucket instances.
>
> radosgw-admin reshard stale-instances list
>
> Can you also send the output from the following command on each rgw?
>
> radosgw-admin period get
>
>
>
> --
> *From:* Christian Rice 
> *Sent:* Tuesday, March 5, 2019 1:46 AM
> *To:* Matthew H; ceph-users
> *Subject:* Re: radosgw sync falling behind regularly
>
>
> sure thing.
>
>
>
> sv5-ceph-rgw1
>
> zonegroup get
>
> {
>
> "id": "de6af748-1a2f-44a1-9d44-30799cf1313e",
>
> "name": "us",
>
> "api_name": "us",
>
> "is_master": "true",
>
> "endpoints": [
>
> 

Re: [ceph-users] radosgw sync falling behind regularly

2019-03-04 Thread Christian Rice
sure thing.

sv5-ceph-rgw1
zonegroup get
{
"id": "de6af748-1a2f-44a1-9d44-30799cf1313e",
"name": "us",
"api_name": "us",
"is_master": "true",
"endpoints": [
"http://sv5-ceph-rgw1.savagebeast.com:8080;
],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",
"zones": [
{
"id": "107d29a0-b732-4bf1-a26e-1f64f820e839",
"name": "dc11-prod",
"endpoints": [
"http://dc11-ceph-rgw1:8080;
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": []
},
{
"id": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",
"name": "sv5-corp",
"endpoints": [
"http://sv5-ceph-rgw1.savagebeast.com:8080;
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": []
},
{
"id": "331d3f1e-1b72-4c56-bb5a-d1d0fcf6d0b8",
"name": "sv3-prod",
"endpoints": [
"http://sv3-ceph-rgw1:8080;
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": []
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": []
}
],
"default_placement": "default-placement",
"realm_id": "b3e2afe7-2254-494a-9a34-ce50358779fd"
}

zone get
{
"id": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",
"name": "sv5-corp",
"domain_root": "sv5-corp.rgw.meta:root",
"control_pool": "sv5-corp.rgw.control",
"gc_pool": "sv5-corp.rgw.log:gc",
"lc_pool": "sv5-corp.rgw.log:lc",
"log_pool": "sv5-corp.rgw.log",
"intent_log_pool": "sv5-corp.rgw.log:intent",
"usage_log_pool": "sv5-corp.rgw.log:usage",
"reshard_pool": "sv5-corp.rgw.log:reshard",
"user_keys_pool": "sv5-corp.rgw.meta:users.keys",
"user_email_pool": "sv5-corp.rgw.meta:users.email",
"user_swift_pool": "sv5-corp.rgw.meta:users.swift",
"user_uid_pool": "sv5-corp.rgw.meta:users.uid",
"system_key": {
"access_key": "access_key_redacted",
"secret_key": "secret_key_redacted"
},
"placement_pools": [
{
"key": "default-placement",
"val": {
"index_pool": "sv5-corp.rgw.buckets.index",
"data_pool": "sv5-corp.rgw.buckets.data",
"data_extra_pool": "sv5-corp.rgw.buckets.non-ec",
"index_type": 0,
"compression": ""
}
}
],
"metadata_heap": "",
"tier_config": [],
"realm_id": "b3e2afe7-2254-494a-9a34-ce50358779fd"
}
sv3-ceph-rgw1
zonegroup get
{
"id": "de6af748-1a2f-44a1-9d44-30799cf1313e",
"name": "us",
"api_name": "us",
"is_master": "true",
"endpoints": [
"http://sv5-ceph-rgw1.savagebeast.com:8080;
],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",
"zones": [
{
"id": "107d29a0-b732-4bf1-a26e-1f64f820e839",
"name": "dc11-prod",
"endpoints": [
"http://dc11-ceph-rgw1:8080;
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": []
},
{
"id": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",
"name": "sv5-corp",
"endpoints": [
"http://sv5-ceph-rgw1.savagebeast.com:8080;
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
   "tier_type": "",
"sync_from_all": "true",
"sync_from": []
},
{
"id": "331d3f1e-1b72-4c56-bb5a-d1d0fcf6d0b8",
"name": "sv3-prod",
"endpoints": [
"http://sv3-ceph-rgw1:8080;
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": []
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": []
}
],
"default_placement": "default-placement",

Re: [ceph-users] radosgw sync falling behind regularly

2019-03-04 Thread Matthew H
Christian,

Can you provide your zonegroup and zones configurations for all 3 rgw sites? 
(run the commands for each site please)

Thanks,


From: Christian Rice 
Sent: Monday, March 4, 2019 5:34 PM
To: Matthew H; ceph-users
Subject: Re: radosgw sync falling behind regularly


So we upgraded everything from 12.2.8 to 12.2.11, and things have gone to hell. 
 Lots of sync errors, like so:



sudo radosgw-admin sync error list

[

{

"shard_id": 0,

"entries": [

{

"id": "1_1549348245.870945_5163821.1",

"section": "data",

"name": 
"dora/catalogmaker-redis:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18467.470/56fbc9685d609b4c8cdbd11dd60bf03bedcb613b438c663c9899d930b25f0405",

"timestamp": "2019-02-05 06:30:45.870945Z",

"info": {

"source_zone": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",

"error_code": 5,

"message": "failed to sync object(5) Input/output error"

}

},

…



radosgw logs are full of:

2019-03-04 14:32:58.039467 7f90e81eb700  0 data sync: ERROR: failed to read 
remote data log info: ret=-2

2019-03-04 14:32:58.041296 7f90e81eb700  0 data sync: ERROR: init sync on 
escarpment/escarpment:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18467.146 failed, 
retcode=-2

2019-03-04 14:32:58.041662 7f90e81eb700  0 meta sync: ERROR: 
RGWBackoffControlCR called coroutine returned -2

2019-03-04 14:32:58.042949 7f90e81eb700  0 data sync: WARNING: skipping data 
log entry for missing bucket 
escarpment/escarpment:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18467.146

2019-03-04 14:32:58.823501 7f90e81eb700  0 data sync: ERROR: failed to read 
remote data log info: ret=-2

2019-03-04 14:32:58.825243 7f90e81eb700  0 meta sync: ERROR: 
RGWBackoffControlCR called coroutine returned -2



dc11-ceph-rgw2:~$ sudo radosgw-admin sync status

  realm b3e2afe7-2254-494a-9a34-ce50358779fd (savagebucket)

  zonegroup de6af748-1a2f-44a1-9d44-30799cf1313e (us)

   zone 107d29a0-b732-4bf1-a26e-1f64f820e839 (dc11-prod)

2019-03-04 14:26:21.351372 7ff7ae042e40  0 meta sync: ERROR: failed to fetch 
mdlog info

  metadata sync syncing

full sync: 0/64 shards

failed to fetch local sync status: (5) Input/output error

^C



Any advice?  All three clusters on 12.2.11, Debian stretch.



From: Christian Rice 
Date: Thursday, February 28, 2019 at 9:06 AM
To: Matthew H , ceph-users 

Subject: Re: radosgw sync falling behind regularly



Yeah my bad on the typo, not running 12.8.8 ☺  It’s 12.2.8.  We can upgrade and 
will attempt to do so asap.  Thanks for that, I need to read my release notes 
more carefully, I guess!



From: Matthew H 
Date: Wednesday, February 27, 2019 at 8:33 PM
To: Christian Rice , ceph-users 
Subject: Re: radosgw sync falling behind regularly



Hey Christian,



I'm making a while guess, but assuming this is 12.2.8. If so, it it possible 
that you can upgrade to 12.2.11? There's been rgw multisite bug fixes for 
metadata syncing and data syncing ( both separate issues ) that you could be 
hitting.



Thanks,



From: ceph-users  on behalf of Christian 
Rice 
Sent: Wednesday, February 27, 2019 7:05 PM
To: ceph-users
Subject: [ceph-users] radosgw sync falling behind regularly



Debian 9; ceph 12.8.8-bpo90+1; no rbd or cephfs, just radosgw; three clusters 
in one zonegroup.



Often we find either metadata or data sync behind, and it doesn’t look to ever 
recover until…we restart the endpoint radosgw target service.



eg at 15:45:40:



dc11-ceph-rgw1:/var/log/ceph# radosgw-admin sync status

  realm b3e2afe7-2254-494a-9a34-ce50358779fd (savagebucket)

  zonegroup de6af748-1a2f-44a1-9d44-30799cf1313e (us)

   zone 107d29a0-b732-4bf1-a26e-1f64f820e839 (dc11-prod)

  metadata sync syncing

full sync: 0/64 shards

incremental sync: 64/64 shards

metadata is behind on 2 shards

behind shards: [19,41]

oldest incremental change not applied: 2019-02-27 
14:42:24.0.408263s

  data sync source: 1e27bf9c-3a2f-4845-85b6-33a24bbe1c04 (sv5-corp)

syncing

full sync: 0/128 shards

incremental sync: 128/128 shards

data is caught up with source

source: 331d3f1e-1b72-4c56-bb5a-d1d0fcf6d0b8 (sv3-prod)

syncing

full sync: 0/128 shards

incremental sync: 128/128 shards

data is caught up with source





so at 15:46:07:



dc11-ceph-rgw1:/var/log/ceph# sudo systemctl restart 
ceph-radosgw@rgw.dc11-ceph-rgw1.service



and by the time I checked at 15:48:08:




Re: [ceph-users] radosgw sync falling behind regularly

2019-03-04 Thread Christian Rice
So we upgraded everything from 12.2.8 to 12.2.11, and things have gone to hell. 
 Lots of sync errors, like so:

sudo radosgw-admin sync error list
[
{
"shard_id": 0,
"entries": [
{
"id": "1_1549348245.870945_5163821.1",
"section": "data",
"name": 
"dora/catalogmaker-redis:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18467.470/56fbc9685d609b4c8cdbd11dd60bf03bedcb613b438c663c9899d930b25f0405",
"timestamp": "2019-02-05 06:30:45.870945Z",
"info": {
"source_zone": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",
"error_code": 5,
"message": "failed to sync object(5) Input/output error"
}
},
…

radosgw logs are full of:
2019-03-04 14:32:58.039467 7f90e81eb700  0 data sync: ERROR: failed to read 
remote data log info: ret=-2
2019-03-04 14:32:58.041296 7f90e81eb700  0 data sync: ERROR: init sync on 
escarpment/escarpment:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18467.146 failed, 
retcode=-2
2019-03-04 14:32:58.041662 7f90e81eb700  0 meta sync: ERROR: 
RGWBackoffControlCR called coroutine returned -2
2019-03-04 14:32:58.042949 7f90e81eb700  0 data sync: WARNING: skipping data 
log entry for missing bucket 
escarpment/escarpment:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18467.146
2019-03-04 14:32:58.823501 7f90e81eb700  0 data sync: ERROR: failed to read 
remote data log info: ret=-2
2019-03-04 14:32:58.825243 7f90e81eb700  0 meta sync: ERROR: 
RGWBackoffControlCR called coroutine returned -2

dc11-ceph-rgw2:~$ sudo radosgw-admin sync status
  realm b3e2afe7-2254-494a-9a34-ce50358779fd (savagebucket)
  zonegroup de6af748-1a2f-44a1-9d44-30799cf1313e (us)
   zone 107d29a0-b732-4bf1-a26e-1f64f820e839 (dc11-prod)
2019-03-04 14:26:21.351372 7ff7ae042e40  0 meta sync: ERROR: failed to fetch 
mdlog info
  metadata sync syncing
full sync: 0/64 shards
failed to fetch local sync status: (5) Input/output error
^C

Any advice?  All three clusters on 12.2.11, Debian stretch.

From: Christian Rice 
Date: Thursday, February 28, 2019 at 9:06 AM
To: Matthew H , ceph-users 

Subject: Re: radosgw sync falling behind regularly

Yeah my bad on the typo, not running 12.8.8 ☺  It’s 12.2.8.  We can upgrade and 
will attempt to do so asap.  Thanks for that, I need to read my release notes 
more carefully, I guess!

From: Matthew H 
Date: Wednesday, February 27, 2019 at 8:33 PM
To: Christian Rice , ceph-users 
Subject: Re: radosgw sync falling behind regularly

Hey Christian,

I'm making a while guess, but assuming this is 12.2.8. If so, it it possible 
that you can upgrade to 12.2.11? There's been rgw multisite bug fixes for 
metadata syncing and data syncing ( both separate issues ) that you could be 
hitting.

Thanks,

From: ceph-users  on behalf of Christian 
Rice 
Sent: Wednesday, February 27, 2019 7:05 PM
To: ceph-users
Subject: [ceph-users] radosgw sync falling behind regularly


Debian 9; ceph 12.8.8-bpo90+1; no rbd or cephfs, just radosgw; three clusters 
in one zonegroup.



Often we find either metadata or data sync behind, and it doesn’t look to ever 
recover until…we restart the endpoint radosgw target service.



eg at 15:45:40:



dc11-ceph-rgw1:/var/log/ceph# radosgw-admin sync status

  realm b3e2afe7-2254-494a-9a34-ce50358779fd (savagebucket)

  zonegroup de6af748-1a2f-44a1-9d44-30799cf1313e (us)

   zone 107d29a0-b732-4bf1-a26e-1f64f820e839 (dc11-prod)

  metadata sync syncing

full sync: 0/64 shards

incremental sync: 64/64 shards

metadata is behind on 2 shards

behind shards: [19,41]

oldest incremental change not applied: 2019-02-27 
14:42:24.0.408263s

  data sync source: 1e27bf9c-3a2f-4845-85b6-33a24bbe1c04 (sv5-corp)

syncing

full sync: 0/128 shards

incremental sync: 128/128 shards

data is caught up with source

source: 331d3f1e-1b72-4c56-bb5a-d1d0fcf6d0b8 (sv3-prod)

syncing

full sync: 0/128 shards

incremental sync: 128/128 shards

data is caught up with source





so at 15:46:07:



dc11-ceph-rgw1:/var/log/ceph# sudo systemctl restart 
ceph-radosgw@rgw.dc11-ceph-rgw1.service



and by the time I checked at 15:48:08:



dc11-ceph-rgw1:/var/log/ceph# radosgw-admin sync status

  realm b3e2afe7-2254-494a-9a34-ce50358779fd (savagebucket)

  zonegroup de6af748-1a2f-44a1-9d44-30799cf1313e (us)

   zone 107d29a0-b732-4bf1-a26e-1f64f820e839 (dc11-prod)

  metadata sync syncing

full sync: 0/64 shards

incremental sync: 64/64 shards

metadata 

Re: [ceph-users] radosgw sync falling behind regularly

2019-02-28 Thread Christian Rice
Yeah my bad on the typo, not running 12.8.8 ☺  It’s 12.2.8.  We can upgrade and 
will attempt to do so asap.  Thanks for that, I need to read my release notes 
more carefully, I guess!

From: Matthew H 
Date: Wednesday, February 27, 2019 at 8:33 PM
To: Christian Rice , ceph-users 
Subject: Re: radosgw sync falling behind regularly

Hey Christian,

I'm making a while guess, but assuming this is 12.2.8. If so, it it possible 
that you can upgrade to 12.2.11? There's been rgw multisite bug fixes for 
metadata syncing and data syncing ( both separate issues ) that you could be 
hitting.

Thanks,

From: ceph-users  on behalf of Christian 
Rice 
Sent: Wednesday, February 27, 2019 7:05 PM
To: ceph-users
Subject: [ceph-users] radosgw sync falling behind regularly


Debian 9; ceph 12.8.8-bpo90+1; no rbd or cephfs, just radosgw; three clusters 
in one zonegroup.



Often we find either metadata or data sync behind, and it doesn’t look to ever 
recover until…we restart the endpoint radosgw target service.



eg at 15:45:40:



dc11-ceph-rgw1:/var/log/ceph# radosgw-admin sync status

  realm b3e2afe7-2254-494a-9a34-ce50358779fd (savagebucket)

  zonegroup de6af748-1a2f-44a1-9d44-30799cf1313e (us)

   zone 107d29a0-b732-4bf1-a26e-1f64f820e839 (dc11-prod)

  metadata sync syncing

full sync: 0/64 shards

incremental sync: 64/64 shards

metadata is behind on 2 shards

behind shards: [19,41]

oldest incremental change not applied: 2019-02-27 
14:42:24.0.408263s

  data sync source: 1e27bf9c-3a2f-4845-85b6-33a24bbe1c04 (sv5-corp)

syncing

full sync: 0/128 shards

incremental sync: 128/128 shards

data is caught up with source

source: 331d3f1e-1b72-4c56-bb5a-d1d0fcf6d0b8 (sv3-prod)

syncing

full sync: 0/128 shards

incremental sync: 128/128 shards

data is caught up with source





so at 15:46:07:



dc11-ceph-rgw1:/var/log/ceph# sudo systemctl restart 
ceph-radosgw@rgw.dc11-ceph-rgw1.service



and by the time I checked at 15:48:08:



dc11-ceph-rgw1:/var/log/ceph# radosgw-admin sync status

  realm b3e2afe7-2254-494a-9a34-ce50358779fd (savagebucket)

  zonegroup de6af748-1a2f-44a1-9d44-30799cf1313e (us)

   zone 107d29a0-b732-4bf1-a26e-1f64f820e839 (dc11-prod)

  metadata sync syncing

full sync: 0/64 shards

incremental sync: 64/64 shards

metadata is caught up with master

  data sync source: 1e27bf9c-3a2f-4845-85b6-33a24bbe1c04 (sv5-corp)

syncing

full sync: 0/128 shards

incremental sync: 128/128 shards

data is caught up with source

source: 331d3f1e-1b72-4c56-bb5a-d1d0fcf6d0b8 (sv3-prod)

syncing

full sync: 0/128 shards

incremental sync: 128/128 shards

data is caught up with source





There’s no way this is “lag.”  It’s stuck, and happens frequently, though 
perhaps not daily.  Any suggestions?  Our cluster isn’t heavily used yet, but 
it’s production.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw sync falling behind regularly

2019-02-27 Thread Matthew H
Hey Christian,

I'm making a while guess, but assuming this is 12.2.8. If so, it it possible 
that you can upgrade to 12.2.11? There's been rgw multisite bug fixes for 
metadata syncing and data syncing ( both separate issues ) that you could be 
hitting.

Thanks,

From: ceph-users  on behalf of Christian 
Rice 
Sent: Wednesday, February 27, 2019 7:05 PM
To: ceph-users
Subject: [ceph-users] radosgw sync falling behind regularly


Debian 9; ceph 12.8.8-bpo90+1; no rbd or cephfs, just radosgw; three clusters 
in one zonegroup.



Often we find either metadata or data sync behind, and it doesn’t look to ever 
recover until…we restart the endpoint radosgw target service.



eg at 15:45:40:



dc11-ceph-rgw1:/var/log/ceph# radosgw-admin sync status

  realm b3e2afe7-2254-494a-9a34-ce50358779fd (savagebucket)

  zonegroup de6af748-1a2f-44a1-9d44-30799cf1313e (us)

   zone 107d29a0-b732-4bf1-a26e-1f64f820e839 (dc11-prod)

  metadata sync syncing

full sync: 0/64 shards

incremental sync: 64/64 shards

metadata is behind on 2 shards

behind shards: [19,41]

oldest incremental change not applied: 2019-02-27 
14:42:24.0.408263s

  data sync source: 1e27bf9c-3a2f-4845-85b6-33a24bbe1c04 (sv5-corp)

syncing

full sync: 0/128 shards

incremental sync: 128/128 shards

data is caught up with source

source: 331d3f1e-1b72-4c56-bb5a-d1d0fcf6d0b8 (sv3-prod)

syncing

full sync: 0/128 shards

incremental sync: 128/128 shards

data is caught up with source





so at 15:46:07:



dc11-ceph-rgw1:/var/log/ceph# sudo systemctl restart 
ceph-radosgw@rgw.dc11-ceph-rgw1.service



and by the time I checked at 15:48:08:



dc11-ceph-rgw1:/var/log/ceph# radosgw-admin sync status

  realm b3e2afe7-2254-494a-9a34-ce50358779fd (savagebucket)

  zonegroup de6af748-1a2f-44a1-9d44-30799cf1313e (us)

   zone 107d29a0-b732-4bf1-a26e-1f64f820e839 (dc11-prod)

  metadata sync syncing

full sync: 0/64 shards

incremental sync: 64/64 shards

metadata is caught up with master

  data sync source: 1e27bf9c-3a2f-4845-85b6-33a24bbe1c04 (sv5-corp)

syncing

full sync: 0/128 shards

incremental sync: 128/128 shards

data is caught up with source

source: 331d3f1e-1b72-4c56-bb5a-d1d0fcf6d0b8 (sv3-prod)

syncing

full sync: 0/128 shards

incremental sync: 128/128 shards

data is caught up with source





There’s no way this is “lag.”  It’s stuck, and happens frequently, though 
perhaps not daily.  Any suggestions?  Our cluster isn’t heavily used yet, but 
it’s production.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com