Re: [ceph-users] radosgw sync falling behind regularly

Casey Bodley Tue, 05 Mar 2019 14:16:40 -0800

Hi Christian,

I think you've correctly intuited that the issues are related to the useof 'bucket sync disable'. There was a bug fix for that feature inhttp://tracker.ceph.com/issues/26895, and I recently found that a blockof code was missing from its luminous backport. That missing code iswhat handled those "ERROR: init sync on <bucket instance> failed,retcode=-2" errors.

I included a fix for that in a later backport(https://github.com/ceph/ceph/pull/26549), which I'm still working toget through qa. I'm afraid I can't really recommend a workaround for theissue in the meantime.

Looking forward though, we do plan to support something like s3's crossregion replication so you can enable replication on a specific bucketwithout having to enable it globally.


Casey


On 3/5/19 2:32 PM, Christian Rice wrote:

Much appreciated. We’ll continue to poke around and certainly willdisable the dynamic resharding.
We started with 12.2.8 in production. We definitely did not have itenabled in ceph.conf
*From: *Matthew H <matthew.he...@hotmail.com>
*Date: *Tuesday, March 5, 2019 at 11:22 AM
*To: *Christian Rice <cr...@pandora.com>, ceph-users<ceph-users@lists.ceph.com>
*Cc: *Trey Palmer <nerdmagic...@gmail.com>
*Subject: *Re: radosgw sync falling behind regularly

Hi Christian,
To be on the safe side and future proof yourself will want to go aheadand set the following in your ceph.conf file, and then issue a restartto your RGW instances.
rgw_dynamic_resharding = false
There are a number of issues with dynamic resharding, multisite rgwproblems being just one of them. However I thought it was disabledautomatically when multisite rgw is used (but I will have to doublecheck the code on that). What version of Ceph did you initiallyinstall the cluster with? Prior to v12.2.2 this feature was enabled bydefault for all rgw use cases.
Thanks,

------------------------------------------------------------------------

*From:*Christian Rice <cr...@pandora.com>
*Sent:* Tuesday, March 5, 2019 2:07 PM
*To:* Matthew H; ceph-users
*Subject:* Re: radosgw sync falling behind regularly

Matthew, first of all, let me say we very much appreciate your help!
So I don’t think we turned dynamic resharding on, nor did we manuallyreshard buckets. Seems like it defaults to on for luminous but themimic docs say it’s not supported in multisite. So do we need todisable it manually via tell and ceph.conf?
Also, after running the command you suggested, all the stale instancesare gone…these from my examples were in output:
"bucket_instance":"sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18467.303",
"bucket_instance":"sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299",
"bucket_instance":"sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.301",
Though we still get lots of log messages like so in rgw:
2019-03-05 11:01:09.526120 7f64120ae700 0 ERROR: failed to get bucketinstance info for.bucket.meta.sysad_task:sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299
2019-03-05 11:01:09.528664 7f63e5016700 1 civetweb: 0x55976f1c2000:172.17.136.17 - - [05/Mar/2019:10:54:06 -0800] "GET/admin/metadata/bucket.instance/sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299?key=sysad_task%2Fsysad-task%3A1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299&rgwx-zonegroup=de6af748-1a2f-44a1-9d44-30799cf1313eHTTP/1.1" 404 0 - -
2019-03-05 11:01:09.529648 7f64130b0700 0 meta sync: ERROR: can'tremove key:bucket.instance:sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299ret=-2
2019-03-05 11:01:09.530324 7f64138b1700 0 ERROR: failed to get bucketinstance info for.bucket.meta.sysad_task:sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299
2019-03-05 11:01:09.530345 7f6405094700 0 data sync: ERROR: failed toretrieve bucket info forbucket=sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299
2019-03-05 11:01:09.531774 7f6405094700 0 data sync: WARNING:skipping data log entry for missing bucketsysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299
2019-03-05 11:01:09.571680 7f6405094700 0 data sync: ERROR: init synconsysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.302failed, retcode=-2
2019-03-05 11:01:09.573179 7f6405094700 0 data sync: WARNING:skipping data log entry for missing bucketsysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.302
2019-03-05 11:01:13.504308 7f63f903e700 1 civetweb: 0x55976f0f2000:10.105.18.20 - - [05/Mar/2019:11:00:57 -0800] "GET/admin/metadata/bucket.instance/sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299?key=sysad_task%2Fsysad-task%3A1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299&rgwx-zonegroup=de6af748-1a2f-44a1-9d44-30799cf1313eHTTP/1.1" 404 0 - -
*From: *Matthew H <matthew.he...@hotmail.com>
*Date: *Tuesday, March 5, 2019 at 10:03 AM
*To: *Christian Rice <cr...@pandora.com>, ceph-users<ceph-users@lists.ceph.com>
*Subject: *Re: radosgw sync falling behind regularly

Hi Christian,
You have stale bucket instances that need to be clean up, which iswhat 'radosgw-admin reshard stale-instances list' is showing you. Haveyou or were you manually resharding your buckets? The errors you areseeing in the logs are related to these stale instances being keptaround.
In v12.2.11 this command along with 'radosgw-admin reshardstale-instance rm' was introduced [1].
Hopefully this helps.

[1]
https://ceph.com/releases/v12-2-11-luminous-released/<https://urldefense.proofpoint.com/v2/url?u=https-3A__ceph.com_releases_v12-2D2-2D11-2Dluminous-2Dreleased_&d=DwMF-g&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw&r=NE1NbWtVhgG-K7YvLdoLZigfFx8zGPwOGk6HWpYK04I&m=vdtYIn6lEKaWD9wW297aHjQLpmQdHZrOVpOhmCBqkqo&s=nGCpS4p5jnaSpPUFlziSi3Y3pFijhVDy6e3867jA9BE&e=>
/"There have been fixes to RGW dynamic and manual resharding, which nolongerleaves behind stale bucket instances to be removed manually. Forfinding and
cleaning up older instances from a reshard a radosgw-admin command reshard
stale-instances list and reshard stale-instances rm should do thenecessary
cleanup."/

------------------------------------------------------------------------

*From:*Christian Rice <cr...@pandora.com>
*Sent:* Tuesday, March 5, 2019 11:34 AM
*To:* Matthew H; ceph-users
*Subject:* Re: radosgw sync falling behind regularly
The output of “radosgw-admin reshard stale-instances list” shows 242entries, which might embed too much proprietary info for me to list,but here’s a tiny sample:
"sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18467.303",

"sysad_task/sysad_task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18467.281",

"sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299",

"sysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.301",

Some of appear repeatedly in the radosgw error logs like so:
2019-03-05 08:13:08.929206 7f6405094700 0 data sync: ERROR: init synconsysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.302failed, retcode=-2
2019-03-05 08:13:08.930581 7f6405094700 0 data sync: WARNING:skipping data log entry for missing bucketsysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.302
2019-03-05 08:13:08.972053 7f6405094700 0 data sync: ERROR: init synconsysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299failed, retcode=-2
2019-03-05 08:13:08.973442 7f6405094700 0 data sync: WARNING:skipping data log entry for missing bucketsysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299
2019-03-05 08:13:19.528295 7f6406897700 0 data sync: ERROR: init synconsysad_task/sysad-task:1e27bf9c-3a2f-4845-85b6-33a24bbe1c04.18330.299failed, retcode=-2
Notably, “Sync is disabled for bucket sysad-task.” We use “bucketsync disable” A LOT. It’s the only way we’ve been able to usemultisite and a single namespace and not replicate things that areunneeded to every zone. Perhaps there’s a bug in the implementationthat’s tripping us up now, with the new sync multisite sync code from12.2.9 onward?
What might we do with stale bucket instances, then?
Of note, our master zone endpoint, which was timing out health checksmost of the day after the upgrade (was running but overworked bycluster confusion, so we couldn’t create new buckets or do user ops),has returned to availability late last night. There’s a lot of datato look at, but in my estimation, due to lack of user complaints (ortheir unawareness of specific issues), it seems the zones arenominally available, even with all the errors and warnings beinglogged. We’ve tested simple zone replication by creating a few filesin one zone and seeing them show up elsewhere…
here’s “period get” output:

sv5-ceph-rgw1

{

    "id": "3d0d40ef-90de-40ea-8c44-caa20ea8dc53",

"epoch": 16,

"predecessor_uuid": "926c74c7-c1a7-46b1-9f25-eb5c392a7fbb",

"sync_status": [],

"period_map": {

"id": "3d0d40ef-90de-40ea-8c44-caa20ea8dc53",

"zonegroups": [

{

"id": "de6af748-1a2f-44a1-9d44-30799cf1313e",

"name": "us",

"api_name": "us",

"is_master": "true",

"endpoints": [

                   "http://sv5-ceph-rgw1.savagebeast.com:8080";

],

"hostnames": [],

"hostnames_s3website": [],

"master_zone": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",

"zones": [

{

"id": "107d29a0-b732-4bf1-a26e-1f64f820e839",

"name": "dc11-prod",

"endpoints": [

"http://dc11-ceph-rgw1:8080";

 ],

"log_meta": "false",

"log_data": "true",

"bucket_index_max_shards": 0,

"read_only": "false",

"tier_type": "",

  "sync_from_all": "true",

"sync_from": []

},

{

"id": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",

"name": "sv5-corp",

"endpoints": [

"http://sv5-ceph-rgw1.savagebeast.com:8080";

],

"log_meta": "false",

"log_data": "true",

"bucket_index_max_shards": 0,

                   "read_only": "false",

"tier_type": "",

"sync_from_all": "true",

"sync_from": []

},

{

"id": "331d3f1e-1b72-4c56-bb5a-d1d0fcf6d0b8",

"name": "sv3-prod",

"endpoints": [

"http://sv3-ceph-rgw1:8080";

],

    "log_meta": "false",

"log_data": "true",

"bucket_index_max_shards": 0,

"read_only": "false",

"tier_type": "",

"sync_from_all": "true",

"sync_from": []

}

],

"placement_targets": [

{

"name": "default-placement",

"tags": []

}

               ],

"default_placement": "default-placement",

"realm_id": "b3e2afe7-2254-494a-9a34-ce50358779fd"

}

        ],

"short_zone_ids": [

{

"key": "107d29a0-b732-4bf1-a26e-1f64f820e839",

"val": 1720993486

},

{

"key": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",

"val": 2301637458

},

{

"key": "331d3f1e-1b72-4c56-bb5a-d1d0fcf6d0b8",

"val": 1449486239

}

        ]

    },

"master_zonegroup": "de6af748-1a2f-44a1-9d44-30799cf1313e",

"master_zone": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",

"period_config": {

"bucket_quota": {

"enabled": false,

"check_on_raw": false,

"max_size": -1,

"max_size_kb": 0,

"max_objects": -1

        },

"user_quota": {

"enabled": false,

"check_on_raw": false,

"max_size": -1,

"max_size_kb": 0,

"max_objects": -1

        }

    },

"realm_id": "b3e2afe7-2254-494a-9a34-ce50358779fd",

"realm_name": "savagebucket",

"realm_epoch": 2

}

sv3-ceph-rgw1

{

    "id": "3d0d40ef-90de-40ea-8c44-caa20ea8dc53",

"epoch": 16,

"predecessor_uuid": "926c74c7-c1a7-46b1-9f25-eb5c392a7fbb",

"sync_status": [],

"period_map": {

"id": "3d0d40ef-90de-40ea-8c44-caa20ea8dc53",

"zonegroups": [

{

"id": "de6af748-1a2f-44a1-9d44-30799cf1313e",

"name": "us",

"api_name": "us",

"is_master": "true",

"endpoints": [

        "http://sv5-ceph-rgw1.savagebeast.com:8080";

],

"hostnames": [],

"hostnames_s3website": [],

"master_zone": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",

"zones": [

         {

"id": "107d29a0-b732-4bf1-a26e-1f64f820e839",

"name": "dc11-prod",

"endpoints": [

"http://dc11-ceph-rgw1:8080";

],

"log_meta": "false",

"log_data": "true",

"bucket_index_max_shards": 0,

"read_only": "false",

"tier_type": "",

"sync_from_all": "true",

"sync_from": []

},

{

"id": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",

"name": "sv5-corp",

"endpoints": [

"http://sv5-ceph-rgw1.savagebeast.com:8080";

],

"log_meta": "false",

"log_data": "true",

"bucket_index_max_shards": 0,

               "read_only": "false",

"tier_type": "",

"sync_from_all": "true",

"sync_from": []

},

{

"id": "331d3f1e-1b72-4c56-bb5a-d1d0fcf6d0b8",

"name": "sv3-prod",

"endpoints": [

"http://sv3-ceph-rgw1:8080";

],

"log_meta": "false",

"log_data": "true",

"bucket_index_max_shards": 0,

"read_only": "false",

"tier_type": "",

"sync_from_all": "true",

  "sync_from": []

}

],

"placement_targets": [

{

"name": "default-placement",

"tags": []

}

],

         "default_placement": "default-placement",

"realm_id": "b3e2afe7-2254-494a-9a34-ce50358779fd"

}

        ],

"short_zone_ids": [

{

"key": "107d29a0-b732-4bf1-a26e-1f64f820e839",

        "val": 1720993486

},

{

"key": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",

"val": 2301637458

},

{

"key": "331d3f1e-1b72-4c56-bb5a-d1d0fcf6d0b8",

"val": 1449486239

}

        ]

    },

"master_zonegroup": "de6af748-1a2f-44a1-9d44-30799cf1313e",

"master_zone": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",

"period_config": {

"bucket_quota": {

"enabled": false,

"check_on_raw": false,

"max_size": -1,

"max_size_kb": 0,

"max_objects": -1

        },

"user_quota": {

"enabled": false,

"check_on_raw": false,

"max_size": -1,

"max_size_kb": 0,

"max_objects": -1

        }

    },

"realm_id": "b3e2afe7-2254-494a-9a34-ce50358779fd",

"realm_name": "savagebucket",

"realm_epoch": 2

}

dc11-ceph-rgw1

{

    "id": "3d0d40ef-90de-40ea-8c44-caa20ea8dc53",

"epoch": 16,

"predecessor_uuid": "926c74c7-c1a7-46b1-9f25-eb5c392a7fbb",

"sync_status": [],

"period_map": {

"id": "3d0d40ef-90de-40ea-8c44-caa20ea8dc53",

"zonegroups": [

        {

"id": "de6af748-1a2f-44a1-9d44-30799cf1313e",

"name": "us",

"api_name": "us",

"is_master": "true",

"endpoints": [

"http://sv5-ceph-rgw1.savagebeast.com:8080";

],

"hostnames": [],

"hostnames_s3website": [],

"master_zone": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",

"zones": [

{

"id": "107d29a0-b732-4bf1-a26e-1f64f820e839",

"name": "dc11-prod",

"endpoints": [

"http://dc11-ceph-rgw1:8080";

],

"log_meta": "false",

                      "log_data": "true",

"bucket_index_max_shards": 0,

"read_only": "false",

"tier_type": "",

"sync_from_all": "true",

"sync_from": []

},

{

"id": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",

"name": "sv5-corp",

"endpoints": [

          "http://sv5-ceph-rgw1.savagebeast.com:8080";

],

"log_meta": "false",

"log_data": "true",

"bucket_index_max_shards": 0,

"read_only": "false",

"tier_type": "",

"sync_from_all": "true",

"sync_from": []

},

{

"id": "331d3f1e-1b72-4c56-bb5a-d1d0fcf6d0b8",

"name": "sv3-prod",

"endpoints": [

"http://sv3-ceph-rgw1:8080";

],

"log_meta": "false",

"log_data": "true",

"bucket_index_max_shards": 0,

"read_only": "false",

"tier_type": "",

"sync_from_all": "true",

"sync_from": []

}

],

"placement_targets": [

{

"name": "default-placement",

"tags": []

}

],

"default_placement": "default-placement",

"realm_id": "b3e2afe7-2254-494a-9a34-ce50358779fd"

}

        ],

"short_zone_ids": [

{

"key": "107d29a0-b732-4bf1-a26e-1f64f820e839",

"val": 1720993486

},

{

"key": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",

"val": 2301637458

},

{

"key": "331d3f1e-1b72-4c56-bb5a-d1d0fcf6d0b8",

"val": 1449486239

}

        ]

    },

"master_zonegroup": "de6af748-1a2f-44a1-9d44-30799cf1313e",

"master_zone": "1e27bf9c-3a2f-4845-85b6-33a24bbe1c04",

"period_config": {

"bucket_quota": {

"enabled": false,

"check_on_raw": false,

          "max_size": -1,

"max_size_kb": 0,

"max_objects": -1

        },

"user_quota": {

"enabled": false,

"check_on_raw": false,

"max_size": -1,

"max_size_kb": 0,

"max_objects": -1

        }

    },

"realm_id": "b3e2afe7-2254-494a-9a34-ce50358779fd",

"realm_name": "savagebucket",

"realm_epoch": 2

}

*From: *Matthew H <matthew.he...@hotmail.com>
*Date: *Tuesday, March 5, 2019 at 4:31 AM
*To: *Christian Rice <cr...@pandora.com>, ceph-users<ceph-users@lists.ceph.com>
*Subject: *Re: radosgw sync falling behind regularly

Hi Christian,
You haven't resharded any of your buckets have you? You can run thecommand below in v12.2.11 to list stale bucket instances.
radosgw-admin reshard stale-instances list

Can you also send the output from the following command on each rgw?

radosgw-admin period get


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw sync falling behind regularly

Reply via email to