Hello,

We have an RGW cluster that was recently upgraded from 12.2.11 to 14.2.22. The 
upgrade went mostly fine, though now several of our RGWs will not start. One 
RGW is working fine, the rest will not initialize. They are on a crash loop. 
This is part of a multisite configuration, and is currently not the master 
zone. Current master zone is running 14.2.22. These are the only two zones in 
the zonegroup. After turning debug up to 20, these are the log snippets between 
each crash:
```
2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got 
periods.1b6e1a93-98ba-4378-bc5c-d36cd5542f11.52
2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got 
periods.1b6e1a93-98ba-4378-bc5c-d36cd5542f11.54
2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got 
realms_names. <redacted>
2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got <redacted>
2023-07-20 14:29:56.371 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.371 7fd8dec40900 20 rados_obj.operate() r=-2 bl.length=0
2023-07-20 14:29:56.371 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=-2 bl.length=0
2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=-2 bl.length=0
2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=46
2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=114
2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=46
2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.374 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=686
2023-07-20 14:29:56.374 7fd8dec40900 20 period zonegroup init ret 0
2023-07-20 14:29:56.374 7fd8dec40900 20 period zonegroup name <redacted>
2023-07-20 14:29:56.374 7fd8dec40900 20 using current period zonegroup 
<redacted>
2023-07-20 14:29:56.374 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.374 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=46
2023-07-20 14:29:56.374 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.375 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=903
2023-07-20 14:29:56.375 7fd8dec40900 10 Cannot find current period zone using 
local zone
2023-07-20 14:29:56.375 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.375 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=903
2023-07-20 14:29:56.375 7fd8dec40900 20 zone <redacted>
2023-07-20 14:29:56.375 7fd8dec40900 20 generating connection object for zone 
<redacted> id f10b465f-bf18-47d0-a51c-ca4f17118ee1
2023-07-20 14:34:56.198 7fd8cafe8700 -1 Initialization timeout, failed to 
initialize
```

I’ve checked all file permissions, filesystem free space, disabled selinux and 
firewalld, tried turning up the initialization timeout to 600, and tried 
removing all non-essential config from ceph.conf. All produce the same results. 
I would greatly appreciate any other ideas or insight.

Thanks,
Ben
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to