Hi,

I recently took our test cluster up to a new version and am no longer able to 
start radosgw.  The cluster itself (mon, osd, mgr) appears fine.

Without being much of an expert trying to read this, from the errors that were 
being thrown it seems like the object expirer is choking ok handling resharded 
buckets.  There have been no recent reshard operations on this cluster, and 
dynamic resharding is disabled.  I though this could’ve been related to 
https://github.com/ceph/ceph/pull/27817 but that landed by v14.2.3...

Logs from starting up radosgw:

 -26> 2019-09-17 16:18:45.719 7f2d93da2780 0 starting handler: civetweb 
   -25> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: 
allow_unicode_in_urls: yes 
   -24> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: 
canonicalize_url_path: no 
   -23> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: decode_url: no 
   -22> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: 
enable_auth_domain_check: no 
   -21> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: 
enable_keep_alive: yes 
   -20> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: 
listening_ports: 7480,7481s 
   -19> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: num_threads: 
512 
   -18> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: run_as_user: 
ceph 
   -17> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: 
ssl_certificate: '/etc/ceph/r 
gw.pem' 
   -16> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: 
validate_http_method: no 
   -15> 2019-09-17 16:18:45.720 7f2d93da2780 0 civetweb: 0x55d622628600: 
ssl_use_pem_file: ca 
nnot open certificate file '/etc/ceph/rgw.pem': error:02001002:system 
library:fopen:No such fi 
le or directory 
   -14> 2019-09-17 16:18:45.720 7f2d93da2780 -1 ERROR: failed run 
   -13> 2019-09-17 16:18:45.721 7f2d5c97b700 5 lifecycle: schedule life cycle 
next start time 
: Wed Sep 18 04:00:00 2019 
   -12> 2019-09-17 16:18:45.721 7f2d5f180700 20 reqs_thread_entry: start 
   -11> 2019-09-17 16:18:45.721 7f2d5e97f700 20 
cr:s=0x55d625c94360:op=0x55d625bcd800:20MetaMa 
sterTrimPollCR: operate() 
   -10> 2019-09-17 16:18:45.721 7f2d5e97f700 20 run: stack=0x55d625c94360 is io 
blocked 
    -9> 2019-09-17 16:18:45.721 7f2d5e97f700 20 
cr:s=0x55d625c94480:op=0x55d625a68c00:17DataLo 
gTrimPollCR: operate() 
    -8> 2019-09-17 16:18:45.721 7f2d5e97f700 20 run: stack=0x55d625c94480 is io 
blocked 
    -7> 2019-09-17 16:18:45.721 7f2d5e97f700 20 
cr:s=0x55d625c945a0:op=0x55d625a69200:16Bucket 
TrimPollCR: operate() 
    -6> 2019-09-17 16:18:45.721 7f2d5e97f700 20 run: stack=0x55d625c945a0 is io 
blocked 
    -5> 2019-09-17 16:18:45.721 7f2d5c17a700 20 BucketsSyncThread: start 
    -4> 2019-09-17 16:18:45.721 7f2d5b979700 20 UserSyncThread: start 
    -3> 2019-09-17 16:18:45.721 7f2d5b178700 20 process_all_logshards 
Resharding is disabled 
    -2> 2019-09-17 16:18:45.721 7f2d5d97d700 20 reqs_thread_entry: start 
    -1> 2019-09-17 16:18:45.724 7f2d731a8700 20 processing shard = 
obj_delete_at_hint.00000000 
01 
     0> 2019-09-17 16:18:45.726 7f2d731a8700 -1 *** Caught signal (Aborted) ** 
 in thread 7f2d731a8700 thread_name:rgw_obj_expirer 

 ceph version 14.2.3 (0f776cf838a1ae3130b2b73dc26be9c95c6ccc39) nautilus 
(stable) 
 1: (()+0xf630) [0x7f2d86ff7630] 
 2: (gsignal()+0x37) [0x7f2d86431377] 
 3: (abort()+0x148) [0x7f2d86432a68] 
 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f2d86d417d5] 
 5: (()+0x5e746) [0x7f2d86d3f746] 
 6: (()+0x5e773) [0x7f2d86d3f773] 
 7: (()+0x5e993) [0x7f2d86d3f993] 
 8: (()+0x1772b) [0x7f2d92efb72b] 
 9: (tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0xf3) 
[0x7f2d92f19a03] 
 10: (()+0x70a8a2) [0x55d6222508a2] 
 11: (()+0x70a8e8) [0x55d6222508e8] 
 12: (RGWObjectExpirer::process_single_shard(std::string const&, utime_t 
const&, utime_t const 
&)+0x115) [0x55d622253155] 
 13: (RGWObjectExpirer::inspect_all_shards(utime_t const&, utime_t 
const&)+0xab) [0x55d6222538 
2b] 
 14: (RGWObjectExpirer::OEWorker::entry()+0x273) [0x55d622253c43] 
 15: (()+0x7ea5) [0x7f2d86fefea5] 
 16: (clone()+0x6d) [0x7f2d864f98cd] 
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.


As another datapoint getting bucket stats fails now for two buckets in the 
cluster:

[root@cephproxy01 ~]# radosgw-admin bucket stats --bucket=wHjAk0t
failure: (2) No such file or directory:
2019-09-19 14:21:59.483 7f54ed3fd6c0 -1 ERROR: get_bucket_instance_from_oid 
failed: -2

[root@cephproxy01 ~]# radosgw-admin bucket stats --bucket=bzUi3MT
failure: (2) No such file or directory:
2019-09-19 14:22:16.324 7fbd172666c0 -1 ERROR: get_bucket_instance_from_oid 
failed: -2


Has anyone seen this before?  Didn’t see a lot on this from googling.  Let me 
know if I can provide any more useful debugging information.

Thanks,
Liam
---
University of Maryland
Institute for Advanced Computer Studies
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to