Re: [ceph-users] OSD up takes 15 minutes after machine restarts
HI, Igor, does this could cause the problem? huxia...@horebdata.cn From: Igor Fedotov Date: 2020-01-19 11:41 To: huxia...@horebdata.cn; ceph-users Subject: Re: [ceph-users] OSD up takes 15 minutes after machine restarts Hi Samuel, wondering if you have bluestore_fsck_on_mount option set to true? Can you see high read load over OSD device(s) during the startup? If so it might be fsck running which takes that long. Thanks, Igor On 1/19/2020 11:53 AM, huxia...@horebdata.cn wrote: Dear folks, I had a strange situation with 3-node Ceph cluster on Luminous 12.2.12 with bluestore. Each machine has 5 OSDs on HDD, and each OSD uses a 30GB DB/WAL partition on SSD. At the beginning without much data, OSDs can quickly up if one node restarts. Then I ran 4-day long stress tests with vdbench, and then I restarts one node, and to my surprise, OSDs on that node takes ca 15 minutes to turn into up state. How can i speed up the OSD UP process ? thanks, Samuel huxia...@horebdata.cn ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD up takes 15 minutes after machine restarts
Dear folks, I had a strange situation with 3-node Ceph cluster on Luminous 12.2.12 with bluestore. Each machine has 5 OSDs on HDD, and each OSD uses a 30GB DB/WAL partition on SSD. At the beginning without much data, OSDs can quickly up if one node restarts. Then I ran 4-day long stress tests with vdbench, and then I restarts one node, and to my surprise, OSDs on that node takes ca 15 minutes to turn into up state. How can i speed up the OSD UP process ? thanks, Samuel huxia...@horebdata.cn ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Openstack VM IOPS drops dramatically during Ceph recovery
hello, Robert thanks for the quick reply. I did test with osd op queue = wpq , and osd op queue cut off = high and osd_recovery_op_priority = 1 osd recovery delay start = 20 osd recovery max active = 1 osd recovery max chunk = 1048576 osd recovery sleep = 1 osd recovery sleep hdd = 1 osd recovery sleep ssd = 1 osd recovery sleep hybrid = 1 osd recovery priority = 1 osd max backfills = 1 osd backfill scan max = 16 osd backfill scan min = 4 osd_op_thread_suicide_timeout = 300 But still the ceph cluster showed extremely hug recovery activities during the beginning of the recovery, and after ca. 5-10 minutes, the recovery gradually get under the control. I guess this is quite similar to what you encountered in Nov. 2015. It is really annoying, and what else can i do to mitigate this weird inital-recovery issue? any suggestions are much appreciated. thanks again, samuel huxia...@horebdata.cn From: Robert LeBlanc Date: 2019-10-17 21:23 To: huxia...@horebdata.cn CC: ceph-users Subject: Re: Re: [ceph-users] Openstack VM IOPS drops dramatically during Ceph recovery On Thu, Oct 17, 2019 at 12:08 PM huxia...@horebdata.cn wrote: > > I happened to find a note that you wrote in Nov 2015: > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-November/006173.html > and I believe this is what i just hit exactly the same behavior : a host down > will badly take the client performance down 1/10 (with 200MB/s recovery > workload) and then took ten minutes to get good control of OSD recovery. > > Could you please share how did you eventally solve that issue? by seting a > fair large OSD recovery delay start or any other parameter? Wow! Dusting off the cobwebs here. I think this is what lead me to dig into the code and write the WPQ scheduler. I can't remember doing anything specific. I'm sorry I'm not much help in this regard. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Openstack VM IOPS drops dramatically during Ceph recovery
Hello, Robert, I happened to find a note that you wrote in Nov 2015: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-November/006173.html and I believe this is what i just hit exactly the same behavior : a host down will badly take the client performance down 1/10 (with 200MB/s recovery workload) and then took ten minutes to get good control of OSD recovery. Could you please share how did you eventally solve that issue? by seting a fair large OSD recovery delay start or any other parameter? best regards, samuel huxia...@horebdata.cn From: Robert LeBlanc Date: 2019-10-16 21:46 To: huxia...@horebdata.cn CC: ceph-users Subject: Re: Re: [ceph-users] Openstack VM IOPS drops dramatically during Ceph recovery On Wed, Oct 16, 2019 at 11:53 AM huxia...@horebdata.cn wrote: > > My Ceph version is Luminuous 12.2.12. Do you think should i upgrade to > Nautilus, or will Nautilus have a better control of recovery/backfilling? We have a Jewel cluster and Luminuous cluster that we have changed these settings on and it really helped both of them. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Openstack VM IOPS drops dramatically during Ceph recovery
My Ceph version is Luminuous 12.2.12. Do you think should i upgrade to Nautilus, or will Nautilus have a better control of recovery/backfilling? best regards, Samuel huxia...@horebdata.cn From: Robert LeBlanc Date: 2019-10-14 16:27 To: huxia...@horebdata.cn CC: ceph-users Subject: Re: [ceph-users] Openstack VM IOPS drops dramatically during Ceph recovery On Thu, Oct 10, 2019 at 2:23 PM huxia...@horebdata.cn wrote: > > Hi, folks, > > I have a middle-size Ceph cluster as cinder backup for openstack (queens). > Duing testing, one Ceph node went down unexpected and powered up again ca 10 > minutes later, Ceph cluster starts PG recovery. To my surprise, VM IOPS > drops dramatically during Ceph recovery, from ca. 13K IOPS to about 400, a > factor of 1/30, and I did put a stringent throttling on backfill and > recovery, with the following ceph parameters > > osd_max_backfills = 1 > osd_recovery_max_active = 1 > osd_client_op_priority=63 > osd_recovery_op_priority=1 > osd_recovery_sleep = 0.5 > > The most weird thing is, > 1) when there is no IO activity from any VM (ALL VMs are quiet except the > recovery IO), the recovery bandwidth is ca. 10MiB/s, 2 objects/s. Seems like > recovery throttle setting is working properly > 2) when using FIO testing inside a VM, the recovery bandwith is going up > quickly, reaching above 200MiB/s, 60 objects/s. FIO IOPS performance inside > VM, however, is only at 400 IOPS/s (8KiB block size), around 3MiB/s. Obvious > recovery throttling DOES NOT work properly > 3) If i stop the FIO testing in VM, the recovery bandwith then goes down to > 10MiB/s, 2 objects/s again, strange enough. > > How can this weird behavior happen? I just wonder, is there a method to > configure recovery bandwith to a specific value, or the number of recovery > objects per second? this may give better control of bakcfilling/recovery, > instead of the faulty logic or relative osd_client_op_priority vs > osd_recovery_op_priority. > > any ideas or suggests to make the recovery under control? > > best regards, > > Samuel Not sure which version of Ceph you are on, but add these to your /etc/ceph/ceph.conf on all your OSDs and restart them. osd op queue = wpq osd op queue cut off = high That should really help and make backfills and recovery be non-impactful. This will be the default in Octopus. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Openstack VM IOPS drops dramatically during Ceph recovery
Hi, folks, I have a middle-size Ceph cluster as cinder backup for openstack (queens). Duing testing, one Ceph node went down unexpected and powered up again ca 10 minutes later, Ceph cluster starts PG recovery. To my surprise, VM IOPS drops dramatically during Ceph recovery, from ca. 13K IOPS to about 400, a factor of 1/30, and I did put a stringent throttling on backfill and recovery, with the following ceph parameters osd_max_backfills = 1 osd_recovery_max_active = 1 osd_client_op_priority=63 osd_recovery_op_priority=1 osd_recovery_sleep = 0.5 The most weird thing is, 1) when there is no IO activity from any VM (ALL VMs are quiet except the recovery IO), the recovery bandwidth is ca. 10MiB/s, 2 objects/s. Seems like recovery throttle setting is working properly 2) when using FIO testing inside a VM, the recovery bandwith is going up quickly, reaching above 200MiB/s, 60 objects/s. FIO IOPS performance inside VM, however, is only at 400 IOPS/s (8KiB block size), around 3MiB/s. Obvious recovery throttling DOES NOT work properly 3) If i stop the FIO testing in VM, the recovery bandwith then goes down to 10MiB/s, 2 objects/s again, strange enough. How can this weird behavior happen? I just wonder, is there a method to configure recovery bandwith to a specific value, or the number of recovery objects per second? this may give better control of bakcfilling/recovery, instead of the faulty logic or relative osd_client_op_priority vs osd_recovery_op_priority. any ideas or suggests to make the recovery under control? best regards, Samuel huxia...@horebdata.cn ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] pgs inconsistent
Dear folks, I had a Ceph cluster with replication 2, 3 nodes, each node with 3 OSDs, on Luminous 12.2.12. Some days ago i had one OSD down (the disk is still fine) due to some errors on rocksdb crash. I tried to restart that OSD but failed. So I tried to rebalance but encountered PGs inconsistent. what can i do to make the cluster working again? thanks a lot for helping me out Samuel ** # ceph -s cluster: id: 289e3afa-f188-49b0-9bea-1ab57cc2beb8 health: HEALTH_ERR pauserd,pausewr,noout flag(s) set 191444 scrub errors Possible data damage: 376 pgs inconsistent services: mon: 3 daemons, quorum horeb71,horeb72,horeb73 mgr: horeb73(active), standbys: horeb71, horeb72 osd: 9 osds: 8 up, 8 in flags pauserd,pausewr,noout data: pools: 1 pools, 1024 pgs objects: 524.29k objects, 1.99TiB usage: 3.67TiB used, 2.58TiB / 6.25TiB avail pgs: 645 active+clean 376 active+clean+inconsistent 3 active+clean+scrubbing+deep ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Possibly a bug on rocksdb
:11:02.300110 7ff4822f1d80 0 set rocksdb option compaction_threads = 32 -12> 2019-08-11 17:11:02.300121 7ff4822f1d80 0 set rocksdb option compression = kNoCompression -11> 2019-08-11 17:11:02.300129 7ff4822f1d80 0 set rocksdb option flusher_threads = 8 -10> 2019-08-11 17:11:02.300135 7ff4822f1d80 0 set rocksdb option level0_file_num_compaction_trigger = 64 -9> 2019-08-11 17:11:02.300142 7ff4822f1d80 0 set rocksdb option level0_slowdown_writes_trigger = 128 -8> 2019-08-11 17:11:02.300146 7ff4822f1d80 0 set rocksdb option level0_stop_writes_trigger = 256 -7> 2019-08-11 17:11:02.300150 7ff4822f1d80 0 set rocksdb option max_background_compactions = 64 -6> 2019-08-11 17:11:02.300155 7ff4822f1d80 0 set rocksdb option max_bytes_for_level_base = 2GB -5> 2019-08-11 17:11:02.300159 7ff4822f1d80 0 set rocksdb option max_write_buffer_number = 64 -4> 2019-08-11 17:11:02.300166 7ff4822f1d80 0 set rocksdb option min_write_buffer_number_to_merge = 32 -3> 2019-08-11 17:11:02.300176 7ff4822f1d80 0 set rocksdb option recycle_log_file_num = 64 -2> 2019-08-11 17:11:02.300185 7ff4822f1d80 0 set rocksdb option target_file_size_base = 4MB -1> 2019-08-11 17:11:02.300193 7ff4822f1d80 0 set rocksdb option write_buffer_size = 4MB 0> 2019-08-11 17:11:02.819067 7ff4822f1d80 -1 *** Caught signal (Aborted) ** in thread 7ff4822f1d80 thread_name:ceph-osd huxia...@horebdata.cn ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Luminous OSD can not be up
Hi, Folks, I just encountered an OSD being down and can not be up again. Below atttached is the log messages. Anyone can tell what is wrong with the OSD? and what should i do? thanks in advance, Samuel *** # tail -500 /var/log/ceph/ceph-osd.0.log.522 2019-05-22 09:23:35.139017 7f9d18f71d80 0 set rocksdb option min_write_buffer_number_to_merge = 32 2019-05-22 09:23:35.139019 7f9d18f71d80 0 set rocksdb option recycle_log_file_num = 64 2019-05-22 09:23:35.139023 7f9d18f71d80 0 set rocksdb option target_file_size_base = 4MB 2019-05-22 09:23:35.139025 7f9d18f71d80 0 set rocksdb option write_buffer_size = 4MB 2019-05-22 09:23:35.139058 7f9d18f71d80 0 set rocksdb option compaction_readahead_size = 2MB 2019-05-22 09:23:35.139063 7f9d18f71d80 0 set rocksdb option compaction_style = kCompactionStyleLevel 2019-05-22 09:23:35.139067 7f9d18f71d80 0 set rocksdb option compaction_threads = 32 2019-05-22 09:23:35.139069 7f9d18f71d80 0 set rocksdb option compression = kNoCompression 2019-05-22 09:23:35.139072 7f9d18f71d80 0 set rocksdb option flusher_threads = 8 2019-05-22 09:23:35.139074 7f9d18f71d80 0 set rocksdb option level0_file_num_compaction_trigger = 64 2019-05-22 09:23:35.139077 7f9d18f71d80 0 set rocksdb option level0_slowdown_writes_trigger = 128 2019-05-22 09:23:35.139079 7f9d18f71d80 0 set rocksdb option level0_stop_writes_trigger = 256 2019-05-22 09:23:35.139081 7f9d18f71d80 0 set rocksdb option max_background_compactions = 64 2019-05-22 09:23:35.139084 7f9d18f71d80 0 set rocksdb option max_bytes_for_level_base = 6GB 2019-05-22 09:23:35.139086 7f9d18f71d80 0 set rocksdb option max_write_buffer_number = 64 2019-05-22 09:23:35.139088 7f9d18f71d80 0 set rocksdb option min_write_buffer_number_to_merge = 32 2019-05-22 09:23:35.139090 7f9d18f71d80 0 set rocksdb option recycle_log_file_num = 64 2019-05-22 09:23:35.139093 7f9d18f71d80 0 set rocksdb option target_file_size_base = 4MB 2019-05-22 09:23:35.139095 7f9d18f71d80 0 set rocksdb option write_buffer_size = 4MB 2019-05-22 09:23:35.171770 7f9d18f71d80 0 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.12/rpm/el7/BUILD/ceph-12.2.12/src/cls/cephfs/cls_cephfs.cc:197: loading cephfs 2019-05-22 09:23:35.172018 7f9d18f71d80 0 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.12/rpm/el7/BUILD/ceph-12.2.12/src/cls/hello/cls_hello.cc:296: loading cls_hello 2019-05-22 09:23:35.172865 7f9d18f71d80 0 _get_class not permitted to load kvs 2019-05-22 09:23:35.173387 7f9d18f71d80 0 _get_class not permitted to load lua 2019-05-22 09:23:35.178435 7f9d18f71d80 0 _get_class not permitted to load sdk 2019-05-22 09:23:35.179798 7f9d18f71d80 0 osd.0 723 crush map has features 288514051259236352, adjusting msgr requires for clients 2019-05-22 09:23:35.179806 7f9d18f71d80 0 osd.0 723 crush map has features 288514051259236352 was 8705, adjusting msgr requires for mons 2019-05-22 09:23:35.179809 7f9d18f71d80 0 osd.0 723 crush map has features 1009089991638532096, adjusting msgr requires for osds 2019-05-22 09:23:35.245033 7f9d18f71d80 0 osd.0 723 load_pgs 2019-05-22 09:25:01.427911 7f9d18f71d80 -1 *** Caught signal (Segmentation fault) ** in thread 7f9d18f71d80 thread_name:ceph-osd ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable) 1: (()+0xa64ee1) [0x55fc16d31ee1] 2: (()+0xf6d0) [0x7f9d1621e6d0] 3: (std::__detail::_List_node_base::_M_hook(std::__detail::_List_node_base*)+0x4) [0x7f9d15b5ce54] 4: (pg_log_t::pg_log_t(pg_log_t const&)+0x109) [0x55fc168d1e79] 5: (PGLog::IndexedLog::operator=(PGLog::IndexedLog const&)+0x29) [0x55fc168dcb29] 6: (void PGLog::read_log_and_missing >(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, PGLog::IndexedLog&, pg_missing_set&, bool, std::basic_ostringstream, std::allocator >&, bool, bool*, DoutPrefixProvider const*, std::set, std::allocator >*, bool)+0x793) [0x55fc168ddde3] 7: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x52b) [0x55fc1687fc2b] 8: (OSD::load_pgs()+0x9b4) [0x55fc167cdd64] 9: (OSD::init()+0x2169) [0x55fc167ec8a9] 10: (main()+0x2d07) [0x55fc166edef7] 11: (__libc_start_main()+0xf5) [0x7f9d1522b445] 12: (()+0x4c0dc3) [0x55fc1678ddc3] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to tune Ceph RBD mirroring parameters to speed up replication
thanks a lot, Jason. how much performance loss should i expect by enabling rbd mirroring? I really need to minimize any performance impact while using this disaster recovery feature. Will a dedicated journal on Intel Optane NVMe help? If so, how big the size should be? cheers, Samuel huxia...@horebdata.cn From: Jason Dillaman Date: 2019-04-03 23:03 To: huxia...@horebdata.cn CC: ceph-users Subject: Re: [ceph-users] How to tune Ceph RBD mirroring parameters to speed up replication For better or worse, out of the box, librbd and rbd-mirror are configured to conserve memory at the expense of performance to support the potential case of thousands of images being mirrored and only a single "rbd-mirror" daemon attempting to handle the load. You can optimize writes by adding "rbd_journal_max_payload_bytes = 8388608" to the "[client]" section on the librbd client nodes. Normally, writes larger than 16KiB are broken into multiple journal entries to allow the remote "rbd-mirror" daemon to make forward progress w/o using too much memory, so this will ensure large IOs only require a single journal entry. You can also add "rbd_mirror_journal_max_fetch_bytes = 33554432" to the "[client]" section on the "rbd-mirror" daemon nodes and restart the daemon for the change to take effect. Normally, the daemon tries to nibble the per-image journal events to prevent excessive memory use in the case where potentially thousands of images are being mirrored. On Wed, Apr 3, 2019 at 4:34 PM huxia...@horebdata.cn wrote: > > Hello, folks, > > I am setting up two ceph clusters to test async replication via RBD > mirroring. The two clusters are very close, just in two buildings about 20m > away, and the networking is very good as well, 10Gb Fiber connection. In this > case, how should i tune the relevant RBD mirroring parameters to accelerate > the replication? > > thanks in advance, > > Samuel > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to tune Ceph RBD mirroring parameters to speed up replication
Hello, folks, I am setting up two ceph clusters to test async replication via RBD mirroring. The two clusters are very close, just in two buildings about 20m away, and the networking is very good as well, 10Gb Fiber connection. In this case, how should i tune the relevant RBD mirroring parameters to accelerate the replication? thanks in advance, Samuel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com