[ceph-users] CRUSH rebalance all at once or host-by-host?
We’re adding in a CRUSH hierarchy retrospectively in preparation for a big expansion. Previously we only had host and osd buckets, and now we’ve added in rack buckets. I’ve got sensible settings to limit rebalancing set, at least what has worked in the past: osd_max_backfills = 1 osd_recovery_threads = 1 osd_recovery_priority = 5 osd_client_op_priority = 63 osd_recovery_max_active = 3 I thought it would save a lot of unnecessary data movement if I move the existing host buckets to the new rack buckets all at once, rather than host-by-host. As long as recovery is throttled correctly, it shouldn’t matter how many objects are misplaced, the thinking goes. 1) Is doing all at once advisable, or am I putting myself at a much greater risk if I do have failures during the rebalance (which could take quite a while)? 2) My failure domain is currently set at the host level. If I want to change the failure domain to ‘rack’, when should I best change this (e.g. after the rebalancing finishes for moving the hosts to the racks)? v12.2.2 if it makes a difference. Cheers, Sean M ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Infiniband backend OSD communication
Ok, so ipoib is required... ><> nathan stratton On Mon, Jan 6, 2020 at 4:45 AM Wei Zhao wrote: > From my understanding, the basic idea is that ceph exchange rdma > information(qp,gid and so) through ip address on rdma device, and then > communicate with each other throng rdma. But in my tests, there > seemed to be some issues in that codes. > > On Fri, Jan 3, 2020 at 2:24 AM Nathan Stratton > wrote: > > > > I am working on upgrading my current ethernet only ceph cluster to a > combined ethernet frontend and infiniband backend. From my research I > understand that I set: > > > > ms_cluster_type = async+rdma > > ms_async_rdma_device_name = mlx4_0 > > > > What I don't understand is how does ceph know how to reach each OSD over > RDMA? Do I have to run IPoIB on top of infiniband and use that for OSD > addresses? > > > > Is there a way to use infiniband on backend without IPoIB and just use > rdma verbs? > > > > ><> > > nathan stratton > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph (jewel) unable to recover after node failure
here is the output of ceph health detail: HEALTH_ERR 16 pgs are stuck inactive for more than 300 seconds; 134 pgs backfill_wait; 11 pgs backfilling; 69 pgs degraded; 14 pgs down; 2 pgs incomplete; 14 pgs peering; 6 pgs recovery_wait; 69 pgs stuck degraded; 16 pgs stuck inactive; 167 pgs stuck unclean; 63 pgs stuck undersized; 63 pgs undersized; 29 requests are blocked > 32 sec; 6 osds have slow requests; recovery 667605/71152293 objects degraded (0.938%); recovery 1564114/71152293 objects misplaced (2.198%); too many PGs per OSD (314 > max 300) pg 8.3ec is stuck inactive for 17320.100016, current state down+peering, last acting [22,40,49] pg 9.3ac is stuck inactive since forever, current state down+remapped+peering, last acting [36] pg 9.243 is stuck inactive for 17602.030517, current state incomplete, last acting [34,47,26] pg 9.23e is stuck inactive since forever, current state down+remapped+peering, last acting [18] pg 11.7a is stuck inactive since forever, current state down+remapped+peering, last acting [13,25] pg 9.66 is stuck inactive since forever, current state down+remapped+peering, last acting [20] pg 8.6c is stuck inactive for 17196.609471, current state down+peering, last acting [34,17,48] pg 8.143 is stuck inactive for 17201.229429, current state down+remapped+peering, last acting [39,19] pg 10.103 is stuck inactive for 17544.862477, current state down+peering, last acting [30,19,53] pg 8.ae is stuck inactive for 17518.839339, current state down+peering, last acting [39,21,52] pg 8.37 is stuck inactive for 17520.793755, current state down+peering, last acting [15,40,52] pg 7.399 is stuck inactive since forever, current state down+remapped+peering, last acting [21] pg 7.210 is stuck inactive for 17535.412721, current state incomplete, last acting [22,49,15] pg 7.136 is stuck inactive for 40796.009480, current state down+remapped+peering, last acting [46] pg 9.38 is stuck inactive since forever, current state down+remapped+peering, last acting [46] pg 7.36 is stuck inactive since forever, current state down+remapped+peering, last acting [20] pg 9.3ff is stuck unclean for 59505.890789, current state active+remapped+wait_backfill, last acting [48,53,33] pg 9.3e8 is stuck unclean for 21312.446345, current state active+remapped+wait_backfill, last acting [28,53,27] pg 9.3df is stuck unclean for 17346.719500, current state active+undersized+degraded+remapped+wait_backfill, last acting [28,46] pg 7.3c8 is stuck unclean for 86528.672542, current state active+remapped+wait_backfill, last acting [30,35,40] pg 9.3b1 is stuck unclean for 17859.207821, current state active+remapped+wait_backfill, last acting [35,40,14] pg 7.3b8 is stuck unclean for 88517.511151, current state active+undersized+degraded+remapped+wait_backfill, last acting [42,14] pg 9.398 is stuck unclean for 41016.001863, current state active+undersized+degraded+remapped+wait_backfill, last acting [32,12] pg 7.38b is stuck unclean for 41003.853238, current state active+remapped+wait_backfill, last acting [13,34,42] pg 7.36d is stuck unclean for 18780.388726, current state active+undersized+degraded+remapped+wait_backfill, last acting [32,29] pg 9.363 is stuck unclean for 59589.647646, current state active+remapped+wait_backfill, last acting [40,16,32] pg 7.369 is stuck unclean for 17601.998787, current state active+undersized+degraded+remapped+wait_backfill, last acting [31,15] pg 9.368 is stuck unclean for 41558.892612, current state active+remapped+wait_backfill, last acting [21,25,19] pg 7.34d is stuck unclean for 41015.946070, current state active+remapped+wait_backfill, last acting [48,14,22] pg 9.3db is stuck unclean for 50487.572088, current state active+remapped+wait_backfill, last acting [40,33,52] pg 7.30c is stuck unclean for 98943.868376, current state active+remapped+wait_backfill, last acting [12,39,16] pg 7.3a5 is stuck unclean for 26487.349029, current state active+remapped+wait_backfill, last acting [36,28,33] pg 8.2d3 is stuck unclean for 98535.669203, current state active+recovery_wait+degraded, last acting [30,33,52] pg 7.2d6 is stuck unclean for 17769.739311, current state active+remapped+wait_backfill, last acting [16,15,36] pg 9.2b2 is stuck unclean for 67277.008904, current state active+undersized+degraded+remapped+wait_backfill, last acting [40,19] pg 9.2b5 is stuck unclean for 17510.383905, current state active+remapped+wait_backfill, last acting [32,29,33] pg 9.2b8 is stuck unclean for 17601.978526, current state active+remapped+backfilling, last acting [18,21,50] pg 9.2a1 is stuck unclean for 41018.243699, current state active+undersized+degraded+remapped+wait_backfill, last acting [28,49] pg 9.2a8 is stuck unclean for 59129.277638, current state active+remapped+wait_backfill, last acting [15,17,44] pg 7.295 is stuck unclean for 17859.207323, current state active+undersized+degraded+remapped+wait_backfill, last acting [38,21] pg 7.28b is stuck unclean for
[ceph-users] ceph (jewel) unable to recover after node failure
Hi, after a node failure ceph is unable to recover, i.e. unable to reintegrate the failed node back into the cluster. what happened? 1. a node with 11 osds crashed, the remaining 4 nodes (also with 11 osds each) re-balanced, although reporting the following error condition: too many PGs per OSD (314 > max 300) 2. after we put the failed node back online, automatic recovery started, but very soon (after a few minutes) we saw OSDs randomly going down and up on ALL the osd nodes (not only on the one that had failed). we saw the the load (CPU) on the nodes was very high (average load 120) 3. the situation seemed to get worse over time (more and more OSDs going down, less were coming back up) so we switched the node that had failed off again. 4. after that, the cluster "calmed down", CPU load became normal (average load ~4-5). we manually restarted the OSD daemons of the OSDs that were still down and one after the other these OSDs came back up. Recovery processes are still running now, but it seems to me that 14 PGs are not recoverable: output of ceph -s: health HEALTH_ERR 16 pgs are stuck inactive for more than 300 seconds 255 pgs backfill_wait 16 pgs backfilling 205 pgs degraded 14 pgs down 2 pgs incomplete 14 pgs peering 48 pgs recovery_wait 205 pgs stuck degraded 16 pgs stuck inactive 335 pgs stuck unclean 156 pgs stuck undersized 156 pgs undersized 25 requests are blocked > 32 sec recovery 1788571/71151951 objects degraded (2.514%) recovery 2342374/71151951 objects misplaced (3.292%) too many PGs per OSD (314 > max 300) I have a few questions now: A. will ceph be able to recover over time? I am afraid that the 14 PGs that are down will not recover. B. what caused the OSDs going down and up during recovery after the failed OSD node came back online? (step 2 above) I suspect that the high CPU load we saw on all the nodes caused timeouts in the OSD daemons. Is this a reasonable assumption? C. If indeed all this was caused by such an overload is there a way to make the recovery process less CPU intensive? D. What would you advise me to do/try to recover to a healthy state? In what follows I try to give some more background information (configuration, log messages). ceph version: 10.2.11 OS version: debian jessie [yes I know this is old] cluster: 5 OSD nodes (12 cores, 64G RAM), 11 OSD per node, each OSD daemon controls a 2 TB harddrive. The journals are written to an SSD. ceph.conf: - [global] fsid = [censored] mon_initial_members = salomon, simon, ramon mon_host = 10.65.16.44, 10.65.16.45, 10.65.16.46 public_network = 10.65.16.0/24 cluster_network = 10.65.18.0/24 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx mon osd down out interval = 7200 -- Log Messages (examples): we see a lot of: Jan 7 18:52:22 bruce ceph-osd[9184]: 2020-01-07 18:52:22.411377 7f0ebd93b700 -1 osd .29 15636 heartbeat_check: no reply from 10.65.16.43:6822 osd.48 since back 2020-01- 07 18:51:20.119784 front 2020-01-07 18:52:21.575852 (cutoff 2020-01-07 18:52:02.4113 30) however, all the networks were up (the machines could ping each other). I guess these are the log messages of OSDs going down (on one of the nodes): Jan 7 16:47:37 bruce ceph-osd[3684]: 2020-01-07 16:47:37.729691 7fbe5ee73700 -1 osd.25 15017 *** Got signal Interrupt *** Jan 7 16:47:37 bruce ceph-osd[3684]: 2020-01-07 16:47:37.729701 7fbe5ee73700 -1 osd.25 15017 shutdown Jan 7 16:47:43 bruce ceph-osd[5689]: 2020-01-07 16:47:43.940577 7fb47fda5700 -1 osd.27 15023 *** Got signal Interrupt *** Jan 7 16:47:43 bruce ceph-osd[5689]: 2020-01-07 16:47:43.940598 7fb47fda5700 -1 osd.27 15023 shutdown Jan 7 16:47:44 bruce ceph-osd[8766]: 2020-01-07 16:47:44.037075 7f4aa0a00700 -1 osd.24 15023 *** Got signal Interrupt *** Jan 7 16:47:44 bruce ceph-osd[8766]: 2020-01-07 16:47:44.037087 7f4aa0a00700 -1 osd.24 15023 shutdown Jan 7 16:48:04 bruce ceph-osd[8098]: 2020-01-07 16:48:04.511811 7fd6c26a8700 -1 osd.22 15042 *** Got signal Interrupt *** Jan 7 16:48:04 bruce ceph-osd[8098]: 2020-01-07 16:48:04.511869 7fd6c26a8700 -1 osd.22 15042 shutdown Best regards, Hp -- Hanspeter Kunz University of Zurich Systems Administrator Department of Informatics Email: hk...@ifi.uzh.ch Binzmühlestrasse 14 Tel: +41.(0)44.63-56714 Office 2.E.07 http://www.ifi.uzh.ch CH-8050 Zurich, Switzerland Spamtraps: hkunz.bo...@ailab.ch hkunz.bo...@ifi.uzh.ch --- Rome wasn't burnt in a day. smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow request and unresponsive kvm guests after upgrading ceph cluster and os, please help debugging
Quoting Paul Emmerich (paul.emmer...@croit.io): > We've also seen some problems with FileStore on newer kernels; 4.9 is the > last kernel that worked reliably with FileStore in my experience. > > But I haven't seen problems with BlueStore related to the kernel version > (well, except for that scrub bug, but my work-around for that is in all > release versions). What scrub bug are you talking about? Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow request and unresponsive kvm guests after upgrading ceph cluster and os, please help debugging
Quoting Jelle de Jong (jelledej...@powercraft.nl): > question 2: what systemd target i can use to run a service after all > ceph-osds are loaded? I tried ceph.target ceph-osd.target both do not work > reliable. ceph-osd.target works for us (every time). Have you enabled all the individual OSD services, i.e. ceph-osd@0.service? > question 3: should I still try to upgrade to bluestore or pray to the system > ods that my performance is back after many many hours of troubleshooting? I would suggest the first, second is optional ;-). Especially because you have seperate NVMe device you can use for WAL / DB. It has advantages over filestore ... > I made a few changes I am going to just list them for other people that are > suffering from slow performance after upgrading there Ceph and/or OS. > > Disk utilization is back around 10% no more 80-100%... and rados bench is > stable again. > > apt-get install irqbalance nftables ^^ Are these some of these changes? Do you need those packages in order to unload / blacklist them? I don't get what your fixes are, or what the problem was. Firewall issues? What Ceph version did you upgrade to? Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow request and unresponsive kvm guests after upgrading ceph cluster and os, please help debugging
Hello everybody, I think I fixed the issues after weeks of looking. question 1: anyone know hos to prevent iptables, nftables or conntrack to be loaded in the first time? Adding them to /etc/modprobe.d/blacklist.local.conf does not seem to work? What is recommended? question 2: what systemd target i can use to run a service after all ceph-osds are loaded? I tried ceph.target ceph-osd.target both do not work reliable. question 3: should I still try to upgrade to bluestore or pray to the system ods that my performance is back after many many hours of troubleshooting? I made a few changes I am going to just list them for other people that are suffering from slow performance after upgrading there Ceph and/or OS. Disk utilization is back around 10% no more 80-100%... and rados bench is stable again. apt-get install irqbalance nftables # cat /etc/ceph/ceph.conf [global] fsid = 5f8d3724-1a51-4895-9b3e-5eb90ea49782 mon_initial_members = ceph01, ceph02, ceph03 mon_host = 192.168.35.11,192.168.35.12,192.168.35.13 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true osd pool default size = 3 public network = 192.168.35.0/28 cluster network = 192.168.35.0/28 osd pool default min size = 2 osd scrub begin hour = 23 osd scrub end hour = 6 # default osd recovery max active = 3 osd recovery max active = 1 #setuser match path = /var/lib/ceph/$type/$cluster-$id debug_default = 0 debug_lockdep = 0/0 debug_context = 0/0 debug_crush = 0/0 debug_buffer = 0/0 debug_timer = 0/0 debug_filer = 0/0 debug_objecter = 0/0 debug_rados = 0/0 debug_rbd = 0/0 debug_journaler = 0/0 debug_objectcatcher = 0/0 debug_client = 0/0 debug_osd = 0/0 debug_optracker = 0/0 debug_objclass = 0/0 debug_filestore = 0/0 debug_journal = 0/0 debug_ms = 0/0 debug_monc = 0/0 debug_tp = 0/0 filestore_op_threads = 8 filestore_max_inline_xattr_size = 254 filestore_max_inline_xattrs = 6 filestore_queue_max_ops = 500 filestore_queue_committing_max_ops = 5000 filestore_merge_threshold = 40 filestore_split_multiple = 10 journal_max_write_entries = 1000 journal_queue_max_ops = 3000 journal_max_write_bytes = 1048576000 osd_mkfs_options_xfs = -f -I size=2048 osd_mount_options_xfs = noatime,largeio,nobarrier,inode64,allocsize=8M ods_op_threads = 32 osd_journal_size = 1 filestore_queue_max_bytes = 1048576000 filestore_queue_committing_max_bytes = 1048576000 journal_queue_max_bytes = 1048576000 filestore_max_sync_interval = 10 filestore_journal_parallel = true [client] rbd cache = true #rbd cache max dirty = 0 # cat /etc/sysctl.d/30-nic-10gbit.conf net.ipv4.tcp_rmem = 1000 1000 1000 net.ipv4.tcp_wmem = 1000 1000 1000 net.ipv4.tcp_mem = 1000 1000 1000 net.core.rmem_default = 524287 net.core.wmem_default = 524287 net.core.rmem_max = 524287 net.core.wmem_max = 524287 net.core.netdev_max_backlog = 30 Unload all forms of filtering, does not blacklist does not work, they keep getting loaded! Guess auto loaded by kernel. echo "blacklist ip_tables" | tee --append /etc/modprobe.d/blacklist.local.conf echo "blacklist iptable_filter" | tee --append /etc/modprobe.d/blacklist.local.conf echo "blacklist ip6_tables" | tee --append /etc/modprobe.d/blacklist.local.conf echo "blacklist ip6table_filter" | tee --append /etc/modprobe.d/blacklist.local.conf echo "blacklist nf_tables" | tee --append /etc/modprobe.d/blacklist.local.conf echo "blacklist nf6_tables" | tee --append /etc/modprobe.d/blacklist.local.conf depmod -a update-initramfs -u -k all -v root@ceph02:~# cat /etc/rc.local #!/bin/bash -e # # rc.local # # This script is executed at the end of each multiuser runlevel. # Make sure that the script will "exit 0" on success or any other # value on error. # # In order to enable or disable this script just change the execution # bits. # # By default this script does nothing. for i in {a..e}; doecho 512 > /sys/block/sd$i/queue/read_ahead_kb; done for i in {a..d}; dohdparm -q -B 255 -q -W0 /dev/sd$i; done echo 'on' > '/sys/bus/pci/devices/:00:01.0/power/control' echo 'on' > '/sys/bus/pci/devices/:00:03.0/power/control' echo 'on' > '/sys/bus/pci/devices/:00:01.0/power/control' cpupower frequency-set --governor performance modprobe -r iptable_filter ip_tables ip6table_filter ip6_tables nf_tables_ipv6 nf_tables_ipv4 nf_tables_bridge nf_tables array=($(pidof ceph-osd)) taskset -cp 0-5 $(echo ${array[0]}) taskset -cp 12-17 $(echo ${array[1]}) taskset -cp 6-11 $(echo ${array[2]}) taskset -cp 18-23 $(echo ${array[3]}) exit 0 Please also save the pastebin from my OP there is a lot of benchmark and test notes in there. root@ceph02:~# rados bench -p scbench 10 write --no-cleanup hints = 1 Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects Object prefix: benchmark_data_ceph02_396172 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg