[ceph-users] CRUSH rebalance all at once or host-by-host?

2020-01-07 Thread Sean Matheny
We’re adding in a CRUSH hierarchy retrospectively in preparation for a big 
expansion. Previously we only had host and osd buckets, and now we’ve added in 
rack buckets.

I’ve got sensible settings to limit rebalancing set, at least what has worked 
in the past:
osd_max_backfills = 1
osd_recovery_threads = 1
osd_recovery_priority = 5
osd_client_op_priority = 63
osd_recovery_max_active = 3

I thought it would save a lot of unnecessary data movement if I move the 
existing host buckets to the new rack buckets all at once, rather than 
host-by-host. As long as recovery is throttled correctly, it shouldn’t matter 
how many objects are misplaced, the thinking goes.

1) Is doing all at once advisable, or am I putting myself at a much greater 
risk if I do have failures during the rebalance (which could take quite a 
while)?
2) My failure domain is currently set at the host level. If I want to change 
the failure domain to ‘rack’, when should I best change this (e.g. after the 
rebalancing finishes for moving the hosts to the racks)?

v12.2.2 if it makes a difference.

Cheers,
Sean M





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infiniband backend OSD communication

2020-01-07 Thread Nathan Stratton
Ok, so ipoib is required...

><>
nathan stratton


On Mon, Jan 6, 2020 at 4:45 AM Wei Zhao  wrote:

> From my understanding, the basic idea is that ceph exchange rdma
> information(qp,gid and so) through ip address on rdma device, and then
> communicate with each other throng rdma. But in my tests,  there
> seemed to be some issues in that codes.
>
> On Fri, Jan 3, 2020 at 2:24 AM Nathan Stratton 
> wrote:
> >
> > I am working on upgrading my current ethernet only ceph cluster to a
> combined ethernet frontend and infiniband backend. From my research I
> understand that I set:
> >
> > ms_cluster_type = async+rdma
> > ms_async_rdma_device_name = mlx4_0
> >
> > What I don't understand is how does ceph know how to reach each OSD over
> RDMA? Do I have to run IPoIB on top of infiniband and use that for OSD
> addresses?
> >
> > Is there a way to use infiniband on backend without IPoIB and just use
> rdma verbs?
> >
> > ><>
> > nathan stratton
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph (jewel) unable to recover after node failure

2020-01-07 Thread Hanspeter Kunz
here is the output of ceph health detail: 

HEALTH_ERR 16 pgs are stuck inactive for more than 300 seconds; 134 pgs 
backfill_wait; 11 pgs backfilling; 69 pgs degraded; 14 pgs down; 2 pgs 
incomplete; 14 pgs peering; 6 pgs recovery_wait; 69 pgs stuck degraded; 16 pgs 
stuck inactive; 167 pgs stuck unclean; 63 pgs stuck undersized; 63 pgs 
undersized; 29 requests are blocked > 32 sec; 6 osds have slow requests; 
recovery 667605/71152293 objects degraded (0.938%); recovery 1564114/71152293 
objects misplaced (2.198%); too many PGs per OSD (314 > max 300)
pg 8.3ec is stuck inactive for 17320.100016, current state down+peering, last 
acting [22,40,49]
pg 9.3ac is stuck inactive since forever, current state down+remapped+peering, 
last acting [36]
pg 9.243 is stuck inactive for 17602.030517, current state incomplete, last 
acting [34,47,26]
pg 9.23e is stuck inactive since forever, current state down+remapped+peering, 
last acting [18]
pg 11.7a is stuck inactive since forever, current state down+remapped+peering, 
last acting [13,25]
pg 9.66 is stuck inactive since forever, current state down+remapped+peering, 
last acting [20]
pg 8.6c is stuck inactive for 17196.609471, current state down+peering, last 
acting [34,17,48]
pg 8.143 is stuck inactive for 17201.229429, current state 
down+remapped+peering, last acting [39,19]
pg 10.103 is stuck inactive for 17544.862477, current state down+peering, last 
acting [30,19,53]
pg 8.ae is stuck inactive for 17518.839339, current state down+peering, last 
acting [39,21,52]
pg 8.37 is stuck inactive for 17520.793755, current state down+peering, last 
acting [15,40,52]
pg 7.399 is stuck inactive since forever, current state down+remapped+peering, 
last acting [21]
pg 7.210 is stuck inactive for 17535.412721, current state incomplete, last 
acting [22,49,15]
pg 7.136 is stuck inactive for 40796.009480, current state 
down+remapped+peering, last acting [46]
pg 9.38 is stuck inactive since forever, current state down+remapped+peering, 
last acting [46]
pg 7.36 is stuck inactive since forever, current state down+remapped+peering, 
last acting [20]
pg 9.3ff is stuck unclean for 59505.890789, current state 
active+remapped+wait_backfill, last acting [48,53,33]
pg 9.3e8 is stuck unclean for 21312.446345, current state 
active+remapped+wait_backfill, last acting [28,53,27]
pg 9.3df is stuck unclean for 17346.719500, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [28,46]
pg 7.3c8 is stuck unclean for 86528.672542, current state 
active+remapped+wait_backfill, last acting [30,35,40]
pg 9.3b1 is stuck unclean for 17859.207821, current state 
active+remapped+wait_backfill, last acting [35,40,14]
pg 7.3b8 is stuck unclean for 88517.511151, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [42,14]
pg 9.398 is stuck unclean for 41016.001863, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [32,12]
pg 7.38b is stuck unclean for 41003.853238, current state 
active+remapped+wait_backfill, last acting [13,34,42]
pg 7.36d is stuck unclean for 18780.388726, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [32,29]
pg 9.363 is stuck unclean for 59589.647646, current state 
active+remapped+wait_backfill, last acting [40,16,32]
pg 7.369 is stuck unclean for 17601.998787, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [31,15]
pg 9.368 is stuck unclean for 41558.892612, current state 
active+remapped+wait_backfill, last acting [21,25,19]
pg 7.34d is stuck unclean for 41015.946070, current state 
active+remapped+wait_backfill, last acting [48,14,22]
pg 9.3db is stuck unclean for 50487.572088, current state 
active+remapped+wait_backfill, last acting [40,33,52]
pg 7.30c is stuck unclean for 98943.868376, current state 
active+remapped+wait_backfill, last acting [12,39,16]
pg 7.3a5 is stuck unclean for 26487.349029, current state 
active+remapped+wait_backfill, last acting [36,28,33]
pg 8.2d3 is stuck unclean for 98535.669203, current state 
active+recovery_wait+degraded, last acting [30,33,52]
pg 7.2d6 is stuck unclean for 17769.739311, current state 
active+remapped+wait_backfill, last acting [16,15,36]
pg 9.2b2 is stuck unclean for 67277.008904, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [40,19]
pg 9.2b5 is stuck unclean for 17510.383905, current state 
active+remapped+wait_backfill, last acting [32,29,33]
pg 9.2b8 is stuck unclean for 17601.978526, current state 
active+remapped+backfilling, last acting [18,21,50]
pg 9.2a1 is stuck unclean for 41018.243699, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [28,49]
pg 9.2a8 is stuck unclean for 59129.277638, current state 
active+remapped+wait_backfill, last acting [15,17,44]
pg 7.295 is stuck unclean for 17859.207323, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [38,21]
pg 7.28b is stuck unclean for 

[ceph-users] ceph (jewel) unable to recover after node failure

2020-01-07 Thread Hanspeter Kunz
Hi,

after a node failure ceph is unable to recover, i.e. unable to
reintegrate the failed node back into the cluster.

what happened?
1. a node with 11 osds crashed, the remaining 4 nodes (also with 11
osds each) re-balanced, although reporting the following error
condition:

too many PGs per OSD (314 > max 300)

2. after we put the failed node back online, automatic recovery
started, but very soon (after a few minutes) we saw OSDs randomly going
down and up on ALL the osd nodes (not only on the one that had failed).
we saw the the load (CPU) on the nodes was very high (average load 120)

3. the situation seemed to get worse over time (more and more OSDs
going down, less were coming back up) so we switched the node that had
failed off again.

4. after that, the cluster "calmed down", CPU load became normal
(average load ~4-5). we manually restarted the OSD daemons of the OSDs
that were still down and one after the other these OSDs came back up.
Recovery processes are still running now, but it seems to me that 14
PGs are not recoverable:

output of ceph -s:

 health HEALTH_ERR
16 pgs are stuck inactive for more than 300 seconds
255 pgs backfill_wait
16 pgs backfilling
205 pgs degraded
14 pgs down
2 pgs incomplete
14 pgs peering
48 pgs recovery_wait
205 pgs stuck degraded
16 pgs stuck inactive
335 pgs stuck unclean
156 pgs stuck undersized
156 pgs undersized
25 requests are blocked > 32 sec
recovery 1788571/71151951 objects degraded (2.514%)
recovery 2342374/71151951 objects misplaced (3.292%)
too many PGs per OSD (314 > max 300)

I have a few questions now:

A. will ceph be able to recover over time? I am afraid that the 14 PGs
that are down will not recover.

B. what caused the OSDs going down and up during recovery after the
failed OSD node came back online? (step 2 above) I suspect that the
high CPU load we saw on all the nodes caused timeouts in the OSD
daemons. Is this a reasonable assumption?

C. If indeed all this was caused by such an overload is there a way to
make the recovery process less CPU intensive?

D. What would you advise me to do/try to recover to a healthy state?

In what follows I try to give some more background information
(configuration, log messages). 

ceph version: 10.2.11
OS version: debian jessie
[yes I know this is old]

cluster: 5 OSD nodes (12 cores, 64G RAM), 11 OSD per node, each OSD
daemon controls a 2 TB harddrive. The journals are written to an SSD. 

ceph.conf:
-
[global]
fsid = [censored]
mon_initial_members = salomon, simon, ramon
mon_host = 10.65.16.44, 10.65.16.45, 10.65.16.46
public_network = 10.65.16.0/24
cluster_network = 10.65.18.0/24
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
mon osd down out interval = 7200
--

Log Messages (examples):

we see a lot of:

Jan  7 18:52:22 bruce ceph-osd[9184]: 2020-01-07 18:52:22.411377 7f0ebd93b700 
-1 osd
.29 15636 heartbeat_check: no reply from 10.65.16.43:6822 osd.48 since back 
2020-01-
07 18:51:20.119784 front 2020-01-07 18:52:21.575852 (cutoff 2020-01-07 
18:52:02.4113
30)

however, all the networks were up (the machines could ping each other).

I guess these are the log messages of OSDs going down (on one of the
nodes):
Jan  7 16:47:37 bruce ceph-osd[3684]: 2020-01-07 16:47:37.729691 7fbe5ee73700 
-1 osd.25 15017 *** Got signal Interrupt ***
Jan  7 16:47:37 bruce ceph-osd[3684]: 2020-01-07 16:47:37.729701 7fbe5ee73700 
-1 osd.25 15017 shutdown
Jan  7 16:47:43 bruce ceph-osd[5689]: 2020-01-07 16:47:43.940577 7fb47fda5700 
-1 osd.27 15023 *** Got signal Interrupt ***
Jan  7 16:47:43 bruce ceph-osd[5689]: 2020-01-07 16:47:43.940598 7fb47fda5700 
-1 osd.27 15023 shutdown
Jan  7 16:47:44 bruce ceph-osd[8766]: 2020-01-07 16:47:44.037075 7f4aa0a00700 
-1 osd.24 15023 *** Got signal Interrupt ***
Jan  7 16:47:44 bruce ceph-osd[8766]: 2020-01-07 16:47:44.037087 7f4aa0a00700 
-1 osd.24 15023 shutdown
Jan  7 16:48:04 bruce ceph-osd[8098]: 2020-01-07 16:48:04.511811 7fd6c26a8700 
-1 osd.22 15042 *** Got signal Interrupt ***
Jan  7 16:48:04 bruce ceph-osd[8098]: 2020-01-07 16:48:04.511869 7fd6c26a8700 
-1 osd.22 15042 shutdown

Best regards,
Hp
-- 
Hanspeter Kunz  University of Zurich
Systems Administrator   Department of Informatics
Email: hk...@ifi.uzh.ch Binzmühlestrasse 14
Tel: +41.(0)44.63-56714 Office 2.E.07
http://www.ifi.uzh.ch   CH-8050 Zurich, Switzerland

Spamtraps: hkunz.bo...@ailab.ch hkunz.bo...@ifi.uzh.ch
---
Rome wasn't burnt in a day.


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow request and unresponsive kvm guests after upgrading ceph cluster and os, please help debugging

2020-01-07 Thread Stefan Kooman
Quoting Paul Emmerich (paul.emmer...@croit.io):
> We've also seen some problems with FileStore on newer kernels; 4.9 is the
> last kernel that worked reliably with FileStore in my experience.
> 
> But I haven't seen problems with BlueStore related to the kernel version
> (well, except for that scrub bug, but my work-around for that is in all
> release versions).

What scrub bug are you talking about?

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow request and unresponsive kvm guests after upgrading ceph cluster and os, please help debugging

2020-01-07 Thread Stefan Kooman
Quoting Jelle de Jong (jelledej...@powercraft.nl):

> question 2: what systemd target i can use to run a service after all
> ceph-osds are loaded? I tried ceph.target ceph-osd.target both do not work
> reliable.

ceph-osd.target works for us (every time). Have you enabled all the
individual OSD services, i.e. ceph-osd@0.service?

> question 3: should I still try to upgrade to bluestore or pray to the system
> ods that my performance is back after many many hours of troubleshooting?

I would suggest the first, second is optional ;-). Especially because
you have seperate NVMe device you can use for WAL / DB. It has
advantages over filestore ...

> I made a few changes I am going to just list them for other people that are
> suffering from slow performance after upgrading there Ceph and/or OS.
> 
> Disk utilization is back around 10% no more 80-100%... and rados bench is
> stable again.
> 
> apt-get install irqbalance nftables

^^ Are these some of these changes? Do you need those packages in order
to unload / blacklist them?

I don't get what your fixes are, or what the problem was. Firewall
issues?

What Ceph version did you upgrade to?

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow request and unresponsive kvm guests after upgrading ceph cluster and os, please help debugging

2020-01-07 Thread Jelle de Jong

Hello everybody,

I think I fixed the issues after weeks of looking.

question 1: anyone know hos to prevent iptables, nftables or conntrack 
to be loaded in the first time? Adding them to 
/etc/modprobe.d/blacklist.local.conf does not seem to work? What is 
recommended?


question 2: what systemd target i can use to run a service after all 
ceph-osds are loaded? I tried ceph.target ceph-osd.target both do not 
work reliable.


question 3: should I still try to upgrade to bluestore or pray to the 
system ods that my performance is back after many many hours of 
troubleshooting?


I made a few changes I am going to just list them for other people that 
are suffering from slow performance after upgrading there Ceph and/or OS.


Disk utilization is back around 10% no more 80-100%... and rados bench 
is stable again.


apt-get install irqbalance nftables

# cat /etc/ceph/ceph.conf
[global]
fsid = 5f8d3724-1a51-4895-9b3e-5eb90ea49782
mon_initial_members = ceph01, ceph02, ceph03
mon_host = 192.168.35.11,192.168.35.12,192.168.35.13
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true

osd pool default size = 3
public network = 192.168.35.0/28
cluster network = 192.168.35.0/28
osd pool default min size = 2

osd scrub begin hour = 23
osd scrub end hour = 6

# default osd recovery max active = 3
osd recovery max active = 1

#setuser match path = /var/lib/ceph/$type/$cluster-$id

debug_default = 0
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
filestore_op_threads = 8
filestore_max_inline_xattr_size = 254
filestore_max_inline_xattrs = 6
filestore_queue_max_ops = 500
filestore_queue_committing_max_ops = 5000
filestore_merge_threshold = 40
filestore_split_multiple = 10
journal_max_write_entries = 1000
journal_queue_max_ops = 3000
journal_max_write_bytes = 1048576000
osd_mkfs_options_xfs = -f -I size=2048
osd_mount_options_xfs = noatime,largeio,nobarrier,inode64,allocsize=8M
ods_op_threads = 32
osd_journal_size = 1
filestore_queue_max_bytes = 1048576000
filestore_queue_committing_max_bytes = 1048576000
journal_queue_max_bytes = 1048576000
filestore_max_sync_interval = 10
filestore_journal_parallel = true

[client]
rbd cache = true
#rbd cache max dirty = 0

# cat /etc/sysctl.d/30-nic-10gbit.conf
net.ipv4.tcp_rmem = 1000 1000 1000
net.ipv4.tcp_wmem = 1000 1000 1000
net.ipv4.tcp_mem = 1000 1000 1000
net.core.rmem_default = 524287
net.core.wmem_default = 524287
net.core.rmem_max = 524287
net.core.wmem_max = 524287
net.core.netdev_max_backlog = 30

Unload all forms of filtering, does not blacklist does not work, they 
keep getting loaded! Guess auto loaded by kernel.


echo "blacklist ip_tables" | tee --append 
/etc/modprobe.d/blacklist.local.conf
echo "blacklist iptable_filter" | tee --append 
/etc/modprobe.d/blacklist.local.conf
echo "blacklist ip6_tables" | tee --append 
/etc/modprobe.d/blacklist.local.conf
echo "blacklist ip6table_filter" | tee --append 
/etc/modprobe.d/blacklist.local.conf
echo "blacklist nf_tables" | tee --append 
/etc/modprobe.d/blacklist.local.conf
echo "blacklist nf6_tables" | tee --append 
/etc/modprobe.d/blacklist.local.conf

depmod -a
update-initramfs -u -k all -v

root@ceph02:~# cat /etc/rc.local
#!/bin/bash -e
#
# rc.local
#
# This script is executed at the end of each multiuser runlevel.
# Make sure that the script will "exit 0" on success or any other
# value on error.
#
# In order to enable or disable this script just change the execution
# bits.
#
# By default this script does nothing.

for i in {a..e}; doecho 512 > /sys/block/sd$i/queue/read_ahead_kb; done
for i in {a..d}; dohdparm -q -B 255 -q -W0 /dev/sd$i; done

echo 'on' > '/sys/bus/pci/devices/:00:01.0/power/control'
echo 'on' > '/sys/bus/pci/devices/:00:03.0/power/control'
echo 'on' > '/sys/bus/pci/devices/:00:01.0/power/control'

cpupower frequency-set --governor performance

modprobe -r iptable_filter ip_tables ip6table_filter ip6_tables 
nf_tables_ipv6 nf_tables_ipv4 nf_tables_bridge nf_tables


array=($(pidof ceph-osd))
taskset -cp 0-5 $(echo ${array[0]})
taskset -cp 12-17 $(echo ${array[1]})
taskset -cp 6-11 $(echo ${array[2]})
taskset -cp 18-23 $(echo ${array[3]})

exit 0


Please also save the pastebin from my OP there is a lot of benchmark and 
test notes in there.


root@ceph02:~# rados bench -p scbench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 
4194304 for up to 10 seconds or 0 objects

Object prefix: benchmark_data_ceph02_396172
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg