[ceph-users] Re: c-states and OSD performance

2024-01-26 Thread Anthony D'Atri
I’ve seen C-states impact mons by dropping a bunch of packets — on nodes that 
were lightly utilized so they transitioned a lot.  Curiously both CPU and NIC 
generation seemed to be factors, as it only happened on one cluster out of a 
dozen or so.

If by SSD you mean SAS/SATA SSDs, then the question is kinda broad, but 
probably.  With HDD OSDs I suspect not.

> On Jan 26, 2024, at 7:35 PM, Christopher Durham  wrote:
> 
> Hi,
> The following article:
> https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/
> 
> suggests that dsabling C-states on your CPUs (on the OSD nodes) as one method 
> to improve performance. The article seems to indicate that the scenariobeing 
> addressed in the article was with NVMEs as OSDs.
> 
> Questions:
> Will disabling C-states and keeping the processors at max power state help 
> performance for the following:
> 1. NVME OSDs (yes)2. SSD OSDs3. Spinning disk OSDs
> 
> -Chris
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-26 Thread Özkan Göksu
I decided to tune cephfs client's kernels and increase network buffers to
increase speed.

This time my client has 1x 10Gbit DAC cable.
Client version is 1 step ahead: ceph-common/stable,now 17.2.7-1focal amd64
[installed]

The kernel tunnings:

root@maradona:~# cat /etc/sysctl.conf
net.ipv4.tcp_syncookies = 0 # Disable syncookies (syncookies are not RFC
compliant and can use too muche resources)
net.ipv4.tcp_keepalive_time = 600   # Keepalive time for TCP connections
(seconds)
net.ipv4.tcp_synack_retries = 3 # Number of SYNACK retries before
giving up
net.ipv4.tcp_syn_retries = 3# Number of SYN retries before giving up
net.ipv4.tcp_rfc1337 = 1 # RFC1337 The set to 1 to enable RFC 1337
protection.
net.ipv4.conf.all.log_martians = 1 # Log packets with impossible addresses
to kernel log
net.ipv4.inet_peer_gc_mintime = 5 # Minimum interval between garbage
collection passes This interval is
net.ipv4.tcp_ecn = 0 # Disable Explicit Congestion Notification in TCP
net.ipv4.tcp_window_scaling = 1 # Enable window scaling as defined in
RFC1323
net.ipv4.tcp_timestamps = 1 # Enable timestamps (RFC1323)
net.ipv4.tcp_sack = 1 # Enable select acknowledgments
net.ipv4.tcp_fack = 1 # Enable FACK congestion avoidance and fast
restransmission
net.ipv4.tcp_dsack = 1 # Allows TCP to send "duplicate" SACKs
net.ipv4.ip_forward = 0 # Controls IP packet forwarding
net.ipv4.conf.default.rp_filter = 0 # No controls source route verification
(RFC1812)
net.ipv4.tcp_tw_recycle = 1 # Enable fast recycling TIME-WAIT sockets
net.ipv4.tcp_max_syn_backlog = 2 # to keep
TCP_SYNQ_HSIZE*16<=tcp_max_syn_backlog
net.ipv4.tcp_max_orphans = 412520 # tells the kernel how many TCP sockets
that are not attached to any user file handle to maintain
net.ipv4.tcp_orphan_retries = 1 # How may times to retry before killing TCP
connection, closed by our side
net.ipv4.tcp_fin_timeout = 20 # how long to keep sockets in the state
FIN-WAIT-2 if we were the one closing the socket
net.ipv4.tcp_max_tw_buckets = 33001472 # maximum number of sockets in
TIME-WAIT to be held simultaneously
net.ipv4.tcp_no_metrics_save = 1 # don't cache ssthresh from previous
connection
net.ipv4.tcp_moderate_rcvbuf = 1 # don't cache ssthresh from previous
connection
net.ipv4.tcp_rmem = 4096 87380 16777216 # increase Linux autotuning TCP
buffer limits
net.ipv4.tcp_wmem = 4096 65536 16777216 # increase Linux autotuning TCP
buffer limits
# increase TCP max buffer size
# net.core.rmem_max = 16777216 #try this if you get problems
# net.core.wmem_max = 16777216 #try this if you get problems
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.core.rmem_default = 262144
net.core.wmem_default = 262144
#net.core.netdev_max_backlog = 2500 #try this if you get problems
net.core.netdev_max_backlog = 3
net.core.somaxconn = 65000
net.ipv6.conf.all.disable_ipv6 = 1 # Disable ipv6
# You can monitor the kernel behavior with regard to the dirty
# pages by using grep -A 1 dirty /proc/vmstat
vm.dirty_background_ratio = 5
vm.dirty_ratio = 15

fs.file-max = 16500736 # system open file limit

# Core dump
kernel.core_pattern = /var/core_dumps/core.%e.%p.%h.%t
fs.suid_dumpable = 2

# Kernel related tunnings
kernel.printk = 4 4 1 7
kernel.core_uses_pid = 1
kernel.sysrq = 0
kernel.msgmax = 65536
kernel.msgmnb = 65536
kernel.shmmax = 243314299699 # Maximum shared segment size in bytes
kernel.shmall = 66003228 # Maximum number of shared memory segments in pages
vm.nr_hugepages = 4096 # Increase Transparent Huge Pages (THP) Defrag:
vm.swappiness = 0 # Set vm.swappiness to 0 to minimize swapping
vm.min_free_kbytes = 2640129 # required free memory (set to 1% of physical
ram)

iobenchmark result:

Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
 BS=1Mwrite: IOPS=, BW=MiB/s (1165MB/s)(3072MiB/2764msec); 0
zone resets
 BS=128K  write: IOPS=3812, BW=477MiB/s (500MB/s)(3072MiB/6446msec); 0 zone
resets
 BS=64K   write: IOPS=5116, BW=320MiB/s (335MB/s)(3072MiB/9607msec); 0 zone
resets
 BS=32K   write: IOPS=6545, BW=205MiB/s (214MB/s)(3072MiB/15018msec); 0
zone resets
 BS=16K   write: IOPS=8004, BW=125MiB/s (131MB/s)(3072MiB/24561msec); 0
zone resets
 BS=4Kwrite: IOPS=8661, BW=33.8MiB/s (35.5MB/s)(3072MiB/90801msec); 0
zone resets
Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
 BS=1Mread: IOPS=1117, BW=1117MiB/s (1171MB/s)(3072MiB/2750msec)
 BS=128K  read: IOPS=8353, BW=1044MiB/s (1095MB/s)(3072MiB/2942msec)
 BS=64K   read: IOPS=11.8k, BW=739MiB/s (775MB/s)(3072MiB/4155msec)
 BS=32K   read: IOPS=16.3k, BW=508MiB/s (533MB/s)(3072MiB/6049msec)
 BS=16K   read: IOPS=23.0k, BW=375MiB/s (393MB/s)(3072MiB/8195msec)
 BS=4Kread: IOPS=27.4k, BW=107MiB/s (112MB/s)(3072MiB/28740msec)
Rand Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
 BS=1Mwrite: IOPS=1102, BW=1103MiB/s (1156MB/s)(3072MiB/2786msec); 0
zone resets
 BS=128K  write: IOPS=8581, BW=1073MiB/s (1125MB/s)(3072MiB/2864msec); 0
zone resets
 BS=64K   write: IOPS=10.9k, BW=681MiB/s 

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-26 Thread Özkan Göksu
Wow I noticed something!

To prevent ram overflow with gpu training allocations, I'm using a 2TB
Samsung 870 evo for swap.

As you can see below, swap usage 18Gi and server was idle, that means maybe
ceph client hits latency because of the swap usage.

root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577#
free -h
   totalusedfree  shared  buff/cache
available
Mem:62Gi34Gi27Gi   0.0Ki   639Mi
 27Gi
Swap:  1.8Ti18Gi   1.8Ti

I decided to play around kernel parameters to prevent ceph swap usage.

kernel.shmmax = 60654764851   # Maximum shared segment size in bytes
> kernel.shmall = 16453658   # Maximum number of shared memory segments in
> pages
> vm.nr_hugepages = 4096   # Increase Transparent Huge Pages (THP) Defrag:
> vm.swappiness = 0 # Set vm.swappiness to 0 to minimize swapping
> vm.min_free_kbytes = 1048576 # required free memory (set to 1% of physical
> ram)


I reboot the server and after reboot swap usage is 0 as expected.

To give a try I started the iobench.sh (
https://github.com/ozkangoksu/benchmark/blob/main/iobench.sh)
This client has 1G nic only. As you can see below, other then 4K block
size, ceph client can saturate NIC.

root@bmw-m4:~# nicstat -MUz 1
Time  Int   rMbps   wMbps   rPk/s   wPk/srAvswAvs %rUtil
%wUtil
01:04:48   ens1f0   936.9   92.90 91196.8 60126.3  1346.6   202.5   98.2
9.74

root@bmw-m4:/mounts/ud-data/benchuser1/96f13211-c37f-42db-8d05-f3255a05129e/testdir#
bash iobench.sh
Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
 BS=1Mwrite: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27395msec); 0 zone
resets
 BS=128K  write: IOPS=894, BW=112MiB/s (117MB/s)(3072MiB/27462msec); 0 zone
resets
 BS=64K   write: IOPS=1758, BW=110MiB/s (115MB/s)(3072MiB/27948msec); 0
zone resets
 BS=32K   write: IOPS=3542, BW=111MiB/s (116MB/s)(3072MiB/27748msec); 0
zone resets
 BS=16K   write: IOPS=6839, BW=107MiB/s (112MB/s)(3072MiB/28747msec); 0
zone resets
 BS=4Kwrite: IOPS=8473, BW=33.1MiB/s (34.7MB/s)(3072MiB/92813msec); 0
zone resets
Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
 BS=1Mread: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27386msec)
 BS=128K  read: IOPS=895, BW=112MiB/s (117MB/s)(3072MiB/27431msec)
 BS=64K   read: IOPS=1788, BW=112MiB/s (117MB/s)(3072MiB/27486msec)
 BS=32K   read: IOPS=3561, BW=111MiB/s (117MB/s)(3072MiB/27603msec)
 BS=16K   read: IOPS=6924, BW=108MiB/s (113MB/s)(3072MiB/28392msec)
 BS=4Kread: IOPS=21.3k, BW=83.3MiB/s (87.3MB/s)(3072MiB/36894msec)
Rand Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
 BS=1Mwrite: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27406msec); 0 zone
resets
 BS=128K  write: IOPS=894, BW=112MiB/s (117MB/s)(3072MiB/27466msec); 0 zone
resets
 BS=64K   write: IOPS=1781, BW=111MiB/s (117MB/s)(3072MiB/27591msec); 0
zone resets
 BS=32K   write: IOPS=3545, BW=111MiB/s (116MB/s)(3072MiB/27729msec); 0
zone resets
 BS=16K   write: IOPS=6823, BW=107MiB/s (112MB/s)(3072MiB/28814msec); 0
zone resets
 BS=4Kwrite: IOPS=12.7k, BW=49.8MiB/s (52.2MB/s)(3072MiB/61694msec); 0
zone resets
Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
 BS=1Mread: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27388msec)
 BS=128K  read: IOPS=894, BW=112MiB/s (117MB/s)(3072MiB/27479msec)
 BS=64K   read: IOPS=1784, BW=112MiB/s (117MB/s)(3072MiB/27547msec)
 BS=32K   read: IOPS=3559, BW=111MiB/s (117MB/s)(3072MiB/27614msec)
 BS=16K   read: IOPS=7047, BW=110MiB/s (115MB/s)(3072MiB/27897msec)
 BS=4Kread: IOPS=26.9k, BW=105MiB/s (110MB/s)(3072MiB/29199msec)



root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1818702#
cat metrics
item   total
--
opened files  / total inodes   0 / 109
pinned i_caps / total inodes   109 / 109
opened inodes / total inodes   0 / 109

item  total   avg_lat(us) min_lat(us) max_lat(us)
stdev(us)
---
read  2316289 13904   221 8827984
760
write 2317824 21152   29759243821
2365
metadata  170 5944225 202505
 24314

item  total   avg_sz(bytes)   min_sz(bytes)   max_sz(bytes)
 total_sz(bytes)

read  2316289 16688   40961048576
38654712361
write 2317824 19457   40964194304
45097156608

item  total   misshit
-
d_lease   112 3   858
caps  109 58  6963547

root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1818702#
free -h
   total

[ceph-users] c-states and OSD performance

2024-01-26 Thread Christopher Durham
Hi,
The following article:
https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/

suggests that dsabling C-states on your CPUs (on the OSD nodes) as one method 
to improve performance. The article seems to indicate that the scenariobeing 
addressed in the article was with NVMEs as OSDs.

Questions:
Will disabling C-states and keeping the processors at max power state help 
performance for the following:
1. NVME OSDs (yes)2. SSD OSDs3. Spinning disk OSDs

-Chris

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-26 Thread Özkan Göksu
I started to investigate my clients.

for example:

root@ud-01:~# ceph health detail
HEALTH_WARN 1 clients failing to respond to cache pressure
[WRN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressure
mds.ud-data.ud-02.xcoojt(mds.0): Client bmw-m4 failing to respond to
cache pressure client_id: 1275577

root@ud-01:~# ceph fs status
ud-data - 86 clients
===
RANK  STATE   MDS  ACTIVITY DNSINOS   DIRS
CAPS
 0active  ud-data.ud-02.xcoojt  Reqs:   34 /s  2926k  2827k   155k
 1157k


ceph tell mds.ud-data.ud-02.xcoojt session ls | jq -r '.[] | "clientid:
\(.id)= num_caps: \(.num_caps), num_leases: \(.num_leases),
request_load_avg: \(.request_load_avg), num_completed_requests:
\(.num_completed_requests), num_completed_flushes:
\(.num_completed_flushes)"' | sort -n -t: -k3

clientid: *1275577*= num_caps: 12312, num_leases: 0, request_load_avg: 0,
num_completed_requests: 0, num_completed_flushes: 1
clientid: 1275571= num_caps: 16307, num_leases: 1, request_load_avg: 2101,
num_completed_requests: 0, num_completed_flushes: 3
clientid: 1282130= num_caps: 26337, num_leases: 3, request_load_avg: 116,
num_completed_requests: 0, num_completed_flushes: 1
clientid: 1191789= num_caps: 32784, num_leases: 0, request_load_avg: 1846,
num_completed_requests: 0, num_completed_flushes: 0
clientid: 1275535= num_caps: 79825, num_leases: 2, request_load_avg: 133,
num_completed_requests: 8, num_completed_flushes: 8
clientid: 1282142= num_caps: 80581, num_leases: 6, request_load_avg: 125,
num_completed_requests: 2, num_completed_flushes: 6
clientid: 1275532= num_caps: 87836, num_leases: 3, request_load_avg: 190,
num_completed_requests: 2, num_completed_flushes: 6
clientid: 1275547= num_caps: 94129, num_leases: 4, request_load_avg: 149,
num_completed_requests: 2, num_completed_flushes: 4
clientid: 1275553= num_caps: 96460, num_leases: 4, request_load_avg: 155,
num_completed_requests: 2, num_completed_flushes: 8
clientid: 1282139= num_caps: 108882, num_leases: 25, request_load_avg: 99,
num_completed_requests: 2, num_completed_flushes: 4
clientid: 1275538= num_caps: 437162, num_leases: 0, request_load_avg: 101,
num_completed_requests: 2, num_completed_flushes: 0

--

*MY CLIENT:*

The client is actually at idle mode and there is no reason to fail at all.

root@bmw-m4:~# apt list --installed |grep ceph
ceph-common/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64 [installed]
libcephfs2/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64
[installed,automatic]
python3-ceph-argparse/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64
[installed,automatic]
python3-ceph-common/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 all
[installed,automatic]
python3-cephfs/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64
[installed,automatic]

Let's check metrics and stats:

root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577#
cat metrics
item   total
--
opened files  / total inodes   2 / 12312
pinned i_caps / total inodes   12312 / 12312
opened inodes / total inodes   1 / 12312

item  total   avg_lat(us) min_lat(us) max_lat(us)
stdev(us)
---
read  22283   44409   430 1804853
15619
write 112702  419725  36588879541
6008
metadata  353322  5712154 917903
 5357

item  total   avg_sz(bytes)   min_sz(bytes)   max_sz(bytes)
 total_sz(bytes)

read  22283   1701940 1   4194304
37924318602
write 112702  246211  1   4194304
27748469309

item  total   misshit
-
d_lease   62  63627   28564698
caps  12312   36658   44568261


root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577#
cat bdi/stats
BdiWriteback:0 kB
BdiReclaimable:800 kB
BdiDirtyThresh:  0 kB
DirtyThresh:   5795340 kB
BackgroundThresh:  2894132 kB
BdiDirtied:   27316320 kB
BdiWritten:   27316320 kB
BdiWriteBandwidth:1472 kBps
b_dirty: 0
b_io:0
b_more_io:   0
b_dirty_time:0
bdi_list:1
state:   1


Last 3 days dmesg output:

[Wed Jan 24 16:45:13 2024] xfsettingsd[653036]: segfault at 18 ip
7fbd12f5d337 sp 7ffd254332a0 error 4 in
libxklavier.so.16.4.0[7fbd12f4d000+19000]
[Wed Jan 24 16:45:13 2024] Code: 4c 89 e7 e8 0b 56 ff ff 48 89 03 48 8b 5c
24 30 e9 d1 fd ff ff e8 b9 5b ff ff 66 0f 1f 84 00 00 00 00 00 41 54 55 48
89 

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-26 Thread Özkan Göksu
Hello Frank.

I have 84 clients (high-end servers) with: Ubuntu 20.04.5 LTS - Kernel:
Linux 5.4.0-125-generic

My cluster 17.2.6 quincy.
I have some client nodes with "ceph-common/stable,now 17.2.7-1focal" I
wonder using new version clients is the main problem?
Maybe I have a communication error. For example I hit this problem and I
can not collect client stats "https://github.com/ceph/ceph/pull/52127/files;

Best regards.



Frank Schilder , 26 Oca 2024 Cum, 14:53 tarihinde şunu yazdı:

> Hi, this message is one of those that are often spurious. I don't recall
> in which thread/PR/tracker I read it, but the story was something like that:
>
> If an MDS gets under memory pressure it will request dentry items back
> from *all* clients, not just the active ones or the ones holding many of
> them. If you have a client that's below the min-threshold for dentries (its
> one of the client/mds tuning options), it will not respond. This client
> will be flagged as not responding, which is a false positive.
>
> I believe the devs are working on a fix to get rid of these spurious
> warnings. There is a "bug/feature" in the MDS that does not clear this
> warning flag for inactive clients. Hence, the message hangs and never
> disappears. I usually clear it with a "echo 3 > /proc/sys/vm/drop_caches"
> on the client. However, except for being annoying in the dashboard, it has
> no performance or otherwise negative impact.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Eugen Block 
> Sent: Friday, January 26, 2024 10:05 AM
> To: Özkan Göksu
> Cc: ceph-users@ceph.io
> Subject: [ceph-users] Re: 1 clients failing to respond to cache pressure
> (quincy:17.2.6)
>
> Performance for small files is more about IOPS rather than throughput,
> and the IOPS in your fio tests look okay to me. What you could try is
> to split the PGs to get around 150 or 200 PGs per OSD. You're
> currently at around 60 according to the ceph osd df output. Before you
> do that, can you share 'ceph pg ls-by-pool cephfs.ud-data.data |
> head'? I don't need the whole output, just to see how many objects
> each PG has. We had a case once where that helped, but it was an older
> cluster and the pool was backed by HDDs and separate rocksDB on SSDs.
> So this might not be the solution here, but it could improve things as
> well.
>
>
> Zitat von Özkan Göksu :
>
> > Every user has a 1x subvolume and I only have 1 pool.
> > At the beginning we were using each subvolume for ldap home directory +
> > user data.
> > When a user logins any docker on any host, it was using the cluster for
> > home and the for user related data, we was have second directory in the
> > same subvolume.
> > Time to time users were feeling a very slow home environment and after a
> > month it became almost impossible to use home. VNC sessions became
> > unresponsive and slow etc.
> >
> > 2 weeks ago, I had to migrate home to a ZFS storage and now the overall
> > performance is better for only user_data without home.
> > But still the performance is not good enough as I expected because of the
> > problems related to MDS.
> > The usage is low but allocation is high and Cpu usage is high. You saw
> the
> > IO Op/s, it's nothing but allocation is high.
> >
> > I develop a fio benchmark script and I run the script on 4x test server
> at
> > the same time, the results are below:
> > Script:
> >
> https://github.com/ozkangoksu/benchmark/blob/8f5df87997864c25ef32447e02fcd41fda0d2a67/iobench.sh
> >
> >
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-01.txt
> >
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-02.txt
> >
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-03.txt
> >
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-04.txt
> >
> > While running benchmark, I take sample values for each type of iobench
> run.
> >
> > Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
> > client:   70 MiB/s rd, 762 MiB/s wr, 337 op/s rd, 24.41k op/s wr
> > client:   60 MiB/s rd, 551 MiB/s wr, 303 op/s rd, 35.12k op/s wr
> > client:   13 MiB/s rd, 161 MiB/s wr, 101 op/s rd, 41.30k op/s wr
> >
> > Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
> > client:   1.6 GiB/s rd, 219 KiB/s wr, 28.76k op/s rd, 89 op/s wr
> > client:   370 MiB/s rd, 475 KiB/s wr, 90.38k op/s rd, 89 op/s wr
> >
> > Rand Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
> > client:   63 MiB/s rd, 1.5 GiB/s wr, 8.77k op/s rd, 5.50k op/s wr
> > client:   14 MiB/s rd, 1.8 GiB/s wr, 81 op/s rd, 13.86k op/s wr
> > client:   6.6 MiB/s rd, 1.2 GiB/s wr, 61 op/s rd, 30.13k op/s wr
> >
> > Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
> > client:   317 MiB/s rd, 841 MiB/s wr, 426 op/s rd, 10.98k op/s wr
> > client:   

[ceph-users] Re: OSD read latency grows over time

2024-01-26 Thread Mark Nelson

On 1/26/24 11:26, Roman Pashin wrote:


Unfortunately they cannot. You'll want to set them in centralized conf
and then restart OSDs for them to take effect.


Got it. Thank you Josh! WIll put it to config of affected OSDs and restart
them.

Just curious, can decreasing rocksdb_cf_compact_on_deletion_trigger 16384 >
4096 hurt performance of HDD OSDs in any way? I have no growing latency on
HDD OSD, where data is stored, but it would be easier to set it to [osd]
section without cherry picking only SSD/NVME OSDs, but for all at once.



Potentially if you set the trigger too low, you could force constant 
compactions.  Say if you set it to trigger compaction every time a 
tombstone is encountered.  You really want to find the sweet spot where 
iterating over tombstones (possibly multiple times) is more expensive 
than doing a compaction.  The defaults are basically just tuned to avoid 
the worst case scenario where OSDs become laggy or even go into 
heartbeat timeout (and we're not 100% sure we got those right).  I 
believe we've got a couple of big users that tune it more aggressively, 
though I'll let them speak up if they are able.



Mark



--
Thank you,
Roman
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 17.2.7: Backfilling deadlock / stall / stuck / standstill

2024-01-26 Thread Mark Nelson
For what it's worth, we saw this last week at Clyso on two separate 
customer clusters on 17.2.7 and also solved it by moving back to wpq.  
We've been traveling this week so haven't created an upstream tracker 
for it yet, but we're back to recommending wpq to our customers for all 
production cluster deployments until we figure out what's going on.



Mark


On 1/26/24 15:08, Wesley Dillingham wrote:

I faced a similar issue. The PG just would never finish recovery. Changing
all OSDs in the PG to "osd_op_queue wpq" and then restarting them serially
ultimately allowed the PG to recover. Seemed to be some issue with mclock.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Fri, Jan 26, 2024 at 7:57 AM Kai Stian Olstad 
wrote:


Hi,

This is a cluster running 17.2.7 upgraded from 16.2.6 on the 15 January
2024.

On Monday 22 January we had 4 HDD all on different server with I/O-error
because of some damage sectors, the OSD is hybrid so the DB is on SSD, 5
HDD share 1 SSD.
I set the OSD out, ceph osd out 223 269 290 318 and all hell broke
loose.

I took only minutes before the users complained about Ceph not working.
Ceph status reportet slow OPS on the OSDs that was set to out, and “ceph
tell osd. dump_ops_in_flight” against the out OSDs it just hang,
after 30 minutes I stopped the dump command.
Long story short I ended up running “ceph osd set nobackfill” to slow
ops was gone and then unset it when the slow ops message disappeared.
I needed to run that all the time so the cluster didn’t come to a holt
so this oneliner loop was used
“while true; do ceph -s | grep -qE "oldest one blocked for [0-9]{2,}" &&
(date; ceph osd set nobackfill; sleep 15; ceph osd unset nobackfill);
sleep 10; done”


But now 4 days later the backfilling has stopped progressing completely
and the number of misplaced object is increasing.
Some PG has 0 misplaced object but sill have backfilling state, and been
in this state for over 24 hours now.

I have a hunch that it’s because of PG 404.6e7 is in state
“active+recovering+degraded+remapped” it’s been in this state for over
48 hours.
It’s has possible 2 missing object, but since they are not unfound I
can’t delete them with “ceph pg 404.6e7 mark_unfound_lost delete”

Could someone please help to solve this?
Down below is some output of ceph commands, I’ll also attache them.


ceph status (only removed information about no running scrub and
deep_scrub)
---
cluster:
  id: b321e76e-da3a-11eb-b75c-4f948441dcd0
  health: HEALTH_WARN
  Degraded data redundancy: 2/6294904971 objects degraded
(0.000%), 1 pg degraded

services:
  mon: 3 daemons, quorum ceph-mon-1,ceph-mon-2,ceph-mon-3 (age 11d)
  mgr: ceph-mon-1.ptrsea(active, since 11d), standbys:
ceph-mon-2.mfdanx
  mds: 1/1 daemons up, 1 standby
  osd: 355 osds: 355 up (since 22h), 351 in (since 4d); 18 remapped
pgs
  rgw: 7 daemons active (7 hosts, 1 zones)

data:
  volumes: 1/1 healthy
  pools:   14 pools, 3945 pgs
  objects: 1.14G objects, 1.1 PiB
  usage:   1.8 PiB used, 1.2 PiB / 3.0 PiB avail
  pgs: 2/6294904971 objects degraded (0.000%)
   2980455/6294904971 objects misplaced (0.047%)
   3901 active+clean
   22   active+clean+scrubbing+deep
   17   active+remapped+backfilling
   4active+clean+scrubbing
   1active+recovering+degraded+remapped

io:
  client:   167 MiB/s rd, 13 MiB/s wr, 6.02k op/s rd, 2.35k op/s wr


ceph health detail (only removed information about no running scrub and
deep_scrub)
---
HEALTH_WARN Degraded data redundancy: 2/6294902067 objects degraded
(0.000%), 1 pg degraded
[WRN] PG_DEGRADED: Degraded data redundancy: 2/6294902067 objects
degraded (0.000%), 1 pg degraded
  pg 404.6e7 is active+recovering+degraded+remapped, acting
[223,274,243,290,286,283]


ceph pg 202.6e7 list_unfound
---
{
  "num_missing": 2,
  "num_unfound": 0,
  "objects": [],
  "state": "Active",
  "available_might_have_unfound": true,
  "might_have_unfound": [],
  "more": false
}

ceph pg 404.6e7 query | jq .recovery_state
---
[
{
  "name": "Started/Primary/Active",
  "enter_time": "2024-01-26T09:08:41.918637+",
  "might_have_unfound": [
{
  "osd": "243(2)",
  "status": "already probed"
},
{
  "osd": "274(1)",
  "status": "already probed"
},
{
  "osd": "275(0)",
  "status": "already probed"
},
{
  "osd": "283(5)",
  "status": "already probed"
},
{
  "osd": "286(4)",
  "status": "already probed"
},
{
  "osd": "290(3)",
  "status": "already probed"
},
{
  "osd": "335(3)",
  "status": "already probed"
}
  ],
  "recovery_progress": {
  

[ceph-users] Re: 17.2.7: Backfilling deadlock / stall / stuck / standstill

2024-01-26 Thread Wesley Dillingham
I faced a similar issue. The PG just would never finish recovery. Changing
all OSDs in the PG to "osd_op_queue wpq" and then restarting them serially
ultimately allowed the PG to recover. Seemed to be some issue with mclock.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Fri, Jan 26, 2024 at 7:57 AM Kai Stian Olstad 
wrote:

> Hi,
>
> This is a cluster running 17.2.7 upgraded from 16.2.6 on the 15 January
> 2024.
>
> On Monday 22 January we had 4 HDD all on different server with I/O-error
> because of some damage sectors, the OSD is hybrid so the DB is on SSD, 5
> HDD share 1 SSD.
> I set the OSD out, ceph osd out 223 269 290 318 and all hell broke
> loose.
>
> I took only minutes before the users complained about Ceph not working.
> Ceph status reportet slow OPS on the OSDs that was set to out, and “ceph
> tell osd. dump_ops_in_flight” against the out OSDs it just hang,
> after 30 minutes I stopped the dump command.
> Long story short I ended up running “ceph osd set nobackfill” to slow
> ops was gone and then unset it when the slow ops message disappeared.
> I needed to run that all the time so the cluster didn’t come to a holt
> so this oneliner loop was used
> “while true; do ceph -s | grep -qE "oldest one blocked for [0-9]{2,}" &&
> (date; ceph osd set nobackfill; sleep 15; ceph osd unset nobackfill);
> sleep 10; done”
>
>
> But now 4 days later the backfilling has stopped progressing completely
> and the number of misplaced object is increasing.
> Some PG has 0 misplaced object but sill have backfilling state, and been
> in this state for over 24 hours now.
>
> I have a hunch that it’s because of PG 404.6e7 is in state
> “active+recovering+degraded+remapped” it’s been in this state for over
> 48 hours.
> It’s has possible 2 missing object, but since they are not unfound I
> can’t delete them with “ceph pg 404.6e7 mark_unfound_lost delete”
>
> Could someone please help to solve this?
> Down below is some output of ceph commands, I’ll also attache them.
>
>
> ceph status (only removed information about no running scrub and
> deep_scrub)
> ---
>cluster:
>  id: b321e76e-da3a-11eb-b75c-4f948441dcd0
>  health: HEALTH_WARN
>  Degraded data redundancy: 2/6294904971 objects degraded
> (0.000%), 1 pg degraded
>
>services:
>  mon: 3 daemons, quorum ceph-mon-1,ceph-mon-2,ceph-mon-3 (age 11d)
>  mgr: ceph-mon-1.ptrsea(active, since 11d), standbys:
> ceph-mon-2.mfdanx
>  mds: 1/1 daemons up, 1 standby
>  osd: 355 osds: 355 up (since 22h), 351 in (since 4d); 18 remapped
> pgs
>  rgw: 7 daemons active (7 hosts, 1 zones)
>
>data:
>  volumes: 1/1 healthy
>  pools:   14 pools, 3945 pgs
>  objects: 1.14G objects, 1.1 PiB
>  usage:   1.8 PiB used, 1.2 PiB / 3.0 PiB avail
>  pgs: 2/6294904971 objects degraded (0.000%)
>   2980455/6294904971 objects misplaced (0.047%)
>   3901 active+clean
>   22   active+clean+scrubbing+deep
>   17   active+remapped+backfilling
>   4active+clean+scrubbing
>   1active+recovering+degraded+remapped
>
>io:
>  client:   167 MiB/s rd, 13 MiB/s wr, 6.02k op/s rd, 2.35k op/s wr
>
>
> ceph health detail (only removed information about no running scrub and
> deep_scrub)
> ---
> HEALTH_WARN Degraded data redundancy: 2/6294902067 objects degraded
> (0.000%), 1 pg degraded
> [WRN] PG_DEGRADED: Degraded data redundancy: 2/6294902067 objects
> degraded (0.000%), 1 pg degraded
>  pg 404.6e7 is active+recovering+degraded+remapped, acting
> [223,274,243,290,286,283]
>
>
> ceph pg 202.6e7 list_unfound
> ---
> {
>  "num_missing": 2,
>  "num_unfound": 0,
>  "objects": [],
>  "state": "Active",
>  "available_might_have_unfound": true,
>  "might_have_unfound": [],
>  "more": false
> }
>
> ceph pg 404.6e7 query | jq .recovery_state
> ---
> [
>{
>  "name": "Started/Primary/Active",
>  "enter_time": "2024-01-26T09:08:41.918637+",
>  "might_have_unfound": [
>{
>  "osd": "243(2)",
>  "status": "already probed"
>},
>{
>  "osd": "274(1)",
>  "status": "already probed"
>},
>{
>  "osd": "275(0)",
>  "status": "already probed"
>},
>{
>  "osd": "283(5)",
>  "status": "already probed"
>},
>{
>  "osd": "286(4)",
>  "status": "already probed"
>},
>{
>  "osd": "290(3)",
>  "status": "already probed"
>},
>{
>  "osd": "335(3)",
>  "status": "already probed"
>}
>  ],
>  "recovery_progress": {
>"backfill_targets": [
>  "275(0)",
>  "335(3)"
>],
>"waiting_on_backfill": [],
>"last_backfill_started":
>
> 

[ceph-users] Re: OSD read latency grows over time

2024-01-26 Thread Josh Baergen
> Just curious, can decreasing rocksdb_cf_compact_on_deletion_trigger 16384 >
> 4096 hurt performance of HDD OSDs in any way? I have no growing latency on
> HDD OSD, where data is stored, but it would be easier to set it to [osd]
> section without cherry picking only SSD/NVME OSDs, but for all at once.

I think that depends on your workload, but I'm not certain.

If you don't override the OSD classes, you should be able to do
something like "ceph config set osd/class:ssd
rocksdb_cf_compact_on_deletion_trigger 4096".

Josh

On Fri, Jan 26, 2024 at 10:27 AM Roman Pashin  wrote:
>
> > Unfortunately they cannot. You'll want to set them in centralized conf
> > and then restart OSDs for them to take effect.
> >
>
> Got it. Thank you Josh! WIll put it to config of affected OSDs and restart
> them.
>
> Just curious, can decreasing rocksdb_cf_compact_on_deletion_trigger 16384 >
> 4096 hurt performance of HDD OSDs in any way? I have no growing latency on
> HDD OSD, where data is stored, but it would be easier to set it to [osd]
> section without cherry picking only SSD/NVME OSDs, but for all at once.
>
> --
> Thank you,
> Roman
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD read latency grows over time

2024-01-26 Thread Roman Pashin
> Unfortunately they cannot. You'll want to set them in centralized conf
> and then restart OSDs for them to take effect.
>

Got it. Thank you Josh! WIll put it to config of affected OSDs and restart
them.

Just curious, can decreasing rocksdb_cf_compact_on_deletion_trigger 16384 >
4096 hurt performance of HDD OSDs in any way? I have no growing latency on
HDD OSD, where data is stored, but it would be easier to set it to [osd]
section without cherry picking only SSD/NVME OSDs, but for all at once.

--
Thank you,
Roman
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD read latency grows over time

2024-01-26 Thread Josh Baergen
> Do you know if it rocksdb_cf_compact_on_deletion_trigger and
> rocksdb_cf_compact_on_deletion_sliding_window can be changed in runtime
> without OSD restart?

Unfortunately they cannot. You'll want to set them in centralized conf
and then restart OSDs for them to take effect.

Josh

On Fri, Jan 26, 2024 at 2:54 AM Roman Pashin  wrote:
>
> Hi Mark,
>
> In v17.2.7 we enabled a feature that automatically performs a compaction
> >> if too many tombstones are present during iteration in RocksDB.  It
> >> might be worth upgrading to see if it helps (you might have to try
> >> tweaking the settings if the defaults aren't helping enough).  The PR is
> >> here:
> >>
> >> https://github.com/ceph/ceph/pull/50893
> >
> >
> we've upgraded Ceph to v17.2.7 yesterday. Unfortunately I still see growing
> latency on OSDs hosting index pool. Will try to tune
> rocksdb_cf_compact_on_deletion options as you suggested.
>
> I've started with decreasing deletion_trigger from 16384 to 512 with:
>
> # ceph tell 'osd.*' injectargs '--rocksdb_cf_compact_on_deletion_trigger
> 512'
>
> At first glance - nothing has changed per OSD latency graphs. I've tried to
> decrease it to 32 deletions per window on a single OSD where I see
> increasing latency to force compactions, but per graphs nothing has changed
> after approx 40 minutes.
>
> # ceph tell 'osd.435' injectargs '--rocksdb_cf_compact_on_deletion_trigger
> 32'
>
> Didn't touch rocksdb_cf_compact_on_deletion_sliding_window yet, it is set
> with default 32768 entries.
>
> Do you know if it rocksdb_cf_compact_on_deletion_trigger and
> rocksdb_cf_compact_on_deletion_sliding_window can be changed in runtime
> without OSD restart?
>
> --
> Thank you,
> Roman
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW crashes when rgw_enable_ops_log is enabled

2024-01-26 Thread Matt Benjamin
Hi Marc,

1. if you can, yes, create a tracker issue on tracker.ceph.com?
2. you might be able to get more throughput with (some number) of
additional threads;  the first thing I would try is prioritization (nice)

regards,

Matt


On Fri, Jan 26, 2024 at 6:08 AM Marc Singer  wrote:

> Hi Matt
>
> Thanks for your answer.
>
> Should I open a bug report then?
>
> How would I be able to read more from it? Have multiple threads access
> it and read from it simultaneously?
>
> Marc
>
> On 1/25/24 20:25, Matt Benjamin wrote:
> > Hi Marc,
> >
> > No, the only thing you need to do with the Unix socket is to keep
> > reading from it.  So it probably is getting backlogged.  And while you
> > could arrange things to make that less likely, you likely can't make
> > it impossible, so there's a bug here.
> >
> > Matt
> >
> > On Thu, Jan 25, 2024 at 10:52 AM Marc Singer 
> wrote:
> >
> > Hi
> >
> > I am using a unix socket client to connect with it and read the data
> > from it.
> > Do I need to do anything like signal the socket that this data has
> > been
> > read? Or am I not reading fast enough and data is backing up?
> >
> > What I am also noticing that at some point (probably after something
> > with the ops socket happens), the log level seems to increase for
> > some
> > reason? I did not find anything in the logs yet why this would be
> > the case.
> >
> > *Normal:*
> >
> > 2024-01-25T15:47:58.444+ 7fe98a5c0b00  1 == starting new
> > request
> > req=0x7fe98712c720 =
> > 2024-01-25T15:47:58.548+ 7fe98b700b00  1 == req done
> > req=0x7fe98712c720 op status=0 http_status=200
> > latency=0.104001537s ==
> > 2024-01-25T15:47:58.548+ 7fe98b700b00  1 beast: 0x7fe98712c720:
> > redacted - redacted [25/Jan/2024:15:47:58.444 +] "PUT
> > /redacted/redacted/chunks/27/27242/27242514_10_4194304 HTTP/1.1" 200
> > 4194304 - "redacted" - latency=0.104001537s
> >
> > *Close before crashing:
> > *
> >
> >-509> 2024-01-25T14:54:31.588+ 7f5186648b00  1 == starting
> > new request req=0x7f517ffca720 =
> >-508> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
> > 2568229052387020224 0.0s initializing for trans_id =
> > tx023a42eb7515dcdc0-0065b27627-823feaa-central
> >-507> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
> > 2568229052387020224 0.0s getting op 1
> >-506> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
> > 2568229052387020224 0.0s s3:put_obj verifying requester
> >-505> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
> > 2568229052387020224 0.0s s3:put_obj normalizing buckets
> > and tenants
> >-504> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
> > 2568229052387020224 0.0s s3:put_obj init permissions
> >-503> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
> > 2568229052387020224 0.0s s3:put_obj recalculating target
> >-502> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
> > 2568229052387020224 0.0s s3:put_obj reading permissions
> >-501> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
> > 2568229052387020224 0.0s s3:put_obj init op
> >-500> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
> > 2568229052387020224 0.0s s3:put_obj verifying op mask
> >-499> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
> > 2568229052387020224 0.0s s3:put_obj verifying op permissions
> >-498> 2024-01-25T14:54:31.588+ 7f5186648b00  5 req
> > 2568229052387020224 0.0s s3:put_obj Searching permissions for
> > identity=rgw::auth::SysReqApplier ->
> > rgw::auth::LocalApplier(acct_user=redacted, acct_name=redacted,
> > subuser=, perm_mask=15, is_admin=0) mask=50
> >-497> 2024-01-25T14:54:31.588+ 7f5186648b00  5 req
> > 2568229052387020224 0.0s s3:put_obj Searching permissions for
> > uid=redacted
> >-496> 2024-01-25T14:54:31.588+ 7f5186648b00  5 req
> > 2568229052387020224 0.0s s3:put_obj Found permission: 15
> >-495> 2024-01-25T14:54:31.588+ 7f5186648b00  5 req
> > 2568229052387020224 0.0s s3:put_obj Searching permissions for
> > group=1 mask=50
> >-494> 2024-01-25T14:54:31.588+ 7f5186648b00  5 req
> > 2568229052387020224 0.0s s3:put_obj Permissions for group
> > not found
> >-493> 2024-01-25T14:54:31.588+ 7f5186648b00  5 req
> > 2568229052387020224 0.0s s3:put_obj Searching permissions for
> > group=2 mask=50
> >-492> 2024-01-25T14:54:31.588+ 7f5186648b00  5 req
> > 2568229052387020224 0.0s s3:put_obj Permissions for group
> > not found
> >-491> 2024-01-25T14:54:31.588+ 7f5186648b00  5 req
> > 2568229052387020224 0.0s s3:put_obj -- Getting permissions

[ceph-users] Re: Odd auto-scaler warnings about too few/many PGs

2024-01-26 Thread Rich Freeman
On Fri, Jan 26, 2024 at 3:35 AM Torkil Svensgaard  wrote:
>
> The most weird one:
>
> Pool rbd_ec_data stores 683TB in 4096 pgs -> warn should be 1024
> Pool rbd_internal stores 86TB in 1024 pgs-> warn should be 2048
>
> That makes no sense to me based on the amount of data stored. Is this a
> bug or what am I missing? Ceph version is 17.2.7.

I'm guessing these pools are in different storage classes or have
different crush rules.  They would be on a different set of OSDs, and
so the autobalancer is going to pro-rate them only against pools
sharing the same OSDs, to maintain a similar number of PGs per OSD
across the entire cluster.

The solid state pools are smaller, so they would get more PGs per TB
of capacity.  On my cluster a 2TB SSD has a similar number of PGs as a
12TB HDD, and that is because the goal is the per-OSD ratio and not
the per-TB ratio.

If you follow the manual balancing guides you'd probably end up with a
similar result.  Just remember that different storage classes need to
be looked at separately - otherwise you'll probably have very few PGs
on your solid state OSDs.

Oh, I personally set the autobalancer to on for solid state, and warn
for HDD, since HDD rebalancing takes so much longer.

--
Rich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] 17.2.7: Backfilling deadlock / stall / stuck / standstill

2024-01-26 Thread Kai Stian Olstad

Hi,

This is a cluster running 17.2.7 upgraded from 16.2.6 on the 15 January 
2024.


On Monday 22 January we had 4 HDD all on different server with I/O-error 
because of some damage sectors, the OSD is hybrid so the DB is on SSD, 5 
HDD share 1 SSD.
I set the OSD out, ceph osd out 223 269 290 318 and all hell broke 
loose.


I took only minutes before the users complained about Ceph not working.
Ceph status reportet slow OPS on the OSDs that was set to out, and “ceph 
tell osd. dump_ops_in_flight” against the out OSDs it just hang, 
after 30 minutes I stopped the dump command.
Long story short I ended up running “ceph osd set nobackfill” to slow 
ops was gone and then unset it when the slow ops message disappeared.
I needed to run that all the time so the cluster didn’t come to a holt 
so this oneliner loop was used
“while true; do ceph -s | grep -qE "oldest one blocked for [0-9]{2,}" && 
(date; ceph osd set nobackfill; sleep 15; ceph osd unset nobackfill); 
sleep 10; done”



But now 4 days later the backfilling has stopped progressing completely 
and the number of misplaced object is increasing.
Some PG has 0 misplaced object but sill have backfilling state, and been 
in this state for over 24 hours now.


I have a hunch that it’s because of PG 404.6e7 is in state 
“active+recovering+degraded+remapped” it’s been in this state for over 
48 hours.
It’s has possible 2 missing object, but since they are not unfound I 
can’t delete them with “ceph pg 404.6e7 mark_unfound_lost delete”


Could someone please help to solve this?
Down below is some output of ceph commands, I’ll also attache them.


ceph status (only removed information about no running scrub and 
deep_scrub)

---
  cluster:
id: b321e76e-da3a-11eb-b75c-4f948441dcd0
health: HEALTH_WARN
Degraded data redundancy: 2/6294904971 objects degraded 
(0.000%), 1 pg degraded


  services:
mon: 3 daemons, quorum ceph-mon-1,ceph-mon-2,ceph-mon-3 (age 11d)
mgr: ceph-mon-1.ptrsea(active, since 11d), standbys: 
ceph-mon-2.mfdanx

mds: 1/1 daemons up, 1 standby
osd: 355 osds: 355 up (since 22h), 351 in (since 4d); 18 remapped 
pgs

rgw: 7 daemons active (7 hosts, 1 zones)

  data:
volumes: 1/1 healthy
pools:   14 pools, 3945 pgs
objects: 1.14G objects, 1.1 PiB
usage:   1.8 PiB used, 1.2 PiB / 3.0 PiB avail
pgs: 2/6294904971 objects degraded (0.000%)
 2980455/6294904971 objects misplaced (0.047%)
 3901 active+clean
 22   active+clean+scrubbing+deep
 17   active+remapped+backfilling
 4active+clean+scrubbing
 1active+recovering+degraded+remapped

  io:
client:   167 MiB/s rd, 13 MiB/s wr, 6.02k op/s rd, 2.35k op/s wr


ceph health detail (only removed information about no running scrub and 
deep_scrub)

---
HEALTH_WARN Degraded data redundancy: 2/6294902067 objects degraded 
(0.000%), 1 pg degraded
[WRN] PG_DEGRADED: Degraded data redundancy: 2/6294902067 objects 
degraded (0.000%), 1 pg degraded
pg 404.6e7 is active+recovering+degraded+remapped, acting 
[223,274,243,290,286,283]



ceph pg 202.6e7 list_unfound
---
{
"num_missing": 2,
"num_unfound": 0,
"objects": [],
"state": "Active",
"available_might_have_unfound": true,
"might_have_unfound": [],
"more": false
}

ceph pg 404.6e7 query | jq .recovery_state
---
[
  {
"name": "Started/Primary/Active",
"enter_time": "2024-01-26T09:08:41.918637+",
"might_have_unfound": [
  {
"osd": "243(2)",
"status": "already probed"
  },
  {
"osd": "274(1)",
"status": "already probed"
  },
  {
"osd": "275(0)",
"status": "already probed"
  },
  {
"osd": "283(5)",
"status": "already probed"
  },
  {
"osd": "286(4)",
"status": "already probed"
  },
  {
"osd": "290(3)",
"status": "already probed"
  },
  {
"osd": "335(3)",
"status": "already probed"
  }
],
"recovery_progress": {
  "backfill_targets": [
"275(0)",
"335(3)"
  ],
  "waiting_on_backfill": [],
  "last_backfill_started": 
"404:e76011a9:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.18_56463c71-286c-4399-8d5d-0c278b7c97fd:head",

  "backfill_info": {
"begin": "MIN",
"end": "MIN",
"objects": []
  },
  "peer_backfill_info": [],
  "backfills_in_flight": [],
  "recovering": [],
  "pg_backend": {
"recovery_ops": [],
"read_ops": []
  }
}
  },
  {
"name": "Started",
"enter_time": "2024-01-26T09:08:40.909151+"
  }
]


ceph pg ls recovering backfilling
---
PG   OBJECTS  DEGRADED  MISPLACED  UNFOUND  BYTES 
OMAP_BYTES*  OMAP_KEYS*  LOGLOG_DUPS  STATE  
  SINCE  VERSION  REPORTED  UP   
  ACTING
404.bc287986 0

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-26 Thread Frank Schilder
Hi, this message is one of those that are often spurious. I don't recall in 
which thread/PR/tracker I read it, but the story was something like that:

If an MDS gets under memory pressure it will request dentry items back from 
*all* clients, not just the active ones or the ones holding many of them. If 
you have a client that's below the min-threshold for dentries (its one of the 
client/mds tuning options), it will not respond. This client will be flagged as 
not responding, which is a false positive.

I believe the devs are working on a fix to get rid of these spurious warnings. 
There is a "bug/feature" in the MDS that does not clear this warning flag for 
inactive clients. Hence, the message hangs and never disappears. I usually 
clear it with a "echo 3 > /proc/sys/vm/drop_caches" on the client. However, 
except for being annoying in the dashboard, it has no performance or otherwise 
negative impact.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Friday, January 26, 2024 10:05 AM
To: Özkan Göksu
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: 1 clients failing to respond to cache pressure 
(quincy:17.2.6)

Performance for small files is more about IOPS rather than throughput,
and the IOPS in your fio tests look okay to me. What you could try is
to split the PGs to get around 150 or 200 PGs per OSD. You're
currently at around 60 according to the ceph osd df output. Before you
do that, can you share 'ceph pg ls-by-pool cephfs.ud-data.data |
head'? I don't need the whole output, just to see how many objects
each PG has. We had a case once where that helped, but it was an older
cluster and the pool was backed by HDDs and separate rocksDB on SSDs.
So this might not be the solution here, but it could improve things as
well.


Zitat von Özkan Göksu :

> Every user has a 1x subvolume and I only have 1 pool.
> At the beginning we were using each subvolume for ldap home directory +
> user data.
> When a user logins any docker on any host, it was using the cluster for
> home and the for user related data, we was have second directory in the
> same subvolume.
> Time to time users were feeling a very slow home environment and after a
> month it became almost impossible to use home. VNC sessions became
> unresponsive and slow etc.
>
> 2 weeks ago, I had to migrate home to a ZFS storage and now the overall
> performance is better for only user_data without home.
> But still the performance is not good enough as I expected because of the
> problems related to MDS.
> The usage is low but allocation is high and Cpu usage is high. You saw the
> IO Op/s, it's nothing but allocation is high.
>
> I develop a fio benchmark script and I run the script on 4x test server at
> the same time, the results are below:
> Script:
> https://github.com/ozkangoksu/benchmark/blob/8f5df87997864c25ef32447e02fcd41fda0d2a67/iobench.sh
>
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-01.txt
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-02.txt
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-03.txt
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-04.txt
>
> While running benchmark, I take sample values for each type of iobench run.
>
> Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
> client:   70 MiB/s rd, 762 MiB/s wr, 337 op/s rd, 24.41k op/s wr
> client:   60 MiB/s rd, 551 MiB/s wr, 303 op/s rd, 35.12k op/s wr
> client:   13 MiB/s rd, 161 MiB/s wr, 101 op/s rd, 41.30k op/s wr
>
> Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
> client:   1.6 GiB/s rd, 219 KiB/s wr, 28.76k op/s rd, 89 op/s wr
> client:   370 MiB/s rd, 475 KiB/s wr, 90.38k op/s rd, 89 op/s wr
>
> Rand Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
> client:   63 MiB/s rd, 1.5 GiB/s wr, 8.77k op/s rd, 5.50k op/s wr
> client:   14 MiB/s rd, 1.8 GiB/s wr, 81 op/s rd, 13.86k op/s wr
> client:   6.6 MiB/s rd, 1.2 GiB/s wr, 61 op/s rd, 30.13k op/s wr
>
> Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
> client:   317 MiB/s rd, 841 MiB/s wr, 426 op/s rd, 10.98k op/s wr
> client:   2.8 GiB/s rd, 882 MiB/s wr, 25.68k op/s rd, 291 op/s wr
> client:   4.0 GiB/s rd, 226 MiB/s wr, 89.63k op/s rd, 124 op/s wr
> client:   2.4 GiB/s rd, 295 KiB/s wr, 197.86k op/s rd, 20 op/s wr
>
> It seems I only have problems with the 4K,8K,16K other sector sizes.
>
>
>
>
> Eugen Block , 25 Oca 2024 Per, 19:06 tarihinde şunu yazdı:
>
>> I understand that your MDS shows a high CPU usage, but other than that
>> what is your performance issue? Do users complain? Do some operations
>> take longer than expected? Are OSDs saturated during those phases?
>> Because the cache pressure messages don’t necessarily mean that users
>> will notice.
>> MDS daemons are 

[ceph-users] Re: RGW crashes when rgw_enable_ops_log is enabled

2024-01-26 Thread Marc Singer

Hi Matt

Thanks for your answer.

Should I open a bug report then?

How would I be able to read more from it? Have multiple threads access 
it and read from it simultaneously?


Marc

On 1/25/24 20:25, Matt Benjamin wrote:

Hi Marc,

No, the only thing you need to do with the Unix socket is to keep 
reading from it.  So it probably is getting backlogged.  And while you 
could arrange things to make that less likely, you likely can't make 
it impossible, so there's a bug here.


Matt

On Thu, Jan 25, 2024 at 10:52 AM Marc Singer  wrote:

Hi

I am using a unix socket client to connect with it and read the data
from it.
Do I need to do anything like signal the socket that this data has
been
read? Or am I not reading fast enough and data is backing up?

What I am also noticing that at some point (probably after something
with the ops socket happens), the log level seems to increase for
some
reason? I did not find anything in the logs yet why this would be
the case.

*Normal:*

2024-01-25T15:47:58.444+ 7fe98a5c0b00  1 == starting new
request
req=0x7fe98712c720 =
2024-01-25T15:47:58.548+ 7fe98b700b00  1 == req done
req=0x7fe98712c720 op status=0 http_status=200
latency=0.104001537s ==
2024-01-25T15:47:58.548+ 7fe98b700b00  1 beast: 0x7fe98712c720:
redacted - redacted [25/Jan/2024:15:47:58.444 +] "PUT
/redacted/redacted/chunks/27/27242/27242514_10_4194304 HTTP/1.1" 200
4194304 - "redacted" - latency=0.104001537s

*Close before crashing:
*

   -509> 2024-01-25T14:54:31.588+ 7f5186648b00  1 == starting
new request req=0x7f517ffca720 =
   -508> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
2568229052387020224 0.0s initializing for trans_id =
tx023a42eb7515dcdc0-0065b27627-823feaa-central
   -507> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
2568229052387020224 0.0s getting op 1
   -506> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
2568229052387020224 0.0s s3:put_obj verifying requester
   -505> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
2568229052387020224 0.0s s3:put_obj normalizing buckets
and tenants
   -504> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
2568229052387020224 0.0s s3:put_obj init permissions
   -503> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
2568229052387020224 0.0s s3:put_obj recalculating target
   -502> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
2568229052387020224 0.0s s3:put_obj reading permissions
   -501> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
2568229052387020224 0.0s s3:put_obj init op
   -500> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
2568229052387020224 0.0s s3:put_obj verifying op mask
   -499> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
2568229052387020224 0.0s s3:put_obj verifying op permissions
   -498> 2024-01-25T14:54:31.588+ 7f5186648b00  5 req
2568229052387020224 0.0s s3:put_obj Searching permissions for
identity=rgw::auth::SysReqApplier ->
rgw::auth::LocalApplier(acct_user=redacted, acct_name=redacted,
subuser=, perm_mask=15, is_admin=0) mask=50
   -497> 2024-01-25T14:54:31.588+ 7f5186648b00  5 req
2568229052387020224 0.0s s3:put_obj Searching permissions for
uid=redacted
   -496> 2024-01-25T14:54:31.588+ 7f5186648b00  5 req
2568229052387020224 0.0s s3:put_obj Found permission: 15
   -495> 2024-01-25T14:54:31.588+ 7f5186648b00  5 req
2568229052387020224 0.0s s3:put_obj Searching permissions for
group=1 mask=50
   -494> 2024-01-25T14:54:31.588+ 7f5186648b00  5 req
2568229052387020224 0.0s s3:put_obj Permissions for group
not found
   -493> 2024-01-25T14:54:31.588+ 7f5186648b00  5 req
2568229052387020224 0.0s s3:put_obj Searching permissions for
group=2 mask=50
   -492> 2024-01-25T14:54:31.588+ 7f5186648b00  5 req
2568229052387020224 0.0s s3:put_obj Permissions for group
not found
   -491> 2024-01-25T14:54:31.588+ 7f5186648b00  5 req
2568229052387020224 0.0s s3:put_obj -- Getting permissions
done
for identity=rgw::auth::SysReqApplier ->
rgw::auth::LocalApplier(acct_user=redacted, acct_name=redacted,
subuser=, perm_mask=15, is_admin=0), owner=redacted, perm=2
   -490> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
2568229052387020224 0.0s s3:put_obj verifying op params
   -489> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
2568229052387020224 0.0s s3:put_obj pre-executing
   -488> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req
2568229052387020224 0.0s s3:put_obj check rate limiting
   -487> 2024-01-25T14:54:31.588+ 7f5186648b00  2 req

[ceph-users] Re: OSD read latency grows over time

2024-01-26 Thread Roman Pashin
Hi Mark,

In v17.2.7 we enabled a feature that automatically performs a compaction
>> if too many tombstones are present during iteration in RocksDB.  It
>> might be worth upgrading to see if it helps (you might have to try
>> tweaking the settings if the defaults aren't helping enough).  The PR is
>> here:
>>
>> https://github.com/ceph/ceph/pull/50893
>
>
we've upgraded Ceph to v17.2.7 yesterday. Unfortunately I still see growing
latency on OSDs hosting index pool. Will try to tune
rocksdb_cf_compact_on_deletion options as you suggested.

I've started with decreasing deletion_trigger from 16384 to 512 with:

# ceph tell 'osd.*' injectargs '--rocksdb_cf_compact_on_deletion_trigger
512'

At first glance - nothing has changed per OSD latency graphs. I've tried to
decrease it to 32 deletions per window on a single OSD where I see
increasing latency to force compactions, but per graphs nothing has changed
after approx 40 minutes.

# ceph tell 'osd.435' injectargs '--rocksdb_cf_compact_on_deletion_trigger
32'

Didn't touch rocksdb_cf_compact_on_deletion_sliding_window yet, it is set
with default 32768 entries.

Do you know if it rocksdb_cf_compact_on_deletion_trigger and
rocksdb_cf_compact_on_deletion_sliding_window can be changed in runtime
without OSD restart?

--
Thank you,
Roman
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Throughput metrics missing iwhen updating Ceph Quincy to Reef

2024-01-26 Thread Eugen Block

Yes, my dashboard looks good here as well. :-)

Zitat von Martin :


Hi Eugen,

Yes, you are right.

After upgrade from v18.2.0 ---> v18.2.1 it is necessary to create  
the ceph-exporter service manually and deploy to all hosts.

The dasboard is fine as well.

Thanks for help.
Martin

On 26/01/2024 00:17, Eugen Block wrote:

Ah, there they are (different port):

reef01:~ # curl http://localhost:9926/metrics | grep ceph_osd_op | head
  % Total    % Received % Xferd  Average Speed   Time    Time Time  Current
 Dload  Upload   Total   Spent Left  Speed
100  124k  100  124k    0 0   111M  0 --:--:-- --:--:--  
--:--:--  121M

# HELP ceph_osd_op Client operations
# TYPE ceph_osd_op counter
ceph_osd_op{ceph_daemon="osd.1"} 25
ceph_osd_op{ceph_daemon="osd.4"} 543
ceph_osd_op{ceph_daemon="osd.5"} 12192
# HELP ceph_osd_op_delayed_degraded Count of ops delayed due to  
target object being degraded

# TYPE ceph_osd_op_delayed_degraded counter
ceph_osd_op_delayed_degraded{ceph_daemon="osd.1"} 0
ceph_osd_op_delayed_degraded{ceph_daemon="osd.4"} 0
ceph_osd_op_delayed_degraded{ceph_daemon="osd.5"} 0

I can't check the dashboard right now, that I will definitely do tomorrow.
Good night!

Zitat von Eugen Block :


Yeah, it's mentioned in the upgrade docs [2]:


Monitoring & Alerting
  Ceph-exporter: Now the performance metrics for Ceph daemons  
are exported by ceph-exporter, which deploys on each daemon  
rather than using prometheus exporter. This will reduce  
performance bottlenecks.



[2]  
https://docs.ceph.com/en/latest/releases/reef/#major-changes-from-quincy


Zitat von Eugen Block :


Hi,

I got those metrics back after setting:

reef01:~ # ceph config set mgr mgr/prometheus/exclude_perf_counters false

reef01:~ # curl http://localhost:9283/metrics | grep ceph_osd_op | head
 % Total    % Received % Xferd  Average Speed   Time Time  
Time  Current

    Dload  Upload   Total Spent    Left  Speed
100  324k  100  324k    0 0  72.5M  0 --:--:-- --:--:--  
--:--:-- 79.1M

# HELP ceph_osd_op Client operations
# TYPE ceph_osd_op counter
ceph_osd_op{ceph_daemon="osd.0"} 139650.0
ceph_osd_op{ceph_daemon="osd.11"} 9711090.0
ceph_osd_op{ceph_daemon="osd.2"} 3864.0
ceph_osd_op{ceph_daemon="osd.1"} 25.0
ceph_osd_op{ceph_daemon="osd.4"} 543.0
ceph_osd_op{ceph_daemon="osd.5"} 12192.0
ceph_osd_op{ceph_daemon="osd.3"} 3661521.0
ceph_osd_op{ceph_daemon="osd.6"} 2030.0


I found the option in the docs [1], but the same section is in  
the quincy docs as well, although there's no such option in my  
quincy cluster, maybe that's why it still exports those  
performance counters in my quincy cluster:


quincy-1:~ # ceph config get mgr mgr/prometheus/exclude_perf_counters
Error ENOENT: unrecognized key 'mgr/prometheus/exclude_perf_counters'

Anyway, this should bring back the metrics the "legacy" way (I  
guess). Apparently, the ceph-exporter daemon is now required on  
your hosts to collect those metrics.
After adding the ceph-exporter service (ceph orch apply  
ceph-exporter) and setting mgr/prometheus/exclude_perf_counters  
back to "true" I see that there are "ceph_osd_op" metrics defined  
but no values yet. Apparently, I'm still missing something, I'll  
check tomorrow. But this could/should be in the upgrade docs IMO.


Regards,
Eugen

[1]  
https://docs.ceph.com/en/latest/mgr/prometheus/#ceph-daemon-performance-counters-metrics


Zitat von Martin :


Hi,

Confirmed that this happens to me as well.
After upgrading from 18.2.0 to 18.2.1 OSD metrics  
like: ceph_osd_op_* are missing from ceph-mgr.


The Grafana dashboard also doesn't display all graphs correctly.

ceph-dashboard/Ceph - Cluster : Capacity used, Cluster I/O, OSD  
Capacity Utilization, PGs per OSD


curl http://localhost:9283/metrics | grep -i ceph_osd_op
  % Total    % Received % Xferd  Average Speed   Time Time  
Time  Current
 Dload  Upload   Total Spent     
Left  Speed
100 38317  100 38317    0 0   9.8M  0 --:--:-- --:--:--  
--:--:-- 12.1M


Before the upgrading to reef 18.2.1 I could get all the metrics.

Martin

On 18/01/2024 12:32, Jose Vicente wrote:

Hi,
After upgrading from Quincy to Reef the ceph-mgr daemon is not  
throwing some throughput OSD metrics like: ceph_osd_op_*

curl http://localhost:9283/metrics | grep -i ceph_osd_op
  % Total    % Received % Xferd  Average Speed   Time  Time      
Time  Current
                                 Dload  Upload   Total Spent    
 Left  Speed
100  295k  100  295k    0     0   144M      0 --:--:-- --:--:--  
--:--:--  144M

However I can get other metrics like:
# curl http://localhost:9283/metrics | grep -i ceph_osd_apply
# HELP ceph_osd_apply_latency_ms OSD stat apply_latency_ms
# TYPE ceph_osd_apply_latency_ms gauge
ceph_osd_apply_latency_ms{ceph_daemon="osd.275"} 152.0
ceph_osd_apply_latency_ms{ceph_daemon="osd.274"} 102.0
...
Before the upgrading to reef (from quincy) I I could 

[ceph-users] Re: Odd auto-scaler warnings about too few/many PGs

2024-01-26 Thread Eugen Block
If you ask me or Joachim, we'll tell you to disable autoscaler. ;-) It  
doesn't seem mature enough yet, especially with many pools. There have  
been multiple threads in the past discussing this topic, I'd suggest  
to leave it disabled. Or you could help improving it, maybe create a  
tracker issue if there isn't already an open one.


Zitat von Torkil Svensgaard :


Hi

A few years ago we were really strapped for space so we tweaked  
pg_num for some pools to ensure all pgs were as to close to the same  
size as possible while stile observing the power of 2 rule, in order  
to get the most mileage space wise. We set the auto-scaler to off  
for the tweaked pools to get rid of the warnings.


We now have a lot more free space so I flipped the auto-scaler to  
warn for all pools and set the bulk flag for the pools expected to  
be data pools, leading to this:


"
[WRN] POOL_TOO_FEW_PGS: 4 pools have too few placement groups
Pool rbd has 512 placement groups, should have 2048
Pool rbd_internal has 1024 placement groups, should have 2048
Pool cephfs.nvme.data has 32 placement groups, should have 4096
Pool cephfs.ssd.data has 32 placement groups, should have 1024
[WRN] POOL_TOO_MANY_PGS: 4 pools have too many placement groups
Pool libvirt has 256 placement groups, should have 32
Pool cephfs.cephfs.data has 512 placement groups, should have 32
Pool rbd_ec_data has 4096 placement groups, should have 1024
Pool cephfs.hdd.data has 2048 placement groups, should have 1024
"

That's a lot of warnings *ponder*

"
# ceph osd pool autoscale-status
POOL  SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  
TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE   
BULK
libvirt  2567G3.0 3031T  0.0025   
   1.0 256  warn   False
.mgr807.5M2.0 6520G  0.0002   
   1.0   1  warn   False
rbd_ec   9168k3.0 6520G  0.   
   1.0  32  warn   False
nvme31708G2.0209.5T  0.2955   
   1.02048  warn   False
.nfs36864 3.0 6520G  0.   
   1.0  32  warn   False
cephfs.cephfs.meta  24914M3.0 6520G  0.0112   
   4.0  32  warn   False
cephfs.cephfs.data  16384 3.0 6520G  0.   
   1.0 512  warn   False
rbd.ssd.data798.1G   2.25 6520G  0.2754   
   1.0  64  warn   False
rbd_ec_data 609.2T1.5 3031T  0.3014   
   1.04096  warn   True
rbd 68170G3.0 3031T  0.0659   
   1.0 512  warn   True
rbd_internal69553G3.0 3031T  0.0672   
   1.01024  warn   True
cephfs.nvme.data0 2.0209.5T  0.   
   1.0  32  warn   True
cephfs.ssd.data 68609M2.0 6520G  0.0206   
   1.0  32  warn   True
cephfs.hdd.data 111.0T   2.25 3031T  0.0824   
   1.02048  warn   True

"

"
# ceph df
--- RAW STORAGE ---
CLASS SIZEAVAIL USED  RAW USED  %RAW USED
hdd3.0 PiB  1.3 PiB  1.6 PiB   1.6 PiB  54.69
nvme   210 TiB  146 TiB   63 TiB63 TiB  30.21
ssd6.4 TiB  4.0 TiB  2.4 TiB   2.4 TiB  37.69
TOTAL  3.2 PiB  1.5 PiB  1.7 PiB   1.7 PiB  53.07

--- POOLS ---
POOLID   PGS   STORED  OBJECTS USED  %USED  MAX AVAIL
rbd  4   512   80 TiB   21.35M  200 TiB  19.31278 TiB
libvirt  5   256  3.0 TiB  810.89k  7.5 TiB   0.89278 TiB
rbd_internal 6  1024   86 TiB   28.22M  204 TiB  19.62278 TiB
.mgr 8 1  4.3 GiB1.06k  1.6 GiB   0.071.0 TiB
rbd_ec  1032   55 MiB   25   27 MiB  0708 GiB
rbd_ec_data 11  4096  683 TiB  180.52M  914 TiB  52.26556 TiB
nvme23  2048   46 TiB   25.18M   62 TiB  31.62 67 TiB
.nfs2532  4.6 KiB   10  108 KiB  0708 GiB
cephfs.cephfs.meta  3132   25 GiB1.66M   73 GiB   3.32708 GiB
cephfs.cephfs.data  32   679489 B   40.41M   48 KiB  0708 GiB
cephfs.nvme.data3432  0 B0  0 B  0 67 TiB
cephfs.ssd.data 3532   77 GiB  425.03k  134 GiB   5.941.0 TiB
cephfs.hdd.data 37 

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-26 Thread Eugen Block
Performance for small files is more about IOPS rather than throughput,  
and the IOPS in your fio tests look okay to me. What you could try is  
to split the PGs to get around 150 or 200 PGs per OSD. You're  
currently at around 60 according to the ceph osd df output. Before you  
do that, can you share 'ceph pg ls-by-pool cephfs.ud-data.data |  
head'? I don't need the whole output, just to see how many objects  
each PG has. We had a case once where that helped, but it was an older  
cluster and the pool was backed by HDDs and separate rocksDB on SSDs.  
So this might not be the solution here, but it could improve things as  
well.



Zitat von Özkan Göksu :


Every user has a 1x subvolume and I only have 1 pool.
At the beginning we were using each subvolume for ldap home directory +
user data.
When a user logins any docker on any host, it was using the cluster for
home and the for user related data, we was have second directory in the
same subvolume.
Time to time users were feeling a very slow home environment and after a
month it became almost impossible to use home. VNC sessions became
unresponsive and slow etc.

2 weeks ago, I had to migrate home to a ZFS storage and now the overall
performance is better for only user_data without home.
But still the performance is not good enough as I expected because of the
problems related to MDS.
The usage is low but allocation is high and Cpu usage is high. You saw the
IO Op/s, it's nothing but allocation is high.

I develop a fio benchmark script and I run the script on 4x test server at
the same time, the results are below:
Script:
https://github.com/ozkangoksu/benchmark/blob/8f5df87997864c25ef32447e02fcd41fda0d2a67/iobench.sh

https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-01.txt
https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-02.txt
https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-03.txt
https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-04.txt

While running benchmark, I take sample values for each type of iobench run.

Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
client:   70 MiB/s rd, 762 MiB/s wr, 337 op/s rd, 24.41k op/s wr
client:   60 MiB/s rd, 551 MiB/s wr, 303 op/s rd, 35.12k op/s wr
client:   13 MiB/s rd, 161 MiB/s wr, 101 op/s rd, 41.30k op/s wr

Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
client:   1.6 GiB/s rd, 219 KiB/s wr, 28.76k op/s rd, 89 op/s wr
client:   370 MiB/s rd, 475 KiB/s wr, 90.38k op/s rd, 89 op/s wr

Rand Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
client:   63 MiB/s rd, 1.5 GiB/s wr, 8.77k op/s rd, 5.50k op/s wr
client:   14 MiB/s rd, 1.8 GiB/s wr, 81 op/s rd, 13.86k op/s wr
client:   6.6 MiB/s rd, 1.2 GiB/s wr, 61 op/s rd, 30.13k op/s wr

Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
client:   317 MiB/s rd, 841 MiB/s wr, 426 op/s rd, 10.98k op/s wr
client:   2.8 GiB/s rd, 882 MiB/s wr, 25.68k op/s rd, 291 op/s wr
client:   4.0 GiB/s rd, 226 MiB/s wr, 89.63k op/s rd, 124 op/s wr
client:   2.4 GiB/s rd, 295 KiB/s wr, 197.86k op/s rd, 20 op/s wr

It seems I only have problems with the 4K,8K,16K other sector sizes.




Eugen Block , 25 Oca 2024 Per, 19:06 tarihinde şunu yazdı:


I understand that your MDS shows a high CPU usage, but other than that
what is your performance issue? Do users complain? Do some operations
take longer than expected? Are OSDs saturated during those phases?
Because the cache pressure messages don’t necessarily mean that users
will notice.
MDS daemons are single-threaded so that might be a bottleneck. In that
case multi-active mds might help, which you already tried and
experienced OOM killers. But you might have to disable the mds
balancer as someone else mentioned. And then you could think about
pinning, is it possible to split the CephFS into multiple
subdirectories and pin them to different ranks?
But first I’d still like to know what the performance issue really is.

Zitat von Özkan Göksu :

> I will try my best to explain my situation.
>
> I don't have a separate mds server. I have 5 identical nodes, 3 of them
> mons, and I use the other 2 as active and standby mds. (currently I have
> left overs from max_mds 4)
>
> root@ud-01:~# ceph -s
>   cluster:
> id: e42fd4b0-313b-11ee-9a00-31da71873773
> health: HEALTH_WARN
> 1 clients failing to respond to cache pressure
>
>   services:
> mon: 3 daemons, quorum ud-01,ud-02,ud-03 (age 9d)
> mgr: ud-01.qycnol(active, since 8d), standbys: ud-02.tfhqfd
> mds: 1/1 daemons up, 4 standby
> osd: 80 osds: 80 up (since 9d), 80 in (since 5M)
>
>   data:
> volumes: 1/1 healthy
> pools:   3 pools, 2305 pgs
> objects: 106.58M objects, 25 TiB
> usage:   45 TiB used, 101 TiB / 146 TiB avail
> pgs: 2303 active+clean
>  2active+clean+scrubbing+deep
>
>   

[ceph-users] Re: Throughput metrics missing iwhen updating Ceph Quincy to Reef

2024-01-26 Thread Martin

Hi Eugen,

Yes, you are right.

After upgrade from v18.2.0 ---> v18.2.1 it is necessary to create the 
ceph-exporter service manually and deploy to all hosts.

The dasboard is fine as well.

Thanks for help.
Martin

On 26/01/2024 00:17, Eugen Block wrote:

Ah, there they are (different port):

reef01:~ # curl http://localhost:9926/metrics | grep ceph_osd_op | head
  % Total    % Received % Xferd  Average Speed   Time    Time Time  
Current
 Dload  Upload   Total   Spent Left  
Speed
100  124k  100  124k    0 0   111M  0 --:--:-- --:--:-- 
--:--:--  121M

# HELP ceph_osd_op Client operations
# TYPE ceph_osd_op counter
ceph_osd_op{ceph_daemon="osd.1"} 25
ceph_osd_op{ceph_daemon="osd.4"} 543
ceph_osd_op{ceph_daemon="osd.5"} 12192
# HELP ceph_osd_op_delayed_degraded Count of ops delayed due to target 
object being degraded

# TYPE ceph_osd_op_delayed_degraded counter
ceph_osd_op_delayed_degraded{ceph_daemon="osd.1"} 0
ceph_osd_op_delayed_degraded{ceph_daemon="osd.4"} 0
ceph_osd_op_delayed_degraded{ceph_daemon="osd.5"} 0

I can't check the dashboard right now, that I will definitely do 
tomorrow.

Good night!

Zitat von Eugen Block :


Yeah, it's mentioned in the upgrade docs [2]:


Monitoring & Alerting
  Ceph-exporter: Now the performance metrics for Ceph daemons 
are exported by ceph-exporter, which deploys on each daemon rather 
than using prometheus exporter. This will reduce performance 
bottlenecks.



[2] 
https://docs.ceph.com/en/latest/releases/reef/#major-changes-from-quincy


Zitat von Eugen Block :


Hi,

I got those metrics back after setting:

reef01:~ # ceph config set mgr mgr/prometheus/exclude_perf_counters 
false


reef01:~ # curl http://localhost:9283/metrics | grep ceph_osd_op | head
 % Total    % Received % Xferd  Average Speed   Time Time Time  
Current
    Dload  Upload   Total Spent    Left  
Speed
100  324k  100  324k    0 0  72.5M  0 --:--:-- --:--:-- 
--:--:-- 79.1M

# HELP ceph_osd_op Client operations
# TYPE ceph_osd_op counter
ceph_osd_op{ceph_daemon="osd.0"} 139650.0
ceph_osd_op{ceph_daemon="osd.11"} 9711090.0
ceph_osd_op{ceph_daemon="osd.2"} 3864.0
ceph_osd_op{ceph_daemon="osd.1"} 25.0
ceph_osd_op{ceph_daemon="osd.4"} 543.0
ceph_osd_op{ceph_daemon="osd.5"} 12192.0
ceph_osd_op{ceph_daemon="osd.3"} 3661521.0
ceph_osd_op{ceph_daemon="osd.6"} 2030.0


I found the option in the docs [1], but the same section is in the 
quincy docs as well, although there's no such option in my quincy 
cluster, maybe that's why it still exports those performance 
counters in my quincy cluster:


quincy-1:~ # ceph config get mgr mgr/prometheus/exclude_perf_counters
Error ENOENT: unrecognized key 'mgr/prometheus/exclude_perf_counters'

Anyway, this should bring back the metrics the "legacy" way (I 
guess). Apparently, the ceph-exporter daemon is now required on your 
hosts to collect those metrics.
After adding the ceph-exporter service (ceph orch apply 
ceph-exporter) and setting mgr/prometheus/exclude_perf_counters back 
to "true" I see that there are "ceph_osd_op" metrics defined but no 
values yet. Apparently, I'm still missing something, I'll check 
tomorrow. But this could/should be in the upgrade docs IMO.


Regards,
Eugen

[1] 
https://docs.ceph.com/en/latest/mgr/prometheus/#ceph-daemon-performance-counters-metrics


Zitat von Martin :


Hi,

Confirmed that this happens to me as well.
After upgrading from 18.2.0 to 18.2.1 OSD metrics 
like: ceph_osd_op_* are missing from ceph-mgr.


The Grafana dashboard also doesn't display all graphs correctly.

ceph-dashboard/Ceph - Cluster : Capacity used, Cluster I/O, OSD 
Capacity Utilization, PGs per OSD


curl http://localhost:9283/metrics | grep -i ceph_osd_op
  % Total    % Received % Xferd  Average Speed   Time Time 
Time  Current
 Dload  Upload   Total Spent    
Left  Speed
100 38317  100 38317    0 0   9.8M  0 --:--:-- --:--:-- 
--:--:-- 12.1M


Before the upgrading to reef 18.2.1 I could get all the metrics.

Martin

On 18/01/2024 12:32, Jose Vicente wrote:

Hi,
After upgrading from Quincy to Reef the ceph-mgr daemon is not 
throwing some throughput OSD metrics like: ceph_osd_op_*

curl http://localhost:9283/metrics | grep -i ceph_osd_op
  % Total    % Received % Xferd  Average Speed   Time  Time     
Time  Current
                                 Dload  Upload   Total Spent   
 Left  Speed
100  295k  100  295k    0     0   144M      0 --:--:-- --:--:-- 
--:--:--  144M

However I can get other metrics like:
# curl http://localhost:9283/metrics | grep -i ceph_osd_apply
# HELP ceph_osd_apply_latency_ms OSD stat apply_latency_ms
# TYPE ceph_osd_apply_latency_ms gauge
ceph_osd_apply_latency_ms{ceph_daemon="osd.275"} 152.0
ceph_osd_apply_latency_ms{ceph_daemon="osd.274"} 102.0
...
Before the upgrading to reef (from quincy) I I could get all the 
metrics. MGR module prometheus is enabled.

Rocky Linux release 8.8 (Green 

[ceph-users] Odd auto-scaler warnings about too few/many PGs

2024-01-26 Thread Torkil Svensgaard

Hi

A few years ago we were really strapped for space so we tweaked pg_num 
for some pools to ensure all pgs were as to close to the same size as 
possible while stile observing the power of 2 rule, in order to get the 
most mileage space wise. We set the auto-scaler to off for the tweaked 
pools to get rid of the warnings.


We now have a lot more free space so I flipped the auto-scaler to warn 
for all pools and set the bulk flag for the pools expected to be data 
pools, leading to this:


"
[WRN] POOL_TOO_FEW_PGS: 4 pools have too few placement groups
Pool rbd has 512 placement groups, should have 2048
Pool rbd_internal has 1024 placement groups, should have 2048
Pool cephfs.nvme.data has 32 placement groups, should have 4096
Pool cephfs.ssd.data has 32 placement groups, should have 1024
[WRN] POOL_TOO_MANY_PGS: 4 pools have too many placement groups
Pool libvirt has 256 placement groups, should have 32
Pool cephfs.cephfs.data has 512 placement groups, should have 32
Pool rbd_ec_data has 4096 placement groups, should have 1024
Pool cephfs.hdd.data has 2048 placement groups, should have 1024
"

That's a lot of warnings *ponder*

"
# ceph osd pool autoscale-status
POOL  SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO 
TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK
libvirt  2567G3.0 3031T  0.0025 
1.0 256  warn   False
.mgr807.5M2.0 6520G  0.0002 
1.0   1  warn   False
rbd_ec   9168k3.0 6520G  0. 
1.0  32  warn   False
nvme31708G2.0209.5T  0.2955 
1.02048  warn   False
.nfs36864 3.0 6520G  0. 
1.0  32  warn   False
cephfs.cephfs.meta  24914M3.0 6520G  0.0112 
4.0  32  warn   False
cephfs.cephfs.data  16384 3.0 6520G  0. 
1.0 512  warn   False
rbd.ssd.data798.1G   2.25 6520G  0.2754 
1.0  64  warn   False
rbd_ec_data 609.2T1.5 3031T  0.3014 
1.04096  warn   True
rbd 68170G3.0 3031T  0.0659 
1.0 512  warn   True
rbd_internal69553G3.0 3031T  0.0672 
1.01024  warn   True
cephfs.nvme.data0 2.0209.5T  0. 
1.0  32  warn   True
cephfs.ssd.data 68609M2.0 6520G  0.0206 
1.0  32  warn   True
cephfs.hdd.data 111.0T   2.25 3031T  0.0824 
1.02048  warn   True

"

"
# ceph df
--- RAW STORAGE ---
CLASS SIZEAVAIL USED  RAW USED  %RAW USED
hdd3.0 PiB  1.3 PiB  1.6 PiB   1.6 PiB  54.69
nvme   210 TiB  146 TiB   63 TiB63 TiB  30.21
ssd6.4 TiB  4.0 TiB  2.4 TiB   2.4 TiB  37.69
TOTAL  3.2 PiB  1.5 PiB  1.7 PiB   1.7 PiB  53.07

--- POOLS ---
POOLID   PGS   STORED  OBJECTS USED  %USED  MAX AVAIL
rbd  4   512   80 TiB   21.35M  200 TiB  19.31278 TiB
libvirt  5   256  3.0 TiB  810.89k  7.5 TiB   0.89278 TiB
rbd_internal 6  1024   86 TiB   28.22M  204 TiB  19.62278 TiB
.mgr 8 1  4.3 GiB1.06k  1.6 GiB   0.071.0 TiB
rbd_ec  1032   55 MiB   25   27 MiB  0708 GiB
rbd_ec_data 11  4096  683 TiB  180.52M  914 TiB  52.26556 TiB
nvme23  2048   46 TiB   25.18M   62 TiB  31.62 67 TiB
.nfs2532  4.6 KiB   10  108 KiB  0708 GiB
cephfs.cephfs.meta  3132   25 GiB1.66M   73 GiB   3.32708 GiB
cephfs.cephfs.data  32   679489 B   40.41M   48 KiB  0708 GiB
cephfs.nvme.data3432  0 B0  0 B  0 67 TiB
cephfs.ssd.data 3532   77 GiB  425.03k  134 GiB   5.941.0 TiB
cephfs.hdd.data 37  2048  121 TiB   68.42M  250 TiB  23.03371 TiB
rbd.ssd.data3864  934 GiB  239.94k  1.8 TiB  45.82944 GiB
"

The most weird one:

Pool rbd_ec_data stores 683TB in 4096 pgs -> warn should be 1024
Pool rbd_internal stores 86TB in 1024 pgs-> warn should be 2048

That makes no sense to me based on the amount of data stored. Is this a 
bug or what am I missing? Ceph version is