[ceph-users] MDS crash in interval_set: FAILED ceph_assert(p->first <= start)

2024-05-07 Thread Dejan Lesjak
Hello,

We have cephfs with two active MDS. Currently rank 1 is repeatedly crashing 
with FAILED ceph_assert(p->first <= start) in md_log_replay thread. Is there 
any way to work around this and get to accesible file system or should we start 
with disaster recovery?
It seems similar to https://tracker.ceph.com/issues/61009
Crash info:

{
"assert_condition": "p->first <= start",
"assert_file": 
"/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el9/BUILD/ceph-18.2.2/src/include/interval_set.h",
"assert_func": "void interval_set::erase(T, T, std::function) [with T = inodeno_t; C = std::map]",
"assert_line": 568,
"assert_msg": 
"/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el9/BUILD/ceph-18.2.2/src/include/interval_set.h:
 In function 'void interval_set::erase(T, T, std::function) 
[with T = inodeno_t; C = std::map]' thread 7fcdaaf8a640 time 
2024-05-08T00:26:22.049974+0200\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el9/BUILD/ceph-18.2.2/src/include/interval_set.h:
 568: FAILED ceph_assert(p->first <= start)\n",
"assert_thread_name": "md_log_replay",
"backtrace": [
"/lib64/libc.so.6(+0x54db0) [0x7fcdb7a54db0]",
"/lib64/libc.so.6(+0xa154c) [0x7fcdb7aa154c]",
"raise()",
"abort()",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x188) [0x7fcdb83610ff]",
"/usr/lib64/ceph/libceph-common.so.2(+0x161263) [0x7fcdb8361263]",
"/usr/bin/ceph-mds(+0x1f3b0e) [0x55a5904a9b0e]",
"/usr/bin/ceph-mds(+0x1f3b55) [0x55a5904a9b55]",
"(EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x4b9d) 
[0x55a5906e1c8d]",
"(EUpdate::replay(MDSRank*)+0x5d) [0x55a5906eacbd]",
"(MDLog::_replay_thread()+0x7a1) [0x55a590694af1]",
"/usr/bin/ceph-mds(+0x1460f1) [0x55a5903fc0f1]",
"/lib64/libc.so.6(+0x9f802) [0x7fcdb7a9f802]",
"/lib64/libc.so.6(+0x3f450) [0x7fcdb7a3f450]"
],
"ceph_version": "18.2.2",
"crash_id": 
"2024-05-07T22:26:22.050652Z_8be89ffb-bb87-4832-9339-57f8bd29f766",
"entity_name": "mds.spod19",
"os_id": "almalinux",
"os_name": "AlmaLinux",
"os_version": "9.3 (Shamrock Pampas Cat)",
"os_version_id": "9.3",
"process_name": "ceph-mds",
"stack_sig": 
"3d0a2ca9b3c7678bf69efc20fff42b588c63f8be1832e1e0c28c99bafc082c15",
"timestamp": "2024-05-07T22:26:22.050652Z",
"utsname_hostname": "spod19.ijs.si",
"utsname_machine": "x86_64",
"utsname_release": "5.14.0-362.8.1.el9_3.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP PREEMPT_DYNAMIC Tue Nov 7 14:54:22 EST 2023"
}


Cheers,
Dejan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Numa pinning best practices

2024-05-07 Thread Szabo, Istvan (Agoda)
Hi,

Haven't really found a proper descripton in case of 2 socket how to pin osds to 
numa node, only this: 
https://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments#Ceph-Storage-Node-NUMA-Tuning
Tuning for All Flash Deployments - Ceph - Ceph 

Redmine
tracker.ceph.com


Is there anybody have some good how to on this topic?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Removed host in maintenance mode

2024-05-07 Thread Eugen Block

 Error EINVAL: hostname is online, please remove host without --offline.


This is strange, why is it online? I thought you couldn't get it to  
boot according to your first message. What is the current output of  
'ceph orch host ls' and is it still present? Have you failed over the  
mgr, just to be safe?


Zitat von Johan :


Looking at the history I first tried

 ceph orch host rm hostname --offline --force

and then

 ceph orch host rm hostname --force

The second command must have removed the host (partially) because I  
didn't try any other commands after that.



Now when I try these commands again, offline gives me

 Error EINVAL: hostname is online, please remove host without --offline.

and then using only force

 Error EINVAL: host hostname does not exist

As a side note I can mention I had to manually clear the crushmap  
after removing the host. I also manually removed the keys for the  
osd:s that remained in the host (after the pools  
recovered/rebalanced).


/Johan


Den 2024-05-07 kl. 12:09, skrev Eugen Block:

Hi, did you remove the host from the host list [0]?

ceph orch host rm  [--force] [--offline]

[0]  
https://docs.ceph.com/en/latest/cephadm/host-management/#offline-host-removal


Zitat von Johan :


Hi all,

In my small cluster of 6 hosts I had troubles with a host (osd:s)  
and was planning to remove it from the cluster. Before I got to do  
that I needed to power down this host and therefore put it in  
maintenance mode. Due to some mistakes on my part I couldn't boot  
the host again and simply decided to force the removal from the  
cluster.


The host is now removed but ceph (17.2.7) keep on complaining on  
it being in maintenance mode.


How can I remove the last remnants of this host from the cluster?

/Johan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Problem with take-over-existing-cluster.yml playbook

2024-05-07 Thread vladimir franciz blando
I know that only a few are using this script but just trying my luck here
if someone has the same issue as mine.

But first, who has successfully used this script and what version did you
use? Im using this guide on my test environment (
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/installation_guide_for_red_hat_enterprise_linux/importing-an-existing-ceph-cluster-to-ansible
)

Error encountered
---
TASK [Generate ceph configuration file]
**

***
fatal: [vladceph-1]: FAILED! =>
  msg: '''_monitor_addresses'' is undefined. ''_monitor_addresses'' is
undefined'
fatal: [vladceph-3]: FAILED! =>
  msg: '''_monitor_addresses'' is undefined. ''_monitor_addresses'' is
undefined'
fatal: [vladceph-2]: FAILED! =>
  msg: '''_monitor_addresses'' is undefined. ''_monitor_addresses'' is
undefined'
---



Regards,
Vlad Blando 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Removed host in maintenance mode

2024-05-07 Thread Johan

Looking at the history I first tried

 ceph orch host rm hostname --offline --force

and then

 ceph orch host rm hostname --force

The second command must have removed the host (partially) because I 
didn't try any other commands after that.



Now when I try these commands again, offline gives me

 Error EINVAL: hostname is online, please remove host without --offline.

and then using only force

 Error EINVAL: host hostname does not exist

As a side note I can mention I had to manually clear the crushmap after 
removing the host. I also manually removed the keys for the osd:s that 
remained in the host (after the pools recovered/rebalanced).


/Johan


Den 2024-05-07 kl. 12:09, skrev Eugen Block:

Hi, did you remove the host from the host list [0]?

ceph orch host rm  [--force] [--offline]

[0] 
https://docs.ceph.com/en/latest/cephadm/host-management/#offline-host-removal


Zitat von Johan :


Hi all,

In my small cluster of 6 hosts I had troubles with a host (osd:s) and 
was planning to remove it from the cluster. Before I got to do that I 
needed to power down this host and therefore put it in maintenance 
mode. Due to some mistakes on my part I couldn't boot the host again 
and simply decided to force the removal from the cluster.


The host is now removed but ceph (17.2.7) keep on complaining on it 
being in maintenance mode.


How can I remove the last remnants of this host from the cluster?

/Johan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephadm upgrade: heartbeat failures not considered

2024-05-07 Thread Eugen Block

Hi,

we're facing an issue during upgrades (and sometimes server reboots),  
it appears to occur when (at leat) one of the MONs has to do a full  
sync. And I'm wondering if the upgrade procedure could be improved in  
that regard, I'll come back to that later. First, I'll try to  
summarize the events.
We upgraded to latest Pacific (16.2.15) last week. Note that the  
filesystems of the MONs are on HDDs (we're planning to move to flash,  
we already noticed issues during disaster recovery on HDDs last year).


According to the logs the MGRs and MONs were upgraded successfully and  
a quorum was formed successfully at:


2024-04-30T12:57:22.347182+ mon.ndeceph03 (mon.0) 1208046 :  
cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down,  
quorum ndeceph03,ndeceph01)


Then ceph started the first OSD upgrades a few minutes later, the  
first one at:


2024-04-30T13:00:42.733528+ mon.ndeceph03 (mon.0) 101 : cluster  
[INF] osd.25 marked itself down and dead


One OSD by one on ndeceph01 was upgraded:

  2024-04-30T13:01:46.517+ 7fbd78a31700  0 [cephadm INFO  
cephadm.upgrade] Upgrade: Updating osd.19 (5/7)
  2024-04-30T13:01:46.517+ 7fbd78a31700  0  
log_channel(cephadm) log [INF] : Upgrade: Updating osd.19 (5/7)
  2024-04-30T13:02:24.800+ 7fbd78a31700  0 [cephadm INFO  
cephadm.upgrade] Upgrade: Updating osd.22 (6/7)
  2024-04-30T13:02:24.800+ 7fbd78a31700  0  
log_channel(cephadm) log [INF] : Upgrade: Updating osd.22 (6/7)
  2024-04-30T13:02:48.220+ 7fbd78a31700  0 [cephadm INFO  
cephadm.upgrade] Upgrade: Updating osd.29 (7/7)
  2024-04-30T13:02:48.220+ 7fbd78a31700  0  
log_channel(cephadm) log [INF] : Upgrade: Updating osd.29 (7/7)


But the mon service on ndeceph02 (also OSD server) was still syncing  
(for around 6 minutes):


2024-04-30T13:02:33.124+ 7f1c2700  1  
mon.ndeceph02@2(synchronizing) e58 handle_auth_request failed to  
assign global_id
2024-04-30T13:08:30.123+ 7f1c2700  1  
mon.ndeceph02@2(synchronizing) e58 handle_auth_request failed to  
assign global_id


All HDD OSDs from ndeceph02 (not upgraded yet) were complaining about  
timeouts, and apparently tried to reboot multiple times (wondering why  
the SSDs didn't complain though):


2024-04-30T13:03:54.374+ 7f2e2d1ce700  1 heartbeat_map is_healthy  
'OSD::osd_op_tp thread 0x7f2e1399b700' had timed out after 15.00954s
2024-04-30T13:03:54.374+ 7f2e2d1ce700  1 osd.0 367411 is_healthy  
false -- internal heartbeat failed
2024-04-30T13:03:54.374+ 7f2e2d1ce700  1 osd.0 367411 not healthy;  
waiting to boot
2024-04-30T13:03:55.330+ 7f2e2d1ce700  1 heartbeat_map is_healthy  
'OSD::osd_op_tp thread 0x7f2e1399b700' had timed out after 15.00954s
2024-04-30T13:03:55.330+ 7f2e2d1ce700  1 osd.0 367411 is_healthy  
false -- internal heartbeat failed
2024-04-30T13:03:55.330+ 7f2e2d1ce700  1 osd.0 367411 not healthy;  
waiting to boot

...
2024-04-30T13:03:59.322+ 7f2e201b4700  1 osd.0 367414 state:  
booting -> active
2024-04-30T13:05:22.184+ 7f2e201b4700  1 osd.0 367433 state:  
booting -> active
2024-04-30T13:06:39.602+ 7f2e201b4700  1 osd.0 367447 state:  
booting -> active
2024-04-30T13:08:02.612+ 7f2e201b4700  1 osd.0 367454 state:  
booting -> active


During this syncing period the disk utilization of the OS filesystem  
was at 100%, but ceph kept upgrading other OSD daemons. There are 3  
main hosts, failure domain is host, replicated pools with min_size 2,  
size 3.
If the OSDs from one host are struggling (because of disk IO on the  
filesystem, apparently) and ceph keeps upgrading others, we get  
inactive PGs.


So what I'm wondering about is, how does the orchestrator decide if  
it's ok to stop one OSD ('ceph osd ok-to-stop ' is the manual  
command) while others obviously are not healthy and would cause  
inactive PGs? The cluster did notice slow requests and reported  
messages like these:


2024-04-30T13:02:58.867499+ osd.28 (osd.28) 10529 : cluster [WRN]  
Monitor daemon marked osd.28 down, but it is still running


If some OSDs clearly aren't healthy, I would expect the orchestrator  
to pause the upgrade. After the mon full sync completed, it started  
upgrading the OSDs on ndeceph02 as well, but the inactive PGs were  
only resolved after several more minutes when almost all OSDs from  
that tree branch had been upgraded:


2024-04-30T13:16:17.536359+ mon.ndeceph03 (mon.0) 1461 : cluster  
[INF] Health check cleared: PG_AVAILABILITY (was: Reduced data  
availability: 2 pgs inactive, 11 pgs peering)


During a planned upgrade this issue can be mitigated via staggered  
upgrades (now that I know what the cause is), upgrading MGRs and MONs  
first and wait until everything has settled. Then continue with OSDs.  
But after a reboot there's no way to control that, of course. I helped  
a customer last year with a mon sync issue so we might be able to  
improve things a bit until we have flash disks.

[ceph-users] Re: [EXTERN] Re: cache pressure?

2024-05-07 Thread Erich Weiler
I still saw client cache pressure messages, although I think it did in 
general help a bit.  What I additionally just did (like 5 minutes ago), 
was reduce "mds_recall_max_caps" from 30,000 to 10,000 after looking at 
this post:


https://www.spinics.net/lists/ceph-users/msg73188.html

And will try further reducing mds_recall_max_caps if the pressure 
messages keep coming up.  After reducing it to 10,000 a few client cache 
pressure warnings cleared but I don't know yet if that was the reason it 
cleared or if it was just luck.  If I see it stay clear then I'll call 
it solved.


-erich

On 5/7/24 6:55 AM, Dietmar Rieder wrote:

On 4/26/24 23:51, Erich Weiler wrote:
As Dietmar said, VS Code may cause this. Quite funny to read, 
actually, because we've been dealing with this issue for over a year, 
and yesterday was the very first time Ceph complained about a client 
and we saw VS Code's remote stuff running. Coincidence.


I'm holding my breath that the vscode issue is the one affecting us - 
I got my users to tweak their vscode configs and the problem seemed to 
go away, but I guess I won't consider it 'solved' until a few days 
pass without it coming back...  :)


I wonder if the vscode configs solved your issues, or if you still see 
the cache pressure messages?


Dietmar

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [EXTERN] Re: cache pressure?

2024-05-07 Thread Dietmar Rieder

On 4/26/24 23:51, Erich Weiler wrote:
As Dietmar said, VS Code may cause this. Quite funny to read, 
actually, because we've been dealing with this issue for over a year, 
and yesterday was the very first time Ceph complained about a client 
and we saw VS Code's remote stuff running. Coincidence.


I'm holding my breath that the vscode issue is the one affecting us - I 
got my users to tweak their vscode configs and the problem seemed to go 
away, but I guess I won't consider it 'solved' until a few days pass 
without it coming back...  :)


I wonder if the vscode configs solved your issues, or if you still see 
the cache pressure messages?


Dietmar


OpenPGP_signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Guidance on using large RBD volumes - NTFS

2024-05-07 Thread Robert W. Eckert
Hi - in my home , I have been running cephfs for a few years, and have 
reasonably good performance, however since exposing cephfs via SMB has been hit 
and miss.So I thought I could carve out space for a RBD device to share 
from a windows machine


My set up:

CEPH 18.2.2  deployed using ceph adm

4 servers running RHEL9  on AMD 5600g  CPUs
64 Gb Each
10Gbe NICs
4x4Tb Hdd
1 2Tb NVME for DB/WAL
rbd pool is set to auto  PG - its currently at 256

I have tested the NIC connection between the servers and my PC, an each point 
to point works well at the 10Gbe speeds

Now the problem

I created a 8Tb RBD using

rbd create winshare -size 8T -pool rbd
rbd map winshare

I prepped the drive, formatted it, and the drive appears cleanly as an 8 Tb 
drive.

When I used fio on the drive/volume, speeds were good around 150-200 Mb/s.

Then I started trying to populate the drive from a few different sources, and 
performance took a nose dive.  - Write speeds are about 6-10 Mb/s,  and windows 
task manager shows  average response time  anywhere from 500ms to 30 seconds. - 
mainly around 4 seconds.


I don't see any obvious bottlenecks - cpu on the servers are about 5-10%, 
memory is good,   Network is showing under 1 Gb/s  on all servers.


I am wondering if I needed to use different parameters for creating the volume? 
 Or is there a practical limit to the volume size I exceeded?

Thanks,

Rob

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RBD Mirroring with Journaling and Snapshot mechanism

2024-05-07 Thread Eugen Block

Hi,

I'm not the biggest rbd-mirror expert.
As understand it, if you use one-way mirroring you can failover to the  
remote site, continue to work there but there's no failover back to  
primary site. You would need to stop client IO on DR, demote the image  
and then import the remote images back to primary site. Once  
everything is good you can promote the image on primary again. The  
rbd-mirror will then most likely be in a split-brain situation, which  
can be resolved by resyncing images from primary again. You can't do a  
resync on primary site because there's no rbd-mirror daemon running.


Having two-way mirroring could help, I believe. Let's say you lose the  
primary site, you can (force) promote images on the remote site,  
continue working. Once the primary site is back up (but not primary  
yet), you can do the image resync from the remote (currently primary)  
site (because there's a rbd-mirror daemon running on the primary site  
as well). Once the primary site has all images promoted, you'll  
probably have to resync on the remote site again to get out of the  
split-brain. But at least you won't need to export/import images.


But you'll need to test this properly to find out if your requirements  
are met.


Regards,
Eugen


Zitat von V A Prabha :


Dear Eugen,
We have a scenario of DC and DR replication, and planned to explore RBD
mirroring with both Journaling and Snapshot mechanism.
I have a 5 TB storage at Primary DC and 5 TB storage at DR site with  
2 different

ceph clusters configured.

Please clarify the following queries

1. With One way mirroring, the failover works fine in both journaling and
snapshot mechanism and we are able to promote the workload from DR site. How
does Failback work? We wanted to move the contents from DR to DC but  
it fails.

In journaling mechanism, it deletes the entire volume and recreates it afresh
which does not solve our problem.
2. How does incremental replication work from DR to DC?
3. Does Two-way mirroring help this situation. According to me, in  
this method,
it is for 2 different clouds with 2 different storages and  
replicating both the

clouds workloads? Does Failback work in this scenario ?
Please help us / guide us to deploy this solution

Regards
V.A.Prabha


Thanks & Regards,
Ms V A Prabha / श्रीमती प्रभा वी ए
Joint Director / संयुक्त निदेशक
Centre for Development of Advanced Computing(C-DAC) / प्रगत संगणन विकास
केन्द्र(सी-डैक)
Tidel Park”, 8th Floor, “D” Block, (North ) / “टाइडल पार्क”,8वीं मंजिल,
“डी” ब्लॉक, (उत्तर और दक्षिण)
No.4, Rajiv Gandhi Salai / नं.4, राजीव गांधी सलाई
Taramani / तारामणि
Chennai / चेन्नई – 600113
Ph.No.:044-22542226/27
Fax No.: 044-22542294

[ C-DAC is on Social-Media too. Kindly follow us at:
Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]

This e-mail is for the sole use of the intended recipient(s) and may
contain confidential and privileged information. If you are not the
intended recipient, please contact the sender by reply e-mail and destroy
all copies and the original message. Any unauthorized review, use,
disclosure, dissemination, forwarding, printing or copying of this email
is strictly prohibited and appropriate legal action will be taken.




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Removed host in maintenance mode

2024-05-07 Thread Eugen Block

Hi, did you remove the host from the host list [0]?

ceph orch host rm  [--force] [--offline]

[0]  
https://docs.ceph.com/en/latest/cephadm/host-management/#offline-host-removal


Zitat von Johan :


Hi all,

In my small cluster of 6 hosts I had troubles with a host (osd:s)  
and was planning to remove it from the cluster. Before I got to do  
that I needed to power down this host and therefore put it in  
maintenance mode. Due to some mistakes on my part I couldn't boot  
the host again and simply decided to force the removal from the  
cluster.


The host is now removed but ceph (17.2.7) keep on complaining on it  
being in maintenance mode.


How can I remove the last remnants of this host from the cluster?

/Johan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Removed host in maintenance mode

2024-05-07 Thread Johan

Hi all,

In my small cluster of 6 hosts I had troubles with a host (osd:s) and 
was planning to remove it from the cluster. Before I got to do that I 
needed to power down this host and therefore put it in maintenance mode. 
Due to some mistakes on my part I couldn't boot the host again and 
simply decided to force the removal from the cluster.


The host is now removed but ceph (17.2.7) keep on complaining on it 
being in maintenance mode.


How can I remove the last remnants of this host from the cluster?

/Johan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Dashboard issue slowing to a crawl - active ceph mgr process spiking to 600%+

2024-05-07 Thread Eugen Block

Hi,

it's a bit much output to scan through, I'd recommend to omit all  
unnecessary information before pasting. Anyway, this sticks out:


2024-05-01T15:49:26.977+ 7f85688e8700  0 [dashboard ERROR  
frontend.error] (https://172.20.2.30:8443/#/login): Http failure  
response for https://172.20.2.30:8443/api/osd/settings: 401 Unauthorized


Maybe it's just a role issue, you can change that in the dashboard or  
via CLI. Can you verify that a user with full access can see the  
contents of the RBD tab?


Regards,
Eugen

Zitat von Zachary Perry :


Hello All,

I'm hoping I can get some help with an issue in the dashboard after  
doing a recent bare metal ceph upgrade from

Octopus to Quincy.

** Please note, this document references it only being an issue with  
the images tab shortly after this I found the same issue on another  
cluster that was recently upgraded from Octopus to quincy 17.2.7  
within the last few months and it's affecting all tabs in the ceph  
dashboard it slows to a crawl until I restart or fail over the mgr  
both running on top of ubuntu 20.04


Everything appears to be working fine besides the Block --> images  
tab. It doesn't matter what node I fail over to,

reboots, reinstalling ceph-mgr-dashboard, different broswers, clients etc

It will not load the 4 RBDs I have, they appear in rbd ls, I can  
query them and the connection on the end appliance is
fine. The loading icons spin infinitely without any failure message.  
If I access the images tab and then move to
any other tab is the dashboard it will allow me to navigate but not  
display anything until I either restart the service
on the active mgr or fail over to another, so it works as expected  
until I access this one tab.



when I use any other section in the in the dashboard cpu utilization  
for the ceph-mgr is normal but when I
access the images tab it's spiked to as high as 600% and will stay  
like that until I restart the service or fail

over the active mgr

-- Active MGR before clicking Block, the OSDs spike for a second but  
revert to around 5%


top - 13:43:37 up 8 days, 23:09,  1 user,  load average: 8.08, 5.02, 4.37
Tasks: 695 total,   1 running, 694 sleeping,   0 stopped,   0 zombie
%Cpu(s):  7.0 us,  1.6 sy,  0.0 ni, 89.7 id,  1.2 wa,  0.0 hi,  0.5  
si,  0.0 st


-
MiB Mem : 128474.1 total,   6705.6 free,  65684.0 used,  56084.5 buff/cache
MiB Swap:  40927.0 total,  35839.3 free,   5087.7 used.  49253.0 avail Mem
PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM  
TIME+ COMMAND
  14156 ceph  20   0 3420632   1.9g  13668 S  55.3   1.5  
864:49.51 ceph-osd
  13762 ceph  20   0 3421384   1.8g  13432 S  51.3   1.4  
960:22.12 ceph-osd
  14163 ceph  20   0 3422352   1.7g  13016 S  50.0   1.3  
902:41.19 ceph-osd
  13803 ceph  20   0 3469596   1.8g  13532 S  44.7   1.4  
941:55.10 ceph-osd
  13774 ceph  20   0 3427560   1.7g  13656 S  38.7   1.4  
932:02.51 ceph-osd
  13801 ceph  20   0 3439796   1.7g  13448 S  37.7   1.3  
981:25.55 ceph-osd
  14025 ceph  20   0 3426360   1.8g  13780 S  36.4   1.4  
994:00.75 ceph-osd
   9888 nobody20   0  126100   8696  0 S  21.2   0.0
1106:19 node_exporter
 126798 ceph  20   0 1787824 528000  39464 S   7.9   0.4
0:14.84 ceph-mgr
  13795 ceph  20   0 3420252   1.7g  13264 S   7.6   1.4  
990:00.61 ceph-osd
  13781 ceph  20   0 3484476   1.9g  13248 S   6.3   1.5
1040:10 ceph-osd
  13777 ceph  20   0 3408972   1.8g  13464 S   6.0   1.5
1026:21 ceph-osd
  13797 ceph  20   0 3432068   1.6g  13932 S   6.0   1.3  
950:39.35 ceph-osd
  13779 ceph  20   0 3471668   1.7g  12728 S   5.6   1.3  
984:53.80 ceph-osd
  13768 ceph  20   0 3496064   1.9g  13504 S   5.3   1.5  
918:37.48 ceph-osd
  13786 ceph  20   0 3422044   1.6g  13456 S   5.3   1.3  
974:29.08 ceph-osd
  13788 ceph  20   0 3454184   1.9g  13048 S   5.3   1.5  
980:35.78 ceph-osd
  13776 ceph  20   0 3445680   1.7g  12880 S   5.0   1.3  
998:30.58 ceph-osd
  13785 ceph  20   0 3409548   1.7g  13704 S   5.0   1.3  
939:37.08 ceph-osd
  14152 ceph  20   0 3465284   1.7g  13840 S   5.0   1.4  
959:39.42 ceph-osd
  10339 nobody20   0 6256048 531428  60188 S   4.6   0.4  
239:37.56 prometheus
  13802 ceph  20   0 3430696   1.8g  13872 S   4.6   1.4  
924:15.74 ceph-osd
  13791 ceph  20   0 3498876   1.5g  12648 S   4.3   1.2  
962:58.37 ceph-osd
  13800 ceph  20   0 3455268   1.7g  12404 S   4.3   1.3
1000:41 ceph-osd
  13790 ceph  20   0 3434364   1.6g  13516 S   3.3   1.3  
974:16.46 ceph-osd
  14217 ceph  20   0 3443436   1.8g  13560 S   3.3   1.4  
902:54.22 ceph-osd
  13526 ceph  20   0 1012048 499628  11244 S   3.0   0.4  
349:35.28 ceph-mon
  13775 ceph  20   0 3367284   1.6g  13940 S   3.0   1.3  
878:38.27 ceph-osd
  13784 ceph  20   0 3380960   1.8g  12892 S   3.0   1.4  
910:50.47 ceph-osd
  13789 ceph  20   0 3432876   1.6g  12464 S   2.6   1.2