from:"Eugen Block"

[ceph-users] Re: is LRC plugin still maintained/supposed to work in Reef?

2024-10-09 Thread Eugen Block


Hi,

I haven't seen any updates in the tracker issue [0], I'm still  
convinced that LRC doesn't work as expected, but I'd still like to get  
some confirmation from the devs. The last response via email was that  
it doesn't have priority, although according to telemetry data it  
appears to be in use. I'll ping Radek again.


[0] https://tracker.ceph.com/issues/61861

Zitat von Michel Jouvin :


Hi,

I am resurrecting this old thread that I started 18 months ago after  
some new tests. I stopped my initial tests as the cluster I was  
using had not enough OSD to use 'host' as the failure domain. Thus I  
was using 'osd' as the failure domain and I understood it was  
unusual and probably not expected to work...


Recently, in another cluster with 3 datacenters and 6 servers (with  
18 to 24 OSDs per server) in each datacenter, I gave the LRC plugin  
another try. And the same happened again after that one of the  
datacenters went down: all PGs from the EC pool using the LRC plugin  
went down. I don't really understand the reason but I was wondering  
if this plugin, which is still documented, is really supported and  
supposed to work in Reef? If not, I would like to avoid spending too  
much time troubleshooting it... If somebody is successfully using  
it, I'm interested to hear it!


My erasure code profile definition is:

crush-device-class=hdd
crush-failure-domain=host
crush-locality=datacenter
crush-root=default
k=9
l=5
m=6
plugin=lrc

Best regards,

Michel

Le 04/05/2023 à 12:51, Michel Jouvin a écrit :

Hi,

I had to restart one of my OSD server today and the problem showed  
up again. This time I managed to capture "ceph health detail"  
output showing the problem with the 2 PGs:


[WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2 pgs down
    pg 56.1 is down, acting  
[208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE]
    pg 56.12 is down, acting  
[NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128]


I still doesn't understand why, if I am supposed to survive to a  
datacenter failure, I cannot survive to 3 OSDs down on the same  
host, hosting shards for the PG. In the second case it is only 2  
OSDs down but I'm surprised they don't seem in the same "group" of  
OSD (I'd expected all the the OSDs of one datacenter to be in the  
same groupe of 5 if the order given really reflects the allocation  
done...


Still interested by some explanation on what I'm doing wrong! Best regards,

Michel

Le 03/05/2023 à 10:21, Eugen Block a écrit :
I think I got it wrong with the locality setting, I'm still  
limited by the number of hosts I have available in my test  
cluster, but as far as I got with failure-domain=osd I believe  
k=6, m=3, l=3 with locality=datacenter could fit your requirement,  
at least with regards to the recovery bandwidth usage between DCs,  
but the resiliency would not match your requirement (one DC  
failure). That profile creates 3 groups of 4 chunks (3 data/coding  
chunks and one parity chunk) across three DCs, in total 12 chunks.  
The min_size=7 would not allow an entire DC to go down, I'm  
afraid, you'd have to reduce it to 6 to allow reads/writes in a  
disaster scenario. I'm still not sure if I got it right this time,  
but maybe you're better off without the LRC plugin with the  
limited number of hosts. Instead you could use the jerasure plugin  
with a profile like k=4 m=5 allowing an entire DC to fail without  
losing data access (we have one customer using that).


Zitat von Eugen Block :


Hi,

disclaimer: I haven't used LRC in a real setup yet, so there  
might be some misunderstandings on my side. But I tried to play  
around with one of my test clusters (Nautilus). Because I'm  
limited in the number of hosts (6 across 3 virtual DCs) I tried  
two different profiles with lower numbers to get a feeling for  
how that works.


# first attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc  
k=4 m=2 l=3 crush-failure-domain=host


For every third OSD one parity chunk is added, so 2 more chunks  
to store ==> 8 chunks in total. Since my failure-domain is host  
and I only have 6 I get incomplete PGs.


# second attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc  
k=2 m=2 l=2 crush-failure-domain=host


This gives me 6 chunks in total to store across 6 hosts which works:

ceph:~ # ceph pg ls-by-pool lrcpool
PG   OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES*  
OMAP_KEYS* LOG STATE    SINCE VERSION REPORTED  
UP    ACTING SCRUB_STAMP DEEP_SCRUB_STAMP
50.0   1    0 0   0   619 0  0   1  
active+clean   72s 18410'1 18415:54 [27,13,0,2,25,7]p27  
[27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02  
14:53:54.322135
50.1   0    0 0   0

[ceph-users] Re: What is the problem with many PGs per OSD

2024-10-09 Thread Eugen Block

I know it doesn't answer your question, I just wanted to point out  
that I'd be interested to know as well which impacts such  
configurations can have. :-) More comments inline.


Zitat von Frank Schilder :


Hi Eugen,

thanks for looking at this. I followed the thread you refer to and  
it doesn't answer my question. Unfortunately, the statement


... It seems to work well, no complaints yet, but note that it's an  
archive cluster, so

the performance requirements aren't very high. ...


is reproducing the rumor that many PGs somehow impact performance in  
a negative way. What is this based on? As I wrote, since the number  
of PGs per OSD times the objects per PG = objects per OSD is a  
constant, I don't see an immediate justification for the assumption  
that more PGs imply less performance? What do you base that on? I  
don't see algorithms at work here for which splitting PGs could  
impact performance noticeably in a bad way.


I just assume that if the PG count reaches a certain number, the  
increased amount of parallel requests could overload an OSD. But I  
have no real proof for that assumption. I tend to be quite hesitant on  
customer clusters to "play around" and rather stick to the defaults as  
close as possible.


On the contrary, my experience with pools with the highest PG/OSD  
count rather says that reducing the number of objects per PG by  
splitting PGs speeds everything up. Yet, the programmers set a quite  
low limit without really explaining why. The docs just state a rumor  
without any solid information a sysadmin/user could use to decide  
whether or not its worth going high. This is of really high  
interest, because there is probably a critical value for which any  
drawbacks (if they actually exist) might outweigh the benefits and  
without some solid information based on what algorithms do the main  
work and what complexity class they have its impossible to make an  
informed decision or diagnose if this happened.


I second that, we also have usually benefitted from PG splits on each  
cluster we maintain. But at the same time we tried to avoid getting  
above the recommendations, as already stated. Many default values  
don't match real-world deployments, I've learned that a lot in the  
recent years both in Ceph and OpenStack. Maybe those recommendations  
are a bit outdated, but I'd like to learn as well how far one could go  
and which impacts are expected. Unfortunately, I only have a couple of  
virtual test clusters, I'd love to have a hardware test cluster to  
play with. :-D


Do you have performance metrics before/after? Did you actually  
observe any performance degradation? Was there an increased memory  
consumption? Anything that justifies making a statement alluding to  
(potential) negative performance impact?


Unfortunately, I don't have access to the cluster or metrics. And the  
retention time of their Prometheus instance is not very long, so no, I  
don't have anything to show. I can ask them if they did monitor that  
by any chance, but I'm not very confident that they did. :-/



Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Wednesday, October 9, 2024 9:24 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: What is the problem with many PGs per OSD

Hi,

half a year ago I asked a related question
(https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/I3TQC42KN2FCKYV774VWJ7AVAWTTXEAA/#GLALD3DSTO6NSM2DY2PH4UCE4UBME3HM), when we needed to split huge PGs on a customer cluster. I wasn't sure either how far we could go with the ratio PGs per OSD. We increased the pg_num to the target value (4096) before new hardware arrived, temporarily the old OSDs (240 * 8 TB) had around 300 PGs/OSD, it wasn't well balanced yet. The new OSDs are larger drives (12 TB), but having the same capacity per node, and after all remapping finished and the balancer did its job, they're now at around 250 PGs/OSD for the smaller drives, 350 PGs/OSD on the larger drives. All OSDs are spinners with rocksDB on SSDs. It seems to work well, no complaints yet, but note that it's an archive cluster, so the performance requirements aren't very high. It's more about resiliency and availibilty in  
this

case.

This is all I can contribute to your question.

Zitat von Anthony D'Atri :


I’ve sprinkled minimizers below.  Free advice and worth every penny.
 ymmv.  Do not taunt Happy Fun Ball.



during a lot of discussions in the past the comment that having
"many PGs per OSD can lead to issues" came up without ever
explaining what these issues will (not might!) be or how one would
notice. It comes up as kind of a rumor without any factual or even
anecdotal backing.


A handful of years ago Sage IIRC retconned PG ratio guidance from
200 to 1

[ceph-users] Re: What is the problem with many PGs per OSD

2024-10-09 Thread Eugen Block

Hi,

half a year ago I asked a related question
(https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/I3TQC42KN2FCKYV774VWJ7AVAWTTXEAA/#GLALD3DSTO6NSM2DY2PH4UCE4UBME3HM), when we needed to split huge PGs on a customer cluster. I wasn't sure either how far we could go with the ratio PGs per OSD. We increased the pg_num to the target value (4096) before new hardware arrived, temporarily the old OSDs (240 * 8 TB) had around 300 PGs/OSD, it wasn't well balanced yet. The new OSDs are larger drives (12 TB), but having the same capacity per node, and after all remapping finished and the balancer did its job, they're now at around 250 PGs/OSD for the smaller drives, 350 PGs/OSD on the larger drives. All OSDs are spinners with rocksDB on SSDs. It seems to work well, no complaints yet, but note that it's an archive cluster, so the performance requirements aren't very high. It's more about resiliency and availibilty in this
case.

This is all I can contribute to your question.

Zitat von Anthony D'Atri :

I’ve sprinkled minimizers below. Free advice and worth every penny.
ymmv. Do not taunt Happy Fun Ball.

during a lot of discussions in the past the comment that having
"many PGs per OSD can lead to issues" came up without ever
explaining what these issues will (not might!) be or how one would
notice. It comes up as kind of a rumor without any factual or even
anecdotal backing.

A handful of years ago Sage IIRC retconned PG ratio guidance from
200 to 100 to help avoid OOMing, the idea being that more PGs = more
RAM usage on each daemon that stores the maps. With BlueStore’s
osd_memory_target, my sense is that the ballooning seen with
Filestore is much less of an issue.

As far as I can tell from experience, any increase of resource
utilization due to an increase of the PG count per OSD is more than
offset by the performance impact of the reduced size of the PGs.
Everything seems to benefit from smaller PGs, recovery, user IO,
scrubbing.

My understanding is that there is serialization in the PG code, and
thus the PG ratio can be thought of as the degree of parallelism the
OSD device can handle. SAS/SATA SSDs don’t seek so they can handle
more than HDDS, and NVMe devices can handle more than SAS/SATA.

Yet, I'm holding back on an increase of PG count due to these rumors.

My personal sense:

HDD OSD: PG ratio 100-200
SATA/SAS SSD OSD: 200-300
NVMe SSD OSD: 300-400

These are not empirical figures. ymmv.

My situation: I would like to split PGs on large HDDs. Currently,
we have on average 135PGs per OSD and I would like to go for
something like 450.

The good Mr. Nelson may have more precise advice, but my personal
sense is that I wouldn’t go higher than 200 on an HDD. If you were
at like 20 (I’ve seen it!) that would be a different story, my sense
is that there are diminishing returns over say 150. Seek thrashing
fu, elevator scheduling fu, op re-ordering fu, etc. Assuming you’re
on Nautilus or later, it doesn’t hurt to experiment with your actual
workload since you can scale pg_num back down. Without Filestore
colocated journals, the seek thrashing may be less of an issue than
it used to be.

I heard in related rumors that some users have 1000+ PGs per OSD
without problems.

On spinners? Or NVMe? On a 60-120 TB NVMe OSD I’d be sorely
tempted to try 500-1000.

I would be very much interested in a non-rumor answer, that is, not
an answer of the form "it might use more RAM", "it might stress
xyz". I don't care what a rumor says it might do. I would like to
know what it will do.

It WILL use more RAM.

I'm looking for answers of the form "a PG per OSD requires X amount
of RAM fixed plus Y amount per object”

Derive the size of your map and multiple by the number of OSDs per
system. My sense is that it’s on the order of MBs per OSD. After a
certain point the RAM delta might have more impact by raising
osd_memory_target instead.

or "searching/indexing stuff of kind A in N PGs per OSD requires N
log N/N²/... operations", "peering of N PGs per OSD requires N/N
log N/N²/N*#peers/... operations". In other words, what are the
*actual* resources required to host N PGs with M objects on an OSD
(note that N*M is a constant per OSD). With that info one could
make an informed decision, informed by facts not rumors.

An additional question of interest is: Has anyone ever observed any
detrimental effects of increasing the PG count per OSD to large
values>500?

Consider this scenario:

An unmanaged lab setup used for successive OpenStack deployments,
each of which created two RBD pools and the panoply of RGW pools.
Which nobody cleaned up before redeploys, so they accreted like
plaque in the arteries of an omnivore. Such that the PG ratio hits
9000. Yes, 9000. Then the building loses power. The systems don’t
have nearly enough RAM to boot, peer, and activate, so the entire
cluster has to

[ceph-users] Re: Question about speeding hdd based cluster

2024-10-08 Thread Eugen Block

Sure:

https://docs.ceph.com/en/latest/ceph-volume/lvm/newdb/

In this case you'll have to prepare the db LV beforehand. I haven't  
done that in a while, here's an example from Clyso:

https://docs.clyso.com/blog/ceph-volume-create-wal-db-on-separate-device-for-existing-osd

Note that in a cephadm deployment you'll need to execute that in a  
shell, for example:

cephadm shell --name osd.6  --env  
CEPH_ARGS='--bluestore_block_db_size=1341967564' --  
ceph-bluestore-tool bluefs-bdev-new-db --dev-target /dev/data_vg1/lv4  
--path /var/lib/ceph/osd/ceph-6

Note that these are two different approaches to achieve the same goal.  
One is via 'ceph-volume lvm new-db', the other one with  
'ceph-bluestore-tool bluefs-bdev-new-db'. I would assume they both  
work, so I can't tell which one to prefer. I feel like the docs could  
use some clarification on this topic.

On a similar topic: Does it make sense to use compression on a  
metadata pool?  Would it matter if the metadata pool is on hdd vs ssd?

As already stated, metadata should be on fast devices, independent of  
compression. The metadata pool doesn't consume a lot of data, so I'd  
say there's not too much of a benefit compressing that.

Zitat von "Kyriazis, George" :

On Oct 7, 2024, at 2:16 AM, Eugen Block  wrote:

Hi, response inline.

Zitat von "Kyriazis, George" :

Thank you all.

The cluster is used mostly for backup of large files currently,  
but we are hoping to use it for home directories (compiles, etc.)  
soon.  Most usage would be for large files, though.

What I've observed with its current usage is that ceph rebalances,  
and proxmox-initiated VM backups bring the storage to its knees.

Would a safe approach be to move the metadata pool to ssd first,  
see how it goes (since it would be cheaper), and then add DB/WAL  
disks?

Moving the metadata to SSDs first is absolutely reasonable and  
relatively cheap since it usually doesn't contain huge amounts of  
data.

How would ceph behave if we are adding DB/WAL disks "slowly" (ie  
one node at a time)?  We have about 100 OSDs (mix hdd/ssd) spread  
across about 25 hosts.  Hosts are server-grade with plenty of  
memory and processing power.

The answer is as always "it depends". If you rebuild the OSDs  
entirely (host-wise) instead of migrating the DB off to SSDs, you  
might encounter slow requests as you already noticed yourself. But  
the whole process would be faster than migrating each DB  
individually.
If you take the migration approach, it would be less invasive, each  
OSD would just have to catch up after restart, reducing the load  
drastically compared to a rebuild. But then again, it would take  
way more time to complete. How large are the OSDs and how much are  
they utilized? Do you have some history how long a host rebuild  
would usually take?

I have no problem destroying and re-creating the OSDs (in place) if  
that’s what it takes.  It will take time to do them all, but if  
“eventually” it works better, then so be it.  Do you happen to have  
a documentation pointer no how to migrate DB to SSDs?

On a similar topic: Does it make sense to use compression on a  
metadata pool?  Would it matter if the metadata pool is on hdd vs ssd?

Thank you!

George

Thank you!

George

-Original Message-
From: Eugen Block 
Sent: Wednesday, October 2, 2024 2:18 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Question about speeding hdd based cluster

Hi George,

the docs [0] strongly recommend to have dedicated SSD or NVMe OSDs for
the metadata pool. You'll also benefit from dedicated DB/WAL devices.
But as Joachim already stated, it depends on a couple of factors like the
number of clients, the load they produce, file sizes etc. There's  
no easy answer.

Regards,
Eugen

[0] https://docs.ceph.com/en/latest/cephfs/createfs/#creating-pools

Zitat von Joachim Kraftmayer :

> Hi Kyriazis,
>
> depends on the workload.
> I would recommend to add  ssd/nvme DB/WAL to each osd.
>
>
>
> Joachim Kraftmayer
>
> www.clyso.com
>
> Hohenzollernstr. 27, 80801 Munich
>
> Utting a. A. | HR: Augsburg | HRB: 25866 | USt. ID-Nr.: DE2754306
>
> Kyriazis, George  schrieb am Mi., 2. Okt.
> 2024,
> 07:37:
>
>> Hello ceph-users,
>>
>> I’ve been wondering…. I have a proxmox hdd-based cephfs pool with no
>> DB/WAL drives.  I also have ssd drives in this setup used for  
other pools.

>>
>> What would increase the speed of the hdd-based cephfs more, and in
>> what usage scenarios:
>>
>> 1. Adding ssd/nvme DB/WAL drives for each node 2. Moving the metadata
>> pool for my cephfs to ssd 3. Increasing the performance of the
>> network.  I currently have 10gbe links.
>>
>> It doesn’t

[ceph-users] Re: Question about speeding hdd based cluster

2024-10-08 Thread Eugen Block

(Resending, apparently my mail didn't get to the ML)
Sure:

https://docs.ceph.com/en/latest/ceph-volume/lvm/newdb/

In this case you'll have to prepare the db LV beforehand. I haven't  
done that in a while, here's an example from Clyso:

https://docs.clyso.com/blog/ceph-volume-create-wal-db-on-separate-device-for-existing-osd

Note that in a cephadm deployment you'll need to execute that in a  
shell, for example:

cephadm shell --name osd.6  --env  
CEPH_ARGS='--bluestore_block_db_size=1341967564' --  
ceph-bluestore-tool bluefs-bdev-new-db --dev-target /dev/data_vg1/lv4  
--path /var/lib/ceph/osd/ceph-6

and don't forget to migrate the data, otherwise you might encounter spillover:

cephadm shell --name osd.6 -- ceph-bluestore-tool bluefs-bdev-migrate  
--dev-target /var/lib/ceph/osd/ceph-6/block.db --path  
/var/lib/ceph/osd/ceph-6 --devs-source /var/lib/ceph/osd/ceph-6/block

Note that these are two different approaches to achieve the same goal.  
One is via 'ceph-volume lvm new-db', the other one with  
'ceph-bluestore-tool bluefs-bdev-new-db'. I would assume they both  
work, so I can't tell which one to prefer. I feel like the docs could  
use some clarification on this topic.

On a similar topic: Does it make sense to use compression on a  
metadata pool?  Would it matter if the metadata pool is on hdd vs ssd?

As already stated, metadata should be on fast devices, independent of  
compression. The metadata pool doesn't consume a lot of data, so I'd  
say there's not too much of a benefit compressing that.

Zitat von "Kyriazis, George" :

On Oct 7, 2024, at 2:16 AM, Eugen Block  wrote:

Hi, response inline.

Zitat von "Kyriazis, George" :

Thank you all.

The cluster is used mostly for backup of large files currently,  
but we are hoping to use it for home directories (compiles, etc.)  
soon.  Most usage would be for large files, though.

What I've observed with its current usage is that ceph rebalances,  
and proxmox-initiated VM backups bring the storage to its knees.

Would a safe approach be to move the metadata pool to ssd first,  
see how it goes (since it would be cheaper), and then add DB/WAL  
disks?

Moving the metadata to SSDs first is absolutely reasonable and  
relatively cheap since it usually doesn't contain huge amounts of  
data.

How would ceph behave if we are adding DB/WAL disks "slowly" (ie  
one node at a time)?  We have about 100 OSDs (mix hdd/ssd) spread  
across about 25 hosts.  Hosts are server-grade with plenty of  
memory and processing power.

The answer is as always "it depends". If you rebuild the OSDs  
entirely (host-wise) instead of migrating the DB off to SSDs, you  
might encounter slow requests as you already noticed yourself. But  
the whole process would be faster than migrating each DB  
individually.
If you take the migration approach, it would be less invasive, each  
OSD would just have to catch up after restart, reducing the load  
drastically compared to a rebuild. But then again, it would take  
way more time to complete. How large are the OSDs and how much are  
they utilized? Do you have some history how long a host rebuild  
would usually take?

I have no problem destroying and re-creating the OSDs (in place) if  
that’s what it takes.  It will take time to do them all, but if  
“eventually” it works better, then so be it.  Do you happen to have  
a documentation pointer no how to migrate DB to SSDs?

On a similar topic: Does it make sense to use compression on a  
metadata pool?  Would it matter if the metadata pool is on hdd vs ssd?

Thank you!

George

Thank you!

George

-Original Message-
From: Eugen Block 
Sent: Wednesday, October 2, 2024 2:18 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Question about speeding hdd based cluster

Hi George,

the docs [0] strongly recommend to have dedicated SSD or NVMe OSDs for
the metadata pool. You'll also benefit from dedicated DB/WAL devices.
But as Joachim already stated, it depends on a couple of factors like the
number of clients, the load they produce, file sizes etc. There's  
no easy answer.

Regards,
Eugen

[0] https://docs.ceph.com/en/latest/cephfs/createfs/#creating-pools

Zitat von Joachim Kraftmayer :

> Hi Kyriazis,
>
> depends on the workload.
> I would recommend to add  ssd/nvme DB/WAL to each osd.
>
>
>
> Joachim Kraftmayer
>
> www.clyso.com
>
> Hohenzollernstr. 27, 80801 Munich
>
> Utting a. A. | HR: Augsburg | HRB: 25866 | USt. ID-Nr.: DE2754306
>
> Kyriazis, George  schrieb am Mi., 2. Okt.
> 2024,
> 07:37:
>
>> Hello ceph-users,
>>
>> I’ve been wondering…. I have a proxmox hdd-based cephfs pool with no
>> DB/WAL drives.  I also have ssd drives in this setup used for  
other pools.

>>
>> What would increase the speed

[ceph-users] Re: About scrub and deep-scrub

2024-10-07 Thread Eugen Block

Well, if you have more PGs, the deep-scrubs per PG finish faster, but  
you have more PGs to scrub. But that way you can stretch the  
deep-scrub interval. As I said, I can not recommend to increase them  
in general. Overloading OSDs with too many PGs can have a negative  
effect as well. If you can, I'd probably rather add more (smaller)  
OSDs to spread the PGs.


Zitat von Phong Tran Thanh :


Hi Eugen

Adding PG to the cluster only helps to reduce PG size and reduce scrub time
is right? >150 PG and <200 in osd i think it's good number

My cluster I/O 2-3GB/s real time in 24h IOPS 2K. If choosing to increase PG
is it a good choice?

Vào Th 2, 7 thg 10, 2024 vào lúc 15:49 Eugen Block  đã
viết:


So your PGs for pool 52 have a size of around 320 GB, that is quite a
lot and not surprising that deep-scrubs take a long time. At the same
time, your PGs per OSD are already > 150. We had a similar situation
on a customer cluster this year as well, also with 12 TB drives. We
decided to increase the pg_num anyway to reduce the pg sizes. They
currently have around 380 PGs per large OSD (they have lots of smaller
OSDs as well) which still works fine. But they're using it as an
archive, so the IO is not very high. If you would decide to split PGs,
keep in mind to increase mon_max_pg_per_osd and
osd_max_pg_per_osd_hard_ratio as well. I can't explicitly recommend to
double your PGs per OSD as I'm not familiar with your cluster, the
load etc. It's just something to think about.
Doubling the PG count would reduce the PG size to around 160 GB, which
is still a lot, but I probably wouldn't go further than that.
The OSD utilization is only around 40%, in this case a cluster with
more (smaller) OSDs would probably have made more sense.

Zitat von Phong Tran Thanh :

> Hi Eugen
>
> Can you see and give me some advice, number of PG and PG size..
>
> ID   CLASS  WEIGHTREWEIGHT  SIZE RAW USE  DATA OMAP META
>   AVAIL%USE   VAR   PGS  STATUS
>   2hdd  10.98349   1.0   11 TiB  4.5 TiB  4.2 TiB  5.0 MiB   12
GiB
>  6.5 TiB  40.77  1.00  182  up
>  17hdd  10.98349   1.0   11 TiB  4.8 TiB  4.6 TiB   22 MiB   14
GiB
>  6.1 TiB  44.16  1.08  196  up
>  32hdd  10.98349   1.0   11 TiB  4.3 TiB  4.0 TiB   37 MiB   12
GiB
>  6.7 TiB  38.80  0.95  173  up
>  47hdd  10.98349   1.0   11 TiB  4.2 TiB  3.9 TiB  655 KiB   11
GiB
>  6.8 TiB  38.16  0.93  184  up
>  60hdd  10.98349   1.0   11 TiB  4.3 TiB  4.0 TiB   19 MiB   12
GiB
>  6.6 TiB  39.47  0.96  176  up
>  74hdd  10.98349   1.0   11 TiB  4.2 TiB  3.9 TiB   28 MiB   12
GiB
>  6.8 TiB  38.10  0.93  187  up
>  83hdd  10.98349   1.0   11 TiB  4.8 TiB  4.5 TiB  1.9 MiB   14
GiB
>  6.2 TiB  43.47  1.06  180  up
>  96hdd  10.98349   1.0   11 TiB  4.3 TiB  4.0 TiB   38 MiB   12
GiB
>  6.7 TiB  38.80  0.95  181  up
> 110hdd  10.98349   1.0   11 TiB  4.5 TiB  4.2 TiB  4.3 MiB   13
GiB
>  6.5 TiB  40.79  1.00  174  up
> 123hdd  10.98349   1.0   11 TiB  4.2 TiB  3.9 TiB  1.9 MiB   13
GiB
>  6.8 TiB  38.11  0.93  173  up
> 136hdd  10.98349   1.0   11 TiB  4.3 TiB  4.0 TiB   43 MiB   12
GiB
>  6.6 TiB  39.46  0.96  179  up
> .
>
> PG  OBJECTS  DEGRADED  MISPLACED  UNFOUND  BYTES OMAP_BYTES*
>  OMAP_KEYS*  LOG   LOG_DUPS  STATESINCE
> 52.0  80121 0  00  3231052480
> 0  1747  3000 active+clean94m
> 52.1  79751 0  00  3217115567660
> 0  1727  3000 active+clean21h
> 52.2  80243 0  00  3237118786260
> 0  1618  3000 active+clean30h
> 52.3  79892 0  00  3221660100200
> 0  1627  3000 active+clean 9h
> 52.4  80267 0  00  3237082194860
> 0  1658  3000 active+clean 5h
> 52.5  79996 0  00  3223315044540
> 0  1722  3000 active+clean18h
> 52.6  80190 0  00  3234603944020
> 0  1759  3000 active+clean15h
> 52.7  79998 0  00  3227691435460
> 0  1720  3000 active+clean26h
> 52.8  80292 0  00  3239321731520
> 0  1691  3000 active+clean21h
> 52.9  79808 0  00  3219107427020
> 0  1675  3000 active+clean 7h
> 52.a

[ceph-users] Re: About scrub and deep-scrub

2024-10-07 Thread Eugen Block

So your PGs for pool 52 have a size of around 320 GB, that is quite a  
lot and not surprising that deep-scrubs take a long time. At the same  
time, your PGs per OSD are already > 150. We had a similar situation  
on a customer cluster this year as well, also with 12 TB drives. We  
decided to increase the pg_num anyway to reduce the pg sizes. They  
currently have around 380 PGs per large OSD (they have lots of smaller  
OSDs as well) which still works fine. But they're using it as an  
archive, so the IO is not very high. If you would decide to split PGs,  
keep in mind to increase mon_max_pg_per_osd and  
osd_max_pg_per_osd_hard_ratio as well. I can't explicitly recommend to  
double your PGs per OSD as I'm not familiar with your cluster, the  
load etc. It's just something to think about.
Doubling the PG count would reduce the PG size to around 160 GB, which  
is still a lot, but I probably wouldn't go further than that.
The OSD utilization is only around 40%, in this case a cluster with  
more (smaller) OSDs would probably have made more sense.


Zitat von Phong Tran Thanh :


Hi Eugen

Can you see and give me some advice, number of PG and PG size..

ID   CLASS  WEIGHTREWEIGHT  SIZE RAW USE  DATA OMAP META
  AVAIL%USE   VAR   PGS  STATUS
  2hdd  10.98349   1.0   11 TiB  4.5 TiB  4.2 TiB  5.0 MiB   12 GiB
 6.5 TiB  40.77  1.00  182  up
 17hdd  10.98349   1.0   11 TiB  4.8 TiB  4.6 TiB   22 MiB   14 GiB
 6.1 TiB  44.16  1.08  196  up
 32hdd  10.98349   1.0   11 TiB  4.3 TiB  4.0 TiB   37 MiB   12 GiB
 6.7 TiB  38.80  0.95  173  up
 47hdd  10.98349   1.0   11 TiB  4.2 TiB  3.9 TiB  655 KiB   11 GiB
 6.8 TiB  38.16  0.93  184  up
 60hdd  10.98349   1.0   11 TiB  4.3 TiB  4.0 TiB   19 MiB   12 GiB
 6.6 TiB  39.47  0.96  176  up
 74hdd  10.98349   1.0   11 TiB  4.2 TiB  3.9 TiB   28 MiB   12 GiB
 6.8 TiB  38.10  0.93  187  up
 83hdd  10.98349   1.0   11 TiB  4.8 TiB  4.5 TiB  1.9 MiB   14 GiB
 6.2 TiB  43.47  1.06  180  up
 96hdd  10.98349   1.0   11 TiB  4.3 TiB  4.0 TiB   38 MiB   12 GiB
 6.7 TiB  38.80  0.95  181  up
110hdd  10.98349   1.0   11 TiB  4.5 TiB  4.2 TiB  4.3 MiB   13 GiB
 6.5 TiB  40.79  1.00  174  up
123hdd  10.98349   1.0   11 TiB  4.2 TiB  3.9 TiB  1.9 MiB   13 GiB
 6.8 TiB  38.11  0.93  173  up
136hdd  10.98349   1.0   11 TiB  4.3 TiB  4.0 TiB   43 MiB   12 GiB
 6.6 TiB  39.46  0.96  179  up
.

PG  OBJECTS  DEGRADED  MISPLACED  UNFOUND  BYTES OMAP_BYTES*
 OMAP_KEYS*  LOG   LOG_DUPS  STATESINCE
52.0  80121 0  00  3231052480
0  1747  3000 active+clean94m
52.1  79751 0  00  3217115567660
0  1727  3000 active+clean21h
52.2  80243 0  00  3237118786260
0  1618  3000 active+clean30h
52.3  79892 0  00  3221660100200
0  1627  3000 active+clean 9h
52.4  80267 0  00  3237082194860
0  1658  3000 active+clean 5h
52.5  79996 0  00  3223315044540
0  1722  3000 active+clean18h
52.6  80190 0  00  3234603944020
0  1759  3000 active+clean15h
52.7  79998 0  00  3227691435460
0  1720  3000 active+clean26h
52.8  80292 0  00  3239321731520
0  1691  3000 active+clean21h
52.9  79808 0  00  3219107427020
0  1675  3000 active+clean 7h
52.a  79751 0  00  3215780613340
0  1822  3000 active+clean26h
52.b  80287 0  00  3239051646420
0  1793  3000 active+clean 6h

Thanks Eugen

Vào Th 2, 7 thg 10, 2024 vào lúc 14:45 Eugen Block  đã
viết:


Hi,

disabling scrubbing in general is bad idea, because you won't notice
any data corruption except when it might be too late.
But you can fine tune scrubbing, for example increase the interval to
allow fewer scrubs to finish in a longer interval. Or if the client
load is mainly during business hours, adjust osd_scrub_begin_hour and
osd_scrub_end_hour to your needs.
And it also depends on the size of your PGs. The larger the PGs are,
the longer a deep-scrub would take. So splitting PGs can have a quite
positive effect in general. Inspect 'ceph osd df' output as well as
'ceph pg ls' (

[ceph-users] Re: About scrub and deep-scrub

2024-10-07 Thread Eugen Block


Hi,

disabling scrubbing in general is bad idea, because you won't notice  
any data corruption except when it might be too late.
But you can fine tune scrubbing, for example increase the interval to  
allow fewer scrubs to finish in a longer interval. Or if the client  
load is mainly during business hours, adjust osd_scrub_begin_hour and  
osd_scrub_end_hour to your needs.
And it also depends on the size of your PGs. The larger the PGs are,  
the longer a deep-scrub would take. So splitting PGs can have a quite  
positive effect in general. Inspect 'ceph osd df' output as well as  
'ceph pg ls' (BYTES column), you can also share it here if you need  
assistance interpreting those values.


Regards,
Eugen

Zitat von Phong Tran Thanh :


Hi ceph users!

How about the disable scrub and deep-scrub, i want to disable it because of
its effect on many I/O of my cluster.
If I disable scrub how will it affect my cluster?
If enabled, scrubbing is not complete and takes a long time.


Thank
Skype: tranphong079
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Question about speeding hdd based cluster

2024-10-07 Thread Eugen Block


Hi, response inline.


Zitat von "Kyriazis, George" :


Thank you all.

The cluster is used mostly for backup of large files currently, but  
we are hoping to use it for home directories (compiles, etc.) soon.   
Most usage would be for large files, though.


What I've observed with its current usage is that ceph rebalances,  
and proxmox-initiated VM backups bring the storage to its knees.


Would a safe approach be to move the metadata pool to ssd first, see  
how it goes (since it would be cheaper), and then add DB/WAL disks?


Moving the metadata to SSDs first is absolutely reasonable and  
relatively cheap since it usually doesn't contain huge amounts of data.


How would ceph behave if we are adding DB/WAL disks "slowly" (ie one  
node at a time)?  We have about 100 OSDs (mix hdd/ssd) spread across  
about 25 hosts.  Hosts are server-grade with plenty of memory and  
processing power.


The answer is as always "it depends". If you rebuild the OSDs entirely  
(host-wise) instead of migrating the DB off to SSDs, you might  
encounter slow requests as you already noticed yourself. But the whole  
process would be faster than migrating each DB individually.
If you take the migration approach, it would be less invasive, each  
OSD would just have to catch up after restart, reducing the load  
drastically compared to a rebuild. But then again, it would take way  
more time to complete. How large are the OSDs and how much are they  
utilized? Do you have some history how long a host rebuild would  
usually take?



Thank you!

George



-Original Message-
From: Eugen Block 
Sent: Wednesday, October 2, 2024 2:18 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Question about speeding hdd based cluster

Hi George,

the docs [0] strongly recommend to have dedicated SSD or NVMe OSDs for
the metadata pool. You'll also benefit from dedicated DB/WAL devices.
But as Joachim already stated, it depends on a couple of factors like the
number of clients, the load they produce, file sizes etc. There's  
no easy answer.


Regards,
Eugen

[0] https://docs.ceph.com/en/latest/cephfs/createfs/#creating-pools

Zitat von Joachim Kraftmayer :

> Hi Kyriazis,
>
> depends on the workload.
> I would recommend to add  ssd/nvme DB/WAL to each osd.
>
>
>
> Joachim Kraftmayer
>
> www.clyso.com
>
> Hohenzollernstr. 27, 80801 Munich
>
> Utting a. A. | HR: Augsburg | HRB: 25866 | USt. ID-Nr.: DE2754306
>
> Kyriazis, George  schrieb am Mi., 2. Okt.
> 2024,
> 07:37:
>
>> Hello ceph-users,
>>
>> I’ve been wondering…. I have a proxmox hdd-based cephfs pool with no
>> DB/WAL drives.  I also have ssd drives in this setup used for  
other pools.

>>
>> What would increase the speed of the hdd-based cephfs more, and in
>> what usage scenarios:
>>
>> 1. Adding ssd/nvme DB/WAL drives for each node 2. Moving the metadata
>> pool for my cephfs to ssd 3. Increasing the performance of the
>> network.  I currently have 10gbe links.
>>
>> It doesn’t look like the network is currently saturated, so I’m
>> thinking
>> (3) is not a solution.  However, if I choose any of the other
>> options, would I need to also upgrade the network so that the network
>> does not become a bottleneck?
>>
>> Thank you!
>>
>> George
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
>> email to ceph-users-le...@ceph.io
>>
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send  
an email to

ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 9 out of 11 missing shards of shadow object in ERC 8:3 pool.

2024-10-05 Thread Eugen Block


This reminds me of this tracker:

https://tracker.ceph.com/issues/50351

IIRC, the information could be actually lost on the OSDs. I’m  
surprised that the number of missing shards is that high though. If  
you have the objects mirrored, maybe importing it with  
objectstore-tool could be a way forward. But I’m really not sure if  
that would be the right approach.


Zitat von Robert Kihlberg :


After an upgrade from Nautilus to Pacific the scrub has found an
inconsistent
object and reports that 9 out of 11 shards are missing. (However, we're not
sure this has to do with the upgrade).

We have been able to trace it to a S3 bucket, but not to a specific S3
object.

# radosgw-admin object stat --bucket=$BUCKET --object=$OBJECT
ERROR: failed to stat object, returned error: (2) No such file or directory

By design, we have a complete mirror of the bucket in another Ceph cluster
and the amount of objects in the buckets match between the clusters. We are
therefore somewhat confident that we are not missing any objects.

Could this be a failed garbage collection where perhaps the primary OSD
failed during gc?

The garbage collector does not show anything that seems relevant though...
radosgw-admin gc list --include-all | grep
"eaa6801e-3967-4541-9b8ca98aa5c2.791015596"

Any suggestions on how we can trace and/or fix this inconsistent object?

# rados list-inconsistent-obj 11.3ff | jq
{
  "epoch": 177981,
  "inconsistents": [
{
  "object": {
"name":
"eaa6801e-3967-4541-9b8ca98aa5c2.791015596.129__shadow_.3XHvgPjrJa3erG4rPlW3brboBWagE95_5",
"nspace": "",
"locator": "",
"snap": "head",
"version": 109853
  },
  "errors": [],
  "union_shard_errors": [
"missing"
  ],
  "selected_object_info": {
"oid": {
  "oid":
"eaa6801e-3967-4541-9b8ca98aa5c2.791015596.129__shadow_.3XHvgPjrJa3erG4rPlW3brboBWagE95_5",
  "key": "",
  "snapid": -2,
  "hash": 4294967295,
  "max": 0,
  "pool": 11,
  "namespace": ""
},
"version": "17636'109853",
"prior_version": "0'0",
"last_reqid": "client.791015590.0:449317175",
"user_version": 109853,
"size": 8388608,
"mtime": "2022-01-24T03:33:42.457722+",
"local_mtime": "2022-01-24T03:33:42.471042+",
"lost": 0,
"flags": [
  "dirty",
  "data_digest"
],
"truncate_seq": 0,
"truncate_size": 0,
"data_digest": "0xe588978d",
"omap_digest": "0x",
"expected_object_size": 0,
"expected_write_size": 0,
"alloc_hint_flags": 0,
"manifest": {
  "type": 0
},
"watchers": {}
  },
  "shards": [
{
  "osd": 14,
  "primary": true,
  "shard": 0,
  "errors": [],
  "size": 1048576
},
{
  "osd": 67,
  "primary": false,
  "shard": 1,
  "errors": [
"missing"
  ]
},
{
  "osd": 77,
  "primary": false,
  "shard": 4,
  "errors": [],
  "size": 1048576
},
{
  "osd": 225,
  "primary": false,
  "shard": 9,
  "errors": [
"missing"
  ]
},
{
  "osd": 253,
  "primary": false,
  "shard": 8,
  "errors": [
"missing"
  ]
},
{
  "osd": 327,
  "primary": false,
  "shard": 6,
  "errors": [
"missing"
  ]
},
{
  "osd": 568,
  "primary": false,
  "shard": 2,
  "errors": [
"missing"
  ]
},
{
  "osd": 610,
  "primary": false,
  "shard": 7,
  "errors": [
"missing"
  ]
},
{
  "osd": 700,
  "primary": false,
  "shard": 3,
  "errors": [
"missing"
  ]
},
{
  "osd": 736,
  "primary": false,
  "shard": 10,
  "errors": [
"missing"
  ]
},
{
  "osd": 764,
  "primary": false,
  "shard": 5,
  "errors": [
"missing"
  ]
}
  ]
}
  ]
}
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: cephadm crush_device_class not applied

2024-10-04 Thread Eugen Block

Oh that’s good news! I haven’t come that far yet, given we had a  
national holiday yesterday and I had a day off today. :-)

I’m glad it’s gonna be fixed in Quincy as well.

Thanks!

Zitat von Frédéric Nass :


Hey Eugen,

Check this one here: https://github.com/ceph/ceph/pull/55534

It's fixed in 18.2.4 and should be in upcoming 17.2.8.

Cheers,
Frédéric.


De : Eugen Block 
Envoyé : jeudi 3 octobre 2024 23:21
À : ceph-users@ceph.io
Objet : [ceph-users] Re: cephadm crush_device_class not applied

I think this PR [1] is responsible. And here are the three supported 
classes [2]:

class to_ceph_volume(object):

 _supported_device_classes = [
 "hdd", "ssd", "nvme"
 ]

Why this limitation?

[1] https://github.com/ceph/ceph/pull/49555
[2] 
https://github.com/ceph/ceph/blob/v18.2.2/src/python-common/ceph/deployment/translate.py#L14

Zitat von Eugen Block :


It works as expected in Pacific 16.2.15 (at least how I expect it to 
work). I applied the same spec file and now have my custom device 
classes (the test class was the result of a manual daemon add 
command):

soc9-ceph:~ # ceph osd tree
ID  CLASS   WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
-1  0.05878  root default
-3  0.05878  host soc9-ceph
  1  hdd-ec  0.00980  osd.1   up   1.0  1.0
  2  hdd-ec  0.00980  osd.2   up   1.0  1.0
  3  hdd-ec  0.00980  osd.3   up   1.0  1.0
  4  hdd-ec  0.00980  osd.4   up   1.0  1.0
  5  hdd-ec  0.00980  osd.5   up   1.0  1.0
  0    test  0.00980  osd.0   up   1.0  1.0

So apparently, there was a change since Quincy. For me it's a 
regression, or is this even a bug? I'd appreciate any comments.

Zitat von Eugen Block :


Apparently, I can only use "well known" device classes in the 
specs, like nvme, ssd or hdd. Every other string (even without 
hyphens etc.) doesn't work.

Zitat von Eugen Block :


Reading the docs again, I noticed that apparently the keyword 
"paths" is required to use with crush_device_class (why?), but 
that doesn't work either. I tried it by specifying the class both 
globally in the spec file as well as per device, still no change, 
the OSDs come up as "hdd".

Zitat von Eugen Block :


Hi,

I'm struggling to create OSDs with a dedicated 
crush_device_class. It worked sometimes when creating a new osd 
via command line (ceph orch daemon add osd 
host:data_devices=/dev/vdg,crush_device_class=test-hdd), but most 
of the time it doesn't work. I tried it with a spec file as well, 
it seems to be correctly parsed and everything, but the new OSDs 
are created with hdd class, not hdd-ec. I have this spec:

cat osd-class.yaml
service_type: osd
service_id: hdd-ec
service_name: hdd-ec
crush_device_class: hdd-ec
placement:
label: osd
spec:
data_devices:
  rotational: 1
  size: 10G
objectstore: bluestore

I see that cephadm has stored it correctly:

ceph config-key get mgr/cephadm/spec.osd.hdd-ec
{"created": "2024-10-03T08:35:41.364216Z", "needs_configuration": 
true, "spec": {"placement": {"label": "osd"}, "service_id": 
"hdd-ec", "service_name": "osd.hdd-ec", "service_type": "osd", 
"spec": {"crush_device_class": "hdd-ec", "data_devices": 
{"rotational": 1, "size": "10G"}, "filter_logic": "AND", 
"objectstore": "bluestore"}}}

And it has the OSDSPEC_AFFINITY set:

cephadm ['--env', 'CEPH_VOLUME_OSDSPEC_AFFINITY=hdd-ec', 
'--image', 
'registry.domain/ceph@sha256:ca901f9ff84d77f8734afad20556775f0ebaea6c62af8cca733161f5338d3f6c', '--timeout', '895', 'ceph-volume', '--fsid', '7d60533e-7e9e-11ef-b140-fa163e2ad8c5', '--config-json', '-', '--', 'lvm', 'batch', '--no-auto', '/dev/vdb', '/dev/vdc', '/dev/vdd', '/dev/vdf', '/dev/vdg', '/dev/vdh',  
'--yes', 

'--no-systemd']

But the OSDs still are created with hdd device class:

ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
-1 0.05878  root default
-3 0.05878  host soc9-ceph
0    hdd  0.00980  osd.0   up   1.0  1.0
1    hdd  0.00980  osd.1   up   1.0  1.0
2    hdd  0.00980  osd.2   up   1.0  1.0
3    hdd  0.00980  osd.3   up   1.0  1.0
4    hdd  0.00980  osd.4   up   1.0  1.0
5    hdd  0.00980  osd.5   up

[ceph-users] Re: cephadm crush_device_class not applied

2024-10-03 Thread Eugen Block

I think this PR [1] is responsible. And here are the three supported  
classes [2]:


class to_ceph_volume(object):

_supported_device_classes = [
"hdd", "ssd", "nvme"
]

Why this limitation?

[1] https://github.com/ceph/ceph/pull/49555
[2]  
https://github.com/ceph/ceph/blob/v18.2.2/src/python-common/ceph/deployment/translate.py#L14


Zitat von Eugen Block :

It works as expected in Pacific 16.2.15 (at least how I expect it to  
work). I applied the same spec file and now have my custom device  
classes (the test class was the result of a manual daemon add  
command):


soc9-ceph:~ # ceph osd tree
ID  CLASS   WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
-1  0.05878  root default
-3  0.05878  host soc9-ceph
 1  hdd-ec  0.00980  osd.1   up   1.0  1.0
 2  hdd-ec  0.00980  osd.2   up   1.0  1.0
 3  hdd-ec  0.00980  osd.3   up   1.0  1.0
 4  hdd-ec  0.00980  osd.4   up   1.0  1.0
 5  hdd-ec  0.00980  osd.5   up   1.0  1.0
 0test  0.00980  osd.0   up   1.0  1.0

So apparently, there was a change since Quincy. For me it's a  
regression, or is this even a bug? I'd appreciate any comments.


Zitat von Eugen Block :

Apparently, I can only use "well known" device classes in the  
specs, like nvme, ssd or hdd. Every other string (even without  
hyphens etc.) doesn't work.


Zitat von Eugen Block :

Reading the docs again, I noticed that apparently the keyword  
"paths" is required to use with crush_device_class (why?), but  
that doesn't work either. I tried it by specifying the class both  
globally in the spec file as well as per device, still no change,  
the OSDs come up as "hdd".


Zitat von Eugen Block :


Hi,

I'm struggling to create OSDs with a dedicated  
crush_device_class. It worked sometimes when creating a new osd  
via command line (ceph orch daemon add osd  
host:data_devices=/dev/vdg,crush_device_class=test-hdd), but most  
of the time it doesn't work. I tried it with a spec file as well,  
it seems to be correctly parsed and everything, but the new OSDs  
are created with hdd class, not hdd-ec. I have this spec:


cat osd-class.yaml
service_type: osd
service_id: hdd-ec
service_name: hdd-ec
crush_device_class: hdd-ec
placement:
label: osd
spec:
data_devices:
 rotational: 1
 size: 10G
objectstore: bluestore

I see that cephadm has stored it correctly:

ceph config-key get mgr/cephadm/spec.osd.hdd-ec
{"created": "2024-10-03T08:35:41.364216Z", "needs_configuration":  
true, "spec": {"placement": {"label": "osd"}, "service_id":  
"hdd-ec", "service_name": "osd.hdd-ec", "service_type": "osd",  
"spec": {"crush_device_class": "hdd-ec", "data_devices":  
{"rotational": 1, "size": "10G"}, "filter_logic": "AND",  
"objectstore": "bluestore"}}}


And it has the OSDSPEC_AFFINITY set:

cephadm ['--env', 'CEPH_VOLUME_OSDSPEC_AFFINITY=hdd-ec',  
'--image',  
'registry.domain/ceph@sha256:ca901f9ff84d77f8734afad20556775f0ebaea6c62af8cca733161f5338d3f6c', '--timeout', '895', 'ceph-volume', '--fsid', '7d60533e-7e9e-11ef-b140-fa163e2ad8c5', '--config-json', '-', '--', 'lvm', 'batch', '--no-auto', '/dev/vdb', '/dev/vdc', '/dev/vdd', '/dev/vdf', '/dev/vdg', '/dev/vdh', '--yes',  
'--no-systemd']


But the OSDs still are created with hdd device class:

ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
-1 0.05878  root default
-3 0.05878  host soc9-ceph
0hdd  0.00980  osd.0   up   1.0  1.0
1hdd  0.00980  osd.1   up   1.0  1.0
2hdd  0.00980  osd.2   up   1.0  1.0
3hdd  0.00980  osd.3   up   1.0  1.0
4hdd  0.00980  osd.4   up   1.0  1.0
5hdd  0.00980  osd.5   up   1.0  1.0

I have tried it with two different indentations:

spec:
crush_device_class: hdd-ec

and as seen above:

crush_device_class: hdd-ec
placement:
label: osd
spec:

According to the docs [0], it's not supposed to be indented, so  
my current spec seems valid. But I see in the mgr log with  
debug_mgr 10 that apparently it is parsed with indentation:


2024-10-03T09:59:23.029+ 7efef1cc6700  0 [cephadm DEBUG  
cephadm.services.osd] Translating DriveGroup  

service_id: hdd-ec
service_name: osd.hd

[ceph-users] Re: cephadm crush_device_class not applied

2024-10-03 Thread Eugen Block

It works as expected in Pacific 16.2.15 (at least how I expect it to  
work). I applied the same spec file and now have my custom device  
classes (the test class was the result of a manual daemon add command):


soc9-ceph:~ # ceph osd tree
ID  CLASS   WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
-1  0.05878  root default
-3  0.05878  host soc9-ceph
 1  hdd-ec  0.00980  osd.1   up   1.0  1.0
 2  hdd-ec  0.00980  osd.2   up   1.0  1.0
 3  hdd-ec  0.00980  osd.3   up   1.0  1.0
 4  hdd-ec  0.00980  osd.4   up   1.0  1.0
 5  hdd-ec  0.00980  osd.5   up   1.0  1.0
 0test  0.00980  osd.0   up   1.0  1.0

So apparently, there was a change since Quincy. For me it's a  
regression, or is this even a bug? I'd appreciate any comments.


Zitat von Eugen Block :

Apparently, I can only use "well known" device classes in the specs,  
like nvme, ssd or hdd. Every other string (even without hyphens  
etc.) doesn't work.


Zitat von Eugen Block :

Reading the docs again, I noticed that apparently the keyword  
"paths" is required to use with crush_device_class (why?), but that  
doesn't work either. I tried it by specifying the class both  
globally in the spec file as well as per device, still no change,  
the OSDs come up as "hdd".


Zitat von Eugen Block :


Hi,

I'm struggling to create OSDs with a dedicated crush_device_class.  
It worked sometimes when creating a new osd via command line (ceph  
orch daemon add osd  
host:data_devices=/dev/vdg,crush_device_class=test-hdd), but most  
of the time it doesn't work. I tried it with a spec file as well,  
it seems to be correctly parsed and everything, but the new OSDs  
are created with hdd class, not hdd-ec. I have this spec:


cat osd-class.yaml
service_type: osd
service_id: hdd-ec
service_name: hdd-ec
crush_device_class: hdd-ec
placement:
label: osd
spec:
data_devices:
  rotational: 1
  size: 10G
objectstore: bluestore

I see that cephadm has stored it correctly:

ceph config-key get mgr/cephadm/spec.osd.hdd-ec
{"created": "2024-10-03T08:35:41.364216Z", "needs_configuration":  
true, "spec": {"placement": {"label": "osd"}, "service_id":  
"hdd-ec", "service_name": "osd.hdd-ec", "service_type": "osd",  
"spec": {"crush_device_class": "hdd-ec", "data_devices":  
{"rotational": 1, "size": "10G"}, "filter_logic": "AND",  
"objectstore": "bluestore"}}}


And it has the OSDSPEC_AFFINITY set:

cephadm ['--env', 'CEPH_VOLUME_OSDSPEC_AFFINITY=hdd-ec',  
'--image',  
'registry.domain/ceph@sha256:ca901f9ff84d77f8734afad20556775f0ebaea6c62af8cca733161f5338d3f6c', '--timeout', '895', 'ceph-volume', '--fsid', '7d60533e-7e9e-11ef-b140-fa163e2ad8c5', '--config-json', '-', '--', 'lvm', 'batch', '--no-auto', '/dev/vdb', '/dev/vdc', '/dev/vdd', '/dev/vdf', '/dev/vdg', '/dev/vdh', '--yes',  
'--no-systemd']


But the OSDs still are created with hdd device class:

ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
-1 0.05878  root default
-3 0.05878  host soc9-ceph
0hdd  0.00980  osd.0   up   1.0  1.0
1hdd  0.00980  osd.1   up   1.0  1.0
2hdd  0.00980  osd.2   up   1.0  1.0
3hdd  0.00980  osd.3   up   1.0  1.0
4hdd  0.00980  osd.4   up   1.0  1.0
5hdd  0.00980  osd.5   up   1.0  1.0

I have tried it with two different indentations:

spec:
crush_device_class: hdd-ec

and as seen above:

crush_device_class: hdd-ec
placement:
label: osd
spec:

According to the docs [0], it's not supposed to be indented, so my  
current spec seems valid. But I see in the mgr log with debug_mgr  
10 that apparently it is parsed with indentation:


2024-10-03T09:59:23.029+ 7efef1cc6700  0 [cephadm DEBUG  
cephadm.services.osd] Translating DriveGroup  

service_id: hdd-ec
service_name: osd.hdd-ec
placement:
label: osd
spec:
crush_device_class: hdd-ec
data_devices:
  rotational: 1
  size: 10G
filter_logic: AND
objectstore: bluestore
'''))> to ceph-volume command

Now I'm wondering how it's actually supposed to work. Yesterday we  
saw the same behaviour on a customer cluster as well with Quincy  
17.2.7. This is Reef 18.2.2.


Trying to create it manually also doesn&

[ceph-users] Re: cephadm crush_device_class not applied

2024-10-03 Thread Eugen Block

Apparently, I can only use "well known" device classes in the specs,  
like nvme, ssd or hdd. Every other string (even without hyphens etc.)  
doesn't work.


Zitat von Eugen Block :

Reading the docs again, I noticed that apparently the keyword  
"paths" is required to use with crush_device_class (why?), but that  
doesn't work either. I tried it by specifying the class both  
globally in the spec file as well as per device, still no change,  
the OSDs come up as "hdd".


Zitat von Eugen Block :


Hi,

I'm struggling to create OSDs with a dedicated crush_device_class.  
It worked sometimes when creating a new osd via command line (ceph  
orch daemon add osd  
host:data_devices=/dev/vdg,crush_device_class=test-hdd), but most  
of the time it doesn't work. I tried it with a spec file as well,  
it seems to be correctly parsed and everything, but the new OSDs  
are created with hdd class, not hdd-ec. I have this spec:


cat osd-class.yaml
service_type: osd
service_id: hdd-ec
service_name: hdd-ec
crush_device_class: hdd-ec
placement:
 label: osd
spec:
 data_devices:
   rotational: 1
   size: 10G
 objectstore: bluestore

I see that cephadm has stored it correctly:

ceph config-key get mgr/cephadm/spec.osd.hdd-ec
{"created": "2024-10-03T08:35:41.364216Z", "needs_configuration":  
true, "spec": {"placement": {"label": "osd"}, "service_id":  
"hdd-ec", "service_name": "osd.hdd-ec", "service_type": "osd",  
"spec": {"crush_device_class": "hdd-ec", "data_devices":  
{"rotational": 1, "size": "10G"}, "filter_logic": "AND",  
"objectstore": "bluestore"}}}


And it has the OSDSPEC_AFFINITY set:

cephadm ['--env', 'CEPH_VOLUME_OSDSPEC_AFFINITY=hdd-ec', '--image',  
'registry.domain/ceph@sha256:ca901f9ff84d77f8734afad20556775f0ebaea6c62af8cca733161f5338d3f6c', '--timeout', '895', 'ceph-volume', '--fsid', '7d60533e-7e9e-11ef-b140-fa163e2ad8c5', '--config-json', '-', '--', 'lvm', 'batch', '--no-auto', '/dev/vdb', '/dev/vdc', '/dev/vdd', '/dev/vdf', '/dev/vdg', '/dev/vdh', '--yes',  
'--no-systemd']


But the OSDs still are created with hdd device class:

ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
-1 0.05878  root default
-3 0.05878  host soc9-ceph
0hdd  0.00980  osd.0   up   1.0  1.0
1hdd  0.00980  osd.1   up   1.0  1.0
2hdd  0.00980  osd.2   up   1.0  1.0
3hdd  0.00980  osd.3   up   1.0  1.0
4hdd  0.00980  osd.4   up   1.0  1.0
5hdd  0.00980  osd.5   up   1.0  1.0

I have tried it with two different indentations:

spec:
 crush_device_class: hdd-ec

and as seen above:

crush_device_class: hdd-ec
placement:
 label: osd
spec:

According to the docs [0], it's not supposed to be indented, so my  
current spec seems valid. But I see in the mgr log with debug_mgr  
10 that apparently it is parsed with indentation:


2024-10-03T09:59:23.029+ 7efef1cc6700  0 [cephadm DEBUG  
cephadm.services.osd] Translating DriveGroup  

service_id: hdd-ec
service_name: osd.hdd-ec
placement:
 label: osd
spec:
 crush_device_class: hdd-ec
 data_devices:
   rotational: 1
   size: 10G
 filter_logic: AND
 objectstore: bluestore
'''))> to ceph-volume command

Now I'm wondering how it's actually supposed to work. Yesterday we  
saw the same behaviour on a customer cluster as well with Quincy  
17.2.7. This is Reef 18.2.2.


Trying to create it manually also doesn't work as expected:

soc9-ceph:~ # ceph orch daemon add osd  
soc9-ceph:data_devices=/dev/vdf,crush_device_class=hdd-ec

Created osd(s) 3 on host 'soc9-ceph'

soc9-ceph:~ # ceph osd tree | grep osd.3
3hdd  0.00980  osd.3   up   1.0  1.0

This is in the mgr debug output from the manual creation:

2024-10-03T10:06:00.329+ 7efeeecc0700  0 [orchestrator DEBUG  
root] _oremote orchestrator ->  
cephadm.create_osds(*(DriveGroupSpec.from_json(yaml.safe_load('''service_type:  
osd

service_name: osd
placement:
 host_pattern: soc9-ceph
spec:
 crush_device_class: hdd-ec
 data_devices:
   paths:
   - /dev/vdf
 filter_logic: AND
 objectstore: bluestore
''')),), **{})
2024-10-03T10:06:00.333+ 7efeeecc0700  0 [cephadm DEBUG  
cephadm.services.osd] Processing DriveGroup  
DriveGroupSpec.from_json(yaml.safe_load('''service_type: osd

service_name: osd
placement:
 host_pattern: soc9-ceph
spec:
 crush_device_class: hdd-ec
 data_devices:
   paths:
   - /dev/vdf
 filter_logic: AND
 objectstore: bluestore
'''))

So parsing the manual command also results in the indented  
crush_device_class. Am I doing something wrong here?


Thanks!
Eugen

[0]  
https://docs.ceph.com/en/latest/cephadm/services/osd/#advanced-osd-service-specifications



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: cephadm crush_device_class not applied

2024-10-03 Thread Eugen Block

Reading the docs again, I noticed that apparently the keyword "paths"  
is required to use with crush_device_class (why?), but that doesn't  
work either. I tried it by specifying the class both globally in the  
spec file as well as per device, still no change, the OSDs come up as  
"hdd".


Zitat von Eugen Block :


Hi,

I'm struggling to create OSDs with a dedicated crush_device_class.  
It worked sometimes when creating a new osd via command line (ceph  
orch daemon add osd  
host:data_devices=/dev/vdg,crush_device_class=test-hdd), but most of  
the time it doesn't work. I tried it with a spec file as well, it  
seems to be correctly parsed and everything, but the new OSDs are  
created with hdd class, not hdd-ec. I have this spec:


cat osd-class.yaml
service_type: osd
service_id: hdd-ec
service_name: hdd-ec
crush_device_class: hdd-ec
placement:
  label: osd
spec:
  data_devices:
rotational: 1
size: 10G
  objectstore: bluestore

I see that cephadm has stored it correctly:

ceph config-key get mgr/cephadm/spec.osd.hdd-ec
{"created": "2024-10-03T08:35:41.364216Z", "needs_configuration":  
true, "spec": {"placement": {"label": "osd"}, "service_id":  
"hdd-ec", "service_name": "osd.hdd-ec", "service_type": "osd",  
"spec": {"crush_device_class": "hdd-ec", "data_devices":  
{"rotational": 1, "size": "10G"}, "filter_logic": "AND",  
"objectstore": "bluestore"}}}


And it has the OSDSPEC_AFFINITY set:

cephadm ['--env', 'CEPH_VOLUME_OSDSPEC_AFFINITY=hdd-ec', '--image',  
'registry.domain/ceph@sha256:ca901f9ff84d77f8734afad20556775f0ebaea6c62af8cca733161f5338d3f6c', '--timeout', '895', 'ceph-volume', '--fsid', '7d60533e-7e9e-11ef-b140-fa163e2ad8c5', '--config-json', '-', '--', 'lvm', 'batch', '--no-auto', '/dev/vdb', '/dev/vdc', '/dev/vdd', '/dev/vdf', '/dev/vdg', '/dev/vdh', '--yes',  
'--no-systemd']


But the OSDs still are created with hdd device class:

ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
-1 0.05878  root default
-3 0.05878  host soc9-ceph
 0hdd  0.00980  osd.0   up   1.0  1.0
 1hdd  0.00980  osd.1   up   1.0  1.0
 2hdd  0.00980  osd.2   up   1.0  1.0
 3hdd  0.00980  osd.3   up   1.0  1.0
 4hdd  0.00980  osd.4   up   1.0  1.0
 5hdd  0.00980  osd.5   up   1.0  1.0

I have tried it with two different indentations:

spec:
  crush_device_class: hdd-ec

and as seen above:

crush_device_class: hdd-ec
placement:
  label: osd
spec:

According to the docs [0], it's not supposed to be indented, so my  
current spec seems valid. But I see in the mgr log with debug_mgr 10  
that apparently it is parsed with indentation:


2024-10-03T09:59:23.029+ 7efef1cc6700  0 [cephadm DEBUG  
cephadm.services.osd] Translating DriveGroup  

service_id: hdd-ec
service_name: osd.hdd-ec
placement:
  label: osd
spec:
  crush_device_class: hdd-ec
  data_devices:
rotational: 1
size: 10G
  filter_logic: AND
  objectstore: bluestore
'''))> to ceph-volume command

Now I'm wondering how it's actually supposed to work. Yesterday we  
saw the same behaviour on a customer cluster as well with Quincy  
17.2.7. This is Reef 18.2.2.


Trying to create it manually also doesn't work as expected:

soc9-ceph:~ # ceph orch daemon add osd  
soc9-ceph:data_devices=/dev/vdf,crush_device_class=hdd-ec

Created osd(s) 3 on host 'soc9-ceph'

soc9-ceph:~ # ceph osd tree | grep osd.3
 3hdd  0.00980  osd.3   up   1.0  1.0

This is in the mgr debug output from the manual creation:

2024-10-03T10:06:00.329+ 7efeeecc0700  0 [orchestrator DEBUG  
root] _oremote orchestrator ->  
cephadm.create_osds(*(DriveGroupSpec.from_json(yaml.safe_load('''service_type:  
osd

service_name: osd
placement:
  host_pattern: soc9-ceph
spec:
  crush_device_class: hdd-ec
  data_devices:
paths:
- /dev/vdf
  filter_logic: AND
  objectstore: bluestore
''')),), **{})
2024-10-03T10:06:00.333+ 7efeeecc0700  0 [cephadm DEBUG  
cephadm.services.osd] Processing DriveGroup  
DriveGroupSpec.from_json(yaml.safe_load('''service_type: osd

service_name: osd
placement:
  host_pattern: soc9-ceph
spec:
  crush_device_class: hdd-ec
  data_devices:
paths:
- /dev/vdf
  filter_logic: AND
  objectstore: bluestore
'''))

So parsing the manual command also results in the indented  
crush_device_class. Am I doing something wrong here?


Thanks!
Eugen

[0]  
https://docs.ceph.com/en/latest/cephadm/services/osd/#advanced-osd-service-specifications



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] cephadm crush_device_class not applied

2024-10-03 Thread Eugen Block


Hi,

I'm struggling to create OSDs with a dedicated crush_device_class. It  
worked sometimes when creating a new osd via command line (ceph orch  
daemon add osd  
host:data_devices=/dev/vdg,crush_device_class=test-hdd), but most of  
the time it doesn't work. I tried it with a spec file as well, it  
seems to be correctly parsed and everything, but the new OSDs are  
created with hdd class, not hdd-ec. I have this spec:


cat osd-class.yaml
service_type: osd
service_id: hdd-ec
service_name: hdd-ec
crush_device_class: hdd-ec
placement:
  label: osd
spec:
  data_devices:
rotational: 1
size: 10G
  objectstore: bluestore

I see that cephadm has stored it correctly:

ceph config-key get mgr/cephadm/spec.osd.hdd-ec
{"created": "2024-10-03T08:35:41.364216Z", "needs_configuration":  
true, "spec": {"placement": {"label": "osd"}, "service_id": "hdd-ec",  
"service_name": "osd.hdd-ec", "service_type": "osd", "spec":  
{"crush_device_class": "hdd-ec", "data_devices": {"rotational": 1,  
"size": "10G"}, "filter_logic": "AND", "objectstore": "bluestore"}}}


And it has the OSDSPEC_AFFINITY set:

cephadm ['--env', 'CEPH_VOLUME_OSDSPEC_AFFINITY=hdd-ec', '--image',  
'registry.domain/ceph@sha256:ca901f9ff84d77f8734afad20556775f0ebaea6c62af8cca733161f5338d3f6c', '--timeout', '895', 'ceph-volume', '--fsid', '7d60533e-7e9e-11ef-b140-fa163e2ad8c5', '--config-json', '-', '--', 'lvm', 'batch', '--no-auto', '/dev/vdb', '/dev/vdc', '/dev/vdd', '/dev/vdf', '/dev/vdg', '/dev/vdh', '--yes',  
'--no-systemd']


But the OSDs still are created with hdd device class:

ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
-1 0.05878  root default
-3 0.05878  host soc9-ceph
 0hdd  0.00980  osd.0   up   1.0  1.0
 1hdd  0.00980  osd.1   up   1.0  1.0
 2hdd  0.00980  osd.2   up   1.0  1.0
 3hdd  0.00980  osd.3   up   1.0  1.0
 4hdd  0.00980  osd.4   up   1.0  1.0
 5hdd  0.00980  osd.5   up   1.0  1.0

I have tried it with two different indentations:

spec:
  crush_device_class: hdd-ec

and as seen above:

crush_device_class: hdd-ec
placement:
  label: osd
spec:

According to the docs [0], it's not supposed to be indented, so my  
current spec seems valid. But I see in the mgr log with debug_mgr 10  
that apparently it is parsed with indentation:


2024-10-03T09:59:23.029+ 7efef1cc6700  0 [cephadm DEBUG  
cephadm.services.osd] Translating DriveGroup  

service_id: hdd-ec
service_name: osd.hdd-ec
placement:
  label: osd
spec:
  crush_device_class: hdd-ec
  data_devices:
rotational: 1
size: 10G
  filter_logic: AND
  objectstore: bluestore
'''))> to ceph-volume command

Now I'm wondering how it's actually supposed to work. Yesterday we saw  
the same behaviour on a customer cluster as well with Quincy 17.2.7.  
This is Reef 18.2.2.


Trying to create it manually also doesn't work as expected:

soc9-ceph:~ # ceph orch daemon add osd  
soc9-ceph:data_devices=/dev/vdf,crush_device_class=hdd-ec

Created osd(s) 3 on host 'soc9-ceph'

soc9-ceph:~ # ceph osd tree | grep osd.3
 3hdd  0.00980  osd.3   up   1.0  1.0

This is in the mgr debug output from the manual creation:

2024-10-03T10:06:00.329+ 7efeeecc0700  0 [orchestrator DEBUG root]  
_oremote orchestrator ->  
cephadm.create_osds(*(DriveGroupSpec.from_json(yaml.safe_load('''service_type:  
osd

service_name: osd
placement:
  host_pattern: soc9-ceph
spec:
  crush_device_class: hdd-ec
  data_devices:
paths:
- /dev/vdf
  filter_logic: AND
  objectstore: bluestore
''')),), **{})
2024-10-03T10:06:00.333+ 7efeeecc0700  0 [cephadm DEBUG  
cephadm.services.osd] Processing DriveGroup  
DriveGroupSpec.from_json(yaml.safe_load('''service_type: osd

service_name: osd
placement:
  host_pattern: soc9-ceph
spec:
  crush_device_class: hdd-ec
  data_devices:
paths:
- /dev/vdf
  filter_logic: AND
  objectstore: bluestore
'''))

So parsing the manual command also results in the indented  
crush_device_class. Am I doing something wrong here?


Thanks!
Eugen

[0]  
https://docs.ceph.com/en/latest/cephadm/services/osd/#advanced-osd-service-specifications

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Question about speeding hdd based cluster

2024-10-02 Thread Eugen Block


Hi George,

the docs [0] strongly recommend to have dedicated SSD or NVMe OSDs for  
the metadata pool. You'll also benefit from dedicated DB/WAL devices.  
But as Joachim already stated, it depends on a couple of factors like  
the number of clients, the load they produce, file sizes etc. There's  
no easy answer.


Regards,
Eugen

[0] https://docs.ceph.com/en/latest/cephfs/createfs/#creating-pools

Zitat von Joachim Kraftmayer :


Hi Kyriazis,

depends on the workload.
I would recommend to add  ssd/nvme DB/WAL to each osd.



Joachim Kraftmayer

www.clyso.com

Hohenzollernstr. 27, 80801 Munich

Utting a. A. | HR: Augsburg | HRB: 25866 | USt. ID-Nr.: DE2754306

Kyriazis, George  schrieb am Mi., 2. Okt. 2024,
07:37:


Hello ceph-users,

I’ve been wondering…. I have a proxmox hdd-based cephfs pool with no
DB/WAL drives.  I also have ssd drives in this setup used for other pools.

What would increase the speed of the hdd-based cephfs more, and in what
usage scenarios:

1. Adding ssd/nvme DB/WAL drives for each node
2. Moving the metadata pool for my cephfs to ssd
3. Increasing the performance of the network.  I currently have 10gbe
links.

It doesn’t look like the network is currently saturated, so I’m thinking
(3) is not a solution.  However, if I choose any of the other options,
would I need to also upgrade the network so that the network does not
become a bottleneck?

Thank you!

George

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Dashboard: frequent queries for balancer status

2024-09-30 Thread Eugen Block


Hi,

I just noticed across different Ceph versions that when browsing the  
dashboard, the MGR is logging lots of prometheus queries for the  
balancer status:


Sep 30 11:15:55 host2 ceph-mgr[3993215]: log_channel(cluster) log  
[DBG] : pgmap v25341: 381 pgs: 381 active+clean; 3.9 GiB data, 69 GiB  
used, 311 GiB / 380 GiB avail; 954 B/s rd, 1 op/s
Sep 30 11:15:56 host2 ceph-mgr[3993215]: log_channel(audit) log [DBG]  
: from='mon.1 -' entity='mon.' cmd=[{"prefix": "balancer status",  
"format": "json"}]: dispatch
Sep 30 11:15:56 host2 ceph-mgr[3993215]: log_channel(audit) log [DBG]  
: from='mon.1 -' entity='mon.' cmd=[{"prefix": "balancer status",  
"format": "json"}]: dispatch
Sep 30 11:15:56 host2 ceph-mgr[3993215]: log_channel(audit) log [DBG]  
: from='mon.1 -' entity='mon.' cmd=[{"prefix": "balancer status",  
"format": "json"}]: dispatch
Sep 30 11:15:56 host2 ceph-mgr[3993215]: log_channel(audit) log [DBG]  
: from='mon.1 -' entity='mon.' cmd=[{"prefix": "balancer status",  
"format": "json"}]: dispatch
Sep 30 11:15:56 host2 ceph-mgr[3993215]: log_channel(audit) log [DBG]  
: from='mon.1 -' entity='mon.' cmd=[{"prefix": "balancer status",  
"format": "json"}]: dispatch
Sep 30 11:15:56 host2 ceph-mgr[3993215]: log_channel(audit) log [DBG]  
: from='mon.1 -' entity='mon.' cmd=[{"prefix": "balancer status",  
"format": "json"}]: dispatch
Sep 30 11:15:56 host2 ceph-mgr[3993215]: log_channel(audit) log [DBG]  
: from='mon.1 -' entity='mon.' cmd=[{"prefix": "balancer status",  
"format": "json"}]: dispatch
Sep 30 11:15:56 host2 ceph-mgr[3993215]: log_channel(audit) log [DBG]  
: from='mon.1 -' entity='mon.' cmd=[{"prefix": "balancer status",  
"format": "json"}]: dispatch
Sep 30 11:15:57 host2 ceph-mgr[3993215]: log_channel(audit) log [DBG]  
: from='mon.1 -' entity='mon.' cmd=[{"prefix": "balancer status",  
"format": "json"}]: dispatch
Sep 30 11:15:59 host2 conmon[3993190]: 192.168.168.62 - -  
[30/Sep/2024:09:15:59] "GET /metrics HTTP/1.1" 200 71320 ""  
"Prometheus/2.33.4"
Sep 30 11:16:02 host2 ceph-mgr[3993215]: log_channel(audit) log [DBG]  
: from='mon.1 -' entity='mon.' cmd=[{"prefix": "balancer status",  
"format": "json"}]: dispatch



At least I assume that this is prometheus if I interpret [0]  
correctly. In this case, the balancer is even disabled.
Why is it necessary to query the balancer status that often? I  
couldn't find any useful config related to this.


This was noticed on Pacific (16.2.15) and Squid (19.1.1), both with  
Prometheus version 2.33.4. I haven't tried other Prometheus versions  
yet. I will check it with a newer Prometheus version as well.


Thanks,
Eugen

[0]  
https://github.com/ceph/ceph/blob/2d93ad5a87a720c9756ea1eac9d95e1067b80976/src/pybind/mgr/dashboard/controllers/prometheus.py#L105

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: device_health_metrics pool automatically recreated

2024-09-29 Thread Eugen Block

So I was able to reproduce it. I created a Octopus cluster, created a  
couple of OSDs, the device_health_metrics pool was automatically  
created as expected:


2024-09-29T20:14:21.225+ 7ff3b6c8e700  0 mon.soc9-ceph@0(leader)  
e1 handle_command mon_command({"prefix": "osd pool rename", "format":  
"json", "srcpool": "device_health_metrics", "destpool": ".mgr"} v 0) v1
2024-09-29T20:14:22.233+ 7ff3b548b700  0 log_channel(audit) log  
[INF] : from='mgr.14302 192.168.124.186:0/3683659721'  
entity='mgr.soc9-ceph.vgsrao' cmd='[{"prefix": "osd pool rename",  
"format": "json", "srcpool": "device_health_metrics", "destpool":  
".mgr"}]': finished


After creating a test pool, I upgraded the cluster to Quincy. And then  
the device_health_metrics pool is recreated:


2024-09-29T20:24:30.829+ 7f9289bf4700  0 mon.soc9-ceph@0(leader)  
e1 handle_command mon_command({"prefix": "osd pool create", "format":  
"json", "pool": "device_health_metrics", "pg_num": 1, "pg_num_min": 1}  
v 0) v1


This was after the first MGR had been upgraded and failed over to the  
old one. So your assumption seems to be correct. I haven't checked  
other upgrade paths, so this probably isn't a big deal. But perhaps a  
note in the docs could mention that there might be a new pool  
after/during the upgrade?


Thanks,
Eugen

Zitat von Eugen Block :


Thanks for chiming in, Patrick.

Altough I can't rule it out, I doubt that anyone except me was on  
the cluster after we performed the upgrade. It had a very low  
priority for the customer. Do you think that if I deleted the  
device_health_metrics pool and started a legacy mgr, it would  
recreate the pool? I think I should be able to try that, just to  
confirm.


Zitat von Patrick Donnelly :


On Tue, Aug 27, 2024 at 6:49 AM Eugen Block  wrote:


Hi,

I just looked into one customer cluster that we upgraded some time ago
from Octopus to Quincy (17.2.6) and I'm wondering why there are still
both pools, "device_health_metrics" and ".mgr".

According to the docs [0], it's supposed to be renamed:


Prior to Quincy, the devicehealth module created a
device_health_metrics pool to store device SMART statistics. With
Quincy, this pool is automatically renamed to be the common manager
module pool.


Now only .mgr has data while device_health_metrics is empty, but it
has a newer ID:

ses01:~ # ceph df | grep -E "device_health|.mgr"
.mgr1 1   68 MiB   18  204 MiB
 0254 TiB
device_health_metrics  15 1  0 B0  0 B
 0254 TiB

On a test cluster (meanwhile upgraded to latest Reef) I see the same:

ceph01:~ # ceph df | grep -E "device_health_metrics|.mgr"
.mgr381  577 KiB2  1.7 MiB  0
   71 GiB
device_health_metrics   451  0 B0  0 B  0
   71 GiB

Since there are still many users who haven't upgraded to >= Quincy
yet, this should be clarified/fixed. I briefly checked
tracker.ceph.com, but didn't find anything related to this. I'm
currently trying to reproduce it on a one-node test cluster which I
upgraded from Pacific to Quincy, but no results yet, only that the
renaming was successful. But for the other clusters I don't have
enough logs to find out how/why the device_health_metrics pool had
been recreated.


Probably someone ran a pre-Quincy ceph-mgr on the cluster after the
upgrade? That would explain the larger pool id.

--
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: device_health_metrics pool automatically recreated

2024-09-29 Thread Eugen Block

Thanks for chiming in, Patrick.

Altough I can't rule it out, I doubt that anyone except me was on the  
cluster after we performed the upgrade. It had a very low priority for  
the customer. Do you think that if I deleted the device_health_metrics  
pool and started a legacy mgr, it would recreate the pool? I think I  
should be able to try that, just to confirm.

Zitat von Patrick Donnelly :

On Tue, Aug 27, 2024 at 6:49 AM Eugen Block  wrote:

Hi,

I just looked into one customer cluster that we upgraded some time ago
from Octopus to Quincy (17.2.6) and I'm wondering why there are still
both pools, "device_health_metrics" and ".mgr".

According to the docs [0], it's supposed to be renamed:

> Prior to Quincy, the devicehealth module created a
> device_health_metrics pool to store device SMART statistics. With
> Quincy, this pool is automatically renamed to be the common manager
> module pool.

Now only .mgr has data while device_health_metrics is empty, but it
has a newer ID:

ses01:~ # ceph df | grep -E "device_health|.mgr"
.mgr1 1   68 MiB   18  204 MiB
  0254 TiB
device_health_metrics  15 1  0 B0  0 B
  0254 TiB

On a test cluster (meanwhile upgraded to latest Reef) I see the same:

ceph01:~ # ceph df | grep -E "device_health_metrics|.mgr"
.mgr381  577 KiB2  1.7 MiB  0
71 GiB
device_health_metrics   451  0 B0  0 B  0
71 GiB

Since there are still many users who haven't upgraded to >= Quincy
yet, this should be clarified/fixed. I briefly checked
tracker.ceph.com, but didn't find anything related to this. I'm
currently trying to reproduce it on a one-node test cluster which I
upgraded from Pacific to Quincy, but no results yet, only that the
renaming was successful. But for the other clusters I don't have
enough logs to find out how/why the device_health_metrics pool had
been recreated.

Probably someone ran a pre-Quincy ceph-mgr on the cluster after the
upgrade? That would explain the larger pool id.

--
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph orchestrator not refreshing device list

2024-09-29 Thread Eugen Block

Okay, apparently this is not what I was facing. I see two other  
options right now. The first would be to purge osd.88 from the crush  
tree entirely.
The second approach would be to create an osd manually with  
ceph-volume, not cephadm ceph-volume, to create a legacy osd (you'd  
get warnings about a stray daemon). If that works, adopt the osd with  
cephadm.

I don't have a better idea right now.

Zitat von Bob Gibson :


Here are the contents from the same directory on our osd node:

ceph-osd31.prod.os:/var/lib/ceph/9b3b3539-59a9-4338-8bab-3badfab6e855# ls -l
total 412
-rw-r--r--  1 root root 366903 Sep 14 14:53  
cephadm.8b92cafd937eb89681ee011f9e70f85937fd09c4bd61ed4a59981d275a1f255b

drwx--  3  167  167   4096 Sep 14 15:01 crash
drwxr-xr-x 12 root root   4096 Sep 15 12:06 custom_config_files
drw-rw  2 root root   4096 Sep 23 17:00 home
drwx--  2  167  167   4096 Sep 26 12:47 osd.84
drwx--  2  167  167   4096 Sep 26 12:47 osd.85
drwx--  2  167  167   4096 Sep 26 12:47 osd.86
drwx--  2  167  167   4096 Sep 26 12:47 osd.87
drwx--  2  167  167   4096 Sep 26 12:47 osd.89
drwx--  2  167  167   4096 Sep 26 12:47 osd.90
drwx--  2  167  167   4096 Sep 26 12:47 osd.91
drwx--  2  167  167   4096 Sep 26 12:47 osd.92
drwx--  2  167  167   4096 Sep 26 12:47 osd.93
drwx--  6 root root   4096 Sep 23 15:59 removed

In our case the osd.88 directory is under the subdirectory named  
“removed”, the same as the other odds which have been converted.


ceph-osd31.prod.os:/var/lib/ceph/9b3b3539-59a9-4338-8bab-3badfab6e855# ls -l  
removed/osd.88_2024-09-23T19\:59\:42.162302Z/

total 64
lrwxrwxrwx 1 167 167   93 Sep 15 12:10 block ->  
/dev/ceph-2a13ec6a-a5f0-4773-8254-c38b915c824a/osd-block-7f8f9778-5ae2-47c1-bd03-a92a3a7a1db1

-rw--- 1 167 167   37 Sep 15 12:10 ceph_fsid
-rw--- 1 167 167  259 Sep 14 15:14 config
-rw--- 1 167 167   37 Sep 15 12:10 fsid
-rw--- 1 167 167   56 Sep 15 12:10 keyring
-rw--- 1 167 1676 Sep 15 12:10 ready
-rw--- 1 167 1673 Sep 14 11:11 require_osd_release
-rw--- 1 167 167   10 Sep 15 12:10 type
-rw--- 1 167 167   38 Sep 14 15:14 unit.configured
-rw--- 1 167 167   48 Sep 14 15:14 unit.created
-rw--- 1 167 167   26 Sep 14 15:06 unit.image
-rw--- 1 167 167   76 Sep 14 15:06 unit.meta
-rw--- 1 167 167 1527 Sep 14 15:06 unit.poststop
-rw--- 1 167 167 2586 Sep 14 15:06 unit.run
-rw--- 1 167 167  334 Sep 14 15:06 unit.stop
-rw--- 1 167 1673 Sep 15 12:10 whoami

On Sep 27, 2024, at 9:30 AM, Eugen Block  wrote:

EXTERNAL EMAIL | USE CAUTION

Oh interesting, I just got into the same situation (I believe) on a
test cluster:

host1:~ # ceph orch ps | grep unknown
osd.1  host6
stopped  72s ago  36m-4096M
  
osd.13 host6
error72s ago  36m-4096M
  

I still had the remainders on the filesystem:

host6:~ # ll /var/lib/ceph/543967bc-e586-32b8-bd2c-2d8b8b168f02/osd.1
insgesamt 68
lrwxrwxrwx 1 ceph ceph  111 27. Sep 14:43 block ->
/dev/mapper/ceph--0e90997f--456e--4a9b--a8f9--a6f1038c1216-osd--block--81e7f32a--a728--4848--b14d--0b86bb7e1c69
lrwxrwxrwx 1 ceph ceph  108 27. Sep 14:43 block.db ->
/dev/mapper/ceph--9ea6e95f--ad43--4e40--8920--2e772b2efa2f-osd--db--f9c57ec1--77c8--4d9a--85df--1dc053a24000

I just removed those two directories to clear the warning, now my
orchestrator can deploy OSDs again on that node.

Hope that helps!

Zitat von Eugen Block :

Right, if you need encryption, a rebuild is required. Your procedure
has already worked 4 times, so I'd say nothing seems wrong with that
per se.
Regarding the stuck device list, do you see the mgr logging anything
suspicious? Especially when you say that it only returns output
after a failover. Those two osd specs are not conflicting since the
first is "unmanaged" after adoption.
Is there something in 'ceph orch osd rm status'? Can you run
'cephadm ceph-volume inventory' locally on that node? Do you see any
hints in the node's syslog? Maybe try a reboot or something?


Zitat von Bob Gibson :

Thanks for your reply Eugen. I’m fairly new to cephadm so I wasn’t
aware that we could manage the drives without rebuilding them.
However, we thought we’d take advantage of this opportunity to also
encrypt the drives, and that does require a rebuild.

I have a theory on why the orchestrator is confused. I want to
create an osd service for each osd node so I can manage drives on a
per-node basis.

I started by creating a spec for the first node:

service_type: osd
service_id: ceph-osd31
placement:
hosts:
- ceph-osd31
spec:
data_devices:
  rotational: 0
  size: '3TB:'
encrypted: true
filter_logic: AND
objectstore: bluestore

But I also see a default spec, “osd”, which has placement set to
“unmanaged”.

`ceph orch ls osd —export` shows the following:

service_type: osd
service_nam

[ceph-users] Re: Ceph orchestrator not refreshing device list

2024-09-27 Thread Eugen Block

Oh interesting, I just got into the same situation (I believe) on a
test cluster:

host1:~ # ceph orch ps | grep unknown
osd.1 host6
stopped 72s ago 36m-4096M

osd.13 host6
error72s ago 36m-4096M

I still had the remainders on the filesystem:

host6:~ # ll /var/lib/ceph/543967bc-e586-32b8-bd2c-2d8b8b168f02/osd.1
insgesamt 68
lrwxrwxrwx 1 ceph ceph 111 27. Sep 14:43 block ->
/dev/mapper/ceph--0e90997f--456e--4a9b--a8f9--a6f1038c1216-osd--block--81e7f32a--a728--4848--b14d--0b86bb7e1c69
lrwxrwxrwx 1 ceph ceph 108 27. Sep 14:43 block.db ->
/dev/mapper/ceph--9ea6e95f--ad43--4e40--8920--2e772b2efa2f-osd--db--f9c57ec1--77c8--4d9a--85df--1dc053a24000

I just removed those two directories to clear the warning, now my
orchestrator can deploy OSDs again on that node.

Hope that helps!

Zitat von Eugen Block :

Right, if you need encryption, a rebuild is required. Your procedure
has already worked 4 times, so I'd say nothing seems wrong with that
per se.
Regarding the stuck device list, do you see the mgr logging anything
suspicious? Especially when you say that it only returns output
after a failover. Those two osd specs are not conflicting since the
first is "unmanaged" after adoption.
Is there something in 'ceph orch osd rm status'? Can you run
'cephadm ceph-volume inventory' locally on that node? Do you see any
hints in the node's syslog? Maybe try a reboot or something?

Zitat von Bob Gibson :

Thanks for your reply Eugen. I’m fairly new to cephadm so I wasn’t
aware that we could manage the drives without rebuilding them.
However, we thought we’d take advantage of this opportunity to also
encrypt the drives, and that does require a rebuild.

I have a theory on why the orchestrator is confused. I want to
create an osd service for each osd node so I can manage drives on a
per-node basis.

I started by creating a spec for the first node:

service_type: osd
service_id: ceph-osd31
placement:
hosts:
- ceph-osd31
spec:
data_devices:
rotational: 0
size: '3TB:'
encrypted: true
filter_logic: AND
objectstore: bluestore

But I also see a default spec, “osd”, which has placement set to
“unmanaged”.

`ceph orch ls osd —export` shows the following:

service_type: osd
service_name: osd
unmanaged: true
spec:
filter_logic: AND
objectstore: bluestore
---
service_type: osd
service_id: ceph-osd31
service_name: osd.ceph-osd31
placement:
hosts:
- ceph-osd31
spec:
data_devices:
rotational: 0
size: '3TB:'
encrypted: true
filter_logic: AND
objectstore: bluestore

`ceph orch ls osd` shows that I was able to convert 4 drives using my spec:

NAMEPORTS RUNNING REFRESHED AGE PLACEMENT
osd 95 10m ago-
osd.ceph-osd31 4 10m ago43m ceph-osd31

Despite being able to convert 4 drives, I’m wondering if these
specs are conflicting with one another, and that has confused the
orchestrator. If so, how do I safely get from where I am now to
where I want to be? :-)

Cheers,
/rjg

On Sep 26, 2024, at 3:31 PM, Eugen Block wrote:

EXTERNAL EMAIL | USE CAUTION

Hi,

this seems a bit unnecessary to rebuild OSDs just to get them managed.
If you apply a spec file that targets your hosts/OSDs, they will
appear as managed. So when you would need to replace a drive, you
could already utilize the orchestrator to remove and zap the drive.
That works just fine.
How to get out of your current situation is not entirely clear to me
yet. I’ll reread your post tomorrow.

Regards,
Eugen

Zitat von Bob Gibson :

Hi,

We recently converted a legacy cluster running Quincy v17.2.7 to
cephadm. The conversion went smoothly and left all osds unmanaged by
the orchestrator as expected. We’re now in the process of converting
the osds to be managed by the orchestrator. We successfully
converted a few of them, but then the orchestrator somehow got
confused. `ceph health detail` reports a “stray daemon” for the osd
we’re trying to convert, and the orchestrator is unable to refresh
its device list so it doesn’t see any available devices.

From the perspective of the osd node, the osd has been wiped and is
ready to be reinstalled. We’ve also rebooted the node for good
measure. `ceph osd tree` shows that the osd has been destroyed, but
the orchestrator won’t reinstall it because it thinks the device is
still active. The orchestrator device information is stale, but
we’re unable to refresh it. The usual recommended workaround of
failing over the mgr hasn’t helped. We’ve also tried `ceph orch
device ls —refresh` to no avail. In fact after running that command
subsequent runs of `ceph orch device ls` produce no output until the
mgr is failed over again.

Is there a way to force the orchestrator to refres

[ceph-users] Re: Restore a pool from snapshot

2024-09-27 Thread Eugen Block


Hi,

it's been a while since I last looked into this, but as far as I know,  
you'd have to iterate over each object in the pool to restore it from  
your snapshot. There's no option to restore all of them with one  
command.


Regards,
Eugen

Zitat von Pavel Kaygorodov :


Hi!

May be a dumb question, sorry, but how I can restore a whole pool  
from a snapshot?
I have made a snapshot with 'rados mksnap', but there is no command  
to restore a whole snapshot, only one object may be specified for  
rollback. Is it possible to restore all objects?


Thanks in advance,
  Pavel.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Quincy: osd_pool_default_crush_rule being ignored?

2024-09-26 Thread Eugen Block

Hm, I don't know much about ceph-ansible. Did you check if there was  
any config set for a specific daemon or something, which would  
override the config set? For example, 'ceph config show-with-defaults  
mon.' for each mon, and then also check 'ceph config dump | grep  
rule'. I would also probably grep for crush_rule in all the usual  
places.


Zitat von Florian Haas :


On 25/09/2024 15:21, Eugen Block wrote:

Hm, do you have any local ceph.conf on your client which has an
override for this option as well?


No.


By the way, how do you bootstrap your cluster? Is it cephadm based?


This one is bootstrapped (on Quincy) with ceph-ansible. And when the  
"ceph config set" change didn't make a difference, I did also make a  
point of cycling all my mons and osds (which shouldn't be necessary,  
but I figured I'd try that, just in case).


And I also confirmed this same issue, in Quincy, after the cluster  
was adopted into cephadm management. At that point, the behaviour  
was still unchanged.


It was only after I upgraded the cluster to Reef, with  
cephadm/ceph orch, that the problem went away.


Cheers,
Florian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Mds daemon damaged - assert failed

2024-09-26 Thread Eugen Block

It could be a bug, sure, but I haven't searched tracker too long,  
maybe there is an existing bug, I'd leave it to the devs to comment on  
that. But the assert alone isn't of much help (to me), more mds logs  
could help track this down.


Zitat von "Kyriazis, George" :


On Sep 25, 2024, at 1:05 AM, Eugen Block  wrote:

Great that you got your filesystem back.


cephfs-journal-tool journal export
cephfs-journal-tool event recover_dentries summary

Both failed


Your export command seems to be missing the output file, or was it  
not the exact command?


Yes I didn’t include the output file in my snippet.  Sorry for the  
confusion.  But the command did in fact complain that the journal  
was corrupted.




Also, I understand that the metadata itself is sitting on the  
disk, but it looks like a single point of failure.  What’s the  
logic behind having a simple metadata location, but multiple mds  
servers?


I think there's a misunderstanding, the metadata is in the cephfs  
metadata pool, not on the local disk of your machine.




By “disk” I meant the concept of permanent storage, ie. Ceph.  Yes,  
our understanding matches.  But the question still remains, as to  
why that assert would trigger.  Is it because of a software issue  
(bug?) that caused the journal to be corrupted, or something else  
corrupted the journal that caused the MDS to throw the assertion?   
Basically, I’m trying to find what could be a possible root-cause..


Thank you!

George




Zitat von "Kyriazis, George" :


I managed to recover my filesystem.

cephfs-journal-tool journal export
cephfs-journal-tool event recover_dentries summary

Both failed

But truncating the journal and following some of the instructions  
in  
https://people.redhat.com/bhubbard/nature/default/cephfs/disaster-recovery-experts/ helped me to get the mds  
up.


Then I scrubbed and repaired the filesystem, and I “believe” I’m  
back in business.


What is weird though is that an assert failed as shown in the  
stack dump below.  Was that a legitimate assertion that indicates  
a bigger issue, or was it a false assertion?


Also, I understand that the metadata itself is sitting on the  
disk, but it looks like a single point of failure.  What’s the  
logic behind having a simple metadata location, but multiple mds  
servers?


Thanks!

George


On Sep 24, 2024, at 5:55 AM, Eugen Block  wrote:

Hi,

I would probably start by inspecting the journal with the  
cephfs-journal-tool [0]:


cephfs-journal-tool [--rank=:{mds-rank|all}] journal inspect

And it could be helful to have the logs prior to the assert.

[0]  
https://docs.ceph.com/en/latest/cephfs/cephfs-journal-tool/#example-journal-inspect


Zitat von "Kyriazis, George" :

Hello ceph users,

I am in the unfortunate situation of having a status of “1 mds  
daemon damaged”.  Looking at the logs, I see that the daemon died  
with an assert as follows:


./src/osdc/Journaler.cc: 1368: FAILED ceph_assert(trim_to > trimming_pos)

ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748)  
reef (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char  
const*)+0x12a) [0x73a83189d7d9]

2: /usr/lib/ceph/libceph-common.so.2(+0x29d974) [0x73a83189d974]
3: (Journaler::_trim()+0x671) [0x57235caa70b1]
4: (Journaler::_finish_write_head(int, Journaler::Header&,  
C_OnFinisher*)+0x171) [0x57235caaa8f1]

5: (Context::complete(int)+0x9) [0x57235c716849]
6: (Finisher::finisher_thread_entry()+0x16d) [0x73a83194659d]
7: /lib/x86_64-linux-gnu/libc.so.6(+0x89134) [0x73a8310a8134]
8: /lib/x86_64-linux-gnu/libc.so.6(+0x1097dc) [0x73a8311287dc]

   0> 2024-09-23T14:10:26.490-0500 73a822c006c0 -1 *** Caught  
signal (Aborted) **

in thread 73a822c006c0 thread_name:MR_Finisher

ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748)  
reef (stable)

1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x73a83105b050]
2: /lib/x86_64-linux-gnu/libc.so.6(+0x8ae2c) [0x73a8310a9e2c]
3: gsignal()
4: abort()
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char  
const*)+0x185) [0x73a83189d834]

6: /usr/lib/ceph/libceph-common.so.2(+0x29d974) [0x73a83189d974]
7: (Journaler::_trim()+0x671) [0x57235caa70b1]
8: (Journaler::_finish_write_head(int, Journaler::Header&,  
C_OnFinisher*)+0x171) [0x57235caaa8f1]

9: (Context::complete(int)+0x9) [0x57235c716849]
10: (Finisher::finisher_thread_entry()+0x16d) [0x73a83194659d]
11: /lib/x86_64-linux-gnu/libc.so.6(+0x89134) [0x73a8310a8134]
12: /lib/x86_64-linux-gnu/libc.so.6(+0x1097dc) [0x73a8311287dc]
NOTE: a copy of the executable, or `objdump -rdS ` is  
needed to interpret this.



As listed above, I am running 18.2.2 on a proxmox cluster with a  
hybrid hdd/sdd setup.  2 cephfs filesystems.  The mds responsible  
for the hdd filesystem is the one that died.


Output of ceph -s follows:

root@vis-mgmt:~/bin# ceph -s
cluster:
  id: ec2c9542-dc1b-4af6-9f21-0adbcabb9452
  health: HEALTH_ERR

[ceph-users] Re: Ceph orchestrator not refreshing device list

2024-09-26 Thread Eugen Block

Right, if you need encryption, a rebuild is required. Your procedure  
has already worked 4 times, so I'd say nothing seems wrong with that  
per se.
Regarding the stuck device list, do you see the mgr logging anything  
suspicious? Especially when you say that it only returns output after  
a failover. Those two osd specs are not conflicting since the first is  
"unmanaged" after adoption.
Is there something in 'ceph orch osd rm status'? Can you run 'cephadm  
ceph-volume inventory' locally on that node? Do you see any hints in  
the node's syslog? Maybe try a reboot or something?



Zitat von Bob Gibson :

Thanks for your reply Eugen. I’m fairly new to cephadm so I wasn’t  
aware that we could manage the drives without rebuilding them.  
However, we thought we’d take advantage of this opportunity to also  
encrypt the drives, and that does require a rebuild.


I have a theory on why the orchestrator is confused. I want to  
create an osd service for each osd node so I can manage drives on a  
per-node basis.


I started by creating a spec for the first node:

service_type: osd
service_id: ceph-osd31
placement:
  hosts:
  - ceph-osd31
spec:
  data_devices:
rotational: 0
size: '3TB:'
  encrypted: true
  filter_logic: AND
  objectstore: bluestore

But I also see a default spec, “osd”, which has placement set to “unmanaged”.

`ceph orch ls osd —export` shows the following:

service_type: osd
service_name: osd
unmanaged: true
spec:
  filter_logic: AND
  objectstore: bluestore
---
service_type: osd
service_id: ceph-osd31
service_name: osd.ceph-osd31
placement:
  hosts:
  - ceph-osd31
spec:
  data_devices:
rotational: 0
size: '3TB:'
  encrypted: true
  filter_logic: AND
  objectstore: bluestore

`ceph orch ls osd` shows that I was able to convert 4 drives using my spec:

NAMEPORTS  RUNNING  REFRESHED  AGE  PLACEMENT
osd 95  10m ago-
osd.ceph-osd31   4  10m ago43m  ceph-osd31

Despite being able to convert 4 drives, I’m wondering if these specs  
are conflicting with one another, and that has confused the  
orchestrator. If so, how do I safely get from where I am now to  
where I want to be? :-)


Cheers,
/rjg

On Sep 26, 2024, at 3:31 PM, Eugen Block  wrote:

EXTERNAL EMAIL | USE CAUTION

Hi,

this seems a bit unnecessary to rebuild OSDs just to get them managed.
If you apply a spec file that targets your hosts/OSDs, they will
appear as managed. So when you would need to replace a drive, you
could already utilize the orchestrator to remove and zap the drive.
That works just fine.
How to get out of your current situation is not entirely clear to me
yet. I’ll reread your post tomorrow.

Regards,
Eugen

Zitat von Bob Gibson :

Hi,

We recently converted a legacy cluster running Quincy v17.2.7 to
cephadm. The conversion went smoothly and left all osds unmanaged by
the orchestrator as expected. We’re now in the process of converting
the osds to be managed by the orchestrator. We successfully
converted a few of them, but then the orchestrator somehow got
confused. `ceph health detail` reports a “stray daemon” for the osd
we’re trying to convert, and the orchestrator is unable to refresh
its device list so it doesn’t see any available devices.

From the perspective of the osd node, the osd has been wiped and is
ready to be reinstalled. We’ve also rebooted the node for good
measure. `ceph osd tree` shows that the osd has been destroyed, but
the orchestrator won’t reinstall it because it thinks the device is
still active. The orchestrator device information is stale, but
we’re unable to refresh it. The usual recommended workaround of
failing over the mgr hasn’t helped. We’ve also tried `ceph orch
device ls —refresh` to no avail. In fact after running that command
subsequent runs of `ceph orch device ls` produce no output until the
mgr is failed over again.

Is there a way to force the orchestrator to refresh its list of
devices when in this state? If not, can anyone offer any suggestions
on how to fix this problem?

Cheers,
/rjg

P.S. Some additional information in case it’s helpful...

We’re using the following command to replace existing devices so
that they’re managed by the orchestrator:

```
ceph orch osd rm  --replace —zap
```

and we’re currently stuck on osd 88.

```
ceph health detail
HEALTH_WARN 1 stray daemon(s) not managed by cephadm
[WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm
   stray daemon osd.88 on host ceph-osd31 not managed by cephadm
```

`ceph osd tree` shows that the osd has been destroyed and is ready
to be replaced:

```
ceph osd tree-from ceph-osd31
ID   CLASS  WEIGHTTYPE NAMESTATUS REWEIGHT  PRI-AFF
-46 34.93088  host ceph-osd31
84ssd   3.49309  osd.84  up   1.0  1.0
85ssd   3.49309  osd.85  up   1.0  1.0
86ssd   3.49309  osd.86

[ceph-users] Re: Ceph orchestrator not refreshing device list

2024-09-26 Thread Eugen Block


Hi,

this seems a bit unnecessary to rebuild OSDs just to get them managed.  
If you apply a spec file that targets your hosts/OSDs, they will  
appear as managed. So when you would need to replace a drive, you  
could already utilize the orchestrator to remove and zap the drive.  
That works just fine.
How to get out of your current situation is not entirely clear to me  
yet. I’ll reread your post tomorrow.


Regards,
Eugen

Zitat von Bob Gibson :


Hi,

We recently converted a legacy cluster running Quincy v17.2.7 to  
cephadm. The conversion went smoothly and left all osds unmanaged by  
the orchestrator as expected. We’re now in the process of converting  
the osds to be managed by the orchestrator. We successfully  
converted a few of them, but then the orchestrator somehow got  
confused. `ceph health detail` reports a “stray daemon” for the osd  
we’re trying to convert, and the orchestrator is unable to refresh  
its device list so it doesn’t see any available devices.


From the perspective of the osd node, the osd has been wiped and is  
ready to be reinstalled. We’ve also rebooted the node for good  
measure. `ceph osd tree` shows that the osd has been destroyed, but  
the orchestrator won’t reinstall it because it thinks the device is  
still active. The orchestrator device information is stale, but  
we’re unable to refresh it. The usual recommended workaround of  
failing over the mgr hasn’t helped. We’ve also tried `ceph orch  
device ls —refresh` to no avail. In fact after running that command  
subsequent runs of `ceph orch device ls` produce no output until the  
mgr is failed over again.


Is there a way to force the orchestrator to refresh its list of  
devices when in this state? If not, can anyone offer any suggestions  
on how to fix this problem?


Cheers,
/rjg

P.S. Some additional information in case it’s helpful...

We’re using the following command to replace existing devices so  
that they’re managed by the orchestrator:


```
ceph orch osd rm  --replace —zap
```

and we’re currently stuck on osd 88.

```
ceph health detail
HEALTH_WARN 1 stray daemon(s) not managed by cephadm
[WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm
stray daemon osd.88 on host ceph-osd31 not managed by cephadm
```

`ceph osd tree` shows that the osd has been destroyed and is ready  
to be replaced:


```
ceph osd tree-from ceph-osd31
ID   CLASS  WEIGHTTYPE NAMESTATUS REWEIGHT  PRI-AFF
-46 34.93088  host ceph-osd31
 84ssd   3.49309  osd.84  up   1.0  1.0
 85ssd   3.49309  osd.85  up   1.0  1.0
 86ssd   3.49309  osd.86  up   1.0  1.0
 87ssd   3.49309  osd.87  up   1.0  1.0
 88ssd   3.49309  osd.88   destroyed 0  1.0
 89ssd   3.49309  osd.89  up   1.0  1.0
 90ssd   3.49309  osd.90  up   1.0  1.0
 91ssd   3.49309  osd.91  up   1.0  1.0
 92ssd   3.49309  osd.92  up   1.0  1.0
 93ssd   3.49309  osd.93  up   1.0  1.0
```

The cephadm log shows a claim on node `ceph-osd31` for that osd:

```
2024-09-25T14:15:45.699348-0400 mgr.ceph-mon3.qzjgws [INF] Found osd  
claims -> {'ceph-osd31': ['88']}
2024-09-25T14:15:45.699534-0400 mgr.ceph-mon3.qzjgws [INF] Found osd  
claims for drivegroup ceph-osd31 -> {'ceph-osd31': ['88']}

```

`ceph orch device ls` shows that the device list isn’t refreshing:

```
ceph orch device ls ceph-osd31
HOSTPATH  TYPE  DEVICE ID 
SIZE  AVAILABLE  REFRESHED  REJECT REASONS
ceph-osd31  /dev/sdc  ssd   INTEL_SSDSC2KG038T8_PHYG039603PE3P8EGN   
3576G  No 22h agoInsufficient space (<10 extents) on  
vgs, LVM detected, locked
ceph-osd31  /dev/sdd  ssd   INTEL_SSDSC2KG038T8_PHYG039600AY3P8EGN   
3576G  No 22h agoInsufficient space (<10 extents) on  
vgs, LVM detected, locked
ceph-osd31  /dev/sde  ssd   INTEL_SSDSC2KG038T8_PHYG039600CW3P8EGN   
3576G  No 22h agoInsufficient space (<10 extents) on  
vgs, LVM detected, locked
ceph-osd31  /dev/sdf  ssd   INTEL_SSDSC2KG038T8_PHYG039600CM3P8EGN   
3576G  No 22h agoInsufficient space (<10 extents) on  
vgs, LVM detected, locked
ceph-osd31  /dev/sdg  ssd   INTEL_SSDSC2KG038T8_PHYG039600UB3P8EGN   
3576G  No 22h agoInsufficient space (<10 extents) on  
vgs, LVM detected, locked
ceph-osd31  /dev/sdh  ssd   INTEL_SSDSC2KG038T8_PHYG039603753P8EGN   
3576G  No 22h agoInsufficient space (<10 extents) on  
vgs, LVM detected, locked
ceph-osd31  /dev/sdi  ssd   INTEL_SSDSC2KG038T8_PHYG039603R63P8EGN   
3576G  No 22h agoInsufficient space (<10 extents) on  
vgs, LVM detected, locked
ceph-osd31  /dev/sdj  ssd   INTEL_SSDSC2KG038TZ_PHYJ4011032M3P8DGN   
3576G  No 22h agoInsufficient space (<10 extents) on

[ceph-users] Re: Quincy: osd_pool_default_crush_rule being ignored?

2024-09-25 Thread Eugen Block

I redeployed a different single-node-cluster with quincy 17.2.6 and it  
works there as well.


Zitat von Eugen Block :

Hm, do you have any local ceph.conf on your client which has an  
override for this option as well? By the way, how do you bootstrap  
your cluster? Is it cephadm based?


Zitat von Florian Haas :


Hi Eugen,

I've just torn down and completely respun my cluster, on 17.2.7.

Recreated my CRUSH rule, set osd_pool_default_crush_rule to its rule_id, 1.

Created a new pool.

That new pool still has crush_rule 0, just as before and contrary  
to what you're seeing.


I'm a bit puzzled, because I'm out of ideas as to what could break  
on my cluster and work fine on yours, to cause this. Odd.


Cheers,
Florian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Quincy: osd_pool_default_crush_rule being ignored?

2024-09-25 Thread Eugen Block

Hm, do you have any local ceph.conf on your client which has an  
override for this option as well? By the way, how do you bootstrap  
your cluster? Is it cephadm based?


Zitat von Florian Haas :


Hi Eugen,

I've just torn down and completely respun my cluster, on 17.2.7.

Recreated my CRUSH rule, set osd_pool_default_crush_rule to its rule_id, 1.

Created a new pool.

That new pool still has crush_rule 0, just as before and contrary to  
what you're seeing.


I'm a bit puzzled, because I'm out of ideas as to what could break  
on my cluster and work fine on yours, to cause this. Odd.


Cheers,
Florian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Quincy: osd_pool_default_crush_rule being ignored?

2024-09-25 Thread Eugen Block


Still works:

quincy-1:~ # ceph osd crush rule create-simple simple-rule default osd
quincy-1:~ # ceph osd crush rule dump simple-rule
{
"rule_id": 4,
...

quincy-1:~ # ceph config set mon osd_pool_default_crush_rule 4
quincy-1:~ # ceph osd pool create test-pool6
pool 'test-pool6' created
quincy-1:~ # ceph osd pool ls detail | grep test-pool
pool 24 'test-pool6' replicated size 2 min_size 1 crush_rule 4  
object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change  
2615 flags hashpspool stripe_width 0




Zitat von Florian Haas :


On 25/09/2024 09:05, Eugen Block wrote:

Hi,

for me this worked in a 17.2.7 cluster just fine


Huh, interesting!


(except for erasure-coded pools).


Okay, *that* bit is expected.  
https://docs.ceph.com/en/quincy/rados/configuration/pool-pg-config-ref/#confval-osd_pool_default_crush_rule does say that the option sets the "default CRUSH rule to use when creating a replicated  
pool".



quincy-1:~ # ceph osd crush rule create-replicated new-rule default osd hdd


Mine was a rule created with "create-simple"; would that make a difference?

Cheers,
Florian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Quincy: osd_pool_default_crush_rule being ignored?

2024-09-25 Thread Eugen Block


Hi,

for me this worked in a 17.2.7 cluster just fine (except for  
erasure-coded pools).


quincy-1:~ # ceph osd crush rule create-replicated new-rule default osd hdd

quincy-1:~ # ceph config set mon osd_pool_default_crush_rule 1

quincy-1:~ # ceph osd pool create test-pool2
pool 'test-pool2' created

quincy-1:~ # ceph osd pool ls detail | grep test-pool2
pool 20 'test-pool2' replicated size 2 min_size 1 crush_rule 1  
object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change  
2593 flags hashpspool stripe_width 0


quincy-1:~ # ceph versions
{
...
"overall": {
"ceph version 17.2.7  
(b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 11

}
}

Setting the option globally works as well for me.

Regards,
Eugen

Zitat von Florian Haas :


Hello everyone,

my cluster has two CRUSH rules: the default replicated_rule  
(rule_id 0), and another rule named rack-aware (rule_id 1).


Now, if I'm not misreading the config reference, I should be able to  
define that all future-created pools use the rack-aware rule, by  
setting osd_pool_default_crush_rule to 1.


I've verified that this option is defined in  
src/common/options/global.yaml.in, so the "global" configuration  
section should be the applicable one (I did try with "mon" and "osd"  
also, for good measure).


However, setting this option, in Quincy, apparently has no effect:

# ceph config set global osd_pool_default_crush_rule 1
# ceph osd pool create foo
pool 'foo' created
# ceph osd pool ls detail | grep foo
# pool 9 'foo' replicated size 3 min_size 2 crush_rule 0 object_hash  
rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 264 flags  
hashpspool stripe_width 0


I am seeing this behaviour in 17.2.7. After an upgrade to Reef  
(18.2.4) it is gone, the option behaves as documented, and new pools  
are created with a crush_rule of 1:


# ceph osd pool create bar
pool 'bar' created
# ceph osd pool ls detail | grep bar
pool 10 'bar' replicated size 3 min_size 2 crush_rule 1 object_hash  
rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 302 flags  
hashpspool stripe_width 0 read_balance_score 4.00


However, the documentation at  
https://docs.ceph.com/en/quincy/rados/configuration/pool-pg-config-ref/#confval-osd_pool_default_crush_rule asserts that osd_pool_default_crush_rule should already work in Quincy, and the Reef release notes at https://docs.ceph.com/en/latest/releases/reef/ don't mention a fix covering  
this.


Am I doing something wrong? Is this a documentation bug, and the  
option can't work in Quincy? Was this "accidentally" fixed at some  
point in the Reef cycle?


Thanks in advance for any insight you might be able to share.

Cheers,
Florian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Mds daemon damaged - assert failed

2024-09-24 Thread Eugen Block


Great that you got your filesystem back.


cephfs-journal-tool journal export
cephfs-journal-tool event recover_dentries summary

Both failed


Your export command seems to be missing the output file, or was it not  
the exact command?


Also, I understand that the metadata itself is sitting on the disk,  
but it looks like a single point of failure.  What’s the logic  
behind having a simple metadata location, but multiple mds servers?


I think there's a misunderstanding, the metadata is in the cephfs  
metadata pool, not on the local disk of your machine.



Zitat von "Kyriazis, George" :


I managed to recover my filesystem.

cephfs-journal-tool journal export
cephfs-journal-tool event recover_dentries summary

Both failed

But truncating the journal and following some of the instructions in  
https://people.redhat.com/bhubbard/nature/default/cephfs/disaster-recovery-experts/ helped me to get the mds  
up.


Then I scrubbed and repaired the filesystem, and I “believe” I’m  
back in business.


What is weird though is that an assert failed as shown in the stack  
dump below.  Was that a legitimate assertion that indicates a bigger  
issue, or was it a false assertion?


Also, I understand that the metadata itself is sitting on the disk,  
but it looks like a single point of failure.  What’s the logic  
behind having a simple metadata location, but multiple mds servers?


Thanks!

George


On Sep 24, 2024, at 5:55 AM, Eugen Block  wrote:

Hi,

I would probably start by inspecting the journal with the  
cephfs-journal-tool [0]:


cephfs-journal-tool [--rank=:{mds-rank|all}] journal inspect

And it could be helful to have the logs prior to the assert.

[0]  
https://docs.ceph.com/en/latest/cephfs/cephfs-journal-tool/#example-journal-inspect


Zitat von "Kyriazis, George" :

Hello ceph users,

I am in the unfortunate situation of having a status of “1 mds  
daemon damaged”.  Looking at the logs, I see that the daemon died  
with an assert as follows:


./src/osdc/Journaler.cc: 1368: FAILED ceph_assert(trim_to > trimming_pos)

ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char  
const*)+0x12a) [0x73a83189d7d9]

2: /usr/lib/ceph/libceph-common.so.2(+0x29d974) [0x73a83189d974]
3: (Journaler::_trim()+0x671) [0x57235caa70b1]
4: (Journaler::_finish_write_head(int, Journaler::Header&,  
C_OnFinisher*)+0x171) [0x57235caaa8f1]

5: (Context::complete(int)+0x9) [0x57235c716849]
6: (Finisher::finisher_thread_entry()+0x16d) [0x73a83194659d]
7: /lib/x86_64-linux-gnu/libc.so.6(+0x89134) [0x73a8310a8134]
8: /lib/x86_64-linux-gnu/libc.so.6(+0x1097dc) [0x73a8311287dc]

0> 2024-09-23T14:10:26.490-0500 73a822c006c0 -1 *** Caught  
signal (Aborted) **

in thread 73a822c006c0 thread_name:MR_Finisher

ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x73a83105b050]
2: /lib/x86_64-linux-gnu/libc.so.6(+0x8ae2c) [0x73a8310a9e2c]
3: gsignal()
4: abort()
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char  
const*)+0x185) [0x73a83189d834]

6: /usr/lib/ceph/libceph-common.so.2(+0x29d974) [0x73a83189d974]
7: (Journaler::_trim()+0x671) [0x57235caa70b1]
8: (Journaler::_finish_write_head(int, Journaler::Header&,  
C_OnFinisher*)+0x171) [0x57235caaa8f1]

9: (Context::complete(int)+0x9) [0x57235c716849]
10: (Finisher::finisher_thread_entry()+0x16d) [0x73a83194659d]
11: /lib/x86_64-linux-gnu/libc.so.6(+0x89134) [0x73a8310a8134]
12: /lib/x86_64-linux-gnu/libc.so.6(+0x1097dc) [0x73a8311287dc]
NOTE: a copy of the executable, or `objdump -rdS ` is  
needed to interpret this.



As listed above, I am running 18.2.2 on a proxmox cluster with a  
hybrid hdd/sdd setup.  2 cephfs filesystems.  The mds responsible  
for the hdd filesystem is the one that died.


Output of ceph -s follows:

root@vis-mgmt:~/bin# ceph -s
 cluster:
   id: ec2c9542-dc1b-4af6-9f21-0adbcabb9452
   health: HEALTH_ERR
   1 filesystem is degraded
   1 filesystem is offline
   1 mds daemon damaged
   5 pgs not scrubbed in time
   1 daemons have recently crashed
   services:
   mon: 5 daemons, quorum  
vis-hsw-01,vis-skx-01,vis-clx-15,vis-clx-04,vis-icx-00 (age 6m)
   mgr: vis-hsw-02(active, since 13d), standbys: vis-skx-02,  
vis-hsw-04, vis-clx-08, vis-clx-02

   mds: 1/2 daemons up, 5 standby
   osd: 97 osds: 97 up (since 3h), 97 in (since 4d)
   data:
   volumes: 1/2 healthy, 1 recovering; 1 damaged
   pools:   14 pools, 1961 pgs
   objects: 223.70M objects, 304 TiB
   usage:   805 TiB used, 383 TiB / 1.2 PiB avail
   pgs: 1948 active+clean
9active+clean+scrubbing+deep
4active+clean+scrubbing
   io:
   client:   86 KiB/s rd, 5.5 MiB/s wr, 64 op/s rd, 26 op/s wr



I tried restarting all the mds deamons but they are all marked as  
“standby”.  I also tried restarting all the mon

[ceph-users] Re: Mds daemon damaged - assert failed

2024-09-24 Thread Eugen Block


Hi,

I would probably start by inspecting the journal with the  
cephfs-journal-tool [0]:


cephfs-journal-tool [--rank=:{mds-rank|all}] journal inspect

And it could be helful to have the logs prior to the assert.

[0]  
https://docs.ceph.com/en/latest/cephfs/cephfs-journal-tool/#example-journal-inspect


Zitat von "Kyriazis, George" :


Hello ceph users,

I am in the unfortunate situation of having a status of “1 mds  
daemon damaged”.  Looking at the logs, I see that the daemon died  
with an assert as follows:


./src/osdc/Journaler.cc: 1368: FAILED ceph_assert(trim_to > trimming_pos)

 ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char  
const*)+0x12a) [0x73a83189d7d9]

 2: /usr/lib/ceph/libceph-common.so.2(+0x29d974) [0x73a83189d974]
 3: (Journaler::_trim()+0x671) [0x57235caa70b1]
 4: (Journaler::_finish_write_head(int, Journaler::Header&,  
C_OnFinisher*)+0x171) [0x57235caaa8f1]

 5: (Context::complete(int)+0x9) [0x57235c716849]
 6: (Finisher::finisher_thread_entry()+0x16d) [0x73a83194659d]
 7: /lib/x86_64-linux-gnu/libc.so.6(+0x89134) [0x73a8310a8134]
 8: /lib/x86_64-linux-gnu/libc.so.6(+0x1097dc) [0x73a8311287dc]

 0> 2024-09-23T14:10:26.490-0500 73a822c006c0 -1 *** Caught  
signal (Aborted) **

 in thread 73a822c006c0 thread_name:MR_Finisher

 ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
 1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x73a83105b050]
 2: /lib/x86_64-linux-gnu/libc.so.6(+0x8ae2c) [0x73a8310a9e2c]
 3: gsignal()
 4: abort()
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char  
const*)+0x185) [0x73a83189d834]

 6: /usr/lib/ceph/libceph-common.so.2(+0x29d974) [0x73a83189d974]
 7: (Journaler::_trim()+0x671) [0x57235caa70b1]
 8: (Journaler::_finish_write_head(int, Journaler::Header&,  
C_OnFinisher*)+0x171) [0x57235caaa8f1]

 9: (Context::complete(int)+0x9) [0x57235c716849]
 10: (Finisher::finisher_thread_entry()+0x16d) [0x73a83194659d]
 11: /lib/x86_64-linux-gnu/libc.so.6(+0x89134) [0x73a8310a8134]
 12: /lib/x86_64-linux-gnu/libc.so.6(+0x1097dc) [0x73a8311287dc]
 NOTE: a copy of the executable, or `objdump -rdS ` is  
needed to interpret this.



As listed above, I am running 18.2.2 on a proxmox cluster with a  
hybrid hdd/sdd setup.  2 cephfs filesystems.  The mds responsible  
for the hdd filesystem is the one that died.


Output of ceph -s follows:

root@vis-mgmt:~/bin# ceph -s
  cluster:
id: ec2c9542-dc1b-4af6-9f21-0adbcabb9452
health: HEALTH_ERR
1 filesystem is degraded
1 filesystem is offline
1 mds daemon damaged
5 pgs not scrubbed in time
1 daemons have recently crashed
services:
mon: 5 daemons, quorum  
vis-hsw-01,vis-skx-01,vis-clx-15,vis-clx-04,vis-icx-00 (age 6m)
mgr: vis-hsw-02(active, since 13d), standbys: vis-skx-02,  
vis-hsw-04, vis-clx-08, vis-clx-02

mds: 1/2 daemons up, 5 standby
osd: 97 osds: 97 up (since 3h), 97 in (since 4d)
data:
volumes: 1/2 healthy, 1 recovering; 1 damaged
pools:   14 pools, 1961 pgs
objects: 223.70M objects, 304 TiB
usage:   805 TiB used, 383 TiB / 1.2 PiB avail
pgs: 1948 active+clean
 9active+clean+scrubbing+deep
 4active+clean+scrubbing
io:
client:   86 KiB/s rd, 5.5 MiB/s wr, 64 op/s rd, 26 op/s wr



I tried restarting all the mds deamons but they are all marked as  
“standby”.  I also tried restarting all the mons and then the mds  
daemons again, but that didn’t help.


Much help is appreciated!

Thank you!

George

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [External Email] Overlapping Roots - How to Fix?

2024-09-21 Thread Eugen Block

I think it would suffice to change rule 0 to use a device class as  
well, as you already mentioned yourself. Do you have pools that use  
that rule? If not, the change wouldn’t even have any impact.


Zitat von Dave Hall :


Oddly, the Nautilus cluster that I'm gradually decommissioning seems to
have the same shadow root pattern in its crush map.  I don't know if that
really means anything, but at least I know it's not something I did
differently when I set up the new Reef cluster.

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu



On Fri, Sep 20, 2024 at 12:48 PM Dave Hall  wrote:


Stefan, Anthony,

Anthony's sequence of commands to reclassify the root failed with errors.
so I have tried to look a little deeper.

I can now see the extra root via 'ceph osd crush tree --show-shadow'.
Looking at the decompiled crush tree, I can also see the extra root:

root default {
id -1   # do not change unnecessarily

*id -2 class hdd # do not change unnecessarily*#
weight 361.90518
alg straw2
hash 0  # rjenkins1
item ceph00 weight 90.51434
item ceph01 weight 90.29265
item ceph09 weight 90.80554
item ceph02 weight 90.29265
}


Based on the hints given in the link provided by Stefan, it would appear
that the correct solution might be to get rid of 'id -2' and change id -1
to class hdd,

root default {

*id -1 class hdd # do not change unnecessarily*#
weight 361.90518
alg straw2
hash 0  # rjenkins1
item ceph00 weight 90.51434
item ceph01 weight 90.29265
item ceph09 weight 90.80554
item ceph02 weight 90.29265
}


but I'm no expert and anxious about losing data.

The rest of the rules in my crush map are:

# rules
rule replicated_rule {
id 0
type replicated
step take default
step chooseleaf firstn 0 type host
step emit
}
rule block-1 {
id 1
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step choose indep 0 type osd
step emit
}
rule default.rgw.buckets.data {
id 2
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step choose indep 0 type osd
step emit
}
rule ceph-block {
id 3
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step choose indep 0 type osd
step emit
}
rule replicated-hdd {
id 4
type replicated
step take default class hdd
step choose firstn 0 type osd
step emit
}

# end crush map


Of these, the last - id 4 - is one that I added while trying to figure
this out.  What this tells me is that the 'take' step in rule id 0 should
probably change to 'step take default class hdd'.

I also notice that each of my host stanzas (buckets) has what looks like
two roots.  For example

host ceph00 {
id -3 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
# weight 90.51434
alg straw2
hash 0 # rjenkins1
item osd.0 weight 11.35069
item osd.1 weight 11.35069
item osd.2 weight 11.35069
item osd.3 weight 11.35069
item osd.4 weight 11.27789
item osd.5 weight 11.27789
item osd.6 weight 11.27789
item osd.7 weight 11.27789
}


I assume I may need to clean this up somehow, or perhaps this is the real
problem.

Please advise.

Thanks.

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu

On Thu, Sep 19, 2024 at 3:56 AM Stefan Kooman  wrote:


On 19-09-2024 05:10, Anthony D'Atri wrote:
>
>
>>
>> Anthony,
>>
>> So it sounds like I need to make a new crush rule for replicated pools
that specifies default-hdd and the device class?  (Or should I go the other
way around?  I think I'd rather change the replicated pools even though
there's more of them.)
>
> I think it would be best to edit the CRUSH rules in-situ so that each
specifies the device class, that way if you do get different media in the
future, you'll be ready.  Rather than messing around with new rules and
modifying pools, this is arguably one of the few times when one would
decompile, edit, recompile, and inject the CRUSH map in toto.
>
> I haven't tried this myself, but maybe something like the below, to
avoid the PITA and potential for error of edting the decompiled text file
by hand.
>
>
> ceph osd getcrushmap -o original.crush
> crushtool -d original.crush -o original.txt
> crushtool -i original.crush --reclassify --reclassify-root default hdd
--set-subtree-class default hdd -o adjusted.crush
> crushtool -d adjusted.crush -o adjusted.txt
> crushtool -i original.crush --compare adjusted.crush
> ceph osd setcrushmap -i adjusted.crush

This might be of use as well (if a lot of data would move):
https://blog.widodh.nl/2019/02/comparing-two-ceph-crush-maps/

Gr. Stefan




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send

[ceph-users] Re: scrubing

2024-09-20 Thread Eugen Block

I assume that the OSD maybe had some backfill going on, hence waiting  
for the scheduled deep-scrub to start. There are config options which  
would allow deep-scrubs during recovery, I believe, but if it's not a  
real issue, you can leave it as is.


Zitat von Albert Shih :


Le 20/09/2024 à 11:01:20+0200, Albert Shih a écrit
Hi,



>
> >   Is they are any way to find which pg ceph status are talking about.
>
> 'ceph health detail' will show you which PG it's warning about.

Too easy for me ;-) ;-)...Thanks ;-)

>
> >   Is they are any way to see the progress or  
scrubbing/remapping/backfill ?

>
> You can see when (deep-)scrubs have been started in the OSD logs or
> depending on your cluster log configuration:
>
> ceph log last 1000 debug cluster | grep scrub

Thanks.

So I think I get some issue with one osd

2024-09-20T08:55:59.760766+ osd.356 (osd.356) 84 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:03.751487+ osd.356 (osd.356) 85 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:05.742816+ osd.356 (osd.356) 86 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:07.822277+ osd.356 (osd.356) 87 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:08.795748+ osd.356 (osd.356) 88 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:09.749838+ osd.356 (osd.356) 89 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:10.778235+ osd.356 (osd.356) 90 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:14.792102+ osd.356 (osd.356) 91 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:15.832620+ osd.356 (osd.356) 92 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:16.791811+ osd.356 (osd.356) 93 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:17.798181+ osd.356 (osd.356) 94 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:19.793526+ osd.356 (osd.356) 95 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:20.809140+ osd.356 (osd.356) 96 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:21.835052+ osd.356 (osd.356) 97 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:22.817378+ osd.356 (osd.356) 98 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:28.887092+ osd.356 (osd.356) 99 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:32.907468+ osd.356 (osd.356) 100 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:33.900065+ osd.356 (osd.356) 101 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:37.853769+ osd.356 (osd.356) 102 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:38.850513+ osd.356 (osd.356) 103 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:46.799896+ osd.356 (osd.356) 104 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:50.915450+ osd.356 (osd.356) 105 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:53.875875+ osd.356 (osd.356) 106 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:55.791675+ osd.356 (osd.356) 107 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:57.833012+ osd.356 (osd.356) 108 : cluster  
[DBG] 4.51 deep-scrub starts
2024-09-20T08:56:59.818262+ osd.356 (osd.356) 109 : cluster  
[DBG] 4.51 deep-scrub starts


but the osd are still in

  queued for deep scrub


Yeah...well after few hours the scrub eventually started.

So eveything seem fine.

Regards
--
Albert SHIH 🦫 🐸
Observatoire de Paris
France
Heure locale/Local time:
ven. 20 sept. 2024 14:13:20 CEST



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: scrubing

2024-09-20 Thread Eugen Block


Hi,

there's some ratio involved when deep-scrubs are checked:

(mon_warn_pg_not_deep_scrubbed_ratio * deep_scrub_interval) +  
deep_scrub_interval


So based on the defaults, ceph would only warn if the last deep-scrub  
timestamp is older than:


(0.75 * 7 days) + 7 days = 12.25 days

Note that the MGR has also a config for deep_scrub_interval. Check out  
the docs [0] or my recent blog [1] on that topic.



  Why ceph tell me «1» pg not been scrub when I see 15 ?


See my reply above.


  Is they are any way to find which pg ceph status are talking about.


'ceph health detail' will show you which PG it's warning about.


  Is they are any way to see the progress or scrubbing/remapping/backfill ?


You can see when (deep-)scrubs have been started in the OSD logs or  
depending on your cluster log configuration:


ceph log last 1000 debug cluster | grep scrub

The (deep-)scrub duration depends on the PG sizes, so they can vary.  
But from experience (and older Logs) you can see if scrubbing duration  
has increased. I haven't checked if there's a metric for that in  
prometheus.
As for remapping and backfill operations, they are constantly reported  
in 'ceph status', it shows how many objects are degraded, how many PGs  
are remapped etc. If you mean something else, please clarify.


Regards,
Eugen

[0]  
https://docs.ceph.com/en/latest/rados/operations/health-checks/#pg-not-deep-scrubbed
[1]  
https://heiterbiswolkig.blogs.nde.ag/2024/09/06/pgs-not-deep-scrubbed-in-time/


Zitat von Albert Shih :


Hi everyone.

Few time ago I add a new node to my cluster with some HDD.

Currently the cluster does the remapping and backfill.

I now got a warning about

HEALTH_WARN 1 pgs not deep-scrubbed in time

So I check and find something a litle weird.

root@cthulhu1:~# ceph config get osd osd_deep_scrub_interval
604800.00

so that's one week.

If I check the LAST DEEP SCRUB TIMESTAMP I got

root@cthulhu1:~# ceph pg dump pgs | awk '{print $1" "$24}' | grep -v  
2024-09-[1-2][0-9]

dumped pgs
PG_STAT DEEP_SCRUB_STAMP
4.63 2024-09-09T19:00:57.739975+
4.5a 2024-09-09T08:17:15.124704+
4.56 2024-09-09T21:51:07.478651+
4.51 2024-09-08T00:10:30.552347+
4.4c 2024-09-09T10:35:02.048445+
4.4b 2024-09-09T19:53:19.839341+
4.14 2024-09-08T18:36:12.025455+
4.c 2024-09-09T16:00:59.047968+
4.4 2024-09-09T00:19:07.554153+
4.8 2024-09-09T22:19:15.280310+
4.25 2024-09-09T06:45:37.258306+
4.30 2024-09-09T16:56:21.472410+
4.82 2024-09-09T21:14:09.802303+
4.c9 2024-09-08T17:10:56.133363+
4.f7 2024-09-09T08:25:40.011924+

If I check the status of those PG it's or

  active+clean+scrubbing+deep and deep scrubbing for Xs

or

  queued for deep scrub

So my questions are :

  Why ceph tell me «1» pg not been scrub when I see 15 ?

  Is they are any way to find which pg ceph status are talking about.

  Is they are any way to see the progress or scrubbing/remapping/backfill ?

Regards


--
Albert SHIH 🦫 🐸
Observatoire de Paris
France
Heure locale/Local time:
ven. 20 sept. 2024 09:35:43 CEST
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: For 750 osd 5 monitor in redhat doc

2024-09-18 Thread Eugen Block

Hi,

you need to consider your resiliency requirements. With 3 MONs, you'll
lose quorum if you have one MON in maintenance and another one fails
during that window. You're safer with 5 MONs in such a case. We have
been running our own cluster with 3 MONs for years without any issues,
so it really depends.

Zitat von "Szabo, Istvan (Agoda)" :

Hi,

I went through some Redhat doc and saw this sentence which I didn't
see in Ceph docs:

"The storage cluster can run with only one Ceph Monitor; however, to
ensure high availability in a production storage cluster, Red Hat
will only support deployments with at least three Ceph Monitor
nodes. Red Hat recommends deploying a total of 5 Ceph Monitors for
storage clusters exceeding 750 Ceph OSDs."

here:
https://docs.redhat.com/en/documentation/red_hat_ceph_storage/4/html/installation_guide/what-is-red-hat-ceph-storage_install#what-is-red-hat-ceph-storage_install

We will reach this number so wonder do we really need 5 mon?
Currently we have 3 collocated mgr/mon on baremetal server with
some multiste gw traffic, if needed I'd plan to simply add some mon
on vm only.

Thank you

This message is confidential and is for the sole use of the intended
recipient(s). It may also be privileged or otherwise protected by
copyright or other legal rules. If you have received it by mistake
please let us know by reply email and delete it from your system. It
is prohibited to copy this message or disclose its content to
anyone. Any confidentiality or privilege is not waived or lost by
any mistaken delivery or unauthorized disclosure of the message. All
messages sent to and from Agoda may be monitored to ensure
compliance with company policies, to protect the company's interests
and to remove potential malware. Electronic messages may be
intercepted, amended, lost or deleted, or contain viruses.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph octopus version cluster not starting

2024-09-16 Thread Eugen Block

Have you tried to start it with a higher debug level? Is the ceph.conf  
still correct? Is a keyring present in /var/lib/ceph/mon? Is the mon  
store in good shape?


Can you run something like this?

ceph-monstore-tool /var/lib/ceph/mon/ceph-{MON}/ get monmap -- --out monmap

monmaptool --print monmap

Does it print the expected output?

Zitat von Amudhan P :


No, I don't use cephadm and I have enough space for a log storage.

When I try to start mon service in any of the node it just keeps waiting to
complete without any error msg in stdout or in log file.

On Mon, Sep 16, 2024 at 1:21 PM Eugen Block  wrote:


Hi,

I would focus on the MONs first. If they don't start, your cluster is
not usable. It doesn't look like you use cephadm, but please confirm.
Check if the nodes are running out of disk space, maybe that's why
they don't log anything and fail to start.


Zitat von Amudhan P :

> Hi,
>
> Recently added one disk in Ceph cluster using "ceph-volume lvm create
> --data /dev/sdX" but the new OSD didn't start. After some rest of the
other
> nodes OSD service also stopped. So, I restarted all nodes in the cluster
> now after restart.
> MON, MDS, MGR  and OSD services are not starting. Could find any new logs
> also after restart it is totally silent in all nodes.
> Could find some logs in Ceph-volume service.
>
>
> Error in Ceph-volume logs :-
> [2024-09-15 23:38:15,080][ceph_volume.process][INFO  ] stderr Running
> command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-5
> --> Executable selinuxenabled not in PATH:
> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
> Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-5
> Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph
prime-osd-dir
> --dev
>
/dev/ceph-33cd42cd-8570-47de-8703-d7cab1acf2ae/osd-block-21968433-bb53-4415-b9e2-fdc36bc4a28e
> --path /var/lib/ceph/osd/ceph-5 --no-mon-config
>  stderr: failed to read label for
>
/dev/ceph-33cd42cd-8570-47de-8703-d7cab1acf2ae/osd-block-21968433-bb53-4415-b9e2-fdc36bc4a28e:
> (2) No such file or directory
> 2024-09-15T23:38:15.059+0530 7fe7767c8100 -1
>
bluestore(/dev/ceph-33cd42cd-8570-47de-8703-d7cab1acf2ae/osd-block-21968433-bb53-4415-b9e2-fdc36bc4a28e)
> _read_bdev_label failed to open
>
/dev/ceph-33cd42cd-8570-47de-8703-d7cab1acf2ae/osd-block-21968433-bb53-4415-b9e2-fdc36bc4a28e:
> (2) No such file or directory
> -->  RuntimeError: command returned non-zero exit status: 1
> [2024-09-15 23:38:15,084][ceph_volume.process][INFO  ] stderr Running
> command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-2
> --> Executable selinuxenabled not in PATH:
> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
> Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2
> Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph
prime-osd-dir
> --dev
>
/dev/ceph-9a9b8328-66ad-4997-8b9f-5216b56b73e8/osd-block-ac2ae41d-3b77-4bfd-ba5c-737e4266e988
> --path /var/lib/ceph/osd/ceph-2 --no-mon-config
>  stderr: failed to read label for
>
/dev/ceph-9a9b8328-66ad-4997-8b9f-5216b56b73e8/osd-block-ac2ae41d-3b77-4bfd-ba5c-737e4266e988:
> (2) No such file or directory
>
> But I could find "
>
/dev/ceph-9a9b8328-66ad-4997-8b9f-5216b56b73e8/osd-block-ac2ae41d-3b77-4bfd-ba5c-737e4266e988"
> the path valid and listing folder.
>
> Not sure how to proceed or where to start any idea or suggestion ?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph octopus version cluster not starting

2024-09-16 Thread Eugen Block


Hi,

I would focus on the MONs first. If they don't start, your cluster is  
not usable. It doesn't look like you use cephadm, but please confirm.  
Check if the nodes are running out of disk space, maybe that's why  
they don't log anything and fail to start.



Zitat von Amudhan P :


Hi,

Recently added one disk in Ceph cluster using "ceph-volume lvm create
--data /dev/sdX" but the new OSD didn't start. After some rest of the other
nodes OSD service also stopped. So, I restarted all nodes in the cluster
now after restart.
MON, MDS, MGR  and OSD services are not starting. Could find any new logs
also after restart it is totally silent in all nodes.
Could find some logs in Ceph-volume service.


Error in Ceph-volume logs :-
[2024-09-15 23:38:15,080][ceph_volume.process][INFO  ] stderr Running
command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-5
--> Executable selinuxenabled not in PATH:
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-5
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir
--dev
/dev/ceph-33cd42cd-8570-47de-8703-d7cab1acf2ae/osd-block-21968433-bb53-4415-b9e2-fdc36bc4a28e
--path /var/lib/ceph/osd/ceph-5 --no-mon-config
 stderr: failed to read label for
/dev/ceph-33cd42cd-8570-47de-8703-d7cab1acf2ae/osd-block-21968433-bb53-4415-b9e2-fdc36bc4a28e:
(2) No such file or directory
2024-09-15T23:38:15.059+0530 7fe7767c8100 -1
bluestore(/dev/ceph-33cd42cd-8570-47de-8703-d7cab1acf2ae/osd-block-21968433-bb53-4415-b9e2-fdc36bc4a28e)
_read_bdev_label failed to open
/dev/ceph-33cd42cd-8570-47de-8703-d7cab1acf2ae/osd-block-21968433-bb53-4415-b9e2-fdc36bc4a28e:
(2) No such file or directory
-->  RuntimeError: command returned non-zero exit status: 1
[2024-09-15 23:38:15,084][ceph_volume.process][INFO  ] stderr Running
command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-2
--> Executable selinuxenabled not in PATH:
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir
--dev
/dev/ceph-9a9b8328-66ad-4997-8b9f-5216b56b73e8/osd-block-ac2ae41d-3b77-4bfd-ba5c-737e4266e988
--path /var/lib/ceph/osd/ceph-2 --no-mon-config
 stderr: failed to read label for
/dev/ceph-9a9b8328-66ad-4997-8b9f-5216b56b73e8/osd-block-ac2ae41d-3b77-4bfd-ba5c-737e4266e988:
(2) No such file or directory

But I could find "
/dev/ceph-9a9b8328-66ad-4997-8b9f-5216b56b73e8/osd-block-ac2ae41d-3b77-4bfd-ba5c-737e4266e988"
the path valid and listing folder.

Not sure how to proceed or where to start any idea or suggestion ?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: no mds services

2024-09-14 Thread Eugen Block


Hi,

I’d suggest to check the servers where the MDS‘s are supposed to be  
running on for a reason why the services stopped. Check daemon logs  
and the service status for hints pointing to a possible root cause.  
Try restarting the services and paste startup logs from a failure here  
if you need more advice.


Regards,
Eugen

Zitat von Ex Calibur :


Hello,

I'm following this guide to upgrade our cephs:
https://ainoniwa.net/pelican/2021-08-11a.html (Proxmox VE 6.4 Ceph upgrade
Nautilus to Octopus)
It's a requirement to upgrade our ProxMox environnement.

Now I've reached the point at that guide where i have to "Upgrade all
CephFS MDS daemons"

But before I started this piece, I checked the status.

root@pmnode1:~# ceph status
  cluster:
id: xxx
health: HEALTH_ERR
noout flag(s) set
1 scrub errors
Possible data damage: 1 pg inconsistent
2 pools have too many placement groups

  services:
mon: 3 daemons, quorum pmnode1,pmnode2,pmnode3 (age 19h)
mgr: pmnode2(active, since 19h), standbys: pmnode1
osd: 15 osds: 12 up (since 12h), 12 in (since 19h)
 flags noout

  data:
pools:   3 pools, 513 pgs
objects: 398.46k objects, 1.5 TiB
usage:   4.5 TiB used, 83 TiB / 87 TiB avail
pgs: 512 active+clean
 1   active+clean+inconsistent

root@pmnode1:~# ceph mds metadata
[]


as you can see there is no mds service running.

What can be wrong and how to solve this?

Thank you in advance.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph-mgr perf throttle-msgr - what is caused fails?

2024-09-13 Thread Eugen Block

I remember having a prometheus issue quite some time ago, it couldn't  
handle 30 nodes or something, not really a big cluster. But we needed  
to increase the polling time. Have you tried increasing  
mgr/prometheus/scrape_interval to 30 seconds or so?


Zitat von Konstantin Shalygin :

As I said before, currently Prometheus module performance  
degradation is only one _visible_ issues. I named things like is as  
indicator (of feature problem's)



k
Sent from my iPhone


On 12 Sep 2024, at 23:18, Eugen Block  wrote:

But did you notice any actual issues or did you just see that value  
being that high without any connection to an incident?



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph-mgr perf throttle-msgr - what is caused fails?

2024-09-12 Thread Eugen Block

You’re right, in my case it was clear where it came from. But if  
there’s no spike visible, it’s probably going to be difficult to get  
to the bottom of it. But did you notice any actual issues or did you  
just see that value being that high without any connection to an  
incident?


Zitat von Konstantin Shalygin :


Hi, Eugene

Yes, I remember. But in that case, it was clear where/whence the  
problem was. In this case, it is completely unclear to me what  
caused the throttling, only suggestions. There was no sudden spike  
in load or significant change in cluster size. I think it slowly  
approached the limit. It remains to be seen what the limit is.


The visible impact is Prometheus module - the module does not have  
time to prepare data within 15 seconds (scrape interval).


This val is 'in flight'. In one second the val may be zero, may be  
nearest max. The idea that came to me now is to look at the msgr  
debug, but I'm not sure that will help given the number of messages



k
Sent from my iPhone


On 8 Sep 2024, at 14:09, Eugen Block  wrote:

Hi,

I don't have an answer, but it reminds me of the issue we had this  
year on a customer cluster. I had created this tracker issue [0]  
where you were the only one yet to comment. Those observations  
might not be related, but do you see any impact on the cluster?

Also, in your output "val" is still smaller than "max":


 "val": 104856554,
 "max": 104857600,


So it probably doesn't have any visible impact, does it? But the  
values are not that far apart, maybe they burst sometime, leading  
to the fail_fail counter to increase? Do you have that monitored?


Thanks,
Eugen

[0] https://tracker.ceph.com/issues/66310

Zitat von Konstantin Shalygin :


Hi, seems something in mgr is throttle due val > max. I'm right?

root@mon1# ceph daemon /var/run/ceph/ceph-mgr.mon1.asok perf dump  
| jq '."throttle-msgr_dispatch_throttler-mgr-0x55930f4aed20"'

{
 "val": 104856554,
 "max": 104857600,
 "get_started": 0,
 "get": 9700833,
 "get_sum": 654452218418,
 "get_or_fail_fail": 1323887918,
 "get_or_fail_success": 9700833,
 "take": 0,
 "take_sum": 0,
 "put": 9698716,
 "put_sum": 654347361864,
 "wait": {
   "avgcount": 0,
   "sum": 0,
   "avgtime": 0
 }
}

The question is - how-to determine what exactly? Another fail_fail  
in perf counters is zero. mgr is not in container, and have  
resources to work



Thanks,
k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Mon is unable to build mgr service

2024-09-09 Thread Eugen Block


Hi,

could you please be more specific which instructions you followed  
exactly? A step-by-step history of the used commands could help  
reproduce it. Which Ceph version are you trying to deploy on wich  
distro?


Zitat von Jorge Ventura :


I am trying to configure ceph for the first time manually. I followed the
instructions and at this point I have only ceph-mon and ceph-mgr installed.

Here is my ceph status:

root@ceph-n1:~# ceph status

  cluster:
id: a93114e4-b0af-4b56-b019-0900310a14f8
health: HEALTH_WARN
OSD count 0 < osd_pool_default_size 3

  services:
mon: 1 daemons, quorum mon-n1 (age 10m)

* mgr: mgr-n1(active, since 9m)*osd: 0 osds: 0 up, 0 in

  data:
pools:   0 pools, 0 pgs
objects: 0 objects, 0 B
usage:   0 B used, 0 B / 0 B avail
pgs:



For a reason that I do not understand I am getting this message in syslog
every second:

 *Sep 05 21:05:18 ceph-n1 ceph-mon[1089]: 2024-09-05T21:05:18.636+

7ffae4fa6640 -1 mon.mon-n1@0(leader) e2 get_authorizer failed to build mgr
service session_auth_info (22) Invalid argument*



I don't know if this is something about caps and at this point I have no
idea what this problem is.
--
Ventura
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Grafana dashboards is missing data

2024-09-09 Thread Eugen Block

Those two daemons are handled differently by cephadm, they're  
different classes (grafana is "class Monitoring(ContainerDaemonForm)"  
while ceph-exporter is "class CephExporter(ContainerDaemonForm)"),  
therefore they have different metadata etc., for example:


soc9-ceph:~ # jq '.ports'  
/var/lib/ceph/{FSID}/ceph-exporter.soc9-ceph/unit.meta

[]

soc9-ceph:~ # jq '.ports' /var/lib/ceph/{FSID}/grafana.soc9-ceph/unit.meta
[
  3000
]

But that's about all I can provide here. Maybe the host OS plays some  
role here as well, not sure.


Zitat von Sake Ceph :

We're using default :) I'm talking about the deployment  
configuration which is shown in the log files when deploying  
grafana/ceph-exporter.


I got the same configuration as you for ceph-exporter (the default)  
when exporting the service.


Kind regards,
Sake


Op 09-09-2024 12:04 CEST schreef Eugen Block :


Can you be more specific about "deploy configuration"? Do you have
your own spec files for grafana and ceph-exporter?
I just ran 'ceph orch apply ceph-exporter' and the resulting config is
this one:

# ceph orch ls ceph-exporter --export
service_type: ceph-exporter
service_name: ceph-exporter
placement:
   host_pattern: '*'
spec:
   prio_limit: 5
   stats_period: 5

Zitat von Sake Ceph :

> Hello Eugen,
>
> Well nothing about enabling port 9926.
>
> For example I see the following when deploying Grafanan:
> 2024-09-05 14:27:30,969 7fd2e6583740 INFO firewalld ready
> 2024-09-05 14:27:31,334 7fd2e6583740 DEBUG /bin/firewall-cmd:  
stdout success

> 2024-09-05 14:27:31,350 7fd2e6583740 INFO firewalld ready
> 2024-09-05 14:27:31,593 7fd2e6583740 DEBUG Non-zero exit code 1 from
> /bin/firewall-cmd --permanent --query-port 3000/tcp
> 2024-09-05 14:27:31,594 7fd2e6583740 DEBUG /bin/firewall-cmd: stdout no
> 2024-09-05 14:27:31,594 7fd2e6583740 INFO Enabling firewalld port
> 3000/tcp in current zone...
> 2024-09-05 14:27:31,832 7fd2e6583740 DEBUG /bin/firewall-cmd:  
stdout success
> 2024-09-05 14:27:32,212 7fd2e6583740 DEBUG /bin/firewall-cmd:  
stdout success

>
> But only the following when deploying ceph-exporter:
> 2024-09-05 12:17:48,897 7f3d7cc0e740 INFO firewalld ready
> 2024-09-05 12:17:49,269 7f3d7cc0e740 DEBUG /bin/firewall-cmd:  
stdout success

>
> When looking in the deploy configuration, Grafana shows 'ports':
> [3000], but ceph-exporter shows 'ports': []
>
> Kind regards,
> Sake
>
>> Op 09-09-2024 10:50 CEST schreef Eugen Block :
>>
>>
>> Sorry, clicked "send" too soon. In a test cluster, cephadm.log shows
>> that it would try to open a port if a firewall was enabled:
>>
>> 2024-09-09 10:48:59,686 7f3142b11740 DEBUG firewalld.service is  
not enabled

>> 2024-09-09 10:48:59,686 7f3142b11740 DEBUG Not possible to enable
>> service . firewalld.service is not available
>>
>> Zitat von Eugen Block :
>>
>> > Do you see anything in the cephadm.log related to the firewall?
>> >
>> > Zitat von Sake Ceph :
>> >
>> >> After opening port 9926 manually, the Grafana dashboards show  
the data.

>> >> So is this a bug?
>> >>
>> >> Kind regards,
>> >> Sake
>> >>> Op 06-09-2024 17:39 CEST schreef Sake Ceph :
>> >>>
>> >>>
>> >>> That is working, but I noticed the firewall isn't opened for that
>> >>> port. Shouldn't cephadm manage this, like it does for all the
>> >>> other ports?
>> >>>
>> >>> Kind regards,
>> >>> Sake
>> >>>
>> >>>> Op 06-09-2024 16:14 CEST schreef Björn Lässig
>> :
>> >>>>
>> >>>>
>> >>>> Am Mittwoch, dem 04.09.2024 um 20:01 +0200 schrieb Sake Ceph:
>> >>>> > After the upgrade from 17.2.7 to 18.2.4 a lot of graphs are
>> empty. For
>> >>>> > example the Osd latency under OSD device details or the  
Osd Overview

>> >>>> > has a lot of No data messages.
>> >>>> >
>> >>>>
>> >>>> is the ceph-exporter listening on port 9926 (on every host)?
>> >>>>
>> >>>>   ss -tlpn sport 9926
>> >>>>
>> >>>> Can you connect via browser?
>> >>>>
>> >>>>   curl localhost:9926/metrics
>> >>>>
>> >>>> > I deployed ceph-exporter on all hosts, am I missing something? Did
>> >>>> > even a redeploy of prometheus.
>> >>>>
>> >>>> there is a bug, that this exporter does not listens for IPv6.
>> >>>>
>> >>>> greetings
>> >>>> Björn
>> >>>> ___
>> >>>> ceph-users mailing list -- ceph-users@ceph.io
>> >>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> >>> ___
>> >>> ceph-users mailing list -- ceph-users@ceph.io
>> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> >> ___
>> >> ceph-users mailing list -- ceph-users@ceph.io
>> >> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Grafana dashboards is missing data

2024-09-09 Thread Eugen Block

Can you be more specific about "deploy configuration"? Do you have  
your own spec files for grafana and ceph-exporter?
I just ran 'ceph orch apply ceph-exporter' and the resulting config is  
this one:


# ceph orch ls ceph-exporter --export
service_type: ceph-exporter
service_name: ceph-exporter
placement:
  host_pattern: '*'
spec:
  prio_limit: 5
  stats_period: 5

Zitat von Sake Ceph :


Hello Eugen,

Well nothing about enabling port 9926.

For example I see the following when deploying Grafanan:
2024-09-05 14:27:30,969 7fd2e6583740 INFO firewalld ready
2024-09-05 14:27:31,334 7fd2e6583740 DEBUG /bin/firewall-cmd: stdout success
2024-09-05 14:27:31,350 7fd2e6583740 INFO firewalld ready
2024-09-05 14:27:31,593 7fd2e6583740 DEBUG Non-zero exit code 1 from  
/bin/firewall-cmd --permanent --query-port 3000/tcp

2024-09-05 14:27:31,594 7fd2e6583740 DEBUG /bin/firewall-cmd: stdout no
2024-09-05 14:27:31,594 7fd2e6583740 INFO Enabling firewalld port  
3000/tcp in current zone...

2024-09-05 14:27:31,832 7fd2e6583740 DEBUG /bin/firewall-cmd: stdout success
2024-09-05 14:27:32,212 7fd2e6583740 DEBUG /bin/firewall-cmd: stdout success

But only the following when deploying ceph-exporter:
2024-09-05 12:17:48,897 7f3d7cc0e740 INFO firewalld ready
2024-09-05 12:17:49,269 7f3d7cc0e740 DEBUG /bin/firewall-cmd: stdout success

When looking in the deploy configuration, Grafana shows 'ports':  
[3000], but ceph-exporter shows 'ports': []


Kind regards,
Sake


Op 09-09-2024 10:50 CEST schreef Eugen Block :


Sorry, clicked "send" too soon. In a test cluster, cephadm.log shows
that it would try to open a port if a firewall was enabled:

2024-09-09 10:48:59,686 7f3142b11740 DEBUG firewalld.service is not enabled
2024-09-09 10:48:59,686 7f3142b11740 DEBUG Not possible to enable
service . firewalld.service is not available

Zitat von Eugen Block :

> Do you see anything in the cephadm.log related to the firewall?
>
> Zitat von Sake Ceph :
>
>> After opening port 9926 manually, the Grafana dashboards show the data.
>> So is this a bug?
>>
>> Kind regards,
>> Sake
>>> Op 06-09-2024 17:39 CEST schreef Sake Ceph :
>>>
>>>
>>> That is working, but I noticed the firewall isn't opened for that
>>> port. Shouldn't cephadm manage this, like it does for all the
>>> other ports?
>>>
>>> Kind regards,
>>> Sake
>>>
>>>> Op 06-09-2024 16:14 CEST schreef Björn Lässig  
:

>>>>
>>>>
>>>> Am Mittwoch, dem 04.09.2024 um 20:01 +0200 schrieb Sake Ceph:
>>>> > After the upgrade from 17.2.7 to 18.2.4 a lot of graphs are  
empty. For

>>>> > example the Osd latency under OSD device details or the Osd Overview
>>>> > has a lot of No data messages.
>>>> >
>>>>
>>>> is the ceph-exporter listening on port 9926 (on every host)?
>>>>
>>>>   ss -tlpn sport 9926
>>>>
>>>> Can you connect via browser?
>>>>
>>>>   curl localhost:9926/metrics
>>>>
>>>> > I deployed ceph-exporter on all hosts, am I missing something? Did
>>>> > even a redeploy of prometheus.
>>>>
>>>> there is a bug, that this exporter does not listens for IPv6.
>>>>
>>>> greetings
>>>> Björn
>>>> ___
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Grafana dashboards is missing data

2024-09-09 Thread Eugen Block

Sorry, clicked "send" too soon. In a test cluster, cephadm.log shows  
that it would try to open a port if a firewall was enabled:


2024-09-09 10:48:59,686 7f3142b11740 DEBUG firewalld.service is not enabled
2024-09-09 10:48:59,686 7f3142b11740 DEBUG Not possible to enable  
service . firewalld.service is not available


Zitat von Eugen Block :


Do you see anything in the cephadm.log related to the firewall?

Zitat von Sake Ceph :


After opening port 9926 manually, the Grafana dashboards show the data.
So is this a bug?

Kind regards,
Sake

Op 06-09-2024 17:39 CEST schreef Sake Ceph :


That is working, but I noticed the firewall isn't opened for that  
port. Shouldn't cephadm manage this, like it does for all the  
other ports?


Kind regards,
Sake


Op 06-09-2024 16:14 CEST schreef Björn Lässig :


Am Mittwoch, dem 04.09.2024 um 20:01 +0200 schrieb Sake Ceph:
> After the upgrade from 17.2.7 to 18.2.4 a lot of graphs are empty. For
> example the Osd latency under OSD device details or the Osd Overview
> has a lot of No data messages.
>

is the ceph-exporter listening on port 9926 (on every host)?

  ss -tlpn sport 9926

Can you connect via browser?

  curl localhost:9926/metrics

> I deployed ceph-exporter on all hosts, am I missing something? Did
> even a redeploy of prometheus.

there is a bug, that this exporter does not listens for IPv6.

greetings
Björn
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Grafana dashboards is missing data

2024-09-09 Thread Eugen Block


Do you see anything in the cephadm.log related to the firewall?

Zitat von Sake Ceph :


After opening port 9926 manually, the Grafana dashboards show the data.
So is this a bug?

Kind regards,
Sake

Op 06-09-2024 17:39 CEST schreef Sake Ceph :


That is working, but I noticed the firewall isn't opened for that  
port. Shouldn't cephadm manage this, like it does for all the other  
ports?


Kind regards,
Sake

> Op 06-09-2024 16:14 CEST schreef Björn Lässig :
>
>
> Am Mittwoch, dem 04.09.2024 um 20:01 +0200 schrieb Sake Ceph:
> > After the upgrade from 17.2.7 to 18.2.4 a lot of graphs are empty. For
> > example the Osd latency under OSD device details or the Osd Overview
> > has a lot of No data messages.
> >
>
> is the ceph-exporter listening on port 9926 (on every host)?
>
>   ss -tlpn sport 9926
>
> Can you connect via browser?
>
>   curl localhost:9926/metrics
>
> > I deployed ceph-exporter on all hosts, am I missing something? Did
> > even a redeploy of prometheus.
>
> there is a bug, that this exporter does not listens for IPv6.
>
> greetings
> Björn
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph-mgr perf throttle-msgr - what is caused fails?

2024-09-08 Thread Eugen Block


Hi,

I don't have an answer, but it reminds me of the issue we had this  
year on a customer cluster. I had created this tracker issue [0] where  
you were the only one yet to comment. Those observations might not be  
related, but do you see any impact on the cluster?

Also, in your output "val" is still smaller than "max":


  "val": 104856554,
  "max": 104857600,


So it probably doesn't have any visible impact, does it? But the  
values are not that far apart, maybe they burst sometime, leading to  
the fail_fail counter to increase? Do you have that monitored?


Thanks,
Eugen

[0] https://tracker.ceph.com/issues/66310

Zitat von Konstantin Shalygin :


Hi, seems something in mgr is throttle due val > max. I'm right?

root@mon1# ceph daemon /var/run/ceph/ceph-mgr.mon1.asok perf dump |  
jq '."throttle-msgr_dispatch_throttler-mgr-0x55930f4aed20"'

{
  "val": 104856554,
  "max": 104857600,
  "get_started": 0,
  "get": 9700833,
  "get_sum": 654452218418,
  "get_or_fail_fail": 1323887918,
  "get_or_fail_success": 9700833,
  "take": 0,
  "take_sum": 0,
  "put": 9698716,
  "put_sum": 654347361864,
  "wait": {
"avgcount": 0,
"sum": 0,
"avgtime": 0
  }
}

The question is - how-to determine what exactly? Another fail_fail  
in perf counters is zero. mgr is not in container, and have  
resources to work



Thanks,
k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] PGs not deep-scrubbed in time

2024-09-07 Thread Eugen Block


Hi,

I finally managed to take the time to get to the bottom of this  
infamous health warning. I decided to write it up in a blog post [0]  
and also contacted Zac to improve the documentation. The short version  
is:


If you want to change the config setting for osd_deep_scrub_interval  
in general, change it either globally or for both MGR *and* OSD  
services. Or configure individual intervals per pool. If you don't  
change it for the MGR, you'll still get the warning because the MGR  
will still compare the last deep-scrub timestamp with the default  
deep-scrub interval (1 week).


Comments are welcome, of course!

Regards,
Eugen

[0]  
https://heiterbiswolkig.blogs.nde.ag/2024/09/06/pgs-not-deep-scrubbed-in-time/


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Somehow throotle recovery even further than basic options?

2024-09-06 Thread Eugen Block

I can’t say anything about the pgremapper, but have you tried  
increasing the crush weight gradually? Add new OSDs with crush initial  
weight 0 and then increase it in small steps. I haven’t used that  
approach for years, but maybe that can help here. Or are all OSDs  
already up and in? Or you could reduce the max misplaced ratio to 1%  
or even lower (default is 5%)?


Zitat von "Szabo, Istvan (Agoda)" :


Forgot to paste, somehow I want to reduce this recovery operation:
recovery: 0 B/s, 941.90k keys/s, 188 objects/s
To 2-300Keys/sec




From: Szabo, Istvan (Agoda) 
Sent: Friday, September 6, 2024 11:18 PM
To: Ceph Users 
Subject: [ceph-users] Somehow throotle recovery even further than  
basic options?


Hi,

4 years ago we've created our cluster with all disks 4osds (ssds and  
nvme disks) on octopus.
The 15TB SSDs still working properly with 4 osds but the small 1.8T  
nvmes with the index pool not.
Each new nvme osd adding to the existing nodes generates slow ops  
with scrub off, recovery_op_priority 1, backfill and recovery 1-1.
I even turned off all index pool heavy sync mechanism but the read  
latency still high which means recovery op pushes it even higher.


I'm trying to somehow add resource to the cluster to spread the 2048  
index pool pg (in replica 3 means 6144pg index pool) but can't make  
it more gentle.


The balancer is working in upmap with max deviation 1.

Have this script from digitalocean  
https://github.com/digitalocean/pgremapper, is there anybody tried  
it before how is it or could this help actually?


Thank you the ideas.


This message is confidential and is for the sole use of the intended  
recipient(s). It may also be privileged or otherwise protected by  
copyright or other legal rules. If you have received it by mistake  
please let us know by reply email and delete it from your system. It  
is prohibited to copy this message or disclose its content to  
anyone. Any confidentiality or privilege is not waived or lost by  
any mistaken delivery or unauthorized disclosure of the message. All  
messages sent to and from Agoda may be monitored to ensure  
compliance with company policies, to protect the company's interests  
and to remove potential malware. Electronic messages may be  
intercepted, amended, lost or deleted, or contain viruses.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: v19.1.1 Squid RC1 released

2024-09-05 Thread Eugen Block

Turns out, that cluster didn't have new snapshots enabled, so the  
tracker issue is invalid. Although I'd like to point out that the  
error messages in the dashboard and cli could be improved, they don't  
really give any clue why it's failing. The actual root cause was  
hidden in debug level 5:


ceph-mds[67856]: mds.0.server new snapshots are disabled for this fs

Zitat von Eugen Block :


Hi,

since you pointed out the CephFS features, I wanted to raise some  
awareness towards snapshot schedulung/creating before releasing  
19.2.0:


https://tracker.ceph.com/issues/67790

I tried 19.1.1 and am failing to create snapshots:

ceph01:~ # ceph fs subvolume snapshot create cephfs subvol1 test-snap1
Error EPERM: error in mkdir /volumes/_nogroup/subvol1/.snap/test-snap1

This works in Reef.

Thanks,
Eugen

Zitat von Yuri Weinstein :


This is the second release candidate for Squid.

Feature highlights:

RGW
Fixed a regression in bucket ownership for Keystone users and  
implicit tenants.

The User Accounts feature unlocks several new AWS-compatible IAM APIs
 for the self-service management of users, keys, groups, roles, policy and
 more.

RADOS
BlueStore has been optimized for better performance in
snapshot-intensive workloads.
BlueStore RocksDB LZ4 compression is now enabled by default to improve
average performance
and "fast device" space usage. Other improvements include more
flexible EC configurations,
an OpTracker to help debug mgr module issues, and better scrub scheduling.

Dashboard
* Rearranged Navigation Layout: The navigation layout has been reorganized
 for improved usability and easier access to key features.

CephFS Improvements
 * Support for managing CephFS snapshots and clones, as well as
snapshot schedule
   management
 * Manage authorization capabilities for CephFS resources
 * Helpers on mounting a CephFS volume

RGW Improvements
 * Support for managing bucket policies
 * Add/Remove bucket tags
 * ACL Management
 * Several UI/UX Improvements to the bucket form
Monitoring: Grafana dashboards are now loaded into the container at
runtime rather than
 building a grafana image with the grafana dashboards. Official Ceph
grafana images
 can be found in quay.io/ceph/grafana
* Monitoring: RGW S3 Analytics: A new Grafana dashboard is now
available, enabling you to
 visualize per bucket and user analytics data, including total GETs,
PUTs, Deletes,
 Copies, and list metrics.

Crimson/Seastore
Crimson's first tech preview release!
Supporting RBD workloads on Replicated pools.
For more information please visit: https://ceph.io/en/news/crimson

If any of our community members would like to help us with performance
investigations or regression testing of the Squid release candidate,
please feel free to provide feedback via email or in
https://pad.ceph.com/p/squid_scale_testing. For more active
discussions, please use the #ceph-at-scale slack channel in
ceph-storage.slack.com.

* Git at git://github.com/ceph/ceph.git
* Tarball at https://download.ceph.com/tarballs/ceph-19.1.1.tar.gz
* Containers at https://quay.io/repository/ceph/ceph
* For packages, see https://docs.ceph.com/en/latest/install/get-packages/
* Release git sha1: 1d9f35852eef16b81614e38a05cf88b505cc142b
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Discovery (port 8765) service not starting

2024-09-05 Thread Eugen Block


Hi,

regarding the scraping endpoints, I wonder if it would make sense to  
handle it the same way as with the dashboard redirect:


ceph config get mgr mgr/dashboard/standby_behaviour
redirect

If you try to access the dashboard via one of the standby MGRs, you're  
redirected to the active one.



Zitat von Matthew Vernon :


Hi,

I tracked it down to 2 issues:

* our ipv6-only deployment (a bug fixed in 18.2.4, though that has  
buggy .debs)


* Discovery service is only run on the active mgr

The latter point is surely a bug? Isn't the point of running a  
service discovery endpoint that one could point e.g. an external  
Prometheus scraper at the service discovery endpoint of any mgr and  
it would then tell Prometheus where to scrape metrics from (i.e. the  
active mgr)?


Thanks,

Matthew
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-09-04 Thread Eugen Block

Another update: Giovanna agreed to switch back to mclock_scheduler and  
adjust osd_snap_trim_cost to 400K. It looks very promising, after a  
few hours the snaptrim queue was processed.


@Sridhar: thanks a lot for your valuable input!

Zitat von Eugen Block :

Quick update: we decided to switch to wpq to see if that would  
confirm our suspicion, and it did. After a few hours all PGs in the  
snaptrim queue had been processed. We haven't looked into the  
average object sizes yet, maybe we'll try that approach next week or  
so. If you have any other ideas, let us know.


Zitat von Eugen Block :


Hi,

as expected the issue is not resolved and turned up again a couple  
of hours later. Here's the tracker issue:


https://tracker.ceph.com/issues/67702

I also attached a log snippet from one osd with debug_osd 10 to the  
tracker. Let me know if you need anything else, I'll stay in touch  
with Giovanna.


Thanks!
Eugen

Zitat von Sridhar Seshasayee :


Hi Eugen,

On Fri, Aug 23, 2024 at 1:37 PM Eugen Block  wrote:


Hi again,

I have a couple of questions about this.
What exactly happened to the PGs? They were queued for snaptrimming,
but we didn't see any progress. Let's assume the average object size
in that pool was around 2 MB (I don't have the actual numbers). Does
that mean if osd_snap_trim_cost (1M default) was too low, those too
large objects weren't trimmed? And then we split the PGs, reducing the
average object size to 1 MB, these objects could be trimmed then,
obviously. Does this explanation make sense?



If you have the OSD logs, I can take a look and see why the snaptrim ops
did not make progress. The cost is one contributing factor on the position
of the op in the queue. Therefore, even though the cost incorrectly
represents the actual average size of the objects in the PG, the op should
be scheduled based on the set cost and the profile allocations.

The OSDs appear to be NVMe based is what I understand from the
thread. Based on the actions taken to resolve the situation (increased
pg_num to 64), I think something else was up on the cluster. For NVMe
based cluster, the current cost shouldn't cause stalling of the snaptrim
ops. I'd suggest raising an upstream tracker with your observation and
OSD logs to investigate this further.




I just browsed through the changes, if I understand the fix correctly,
the average object size is now calculated automatically, right? Which
makes a lot of sense to me, as an operator I don't want to care too
much about the average object sizes since ceph should know them better
than me. ;-)



Yes, that's correct. This fix was part of the effort to incrementally
include
background OSD operations to be scheduled by mClock.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Issue Replacing OSD with cephadm: Partition Path Not Accepted

2024-09-04 Thread Eugen Block


Hi,

apparently, I was wrong about specifying a partition in the path  
option of the spec file. In my quick test it doesn't work either.  
Creating a PV, VG, LV on that partition makes it work:


ceph orch daemon add osd soc9-ceph:data_devices=ceph-manual-vg/ceph-osd
Created osd(s) 3 on host 'soc9-ceph'

But if you want an easy cluster management, especially OSDs, I'd  
recommend to use the entire (raw) devices. It really makes (almost)  
everything easier. Just recently a customer was quite pleased with the  
process compared to prior ceph versions. They removed a failed disk  
and only had to run 'ceph orch osd rm  --replace' and ceph did  
the rest (given the existing service specs cover it). They literally  
said:"wow, this changes our whole impression of ceph". They didn't  
have many disk replacements in the past 5 years, this was the first  
one since they adopted the cluster with cephadm.


Fiddling with partitions seems quite unnecessary, especially on larger  
deployments.


Regards,
Eugen

Zitat von Herbert Faleiros :


On 03/09/2024 03:35, Robert Sander wrote:

Hi,


Hello,


On 9/2/24 20:24, Herbert Faleiros wrote:


/usr/bin/docker: stderr ceph-volume lvm batch: error: /dev/sdb1 is a
partition, please pass LVs or raw block devices


A Ceph OSD nowadays needs a logical volume because it stores  
crucial metadata in the LV tags. This helps to activate the OSD.
IMHO you will have to redeploy the OSD to use LVM on the disk. It  
does not need to be the whole disk if there is other data on it. It  
should be sufficient to make /dev/sdb1 a PV of a new VG for the LV  
of the OSD.


thank you for the suggestion. I understand the need for Ceph OSDs to  
use LVM due to the metadata stored in LV tags. However, I’m facing a  
challenge with the disk replacement process. Since I’ve already  
migrated the OSDs to use |ceph-volume|, I was hoping that |cephadm|  
would handle the creation of LVM structures automatically.  
Unfortunately, it doesn’t seem to recreate these structures on its  
own when replacing a disk, and manually creating them isn’t ideal  
because |ceph-volume| uses its own specific naming conventions.


Do you have any recommendations on how to proceed with |cephadm| in  
a way that it can handle the LVM setup automatically, or perhaps  
another method that aligns with the conventions used by |ceph-volume|?


--

Herbert



Regards

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: R: Re: CephFS troubleshooting

2024-09-04 Thread Eugen Block

Has it worked before or did it just stop working at some point? What's  
the exact command that fails (and error message if there is)?


For the "too many PGs per OSD" I suppose I have to add some other  
OSDs, right?


Either that or reduce the number of PGs. If you had only a few pools  
I'd suggest to leave it to the autoscaler, but not for 13 pools. You  
can paste 'ceph osd df' and 'ceph osd pool ls detail' if you need more  
input for that.


Zitat von Eugenio Tampieri :


Hi Eugen,
Sorry, but I had some trouble when I signed up and then I was away  
so I missed your reply.



ceph auth export client.migration
[client.migration]
key = redacted
caps mds = "allow rw fsname=repo"
caps mon = "allow r fsname=repo"
caps osd = "allow rw tag cephfs data=repo"


For the "too many PGs per OSD" I suppose I have to add some other  
OSDs, right?


Thanks,

Eugenio

-Messaggio originale-
Da: Eugen Block 
Inviato: mercoledì 4 settembre 2024 10:07
A: ceph-users@ceph.io
Oggetto: [ceph-users] Re: CephFS troubleshooting

Hi, I already responded to your first attempt:

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/GS7KJRJP7BAOF66KJM255G27TJ4KG656/

Please provide the requested details.


Zitat von Eugenio Tampieri :


Hello,
I'm writing to troubleshoot an otherwise functional Ceph quincy
cluster that has issues with cephfs.
I cannot mount it with ceph-fuse (it gets stuck), and if I mount it
with NFS I can list the directories but I cannot read or write
anything.
Here's the output of ceph -s
  cluster:
id: 3b92e270-1dd6-11ee-a738-000c2937f0ec
health: HEALTH_WARN
mon ceph-storage-a is low on available space
1 daemons have recently crashed
too many PGs per OSD (328 > max 250)

  services:
mon:5 daemons, quorum
ceph-mon-a,ceph-storage-a,ceph-mon-b,ceph-storage-c,ceph-storage-d
(age 105m)
mgr:ceph-storage-a.ioenwq(active, since 106m), standbys:
ceph-mon-a.tiosea
mds:1/1 daemons up, 2 standby
osd:4 osds: 4 up (since 104m), 4 in (since 24h)
rbd-mirror: 2 daemons active (2 hosts)
rgw:2 daemons active (2 hosts, 1 zones)

  data:
volumes: 1/1 healthy
pools:   13 pools, 481 pgs
objects: 231.83k objects, 648 GiB
usage:   1.3 TiB used, 1.8 TiB / 3.1 TiB avail
pgs: 481 active+clean

  io:
client:   1.5 KiB/s rd, 8.6 KiB/s wr, 1 op/s rd, 0 op/s wr
Best regards,

Eugenio Tampieri
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an  
email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CephFS troubleshooting

2024-09-04 Thread Eugen Block


Hi, I already responded to your first attempt:

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/GS7KJRJP7BAOF66KJM255G27TJ4KG656/

Please provide the requested details.


Zitat von Eugenio Tampieri :


Hello,
I'm writing to troubleshoot an otherwise functional Ceph quincy  
cluster that has issues with cephfs.
I cannot mount it with ceph-fuse (it gets stuck), and if I mount it  
with NFS I can list the directories but I cannot read or write  
anything.

Here's the output of ceph -s
  cluster:
id: 3b92e270-1dd6-11ee-a738-000c2937f0ec
health: HEALTH_WARN
mon ceph-storage-a is low on available space
1 daemons have recently crashed
too many PGs per OSD (328 > max 250)

  services:
mon:5 daemons, quorum  
ceph-mon-a,ceph-storage-a,ceph-mon-b,ceph-storage-c,ceph-storage-d  
(age 105m)
mgr:ceph-storage-a.ioenwq(active, since 106m), standbys:  
ceph-mon-a.tiosea

mds:1/1 daemons up, 2 standby
osd:4 osds: 4 up (since 104m), 4 in (since 24h)
rbd-mirror: 2 daemons active (2 hosts)
rgw:2 daemons active (2 hosts, 1 zones)

  data:
volumes: 1/1 healthy
pools:   13 pools, 481 pgs
objects: 231.83k objects, 648 GiB
usage:   1.3 TiB used, 1.8 TiB / 3.1 TiB avail
pgs: 481 active+clean

  io:
client:   1.5 KiB/s rd, 8.6 KiB/s wr, 1 op/s rd, 0 op/s wr
Best regards,

Eugenio Tampieri
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: v19.1.1 Squid RC1 released

2024-09-03 Thread Eugen Block


Hi,

since you pointed out the CephFS features, I wanted to raise some  
awareness towards snapshot schedulung/creating before releasing 19.2.0:


https://tracker.ceph.com/issues/67790

I tried 19.1.1 and am failing to create snapshots:

ceph01:~ # ceph fs subvolume snapshot create cephfs subvol1 test-snap1
Error EPERM: error in mkdir /volumes/_nogroup/subvol1/.snap/test-snap1

This works in Reef.

Thanks,
Eugen

Zitat von Yuri Weinstein :


This is the second release candidate for Squid.

Feature highlights:

RGW
Fixed a regression in bucket ownership for Keystone users and  
implicit tenants.

The User Accounts feature unlocks several new AWS-compatible IAM APIs
  for the self-service management of users, keys, groups, roles, policy and
  more.

RADOS
BlueStore has been optimized for better performance in
snapshot-intensive workloads.
BlueStore RocksDB LZ4 compression is now enabled by default to improve
average performance
and "fast device" space usage. Other improvements include more
flexible EC configurations,
an OpTracker to help debug mgr module issues, and better scrub scheduling.

Dashboard
* Rearranged Navigation Layout: The navigation layout has been reorganized
  for improved usability and easier access to key features.

CephFS Improvements
  * Support for managing CephFS snapshots and clones, as well as
snapshot schedule
management
  * Manage authorization capabilities for CephFS resources
  * Helpers on mounting a CephFS volume

RGW Improvements
  * Support for managing bucket policies
  * Add/Remove bucket tags
  * ACL Management
  * Several UI/UX Improvements to the bucket form
Monitoring: Grafana dashboards are now loaded into the container at
runtime rather than
  building a grafana image with the grafana dashboards. Official Ceph
grafana images
  can be found in quay.io/ceph/grafana
* Monitoring: RGW S3 Analytics: A new Grafana dashboard is now
available, enabling you to
  visualize per bucket and user analytics data, including total GETs,
PUTs, Deletes,
  Copies, and list metrics.

Crimson/Seastore
Crimson's first tech preview release!
Supporting RBD workloads on Replicated pools.
For more information please visit: https://ceph.io/en/news/crimson

If any of our community members would like to help us with performance
investigations or regression testing of the Squid release candidate,
please feel free to provide feedback via email or in
https://pad.ceph.com/p/squid_scale_testing. For more active
discussions, please use the #ceph-at-scale slack channel in
ceph-storage.slack.com.

* Git at git://github.com/ceph/ceph.git
* Tarball at https://download.ceph.com/tarballs/ceph-19.1.1.tar.gz
* Containers at https://quay.io/repository/ceph/ceph
* For packages, see https://docs.ceph.com/en/latest/install/get-packages/
* Release git sha1: 1d9f35852eef16b81614e38a05cf88b505cc142b
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Discovery (port 8765) service not starting

2024-09-03 Thread Eugen Block

Oh that's interesting :-D I have no explanation for that, except maybe  
some flaw in your custom images? Or in the service specs? Not sure, to  
be honest...


Zitat von Matthew Vernon :


Hi,

On 03/09/2024 11:46, Eugen Block wrote:

Do you see the port definition in the unit.meta file?


Oddly:

"ports": [
9283,
8765,
8765,
8765,
8765
],

which doesn't look right...

Regards,

Mattew
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Discovery (port 8765) service not starting

2024-09-03 Thread Eugen Block


Do you see the port definition in the unit.meta file?

jq -r '.ports' /var/lib/ceph/{FSID}/mgr.{MGR}/unit.meta
[
  8443,
  9283,
  8765
]


Zitat von Matthew Vernon :


Hi,

On 02/09/2024 21:24, Eugen Block wrote:
Without having looked too closely, do you run ceph with IPv6?  
There’s a tracker issue:


https://tracker.ceph.com/issues/66426

It will be backported to Reef.


I do run IPv6, but the problem is that nothing is listening on port  
8765 at all, not that it's only doing so on v4. If I do lsof -p [pid  
of mgr] | grep LISTEN I just get


/var/run/ceph/ceph-mgr.moss-be1001.eshmpf.asok type=STREAM (LISTEN)
TCP *:9283 (LISTEN)

i.e. the management socket, and the prometheus endpoint itself.

AFAICT that tracker issue describes a v6-enabled deployment where  
it's only listening on v4.


Thanks,

Matthew
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Issue Replacing OSD with cephadm: Partition Path Not Accepted

2024-09-02 Thread Eugen Block

I would try it with a spec file that contains a path to the partition  
(limit the placement to that host only). Or have you tried it already?  
I don’t use partitions for ceph, but there have been threads from  
other users who use partitions and with spec files it seemed to work.

You can generate a preview with ‚ceph orch apply -i osd-spec.yaml --dry-run‘.

Zitat von Herbert Faleiros :


I am on a journey, so far successful, to update our clusters to supported
versions. I started with Luminous and Ubuntu 16.04, and now we are on Reef
with Ubuntu 20.04. We still have more updates to do, but at the moment, I
encountered an issue with an OSD, and it was necessary to replace a disk.
Since the cluster was adopted, I'm not entirely sure what the best way to
replace this OSD is, as with cephadm, it doesn't like when the path for the
device is a partition. I could recreate the OSD using traditional methods
and then adopt the OSD, but that doesn't seem like the best approach. Does
anyone know how I should proceed to recreate this OSD? I had the same
problem in my lab, where I am already on Quincy.

What I am trying to do is:

# ceph orch osd rm 6 --replace --zap
# ceph orch daemon add osd
osd-nodeXXX:data_devices=/dev/sdb1,db_devices=/dev/sda3

The error it gives is:

/usr/bin/docker: stderr ceph-volume lvm batch: error: /dev/sdb1 is a
partition, please pass LVs or raw block devices
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Discovery (port 8765) service not starting

2024-09-02 Thread Eugen Block

Without having looked too closely, do you run ceph with IPv6? There’s  
a tracker issue:


https://tracker.ceph.com/issues/66426

It will be backported to Reef.

Zitat von Matthew Vernon :


Hi,

I'm running reef, with locally-built containers based on upstream .debs.

I've now enabled prometheus metrics thus:
ceph mgr module enable prometheus

And that seems to have worked (the active mgr is listening on port  
9283); but per the docs[0] there should also be a service discovery  
endpoint (on port 8765) that I can point Prometheus' http_sd_config  
at.


But there is in fact no listener on that port, nor anything obvious  
in logs if I failover the mgr - see a log extract here:


https://phabricator.wikimedia.org/P68521

Is there something else I need to be doing to get the service  
discovery endpoint working?


Thanks,

Matthew

[0]  
https://docs.ceph.com/en/reef/cephadm/services/monitoring/#deploying-monitoring-without-cephadm

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS cache always increasing

2024-09-02 Thread Eugen Block

Can you tell if the number of objects increases in your cephfs between  
those bursts? I noticed something similar in a 16.2.15 cluster as  
well. It's not that heavily used, but it contains home directories and  
development working directories etc. And when one user checked out a  
git project, the mds memory usage increased a lot, getting near its  
configured limit. Before there were around 3,7 Million objects in the  
cephfs, that user added more than a million more files with his  
checkout. It wasn't a real issue (yet) because the usage isn't very  
dynamical and the total number of files is relatively stable.
This doesn't really help resolve anything, but if your total number of  
files grows, I'm not surprised that the mds requires more memory.


Zitat von Alexander Patrakov :


As a workaround, to reduce the impact of the MDS slowed down by
excessive memory consumption, I would suggest installing earlyoom,
disabling swap, and configuring earlyoom as follows (usually through
/etc/sysconfig/earlyoom, but could be in a different place on your
distribution):

EARLYOOM_ARGS="-p -r 600 -m 4,4 -s 1,1"

On Sat, Aug 31, 2024 at 3:44 PM Sake Ceph  wrote:


Ow it got worse after the upgrade to Reef (was running Quincy).  
With Quincy the memory usage was also a lot of times around 95% and  
some swap usage, but never exceeding both to the point of crashing.


Kind regards,
Sake
> Op 31-08-2024 09:15 CEST schreef Alexander Patrakov :
>
>
> Got it.
>
> However, to narrow down the issue, I suggest that you test whether it
> still exists after the following changes:
>
> 1. Reduce max_mds to 1.
> 2. Do not reduce max_mds to 1, but migrate all clients from a direct
> CephFS mount to NFS.
>
> On Sat, Aug 31, 2024 at 2:55 PM Sake Ceph  wrote:
> >
> > I was talking about the hosts where the MDS containers are  
running on. The clients are all RHEL 9.

> >
> > Kind regards,
> > Sake
> >
> > > Op 31-08-2024 08:34 CEST schreef Alexander Patrakov  
:

> > >
> > >
> > > Hello Sake,
> > >
> > > The combination of two active MDSs and RHEL8 does ring a bell, and I
> > > have seen this with Quincy, too. However, what's relevant is the
> > > kernel version on the clients. If they run the default 4.18.x kernel
> > > from RHEL8, please either upgrade to the mainline kernel or decrease
> > > max_mds to 1. If they run a modern kernel, then it is something I do
> > > not know about.
> > >
> > > On Sat, Aug 31, 2024 at 1:21 PM Sake Ceph  wrote:
> > > >
> > > > @Anthony: it's a small virtualized cluster and indeed SWAP  
shouldn't be used, but this doesn't change the problem.

> > > >
> > > > @Alexander: the problem is in the active nodes, the standby  
replay don't have issues anymore.

> > > >
> > > > Last night's backup run increased the memory usage to 86%  
when rsync was running for app2. It dropped to 77,8% when it was  
done. When the rsync for app4 was running it increased to 84% and  
dropping to 80%. After a few hours it's now settled on 82%.
> > > > It looks to me the MDS server is caching something forever  
while it isn't being used..

> > > >
> > > > The underlying host is running on RHEL 8. Upgrade to RHEL 9  
is planned, but hit some issues with automatically upgrading hosts.

> > > >
> > > > Kind regards,
> > > > Sake
> > > > ___
> > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> > >
> > >
> > > --
> > > Alexander Patrakov
>
>
>
> --
> Alexander Patrakov
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




--
Alexander Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph orch host drain daemon type

2024-08-30 Thread Eugen Block

The label removal approach is great, but it still doesn't allow you to  
drain only OSDs, no other daemons. I didn't think about the other  
labels, that's a good point. Let's see what the devs have to say. :-)


Zitat von Frédéric Nass :


Hi Eugen,

I know, but removing other services is generally done by removing  
labels on hosts, isn't it?


Whatever the way, another concern would be how to deal with  
_no_schedule and _no_conf_keyring labels when not draining all  
services on the host. Would it require per service type  
_no_schedule_labels?

I don't know. That's where devs usually chime in. :-)

Cheers,
Frédéric

- Le 30 Aoû 24, à 12:33, Eugen Block ebl...@nde.ag a écrit :


Hi,


What would be nice is if the drain command could take care of OSDs
by default and drain all services only when called with a
--remove-all-services flag or something similar.


but that would mean that you wouldn't be able to drain only specific
services, and OSDs would be drained either way. From my perspective,
it would suffice to have the --daemon-type flag which is already
present in 'ceph orch ps --daemon-type' command. You could either
drain a specific daemon-type or drain the entire host (can be the
default with the same behaviour as it currently works). That would
allow more control about non-osd daemons.

Zitat von Frédéric Nass :


Hi Robert,

Thanks for the suggestion. Unfortunately, removing the 'osds' label
from a host does not remove the OSDs, unlike with other labels
(mons, mgrs, mdss, nfss, crash, _admin, etc.). This is because this
kind of service is tightly bound to the host and less 'portable'
than other services, I think. This is likely the purpose of the
drain command.

What would be nice is if the drain command could take care of OSDs
by default and drain all services only when called with a
--remove-all-services flag or something similar.

Frédéric

- Le 30 Aoû 24, à 1:07, Robert W. Eckert r...@rob.eckert.name a écrit :


If you are using cephadm, couldn't the host be removed from placing
osds? On my
cluster, I labeled the hosts for each service (OSD/MON/MGR/...)  
and have the

services deployed by label.   I believe that if you had that, then
when a label
is removed from the host the services eventually drain.



-Original Message-----
From: Frédéric Nass 
Sent: Thursday, August 29, 2024 11:30 AM
To: Eugen Block 
Cc: ceph-users ; dev 
Subject: [ceph-users] Re: ceph orch host drain daemon type

Hello Eugen,

A month back, while playing with a lab cluster, I drained a
multi-service host
(OSDs, MGR, MON, etc.) in order to recreate all of its OSDs. During this
operation, all cephadm containers were removed as expected,
including the MGR.
As a result, I got into a situation where the orchestrator  
backend 'cephadm'

was missing and wouldn't load anymore. I didn't have much time to
investigate
this, so I decided to recreate the lab cluster. But I think this  
is due to a

bug.

I would probably have avoided this situation if I had been able to ask the
orchestrator to only drain services of type 'osd'. Makes sense.

Cheers,
Frédéric.

- Le 27 Aoû 24, à 15:19, Eugen Block ebl...@nde.ag a écrit :


Hi,

is there anything on the road map to be able to choose a specific
daemon type to be entirely removed from a host instead of all cephadm
managed daemons? I just did a quick search in tracker and github, but
it may be "hidden" somewhere else.

I was thinking about colocated daemons on a host, for example MON,
MGR, OSDs, node-exporter, crash. That's quite common, but if I just
wanted to drain all OSDs (maybe mark them as "destroyed" in order to
replace the drives), the usual 'ceph orch host drain ' command
would remove all daemons. That seems unnecessary if I'm going to add
the OSDs back.

Since there are a couple of other daemon types which can be deployed
multiple times per host, e. g. MDS, RGW, it doesn't only make sense
for OSDs but for other daemons as well. And those other daemons
usually have some cryptic suffix, we wouldn't need that in order to
get rid of them, it doesn't save that much time, but it could be a
nice enhancement.

Should I create a tracker issue in the enhancement section for that?

Thanks!
Eugen
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send
an email to

>> ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph orch host drain daemon type

2024-08-30 Thread Eugen Block


Hi,

What would be nice is if the drain command could take care of OSDs  
by default and drain all services only when called with a  
--remove-all-services flag or something similar.


but that would mean that you wouldn't be able to drain only specific  
services, and OSDs would be drained either way. From my perspective,  
it would suffice to have the --daemon-type flag which is already  
present in 'ceph orch ps --daemon-type' command. You could either  
drain a specific daemon-type or drain the entire host (can be the  
default with the same behaviour as it currently works). That would  
allow more control about non-osd daemons.


Zitat von Frédéric Nass :


Hi Robert,

Thanks for the suggestion. Unfortunately, removing the 'osds' label  
from a host does not remove the OSDs, unlike with other labels  
(mons, mgrs, mdss, nfss, crash, _admin, etc.). This is because this  
kind of service is tightly bound to the host and less 'portable'  
than other services, I think. This is likely the purpose of the  
drain command.


What would be nice is if the drain command could take care of OSDs  
by default and drain all services only when called with a  
--remove-all-services flag or something similar.


Frédéric

- Le 30 Aoû 24, à 1:07, Robert W. Eckert r...@rob.eckert.name a écrit :

If you are using cephadm, couldn't the host be removed from placing  
osds? On my

cluster, I labeled the hosts for each service (OSD/MON/MGR/...) and have the
services deployed by label.   I believe that if you had that, then  
when a label

is removed from the host the services eventually drain.



-Original Message-
From: Frédéric Nass 
Sent: Thursday, August 29, 2024 11:30 AM
To: Eugen Block 
Cc: ceph-users ; dev 
Subject: [ceph-users] Re: ceph orch host drain daemon type

Hello Eugen,

A month back, while playing with a lab cluster, I drained a  
multi-service host

(OSDs, MGR, MON, etc.) in order to recreate all of its OSDs. During this
operation, all cephadm containers were removed as expected,  
including the MGR.

As a result, I got into a situation where the orchestrator backend 'cephadm'
was missing and wouldn't load anymore. I didn't have much time to  
investigate

this, so I decided to recreate the lab cluster. But I think this is due to a
bug.

I would probably have avoided this situation if I had been able to ask the
orchestrator to only drain services of type 'osd'. Makes sense.

Cheers,
Frédéric.

- Le 27 Aoû 24, à 15:19, Eugen Block ebl...@nde.ag a écrit :


Hi,

is there anything on the road map to be able to choose a specific
daemon type to be entirely removed from a host instead of all cephadm
managed daemons? I just did a quick search in tracker and github, but
it may be "hidden" somewhere else.

I was thinking about colocated daemons on a host, for example MON,
MGR, OSDs, node-exporter, crash. That's quite common, but if I just
wanted to drain all OSDs (maybe mark them as "destroyed" in order to
replace the drives), the usual 'ceph orch host drain ' command
would remove all daemons. That seems unnecessary if I'm going to add
the OSDs back.

Since there are a couple of other daemon types which can be deployed
multiple times per host, e. g. MDS, RGW, it doesn't only make sense
for OSDs but for other daemons as well. And those other daemons
usually have some cryptic suffix, we wouldn't need that in order to
get rid of them, it doesn't save that much time, but it could be a
nice enhancement.

Should I create a tracker issue in the enhancement section for that?

Thanks!
Eugen
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send  
an email to

ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-30 Thread Eugen Block

Quick update: we decided to switch to wpq to see if that would confirm  
our suspicion, and it did. After a few hours all PGs in the snaptrim  
queue had been processed. We haven't looked into the average object  
sizes yet, maybe we'll try that approach next week or so. If you have  
any other ideas, let us know.


Zitat von Eugen Block :


Hi,

as expected the issue is not resolved and turned up again a couple  
of hours later. Here's the tracker issue:


https://tracker.ceph.com/issues/67702

I also attached a log snippet from one osd with debug_osd 10 to the  
tracker. Let me know if you need anything else, I'll stay in touch  
with Giovanna.


Thanks!
Eugen

Zitat von Sridhar Seshasayee :


Hi Eugen,

On Fri, Aug 23, 2024 at 1:37 PM Eugen Block  wrote:


Hi again,

I have a couple of questions about this.
What exactly happened to the PGs? They were queued for snaptrimming,
but we didn't see any progress. Let's assume the average object size
in that pool was around 2 MB (I don't have the actual numbers). Does
that mean if osd_snap_trim_cost (1M default) was too low, those too
large objects weren't trimmed? And then we split the PGs, reducing the
average object size to 1 MB, these objects could be trimmed then,
obviously. Does this explanation make sense?



If you have the OSD logs, I can take a look and see why the snaptrim ops
did not make progress. The cost is one contributing factor on the position
of the op in the queue. Therefore, even though the cost incorrectly
represents the actual average size of the objects in the PG, the op should
be scheduled based on the set cost and the profile allocations.

The OSDs appear to be NVMe based is what I understand from the
thread. Based on the actions taken to resolve the situation (increased
pg_num to 64), I think something else was up on the cluster. For NVMe
based cluster, the current cost shouldn't cause stalling of the snaptrim
ops. I'd suggest raising an upstream tracker with your observation and
OSD logs to investigate this further.




I just browsed through the changes, if I understand the fix correctly,
the average object size is now calculated automatically, right? Which
makes a lot of sense to me, as an operator I don't want to care too
much about the average object sizes since ceph should know them better
than me. ;-)



Yes, that's correct. This fix was part of the effort to incrementally
include
background OSD operations to be scheduled by mClock.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: RBD Mirror - Failed to unlink peer

2024-08-27 Thread Eugen Block


Can you share 'ceph versions' output?
Do you see the same behaviour when adding a snapshot schedule, e.g.

rbd -p  mirror snapshot schedule add 30m

I can't reproduce it, unfortunately, creating those mirror snapshots  
manually still works for me.


Zitat von scott.cai...@tecnica-ltd.co.uk:

We have rbd-mirror daemon running on both sites, however replication  
is only one way (i.e. the one on the remote site is the only live  
one, the one on the primary site is just there for if we ever need  
to set up two-way, but this is not currently set up for any  
replication - so it makes sense there's nothing in the log files on  
the primary site, as it's doing nothing).


I'm not seeing any errors in rbd-mirror daemon log at either end -  
primary is blank as expected, and the error appears to be on the  
primary when the snapshot is taken, so the remote cluster never  
see's any errors.


When we either manually run the command to take a snapshot, or have  
this run through cron we receive the error, e.g. running the  
following on the primary site:


# rbd mirror image snapshot ceph-ssd/vm-101-disk-1
Snapshot ID: 58393
2024-08-26T12:39:54.958+0100 7b5ad6a006c0 -1  
librbd::mirror::snapshot::CreatePrimaryRequest: 0x7b5ac0019e60  
handle_unlink_peer: failed to unlink peer: (2) No such file or  
directory



This appears in the console as the output for this (we used to only  
get the Snapshot ID: x), not in any rbd log files.


Hope that clarifies it? Thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] ceph orch host drain daemon type

2024-08-27 Thread Eugen Block


Hi,

is there anything on the road map to be able to choose a specific  
daemon type to be entirely removed from a host instead of all cephadm  
managed daemons? I just did a quick search in tracker and github, but  
it may be "hidden" somewhere else.


I was thinking about colocated daemons on a host, for example MON,  
MGR, OSDs, node-exporter, crash. That's quite common, but if I just  
wanted to drain all OSDs (maybe mark them as "destroyed" in order to  
replace the drives), the usual 'ceph orch host drain ' command  
would remove all daemons. That seems unnecessary if I'm going to add  
the OSDs back.


Since there are a couple of other daemon types which can be deployed  
multiple times per host, e. g. MDS, RGW, it doesn't only make sense  
for OSDs but for other daemons as well. And those other daemons  
usually have some cryptic suffix, we wouldn't need that in order to  
get rid of them, it doesn't save that much time, but it could be a  
nice enhancement.


Should I create a tracker issue in the enhancement section for that?

Thanks!
Eugen
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] device_health_metrics pool automatically recreated

2024-08-27 Thread Eugen Block


Hi,

I just looked into one customer cluster that we upgraded some time ago  
from Octopus to Quincy (17.2.6) and I'm wondering why there are still  
both pools, "device_health_metrics" and ".mgr".


According to the docs [0], it's supposed to be renamed:

Prior to Quincy, the devicehealth module created a  
device_health_metrics pool to store device SMART statistics. With  
Quincy, this pool is automatically renamed to be the common manager  
module pool.


Now only .mgr has data while device_health_metrics is empty, but it  
has a newer ID:


ses01:~ # ceph df | grep -E "device_health|.mgr"
.mgr1 1   68 MiB   18  204 MiB  
 0254 TiB
device_health_metrics  15 1  0 B0  0 B  
 0254 TiB


On a test cluster (meanwhile upgraded to latest Reef) I see the same:

ceph01:~ # ceph df | grep -E "device_health_metrics|.mgr"
.mgr381  577 KiB2  1.7 MiB  0   
   71 GiB
device_health_metrics   451  0 B0  0 B  0   
   71 GiB


Since there are still many users who haven't upgraded to >= Quincy  
yet, this should be clarified/fixed. I briefly checked  
tracker.ceph.com, but didn't find anything related to this. I'm  
currently trying to reproduce it on a one-node test cluster which I  
upgraded from Pacific to Quincy, but no results yet, only that the  
renaming was successful. But for the other clusters I don't have  
enough logs to find out how/why the device_health_metrics pool had  
been recreated.


Thanks,
Eugen

[0] https://docs.ceph.com/en/latest/mgr/administrator/#module-pool

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: RBD Mirror - Failed to unlink peer

2024-08-26 Thread Eugen Block


Hi,

I think I need some clarification. You have a rbd-mirror daemon  
running on the primary site although you have configured rbd-mirroring  
one-way only? And those errors you see in the rbd-mirror daemon log on  
the primary site?
Maybe the daemon got started/activated by accident (or it was not  
disabled from some two-way mirror tests)? You don't need a rbd-mirror  
daemon on the primary site if you mirror only one-way.



Zitat von scott.cai...@tecnica-ltd.co.uk:

Thanks - side tracked with other work so only just got around to  
testing this.


Unfortunately when enabling rbd-mirror logs on the source cluster  
I'm not seeing any events logged at all, however on the remote  
cluster I can see constant activity (mostly imageReplayer,  
mirrorStatusUpdater, etc. logs).


Currently our sync is only one way (from source to remote), and the  
error appears to be on the source (i.e. as soon as the snapshot is  
taken).


There's no error on the remote cluster in the rbd mirror logs, and  
nothing logged at all on the source cluster in the rbd mirror logs.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-26 Thread Eugen Block


Hi,

as expected the issue is not resolved and turned up again a couple of  
hours later. Here's the tracker issue:


https://tracker.ceph.com/issues/67702

I also attached a log snippet from one osd with debug_osd 10 to the  
tracker. Let me know if you need anything else, I'll stay in touch  
with Giovanna.


Thanks!
Eugen

Zitat von Sridhar Seshasayee :


Hi Eugen,

On Fri, Aug 23, 2024 at 1:37 PM Eugen Block  wrote:


Hi again,

I have a couple of questions about this.
What exactly happened to the PGs? They were queued for snaptrimming,
but we didn't see any progress. Let's assume the average object size
in that pool was around 2 MB (I don't have the actual numbers). Does
that mean if osd_snap_trim_cost (1M default) was too low, those too
large objects weren't trimmed? And then we split the PGs, reducing the
average object size to 1 MB, these objects could be trimmed then,
obviously. Does this explanation make sense?



If you have the OSD logs, I can take a look and see why the snaptrim ops
did not make progress. The cost is one contributing factor on the position
of the op in the queue. Therefore, even though the cost incorrectly
represents the actual average size of the objects in the PG, the op should
be scheduled based on the set cost and the profile allocations.

The OSDs appear to be NVMe based is what I understand from the
thread. Based on the actions taken to resolve the situation (increased
pg_num to 64), I think something else was up on the cluster. For NVMe
based cluster, the current cost shouldn't cause stalling of the snaptrim
ops. I'd suggest raising an upstream tracker with your observation and
OSD logs to investigate this further.




I just browsed through the changes, if I understand the fix correctly,
the average object size is now calculated automatically, right? Which
makes a lot of sense to me, as an operator I don't want to care too
much about the average object sizes since ceph should know them better
than me. ;-)



Yes, that's correct. This fix was part of the effort to incrementally
include
background OSD operations to be scheduled by mClock.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-23 Thread Eugen Block


Hi again,

I have a couple of questions about this.
What exactly happened to the PGs? They were queued for snaptrimming,  
but we didn't see any progress. Let's assume the average object size  
in that pool was around 2 MB (I don't have the actual numbers). Does  
that mean if osd_snap_trim_cost (1M default) was too low, those too  
large objects weren't trimmed? And then we split the PGs, reducing the  
average object size to 1 MB, these objects could be trimmed then,  
obviously. Does this explanation make sense?


I just browsed through the changes, if I understand the fix correctly,  
the average object size is now calculated automatically, right? Which  
makes a lot of sense to me, as an operator I don't want to care too  
much about the average object sizes since ceph should know them better  
than me. ;-)


Thanks!
Eugen

Zitat von Sridhar Seshasayee :


Hi Eugen,

There was a PR (https://github.com/ceph/ceph/pull/55040) related to mClock
and snaptrim
that was backported and available from v18.2.4. The fix more accurately
determines the
cost (instead of priority with wpq) of snaptrim operation depending on the
average size of
the objects in the PG. Depending on the active mClock profile, this should
help move the
snaptrim queue.

To prevent the cluster getting into a similar situation again is to try and
change the config
osd_snap_trim_cost (I think it's set to 1 MiB by default)  to a value that
more accurately
reflects the average object size of the PGs undergoing snaptrim and see if
it helps. In
general with mClock, lower cost ops spend less time in the queue.

-Sridhar



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-22 Thread Eugen Block

Oh yeah, I think I stumbled upon that as well, but then it slipped my  
mind again. Thanks for pointing that out, I appreciate it!



Zitat von Sridhar Seshasayee :


Hi Eugen,

There was a PR (https://github.com/ceph/ceph/pull/55040) related to mClock
and snaptrim
that was backported and available from v18.2.4. The fix more accurately
determines the
cost (instead of priority with wpq) of snaptrim operation depending on the
average size of
the objects in the PG. Depending on the active mClock profile, this should
help move the
snaptrim queue.

To prevent the cluster getting into a similar situation again is to try and
change the config
osd_snap_trim_cost (I think it's set to 1 MiB by default)  to a value that
more accurately
reflects the average object size of the PGs undergoing snaptrim and see if
it helps. In
general with mClock, lower cost ops spend less time in the queue.

-Sridhar



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-22 Thread Eugen Block

I know, I know, but since the rest seemed to work well I didn't want  
to change it yet but rather analyze what else was going on. And since  
we found a few things, it was worth it. :-)


Zitat von Joachim Kraftmayer :


Hi Eugen,
the first what can into my mind was replace mclock with wpq.
Joachim

Eugen Block  schrieb am Do., 22. Aug. 2024, 14:31:


Just a quick update on this topic. I assisted Giovanna directly off
list. For now the issue seems resolved, although I don't think we
really fixed anything but rather got rid of the current symptoms.

A couple of findings for posterity:

- There's a k8s pod creating new snap-schedules every couple of hours
or so, we removed dozens of them, probably around 30.
- We removed all existing cephfs snapshots after mounting the root
dir, this didn't have any effect on the snaptrims yet.
- We increased the number of parallel snaptrim operations to 32 since
the NVMe OSDs were basically idle, which only marked all 32 PGs as
snaptrimming and none were in snaptrim_wait status. But still no real
progress visible. Inspecting the OSD logs in debug level 10 didn't
reveal anything obvious.
- We then increased pg_num to 64 (and disabled autoscaler for this
pool) since 'ceph osd df' showed only around 40 PGs per OSD. This
actually did slowly get rid of the snaptrimming PGs while backfilling.
Yay!
- All config changes have been reset to default.

My interpretation is that the increasing number of snap-schedules
accumulated so many snapshots, causing slow trimming. Here's a snippet
of the queue (from 'ceph osd pool ls detail output'):

removed_snaps_queue

[5b3ee~1,5be5c~6f,5bf71~a1,5c0b8~a1,5c1fd~2,5c201~9f,5c346~a1,5c48d~a1,5c5d0~1,5c71a~1,5c85d~1,5c85f~1,5c861~1,5c865~1,5c9a6~1,5c9a8~1,5c9aa~1,5c9ad~1,5caef~1,5caf1~1,5caf3~1,5caf6~1,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~4,5ce2a~1,5ce2c~1,5ce2f~a3,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a7,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a9,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a9,5f0bd~a7,5f166~2,5f206~a9,5f34f~a9,5f499~a9,5f5e3~a7,5f68c~2,5f72d~a7,5f875~a1,5f9b7~a7,5fa61~2,5fb01~a7,5fba9~1,5fbab~1,5fc4b~a7,5fcf3~2,5fd95~a9,5fedf~a7,5ff88~2,60028~a1,600ca~6,600d1~1,600d3~1,60173~a7,6021c~2,602bd~bd]

My assumption was, if we have more PGs, more trimming can be done in
parallel to finally catch up. My suspicion is that this could have
something to do with mclock although I have no real evidence, except a
thread I found yesterday [1].
I recommended to keep an eye on the removed_snaps_queue as well as
checking the pod creating so many snap-schedules (btw., they were all
exactly the same, same retention time etc.) and modify it so it
doesn't flood the cluster with unnecessary snapshots.
If this situation comes up again, we'll try it with wpq scheduler
instead of mclock, or search for better mclock settings. But since the
general recommendation to stick to wpq hasn't been revoked yet, it
might be the better approach anyway.

We'll see how it goes.

[1] https://www.spinics.net/lists/ceph-users/msg78514.html


Zitat von Giovanna Ratini :

> Hello Eugen,
>
>> Hi (please don't drop the ML from your responses),
> Sorry. I didn't pay attention. I will.
>>
>>> All PGs of pool cephfs are affected and they are in all OSDs
>>
>> then just pick a random one and check if anything stands out. I'm
>> not sure if you mentioned it already, did you also try restarting
>> OSDs?
>>
> Yes, I've done everything, including compaction, reducing defaults,
> and OSD restarts.
>
> The growth seems to have stopped, but there hasn't been a decrease.
> It appears that only the CephFS pool is problematic. I'm an Oracle
> admin and I don't have much experience with Ceph, so perhaps my
> questions might seem a bit naive.
>
> I have a lot of space in this cluster. Could I create a new cephfs
> pool (cephfs01) and copy the data over to it?
> Then I would change the name of the pool in Rook and hope that the
> pods will find their PVs."
>
> Regards,
>
> Gio
>
>>> Oh, not yesterday. I do it now, then I compat all osds with nostrim.
>>> Do I add OSDs?
>>
>> Let's wait for the other results first (compaction, reducing
>> defaults, OSD restart). If that doesn't change anything, I would
>> probably try to add three more OSDs. I assume you have three hosts?
>>
>>
>> Zitat von Giovanna Ratini :
>>
>>> Hello Eugen,
>>>
>>> Am 20.08.2024 um 09:44 schrieb Eugen Block:
>>>

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-22 Thread Eugen Block

Just a quick update on this topic. I assisted Giovanna directly off  
list. For now the issue seems resolved, although I don't think we  
really fixed anything but rather got rid of the current symptoms.


A couple of findings for posterity:

- There's a k8s pod creating new snap-schedules every couple of hours  
or so, we removed dozens of them, probably around 30.
- We removed all existing cephfs snapshots after mounting the root  
dir, this didn't have any effect on the snaptrims yet.
- We increased the number of parallel snaptrim operations to 32 since  
the NVMe OSDs were basically idle, which only marked all 32 PGs as  
snaptrimming and none were in snaptrim_wait status. But still no real  
progress visible. Inspecting the OSD logs in debug level 10 didn't  
reveal anything obvious.
- We then increased pg_num to 64 (and disabled autoscaler for this  
pool) since 'ceph osd df' showed only around 40 PGs per OSD. This  
actually did slowly get rid of the snaptrimming PGs while backfilling.  
Yay!

- All config changes have been reset to default.

My interpretation is that the increasing number of snap-schedules  
accumulated so many snapshots, causing slow trimming. Here's a snippet  
of the queue (from 'ceph osd pool ls detail output'):


removed_snaps_queue  
[5b3ee~1,5be5c~6f,5bf71~a1,5c0b8~a1,5c1fd~2,5c201~9f,5c346~a1,5c48d~a1,5c5d0~1,5c71a~1,5c85d~1,5c85f~1,5c861~1,5c865~1,5c9a6~1,5c9a8~1,5c9aa~1,5c9ad~1,5caef~1,5caf1~1,5caf3~1,5caf6~1,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~4,5ce2a~1,5ce2c~1,5ce2f~a3,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a7,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a9,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a9,5f0bd~a7,5f166~2,5f206~a9,5f34f~a9,5f499~a9,5f5e3~a7,5f68c~2,5f72d~a7,5f875~a1,5f9b7~a7,5fa61~2,5fb01~a7,5fba9~1,5fbab~1,5fc4b~a7,5fcf3~2,5fd95~a9,5fedf~a7,5ff88~2,60028~a1,600ca~6,600d1~1,600d3~1,60173~a7,6021c~2,602bd~bd]


My assumption was, if we have more PGs, more trimming can be done in  
parallel to finally catch up. My suspicion is that this could have  
something to do with mclock although I have no real evidence, except a  
thread I found yesterday [1].
I recommended to keep an eye on the removed_snaps_queue as well as  
checking the pod creating so many snap-schedules (btw., they were all  
exactly the same, same retention time etc.) and modify it so it  
doesn't flood the cluster with unnecessary snapshots.
If this situation comes up again, we'll try it with wpq scheduler  
instead of mclock, or search for better mclock settings. But since the  
general recommendation to stick to wpq hasn't been revoked yet, it  
might be the better approach anyway.


We'll see how it goes.

[1] https://www.spinics.net/lists/ceph-users/msg78514.html


Zitat von Giovanna Ratini :


Hello Eugen,


Hi (please don't drop the ML from your responses),

Sorry. I didn't pay attention. I will.



All PGs of pool cephfs are affected and they are in all OSDs


then just pick a random one and check if anything stands out. I'm  
not sure if you mentioned it already, did you also try restarting  
OSDs?


Yes, I've done everything, including compaction, reducing defaults,  
and OSD restarts.


The growth seems to have stopped, but there hasn't been a decrease.  
It appears that only the CephFS pool is problematic. I'm an Oracle  
admin and I don't have much experience with Ceph, so perhaps my  
questions might seem a bit naive.


I have a lot of space in this cluster. Could I create a new cephfs  
pool (cephfs01) and copy the data over to it?
Then I would change the name of the pool in Rook and hope that the  
pods will find their PVs."


Regards,

Gio


Oh, not yesterday. I do it now, then I compat all osds with nostrim.
Do I add OSDs?


Let's wait for the other results first (compaction, reducing  
defaults, OSD restart). If that doesn't change anything, I would  
probably try to add three more OSDs. I assume you have three hosts?



Zitat von Giovanna Ratini :


Hello Eugen,

Am 20.08.2024 um 09:44 schrieb Eugen Block:
 You could also look into the historic_ops of the primary OSD for  
one affected PG:


All PGs of pool cephfs are affected and they are in all OSDs :-(


Did you reduce the default values I mentioned?


Oh, not yesterday. I do it now, then I compat all osds with nostrim.

Do I add OSDs?

Regars,

Gio




ceph tell osd. dump_historic_ops_by_duration

But I'm not sure if that can actually help here. There are plenty  
of places to look at, you could turn on debug logs on one primary  
OSD and inspect the output.


I just get the feeling that this is one of the corner cases

[ceph-users] Re: Pull failed on cluster upgrade

2024-08-22 Thread Eugen Block


Hi,

I haven't dealt with this myself yet, but the docs [0] state:

A bug was discovered in root_squash which would potentially lose  
changes made by a client restricted with root_squash caps. The fix  
required a change to the protocol and a client upgrade is required.


This is a HEALTH_ERR warning because of the danger of inconsistency  
and lost data. It is recommended to either upgrade your clients,  
discontinue using root_squash in the interim, or silence the warning  
if desired.


To evict and permanently block broken clients from connecting to the  
cluster, set the required_client_feature bit client_mds_auth_caps.


Do you have the option to upgrade your client(s) as well?

[0]  
https://docs.ceph.com/en/latest/cephfs/health-messages/#mds-clients-broken-rootsquash


Zitat von Nicola Mori :


The upgrade ended successfully, but now the cluster reports this error:

  MDS_CLIENTS_BROKEN_ROOTSQUASH: 1 MDS report clients with broken  
root_squash implementation


From what I understood this is due to a new feature meant to fix a  
bug in the root_squash implementation, and that will be released  
with version 19. I didn't find anything about a backport to 18.2.4.  
Can someone share some info please? Especially about if and how it  
can be fixed.

Thanks in advance,

Nicola



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Unable to recover cluster, error: unable to read magic from mon data

2024-08-22 Thread Eugen Block

Alright, quorum is good. Do you have mgr keyrings in 'ceph auth ls'  
output? If they are present, is there a keyring file in  
/var/lib/ceph/{fsid}/mgr.{mgr}/ and does it match the auth output?

Zitat von RIT Computer Science House :

One mon survived (it took us a while to find it since it was in a damaged
state), and we have since been able to create a new second mon where an old
mon was - quorum has been re-established. We are not able to use `ceph
orch` now to deploy new mons though, it is giving us an error from the
keyring.

The ceph commands were not working at the time of only one damaged mon, but
now most ceph commands function normally with 2 mons.

On Wed, Aug 21, 2024 at 6:27 AM Eugen Block  wrote:

Hi,

Is there any command-line history available to get at least some sort
of history of events?
Are all MONs down or has one survived?
Could he have tried to change IP addresses or something? There's an
old blog post [0] explaining how to clean up. And here's some more
reading [1] how to modify a monmap in a cephadm managed cluster.
I assume none of the ceph commands work, can you confirm?

[0] https://ceph.io/en/news/blog/2015/ceph-monitor-troubleshooting/
[1]

https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/#rados-mon-remove-from-unhealthy

Zitat von RIT Computer Science House :

> Hello,
>
> Our cluster has become unresponsive after a teammate's work on the
cluster.
> We are currently unable to get the full story on what he did to fully
> understand what is going on, and the only error we are able to see in any
> of the logs is the following:
> 2024-08-20T03:12:34.183+ 7f3670246b80 -1 unable to read magic from
mon
> data
>
> Any help would be greatly appreciated!
> I can provide any information necessary to help debugging.
>
> Thanks,
> Tyler
> System Administrator @ Computer Science House
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Unable to recover cluster, error: unable to read magic from mon data

2024-08-21 Thread Eugen Block


Hi,

Is there any command-line history available to get at least some sort  
of history of events?

Are all MONs down or has one survived?
Could he have tried to change IP addresses or something? There's an  
old blog post [0] explaining how to clean up. And here's some more  
reading [1] how to modify a monmap in a cephadm managed cluster.

I assume none of the ceph commands work, can you confirm?

[0] https://ceph.io/en/news/blog/2015/ceph-monitor-troubleshooting/
[1]  
https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/#rados-mon-remove-from-unhealthy


Zitat von RIT Computer Science House :


Hello,

Our cluster has become unresponsive after a teammate's work on the cluster.
We are currently unable to get the full story on what he did to fully
understand what is going on, and the only error we are able to see in any
of the logs is the following:
2024-08-20T03:12:34.183+ 7f3670246b80 -1 unable to read magic from mon
data

Any help would be greatly appreciated!
I can provide any information necessary to help debugging.

Thanks,
Tyler
System Administrator @ Computer Science House
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CephFS troubleshooting

2024-08-20 Thread Eugen Block


Hi,

can you share more details? For example the auth caps of your fuse  
client (ceph auth export client.) and the exact command  
that fails? Did it work before?


I just did that on a small test cluster (17.2.7) without an issue.

BTW, the warning "too many PGs per OSD (328 > max 250)" is serious and  
should be taken care of.


Regards,
Eugen

Zitat von Eugenio Tampieri :


Hello,
I'm writing to troubleshoot an otherwise functional Ceph quincy  
cluster that has issues with cephfs.
I cannot mount it with ceph-fuse (it gets stuck), and if I mount it  
with NFS I can list the directories but I cannot read or write  
anything.

Here's the output of ceph -s
  cluster:
id: 3b92e270-1dd6-11ee-a738-000c2937f0ec
health: HEALTH_WARN
mon ceph-storage-a is low on available space
1 daemons have recently crashed
too many PGs per OSD (328 > max 250)

  services:
mon:5 daemons, quorum  
ceph-mon-a,ceph-storage-a,ceph-mon-b,ceph-storage-c,ceph-storage-d  
(age 105m)
mgr:ceph-storage-a.ioenwq(active, since 106m), standbys:  
ceph-mon-a.tiosea

mds:1/1 daemons up, 2 standby
osd:4 osds: 4 up (since 104m), 4 in (since 24h)
rbd-mirror: 2 daemons active (2 hosts)
rgw:2 daemons active (2 hosts, 1 zones)

  data:
volumes: 1/1 healthy
pools:   13 pools, 481 pgs
objects: 231.83k objects, 648 GiB
usage:   1.3 TiB used, 1.8 TiB / 3.1 TiB avail
pgs: 481 active+clean

  io:
client:   1.5 KiB/s rd, 8.6 KiB/s wr, 1 op/s rd, 0 op/s wr
Best regards,

Eugenio Tampieri
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-20 Thread Eugen Block


Hi (please don't drop the ML from your responses),


All PGs of pool cephfs are affected and they are in all OSDs


then just pick a random one and check if anything stands out. I'm not  
sure if you mentioned it already, did you also try restarting OSDs?



Oh, not yesterday. I do it now, then I compat all osds with nostrim.
Do I add OSDs?


Let's wait for the other results first (compaction, reducing defaults,  
OSD restart). If that doesn't change anything, I would probably try to  
add three more OSDs. I assume you have three hosts?



Zitat von Giovanna Ratini :


Hello Eugen,

Am 20.08.2024 um 09:44 schrieb Eugen Block:
 You could also look into the historic_ops of the primary OSD for  
one affected PG:


All PGs of pool cephfs are affected and they are in all OSDs :-(


Did you reduce the default values I mentioned?


Oh, not yesterday. I do it now, then I compat all osds with nostrim.

Do I add OSDs?

Regars,

Gio




ceph tell osd. dump_historic_ops_by_duration

But I'm not sure if that can actually help here. There are plenty  
of places to look at, you could turn on debug logs on one primary  
OSD and inspect the output.


I just get the feeling that this is one of the corner cases with  
too few OSDs, although the cluster load seems to be low.


Zitat von Giovanna Ratini :


Hello Eugen,

yesterday after stop and go of snaptrim the queue decrease a  
little and then remain blocked.

They didn't grow and didn't decrease.

Is that good or bad?


Am 19.08.2024 um 15:43 schrieb Eugen Block:
There's a lengthy thread [0] where several approaches are  
proposed. The worst is a OSD recreation, but that's the last  
resort, of course.


What's are the current values for these configs?

ceph config get osd osd_pg_max_concurrent_snap_trims
ceph config get osd osd_max_trimming_pgs

Maybe decrease them to 1 each while the nosnaptrim flag is set,  
then unset it. You could also try online (and/or offline osd  
compaction) before unsetting the flag. Are the OSD processes  
utilizing an entire CPU?


[0] https://www.spinics.net/lists/ceph-users/msg75626.html

Zitat von Giovanna Ratini :


Hallo Eugen,

yes, the load is for now not too much.

I stop the snap and now this is the output. No changes in the queue.

root@kube-master02:~# k ceph -s
Info: running 'ceph' command with args: [-s]
  cluster:
    id: 3a35629a-6129-4daf-9db6-36e0eda637c7
    health: HEALTH_WARN
    nosnaptrim flag(s) set
    32 pgs not deep-scrubbed in time
    32 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum bx,bz,ca (age 30h)
    mgr: a(active, since 29h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 21h), 6 in (since 6d)
 flags nosnaptrim

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 4.21M objects, 2.5 TiB
    usage:   7.7 TiB used, 76 TiB / 84 TiB avail
    pgs: 65 active+clean
 32 active+clean+snaptrim_wait

  io:
    client:   7.4 MiB/s rd, 7.9 MiB/s wr, 11 op/s rd, 35 op/s wr

Am 19.08.2024 um 14:54 schrieb Eugen Block:

What happens when you disable snaptrimming entirely?

ceph osd set nosnaptrim

So the load on your cluster seems low, but are the OSDs heavily  
utilized? Have you checked iostat?


Zitat von Giovanna Ratini :


Hello Eugen,

*root@kube-master02:~# k ceph -s*

Info: running 'ceph' command with args: [-s]
  cluster:
    id: 3a35629a-6129-4daf-9db6-36e0eda637c7
    health: HEALTH_WARN
    32 pgs not deep-scrubbed in time
    32 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum bx,bz,ca (age 13h)
    mgr: a(active, since 13h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 5h), 6 in (since 5d)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 4.20M objects, 2.5 TiB
    usage:   7.7 TiB used, 76 TiB / 84 TiB avail
    pgs: 65 active+clean
 20 active+clean+snaptrim_wait
 12 active+clean+snaptrim

  io:
    client:   3.5 MiB/s rd, 3.6 MiB/s wr, 6 op/s rd, 12 op/s wr

If I understand the documentation correctly, I will never have  
a scrub unless the PGs (Placement Groups) are active and clean.


All 32 PGs of the CephFS pool have been in this status for  
several days:


 * 20 active+clean+snaptrim_wait
 * 12 active+clean+snaptrim"

Today, I restarted the MON, MGR, and MDS, but no changes in  
the growing.


Am 18.08.2024 um 18:39 schrieb Eugen Block:
Can you share the current ceph status? Are the OSDs reporting  
anything suspicious? How is the disk utilization?


Zitat von Giovanna Ratini :


More information:

The snaptrim take a lot of time but the he objects_trimmed are "0"

 "objects_trimmed": 0,
"snaptrim_duration": 500.5807601752,

It could explain, why the queue are growing up..


Am 17.08.2024 um 14:37 schrieb Giovanna Ratini:

Hello again,

I

[ceph-users] Re: Bug with Cephadm module osd service preventing orchestrator start

2024-08-20 Thread Eugen Block

Don't worry, it happens a lot and it also happens to me. ;-) Glad it  
worked for you as well.


Zitat von Benjamin Huth :


Thank you so much for the help! Thanks to the issue you linked and the
other guy you replied to with the same issue, I was able to edit the
config-key and get my orchestrator back. Sorry for not checking the issues
as well as I should have, that's my bad there.

On Mon, Aug 19, 2024 at 6:12 AM Eugen Block  wrote:


There's a tracker issue for this:

https://tracker.ceph.com/issues/67329

Zitat von Eugen Block :

> Hi,
>
> what is the output of this command?
>
> ceph config-key get mgr/cephadm/osd_remove_queue
>
> I just tried to cancel a draining on a small 18.2.4 test cluster, it
> went well, though. After scheduling the drain the mentioned key
> looks like this:
>
> # ceph config-key get mgr/cephadm/osd_remove_queue
> [{"osd_id": 1, "started": true, "draining": false, "stopped": false,
> "replace": false, "force": false, "zap": false, "hostname": "host5",
> "original_weight": 0.0233917236328125, "drain_started_at": null,
> "drain_stopped_at": null, "drain_done_at": null,
> "process_started_at": "2024-08-19T07:21:27.783527Z"}, {"osd_id": 13,
> "started": true, "draining": true, "stopped": false, "replace":
> false, "force": false, "zap": false, "hostname": "host5",
> "original_weight": 0.0233917236328125, "drain_started_at":
> "2024-08-19T07:21:30.365237Z", "drain_stopped_at": null,
> "drain_done_at": null, "process_started_at":
> "2024-08-19T07:21:27.794688Z"}]
>
> Here you see the original_weight which the orchestrator failed to
> read, apparently. (Note that there are only small 20 GB OSDs, hence
> the small weight). You probably didn't have the output while the
> OSDs were scheduled for draining, correct? I was able to break my
> cephadm module by injecting that json again (it was already
> completed, hence empty), but maybe I did it incorrectly, not sure yet.
>
> Regards,
> Eugen
>
> Zitat von Benjamin Huth :
>
>> So about a week and a half ago, I started a drain on an incorrect host.
I
>> fairly quickly realized that it was the wrong host, so I stopped the
drain,
>> canceled the osd deletions with "ceph orch osd rm stop OSD_ID", then
>> dumped, edited the crush map to properly reweight those osds and host,
and
>> applied the edited crush map. I then proceeded with a full drain of the
>> correct host and completed that before attempting to upgrade my cluster.
>>
>> I started the upgrade, and all 3 of my managers were upgraded from
18.2.2
>> to 18.2.4. At this point, my managers started back up, but with an
>> orchestrator that had failed to start, so the upgrade was unable to
>> continue. My cluster is in a stage where only the 3 managers are
upgraded
>> to 18.2.4 and every other part is at 18.2.2
>>
>> Since my orchestrator is not able to start, I'm unfortunately not able
to
>> run any ceph orch commands as I receive "Error ENOENT: Module not found"
>> because the cephadm module doesn't load.
>> Output of ceph versions:
>> {
>>"mon": {
>>"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
>> reef (stable)": 5
>>},
>>"mgr": {
>>"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
>> reef (stable)": 1
>>},
>>"osd": {
>>"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
>> reef (stable)": 119
>>},
>>"mds": {
>>"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
>> reef (stable)": 4
>>},
>>"overall": {
>>"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
>> reef (stable)": 129
>>}
>> }
>>
>> I mentioned in my previous post that I tried manually downgrading the
>> managers to 18.2.2 because I thought there may be an issue with 18.2.4,
but
>> 18.2.2 also has the PR that I believe is causing this (
>>
https://github.com/ceph/ceph/commit/ba7fac074fb5ad072fcad10862f75c0a26a7591d
)
>> so no luck
>>
>> Thanks!
>> (so sorry, I did not reply all so you may have received this twice)
>>
>> On Sat, Aug 17, 2024 at 2:55 AM Eugen Block  wrote:
>>
>>> Just to get some background information,

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-20 Thread Eugen Block

Did you reduce the default values I mentioned? You could also look  
into the historic_ops of the primary OSD for one affected PG:


ceph tell osd. dump_historic_ops_by_duration

But I'm not sure if that can actually help here. There are plenty of  
places to look at, you could turn on debug logs on one primary OSD and  
inspect the output.


I just get the feeling that this is one of the corner cases with too  
few OSDs, although the cluster load seems to be low.


Zitat von Giovanna Ratini :


Hello Eugen,

yesterday after stop and go of snaptrim the queue decrease a little  
and then remain blocked.

They didn't grow and didn't decrease.

Is that good or bad?


Am 19.08.2024 um 15:43 schrieb Eugen Block:
There's a lengthy thread [0] where several approaches are proposed.  
The worst is a OSD recreation, but that's the last resort, of course.


What's are the current values for these configs?

ceph config get osd osd_pg_max_concurrent_snap_trims
ceph config get osd osd_max_trimming_pgs

Maybe decrease them to 1 each while the nosnaptrim flag is set,  
then unset it. You could also try online (and/or offline osd  
compaction) before unsetting the flag. Are the OSD processes  
utilizing an entire CPU?


[0] https://www.spinics.net/lists/ceph-users/msg75626.html

Zitat von Giovanna Ratini :


Hallo Eugen,

yes, the load is for now not too much.

I stop the snap and now this is the output. No changes in the queue.

root@kube-master02:~# k ceph -s
Info: running 'ceph' command with args: [-s]
  cluster:
    id: 3a35629a-6129-4daf-9db6-36e0eda637c7
    health: HEALTH_WARN
    nosnaptrim flag(s) set
    32 pgs not deep-scrubbed in time
    32 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum bx,bz,ca (age 30h)
    mgr: a(active, since 29h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 21h), 6 in (since 6d)
 flags nosnaptrim

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 4.21M objects, 2.5 TiB
    usage:   7.7 TiB used, 76 TiB / 84 TiB avail
    pgs: 65 active+clean
 32 active+clean+snaptrim_wait

  io:
    client:   7.4 MiB/s rd, 7.9 MiB/s wr, 11 op/s rd, 35 op/s wr

Am 19.08.2024 um 14:54 schrieb Eugen Block:

What happens when you disable snaptrimming entirely?

ceph osd set nosnaptrim

So the load on your cluster seems low, but are the OSDs heavily  
utilized? Have you checked iostat?


Zitat von Giovanna Ratini :


Hello Eugen,

*root@kube-master02:~# k ceph -s*

Info: running 'ceph' command with args: [-s]
  cluster:
    id: 3a35629a-6129-4daf-9db6-36e0eda637c7
    health: HEALTH_WARN
    32 pgs not deep-scrubbed in time
    32 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum bx,bz,ca (age 13h)
    mgr: a(active, since 13h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 5h), 6 in (since 5d)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 4.20M objects, 2.5 TiB
    usage:   7.7 TiB used, 76 TiB / 84 TiB avail
    pgs: 65 active+clean
 20 active+clean+snaptrim_wait
 12 active+clean+snaptrim

  io:
    client:   3.5 MiB/s rd, 3.6 MiB/s wr, 6 op/s rd, 12 op/s wr

If I understand the documentation correctly, I will never have a  
scrub unless the PGs (Placement Groups) are active and clean.


All 32 PGs of the CephFS pool have been in this status for several days:

 * 20 active+clean+snaptrim_wait
 * 12 active+clean+snaptrim"

Today, I restarted the MON, MGR, and MDS, but no changes in the growing.

Am 18.08.2024 um 18:39 schrieb Eugen Block:
Can you share the current ceph status? Are the OSDs reporting  
anything suspicious? How is the disk utilization?


Zitat von Giovanna Ratini :


More information:

The snaptrim take a lot of time but the he objects_trimmed are "0"

 "objects_trimmed": 0,
"snaptrim_duration": 500.5807601752,

It could explain, why the queue are growing up..


Am 17.08.2024 um 14:37 schrieb Giovanna Ratini:

Hello again,

I checked the pgs dump. Snapshot grow up

Query für PG: 3.12
{
    "snap_trimq":  
"[5b974~3b,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-19 Thread Eugen Block

There's a lengthy thread [0] where several approaches are proposed.  
The worst is a OSD recreation, but that's the last resort, of course.


What's are the current values for these configs?

ceph config get osd osd_pg_max_concurrent_snap_trims
ceph config get osd osd_max_trimming_pgs

Maybe decrease them to 1 each while the nosnaptrim flag is set, then  
unset it. You could also try online (and/or offline osd compaction)  
before unsetting the flag. Are the OSD processes utilizing an entire  
CPU?


[0] https://www.spinics.net/lists/ceph-users/msg75626.html

Zitat von Giovanna Ratini :


Hallo Eugen,

yes, the load is for now not too much.

I stop the snap and now this is the output. No changes in the queue.

root@kube-master02:~# k ceph -s
Info: running 'ceph' command with args: [-s]
  cluster:
    id: 3a35629a-6129-4daf-9db6-36e0eda637c7
    health: HEALTH_WARN
    nosnaptrim flag(s) set
    32 pgs not deep-scrubbed in time
    32 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum bx,bz,ca (age 30h)
    mgr: a(active, since 29h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 21h), 6 in (since 6d)
 flags nosnaptrim

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 4.21M objects, 2.5 TiB
    usage:   7.7 TiB used, 76 TiB / 84 TiB avail
    pgs: 65 active+clean
 32 active+clean+snaptrim_wait

  io:
    client:   7.4 MiB/s rd, 7.9 MiB/s wr, 11 op/s rd, 35 op/s wr

Am 19.08.2024 um 14:54 schrieb Eugen Block:

What happens when you disable snaptrimming entirely?

ceph osd set nosnaptrim

So the load on your cluster seems low, but are the OSDs heavily  
utilized? Have you checked iostat?


Zitat von Giovanna Ratini :


Hello Eugen,

*root@kube-master02:~# k ceph -s*

Info: running 'ceph' command with args: [-s]
  cluster:
    id: 3a35629a-6129-4daf-9db6-36e0eda637c7
    health: HEALTH_WARN
    32 pgs not deep-scrubbed in time
    32 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum bx,bz,ca (age 13h)
    mgr: a(active, since 13h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 5h), 6 in (since 5d)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 4.20M objects, 2.5 TiB
    usage:   7.7 TiB used, 76 TiB / 84 TiB avail
    pgs: 65 active+clean
 20 active+clean+snaptrim_wait
 12 active+clean+snaptrim

  io:
    client:   3.5 MiB/s rd, 3.6 MiB/s wr, 6 op/s rd, 12 op/s wr

If I understand the documentation correctly, I will never have a  
scrub unless the PGs (Placement Groups) are active and clean.


All 32 PGs of the CephFS pool have been in this status for several days:

 * 20 active+clean+snaptrim_wait
 * 12 active+clean+snaptrim"

Today, I restarted the MON, MGR, and MDS, but no changes in the growing.

Am 18.08.2024 um 18:39 schrieb Eugen Block:
Can you share the current ceph status? Are the OSDs reporting  
anything suspicious? How is the disk utilization?


Zitat von Giovanna Ratini :


More information:

The snaptrim take a lot of time but the he objects_trimmed are "0"

 "objects_trimmed": 0,
"snaptrim_duration": 500.5807601752,

It could explain, why the queue are growing up..


Am 17.08.2024 um 14:37 schrieb Giovanna Ratini:

Hello again,

I checked the pgs dump. Snapshot grow up

Query für PG: 3.12
{
    "snap_trimq":  
"[5b974~3b,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1,5f7d3~1]",

*    "snap_trimq_len": 5421,*
    "state": "active+clean+snaptrim",
    "epoch": 734130,

Query für PG: 3.12
{
    "snap_trimq":  
"[5b976~39,5ba53~1,5ba56~a0,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,

[ceph-users] Re: Bug with Cephadm module osd service preventing orchestrator start

2024-08-19 Thread Eugen Block


There's a tracker issue for this:

https://tracker.ceph.com/issues/67329

Zitat von Eugen Block :


Hi,

what is the output of this command?

ceph config-key get mgr/cephadm/osd_remove_queue

I just tried to cancel a draining on a small 18.2.4 test cluster, it  
went well, though. After scheduling the drain the mentioned key  
looks like this:


# ceph config-key get mgr/cephadm/osd_remove_queue
[{"osd_id": 1, "started": true, "draining": false, "stopped": false,  
"replace": false, "force": false, "zap": false, "hostname": "host5",  
"original_weight": 0.0233917236328125, "drain_started_at": null,  
"drain_stopped_at": null, "drain_done_at": null,  
"process_started_at": "2024-08-19T07:21:27.783527Z"}, {"osd_id": 13,  
"started": true, "draining": true, "stopped": false, "replace":  
false, "force": false, "zap": false, "hostname": "host5",  
"original_weight": 0.0233917236328125, "drain_started_at":  
"2024-08-19T07:21:30.365237Z", "drain_stopped_at": null,  
"drain_done_at": null, "process_started_at":  
"2024-08-19T07:21:27.794688Z"}]


Here you see the original_weight which the orchestrator failed to  
read, apparently. (Note that there are only small 20 GB OSDs, hence  
the small weight). You probably didn't have the output while the  
OSDs were scheduled for draining, correct? I was able to break my  
cephadm module by injecting that json again (it was already  
completed, hence empty), but maybe I did it incorrectly, not sure yet.


Regards,
Eugen

Zitat von Benjamin Huth :


So about a week and a half ago, I started a drain on an incorrect host. I
fairly quickly realized that it was the wrong host, so I stopped the drain,
canceled the osd deletions with "ceph orch osd rm stop OSD_ID", then
dumped, edited the crush map to properly reweight those osds and host, and
applied the edited crush map. I then proceeded with a full drain of the
correct host and completed that before attempting to upgrade my cluster.

I started the upgrade, and all 3 of my managers were upgraded from 18.2.2
to 18.2.4. At this point, my managers started back up, but with an
orchestrator that had failed to start, so the upgrade was unable to
continue. My cluster is in a stage where only the 3 managers are upgraded
to 18.2.4 and every other part is at 18.2.2

Since my orchestrator is not able to start, I'm unfortunately not able to
run any ceph orch commands as I receive "Error ENOENT: Module not found"
because the cephadm module doesn't load.
Output of ceph versions:
{
   "mon": {
   "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 5
   },
   "mgr": {
   "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 1
   },
   "osd": {
   "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 119
   },
   "mds": {
   "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 4
   },
   "overall": {
   "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 129
   }
}

I mentioned in my previous post that I tried manually downgrading the
managers to 18.2.2 because I thought there may be an issue with 18.2.4, but
18.2.2 also has the PR that I believe is causing this (
https://github.com/ceph/ceph/commit/ba7fac074fb5ad072fcad10862f75c0a26a7591d)
so no luck

Thanks!
(so sorry, I did not reply all so you may have received this twice)

On Sat, Aug 17, 2024 at 2:55 AM Eugen Block  wrote:


Just to get some background information, did you remove OSDs while
performing the upgrade? Or did you start OSD removal and then started
the upgrade? Upgrades should be started with a healthy cluster, but
one can’t guarantee that of course, OSDs and/or entire hosts can
obviously also fail during an upgrade.
Just trying to understand what could cause this (I haven’t upgraded
production clusters to Reef yet, only test clusters). Have you stopped
the upgrade to cancel the process entirely? Can you share this
information please:

ceph versions
ceph orch upgrade status

Zitat von Benjamin Huth :


Just wanted to follow up on this, I am unfortunately still stuck with

this

and can't find where the json for this value is stored. I'm wondering if

I

should attempt to build a manager container  with the code for this
reverted to before the commit that introduced the original_weight

argument.

Please let me know if you guys have any thoughts

Thank you!

On Wed, Aug 14, 2024, 7:37 PM Benjamin Huth 

wrote:



Hey there, so I went to upgrade my ceph from 18.2.2

[ceph-users] Re: cephadm module fails to load with "got an unexpected keyword argument"

2024-08-19 Thread Eugen Block


Hi,

there's a tracker issue [0] for that. I was assisting with the same  
issue in a different thread [1].


Thanks,
Eugen

[0] https://tracker.ceph.com/issues/67329
[1]  
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/SRJPC5ZYTPXF63AKGIIOA2LLLBBWCIT4/


Zitat von Alex Sanderson :


Hi everyone,

I recently upgraded from Quincy to Reef v18.2.4 and my dashboard and  
mgr systems have been broken since.  Since the upgrade I was slowly  
removing and zapping osd's that still had the 64k  
"bluestore_bdev_block_size" and decided to have a look at the  
dashboard problem.   I restarted the mgrs one at a time and they  
showed in status that they working but actually the cephadm module  
was failing.  The systems were all upgraded to 18 via orch from  
17.2.7 and are running the official docker images.


This is the error message:

debug 2024-08-13T10:08:11.736+ 7fd30dbe0640 -1 mgr load Failed  
to construct class in 'cephadm'
debug 2024-08-13T10:08:11.736+ 7fd30dbe0640 -1 mgr load  
Traceback (most recent call last):

  File "/usr/share/ceph/mgr/cephadm/module.py", line 619, in __init__
    self.to_remove_osds.load_from_store()
  File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 924, in  
load_from_store

    osd_obj = OSD.from_json(osd, rm_util=self.rm_util)
  File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 789, in from_json
    return cls(**inp)
TypeError: __init__() got an unexpected keyword argument 'original_weight'

debug 2024-08-13T10:08:11.736+ 7fd30dbe0640 -1 mgr operator()  
Failed to run module in active mode ('cephadm')


The config-key responsible was mgr/cephadm/osd_remove_queue

This is what it looked like before.  After removing the  
original_weight field and setting the variable again, the cephadm  
module loads and orch works.   It seems like a bug.


[{"osd_id": 89, "started": true, "draining": true, "stopped": false,  
"replace": false, "force": true, "zap": true, "hostname": "goanna",  
"original_weight": 0.930999755859375, "drain_started_at":  
"2024-08-12T13:21:04.458019Z", "drain_stopped_at": null,  
"drain_done_at": null, "process_started_at":  
"2024-08-12T13:20:40.021185Z"}, {"osd_id": 37, "started": true,  
"draining": true, "stopped": false, "replace": false, "force": true,  
"zap": true, "hostname": "gsceph1osd05", "original_weight": 4,  
"drain_started_at": "2024-08-10T06:30:37.569931Z",  
"drain_stopped_at": null, "drain_done_at": null,  
"process_started_at": "2024-08-10T06:30:19.729143Z"}, {"osd_id": 47,  
"started": true, "draining": true, "stopped": false, "replace":  
false, "force": true, "zap": true, "hostname": "gsceph1osd07",  
"original_weight": 4, "drain_started_at":  
"2024-08-10T09:54:49.132830Z", "drain_stopped_at": null,  
"drain_done_at": null, "process_started_at":  
"2024-08-10T09:54:34.367655Z"}]


I thought I should put this out there in case anyone else was having  
a weird issue with a keyword argument problem.  It did not fix the  
problem with the dashboard, still working on that.


Alex

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-19 Thread Eugen Block


What happens when you disable snaptrimming entirely?

ceph osd set nosnaptrim

So the load on your cluster seems low, but are the OSDs heavily  
utilized? Have you checked iostat?


Zitat von Giovanna Ratini :


Hello Eugen,

*root@kube-master02:~# k ceph -s*

Info: running 'ceph' command with args: [-s]
  cluster:
    id: 3a35629a-6129-4daf-9db6-36e0eda637c7
    health: HEALTH_WARN
    32 pgs not deep-scrubbed in time
    32 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum bx,bz,ca (age 13h)
    mgr: a(active, since 13h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 5h), 6 in (since 5d)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 4.20M objects, 2.5 TiB
    usage:   7.7 TiB used, 76 TiB / 84 TiB avail
    pgs: 65 active+clean
 20 active+clean+snaptrim_wait
 12 active+clean+snaptrim

  io:
    client:   3.5 MiB/s rd, 3.6 MiB/s wr, 6 op/s rd, 12 op/s wr

If I understand the documentation correctly, I will never have a  
scrub unless the PGs (Placement Groups) are active and clean.


All 32 PGs of the CephFS pool have been in this status for several days:

 * 20 active+clean+snaptrim_wait
 * 12 active+clean+snaptrim"

Today, I restarted the MON, MGR, and MDS, but no changes in the growing.

Am 18.08.2024 um 18:39 schrieb Eugen Block:
Can you share the current ceph status? Are the OSDs reporting  
anything suspicious? How is the disk utilization?


Zitat von Giovanna Ratini :


More information:

The snaptrim take a lot of time but the he objects_trimmed are "0"

 "objects_trimmed": 0,
"snaptrim_duration": 500.5807601752,

It could explain, why the queue are growing up..


Am 17.08.2024 um 14:37 schrieb Giovanna Ratini:

Hello again,

I checked the pgs dump. Snapshot grow up

Query für PG: 3.12
{
    "snap_trimq":  
"[5b974~3b,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1,5f7d3~1]",

*    "snap_trimq_len": 5421,*
    "state": "active+clean+snaptrim",
    "epoch": 734130,

Query für PG: 3.12
{
    "snap_trimq":  
"[5b976~39,5ba53~1,5ba56~a0,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1,5f7d3~1,5f875~a1]",

*   "snap_trimq_len": 5741,*
    "state": "active+clean+snaptrim",
    "epoch": 734240,
    "up": [

Do you know the way to see if the snaptim "process" works?

Best Regard

Gio


Am 17.08.2024 um 12:59 schrieb Giovanna Ratini:

Hello Eugen,

thank you for your answer.

I restarted all the kube-ceph nodes one after the other. Nothing  
has changed.


ok, I deactivate the snap ... : ceph fs snap-schedule deactivate /

Is there a way to see how many snapshots will be deleted per hour?

Regards,

Gio





Am 17.08.2024 um 10:12 schrieb Eugen Block:

Hi,

have you tried to fail the mgr? Sometimes the PG stats are not  
correct. You could also temporarily disable snapshots to see if  
things settle down.


Zitat von Giovanna Ratini :


Hello all,

We use Ceph (v18.2.2) and Rook (1.14.3) as the CSI for a  
Kubernetes environment. Last week, we had a problem with the  
MDS falling behind on trimming every 4-5 days (GitHub issue  
link). We resolved the issue using the steps outlined in the  
GitHub issue.


We have 3 hosts (I know, I need to increase this as soon as  
possible, and I will!) and 6 OSDs. After running the commands:


ceph config set mds mds_dir_max_commit_size 80,

ceph fs fail , and

ceph fs set  joinable true,

After that, the sn

[ceph-users] Re: Bug with Cephadm module osd service preventing orchestrator start

2024-08-19 Thread Eugen Block

Hi,

what is the output of this command?

ceph config-key get mgr/cephadm/osd_remove_queue

I just tried to cancel a draining on a small 18.2.4 test cluster, it  
went well, though. After scheduling the drain the mentioned key looks  
like this:

# ceph config-key get mgr/cephadm/osd_remove_queue
[{"osd_id": 1, "started": true, "draining": false, "stopped": false,  
"replace": false, "force": false, "zap": false, "hostname": "host5",  
"original_weight": 0.0233917236328125, "drain_started_at": null,  
"drain_stopped_at": null, "drain_done_at": null, "process_started_at":  
"2024-08-19T07:21:27.783527Z"}, {"osd_id": 13, "started": true,  
"draining": true, "stopped": false, "replace": false, "force": false,  
"zap": false, "hostname": "host5", "original_weight":  
0.0233917236328125, "drain_started_at": "2024-08-19T07:21:30.365237Z",  
"drain_stopped_at": null, "drain_done_at": null, "process_started_at":  
"2024-08-19T07:21:27.794688Z"}]

Here you see the original_weight which the orchestrator failed to  
read, apparently. (Note that there are only small 20 GB OSDs, hence  
the small weight). You probably didn't have the output while the OSDs  
were scheduled for draining, correct? I was able to break my cephadm  
module by injecting that json again (it was already completed, hence  
empty), but maybe I did it incorrectly, not sure yet.

Regards,
Eugen

Zitat von Benjamin Huth :

So about a week and a half ago, I started a drain on an incorrect host. I
fairly quickly realized that it was the wrong host, so I stopped the drain,
canceled the osd deletions with "ceph orch osd rm stop OSD_ID", then
dumped, edited the crush map to properly reweight those osds and host, and
applied the edited crush map. I then proceeded with a full drain of the
correct host and completed that before attempting to upgrade my cluster.

I started the upgrade, and all 3 of my managers were upgraded from 18.2.2
to 18.2.4. At this point, my managers started back up, but with an
orchestrator that had failed to start, so the upgrade was unable to
continue. My cluster is in a stage where only the 3 managers are upgraded
to 18.2.4 and every other part is at 18.2.2

Since my orchestrator is not able to start, I'm unfortunately not able to
run any ceph orch commands as I receive "Error ENOENT: Module not found"
because the cephadm module doesn't load.
Output of ceph versions:
{
"mon": {
"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 5
},
"mgr": {
"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 1
},
"osd": {
"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 119
},
"mds": {
"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 4
},
"overall": {
"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 129
}
}

I mentioned in my previous post that I tried manually downgrading the
managers to 18.2.2 because I thought there may be an issue with 18.2.4, but
18.2.2 also has the PR that I believe is causing this (
https://github.com/ceph/ceph/commit/ba7fac074fb5ad072fcad10862f75c0a26a7591d)
so no luck

Thanks!
(so sorry, I did not reply all so you may have received this twice)

On Sat, Aug 17, 2024 at 2:55 AM Eugen Block  wrote:

Just to get some background information, did you remove OSDs while
performing the upgrade? Or did you start OSD removal and then started
the upgrade? Upgrades should be started with a healthy cluster, but
one can’t guarantee that of course, OSDs and/or entire hosts can
obviously also fail during an upgrade.
Just trying to understand what could cause this (I haven’t upgraded
production clusters to Reef yet, only test clusters). Have you stopped
the upgrade to cancel the process entirely? Can you share this
information please:

ceph versions
ceph orch upgrade status

Zitat von Benjamin Huth :

> Just wanted to follow up on this, I am unfortunately still stuck with
this
> and can't find where the json for this value is stored. I'm wondering if
I
> should attempt to build a manager container  with the code for this
> reverted to before the commit that introduced the original_weight
argument.
> Please let me know if you guys have any thoughts
>
> Thank you!
>
> On Wed, Aug 14, 2024, 7:37 PM Benjamin Huth 
wrote:
>
>> Hey there, so I went to upgrade my ceph from 18.2.2 to 18.2.4 and have
>> encountered a problem wit

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-18 Thread Eugen Block

Can you share the current ceph status? Are the OSDs reporting anything  
suspicious? How is the disk utilization?


Zitat von Giovanna Ratini :


More information:

The snaptrim take a lot of time but the he objects_trimmed are "0"

 "objects_trimmed": 0,
"snaptrim_duration": 500.5807601752,

It could explain, why the queue are growing up..


Am 17.08.2024 um 14:37 schrieb Giovanna Ratini:

Hello again,

I checked the pgs dump. Snapshot grow up

Query für PG: 3.12
{
    "snap_trimq":  
"[5b974~3b,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1,5f7d3~1]",

*    "snap_trimq_len": 5421,*
    "state": "active+clean+snaptrim",
    "epoch": 734130,

Query für PG: 3.12
{
    "snap_trimq":  
"[5b976~39,5ba53~1,5ba56~a0,5cc3a~1,5cc3c~1,5cc3e~1,5cc40~1,5cd83~1,5cd85~1,5cd87~1,5cd89~1,5cecc~1,5cece~4,5ced3~2,5cf72~1,5cf74~4,5cf79~a2,5d0b8~1,5d0bb~1,5d0bd~a5,5d1f9~2,5d204~a5,5d349~a7,5d48e~3,5d493~a4,5d5d7~a7,5d71e~a3,5d7c2~3,5d860~1,5d865~4,5d86a~a2,5d9aa~1,5d9ac~1,5d9ae~a5,5daf3~a5,5db9a~2,5dc3a~a5,5dce1~1,5dce3~1,5dd81~a7,5dec8~a7,5e00f~a7,5e156~a8,5e29d~1,5e29f~a7,5e3e6~a8,5e52e~a6,5e5d6~2,5e676~a6,5e71e~2,5e7be~a9,5e907~a5,5e9ad~3,5ea50~a7,5eaf9~1,5eafb~1,5eb99~a7,5ec42~2,5ece2~a7,5ed8a~2,5ee2b~a9,5ef74~a7,5f01c~1,5f0bd~a1,5f15f~1,5f161~1,5f163~1,5f167~1,5f206~a1,5f2a8~1,5f2aa~1,5f2ac~1,5f2ae~1,5f34f~a1,5f3f1~1,5f3f3~1,5f3f5~1,5f3f7~1,5f499~a1,5f53b~1,5f53d~1,5f53f~1,5f541~1,5f5e3~a1,5f685~1,5f687~1,5f689~1,5f68d~1,5f72d~a1,5f7cf~1,5f7d1~1,5f7d3~1,5f875~a1]",

*   "snap_trimq_len": 5741,*
    "state": "active+clean+snaptrim",
    "epoch": 734240,
    "up": [

Do you know the way to see if the snaptim "process" works?

Best Regard

Gio


Am 17.08.2024 um 12:59 schrieb Giovanna Ratini:

Hello Eugen,

thank you for your answer.

I restarted all the kube-ceph nodes one after the other. Nothing  
has changed.


ok, I deactivate the snap ... : ceph fs snap-schedule deactivate /

Is there a way to see how many snapshots will be deleted per hour?

Regards,

Gio





Am 17.08.2024 um 10:12 schrieb Eugen Block:

Hi,

have you tried to fail the mgr? Sometimes the PG stats are not  
correct. You could also temporarily disable snapshots to see if  
things settle down.


Zitat von Giovanna Ratini :


Hello all,

We use Ceph (v18.2.2) and Rook (1.14.3) as the CSI for a  
Kubernetes environment. Last week, we had a problem with the MDS  
falling behind on trimming every 4-5 days (GitHub issue link).  
We resolved the issue using the steps outlined in the GitHub  
issue.


We have 3 hosts (I know, I need to increase this as soon as  
possible, and I will!) and 6 OSDs. After running the commands:


ceph config set mds mds_dir_max_commit_size 80,

ceph fs fail , and

ceph fs set  joinable true,

After that, the snaptrim queue for our PGs has stopped  
decreasing. All PGs of our CephFS are in either  
active+clean+snaptrim_wait or active+clean+snaptrim states. For  
example, the PG 3.12 is in the active+clean+snaptrim state, and  
its snap_trimq_len was 4077 yesterday but has increased to 4538  
today.


I increased the osd_snap_trim_priority to 10 (ceph config set  
osd osd_snap_trim_priority 10), but it didn't help. Only the PGs  
of our CephFS have this problem.


Do you have any ideas on how we can resolve this issue?

Thanks in advance,
Giovanna
p.s. I'm not a ceph expert :-).
Faulkener asked me for more information, so here it is:
MDS Memory: 11GB
mds_cache_memory_limit: 11,811,160,064 bytes

root@kube-master02:~# ceph fs snap-schedule status /
{
    "fs": "rook-cephfs",
    "subvol": null,
    "path": "/",
    "rel_path": "/",
    "schedule": "3h",
    "retention": {"h": 24, "w": 4},
    "start": "2024-05-05T00:00:00",
    "created": "2024-05-05T17:28:18",
    "first": "2024-05-05T18:00:00",
    "last": "2024-08-15T18:00:00",
    "last_pruned": "2024-08-15T18:00:00",
    "created_count": 817,
    "pruned_count": 817,
    "active

[ceph-users] Re: Bug with Cephadm module osd service preventing orchestrator start

2024-08-17 Thread Eugen Block

Just to get some background information, did you remove OSDs while  
performing the upgrade? Or did you start OSD removal and then started  
the upgrade? Upgrades should be started with a healthy cluster, but  
one can’t guarantee that of course, OSDs and/or entire hosts can  
obviously also fail during an upgrade.
Just trying to understand what could cause this (I haven’t upgraded  
production clusters to Reef yet, only test clusters). Have you stopped  
the upgrade to cancel the process entirely? Can you share this  
information please:


ceph versions
ceph orch upgrade status

Zitat von Benjamin Huth :


Just wanted to follow up on this, I am unfortunately still stuck with this
and can't find where the json for this value is stored. I'm wondering if I
should attempt to build a manager container  with the code for this
reverted to before the commit that introduced the original_weight argument.
Please let me know if you guys have any thoughts

Thank you!

On Wed, Aug 14, 2024, 7:37 PM Benjamin Huth  wrote:


Hey there, so I went to upgrade my ceph from 18.2.2 to 18.2.4 and have
encountered a problem with my managers. After they had been upgraded, my
ceph orch module broke because the cephadm module would not load. This
obviously halted the update because you can't really update without the
orchestrator. Here are the logs related to why the cephadm module fails to
start:

https://pastebin.com/SzHbEDVA

and the relevent part here:

"backtrace": [

" File \\"/usr/share/ceph/mgr/cephadm/module.py\\", line 591, in
__init__\\n self.to_remove_osds.load_from_store()",

" File \\"/usr/share/ceph/mgr/cephadm/services/osd.py\\", line 918, in
load_from_store\\n osd_obj = OSD.from_json(osd, rm_util=self.rm_util)",

" File \\"/usr/share/ceph/mgr/cephadm/services/osd.py\\", line 783, in
from_json\\n return cls(**inp)",

"TypeError: __init__() got an unexpected keyword argument
'original_weight'"

]

Unfortunately, I am at a loss to what passes this the original weight
argument. I have attempted to migrate back to 18.2.2 and successfully
redeployed a manager of that version, but it also has the same issue with
the cephadm module. I believe this may be because I recently started
several OSD drains, then canceled them, causing this to manifest once the
managers restarted.

I went through a good bit of the source and found the module at fault:

https://github.com/ceph/ceph/blob/e0dd396793b679922e487332a2a4bc48e024a42f/src/pybind/mgr/cephadm/services/osd.py#L779

as well as the PR that caused the issue:

https://github.com/ceph/ceph/commit/ba7fac074fb5ad072fcad10862f75c0a26a7591d

I unfortunately am not familiar enough with the ceph source to find the
ceph-config values I need to delete or smart enough to fix this myself. Any
help would be super appreciated.

Thanks!


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: squid release codename

2024-08-17 Thread Eugen Block


Hi,

I just wanted to point out how releases are named:

Each stable release series will receive a name (e.g., ‘Mimic’) and a  
major release number (e.g., 13 for Mimic because ‘M’ is the 13th  
letter of the alphabet).
Releases are named after a species of cephalopod (usually the common  
name, since the latin names are harder to remember or pronounce).


I don’t know how many more (sub)species there are to start over from A  
(the first release was Argonaut) when they reach Z, but we‘ll see. ;-)
So squid is not really (only) a reference to SpongeBob although it  
might have been part of the decision process. :-)



Zitat von Nico Schottelius :


Bike shedding at its best, so I've also to get my paintbrush for a good
place on the shed...

...  that said, naming a *release* of a software with the name of
well known other open source software is pure crazyness.

What's coming next? Ceph Redis? Ceph Apache? Or Apache Ceph?

Seriously, do you really think this is a good idea?

Nico

--
Sustainable and modern Infrastructures by ungleich.ch
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: The snaptrim queue of PGs has not decreased for several days.

2024-08-17 Thread Eugen Block


Hi,

have you tried to fail the mgr? Sometimes the PG stats are not  
correct. You could also temporarily disable snapshots to see if things  
settle down.


Zitat von Giovanna Ratini :


Hello all,

We use Ceph (v18.2.2) and Rook (1.14.3) as the CSI for a Kubernetes  
environment. Last week, we had a problem with the MDS falling behind  
on trimming every 4-5 days (GitHub issue link). We resolved the  
issue using the steps outlined in the GitHub issue.


We have 3 hosts (I know, I need to increase this as soon as  
possible, and I will!) and 6 OSDs. After running the commands:


ceph config set mds mds_dir_max_commit_size 80,

ceph fs fail , and

ceph fs set  joinable true,

After that, the snaptrim queue for our PGs has stopped decreasing.  
All PGs of our CephFS are in either active+clean+snaptrim_wait or  
active+clean+snaptrim states. For example, the PG 3.12 is in the  
active+clean+snaptrim state, and its snap_trimq_len was 4077  
yesterday but has increased to 4538 today.


I increased the osd_snap_trim_priority to 10 (ceph config set osd  
osd_snap_trim_priority 10), but it didn't help. Only the PGs of our  
CephFS have this problem.


Do you have any ideas on how we can resolve this issue?

Thanks in advance,
Giovanna
p.s. I'm not a ceph expert :-).
Faulkener asked me for more information, so here it is:
MDS Memory: 11GB
mds_cache_memory_limit: 11,811,160,064 bytes

root@kube-master02:~# ceph fs snap-schedule status /
{
    "fs": "rook-cephfs",
    "subvol": null,
    "path": "/",
    "rel_path": "/",
    "schedule": "3h",
    "retention": {"h": 24, "w": 4},
    "start": "2024-05-05T00:00:00",
    "created": "2024-05-05T17:28:18",
    "first": "2024-05-05T18:00:00",
    "last": "2024-08-15T18:00:00",
    "last_pruned": "2024-08-15T18:00:00",
    "created_count": 817,
    "pruned_count": 817,
    "active": true
}
I do not understand if the snapshots in the PGs are correlated with  
the snapshots on CephFS. Until we encountered the issue with the  
"MDS falling behind on trimming every 4-5 days," we didn't have any  
problems with snapshots.


Could someone explain me this or send me to the documentation?
Thank you
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Accidentally created systemd units for OSDs

2024-08-17 Thread Eugen Block

Hi,

When things settle down, I *MIGHT* put in a RFE to change the
default for ceph-volume to --no-systemd to save someone else from
this anguish.

note that there are still users/operators/admins who don't use
containers. Changing the ceph-volume default might not be the best
idea in this case.

Regarding the cleanup, this was the thread [1] Tim was referring to. I
would set the noout flag, stop an OSD (so the device won't be busy
anymore), make sure that both ceph-osd@{OSD_ID} and
ceph-{FSID}@osd.{OSD_ID} then double check that everything you need is
still under /var/lib/ceph/{FSID}/osd.{OSD_ID}, like configs an
keyrings. Disable the ceph-osd@{OSD_ID} (as already pointed out), then
check if the orchestrator can start the OSD via systemd:

ceph orch daemon start osd.{OSD_ID}

or alternatively, try it manually:

systemctl reset-failed
systemctl start ceph-{FSID}@osd.{OSD_ID}

Watch the log for that OSD to identify any issues. If it works, unset
the noout flag. You might want to ensure it also works after a reboot,
though.
I don't think it should be necessary to redeploy the OSDs, but the
cleanup has to be proper.
As a guidance you can check the cephadm tool's contents and look for
the "adopt" function. That migrates the contents of the pre-cephadm
daemons into the FSID specific directories.

Regards,
Eugen

[1]
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/K2R3MXRD3S2DSXCEGX5IPLCF5L3UUOQI/

Zitat von Dan O'Brien :

OK... I've been in the Circle of Hell where systemd lives and I
*THINK* I have convinced myself I'm OK. I *REALLY* don't want to
trash and rebuild the OSDs.

In the manpage for systemd.unit, I found

UNIT GARBAGE COLLECTION
The system and service manager loads a unit's configuration
automatically when a unit is referenced for the first time. It will
automatically unload the unit configuration and state again when the
unit is not needed anymore ("garbage collection").

I've disabled the systemd units (which removes the symlink from the
target) for the non-cephadm OSDs I created by mistake and I'm PRETTY
SURE if I wait long enough (or reboot) that I won't see them any
more, since there won't be a unit for systemd to care about.

I *WILL* have to clean up /var/lib/ceph/osd eventually. I tried just
now, but it says "device busy." I think that's because there's some
OTHER systemd cruft that shows a mount:

[root@ceph02 ~]# systemctl --all | grep ceph | grep mount
var-lib-ceph-osd-ceph\x2d11.mount loadedactive
mounted /var/lib/ceph/osd/ceph-11
var-lib-ceph-osd-ceph\x2d25.mountloadedactive
mounted /var/lib/ceph/osd/ceph-25
var-lib-ceph-osd-ceph\x2d9.mount loadedactive
mounted /var/lib/ceph/osd/ceph-9

When things settle down, I *MIGHT* put in a RFE to change the
default for ceph-volume to --no-systemd to save someone else from
this anguish.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Cephadm Upgrade Issue

2024-08-14 Thread Eugen Block

Of course, I didn’t really think that through. 😄 I believe we had to  
use the workaround to upgrade one mgr manually as you already  
mentioned, and after that all went well. Thanks!


Zitat von Adam King :


If you're referring to https://tracker.ceph.com/issues/57675, it got into
16.2.14, although there was another issue where running a `ceph orch
restart mgr` or `ceph orch redeploy mgr` would cause an endless loop of the
mgr daemons restarting, which would block all operations, that might be
what we were really dealing with here. That didn't have a tracker afaik,
but I believe was fixed by https://github.com/ceph/ceph/pull/41002. That
got into 16.2.4, but if the version being upgraded from was earlier than
this, the issue would have had to have been resolved before upgrade could
actually happen.

On Wed, Aug 14, 2024 at 1:07 PM Eugen Block  wrote:


A few of our customers were affected by that, but as far as I remember
(I can look it up tomorrow), the actual issue popped up if they had
more than two MGRs. But I believe it was resolved in a newer pacific
version (don’t have the exact version on mind), which version did you
try to upgrade to? There shouldn’t be any reason to remove other
daemons.


Zitat von "Alex Hussein-Kershaw (HE/HIM)" :

> I spotted this: Performing a `ceph orch restart mgr` results in
> endless restart loop | Support |
> SUSE<https://www.suse.com/support/kb/doc/?id=20530>, which
> sounded quite similar, so I gave it a go and did:
>
> ceph orch daemon rm mgr.raynor-sc-1
> < wait a bit for it to be created >
> < repeat for each host >
>
> That seemed to solve my problem. I upgraded and it just worked.
>
> Did get me wondering if I should be doing the same for my monitors
> (and even OSDs) post-adoption? They do seem to have a different
> naming scheme.
>
> Best Wishes,
> Alex
>
>
> 
> From: Alex Hussein-Kershaw (HE/HIM)
> Sent: Wednesday, August 14, 2024 3:06 PM
> To: ceph-users 
> Subject: Cephadm Upgrade Issue
>
> Hi Folks,
>
> I'm prototyping the upgrade process for our Ceph Clusters. I've
> adopted the Cluster following the docs, that works nicely 🙂 I then
> load my docker image into a locally running container registry, as
> I'm in a disconnected environment.  I have a test Cluster with 3 VMs
> and no data, adopted at Octopus and upgrading to Pacific. I'm
> running a MON, MGR, MDS and OSD on each VM.
>
> I then attempt to upgrade:
> ceph orch upgrade start --image localhost:5000/ceph/pacific:v16.2.15
>
> Lots of logs below, but the summary appears to be that we initially
> fail to upgrade the managers and get into a bad state. It looks like
> there is some confusion in manager naming, and we end up with two
> managers on each machine instead of one. Eventually Ceph reports a
> health warning:
>
> $ ceph -s
>   cluster:
> id: e773d9c2-6d8d-4413-8e8f-e38f248f5959
> health: HEALTH_ERR
> 1 failed cephadm daemon(s)
> Module 'cephadm' has failed: 'cephadm'
>
> That does seem to eventually clean its self up and the upgrade
> appears to have completed ("ceph versions" shows everything on
> Pacific), but it feels a bit bumpy. Hoping someone has some guidance
> here. The containers on one host during upgrade are shown below.
> Notice I somehow have two managers, where the names are a single
> character different (a "-" replaced with a "."):
>
> $ docker ps | grep mgr
> 2143b6f0e0e6   localhost:5000/ceph/pacific:v16.2.15
> "/usr/bin/ceph-mgr -…"   About a minute ago   Up About a minute
>ceph-e773d9c2-6d8d-4413-8e8f-e38f248f5959-mgr.raynor-sc-2
> 59c8cfddac64   ceph-daemon:v5.0.12-stable-5.0-octopus-centos-8
> "/usr/bin/ceph-mgr -…"   14 minutes ago   Up 14 minutes
>ceph-e773d9c2-6d8d-4413-8e8f-e38f248f5959-mgr-raynor-sc-2
>
> In the output of "ceph -w" I see this sort of stuff:
>
> 2024-08-14T13:45:13.003405+ mon.raynor-sc-1 [INF] Manager daemon
> raynor-sc-3 is now available
> 2024-08-14T13:45:23.179699+ mon.raynor-sc-1 [ERR] Health check
> failed: Module 'cephadm' has failed: 'cephadm' (MGR_MODULE_ERROR)
> 2024-08-14T13:45:22.372376+ mgr.raynor-sc-3 [ERR] Unhandled
> exception from module 'cephadm' while running on mgr.raynor-sc-3:
> 'cephadm'
> 2024-08-14T13:45:24.761961+ mon.raynor-sc-1 [INF] Active manager
> daemon raynor-sc-3 restarted
> 2024-08-14T13:45:24.766395+ mon.raynor-sc-1 [INF] Activating
> manager daemon raynor-sc-3
> 2024-08-14T13:45:31.800989+ mon.raynor-sc-1 [INF] Manager daemon
> raynor-sc-3 is now available
> 2024-08-14T13:45:32.874227+

[ceph-users] Re: Cephadm Upgrade Issue

2024-08-14 Thread Eugen Block

A few of our customers were affected by that, but as far as I remember  
(I can look it up tomorrow), the actual issue popped up if they had  
more than two MGRs. But I believe it was resolved in a newer pacific  
version (don’t have the exact version on mind), which version did you  
try to upgrade to? There shouldn’t be any reason to remove other  
daemons.



Zitat von "Alex Hussein-Kershaw (HE/HIM)" :

I spotted this: Performing a `ceph orch restart mgr` results in  
endless restart loop | Support |  
SUSE, which  
sounded quite similar, so I gave it a go and did:


ceph orch daemon rm mgr.raynor-sc-1
< wait a bit for it to be created >
< repeat for each host >

That seemed to solve my problem. I upgraded and it just worked.

Did get me wondering if I should be doing the same for my monitors  
(and even OSDs) post-adoption? They do seem to have a different  
naming scheme.


Best Wishes,
Alex



From: Alex Hussein-Kershaw (HE/HIM)
Sent: Wednesday, August 14, 2024 3:06 PM
To: ceph-users 
Subject: Cephadm Upgrade Issue

Hi Folks,

I'm prototyping the upgrade process for our Ceph Clusters. I've  
adopted the Cluster following the docs, that works nicely 🙂 I then  
load my docker image into a locally running container registry, as  
I'm in a disconnected environment.  I have a test Cluster with 3 VMs  
and no data, adopted at Octopus and upgrading to Pacific. I'm  
running a MON, MGR, MDS and OSD on each VM.


I then attempt to upgrade:
ceph orch upgrade start --image localhost:5000/ceph/pacific:v16.2.15

Lots of logs below, but the summary appears to be that we initially  
fail to upgrade the managers and get into a bad state. It looks like  
there is some confusion in manager naming, and we end up with two  
managers on each machine instead of one. Eventually Ceph reports a  
health warning:


$ ceph -s
  cluster:
id: e773d9c2-6d8d-4413-8e8f-e38f248f5959
health: HEALTH_ERR
1 failed cephadm daemon(s)
Module 'cephadm' has failed: 'cephadm'

That does seem to eventually clean its self up and the upgrade  
appears to have completed ("ceph versions" shows everything on  
Pacific), but it feels a bit bumpy. Hoping someone has some guidance  
here. The containers on one host during upgrade are shown below.  
Notice I somehow have two managers, where the names are a single  
character different (a "-" replaced with a "."):


$ docker ps | grep mgr
2143b6f0e0e6   localhost:5000/ceph/pacific:v16.2.15   
"/usr/bin/ceph-mgr -…"   About a minute ago   Up About a minute   
   ceph-e773d9c2-6d8d-4413-8e8f-e38f248f5959-mgr.raynor-sc-2
59c8cfddac64   ceph-daemon:v5.0.12-stable-5.0-octopus-centos-8
"/usr/bin/ceph-mgr -…"   14 minutes ago   Up 14 minutes   
   ceph-e773d9c2-6d8d-4413-8e8f-e38f248f5959-mgr-raynor-sc-2


In the output of "ceph -w" I see this sort of stuff:

2024-08-14T13:45:13.003405+ mon.raynor-sc-1 [INF] Manager daemon  
raynor-sc-3 is now available
2024-08-14T13:45:23.179699+ mon.raynor-sc-1 [ERR] Health check  
failed: Module 'cephadm' has failed: 'cephadm' (MGR_MODULE_ERROR)
2024-08-14T13:45:22.372376+ mgr.raynor-sc-3 [ERR] Unhandled  
exception from module 'cephadm' while running on mgr.raynor-sc-3:  
'cephadm'
2024-08-14T13:45:24.761961+ mon.raynor-sc-1 [INF] Active manager  
daemon raynor-sc-3 restarted
2024-08-14T13:45:24.766395+ mon.raynor-sc-1 [INF] Activating  
manager daemon raynor-sc-3
2024-08-14T13:45:31.800989+ mon.raynor-sc-1 [INF] Manager daemon  
raynor-sc-3 is now available
2024-08-14T13:45:32.874227+ mon.raynor-sc-1 [INF] Health check  
cleared: MGR_MODULE_ERROR (was: Module 'cephadm' has failed:  
'cephadm')

2024-08-14T13:45:32.874269+ mon.raynor-sc-1 [INF] Cluster is now healthy
2024-08-14T13:45:33.664602+ mon.raynor-sc-1 [INF] Active manager  
daemon raynor-sc-3 restarted
2024-08-14T13:45:33.671809+ mon.raynor-sc-1 [INF] Activating  
manager daemon raynor-sc-3
2024-08-14T13:45:34.050292+ mon.raynor-sc-1 [INF] Manager daemon  
raynor-sc-3 is now available
2024-08-14T13:45:38.260385+ mon.raynor-sc-1 [WRN] Health check  
failed: 1 failed cephadm daemon(s) (CEPHADM_FAILED_DAEMON)
2024-08-14T13:45:43.462665+ mgr.raynor-sc-3 [ERR] Unhandled  
exception from module 'cephadm' while running on mgr.raynor-sc-3:  
'cephadm'
2024-08-14T13:45:44.770711+ mon.raynor-sc-1 [ERR] Health check  
failed: Module 'cephadm' has failed: 'cephadm' (MGR_MODULE_ERROR)
2024-08-14T13:45:45.668379+ mon.raynor-sc-1 [INF] Active manager  
daemon raynor-sc-3 restarted
2024-08-14T13:45:45.673206+ mon.raynor-sc-1 [INF] Activating  
manager daemon raynor-sc-3
2024-08-14T13:45:45.673316+ mon.raynor-sc-1 [INF] Active manager  
daemon raynor-sc-3 restarted
2024-08-14T13:45:45.689515+ mon.raynor-sc-1 [INF] Active manager  
daemon raynor-sc-3 restarted
2024-08-14T13:45:45.694315+ mon.raynor-sc

[ceph-users] Re: Ceph Logging Configuration and "Large omap objects found"

2024-08-14 Thread Eugen Block

Hm, then I don't see another way than to scan each OSD host for the  
omap message. Do you have a centralized logging or some configuration  
management like salt where you can target all hosts with a command?


Zitat von Janek Bevendorff :

Thanks. I increased the number even further and got a (literal)  
handful of non-debug messages. Unfortunately, none were relevant for  
the problem I'm trying to debug.



On 13/08/2024 14:03, Eugen Block wrote:
Interesting, apparently the number one provides in the 'ceph log  
last ' command is not the number of lines to display but the  
number of lines to search for a match.
So in your case you should still see your osd log output about the  
large omap if you pick a large enough number. My interpretation was  
that the number of lines you provide is the number of lines to be  
displayed in the selected log level. This needs to be documented.


Zitat von Eugen Block :

I just played a bit more with the 'ceph log last' command, it  
doesn't have a large retention time, the messages get cleared out  
quickly, I suppose because they haven't changed. I'll take a  
closer look if and how that can be handled properly.



Zitat von Janek Bevendorff :



That's where the 'ceph log last' commands should help you out,  
but I don't know why you don't see it, maybe increase the number  
of lines to display or something?


BTW, which ceph version are we talking about here?


reef.

I tried ceph log last 100 debug cluster and that gives me the  
usual DBG spam that I otherwise see in the MON logs. But there  
are no messages above that level.





--
Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Identify laggy PGs

2024-08-14 Thread Eugen Block


Hi Boris,


 PGs are roughtly 35GB.


that's not huge. You wrote you drained one OSD which helped with the  
flapping, so you don't have flapping OSDs anymore at all?

If you have identified problematic PGs, you can get the OSD mapping like this:

ceph pg map 26.7
osdmap e14121 pg 26.7 (26.7) -> up [2,5,8] acting [2,5,8]


Just curiously I've checked my pg size which is like 150GB, when are  
we talking about big pgs?


That depends on a couple of factors. Just one example from a customer  
cluster: They had 240 HDD OSDs with roughly 1 PB a couple of months  
ago, which resulted in PG sizes around 400 GB. This led to very long  
deep-scrubs, utilizing the OSDs for quite some time, with a noticable  
impact on their application. This was not only a performance issue but  
also a balancing issue: if the OSDs deviate by only 5 PGs that makes a  
difference of 2 TB, on 8 TB disks that is 25% which can quickly bring  
the OSDs to or above 85% usage.
That's why we quadrupled the PGs for the main pool which improved  
balancing a lot, and deep-scrubs per PG also run much faster with the  
cost of having more PGs to scrub, of course.



Zitat von "Szabo, Istvan (Agoda)" :

Just curiously I've checked my pg size which is like 150GB, when are  
we talking about big pgs?

____
From: Eugen Block 
Sent: Wednesday, August 14, 2024 2:23 PM
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: Identify laggy PGs

Email received from the internet. If in doubt, don't click any link  
nor open any attachment !



Hi,

how big are those PGs? If they're huge and are deep-scrubbed, for
example, that can cause significant delays. I usually look at 'ceph pg
ls-by-pool {pool}' and the "BYTES" column.

Zitat von Boris :


Hi,

currently we encouter laggy PGs and I would like to find out what is
causing it.
I suspect it might be one or more failing OSDs. We had flapping OSDs and I
synced one out, which helped with the flapping, but it doesn't help with
the laggy ones.

Any tooling to identify or count PG performance and map that to OSDs?


--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended  
recipient(s). It may also be privileged or otherwise protected by  
copyright or other legal rules. If you have received it by mistake  
please let us know by reply email and delete it from your system. It  
is prohibited to copy this message or disclose its content to  
anyone. Any confidentiality or privilege is not waived or lost by  
any mistaken delivery or unauthorized disclosure of the message. All  
messages sent to and from Agoda may be monitored to ensure  
compliance with company policies, to protect the company's interests  
and to remove potential malware. Electronic messages may be  
intercepted, amended, lost or deleted, or contain viruses.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Bluestore issue using 18.2.2

2024-08-14 Thread Eugen Block


Hi Frank,

you may be right about the checksums, but I just wanted to point out  
the risks of having size 2 pools in general. Since there was no  
response to the thread yet, I wanted to bump it a bit.


Zitat von Frank Schilder :


Hi Eugen,

isn't every shard/replica on every OSD read and written with a  
checksum? Even if only the primary holds a checksum, it should be  
possible to identify the damaged shard/replica during deep-scrub  
(even for replication 1).


Apart from that, it is unusual to see a virtual disk have  
read-errors. If its some kind of pass-through mapping, there is  
probably something incorrectly configured with a write cache. Still,  
this would only be a problem if the VM dies unexpectedly. There is  
something off with the setup (unless the underlying hardware device  
for the VDs does actually have damage).


Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Wednesday, August 14, 2024 9:05 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Bluestore issue using 18.2.2

Hi,

it looks like you're using size 2 pool(s), I strongly advise to
increase that to 3 (and min_size = 2). Although it's unclear why the
PGs get damaged, the repair of a PG with only two replicas is
difficult, which is the correct one? So to avoid that, avoid pools
with size 2, except for tests and if you don't care about the data.
If you want to use the current situation to learn, you could try to
inspect the PGs with the ceph-objectstore-tool and find out which
replica is the correct one, export it and then inject it into the OSD.
But this can be tricky, of course.

Regards,
Eugen

Zitat von Marianne Spiller :


Hi,

I am trying to gather experience on a Ceph STAGE cluster; it
consists of virtual machines - which is not perfect, I know. The VMs
are running Debian 12 and podman-4.3.1. There is practically no load
on this Ceph - there is just one client using the storage, and it
makes no noise. So this is what happened:

* "During data consistency checks (scrub), at least one PG has been
flagged as being damaged or inconsistent."
* so I listed them (["2.3","2.58"])
* and tried to repair ("ceph pg repair 2.3", "ceph pg repair 2.58")
* they both went well (resulting in "pgs: 129 active+clean"), but
the cluster keeped its "HEALTH_WARN" state ("Too many repaired reads
on 1 OSDs")
* so I googled for this message; and the only thing I found was to
restart the OSD to get rid of this message and - more important -
the cluster WARN state ("ceph orch daemon restart osd.3")
* after the restart, my cluster was still in WARN state - and
complained about "2 PGs has been flagged as being damaged or
inconsistent" - but other PGs on other OSDs
* I "ceph pg repair"ed them, too, and the cluster's state was WARN
afterwards, again ("Too many repaired reads on 1 OSDs")
* when I restarted the OSD ("ceph orch daemon restart osd.2"), the
crash occured; Ceph marked this OSD "down" and "out" and suspected a
hardware issue, while the OSD HDDs in fact are QEMU "harddisks"
* I can't judge whether it's a serious bug or just due to my
non-optimal STAGE setup, so I'll attach the gzipped log of osd.2

I need help to understand what happened and how to prevent it in the
future. What ist this "Too many repaired reads" and how to deal with
it?

Thanks a lot for reading,
  Marianne



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Identify laggy PGs

2024-08-14 Thread Eugen Block


Hi,

how big are those PGs? If they're huge and are deep-scrubbed, for  
example, that can cause significant delays. I usually look at 'ceph pg  
ls-by-pool {pool}' and the "BYTES" column.


Zitat von Boris :


Hi,

currently we encouter laggy PGs and I would like to find out what is
causing it.
I suspect it might be one or more failing OSDs. We had flapping OSDs and I
synced one out, which helped with the flapping, but it doesn't help with
the laggy ones.

Any tooling to identify or count PG performance and map that to OSDs?


--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: All MDS's Crashed, Failed Assert

2024-08-14 Thread Eugen Block


Hi,

have you checked the MDS journal for any damage (replace {CEPHFS} with  
the name of your filesystem)?


cephfs-journal-tool --rank={CEPHFS}:all journal inspect

Zitat von m...@silvenga.com:

I'm looking for guidance around how to recover after all MDS  
continue to crash with a failed assert during journal replay (no MON  
damage).


Context:

So I've been working through failed MDS for the past day, likely  
caused by a large snaptrim operation that caused the cluster to  
grind to a halt.


After evicting all clients and restarting the MDS's (it appears the  
clients were overwhelming the MDS's). The MDS are failing to start  
with:


debug -1> 2024-07-24T18:44:52.674+ 7f7878c22700 -1  
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/osdc/Journaler.cc: In function 'bool Journaler::try_read_entry(ceph::bufferlist&)' thread 7f7878c22700 time  
2024-07-24T18:44:52.676027+
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/osdc/Journaler.cc: 1256: FAILED ceph_assert(start_ptr ==  
read_pos)

 ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char  
const*)+0x135) [0x7f788aa32e15]

 2: /usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f788aa32fdb]
 3: (Journaler::try_read_entry(ceph::buffer::v15_2_0::list&)+0x132)  
[0x5847ef32]

 4: (MDLog::_replay_thread()+0xda) [0x58436bea]
 5: (MDLog::ReplayThread::entry()+0x11) [0x580e52d1]
 6: /lib64/libpthread.so.0(+0x81ca) [0x7f78897d81ca]
 7: clone()
debug  0> 2024-07-24T18:44:52.674+ 7f7878c22700 -1 ***  
Caught signal (Aborted) **

 in thread 7f7878c22700 thread_name:md_log_replay
 ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
 1: /lib64/libpthread.so.0(+0x12d20) [0x7f78897e2d20]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char  
const*)+0x18f) [0x7f788aa32e6f]

 5: /usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f788aa32fdb]
 6: (Journaler::try_read_entry(ceph::buffer::v15_2_0::list&)+0x132)  
[0x5847ef32]

 7: (MDLog::_replay_thread()+0xda) [0x58436bea]
 8: (MDLog::ReplayThread::entry()+0x11) [0x580e52d1]
 9: /lib64/libpthread.so.0(+0x81ca) [0x7f78897d81ca]
 10: clone()

Normally, there's three MDS are deployed, 1 active, one on hot  
standby. The cluster seems to believe that any restarted MDS is  
attempting to replay, but systemd reports an immediate crashed with  
a SIGABRT.


ceph mds stat
cephfs:1/1 {0=cephfs.sm1.esxjag=up:replay(laggy or crashed)}

Redeploying the MDS's also continue to crash (suggesting a bad journal?)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Bluestore issue using 18.2.2

2024-08-14 Thread Eugen Block


Hi,

it looks like you're using size 2 pool(s), I strongly advise to  
increase that to 3 (and min_size = 2). Although it's unclear why the  
PGs get damaged, the repair of a PG with only two replicas is  
difficult, which is the correct one? So to avoid that, avoid pools  
with size 2, except for tests and if you don't care about the data.
If you want to use the current situation to learn, you could try to  
inspect the PGs with the ceph-objectstore-tool and find out which  
replica is the correct one, export it and then inject it into the OSD.  
But this can be tricky, of course.


Regards,
Eugen

Zitat von Marianne Spiller :


Hi,

I am trying to gather experience on a Ceph STAGE cluster; it  
consists of virtual machines - which is not perfect, I know. The VMs  
are running Debian 12 and podman-4.3.1. There is practically no load  
on this Ceph - there is just one client using the storage, and it  
makes no noise. So this is what happened:


* "During data consistency checks (scrub), at least one PG has been  
flagged as being damaged or inconsistent."

* so I listed them (["2.3","2.58"])
* and tried to repair ("ceph pg repair 2.3", "ceph pg repair 2.58")
* they both went well (resulting in "pgs: 129 active+clean"), but  
the cluster keeped its "HEALTH_WARN" state ("Too many repaired reads  
on 1 OSDs")
* so I googled for this message; and the only thing I found was to  
restart the OSD to get rid of this message and - more important -  
the cluster WARN state ("ceph orch daemon restart osd.3")
* after the restart, my cluster was still in WARN state - and  
complained about "2 PGs has been flagged as being damaged or  
inconsistent" - but other PGs on other OSDs
* I "ceph pg repair"ed them, too, and the cluster's state was WARN  
afterwards, again ("Too many repaired reads on 1 OSDs")
* when I restarted the OSD ("ceph orch daemon restart osd.2"), the  
crash occured; Ceph marked this OSD "down" and "out" and suspected a  
hardware issue, while the OSD HDDs in fact are QEMU "harddisks"
* I can't judge whether it's a serious bug or just due to my  
non-optimal STAGE setup, so I'll attach the gzipped log of osd.2


I need help to understand what happened and how to prevent it in the  
future. What ist this "Too many repaired reads" and how to deal with  
it?


Thanks a lot for reading,
  Marianne



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Upgrading RGW before cluster?

2024-08-13 Thread Eugen Block


Hi Thomas,

I agree, from my point of view this shouldn't be an issue. And  
although I usually stick to the documented process, especially with  
products like SUSE Enterprise Storage (which was decommissioned),  
there are/were customers who had services colocated, for example MON,  
MGR and RGW on the same nodes. Before cephadm when they upgraded the  
first MON node they automatically upgraded the RGW as well, of course.  
And I haven't seen any issues with that, but maybe other  
admins/operators have.
I assume you have multiple (dedicated) RGWs running, so you can  
upgrade only one and see if it still works properly after the upgrade,  
then move on to the next if it does.


Regards,
Eugen


Zitat von Thomas Byrne - STFC UKRI :


Hi all,

The Ceph documentation has always recommended upgrading RGWs last  
when doing a upgrade. Is there a reason for this? As they're mostly  
just RADOS clients you could imagine the order doesn't matter as  
long as the cluster and RGW major versions are compatible. Our basic  
testing has shown no obvious issues (with Pacific RGWs and a  
Nautilus cluster FWIW).


I'm asking because in our case it would be handy to upgrade our  
gateway infrastructure first, not for any new RGW features, but just  
for scheduling the operations.


Is this a terrible idea?

If it helps, this is the first step in a Nautilus -> Pacific -> Reef  
plan, with no cephadm.


Thanks,
Tom
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1038 matches

Mail list logo