[ceph-users] Querying the most recent snapshot

2023-09-21 Thread Dominique Ramaekers
Hi,

A question to avoid using a to elaborate method in finding de most recent 
snapshot of a RBD-image.

So, what would be the preferred way to find the latest snapshot of this image?

root@hvs001:/# rbd snap ls libvirt-pool/CmsrvDOM2-MULTIMEDIA
SNAPID  NAMESIZE PROTECTED  TIMESTAMP
   223  snap_5  435 GiB  yesFri Sep 15 15:33:39 2023
   262  snap_1  435 GiB  yesMon Sep 18 15:39:36 2023
   280  snap_3  435 GiB  yesWed Sep 20 15:39:42 2023

I would tend to select the highest snapid. But at some point, the next snapid 
will restart at 1? So maybe not the best idea.

I could select by date/time but I don't have an easy way to convert the text 
string to an timestamp...

I've looked in the rbd help snap ls, there seem to be now way to sort, format 
the output of the timestamp,...

Any advice will be appreciated.

Greetings,

Dominique.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Join us for the User + Dev Relaunch, happening this Thursday!

2023-09-21 Thread Laura Flores
Hi Ceph users and developers,

Big thanks to Cory Snyder and Jonas Sterr for sharing your insights with an
audience of 50+ users and developers!

Cory shared some valuable troubleshooting tools and tricks that would be
helpful for anyone interested in gathering good debugging info.
See his presentation slides here:
https://github.com/ljflores/ceph_user_dev_monthly_meeting/blob/main/user_dev_meeting_2023_09_21_cory_snyder.pdf

Jonas presented some ideas for how Ceph can cater more effectively to
beginning Ceph users, such as highlighting "must-know" topics in our
documentation (what does a mon/mgr/mds service do, how to create a pool,
how to integrate new disks, etc.)
See his presentation slides here:
https://github.com/ljflores/ceph_user_dev_monthly_meeting/blob/main/user_dev_meeting_2023_09_21_jonas_sterr.pdf

If you'd like to be an upcoming speaker, we invite you to submit a focus
topic to this Google form:
https://docs.google.com/forms/d/e/1FAIpQLSdboBhxVoBZoaHm8xSmeBoemuXoV_rmh4vJDGBrp6d-D3-BlQ/viewform?usp=sf_link

See you all next month!

- Laura

On Tue, Sep 19, 2023 at 1:07 PM Laura Flores  wrote:

> Hi Ceph users and developers,
>
> We invite you to join us at the User + Dev Relaunch, happening this
> Thursday at 10:00 AM EST! See below for more meeting details. Also see this
> blog post to read more about the relaunch:
> https://ceph.io/en/news/blog/2023/user-dev-meeting-relaunch/
>
> We have two guest speakers who will present their focus topics during the
> first 40 minutes of the session:
>
>1. "What to do when Ceph isn't Ceph-ing" by Cory Snyder
>Topics include troubleshooting tips, effective ways to gather help
> from the community, ways to improve cluster health and insights, and more!
>
>2. "Ceph Usability Improvements" by Jonas Sterr
>A continuation of a talk from Cephalocon 2023, updated after trying
> out the Reef Dashboard.
>
> The last 20 minutes of the meeting will be dedicated to open discussion.
> Feel free to add questions for the speakers or additional topics under the
> "Open Discussion" section on the agenda:
> https://pad.ceph.com/p/ceph-user-dev-monthly-minutes
>
> If you have an idea for a focus topic you'd like to present at a future
> meeting, you are welcome to submit it to this Google Form:
> https://docs.google.com/forms/d/e/1FAIpQLSdboBhxVoBZoaHm8xSmeBoemuXoV_rmh4vJDGBrp6d-D3-BlQ/viewform?usp=sf_link
> Any Ceph user or developer is eligible to submit!
>
> Thanks,
> Laura Flores
>
> Meeting link: https://meet.jit.si/ceph-user-dev-monthly
>
> Time conversions:
> UTC:   Thursday, September 21, 14:00 UTC
> Mountain View, CA, US: Thursday, September 21,  7:00 PDT
> Phoenix, AZ, US:   Thursday, September 21,  7:00 MST
> Denver, CO, US:Thursday, September 21,  8:00 MDT
> Huntsville, AL, US:Thursday, September 21,  9:00 CDT
> Raleigh, NC, US:   Thursday, September 21, 10:00 EDT
> London, England:   Thursday, September 21, 15:00 BST
> Paris, France: Thursday, September 21, 16:00 CEST
> Helsinki, Finland: Thursday, September 21, 17:00 EEST
> Tel Aviv, Israel:  Thursday, September 21, 17:00 IDT
> Pune, India:   Thursday, September 21, 19:30 IST
> Brisbane, Australia:   Friday, September 22,  0:00 AEST
> Singapore, Asia:   Thursday, September 21, 22:00 +08
> Auckland, New Zealand: Friday, September 22,  2:00 NZST
>
> --
>
> Laura Flores
>
> She/Her/Hers
>
> Software Engineer, Ceph Storage 
>
> Chicago, IL
>
> lflo...@ibm.com | lflo...@redhat.com 
> M: +17087388804
>
>
>

-- 

Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage 

Chicago, IL

lflo...@ibm.com | lflo...@redhat.com 
M: +17087388804
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: After upgrading from 17.2.6 to 18.2.0, OSDs are very frequently restarting due to livenessprobe failures

2023-09-21 Thread Sudhin Bengeri
Igor, Travis,

Thanks for your attention to this issue.

We extended the timeout for the liveness probe yesterday, and also extended
the time after which a down OSD deployment is deleted by the operator. Once
all the OSD deployments were recreated by the operator, we observed two OSD
restarts - which is a much lower rate than earlier.

Igor, we are still working on piecing together logs (from our log store)
before the OSD restarts and will send them shortly.

Thanks.
Sudhin




On Thu, Sep 21, 2023 at 3:12 PM Travis Nielsen  wrote:

> If there is nothing obvious in the OSD logs such as failing to start, and
> if the OSDs appear to be running until the liveness probe restarts them,
> you could disable or change the timeouts on the liveness probe. See
> https://rook.io/docs/rook/latest/CRDs/Cluster/ceph-cluster-crd/#health-settings
> .
>
> But of course, we need to understand if there is some issue with the OSDs.
> Please open a Rook issue if it appears related to the liveness probe.
>
> Travis
>
> On Thu, Sep 21, 2023 at 3:12 AM Igor Fedotov 
> wrote:
>
>> Hi!
>>
>> Can you share OSD logs demostrating such a restart?
>>
>>
>> Thanks,
>>
>> Igor
>>
>> On 20/09/2023 20:16, sbeng...@gmail.com wrote:
>> > Since upgrading to 18.2.0 , OSDs are very frequently restarting due to
>> livenessprobe failures making the cluster unusable. Has anyone else seen
>> this behavior?
>> >
>> > Upgrade path: ceph 17.2.6 to 18.2.0 (and rook from 1.11.9 to 1.12.1)
>> > on ubuntu 20.04 kernel 5.15.0-79-generic
>> >
>> > Thanks.
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: After upgrading from 17.2.6 to 18.2.0, OSDs are very frequently restarting due to livenessprobe failures

2023-09-21 Thread Travis Nielsen
If there is nothing obvious in the OSD logs such as failing to start, and
if the OSDs appear to be running until the liveness probe restarts them,
you could disable or change the timeouts on the liveness probe. See
https://rook.io/docs/rook/latest/CRDs/Cluster/ceph-cluster-crd/#health-settings
.

But of course, we need to understand if there is some issue with the OSDs.
Please open a Rook issue if it appears related to the liveness probe.

Travis

On Thu, Sep 21, 2023 at 3:12 AM Igor Fedotov  wrote:

> Hi!
>
> Can you share OSD logs demostrating such a restart?
>
>
> Thanks,
>
> Igor
>
> On 20/09/2023 20:16, sbeng...@gmail.com wrote:
> > Since upgrading to 18.2.0 , OSDs are very frequently restarting due to
> livenessprobe failures making the cluster unusable. Has anyone else seen
> this behavior?
> >
> > Upgrade path: ceph 17.2.6 to 18.2.0 (and rook from 1.11.9 to 1.12.1)
> > on ubuntu 20.04 kernel 5.15.0-79-generic
> >
> > Thanks.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW External IAM Authorization

2023-09-21 Thread Seena Fallah
Hi Community,

I recently proposed a new authorization mechanism for RGW that can let the
RGW daemon ask an external service to authorize a request based on AWS S3
IAM tags (that means the external service would receive the same env as an
IAM policy doc would have to evaluate the policy).
You can find the documentation of the implementation here:
https://github.com/clwluvw/ceph/blob/rgw-external-iam/doc/radosgw/external-iam.rst
And the PR here: https://github.com/ceph/ceph/pull/53345

We would love to hear feedback if anyone else feels this would be a need
for them and what you would think about the APIs.

Best,
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: millions of hex 80 0_0000 omap keys in single index shard for single bucket

2023-09-21 Thread Christopher Durham
 Casey,
What I will probably do is:
1. stop usage of that bucket2. wait a few minutes to allow anything to 
replicate, and verify object count, etc.
3. bilog trim
After #3 I will see if any of the '/' objects still exist.


Hopefully that will help. I now know what to look for to see if I can narrow 
things down as soon as the issue starts, and maybe garner more datato track 
this down further. We'll eee.

-chris

On Thursday, September 21, 2023 at 11:17:51 AM MDT, Casey Bodley 
 wrote:  
 
 On Thu, Sep 21, 2023 at 12:21 PM Christopher Durham  wrote:
>
>
> Hi Casey,
>
> This is indeed a multisite setup. The other side shows that for
>
> # radosgw-admin sync status
>
> the oldest incremental change not applied is about a minute old, and that is 
> consistent over a number of minutes, always the oldest incremental change a 
> minute or two old.
>
> However:
>
> # radosgw-admin bucket sync status --bucket bucket-in-question
>
> shows a number of shards always behind, although it varies.
>
> The number of objects on each side in that bucket is close, and  to this 
> point I have attributed that to the replication lag.
>
> One thing that came to mind is that the code that writes to say 
> foo/bar/baz/objects ...
>
> will often delete the objects quickly after creating them. Perhaps the 
> replication doesn't occur to
> the other side before they are deleted? Would that perhaps contribute to this?

sync should handle object deletion just fine. it'll see '404 Not
Found' errors when it tries to replicate them, and just continue on to
the next object. that shouldn't cause bucket sync to get stuck

>
> Not sure how this relates to the objects ending in '/' though, although they 
> are in the same prefix hierarchy.
>
> To get out of this situation, what do I need to do:
>
> 1. radosgw-admin bucket sync init --bucket bucket-in-question on both sides?

'bucket sync init' clears the bucket's sync status, but nothing would
trigger rgw to restart the sync on it. you could try 'bucket sync run'
instead, though it's not especially reliable until the reef release so
you may need to rerun the command several times before it catches up
completely. once the bucket sync catches up, the source zone's bilog
entries would be eligible for automatic trimming

> 2. manually delete the 0_ objects in rados? (yuk).

you can use the 'bilog trim' command on a bucket to delete its log
entries, but i'd only consider doing that if you're satisfied that all
of the objects you care about have already replicated

>
> I've done #1 before when I had the other side of a multi site down for awhile 
> before. I have not had that happen in the current situation (link down 
> between sites).
>
> Thanks for anything you or others can offer.

for rgw multisite users in particular, i highly recommend trying out
the reef release. in addition to multisite resharding support, we made
a lot of improvements to multisite stability/reliability that we won't
be able to backport to pacific/quincy

>
> -Chris
>
>
> On Wednesday, September 20, 2023 at 07:33:07 PM MDT, Casey Bodley 
>  wrote:
>
>
> these keys starting with "<80>0_" appear to be replication log entries
> for multisite. can you confirm that this is a multisite setup? is the
> 'bucket sync status' mostly caught up on each zone? in a healthy
> multisite configuration, these log entries would eventually get
> trimmed automatically
>
> On Wed, Sep 20, 2023 at 7:08 PM Christopher Durham  wrote:
> >
> > I am using ceph 17.2.6 on Rocky 8.
> > I have a system that started giving me large omap object warnings.
> >
> > I tracked this down to a specific index shard for a single s3 bucket.
> >
> > rados -p  listomapkeys .dir..bucketid.nn.shardid
> > shows over 3 million keys for that shard. There are only about 2
> > million objects in the entire bucket according to a listing of the bucket
> > and radosgw-admin bucket stats --bucket bucketname. No other shard
> > has anywhere near this many index objects. Perhaps it should be noted that 
> > this
> > shard is the highest numbered shard for this bucket. For a bucket with
> > 16 shards, this is shard 15.
> >
> > If I look at the list of omapkeys generated, there are *many*
> > beginning with "<80>0_", almost the entire set of the three + million
> > keys in the shard. These are index objects in the so-called 'ugly' 
> > namespace. The rest ofthey omapkeys appear to be normal.
> >
> > The 0_ after the <80> indicates some sort of 'bucket log index' 
> > according to src/cls/rgw/cls_rgw.cc.
> > However, using some sed magic previously discussed here, I ran:
> >
> > rados -p  getomapval .dir..bucketid.nn.shardid 
> > --omap-key-file /tmp/key.txt
> >
> > Where /tmp/key.txt contains only the funny <80>0_ key name without a 
> > newline
> >
> > The output of this shows, in a hex dump, the object name to which the index
> > refers, which was at one time a valid object.
> >
> > However, that object no longer exists in the bucket, and based on 
> > expirat

[ceph-users] Re: millions of hex 80 0_0000 omap keys in single index shard for single bucket

2023-09-21 Thread Casey Bodley
On Thu, Sep 21, 2023 at 12:21 PM Christopher Durham  wrote:
>
>
> Hi Casey,
>
> This is indeed a multisite setup. The other side shows that for
>
> # radosgw-admin sync status
>
> the oldest incremental change not applied is about a minute old, and that is 
> consistent over a number of minutes, always the oldest incremental change a 
> minute or two old.
>
> However:
>
> # radosgw-admin bucket sync status --bucket bucket-in-question
>
> shows a number of shards always behind, although it varies.
>
> The number of objects on each side in that bucket is close, and  to this 
> point I have attributed that to the replication lag.
>
> One thing that came to mind is that the code that writes to say 
> foo/bar/baz/objects ...
>
> will often delete the objects quickly after creating them. Perhaps the 
> replication doesn't occur to
> the other side before they are deleted? Would that perhaps contribute to this?

sync should handle object deletion just fine. it'll see '404 Not
Found' errors when it tries to replicate them, and just continue on to
the next object. that shouldn't cause bucket sync to get stuck

>
> Not sure how this relates to the objects ending in '/' though, although they 
> are in the same prefix hierarchy.
>
> To get out of this situation, what do I need to do:
>
> 1. radosgw-admin bucket sync init --bucket bucket-in-question on both sides?

'bucket sync init' clears the bucket's sync status, but nothing would
trigger rgw to restart the sync on it. you could try 'bucket sync run'
instead, though it's not especially reliable until the reef release so
you may need to rerun the command several times before it catches up
completely. once the bucket sync catches up, the source zone's bilog
entries would be eligible for automatic trimming

> 2. manually delete the 0_ objects in rados? (yuk).

you can use the 'bilog trim' command on a bucket to delete its log
entries, but i'd only consider doing that if you're satisfied that all
of the objects you care about have already replicated

>
> I've done #1 before when I had the other side of a multi site down for awhile 
> before. I have not had that happen in the current situation (link down 
> between sites).
>
> Thanks for anything you or others can offer.

for rgw multisite users in particular, i highly recommend trying out
the reef release. in addition to multisite resharding support, we made
a lot of improvements to multisite stability/reliability that we won't
be able to backport to pacific/quincy

>
> -Chris
>
>
> On Wednesday, September 20, 2023 at 07:33:07 PM MDT, Casey Bodley 
>  wrote:
>
>
> these keys starting with "<80>0_" appear to be replication log entries
> for multisite. can you confirm that this is a multisite setup? is the
> 'bucket sync status' mostly caught up on each zone? in a healthy
> multisite configuration, these log entries would eventually get
> trimmed automatically
>
> On Wed, Sep 20, 2023 at 7:08 PM Christopher Durham  wrote:
> >
> > I am using ceph 17.2.6 on Rocky 8.
> > I have a system that started giving me large omap object warnings.
> >
> > I tracked this down to a specific index shard for a single s3 bucket.
> >
> > rados -p  listomapkeys .dir..bucketid.nn.shardid
> > shows over 3 million keys for that shard. There are only about 2
> > million objects in the entire bucket according to a listing of the bucket
> > and radosgw-admin bucket stats --bucket bucketname. No other shard
> > has anywhere near this many index objects. Perhaps it should be noted that 
> > this
> > shard is the highest numbered shard for this bucket. For a bucket with
> > 16 shards, this is shard 15.
> >
> > If I look at the list of omapkeys generated, there are *many*
> > beginning with "<80>0_", almost the entire set of the three + million
> > keys in the shard. These are index objects in the so-called 'ugly' 
> > namespace. The rest ofthey omapkeys appear to be normal.
> >
> > The 0_ after the <80> indicates some sort of 'bucket log index' 
> > according to src/cls/rgw/cls_rgw.cc.
> > However, using some sed magic previously discussed here, I ran:
> >
> > rados -p  getomapval .dir..bucketid.nn.shardid 
> > --omap-key-file /tmp/key.txt
> >
> > Where /tmp/key.txt contains only the funny <80>0_ key name without a 
> > newline
> >
> > The output of this shows, in a hex dump, the object name to which the index
> > refers, which was at one time a valid object.
> >
> > However, that object no longer exists in the bucket, and based on 
> > expiration policy, was
> > previously deleted. Let's say, in the hex dump, that the object was:
> >
> > foo/bar/baz/object1.bin
> >
> > The prefix foo/bar/baz/ used to have 32 objects, say 
> > foo/bar/baz/{object1.bin, object2.bin, ... }
> > An s3api listing shows that those objects no longer exist (and that is OK, 
> > as they  were previously deleted).
> > BUT, now, there is a weirdo object left in the bucket:
> >
> > foo/bar/baz/ <- with the slash at the end, and it is an object not a PRE 
> > (fix)

[ceph-users] Re: millions of hex 80 0_0000 omap keys in single index shard for single bucket

2023-09-21 Thread Christopher Durham

Hi Casey,
This is indeed a multisite setup. The other side shows that for 

# radosgw-admin sync status
the oldest incremental change not applied is about a minute old, and that is 
consistent over a number of minutes, always the oldest incremental change a 
minute or two old.

However:
# radosgw-admin bucket sync status --bucket bucket-in-question
shows a number of shards always behind, although it varies.
The number of objects on each side in that bucket is close, and  to this point 
I have attributed that to the replication lag. 

One thing that came to mind is that the code that writes to say 
foo/bar/baz/objects ... 

will often delete the objects quickly after creating them. Perhaps the 
replication doesn't occur tothe other side before they are deleted? Would that 
perhaps contribute to this?

Not sure how this relates to the objects ending in '/' though, although they 
are in the same prefix hierarchy.

To get out of this situation, what do I need to do:
1. radosgw-admin bucket sync init --bucket bucket-in-question on both sides?2. 
manually delete the 0_ objects in rados? (yuk). 

I've done #1 before when I had the other side of a multi site down for awhile 
before. I have not had that happen in the current situation (link down between 
sites). 

Thanks for anything you or others can offer.
-Chris


   On Wednesday, September 20, 2023 at 07:33:07 PM MDT, Casey Bodley 
 wrote:  
 
 these keys starting with "<80>0_" appear to be replication log entries
for multisite. can you confirm that this is a multisite setup? is the
'bucket sync status' mostly caught up on each zone? in a healthy
multisite configuration, these log entries would eventually get
trimmed automatically

On Wed, Sep 20, 2023 at 7:08 PM Christopher Durham  wrote:
>
> I am using ceph 17.2.6 on Rocky 8.
> I have a system that started giving me large omap object warnings.
>
> I tracked this down to a specific index shard for a single s3 bucket.
>
> rados -p  listomapkeys .dir..bucketid.nn.shardid
> shows over 3 million keys for that shard. There are only about 2
> million objects in the entire bucket according to a listing of the bucket
> and radosgw-admin bucket stats --bucket bucketname. No other shard
> has anywhere near this many index objects. Perhaps it should be noted that 
> this
> shard is the highest numbered shard for this bucket. For a bucket with
> 16 shards, this is shard 15.
>
> If I look at the list of omapkeys generated, there are *many*
> beginning with "<80>0_", almost the entire set of the three + million
> keys in the shard. These are index objects in the so-called 'ugly' namespace. 
> The rest ofthey omapkeys appear to be normal.
>
> The 0_ after the <80> indicates some sort of 'bucket log index' according 
> to src/cls/rgw/cls_rgw.cc.
> However, using some sed magic previously discussed here, I ran:
>
> rados -p  getomapval .dir..bucketid.nn.shardid 
> --omap-key-file /tmp/key.txt
>
> Where /tmp/key.txt contains only the funny <80>0_ key name without a 
> newline
>
> The output of this shows, in a hex dump, the object name to which the index
> refers, which was at one time a valid object.
>
> However, that object no longer exists in the bucket, and based on expiration 
> policy, was
> previously deleted. Let's say, in the hex dump, that the object was:
>
> foo/bar/baz/object1.bin
>
> The prefix foo/bar/baz/ used to have 32 objects, say 
> foo/bar/baz/{object1.bin, object2.bin, ... }
> An s3api listing shows that those objects no longer exist (and that is OK, as 
> they  were previously deleted).
> BUT, now, there is a weirdo object left in the bucket:
>
> foo/bar/baz/ <- with the slash at the end, and it is an object not a PRE 
> (fix).
>
> All objects under foo/ have a 3 day lifecycle expiration. If I wait(at most) 
> 3 days, the weirdo object with '/'
> at the end will be deleted, or I can delete it manually using aws s3api. But 
> either way, the log index
> objects, <80>0_ remain.
>
> The bucket in question is heavily used. But with over 3 million of these 
> <80>0_ objects (and growing)
> in a single shard, I am currently at a loss as to what to do or how to stop 
> this from occuring.
> I've poked around at a few other buckets, and I found a few others that have 
> this problem, but not enoughto cause a large omap warning. (A few hundred 
> <80>0_000 index objects in a shard), no where near enoughto cause the 
> large omap warning that led me to this post.
>
> Any ideas?
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
  
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recently started OSD crashes (or messages thereof)

2023-09-21 Thread Igor Fedotov

Hi Luke,

highly likely this is caused by the issue covered 
https://tracker.ceph.com/issues/53906


Unfortunately it looks like we missed proper backport in Pacific.

You can apparently work around the issue by setting 
'bluestore_volume_selection_policy' config parameter to rocksdb_original.


The potential implication of that "tuning" is a less effective free 
space usage for DB volume - RocksDB/BlueFS might initiate data spillover 
to main (slow) device despite having available free space at standalone 
DB volume. Which in turn might cause some performance regression. 
Relevant alert will pop up if such a spillover takes place .


The above consequences are not highly likely to occur though. And they 
are rather minor most of the time so I would encourage you to try that 
if OSD crashes are that common.



Thanks,

Igor


On 21/09/2023 17:48, Luke Hall wrote:

Hi,

Since the recent update to 16.2.14-1~bpo11+1 on Debian Bullseye I've 
started seeing OSD crashes being registered almost daily across all 
six physical machines (6xOSD disks per machine). There's a --block-db 
for each osd on a LV from an NVMe.


If anyone has any idea what might be causing these I'd appreciate some 
insight. Happy to provide any other info which might be useful.


Thanks,

Luke



{
    "assert_condition": "cur2 >= p.length",
    "assert_file": "./src/os/bluestore/BlueStore.h",
    "assert_func": "virtual void 
RocksDBBlueFSVolumeSelector::sub_usage(void*, const bluefs_fnode_t&)",

    "assert_line": 3875,
    "assert_msg": "./src/os/bluestore/BlueStore.h: In function 
'virtual void RocksDBBlueFSVolumeSelector::sub_usage(void*, const 
bluefs_fnode_t&)' thread 7f7f54f25700 time 
2023-09-20T14:24:00.455721+0100\n./src/os/bluestore/BlueStore.h: 3875: 
FAILED ceph_assert(cur2 >= p.length)\n",

    "assert_thread_name": "bstore_kv_sync",
    "backtrace": [
    "/lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) 
[0x7f7f68632140]",

    "gsignal()",
    "abort()",
    "(ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x16e) [0x55b22a49b5fa]",

    "/usr/bin/ceph-osd(+0xac673b) [0x55b22a49b73b]",
    "(RocksDBBlueFSVolumeSelector::sub_usage(void*, bluefs_fnode_t 
const&)+0x11e) [0x55b22ab0077e]",
    "(BlueFS::_flush_range_F(BlueFS::FileWriter*, unsigned long, 
unsigned long)+0x5bd) [0x55b22ab9b8ed]",
    "(BlueFS::_flush_F(BlueFS::FileWriter*, bool, bool*)+0x9a) 
[0x55b22ab9bd7a]",

    "(BlueFS::fsync(BlueFS::FileWriter*)+0x79) [0x55b22aba97a9]",
    "(BlueRocksWritableFile::Sync()+0x15) [0x55b22abbf405]",
"(rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions const&, 
rocksdb::IODebugContext*)+0x3f) [0x55b22b0914d1]",
    "(rocksdb::WritableFileWriter::SyncInternal(bool)+0x1f4) 
[0x55b22b26b7c6]",
    "(rocksdb::WritableFileWriter::Sync(bool)+0x18c) 
[0x55b22b26b1f8]",
"(rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&, 
rocksdb::log::Writer*, unsigned long*, bool, bool, unsigned 
long)+0x366) [0x55b22b0e4a98]",
    "(rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, 
rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, 
unsigned long, bool, unsigned long*, unsigned long, 
rocksdb::PreReleaseCallback*)+0x12cc) [0x55b22b0e0c5a]",
    "(rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, 
rocksdb::WriteBatch*)+0x4a) [0x55b22b0df92a]",
    "(RocksDBStore::submit_common(rocksdb::WriteOptions&, 
std::shared_ptr)+0x82) [0x55b22b036c42]",


"(RocksDBStore::submit_transaction_sync(std::shared_ptr)+0x96) 
[0x55b22b037cc6]",

    "(BlueStore::_kv_sync_thread()+0x1201) [0x55b22aafc891]",
    "(BlueStore::KVSyncThread::entry()+0xd) [0x55b22ab2792d]",
    "/lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) 
[0x7f7f68626ea7]",

    "clone()"
    ],
    "ceph_version": "16.2.14",
    "crash_id": 
"2023-09-20T13:24:00.562318Z_beb5c664-9ffb-4a4e-8c61-166865fd4e0b",

    "entity_name": "osd.8",
    "os_id": "11",
    "os_name": "Debian GNU/Linux 11 (bullseye)",
    "os_version": "11 (bullseye)",
    "os_version_id": "11",
    "process_name": "ceph-osd",
    "stack_sig": 
"90d1fb6954f0f5b1e98659a93a1b9ce5a5a42cd5e0b2990a65dc336567adcb26",

    "timestamp": "2023-09-20T13:24:00.562318Z",
    "utsname_hostname": "cphosd02",
    "utsname_machine": "x86_64",
    "utsname_release": "5.10.0-23-amd64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Debian 5.10.179-1 (2023-05-12)"
}



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: backfill_wait preventing deep scrubs

2023-09-21 Thread Frank Schilder
Thanks!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Mykola Golub 
Sent: Thursday, September 21, 2023 4:53 PM
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] backfill_wait preventing deep scrubs

On Thu, Sep 21, 2023 at 1:57 PM Frank Schilder  wrote:
>
> Hi all,
>
> I replaced a disk in our octopus cluster and it is rebuilding. I noticed that 
> since the replacement there is no scrubbing going on. Apparently, an OSD 
> having a PG in backfill_wait state seems to block deep scrubbing all other 
> PGs on that OSD as well - at least this is how it looks.
>
> Some numbers: the pool in question has 8192 PGs with EC 8+3 and ca 850 OSDs. 
> A total of 144 PGs needed backfilling (were remapped after replacing the 
> disk). After about 2 days we are down to 115 backfill_wait + 3 backfilling. 
> It will take a bit more than a week to complete.
>
> There is plenty of time and IOP/s available to deep-scrub PGs on the side, 
> but since the backfill started there is zero scrubbing/deep scrubbing going 
> on and "PGs not deep scrubbed in time" messages are piling up.
>
> Is there a way to allow (deep) scrub in this situation?

ceph config set osd osd_scrub_during_recovery true


--
Mykola Golub
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: After power outage, osd do not restart

2023-09-21 Thread Patrick Begou

Hi Eneko,

I do not work on the ceph cluster since my last email (making some user 
support) and now the osd.2 is back in the cluster:


 -7 0.68217  host mostha1
  2    hdd  0.22739  osd.2   up   1.0  1.0
  5    hdd  0.45479  osd.5   up   1.0  1.0

May be the reboot suggested by Igor ?

I will try to solve my last problem now. While upgrading from 15.2.13 to 
15.2.17 I hit a memory problem on one node (these are old computers used 
to learn Ceph).
Upgrading one of the osd fails and it locks the upgrade as Ceph did not 
accept to stop and upgrade next osd in the cluster. But Ceph start 
rebalancing the data and magicaly finishes the upgrade.
But a last osd is still down and out and it is a daemon problem as 
smartctl returns a good health for the HDD.
I've changed the faulty memory dims and the node is back in the cluster. 
So this is my new challenge 😁


Using old material (2011) for learning seams fine to investigate Ceph 
reliability as many problems can raise up  but at no risks!


Patrick



Le 21/09/2023 à 16:31, Eneko Lacunza a écrit :

Hi Patrick,

It seems your disk or controller are damaged. Are other disks 
connected to the same controller working ok? If so, I'd say disk is dead.


Cheers

El 21/9/23 a las 16:17, Patrick Begou escribió:

Hi Igor,

a "systemctl reset-failed" doesn't restart the osd.

I reboot the node and now it show some error on the HDD:

[  107.716769] ata3.00: exception Emask 0x0 SAct 0x80 SErr 0x0 
action 0x0

[  107.716782] ata3.00: irq_stat 0x4008
[  107.716787] ata3.00: failed command: READ FPDMA QUEUED
[  107.716791] ata3.00: cmd 60/00:b8:00:a8:08/08:00:0e:00:00/40 tag 
23 ncq dma 1048576 in
    res 41/40:00:c2:ad:08/00:00:0e:00:00/40 Emask 
0x409 (media error) 

[  107.716802] ata3.00: status: { DRDY ERR }
[  107.716806] ata3.00: error: { UNC }
[  107.728547] ata3.00: configured for UDMA/133
[  107.728575] sd 2:0:0:0: [sda] tag#23 FAILED Result: 
hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1s
[  107.728581] sd 2:0:0:0: [sda] tag#23 Sense Key : Medium Error 
[current]
[  107.728585] sd 2:0:0:0: [sda] tag#23 Add. Sense: Unrecovered read 
error - auto reallocate failed
[  107.728590] sd 2:0:0:0: [sda] tag#23 CDB: Read(10) 28 00 0e 08 a8 
00 00 08 00 00
[  107.728592] I/O error, dev sda, sector 235449794 op 0x0:(READ) 
flags 0x80700 phys_seg 6 prio class 2

[  107.728623] ata3: EH complete
[  109.203256] ata3.00: exception Emask 0x0 SAct 0x2000 SErr 0x0 
action 0x0

[  109.203268] ata3.00: irq_stat 0x4008
[  109.203274] ata3.00: failed command: READ FPDMA QUEUED
[  109.203277] ata3.00: cmd 60/08:e8:48:ad:08/00:00:0e:00:00/40 tag 
29 ncq dma 4096 in
    res 41/40:00:48:ad:08/00:00:0e:00:00/40 Emask 
0x409 (media error) 

[  109.203289] ata3.00: status: { DRDY ERR }
[  109.203292] ata3.00: error: { UNC }



I think the storage is corrupted and I have te reset it all.

Patrick


Le 21/09/2023 à 13:32, Igor Fedotov a écrit :


May be execute systemctl reset-failed <...> or even restart the node?


On 21/09/2023 14:26, Patrick Begou wrote:

Hi Igor,

the ceph-osd.2.log remains empty on the node where this osd is 
located. This is what I get when manualy restarting the osd.


[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# systemctl 
restart ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service
Job for ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service 
failed because a timeout was exceeded.
See "systemctl status 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service" and 
"journalctl -xe" for details.

[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# journalctl -xe
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 5728 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 5882 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 5884 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 6031 (bash) in control group while starting unit. 
Ignoring.

[ceph-users] Re: backfill_wait preventing deep scrubs

2023-09-21 Thread Mykola Golub
On Thu, Sep 21, 2023 at 1:57 PM Frank Schilder  wrote:
>
> Hi all,
>
> I replaced a disk in our octopus cluster and it is rebuilding. I noticed that 
> since the replacement there is no scrubbing going on. Apparently, an OSD 
> having a PG in backfill_wait state seems to block deep scrubbing all other 
> PGs on that OSD as well - at least this is how it looks.
>
> Some numbers: the pool in question has 8192 PGs with EC 8+3 and ca 850 OSDs. 
> A total of 144 PGs needed backfilling (were remapped after replacing the 
> disk). After about 2 days we are down to 115 backfill_wait + 3 backfilling. 
> It will take a bit more than a week to complete.
>
> There is plenty of time and IOP/s available to deep-scrub PGs on the side, 
> but since the backfill started there is zero scrubbing/deep scrubbing going 
> on and "PGs not deep scrubbed in time" messages are piling up.
>
> Is there a way to allow (deep) scrub in this situation?

ceph config set osd osd_scrub_during_recovery true


-- 
Mykola Golub
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Recently started OSD crashes (or messages thereof)

2023-09-21 Thread Luke Hall

Hi,

Since the recent update to 16.2.14-1~bpo11+1 on Debian Bullseye I've 
started seeing OSD crashes being registered almost daily across all six 
physical machines (6xOSD disks per machine). There's a --block-db for 
each osd on a LV from an NVMe.


If anyone has any idea what might be causing these I'd appreciate some 
insight. Happy to provide any other info which might be useful.


Thanks,

Luke



{
"assert_condition": "cur2 >= p.length",
"assert_file": "./src/os/bluestore/BlueStore.h",
"assert_func": "virtual void 
RocksDBBlueFSVolumeSelector::sub_usage(void*, const bluefs_fnode_t&)",

"assert_line": 3875,
"assert_msg": "./src/os/bluestore/BlueStore.h: In function 'virtual 
void RocksDBBlueFSVolumeSelector::sub_usage(void*, const 
bluefs_fnode_t&)' thread 7f7f54f25700 time 
2023-09-20T14:24:00.455721+0100\n./src/os/bluestore/BlueStore.h: 3875: 
FAILED ceph_assert(cur2 >= p.length)\n",

"assert_thread_name": "bstore_kv_sync",
"backtrace": [
"/lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7f7f68632140]",
"gsignal()",
"abort()",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x16e) [0x55b22a49b5fa]",

"/usr/bin/ceph-osd(+0xac673b) [0x55b22a49b73b]",
"(RocksDBBlueFSVolumeSelector::sub_usage(void*, bluefs_fnode_t 
const&)+0x11e) [0x55b22ab0077e]",
"(BlueFS::_flush_range_F(BlueFS::FileWriter*, unsigned long, 
unsigned long)+0x5bd) [0x55b22ab9b8ed]",
"(BlueFS::_flush_F(BlueFS::FileWriter*, bool, bool*)+0x9a) 
[0x55b22ab9bd7a]",

"(BlueFS::fsync(BlueFS::FileWriter*)+0x79) [0x55b22aba97a9]",
"(BlueRocksWritableFile::Sync()+0x15) [0x55b22abbf405]",
"(rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions 
const&, rocksdb::IODebugContext*)+0x3f) [0x55b22b0914d1]",
"(rocksdb::WritableFileWriter::SyncInternal(bool)+0x1f4) 
[0x55b22b26b7c6]",

"(rocksdb::WritableFileWriter::Sync(bool)+0x18c) [0x55b22b26b1f8]",
"(rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup 
const&, rocksdb::log::Writer*, unsigned long*, bool, bool, unsigned 
long)+0x366) [0x55b22b0e4a98]",
"(rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, 
rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned 
long, bool, unsigned long*, unsigned long, 
rocksdb::PreReleaseCallback*)+0x12cc) [0x55b22b0e0c5a]",
"(rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, 
rocksdb::WriteBatch*)+0x4a) [0x55b22b0df92a]",
"(RocksDBStore::submit_common(rocksdb::WriteOptions&, 
std::shared_ptr)+0x82) [0x55b22b036c42]",


"(RocksDBStore::submit_transaction_sync(std::shared_ptr)+0x96) 
[0x55b22b037cc6]",

"(BlueStore::_kv_sync_thread()+0x1201) [0x55b22aafc891]",
"(BlueStore::KVSyncThread::entry()+0xd) [0x55b22ab2792d]",
"/lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f7f68626ea7]",
"clone()"
],
"ceph_version": "16.2.14",
"crash_id": 
"2023-09-20T13:24:00.562318Z_beb5c664-9ffb-4a4e-8c61-166865fd4e0b",

"entity_name": "osd.8",
"os_id": "11",
"os_name": "Debian GNU/Linux 11 (bullseye)",
"os_version": "11 (bullseye)",
"os_version_id": "11",
"process_name": "ceph-osd",
"stack_sig": 
"90d1fb6954f0f5b1e98659a93a1b9ce5a5a42cd5e0b2990a65dc336567adcb26",

"timestamp": "2023-09-20T13:24:00.562318Z",
"utsname_hostname": "cphosd02",
"utsname_machine": "x86_64",
"utsname_release": "5.10.0-23-amd64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Debian 5.10.179-1 (2023-05-12)"
}


--
All postal correspondence to:
The Positive Internet Company, 24 Ganton Street, London. W1F 7QY

*Follow us on Twitter* @posipeople

The Positive Internet Company Limited is registered in England and Wales.
Registered company number: 3673639. VAT no: 726 7072 28.
Registered office: Northside House, Mount Pleasant, Barnet, Herts, EN4 9EE.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: After power outage, osd do not restart

2023-09-21 Thread Eneko Lacunza

Hi Patrick,

It seems your disk or controller are damaged. Are other disks connected 
to the same controller working ok? If so, I'd say disk is dead.


Cheers

El 21/9/23 a las 16:17, Patrick Begou escribió:

Hi Igor,

a "systemctl reset-failed" doesn't restart the osd.

I reboot the node and now it show some error on the HDD:

[  107.716769] ata3.00: exception Emask 0x0 SAct 0x80 SErr 0x0 
action 0x0

[  107.716782] ata3.00: irq_stat 0x4008
[  107.716787] ata3.00: failed command: READ FPDMA QUEUED
[  107.716791] ata3.00: cmd 60/00:b8:00:a8:08/08:00:0e:00:00/40 tag 23 
ncq dma 1048576 in
    res 41/40:00:c2:ad:08/00:00:0e:00:00/40 Emask 
0x409 (media error) 

[  107.716802] ata3.00: status: { DRDY ERR }
[  107.716806] ata3.00: error: { UNC }
[  107.728547] ata3.00: configured for UDMA/133
[  107.728575] sd 2:0:0:0: [sda] tag#23 FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_OK cmd_age=1s
[  107.728581] sd 2:0:0:0: [sda] tag#23 Sense Key : Medium Error 
[current]
[  107.728585] sd 2:0:0:0: [sda] tag#23 Add. Sense: Unrecovered read 
error - auto reallocate failed
[  107.728590] sd 2:0:0:0: [sda] tag#23 CDB: Read(10) 28 00 0e 08 a8 
00 00 08 00 00
[  107.728592] I/O error, dev sda, sector 235449794 op 0x0:(READ) 
flags 0x80700 phys_seg 6 prio class 2

[  107.728623] ata3: EH complete
[  109.203256] ata3.00: exception Emask 0x0 SAct 0x2000 SErr 0x0 
action 0x0

[  109.203268] ata3.00: irq_stat 0x4008
[  109.203274] ata3.00: failed command: READ FPDMA QUEUED
[  109.203277] ata3.00: cmd 60/08:e8:48:ad:08/00:00:0e:00:00/40 tag 29 
ncq dma 4096 in
    res 41/40:00:48:ad:08/00:00:0e:00:00/40 Emask 
0x409 (media error) 

[  109.203289] ata3.00: status: { DRDY ERR }
[  109.203292] ata3.00: error: { UNC }



I think the storage is corrupted and I have te reset it all.

Patrick


Le 21/09/2023 à 13:32, Igor Fedotov a écrit :


May be execute systemctl reset-failed <...> or even restart the node?


On 21/09/2023 14:26, Patrick Begou wrote:

Hi Igor,

the ceph-osd.2.log remains empty on the node where this osd is 
located. This is what I get when manualy restarting the osd.


[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# systemctl 
restart ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service
Job for ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service 
failed because a timeout was exceeded.
See "systemctl status 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service" and 
"journalctl -xe" for details.

[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# journalctl -xe
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 5728 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 5882 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 5884 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 6031 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 6033 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 6185 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 6187 (podman) in control group while starting 
unit. Ignoring.
sept. 21 13:22:39 

[ceph-users] Re: After power outage, osd do not restart

2023-09-21 Thread Patrick Begou

Hi Igor,

a "systemctl reset-failed" doesn't restart the osd.

I reboot the node and now it show some error on the HDD:

[  107.716769] ata3.00: exception Emask 0x0 SAct 0x80 SErr 0x0 
action 0x0

[  107.716782] ata3.00: irq_stat 0x4008
[  107.716787] ata3.00: failed command: READ FPDMA QUEUED
[  107.716791] ata3.00: cmd 60/00:b8:00:a8:08/08:00:0e:00:00/40 tag 23 
ncq dma 1048576 in
    res 41/40:00:c2:ad:08/00:00:0e:00:00/40 Emask 
0x409 (media error) 

[  107.716802] ata3.00: status: { DRDY ERR }
[  107.716806] ata3.00: error: { UNC }
[  107.728547] ata3.00: configured for UDMA/133
[  107.728575] sd 2:0:0:0: [sda] tag#23 FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_OK cmd_age=1s

[  107.728581] sd 2:0:0:0: [sda] tag#23 Sense Key : Medium Error [current]
[  107.728585] sd 2:0:0:0: [sda] tag#23 Add. Sense: Unrecovered read 
error - auto reallocate failed
[  107.728590] sd 2:0:0:0: [sda] tag#23 CDB: Read(10) 28 00 0e 08 a8 00 
00 08 00 00
[  107.728592] I/O error, dev sda, sector 235449794 op 0x0:(READ) flags 
0x80700 phys_seg 6 prio class 2

[  107.728623] ata3: EH complete
[  109.203256] ata3.00: exception Emask 0x0 SAct 0x2000 SErr 0x0 
action 0x0

[  109.203268] ata3.00: irq_stat 0x4008
[  109.203274] ata3.00: failed command: READ FPDMA QUEUED
[  109.203277] ata3.00: cmd 60/08:e8:48:ad:08/00:00:0e:00:00/40 tag 29 
ncq dma 4096 in
    res 41/40:00:48:ad:08/00:00:0e:00:00/40 Emask 
0x409 (media error) 

[  109.203289] ata3.00: status: { DRDY ERR }
[  109.203292] ata3.00: error: { UNC }



I think the storage is corrupted and I have te reset it all.

Patrick


Le 21/09/2023 à 13:32, Igor Fedotov a écrit :


May be execute systemctl reset-failed <...> or even restart the node?


On 21/09/2023 14:26, Patrick Begou wrote:

Hi Igor,

the ceph-osd.2.log remains empty on the node where this osd is 
located. This is what I get when manualy restarting the osd.


[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# systemctl 
restart ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service
Job for ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service 
failed because a timeout was exceeded.
See "systemctl status 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service" and 
"journalctl -xe" for details.

[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# journalctl -xe
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 5728 (podman) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 5882 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 5884 (podman) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 6031 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 6033 (podman) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 6185 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 6187 (podman) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph

[ceph-users] Re: ceph orch osd data_allocate_fraction does not work

2023-09-21 Thread Adam King
Looks like the orchestation side support for this got brought into pacific
with the rest of the drive group stuff, but the actual underlying feature
in ceph-volume (from https://github.com/ceph/ceph/pull/40659) never got a
pacific backport. I've opened the backport now
https://github.com/ceph/ceph/pull/53581 and I think another pacific release
is planned so we can hopefully have it fixed there eventually, but it's
definitely broken as of now.  Sorry about that.

On Thu, Sep 21, 2023 at 7:54 AM Boris Behrens  wrote:

> I have a use case where I want to only use a small portion of the disk for
> the OSD and the documentation states that I can use
> data_allocation_fraction [1]
>
> But cephadm can not use this and throws this error:
> /usr/bin/podman: stderr ceph-volume lvm batch: error: unrecognized
> arguments: --data-allocate-fraction 0.1
>
> So, what I actually want to achieve:
> Split up a single SSD into:
> 3-5x block.db for spinning disks (5x 320GB or 3x 500GB regarding if I have
> 8TB HDDs or 16TB HDDs)
> 1x SSD OSD (100G) for RGW index / meta pools
> 1x SSD OSD (100G) for RGW gc pool because of this bug [2]
>
> My service definition looks like this:
>
> service_type: osd
> service_id: hdd-8tb
> placement:
>   host_pattern: '*'
> crush_device_class: hdd
> spec:
>   data_devices:
> rotational: 1
> size: ':9T'
>   db_devices:
> rotational: 0
> limit: 5
> size: '1T:2T'
>   encrypted: true
>   block_db_size: 3200
> ---
> service_type: osd
> service_id: hdd-16tb
> placement:
>   host_pattern: '*'
> crush_device_class: hdd
> spec:
>   data_devices:
> rotational: 1
> size: '14T:'
>   db_devices:
> rotational: 0
> limit: 1
> size: '1T:2T'
>   encrypted: true
>   block_db_size: 5000
> ---
> service_type: osd
> service_id: gc
> placement:
>   host_pattern: '*'
> crush_device_class: gc
> spec:
>   data_devices:
> rotational: 0
> size: '1T:2T'
>   encrypted: true
>   data_allocate_fraction: 0.05
> ---
> service_type: osd
> service_id: ssd
> placement:
>   host_pattern: '*'
> crush_device_class: ssd
> spec:
>   data_devices:
> rotational: 0
> size: '1T:2T'
>   encrypted: true
>   data_allocate_fraction: 0.05
>
>
> [1]
>
> https://docs.ceph.com/en/pacific/cephadm/services/osd/#ceph.deployment.drive_group.DriveGroupSpec.data_allocate_fraction
> [2] https://tracker.ceph.com/issues/53585
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groüen Saal.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph orch osd data_allocate_fraction does not work

2023-09-21 Thread Boris Behrens
I have a use case where I want to only use a small portion of the disk for
the OSD and the documentation states that I can use
data_allocation_fraction [1]

But cephadm can not use this and throws this error:
/usr/bin/podman: stderr ceph-volume lvm batch: error: unrecognized
arguments: --data-allocate-fraction 0.1

So, what I actually want to achieve:
Split up a single SSD into:
3-5x block.db for spinning disks (5x 320GB or 3x 500GB regarding if I have
8TB HDDs or 16TB HDDs)
1x SSD OSD (100G) for RGW index / meta pools
1x SSD OSD (100G) for RGW gc pool because of this bug [2]

My service definition looks like this:

service_type: osd
service_id: hdd-8tb
placement:
  host_pattern: '*'
crush_device_class: hdd
spec:
  data_devices:
rotational: 1
size: ':9T'
  db_devices:
rotational: 0
limit: 5
size: '1T:2T'
  encrypted: true
  block_db_size: 3200
---
service_type: osd
service_id: hdd-16tb
placement:
  host_pattern: '*'
crush_device_class: hdd
spec:
  data_devices:
rotational: 1
size: '14T:'
  db_devices:
rotational: 0
limit: 1
size: '1T:2T'
  encrypted: true
  block_db_size: 5000
---
service_type: osd
service_id: gc
placement:
  host_pattern: '*'
crush_device_class: gc
spec:
  data_devices:
rotational: 0
size: '1T:2T'
  encrypted: true
  data_allocate_fraction: 0.05
---
service_type: osd
service_id: ssd
placement:
  host_pattern: '*'
crush_device_class: ssd
spec:
  data_devices:
rotational: 0
size: '1T:2T'
  encrypted: true
  data_allocate_fraction: 0.05


[1]
https://docs.ceph.com/en/pacific/cephadm/services/osd/#ceph.deployment.drive_group.DriveGroupSpec.data_allocate_fraction
[2] https://tracker.ceph.com/issues/53585

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: After power outage, osd do not restart

2023-09-21 Thread Igor Fedotov


May be execute systemctl reset-failed <...> or even restart the node?


On 21/09/2023 14:26, Patrick Begou wrote:

Hi Igor,

the ceph-osd.2.log remains empty on the node where this osd is 
located. This is what I get when manualy restarting the osd.


[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# systemctl restart 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service
Job for ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service failed 
because a timeout was exceeded.
See "systemctl status 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service" and 
"journalctl -xe" for details.

[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# journalctl -xe
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 5728 (podman) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 5882 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 5884 (podman) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 6031 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 6033 (podman) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 6185 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 6187 (podman) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 14627 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 14629 (podman) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 14776 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 14778 (podman) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 15169 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates 

[ceph-users] Re: After power outage, osd do not restart

2023-09-21 Thread Patrick Begou

Hi Igor,

the ceph-osd.2.log remains empty on the node where this osd is located. 
This is what I get when manualy restarting the osd.


[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# systemctl restart 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service
Job for ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service failed 
because a timeout was exceeded.
See "systemctl status 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service" and "journalctl 
-xe" for details.

[root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# journalctl -xe
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over 
process 5728 (podman) in control group while starting unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually 
indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over 
process 5882 (bash) in control group while starting unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually 
indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over 
process 5884 (podman) in control group while starting unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually 
indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over 
process 6031 (bash) in control group while starting unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually 
indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over 
process 6033 (podman) in control group while starting unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually 
indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over 
process 6185 (bash) in control group while starting unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually 
indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over 
process 6187 (podman) in control group while starting unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually 
indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over 
process 14627 (bash) in control group while starting unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually 
indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over 
process 14629 (podman) in control group while starting unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually 
indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over 
process 14776 (bash) in control group while starting unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually 
indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over 
process 14778 (podman) in control group while starting unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually 
indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over 
process 15169 (bash) in control group while starting unit. Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually 
indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr s

[ceph-users] backfill_wait preventing deep scrubs

2023-09-21 Thread Frank Schilder
Hi all,

I replaced a disk in our octopus cluster and it is rebuilding. I noticed that 
since the replacement there is no scrubbing going on. Apparently, an OSD having 
a PG in backfill_wait state seems to block deep scrubbing all other PGs on that 
OSD as well - at least this is how it looks.

Some numbers: the pool in question has 8192 PGs with EC 8+3 and ca 850 OSDs. A 
total of 144 PGs needed backfilling (were remapped after replacing the disk). 
After about 2 days we are down to 115 backfill_wait + 3 backfilling. It will 
take a bit more than a week to complete.

There is plenty of time and IOP/s available to deep-scrub PGs on the side, but 
since the backfill started there is zero scrubbing/deep scrubbing going on and 
"PGs not deep scrubbed in time" messages are piling up.

Is there a way to allow (deep) scrub in this situation?

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] OSD not starting after being mounted with ceph-objectstore-tool --op fuse

2023-09-21 Thread Budai Laszlo

Hello,

I have a problem with an OSD not starting after being mounted offline using the 
ceph-objectstore-tool --op fuse command.

The cephadm orch ps now shows me the osd in error state:

osd.0   storage1   error 2m ago   5h    -    4096M  
  


If I'm checkung the logs on the node I can see the following messages in the 
system journal:

Sep 21 10:26:13 storage1 systemd[1]: Started Ceph osd.0 for 
82eb0cee-583a-11ee-b10b-abe63a69ab28.
Sep 21 10:26:14 storage1 bash[50983]: Running command: /usr/bin/chown -R 
ceph:ceph /var/lib/ceph/osd/ceph-0
Sep 21 10:26:14 storage1 bash[50983]: Running command: 
/usr/bin/ceph-bluestore-tool prime-osd-dir --path /var/lib/ceph/osd/ceph-0 
--no-mon-config --dev /dev/mapper/ceph--aac54f64--d2a7--42e6>
Sep 21 10:26:14 storage1 bash[50983]: Running command: /usr/bin/chown -h ceph:ceph 
/dev/mapper/ceph--aac54f64--d2a7--42e6--a09d--1373e3524414-osd--block--57cfd62d--ae4d--4cae--8c64--be255837>
Sep 21 10:26:14 storage1 bash[50983]: Running command: /usr/bin/chown -R 
ceph:ceph /dev/dm-0
Sep 21 10:26:14 storage1 bash[50983]: Running command: /usr/bin/ln -s 
/dev/mapper/ceph--aac54f64--d2a7--42e6--a09d--1373e3524414-osd--block--57cfd62d--ae4d--4cae--8c64--be25583728fa
 /var/lib>
Sep 21 10:26:14 storage1 bash[50983]: Running command: /usr/bin/chown -R 
ceph:ceph /var/lib/ceph/osd/ceph-0
Sep 21 10:26:14 storage1 bash[50983]: --> ceph-volume raw activate successful 
for osd ID: 0
Sep 21 10:26:14 storage1 bash[51214]: debug 2023-09-21T10:26:14.607+ 
7f91c87cd540  0 set uid:gid to 167:167 (ceph:ceph)
Sep 21 10:26:14 storage1 bash[51214]: debug 2023-09-21T10:26:14.607+ 
7f91c87cd540  0 ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) 
quincy (stable), process ceph-osd, pid>
Sep 21 10:26:14 storage1 bash[51214]: debug 2023-09-21T10:26:14.607+ 
7f91c87cd540  0 pidfile_write: ignore empty --pid-file
Sep 21 10:26:14 storage1 bash[51214]: debug 2023-09-21T10:26:14.611+ 
7f91c87cd540  1 bdev(0x55d79b319400 /var/lib/ceph/osd/ceph-0/block) open path 
/var/lib/ceph/osd/ceph-0/block
Sep 21 10:26:14 storage1 bash[51214]: debug 2023-09-21T10:26:14.611+ 
7f91c87cd540  1 bdev(0x55d79b319400 /var/lib/ceph/osd/ceph-0/block) open size 
10733223936 (0x27fc0, 10 GiB) block>
Sep 21 10:26:14 storage1 bash[51214]: debug 2023-09-21T10:26:14.611+ 
7f91c87cd540  1 bluestore(/var/lib/ceph/osd/ceph-0) _set_cache_sizes cache_size 
1073741824 meta 0.45 kv 0.45 data 0.06
Sep 21 10:26:14 storage1 bash[51214]: debug 2023-09-21T10:26:14.611+ 
7f91c87cd540  1 bdev(0x55d79b318c00 /var/lib/ceph/osd/ceph-0/block) open path 
/var/lib/ceph/osd/ceph-0/block
Sep 21 10:26:14 storage1 bash[51214]: debug 2023-09-21T10:26:14.611+ 
7f91c87cd540  1 bdev(0x55d79b318c00 /var/lib/ceph/osd/ceph-0/block) open size 
10733223936 (0x27fc0, 10 GiB) block>
Sep 21 10:26:14 storage1 bash[51214]: debug 2023-09-21T10:26:14.611+ 
7f91c87cd540  1 bluefs add_block_device bdev 1 path 
/var/lib/ceph/osd/ceph-0/block size 10 GiB
Sep 21 10:26:14 storage1 bash[51214]: debug 2023-09-21T10:26:14.611+ 
7f91c87cd540  1 bdev(0x55d79b318c00 /var/lib/ceph/osd/ceph-0/block) close
Sep 21 10:26:14 storage1 bash[51214]: debug 2023-09-21T10:26:14.899+ 
7f91c87cd540  1 bdev(0x55d79b319400 /var/lib/ceph/osd/ceph-0/block) close
Sep 21 10:26:15 storage1 bash[51214]: debug 2023-09-21T10:26:15.143+ 
7f91c87cd540  0 starting osd.0 osd_data /var/lib/ceph/osd/ceph-0 
/var/lib/ceph/osd/ceph-0/journal
Sep 21 10:26:15 storage1 bash[51214]: debug 2023-09-21T10:26:15.143+ 
7f91c87cd540 -1 Falling back to public interface
Sep 21 10:26:15 storage1 bash[51214]: debug 2023-09-21T10:26:15.175+ 
7f91c87cd540  0 load: jerasure load: lrc
Sep 21 10:26:15 storage1 bash[51214]: debug 2023-09-21T10:26:15.175+ 
7f91c87cd540  1 bdev(0x55d79c12 /var/lib/ceph/osd/ceph-0/block) open path 
/var/lib/ceph/osd/ceph-0/block
Sep 21 10:26:15 storage1 bash[51214]: debug 2023-09-21T10:26:15.179+ 
7f91c87cd540 -1 bdev(0x55d79c12 /var/lib/ceph/osd/ceph-0/block) open open 
got: (13) Permission denied
Sep 21 10:26:15 storage1 bash[51214]: debug 2023-09-21T10:26:15.179+ 
7f91c87cd540  1 bdev(0x55d79c12 /var/lib/ceph/osd/ceph-0/block) open path 
/var/lib/ceph/osd/ceph-0/block
Sep 21 10:26:15 storage1 bash[51214]: debug 2023-09-21T10:26:15.179+ 
7f91c87cd540 -1 bdev(0x55d79c12 /var/lib/ceph/osd/ceph-0/block) open open 
got: (13) Permission denied
Sep 21 10:26:15 storage1 bash[51214]: debug 2023-09-21T10:26:15.179+ 
7f91c87cd540  1 mClockScheduler: set_max_osd_capacity #op shards: 5 max osd 
capacity(iops) per shard: 63.00
Sep 21 10:26:15 storage1 bash[51214]: debug 2023-09-21T10:26:15.179+ 
7f91c87cd540  1 mClockScheduler: set_osd_mclock_cost_per_io 
osd_mclock_cost_per_io: 0.0114000
Sep 21 10:26:15 storage1 bash[51214]: debug 2023-09-21T10:26:15.179+ 
7f91c87cd540  1 mClockScheduler: set_osd_mclock_cost_per_byte 
osd_mclock_cost_per_byte: 0.026
Sep 2

[ceph-users] Re: After power outage, osd do not restart

2023-09-21 Thread Igor Fedotov

Hi Patrick,

please share osd restart log to investigate that.


Thanks,

Igor

On 21/09/2023 13:41, Patrick Begou wrote:

Hi,

After a power outage on my test ceph cluster, 2 osd fail to restart.  
The log file show:


8e5f-00266cf8869c@osd.2.service: Failed with result 'timeout'.
Sep 21 11:55:02 mostha1 systemd[1]: Failed to start Ceph osd.2 for 
250f9864-0142-11ee-8e5f-00266cf8869c.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Service 
RestartSec=10s expired, scheduling restart.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Scheduled 
restart job, restart counter is at 2.
Sep 21 11:55:12 mostha1 systemd[1]: Stopped Ceph osd.2 for 
250f9864-0142-11ee-8e5f-00266cf8869c.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 1858 (bash) in control group while starting unit. 
Ignoring.
Sep 21 11:55:12 mostha1 systemd[1]: This usually indicates unclean 
termination of a previous run, or service implementation deficiencies.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 2815 (podman) in control group while starting unit. 
Ignoring.


This is not critical as it is a test cluster and it is actually 
rebalancing on other osd but I would like to know how to return to 
HEALTH_OK status.


Smartctl show the HDD are OK.

So is there a way to recover the osd from this state ? Version is 
15.2.17 (juste moved from 15.2.13 to 15.2.17 yesterday, will try to 
move to latest versions as soon as this problem is solved)


Thanks

Patrick

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] After power outage, osd do not restart

2023-09-21 Thread Patrick Begou

Hi,

After a power outage on my test ceph cluster, 2 osd fail to restart.  
The log file show:


8e5f-00266cf8869c@osd.2.service: Failed with result 'timeout'.
Sep 21 11:55:02 mostha1 systemd[1]: Failed to start Ceph osd.2 for 
250f9864-0142-11ee-8e5f-00266cf8869c.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Service 
RestartSec=10s expired, scheduling restart.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Scheduled 
restart job, restart counter is at 2.
Sep 21 11:55:12 mostha1 systemd[1]: Stopped Ceph osd.2 for 
250f9864-0142-11ee-8e5f-00266cf8869c.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over 
process 1858 (bash) in control group while starting unit. Ignoring.
Sep 21 11:55:12 mostha1 systemd[1]: This usually indicates unclean 
termination of a previous run, or service implementation deficiencies.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over 
process 2815 (podman) in control group while starting unit. Ignoring.


This is not critical as it is a test cluster and it is actually 
rebalancing on other osd but I would like to know how to return to 
HEALTH_OK status.


Smartctl show the HDD are OK.

So is there a way to recover the osd from this state ? Version is 
15.2.17 (juste moved from 15.2.13 to 15.2.17 yesterday, will try to move 
to latest versions as soon as this problem is solved)


Thanks

Patrick

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: After upgrading from 17.2.6 to 18.2.0, OSDs are very frequently restarting due to livenessprobe failures

2023-09-21 Thread Igor Fedotov

Hi!

Can you share OSD logs demostrating such a restart?


Thanks,

Igor

On 20/09/2023 20:16, sbeng...@gmail.com wrote:

Since upgrading to 18.2.0 , OSDs are very frequently restarting due to 
livenessprobe failures making the cluster unusable. Has anyone else seen this 
behavior?

Upgrade path: ceph 17.2.6 to 18.2.0 (and rook from 1.11.9 to 1.12.1)
on ubuntu 20.04 kernel 5.15.0-79-generic

Thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS warning: clients laggy due to laggy OSDs

2023-09-21 Thread Janek Bevendorff

Hi,

I took a snapshot of MDS.0's logs. We have five active MDS in total, 
each one reporting laggy OSDs/clients, but I cannot find anything 
related to that in the log snippet. Anyhow, I uploaded the log for your 
reference with ceph-post-file ID 79b5138b-61d7-4ba7-b0a9-c6f02f47b881.


This is what ceph status looks like after a couple of days. This is not 
normal:


HEALTH_WARN
55 client(s) laggy due to laggy OSDs
8 clients failing to respond to capability release
1 clients failing to advance oldest client/flush tid
5 MDSs report slow requests

(55 clients are actually "just" 11 unique client IDs, but each MDS makes 
their own report.)


osd mon_osd_laggy_halflife is not configured on our cluster, so it's the 
default of 3600.



Janek


On 20/09/2023 13:17, Dhairya Parmar wrote:

Hi Janek,

The PR venky mentioned makes use of OSD's laggy parameters 
(laggy_interval and
laggy_probability) to find if any OSD is laggy or not. These laggy 
parameters
can reset to 0 if the interval between the last modification done to 
OSDMap and
the time stamp when OSD was marked down exceeds the grace interval 
threshold

which is the value we get by `mon_osd_laggy_halflife * 48` where
mon_osd_laggy_halflife is a configurable value which is by default 
3600 so only
if the interval I talked about exceeds 172800; the laggy parameters 
would reset
to 0. I'd recommend taking a look at what your configured value 
is(using cmd:

ceph config get osd mon_osd_laggy_halflife).

There is also a "hack" to reset the parameters manually(*Not 
recommended, just

for info*): set mon_osd_laggy_weight to 1 using `ceph config set osd
mon_osd_laggy_weight 1` and reboot the OSD(s) which is/are being said 
laggy and

you will see the lagginess go away.


*Dhairya Parmar*

Associate Software Engineer, CephFS

Red Hat Inc. 

dpar...@redhat.com





On Wed, Sep 20, 2023 at 3:25 PM Venky Shankar  wrote:

Hey Janek,

I took a closer look at various places where the MDS would consider a
client as laggy and it seems like a wide variety of reasons are taken
into consideration and not all of them might be a reason to defer
client
eviction, so the warning is a bit misleading. I'll post a PR for
this. In
the meantime, could you share the debug logs stated in my previous
email?

On Wed, Sep 20, 2023 at 3:07 PM Venky Shankar
 wrote:

> Hi Janek,
>
> On Tue, Sep 19, 2023 at 4:44 PM Janek Bevendorff <
> janek.bevendo...@uni-weimar.de> wrote:
>
>> Hi Venky,
>>
>> As I said: There are no laggy OSDs. The maximum ping I have for
any OSD
>> in ceph osd perf is around 60ms (just a handful, probably aging
disks). The
>> vast majority of OSDs have ping times of less than 1ms. Same
for the host
>> machines, yet I'm still seeing this message. It seems that the
affected
>> hosts are usually the same, but I have absolutely no clue why.
>>
>
> It's possible that you are running into a bug which does not
clear the
> laggy clients list which the MDS sends to monitors via beacons.
Could you
> help us out with debug mds logs (by setting debug_mds=20) for
the active
> mds for around 15-20 seconds and share the logs please? Also
reset the log
> level once done since it can hurt performance.
>
> # ceph config set mds.<> debug_mds 20
>
> and reset via
>
> # ceph config rm mds.<> debug_mds
>
>
>> Janek
>>
>>
>> On 19/09/2023 12:36, Venky Shankar wrote:
>>
>> Hi Janek,
>>
>> On Mon, Sep 18, 2023 at 9:52 PM Janek Bevendorff <
>> janek.bevendo...@uni-weimar.de> wrote:
>>
>>> Thanks! However, I still don't really understand why I am
seeing this.
>>>
>>
>> This is due to a changes that was merged recently in pacific
>>
>> https://github.com/ceph/ceph/pull/52270
>>
>> The MDS would not evict laggy clients if the OSDs report as
laggy. Laggy
>> OSDs can cause cephfs clients to not flush dirty data (during
cap revokes
>> by the MDS) and thereby showing up as laggy and getting evicted
by the MDS.
>> This behaviour was changed and therefore you get warnings that
some client
>> are laggy but they are not evicted since the OSDs are laggy.
>>
>>
>>> The first time I had this, one of the clients was a remote
user dialling
>>> in via VPN, which could indeed be laggy. But I am also seeing
it from
>>> neighbouring hosts that are on the same physical network with
reliable ping
>>> times way below 1ms. How is that considered laggy?
>>>
>>  Are some of your OSDs reporting laggy? This can be check via
`perf dump`
>>
>> > ceph tell mds.<> perf dump
>> (search for op_laggy/osd_laggy)
>>
>>
>>> On 18/09/2023 18:07, Laura Flores wrote:
>>>
>>> Hi Janek,
>>>
>>> There was some docume