[ceph-users] Mysterious Disk-Space Eater

2023-01-11 Thread duluxoz

Hi All,

Got a funny one, which I'm hoping someone can help us with.

We've got three identical(?) Ceph Quincy Nodes running on Rocky Linux 
8.7. Each Node has 4 OSDs, plus Monitor, Manager, and iSCSI G/W services 
running on them (we're only a small shop). Each Node has a separate 16 
GiB partition mounted as /var. Everything is running well and the Ceph 
Cluster is handling things very well).


However, one of the Nodes (not the one currently acting as the Active 
Manager) is running out of space on /var. Normally, all of the Nodes 
have around 10% space used (via a df -H command), but the problem Node 
only takes 1 to 3 days to run out of space, hence taking it out of 
Quorum. Its currently at 85% and growing.


At first we thought this was caused by an overly large log file, but 
investigations showed that all the logs on all 3 Nodes were of 
comparable size. Also, searching for the 20 largest files on the problem 
Node's /var didn't produce any significant results.


Coincidentally, unrelated to this issue, the problem Node (but not the 
other 2 Nodes) was re-booted a couple of days ago and, when the Cluster 
had re-balanced itself and everything was back online and reporting as 
Healthy, the problem Node's /var was back down to around 10%, the same 
as the other two Nodes.


This lead us to suspect that there was some sort of "run-away" process 
or journaling/logging/temporary file(s) or whatever that the re-boot has 
"cleaned up". So we've been keeping an eye on things but we can't see 
anything causing the issue and now, as I said above, the problem Node's 
/var is back up to 85% and growing.


I've been looking at the log files, tying to determine the issue, but as 
I don't really know what I'm looking for I don't even know if I'm 
looking in the *correct* log files...


Obviously rebooting the problem Node every couple of days is not a 
viable option, and increasing the size of the /var partition is only 
going to postpone the issue, not resolve it. So if anyone has any ideas 
we'd love to hear about it - thanks


Cheers

Dulux-Oz
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Current min_alloc_size of OSD?

2023-01-11 Thread Anthony D'Atri

It’s printed in the OSD log at startup.

I don’t immediately see it in `ceph osd metadata` ; arguably it should be 
there.  `config show` on the admin socket I suspect does not show the existing 
value.

> 
> Hi,
> 
> Ceph 16 Pacific introduced a new smaller default min_alloc_size of 4096 bytes 
> for HDD and SSD OSDs.
> 
> How can I get the current min_allloc_size of OSDs that were created with 
> older Ceph versions? Is there a command that shows this info from the on disk 
> format of a bluestore OSD?
> 
> Regards
> -- 
> Robert Sander
> Heinlein Support GmbH
> Schwedter Str. 8/9b, 10119 Berlin
> 
> https://www.heinlein-support.de
> 
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
> 
> Amtsgericht Berlin-Charlottenburg - HRB 93818 B
> Geschäftsführer: Peer Heinlein - Sitz: Berlin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Current min_alloc_size of OSD?

2023-01-11 Thread Robert Sander
Hi,

Ceph 16 Pacific introduced a new smaller default min_alloc_size of 4096 bytes 
for HDD and SSD OSDs.

How can I get the current min_allloc_size of OSDs that were created with older 
Ceph versions? Is there a command that shows this info from the on disk format 
of a bluestore OSD?

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 93818 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pg mapping verification

2023-01-11 Thread Stephen Smith6
I think “ceph pg dump” is what you’re after, look for the “UP” and “ACTIVE” 
fields to map a PG to an OSD. From there it’s just a matter of verifying your 
PG placement matches the CRUSH rule.

From: Christopher Durham 
Date: Wednesday, January 11, 2023 at 3:56 PM
To: ceph-users@ceph.io 
Subject: [EXTERNAL] [ceph-users] pg mapping verification

Hi,
For a given crush rule and pool that uses it, how can I verify hat the pgs in 
that pool folllow the rule? I have a requirement to 'prove' that the pgs are 
mapping correctly.
I see: https://pypi.org/project/crush/
This allows me to read in a crushmap file that I could then use to verify a pg 
with some scripting, but this pypi is very old and seems not to be maintained 
or updatedsince 2017.
I am sure there is a way, using osdmaptool or something else, but it is not 
obvious. Before i spend alot of time searching, I thought I would ask here.
Basically, having a list of pgs like this:
[[1,2,3,4,5],[2,3,4,5,6],...]
Given a read-in crushmap and a specific rule therein, I want to verify that all 
pgs in my list are consistent with the rule specified.
Let me know if there is a proper way to do this, and thanks.
-Chris


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] pg mapping verification

2023-01-11 Thread Christopher Durham


Hi,
For a given crush rule and pool that uses it, how can I verify hat the pgs in 
that pool folllow the rule? I have a requirement to 'prove' that the pgs are 
mapping correctly.
I see: https://pypi.org/project/crush/
This allows me to read in a crushmap file that I could then use to verify a pg 
with some scripting, but this pypi is very old and seems not to be maintained 
or updatedsince 2017.
I am sure there is a way, using osdmaptool or something else, but it is not 
obvious. Before i spend alot of time searching, I thought I would ask here.
Basically, having a list of pgs like this:
[[1,2,3,4,5],[2,3,4,5,6],...]
Given a read-in crushmap and a specific rule therein, I want to verify that all 
pgs in my list are consistent with the rule specified.
Let me know if there is a proper way to do this, and thanks.
-Chris


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph Octopus rbd images stuck in trash

2023-01-11 Thread Jeff Welling

Hello there,

I'm running Ceph 15.2.17 (Octopus) on Debian Buster and I'm starting an 
upgrade but I'm seeing a problem and I wanted to ask how best to proceed 
in case I make things worse by mucking with it without asking experts.


I've moved an rbd image to the trash without clearing the snapshots 
first, and then tried to 'trash purge'. This resulted in an error 
because the image still has snapshots, but I'm unable to remove the 
image from the pool to clear the snapshots either. At least one of these 
images is from a clone of a snapshot from another trashed image, which 
I'm already kicking myself for.


The contents of my trash:

# rbd trash ls
07afadac0ed69c nfsroot_pi08
240ae5a5eb3214 bigdisk
7fd5138848231e nfsroot_pi01
f33e1f5bad0952 bigdisk2
fcdeb1f96a6124 raspios-64bit-lite-manuallysetup-p1
fcdebd2237697a raspios-64bit-lite-manuallysetup-p2
fd51418d5c43da nfsroot_pi02
fd514a6b4d3441 nfsroot_pi03
fd515061816c70 nfsroot_pi04
fd51566859250b nfsroot_pi05
fd5162c5885d9c nfsroot_pi07
fd5171c27c36c2 nfsroot_pi09
fd51743cb8813c nfsroot_pi10
fd517ad3bc3c9d nfsroot_pi11
fd5183bfb1e588 nfsroot_pi12


This is the error I get trying to purge the trash:

# rbd trash purge
Removing images: 0% complete...failed.
rbd: some expired images could not be removed
Ensure that they are closed/unmapped, do not have snapshots (including 
trashed snapshots with linked clones), are not in a group and were moved 
to the trash successfully.



This is the error when I try and restore one of the trashed images:

# rbd trash restore nfsroot_pi08
rbd: error: image does not exist in trash
2023-01-11T12:28:52.982-0800 7f4b69a7c3c0 -1 librbd::api::Trash: 
restore: error getting image id nfsroot_pi08 info from trash: (2) No 
such file or directory


Trying to restore other images gives the same error.

These trash images are now taking up a significant portion of the 
cluster space. One thought was to upgrade and see if that resolves the 
problem, but I've shot myself in the foot doing that in the past without 
confirming it would solve the problem, so I'm looking for a second 
opinion on how best to clear these?


These are all Debian Buster systems, the kernel version of the host I'm 
running these commands on is:


Linux zim 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1+deb10u1 (2020-04-27) 
x86_64 GNU/Linux


I'm going to be upgrading that too but one step at a time.
The exact ceph version is:

ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus 
(stable)


This was installed from the ceph repos, not the debian repos, using 
cephadm. If there's any additional details I can share please let me 
know, any and all thoughts welcome! I've been googling and have found 
folks with similar issues but nothing similar enough to feel helpful.


Thanks in advance, and thank you to any and everyone who contributes to 
Ceph, it's awesome!

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Move bucket between realms

2023-01-11 Thread mahnoosh shahidi
Hi all,

Is there any way in rgw to move a bucket from one realm to another one in
the same cluster?

Best regards,
Mahnoosh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: adding OSD to orchestrated system, ignoring osd service spec.

2023-01-11 Thread Eugen Block
I just wanted to see if something like "all available devices" is  
managed and could possibly override your drivegroups.yml. Here's an  
example:


storage01:~ # ceph orch ls osd
NAME   PORTS  RUNNING  REFRESHED  AGE  PLACEMENT
osd 3  9m ago -
osd.all-available-devices   0  -  8M   *

You can also find more information in the cephadm.log and/or in the  
ceph-volume.log.


Zitat von Wyll Ingersoll :

Not really, its on an airgapped/secure network and I cannot  
copy-and-paste from it.  What are you looking for?  This cluster has  
720 OSDs across 18 storage nodes.
I think we have identified the problem and it may not be a ceph  
issue, but need to investigate further.  It has something to do with  
the SSD devices that are being ignored - they are slightly different  
from the other ones.


From: Eugen Block 
Sent: Wednesday, January 11, 2023 3:27 AM
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: adding OSD to orchestrated system,  
ignoring osd service spec.


Hi,

can you share the output of

storage01:~ # ceph orch ls osd

Thanks,
Eugen

Zitat von Wyll Ingersoll :


When adding a new OSD to a ceph orchestrated system (16.2.9) on a
storage node that has a specification profile that dictates which
devices to use as the db_devices (SSDs), the newly added OSDs seem
to be ignoring the db_devices (there are several available) and
putting the data and db/wal on the same device.

We installed the new disk (HDD) and then ran "ceph orch device zap
/dev/xyz --force" to initialize the addition process.
The OSDs that were added originally on that node were layed out
correctly, but the new ones seem to be ignoring the OSD service spec.

How can we make sure the new devices added are layed out correctly?

thanks,
Wyllys


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Serious cluster issue - Incomplete PGs

2023-01-11 Thread Eugen Block
I can't recall having used the objectstore-tool to mark PGs as  
complete, so I can't really comfirm if that will work and if it will  
unblock the stuck requests (I would assume it does). Hopefully someone  
can chime in here.


Zitat von Deep Dish :


Eugen,

I never insinuated my circumstance is resultant from buggy software, and
acknowledged operational missteps.   Let's please leave that there.  Ceph
remains a technology I like and will continue to use.   Our operational
understanding has evolved greatly as a result of current circumstances.

Removed OSDs are gone and not recoverable.  (ie. lockbox keys gone, VG
groups removed)..

My objective of this post is to validate understanding of an alternate
recovery (of available, not complete) data scenario:

1. Cluster has blocked IO due to Incomplete pages.   Therefore any online
operations on affected pools / images / filesystems are blocked.

# ceph -s

  cluster:

id:

health: HEALTH_WARN

1 hosts fail cephadm check

cephadm background work is paused

Reduced data availability: 28 pgs inactive, 28 pgs incomplete

5 pgs not deep-scrubbed in time

3 slow ops, oldest one blocked for 347227 sec, daemons
[osd.25,osd.50,osd.51] have slow ops.



  services:

mon: 5 daemons, quorum  (age 8h)

mgr: (active, since 27m)

mds: 2/2 daemons up, 3 standby

osd: 70 osds: 70 up (since 3d), 45 in (since 3d); 24 remapped pgs



  data:

volumes: 2/2 healthy

pools:   9 pools, 1056 pgs

objects: 10.64M objects, 40 TiB

usage:   61 TiB used, 266 TiB / 327 TiB avail

pgs: 2.652% pgs not active

 1027 active+clean

 24   remapped+incomplete

 4incomplete

 1active+clean+scrubbing+deep

2. Since pages are incomplete and supporting data lost, I found a
documented process that will mark pages are complete and unblock IO for the
cluster.   I fully understand that pgs that have 0 objects will have no
impact on data integrity, however those pgs containing objects will result
in complete data loss for only those affected pgs.
Link:
https://medium.com/opsops/recovering-ceph-from-reduced-data-availability-3-pgs-inactive-3-pgs-incomplete-b97cbcb4b5a1


Based on above referenced link, commands to this effect would mark
incomplete PGs as complete (examples):
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2 --op
mark-complete --pgid 2.50
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2 --op
mark-complete --pgid 2.57

3.  My cluster, at present, has a total 28 incomplete pgs.   Of these, 7
reference approximately 644 GB of now lost / irrecoverable data, the rest
reference 0 objects and 0 bytes (empty).   The cluster holds a total of
61.3T of data, leaving ~60.8T available for recovery.
4.  If I were to mark ALL incomplete pgs as complete, the cluster would be
operable - meaning I can interact with pool images and surviving files on
cephfs pools.
5.  Although data loss may affect the contents RBD images, these images
would be able to be mapped ( rbd map ) and available for alternate recovery
methods (e.g. DD contents to a seperate volume for use at a recovery
facility, or attempt to read via available recovery tools that interact
with the filesystem on those block devices (XFS in this case).  Lost data
would be equivalent of blocks of 0's in the overall image data stream where
data was lost.
6.  The above could be successful in extracting available / recoverable.
7.  Upon marking the 2 incomplete pages affecting CephFS volume as
complete, CephFS would be accessible minus affected files.   How would
these files be represented?  (Corrupted or simply 0 bytes)?

Thank you.

Date: Tue, 10 Jan 2023 08:15:31 +
From: Eugen Block 
Subject: [ceph-users] Re: Serious cluster issue - Incomplete PGs
To: ceph-users@ceph.io
Message-ID:
<20230110081531.horde.nfeixevxkbyy6jfymgyb...@webmail.nde.ag>
Content-Type: text/plain; charset=utf-8; format=flowed; DelSp=Yes

Hi,


Backups will be challenging.   I honestly didn't anticipate this kind of
failure with ceph to be possible, we've been using it for several years

now

and were encouraged by orchestrator and performance improvements in the 17
code branch.


that's exactly what a backup is for, to be prepared for the
unexpected. Besides the fact that ceph didn't actually fail (you
removed too many/too early OSDs) you can't expect a bug free software,
no matter how long it has been running successfully.


- Identifying the pools / images / files that are affected by incomplete
pages;


The PGs start with a number which reflects the pools in your cluster,
check the output of 'ceph osd pool ls detail'. There's no easy way to
tell which images or files are affected, you can query each OSD and
list the PG's objects, but that doesn't work for missing OSDs/PGs, of
course. I'm not sure how promising it is, but maybe try a for loop
over all rbd images and just execute a 

[ceph-users] Re: adding OSD to orchestrated system, ignoring osd service spec.

2023-01-11 Thread Wyll Ingersoll


Not really, its on an airgapped/secure network and I cannot copy-and-paste from 
it.  What are you looking for?  This cluster has 720 OSDs across 18 storage 
nodes.
I think we have identified the problem and it may not be a ceph issue, but need 
to investigate further.  It has something to do with the SSD devices that are 
being ignored - they are slightly different from the other ones.

From: Eugen Block 
Sent: Wednesday, January 11, 2023 3:27 AM
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: adding OSD to orchestrated system, ignoring osd 
service spec.

Hi,

can you share the output of

storage01:~ # ceph orch ls osd

Thanks,
Eugen

Zitat von Wyll Ingersoll :

> When adding a new OSD to a ceph orchestrated system (16.2.9) on a
> storage node that has a specification profile that dictates which
> devices to use as the db_devices (SSDs), the newly added OSDs seem
> to be ignoring the db_devices (there are several available) and
> putting the data and db/wal on the same device.
>
> We installed the new disk (HDD) and then ran "ceph orch device zap
> /dev/xyz --force" to initialize the addition process.
> The OSDs that were added originally on that node were layed out
> correctly, but the new ones seem to be ignoring the OSD service spec.
>
> How can we make sure the new devices added are layed out correctly?
>
> thanks,
> Wyllys
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Permanently ignore some warning classes

2023-01-11 Thread Nicola Mori

Dear Ceph users,

my cluster is build with old hardware on a gigabit network, so I often 
experience warnings like OSD_SLOW_PING_TIME_BACK. These in turn triggers 
alert mails too often, forcing me to disable alerts which is not 
sustainable. So my question is: is it possible to tell Ceph to ignore 
(or at least to not send alerts for) a given class of warnings?

Thank you,

Nicola


smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] OSD crash with "FAILED ceph_assert(v.length() == p->shard_info->bytes)"

2023-01-11 Thread Yu Changyuan
One of OSD(other OSDs are fine) was crashed, and try
"ceph-bluestore-tool fsck" also crashed with same error. Besides destroy
this OSD and re-create, are there any other steps I can do to restore
the OSD?

Below is part of message:

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueStore.cc:
 3228: FAILED ceph_assert(v.length() == p->shard_info->bytes)

 ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific 
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x158) [0x562c19f7f73c]
 2: /usr/bin/ceph-osd(+0x57f956) [0x562c19f7f956]
 3: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned 
int)+0x7cf) [0x562c1a56d1ef]
 4: (BlueStore::_do_write(BlueStore::TransContext*, 
boost::intrusive_ptr&, 
boost::intrusive_ptr, unsigned long, unsigned long, 
ceph::buffer::v15_2_0::list&, unsigned int)+0x5dd) [0x562c1a5c5ebd]
 5: (BlueStore::_write(BlueStore::TransContext*, 
boost::intrusive_ptr&, 
boost::intrusive_ptr&, unsigned long, unsigned long, 
ceph::buffer::v15_2_0::list&, unsigned int)+0xd1) [0x562c1a5c70e1]
 6: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
ceph::os::Transaction*)+0x2077) [0x562c1a5cb237]
 7: 
(BlueStore::queue_transactions(boost::intrusive_ptr&,
 std::vector >&, 
boost::intrusive_ptr, ThreadPool::TPHandle*)+0x316) [0x562c1a5e66d6]
 8: (non-virtual thunk to 
PrimaryLogPG::queue_transactions(std::vector >&, 
boost::intrusive_ptr)+0x58) [0x562c1a22a878]
 9: (ReplicatedBackend::do_repop(boost::intrusive_ptr)+0xeb0) 
[0x562c1a41cff0]
 10: 
(ReplicatedBackend::_handle_message(boost::intrusive_ptr)+0x267) 
[0x562c1a42d357]
 11: (PGBackend::handle_message(boost::intrusive_ptr)+0x52) 
[0x562c1a25dd52]
 12: (PrimaryLogPG::do_request(boost::intrusive_ptr&, 
ThreadPool::TPHandle&)+0x5de) [0x562c1a20168e]
 13: (OSD::dequeue_op(boost::intrusive_ptr, 
boost::intrusive_ptr, ThreadPool::TPHandle&)+0x309) [0x562c1a088fc9]
 14: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, 
boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x68) [0x562c1a2e7e78]
 15: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0xc28) [0x562c1a0a64c8]
 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) 
[0x562c1a7232a4]
 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x562c1a726184]
 18: /lib64/libpthread.so.0(+0x81ca) [0x7f2d2a4081ca]
 19: clone()

   -10> 2023-01-10T09:28:02.143+ 7f2cff3de700 -1 *** Caught signal
(Aborted) **


And this is "meta" file of crash log:

{
"crash_id": 
"2023-01-10T09:28:02.137396Z_a504670d-32c3-46ee-8398-84389c9c2d95",
"timestamp": "2023-01-10T09:28:02.137396Z",
"process_name": "ceph-osd",
"entity_name": "osd.3",
"ceph_version": "16.2.10",
"utsname_hostname": "dskm1-r0",
"utsname_sysname": "Linux",
"utsname_release": "5.18.19",
"utsname_version": "#1-NixOS SMP PREEMPT_DYNAMIC Sun Aug 21 13:18:56 UTC 
2022",
"utsname_machine": "x86_64",
"os_name": "CentOS Stream",
"os_id": "centos",
"os_version_id": "8",
"os_version": "8",
"assert_condition": "v.length() == p->shard_info->bytes",
"assert_func": "void BlueStore::ExtentMap::fault_range(KeyValueDB*, 
uint32_t, uint32_t)",
"assert_file": 
"/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueStore.cc",
"assert_line": 3228,
"assert_thread_name": "tp_osd_tp",
"assert_msg": 
"/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueStore.cc:
 In function 'void BlueStore::ExtentMap::fault_range(KeyValueDB*, uint32_t, 
uint32_t)' thread 7f2cff3de700 time 
2023-01-10T09:28:02.016735+\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueStore.cc:
 3228: FAILED ceph_assert(v.length() == p->shard_info->bytes)\n",
"backtrace": [
"/lib64/libpthread.so.0(+0x12cf0) [0x7f2d2a412cf0]",
"gsignal()",
"abort()",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x1a9) [0x562c19f7f78d]",
"/usr/bin/ceph-osd(+0x57f956) [0x562c19f7f956]",
"(BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned 
int)+0x7cf) [0x562c1a56d1ef]",
"(BlueStore::_do_write(BlueStore::TransContext*, 
boost::intrusive_ptr&, 
boost::intrusive_ptr, unsigned long, unsigned long, 
ceph::buffer::v15_2_0::list&, unsigned int)+0x5dd) [0x562c1a5c5ebd]",
"(BlueStore::_write(BlueStore::TransContext*, 

[ceph-users] Re: OSD crash on Onode::put

2023-01-11 Thread Frank Schilder
Hi Anthony and Serkan,

I checked the drive temperatures and there is nothing special about this slot. 
The disks in this slot are from different vendors and were not populated 
incrementally. It might be a very weird coincidence. I seem to have an OSD 
developing this problem in another slot on a different host now. Let's see what 
happens in the future. No reason to turn superstitious :)

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Snap trimming best practice

2023-01-11 Thread Frank Schilder
Hi Istvan,

our experience is the opposite. We put as many PGs in pools as the OSDs can 
manage. We aim for between 100 and 200 for HDDs and accept larger than 200 for 
SSDs. The smaller the PGs the better work all internal operations, including 
snaptrim, recovery, scrubbing etc. on our cluster.

We had problems with snaptrim on our file system taking more than a day and 
starting to overlap with the next day's snaptrim. After bumping the PG count 
this went away immediately. On a busy day (many TB deleted) a snaptrim takes 
maybe 2 hours on an FS with 3PB data, all on HDD, ca. 160 PGs/OSD.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Szabo, Istvan (Agoda) 
Sent: 11 January 2023 09:06:51
To: Ceph Users
Subject: [ceph-users] Snap trimming best practice

Hi,

Wonder have you ever faced issue with snaptrimming if you follow ceph pg 
allocation recommendation (100pg/osd)?

We have a nautilus cluster and we scare to increase the pg-s of the pools 
because seems like even if we have 4osd/nvme, if the pg number is higher = the 
snaptrimming is slower.

Eg.:

We have these pools:
Db1 pool size 64,504G with 512 PGs
Db2 pool size 92,242G with 256 PGs
Db2 snapshot remove faster than Db1.

Our osds are very underutilized regarding pg point of view due to this reason, 
each osd is holding maximum 25 gigantic pgs which makes all the maintenance 
very difficult due to backfilling full, osd full issues.

Any recommendation if you use this feature?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crash on Onode::put

2023-01-11 Thread Frank Schilder
Hi Dongdong.

> is simple and can be applied cleanly.

I understand this statement from a developer's perspective. Now, try to explain 
to a user with a cephadm deployed containerized cluster how to build a 
container from source, point cephadm to use this container and what to do for 
the next upgrade. I think "simple" depends on context. Applying a patch to a 
production system is currently an expert operation, I'm afraid.

If you have instructions for building a ceph-container with the patch applied, 
I would be very interested. I was asking for a source container for exactly 
this reason. As far as I can tell from the conversation, this is quite a 
project in itself. The thread was "Re: Building ceph packages in containers? 
[was: Ceph debian/ubuntu packages build]", but I can't find it on the mailing 
list any more. There seems to be an archived version: 
https://www.spinics.net/lists/ceph-users/msg73231.html

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dongdong Tao 
Sent: 11 January 2023 04:30:14
To: Frank Schilder
Cc: Igor Fedotov; ceph-users@ceph.io; cobanser...@gmail.com
Subject: Re: [ceph-users] Re: OSD crash on Onode::put

Hi Frank,

I don't have an operational workaround, the patch 
https://github.com/ceph/ceph/pull/46911/commits/f43f596aac97200a70db7a70a230eb9343018159
 is simple and can be applied cleanly.

Yes, restarting the OSD will clear pool entries, you can restart it when the 
bluestore_onode items are very low (e.g less than 10) if it really helps, but I 
think you'll need to tune and monitor the performance until you can get a 
number that is most suitable for your cluster.

But it can't help with the crash, since in general, the crash itself is 
basically a restart.

Regards,
Dongdong

On Tue, Jan 10, 2023 at 8:21 PM Serkan Çoban 
mailto:cobanser...@gmail.com>> wrote:
Slot 19 is inside the chassis? Do you check chassis temperature? I
sometimes have more failure rate in chassis HDDs than in front of the
chassis. In our case it was related to the temperature difference.

On Tue, Jan 10, 2023 at 1:28 PM Frank Schilder 
mailto:fr...@dtu.dk>> wrote:
>
> Following up on my previous post, we have identical OSD hosts. The very 
> strange observation now is, that all outlier OSDs are in exactly the same 
> disk slot on these hosts. We have 5 problematic OSDs and they are all in slot 
> 19 on 5 different hosts. This is an extremely strange and unlikely 
> co-incidence.
>
> Are there any specific conditions for this problem to be present or amplified 
> that could have to do with hardware?
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to 
> ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2023-01-11 Thread Konstantin Shalygin
Hi,

> On 10 Jan 2023, at 07:10, David Orman  wrote:
> 
> We ship all of this to our centralized monitoring system (and a lot more) and 
> have dashboards/proactive monitoring/alerting with 100PiB+ of Ceph. If you're 
> running Ceph in production, I believe host-level monitoring is critical, 
> above and beyond Ceph level. Things like inlet/outlet temperature, hardware 
> state of various components, and various other details are probably best 
> served by monitoring external to Ceph itself.

I agree with David's suggestions

> 
> I did a quick glance and didn't see this data (OSD errors re: reads/writes) 
> exposed in the Pacific version of Ceph's Prometheus-style exporter, but I may 
> have overlooked it. This would be nice to have, as well, if it does not exist.
> 
> We collect drive counters at the host level, and alert at levels prior to 
> general impact. Even a failing drive can cause latency spikes which are 
> frustrating, before it starts returning errors (correctable errors) - the OSD 
> will not see these other than longer latency on operations. Seeing a change 
> in the smart counters either at a high rate or above thresholds you define is 
> most certainly something I would suggest ensuring is covered in whatever 
> host-level monitoring you're already performing for production usage.

Seems to me that there is no need to reinvent the wheel and create even more 
GIL problems for ceph-mgr. In previous year was released production-ready 
exporter for smartctl data, with NVMe support [1]
Golang, CI & tested in production with Ceph - ready to go 


[1] https://github.com/prometheus-community/smartctl_exporter
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Intel Cache Solution with HA Cluster on the iSCSI Gateway node

2023-01-11 Thread Kamran Zafar Syed
HI There,
Is there someone, who had some experience of implementing Intel Cache
Accelerator Solution on top of iSCSI Gateway.
Thanks and Regards,
Koki
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: adding OSD to orchestrated system, ignoring osd service spec.

2023-01-11 Thread Eugen Block

Hi,

can you share the output of

storage01:~ # ceph orch ls osd

Thanks,
Eugen

Zitat von Wyll Ingersoll :

When adding a new OSD to a ceph orchestrated system (16.2.9) on a  
storage node that has a specification profile that dictates which  
devices to use as the db_devices (SSDs), the newly added OSDs seem  
to be ignoring the db_devices (there are several available) and  
putting the data and db/wal on the same device.


We installed the new disk (HDD) and then ran "ceph orch device zap  
/dev/xyz --force" to initialize the addition process.
The OSDs that were added originally on that node were layed out  
correctly, but the new ones seem to be ignoring the OSD service spec.


How can we make sure the new devices added are layed out correctly?

thanks,
Wyllys


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Snap trimming best practice

2023-01-11 Thread Szabo, Istvan (Agoda)
Hi,

Wonder have you ever faced issue with snaptrimming if you follow ceph pg 
allocation recommendation (100pg/osd)?

We have a nautilus cluster and we scare to increase the pg-s of the pools 
because seems like even if we have 4osd/nvme, if the pg number is higher = the 
snaptrimming is slower.

Eg.:

We have these pools:
Db1 pool size 64,504G with 512 PGs
Db2 pool size 92,242G with 256 PGs
Db2 snapshot remove faster than Db1.

Our osds are very underutilized regarding pg point of view due to this reason, 
each osd is holding maximum 25 gigantic pgs which makes all the maintenance 
very difficult due to backfilling full, osd full issues.

Any recommendation if you use this feature?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io