[ceph-users] Re: Question regarding bluestore labels

2024-06-10 Thread Igor Fedotov

Hi Bailey,

yes, this should be doable using the following steps:

1. Copy the very first block 0~4096 from a different OSD to that 
non-working one.


2. Use ceph-bluestore-tool's set-label-key command to modify "osd_uud" 
at target OSD


3. Adjust "size" field at target OSD if DB volume size at target OSD is 
different.



Hope this helps.

Thanks,

Igor

On 6/8/2024 3:38 AM, Bailey Allison wrote:

I have a question regarding bluestore labels, specifically for a block.db
partition.

  


To make a long story short, we are currently in a position where checking
the label of a block.db partition and it appears corrupted.

  


I have seen another thread on here suggesting to copy the label from a
working OSD to the non working OSD, then re-adding the correct value to the
labels with ceph-bluestore-tool.

  


Where this was mentioned this was with an OSD in mind, would the same logic
apply if we were working with a db device instead? This is assuming the only
issue with the db is the corrupted label, and there is no other issues.

  


Regards,

  


Bailey

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: A couple OSDs not starting after host reboot

2024-04-05 Thread Igor Fedotov

On 05/04/2024 17:28, xu chenhui wrote:

Hi, Igor

Thank you for providing the repair procedure. I will try it when I am back to 
my workstation. Can you provide any possible reasons for this problem?
Unfortunately no. I recall a few cases like that but I doubt any one 
knows the root cause.

ceph version: v16.2.5


You better upgrade to the latest pacific release.




error info:
systemd[1]: Started Ceph osd.307 for 02eac9e0-d147-11ee-95de-f0b2b90ee048.
bash[39068]: Running command: /usr/bin/chown -R ceph:ceph 
/var/lib/ceph/osd/ceph-307
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir 
--dev 
/dev/ceph-6b69b64c-7293-4530-9e25-28279308198e/osd-block-81fbaf55-7de2-4f21-97bf-7d79f045ee79
 --path /var/lib/ceph/osd/ceph-307 --no-mon-config
bash[39068]: stderr: failed to read label for 
/dev/ceph-6b69b64c-7293-4530-9e25-28279308198e/osd-block-81fbaf55-7de2-4f21-97bf-7d79f045ee79:
 (2) No such file or directory
bash[39068]: -->  RuntimeError: command returned non-zero exit status: 1

2024-04-03T14:25:24.349+ 7f90206fb3c0 10 
bluestore(/dev/ceph-6b69b64c-7293-4530-9e25-28279308198e/osd-block-81fbaf55-7de2-4f21-97bf-7d79f045ee79)
 _read_bdev_label
2024-04-03T14:25:24.349+ 7f90206fb3c0  2 
bluestore(/dev/ceph-6b69b64c-7293-4530-9e25-28279308198e/osd-block-81fbaf55-7de2-4f21-97bf-7d79f045ee79)
 _read_bdev_label unable to decode label at offset 102: void 
bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) 
decode past end of struct encoding: Malformed input

thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: A couple OSDs not starting after host reboot

2024-04-05 Thread Igor Fedotov

Hi chenhui,

there is still a work in progress to support multiple labels to avoid 
the issue (https://github.com/ceph/ceph/pull/55374). But this is of 
little help for your current case.


If your disk is fine (meaning it's able to read/write block at offset 0) 
you might want to try to recover the label using label from a different 
OSD sitting on a similar(!!that's important!!!) main device. One needs 
to update osd uuid, whoami and osd_key fields after copying though. Here 
is the step-by-step procedure:


1. Copy OSD label (4K data block at offset 0) from source OSD's main 
device to the same location on the broken one:


> dd if= of= count=1 
bs=4096


2. Learn broken OSD uuid, N denotes broken OSD id:

> ceph report | grep '"osd": N' -A 1
    "osd": N,
    "uuid": "6a4ca4ab-6a43-473c-b09c-b13bdd9def5c",

3. Set obtained uuid to copied OSD osd label

> ceph-bluestore-tool --dev  --command 
set-label-key -k osd_uuid -v 6a4ca4ab-6a43-473c-b09c-b13bdd9def5c


4. Update whoami field in the copied label

> ceph-bluestore-tool --dev  --command 
set-label-key -k whoami -v N


5. learn osd's key

> ceph auth ls | grep osd.1 -A 2
osd.1
    key: AQDrvg9maKxvKxAAqAzqCeR6y0UqBSVIyDhppg==

6. Update osd_key field in the copied label

> ceph-bluestore-tool --dev  --command 
set-label-key -k osd_key -v AQDrvg9maKxvKxAAqAzqCeR6y0UqBSVIyDhppg==


7. Prime OSD dir if it's been lost:

> ceph-bluestore-tool --dev  --path 
 --command prime-osd-dir



At this point OSD should be able to start if corrupted label was the 
only problem.


Hope this helps,

Igor.


On 05/04/2024 05:50, xu chenhui wrote:

Hi,
Has there been any progress on this issue ?  is there  quick recover method? I 
have same problem with you that first 4k block of osd metadata is invalid. It 
will pay a heavy price to recreate osd.

Thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: log_latency slow operation observed for submit_transact, latency = 22.644258499s

2024-03-22 Thread Igor Fedotov

Hi Torkil,

highly likely you're facing a well known issue with RocksDB performance 
drop after bulk data removal. The latter might occur at source OSDs 
after PG migration completion.


You might want to use DB compaction (preferably offline one using 
ceph-kvstore-tool) to get OSD out of this "degraded" state or as a 
preventive measure. I'd recommend to do that for all the OSDs right now. 
And once again after rebalancing is completed.  This should improve 
things but unfortunately no 100% guarantee.


Also curious if you have DB/WAL on fast (SSD or NVMe) drives? This might 
be crucial..



Thanks,

Igor

On 3/22/2024 9:59 AM, Torkil Svensgaard wrote:

Good morning,

Cephadm Reef 18.2.1. We recently added 4 hosts and changed a failure 
domain from host to datacenter which is the reason for the large 
misplaced percentage.


We were seeing some pretty crazy spikes in "OSD Read Latencies" and 
"OSD Write Latencies" on the dashboard. Most of the time everything is 
well but then for periods of time, 1-4 hours, latencies will go to 10+ 
seconds for one or more OSDs. This also happens outside scrub hours 
and it is not the same OSDs every time. The OSDs affected are HDD with 
DB/WAL on NVMe.


Log snippet:

"
...
2024-03-22T06:48:22.859+ 7fb184b52700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
2024-03-22T06:48:22.859+ 7fb185b54700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
2024-03-22T06:48:22.864+ 7fb169898700  1 heartbeat_map 
clear_timeout 'OSD::osd_op_tp thread 0x7fb169898700' had timed out 
after 15.00954s
2024-03-22T06:48:22.864+ 7fb169898700  0 
bluestore(/var/lib/ceph/osd/ceph-112) log_latency slow operation 
observed for submit_transact, latency = 17.716707230s
2024-03-22T06:48:22.880+ 7fb1748ae700  0 
bluestore(/var/lib/ceph/osd/ceph-112) log_latency_fn slow operation 
observed for _txc_committed_kv, latency = 17.732601166s, txc = 
0x55a5bcda0f00
2024-03-22T06:48:38.077+ 7fb184b52700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
2024-03-22T06:48:38.077+ 7fb184b52700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s

...
"

"
[root@dopey ~]# ceph -s
  cluster:
    id: 8ee2d228-ed21-4580-8bbf-0649f229e21d
    health: HEALTH_WARN
    1 failed cephadm daemon(s)
    Low space hindering backfill (add storage if this doesn't 
resolve itself): 1 pg backfill_toofull


  services:
    mon: 5 daemons, quorum lazy,jolly,happy,dopey,sleepy (age 3d)
    mgr: jolly.tpgixt(active, since 10d), standbys: dopey.lxajvk, 
lazy.xuhetq

    mds: 1/1 daemons up, 2 standby
    osd: 540 osds: 539 up (since 6m), 539 in (since 15h); 6250 
remapped pgs


  data:
    volumes: 1/1 healthy
    pools:   15 pools, 10849 pgs
    objects: 546.35M objects, 1.1 PiB
    usage:   1.9 PiB used, 2.3 PiB / 4.2 PiB avail
    pgs: 1425479651/3163081036 objects misplaced (45.066%)
 6224 active+remapped+backfill_wait
 4516 active+clean
 67   active+clean+scrubbing
 25   active+remapped+backfilling
 16   active+clean+scrubbing+deep
 1    active+remapped+backfill_wait+backfill_toofull

  io:
    client:   117 MiB/s rd, 68 MiB/s wr, 274 op/s rd, 183 op/s wr
    recovery: 438 MiB/s, 192 objects/s
"

Anyone know what the issue might be? Given that is happens on and off 
with large periods of time in between with normal low latencies I 
think it unlikely that it is just because the cluster is busy.


Also, how come there's only a small amount of PGs doing backfill when 
we have such a large misplaced percentage? Can this be just from 
backfill reservation logjam?


Mvh.

Torkil


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS space usage

2024-03-20 Thread Igor Fedotov

Thorne,

if that's a bug in Ceph which causes space leakage you might be unable 
to reclaim the space without total purge of the pool.


The problem is that we still uncertain if this is a leakage or something 
else. Hence the need for more thorough research.



Thanks,

Igor

On 3/20/2024 9:13 PM, Thorne Lawler wrote:


Alexander,

Thanks for explaining this. As I suspected, this is a high abstract 
pursuit of what caused the problem, and while I'm sure this makes 
sense for Ceph developers, it isn't going to happen in this case.


I don't care how it got this way- the tools used to create this pool 
will never be used in our environment again after I recover this disk 
space - the entire reason I need to recover the missing space is so I 
can move enough filesystems around to remove the current structure and 
the tools that made it.


I only need to get that disk space back. Any analysis I do will be 
solely directed towards achieving that.


Thanks.

On 21/03/2024 3:10 am, Alexander E. Patrakov wrote:

Hi Thorne,

The idea is quite simple. By retesting the leak with a separate pool, 
used by nobody except you, in the case if the leak exists and is 
reproducible (which is not a given), you can definitely pinpoint it 
without giving any chance to the alternate hypothesis "somebody wrote 
some data in parallel". And then, even if the leak is small but 
reproducible, one can say that multiple such events accumulated to 10 
TB of garbage in the original pool.


On Wed, Mar 20, 2024 at 7:29 PM Thorne Lawler  wrote:

Alexander,

I'm happy to create a new pool if it will help, but I don't
presently see how creating a new pool will help us to identify
the source of the 10TB discrepancy in this original cephfs pool.

Please help me to understand what you are hoping to find...?

On 20/03/2024 6:35 pm, Alexander E. Patrakov wrote:

Thorne,

That's why I asked you to create a separate pool. All writes go
to the original pool, and it is possible to see object counts
per-pool.

On Wed, Mar 20, 2024 at 6:32 AM Thorne Lawler
 wrote:

Alexander,

Thank you, but as I said to Igor: The 5.5TB of files on this
filesystem are virtual machine disks. They are under
constant, heavy write load. There is no way to turn this off.

On 19/03/2024 9:36 pm, Alexander E. Patrakov wrote:

Hello Thorne,

Here is one more suggestion on how to debug this. Right now, there is
uncertainty on whether there is really a disk space leak or if
something simply wrote new data during the test.

If you have at least three OSDs you can reassign, please set their
CRUSH device class to something different than before. E.g., "test".
Then, create a new pool that targets this device class and add it to
CephFS. Then, create an empty directory on CephFS and assign this pool
to it using setfattr. Finally, try reproducing the issue using only
files in this directory. This way, you will be sure that nobody else
is writing any data to the new pool.

On Tue, Mar 19, 2024 at 5:40 PM Igor Fedotov  
<mailto:igor.fedo...@croit.io>  wrote:

Hi Thorn,

given the amount of files at CephFS volume I presume you don't have
severe write load against it. Is that correct?

If so we can assume that the numbers you're sharing are mostly refer to
your experiment. At peak I can see bytes_used increase = 629,461,893,120
bytes (45978612027392  - 45349150134272). With replica factor = 3 this
roughly matches your written data (200GB I presume?).


More interestingly is that after file's removal we can see 419,450,880
bytes delta (=45349569585152 - 45349150134272). I could see two options
(apart that someone else wrote additional stuff to CephFS during the
experiment) to explain this:

1. File removal wasn't completed at the last probe half an hour after
file's removal. Did you see stale object counter when making that probe?

2. Some space is leaking. If that's the case this could be a reason for
your issue if huge(?) files at CephFS are created/removed periodically.
So if we're certain that the leak really occurred (and option 1. above
isn't the case) it makes sense to run more experiments with
writing/removing a bunch of huge files to the volume to confirm space
leakage.

On 3/18/2024 3:12 AM, Thorne Lawler wrote:

Thanks Igor,

I have tried that, and the number of objects and bytes_used took a
long time to drop, but they seem to have dropped back to almost the
original level:

   * Before creating the file:
   o 3885835 objects
   o 45349150134272 bytes_used
   * After creating the file:
   o 3931663 objects
   o

[ceph-users] Re: CephFS space usage

2024-03-20 Thread Igor Fedotov

Thorne,

if that's a bug in Ceph which causes space leakage you might be unable 
to reclaim the space without total purge of the pool.


The problem is that we still uncertain if this is a leakage or something 
else. Hence the need for more thorough research.



Thanks,

Igor

On 3/20/2024 9:13 PM, Thorne Lawler wrote:


Alexander,

Thanks for explaining this. As I suspected, this is a high abstract 
pursuit of what caused the problem, and while I'm sure this makes 
sense for Ceph developers, it isn't going to happen in this case.


I don't care how it got this way- the tools used to create this pool 
will never be used in our environment again after I recover this disk 
space - the entire reason I need to recover the missing space is so I 
can move enough filesystems around to remove the current structure and 
the tools that made it.


I only need to get that disk space back. Any analysis I do will be 
solely directed towards achieving that.


Thanks.

On 21/03/2024 3:10 am, Alexander E. Patrakov wrote:

Hi Thorne,

The idea is quite simple. By retesting the leak with a separate pool, 
used by nobody except you, in the case if the leak exists and is 
reproducible (which is not a given), you can definitely pinpoint it 
without giving any chance to the alternate hypothesis "somebody wrote 
some data in parallel". And then, even if the leak is small but 
reproducible, one can say that multiple such events accumulated to 10 
TB of garbage in the original pool.


On Wed, Mar 20, 2024 at 7:29 PM Thorne Lawler  wrote:

Alexander,

I'm happy to create a new pool if it will help, but I don't
presently see how creating a new pool will help us to identify
the source of the 10TB discrepancy in this original cephfs pool.

Please help me to understand what you are hoping to find...?

On 20/03/2024 6:35 pm, Alexander E. Patrakov wrote:

Thorne,

That's why I asked you to create a separate pool. All writes go
to the original pool, and it is possible to see object counts
per-pool.

On Wed, Mar 20, 2024 at 6:32 AM Thorne Lawler
 wrote:

Alexander,

Thank you, but as I said to Igor: The 5.5TB of files on this
filesystem are virtual machine disks. They are under
constant, heavy write load. There is no way to turn this off.

On 19/03/2024 9:36 pm, Alexander E. Patrakov wrote:

Hello Thorne,

Here is one more suggestion on how to debug this. Right now, there is
uncertainty on whether there is really a disk space leak or if
something simply wrote new data during the test.

If you have at least three OSDs you can reassign, please set their
CRUSH device class to something different than before. E.g., "test".
Then, create a new pool that targets this device class and add it to
CephFS. Then, create an empty directory on CephFS and assign this pool
to it using setfattr. Finally, try reproducing the issue using only
files in this directory. This way, you will be sure that nobody else
is writing any data to the new pool.

On Tue, Mar 19, 2024 at 5:40 PM Igor Fedotov  
<mailto:igor.fedo...@croit.io>  wrote:

Hi Thorn,

given the amount of files at CephFS volume I presume you don't have
severe write load against it. Is that correct?

If so we can assume that the numbers you're sharing are mostly refer to
your experiment. At peak I can see bytes_used increase = 629,461,893,120
bytes (45978612027392  - 45349150134272). With replica factor = 3 this
roughly matches your written data (200GB I presume?).


More interestingly is that after file's removal we can see 419,450,880
bytes delta (=45349569585152 - 45349150134272). I could see two options
(apart that someone else wrote additional stuff to CephFS during the
experiment) to explain this:

1. File removal wasn't completed at the last probe half an hour after
file's removal. Did you see stale object counter when making that probe?

2. Some space is leaking. If that's the case this could be a reason for
your issue if huge(?) files at CephFS are created/removed periodically.
So if we're certain that the leak really occurred (and option 1. above
isn't the case) it makes sense to run more experiments with
writing/removing a bunch of huge files to the volume to confirm space
leakage.

On 3/18/2024 3:12 AM, Thorne Lawler wrote:

Thanks Igor,

I have tried that, and the number of objects and bytes_used took a
long time to drop, but they seem to have dropped back to almost the
original level:

   * Before creating the file:
   o 3885835 objects
   o 45349150134272 bytes_used
   * After creating the file:
   o 3931663 objects
   o

[ceph-users] Re: CephFS space usage

2024-03-20 Thread Igor Fedotov

Hi Thorne,

unfortunately I'm unaware of any tools high level enough to easily map 
files to rados objects without deep undestanding how this works. You 
might want to try "rados ls" command to get the list of all the objects 
in the cephfs data pool. And then  learn how that mapping is performed 
and parse your listing.



Thanks,

Igor

On 3/20/2024 1:30 AM, Thorne Lawler wrote:


Igor,

Those files are VM disk images, and they're under constant heavy use, 
so yes- there/is/ constant severe write load against this disk.


Apart from writing more test files into the filesystems, there must be 
Ceph diagnostic tools to describe what those objects are being used 
for, surely?


We're talking about an extra 10TB of space. How hard can it be to 
determine which file those objects are associated with?


On 19/03/2024 8:39 pm, Igor Fedotov wrote:


Hi Thorn,

given the amount of files at CephFS volume I presume you don't have 
severe write load against it. Is that correct?


If so we can assume that the numbers you're sharing are mostly refer 
to your experiment. At peak I can see bytes_used increase = 
629,461,893,120 bytes (45978612027392  - 45349150134272). With 
replica factor = 3 this roughly matches your written data (200GB I 
presume?).



More interestingly is that after file's removal we can see 
419,450,880 bytes delta (=45349569585152 - 45349150134272). I could 
see two options (apart that someone else wrote additional stuff to 
CephFS during the experiment) to explain this:


1. File removal wasn't completed at the last probe half an hour after 
file's removal. Did you see stale object counter when making that probe?


2. Some space is leaking. If that's the case this could be a reason 
for your issue if huge(?) files at CephFS are created/removed 
periodically. So if we're certain that the leak really occurred (and 
option 1. above isn't the case) it makes sense to run more 
experiments with writing/removing a bunch of huge files to the volume 
to confirm space leakage.


On 3/18/2024 3:12 AM, Thorne Lawler wrote:


Thanks Igor,

I have tried that, and the number of objects and bytes_used took a 
long time to drop, but they seem to have dropped back to almost the 
original level:


  * Before creating the file:
  o 3885835 objects
  o 45349150134272 bytes_used
  * After creating the file:
  o 3931663 objects
  o 45924147249152 bytes_used
  * Immediately after deleting the file:
  o 3935995 objects
  o 45978612027392 bytes_used
  * Half an hour after deleting the file:
  o 3886013 objects
  o 45349569585152 bytes_used

Unfortunately, this is all production infrastructure, so there is 
always other activity taking place.


What tools are there to visually inspect the object map and see how 
it relates to the filesystem?


Not sure if there is anything like that at CephFS level but you can 
use rados tool to view objects in cephfs data pool and try to build 
some mapping between them and CephFS file list. Could be a bit tricky 
though.


On 15/03/2024 7:18 pm, Igor Fedotov wrote:

ceph df detail --format json-pretty

--

Regards,

Thorne Lawler - Senior System Administrator
*DDNS* | ABN 76 088 607 265
First registrar certified ISO 27001-2013 Data Security Standard 
ITGOV40172

P +61 499 449 170

_DDNS

/_*Please note:* The information contained in this email message and 
any attached files may be confidential information, and may also be 
the subject of legal professional privilege. _If you are not the 
intended recipient any use, disclosure or copying of this email is 
unauthorised. _If you received this email in error, please notify 
Discount Domain Name Services Pty Ltd on 03 9815 6868 to report this 
matter and delete all copies of this transmission together with any 
attachments. /



--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us athttps://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx

--

Regards,

Thorne Lawler - Senior System Administrator
*DDNS* | ABN 76 088 607 265
First registrar certified ISO 27001-2013 Data Security Standard ITGOV40172
P +61 499 449 170

_DDNS

/_*Please note:* The information contained in this email message and 
any attached files may be confidential information, and may also be 
the subject of legal professional privilege. _If you are not the 
intended recipient any use, disclosure or copying of this email is 
unauthorised. _If you received this email in error, please notify 
Discount Domain Name Services Pty Ltd on 03 9815 6868 to report this 
matter and delete all copies of this transmission together with any 
attachments. /



--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us athttps://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: A

[ceph-users] Re: OSD does not die when disk has failures

2024-03-20 Thread Igor Fedotov

Hi Robert,

I presume the plan was to support handling EIO at upper layers. But 
apparently that hasn't been completed. Or there are some bugs...


Will take a look.


Thanks,

Igor

On 3/19/2024 3:36 PM, Robert Sander wrote:

Hi,

On 3/19/24 13:00, Igor Fedotov wrote:


translating EIO to upper layers rather than crashing an OSD is a 
valid default behavior. One can alter this by setting 
bluestore_fail_eio parameter to true.


What benefit lies in this behavior when in the end client IO stalls?

Regards


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD does not die when disk has failures

2024-03-19 Thread Igor Fedotov

Hi Daniel,

translating EIO to upper layers rather than crashing an OSD is a valid 
default behavior. One can alter this by setting bluestore_fail_eio 
parameter to true.



Thanks,

Igor

On 3/19/2024 2:50 PM, Daniel Schreiber wrote:

Hi,

in our cluster (17.2.6) disks fail from time to time. Block devices 
are HDD, DB devices are NVME. However, the OSD process does not 
reliably die. That leads to blocked client IO for all requests for 
which the OSD with the broken disk is the primary OSD. All pools on 
these OSDs are EC pools (cephfs data or rbd data). Client IO recovers 
if I manually stop the OSD.


It seems like the error was triggered during deep scrub, because the 
cluster reported scrub errors afterwards.



OSD Log:

2024-03-11T20:12:43+01:00 urzceph1-osd05 bash[9695]: debug 
2024-03-11T19:12:43.392+ 7fe4cad3f700  4 rocksdb: (Original Log 
Time 2024/03/11-19:12:43.395747) 
[db/db_impl/db_impl_compaction_flush.cc:2818] Compaction nothing to do
2024-03-11T20:15:58+01:00 urzceph1-osd05 bash[9575]: debug 
2024-03-11T19:15:58.285+ 7f9182765700 -1 bdev(0x55f72b8af800 
/var/lib/ceph/osd/ceph-17/block) _aio_thread got r=-5 ((5) 
Input/output error)
2024-03-11T20:15:58+01:00 urzceph1-osd05 bash[9575]: debug 
2024-03-11T19:15:58.289+ 7f9182765700 -1 bdev(0x55f72b8af800 
/var/lib/ceph/osd/ceph-17/block) _aio_thread translating the error to 
EIO for upper layer
2024-03-11T20:15:58+01:00 urzceph1-osd05 bash[9575]: debug 
2024-03-11T19:15:58.289+ 7f9182765700 -1 bdev(0x55f72b8af800 
/var/lib/ceph/osd/ceph-17/block) _aio_thread got r=-5 ((5) 
Input/output error)
2024-03-11T20:15:58+01:00 urzceph1-osd05 bash[9575]: debug 
2024-03-11T19:15:58.289+ 7f9182765700 -1 bdev(0x55f72b8af800 
/var/lib/ceph/osd/ceph-17/block) _aio_thread translating the error to 
EIO for upper layer
2024-03-11T20:17:02+01:00 urzceph1-osd05 bash[10152]: debug 
2024-03-11T19:17:02.357+ 7fcffadf4700  4 rocksdb: 
[db/db_impl/db_impl_write.cc:1736] [L] New memtable created with log 
file: #73918. Immutable memtables: 0.


Kernel Log:

[Mon Mar 11 20:15:43 2024] ata9.00: exception Emask 0x0 SAct 
0x SErr 0xc action 0x0

[Mon Mar 11 20:15:43 2024] ata9.00: irq_stat 0x4008
[Mon Mar 11 20:15:43 2024] ata9: SError: { CommWake 10B8B }
[Mon Mar 11 20:15:43 2024] ata9.00: failed command: READ FPDMA QUEUED
[Mon Mar 11 20:15:43 2024] ata9.00: cmd 
60/f8:38:60:b2:8e/00:00:37:00:00/40 tag 7 ncq dma 126976 in
    res 
43/40:f0:68:b2:8e/00:00:37:00:00/40 Emask 0x409 (media error) 

[Mon Mar 11 20:15:43 2024] ata9.00: status: { DRDY SENSE ERR }
[Mon Mar 11 20:15:43 2024] ata9.00: error: { UNC }
[Mon Mar 11 20:15:43 2024] ata9: hard resetting link
[Mon Mar 11 20:15:43 2024] ata9: SATA link up 6.0 Gbps (SStatus 133 
SControl 300)

[Mon Mar 11 20:15:43 2024] ata9.00: configured for UDMA/133
[Mon Mar 11 20:15:43 2024] sd 8:0:0:0: [sdj] tag#7 FAILED Result: 
hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=3s
[Mon Mar 11 20:15:43 2024] sd 8:0:0:0: [sdj] tag#7 Sense Key : Medium 
Error [current]
[Mon Mar 11 20:15:43 2024] sd 8:0:0:0: [sdj] tag#7 Add. Sense: 
Unrecovered read error - auto reallocate failed
[Mon Mar 11 20:15:43 2024] sd 8:0:0:0: [sdj] tag#7 CDB: Read(16) 88 00 
00 00 00 00 37 8e b2 60 00 00 00 f8 00 00
[Mon Mar 11 20:15:43 2024] blk_update_request: I/O error, dev sdj, 
sector 932098664 op 0x0:(READ) flags 0x0 phys_seg 29 prio class 0

[Mon Mar 11 20:15:43 2024] ata9: EH complete

Is this expected behavior or a bug? If it is expected how can we keep 
client IO flowing?


Kind regards,

Daniel

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS space usage

2024-03-19 Thread Igor Fedotov

Hi Thorn,

given the amount of files at CephFS volume I presume you don't have 
severe write load against it. Is that correct?


If so we can assume that the numbers you're sharing are mostly refer to 
your experiment. At peak I can see bytes_used increase = 629,461,893,120 
bytes (45978612027392  - 45349150134272). With replica factor = 3 this 
roughly matches your written data (200GB I presume?).



More interestingly is that after file's removal we can see 419,450,880 
bytes delta (=45349569585152 - 45349150134272). I could see two options 
(apart that someone else wrote additional stuff to CephFS during the 
experiment) to explain this:


1. File removal wasn't completed at the last probe half an hour after 
file's removal. Did you see stale object counter when making that probe?


2. Some space is leaking. If that's the case this could be a reason for 
your issue if huge(?) files at CephFS are created/removed periodically. 
So if we're certain that the leak really occurred (and option 1. above 
isn't the case) it makes sense to run more experiments with 
writing/removing a bunch of huge files to the volume to confirm space 
leakage.


On 3/18/2024 3:12 AM, Thorne Lawler wrote:


Thanks Igor,

I have tried that, and the number of objects and bytes_used took a 
long time to drop, but they seem to have dropped back to almost the 
original level:


  * Before creating the file:
  o 3885835 objects
  o 45349150134272 bytes_used
  * After creating the file:
  o 3931663 objects
  o 45924147249152 bytes_used
  * Immediately after deleting the file:
  o 3935995 objects
  o 45978612027392 bytes_used
  * Half an hour after deleting the file:
  o 3886013 objects
  o 45349569585152 bytes_used

Unfortunately, this is all production infrastructure, so there is 
always other activity taking place.


What tools are there to visually inspect the object map and see how it 
relates to the filesystem?


Not sure if there is anything like that at CephFS level but you can use 
rados tool to view objects in cephfs data pool and try to build some 
mapping between them and CephFS file list. Could be a bit tricky though.


On 15/03/2024 7:18 pm, Igor Fedotov wrote:

ceph df detail --format json-pretty

--

Regards,

Thorne Lawler - Senior System Administrator
*DDNS* | ABN 76 088 607 265
First registrar certified ISO 27001-2013 Data Security Standard ITGOV40172
P +61 499 449 170

_DDNS

/_*Please note:* The information contained in this email message and 
any attached files may be confidential information, and may also be 
the subject of legal professional privilege. _If you are not the 
intended recipient any use, disclosure or copying of this email is 
unauthorised. _If you received this email in error, please notify 
Discount Domain Name Services Pty Ltd on 03 9815 6868 to report this 
matter and delete all copies of this transmission together with any 
attachments. /



--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us athttps://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS space usage

2024-03-15 Thread Igor Fedotov

Hi Thorn,

so the problem is apparently bound to huge file sizes. I presume they're 
split into multiple chunks at ceph side hence producing millions of objects.


And possibly something is wrong with this mapping.

If this pool has no write load at the moment you might want to run the 
following experiment:


1) put one more huge file to the filesystem, e.g. of 200GB bytes. And 
note pool stats (through "ceph df detail --format json-pretty) before 
and after this operation.


2) then remove the file, wait until object count is stabilized (i.e. 
removal is completed) and learn the final stats.


Are there any leaks? What was stored space (object count) increase in 
the middle of the above procedure?



As it looks like compession is (was?) enabled on the pool-in-question 
it's worth using non-compressible data for the experiment, e.g. generate 
them through /dev/urandom.



Thanks,

Igor


On 3/15/2024 2:05 AM, Thorne Lawler wrote:


Igor,

Yes. Just a bit.

root@pmx101:/mnt/pve/iso# du -h | wc -l
10
root@pmx101:/mnt/pve/iso# du -h
0   ./snippets
0   ./tmp
257M    ./xcp_nfs_sr/2ba36cf5-291a-17d2-b510-db1a295ce0c2
5.5T    ./xcp_nfs_sr/5aacaebb-4469-96f9-729e-fe45eef06a14
5.5T    ./xcp_nfs_sr
0   ./failover_test
11G ./template/iso
11G ./template
0   ./xcpiso
5.5T    .
root@pmx101:/mnt/pve/iso# du --inodes
1   ./snippets
1   ./tmp
5   ./xcp_nfs_sr/2ba36cf5-291a-17d2-b510-db1a295ce0c2
53  ./xcp_nfs_sr/5aacaebb-4469-96f9-729e-fe45eef06a14
59  ./xcp_nfs_sr
1   ./failover_test
2   ./template/iso
3   ./template
1   ./xcpiso
67  .
root@pmx101:/mnt/pve/iso# rados lssnap -p cephfs.shared.data
0 snaps

What/where are all the other objects?!?

On 15/03/2024 3:36 am, Igor Fedotov wrote:


Thorn,

you might want to assess amount of files on the mounted fs by 
runnning "du -h | wc". Does it differ drastically from amount of 
objects in the pool = ~3.8 M?


And just in case - please run "rados lssnap -p cephfs.shared.data".


Thanks,

Igor

On 3/14/2024 1:42 AM, Thorne Lawler wrote:


Igor, Etienne, Bogdan,

The system is a four node cluster. Each node has 12 3.8TB SSDs, and 
each SSD is an OSD.


I have not defined any separate DB / WAL devices - this cluster is 
mostly at cephadm defaults.


Everything is currently configured to have x3 replicas.

The system also does various RBD workloads from other pools.

There are no subvolumes and no snapshots on the CephFS volume in 
question.


The CephFS volume I am concerned about is called 'shared'. For the 
purposes of this question I am omitting information about the other 
pools.


[root@san1 ~]# rados df
POOL_NAME USED  OBJECTS  CLONES COPIES  
MISSING_ON_PRIMARY  UNFOUND  DEGRADED RD_OPS    RD  
WR_OPS   WR  USED COMPR  UNDER COMPR
cephfs.shared.data  41 TiB  3834689   0 
11504067   0    0 0 3219785418   175 
TiB  9330001764  229 TiB 7.0 MiB   12 MiB
cephfs.shared.meta 757 MiB   85   0 
255   0    0 0   5306018840 26 TiB  
9170232158   24 TiB 0 B  0 B


total_objects    13169948
total_used   132 TiB
total_avail  33 TiB
total_space  166 TiB

[root@san1 ~]# ceph df detail
--- RAW STORAGE ---
CLASS SIZE   AVAIL USED  RAW USED  %RAW USED
ssd    166 TiB  33 TiB  132 TiB   132 TiB  79.82
TOTAL  166 TiB  33 TiB  132 TiB   132 TiB  79.82

--- POOLS ---
POOL   ID  PGS   STORED   (DATA) (OMAP)  
OBJECTS USED   (DATA)   (OMAP)  %USED  MAX AVAIL  QUOTA OBJECTS  
QUOTA BYTES  DIRTY  USED COMPR UNDER COMPR
cephfs.shared.meta  3   32  251 MiB  208 MiB   42 MiB   
84  752 MiB  625 MiB  127 MiB  0    3.4 TiB    
N/A  N/A    N/A 0 B  0 B
cephfs.shared.data  4  512   14 TiB   14 TiB 0 B    3.83M   
41 TiB   41 TiB  0 B  79.90    3.4 TiB    N/A  
N/A    N/A 7.0 MiB 12 MiB


[root@san1 ~]# ceph osd pool get cephfs.shared.data size
size: 3

...however running 'du' in the root directory of the 'shared' volume 
says:


# du -sh .
5.5T    .

So yeah - 14TB is replicated to 41TB, that's fine, but 14TB is a lot 
more than 5.5TB, so... where is that space going?


On 14/03/2024 2:09 am, Igor Fedotov wrote:

Hi Thorn,

could you please share the output of "ceph df detail" command 
representing the problem?



And please give an overview of your OSD layout - amount of OSDs, 
shared or dedicated DB/WAL, main and DB volume sizes.



Thanks,

Igor


On 3/13/2024 5:58 AM, Thorne Lawler wrote:

Hi everyone!

My Ceph cluster (17.2.6) has a CephFS volume which is showing 41TB 
usage for the data pool, but there are only 5.5TB of files in it. 
There are fewer than 100 files on the filesystem in total, so 
where is all that space going?


How can I analyze my cephfs to understand what is using that 
space, and if possible, how can I reclaim tha

[ceph-users] Re: CephFS space usage

2024-03-14 Thread Igor Fedotov

Thorn,

you might want to assess amount of files on the mounted fs by runnning 
"du -h | wc". Does it differ drastically from amount of objects in the 
pool = ~3.8 M?


And just in case - please run "rados lssnap -p cephfs.shared.data".


Thanks,

Igor

On 3/14/2024 1:42 AM, Thorne Lawler wrote:


Igor, Etienne, Bogdan,

The system is a four node cluster. Each node has 12 3.8TB SSDs, and 
each SSD is an OSD.


I have not defined any separate DB / WAL devices - this cluster is 
mostly at cephadm defaults.


Everything is currently configured to have x3 replicas.

The system also does various RBD workloads from other pools.

There are no subvolumes and no snapshots on the CephFS volume in question.

The CephFS volume I am concerned about is called 'shared'. For the 
purposes of this question I am omitting information about the other pools.


[root@san1 ~]# rados df
POOL_NAME USED  OBJECTS  CLONES    COPIES 
MISSING_ON_PRIMARY  UNFOUND  DEGRADED   RD_OPS RD  
WR_OPS   WR  USED COMPR  UNDER COMPR
cephfs.shared.data  41 TiB  3834689   0 
11504067   0    0 0   3219785418 175 TiB  
9330001764  229 TiB 7.0 MiB   12 MiB
cephfs.shared.meta 757 MiB   85   0 
255   0    0 0   5306018840    26 TiB  
9170232158   24 TiB 0 B  0 B


total_objects    13169948
total_used   132 TiB
total_avail  33 TiB
total_space  166 TiB

[root@san1 ~]# ceph df detail
--- RAW STORAGE ---
CLASS SIZE   AVAIL USED  RAW USED  %RAW USED
ssd    166 TiB  33 TiB  132 TiB   132 TiB  79.82
TOTAL  166 TiB  33 TiB  132 TiB   132 TiB  79.82

--- POOLS ---
POOL   ID  PGS   STORED   (DATA)   (OMAP) 
OBJECTS USED   (DATA)   (OMAP)  %USED  MAX AVAIL  QUOTA OBJECTS  
QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
cephfs.shared.meta  3   32  251 MiB  208 MiB   42 MiB   
84  752 MiB  625 MiB  127 MiB  0    3.4 TiB    
N/A  N/A    N/A 0 B  0 B
cephfs.shared.data  4  512   14 TiB   14 TiB  0 B    
3.83M   41 TiB   41 TiB  0 B  79.90    3.4 TiB    
N/A  N/A    N/A 7.0 MiB   12 MiB


[root@san1 ~]# ceph osd pool get cephfs.shared.data size
size: 3

...however running 'du' in the root directory of the 'shared' volume says:

# du -sh .
5.5T    .

So yeah - 14TB is replicated to 41TB, that's fine, but 14TB is a lot 
more than 5.5TB, so... where is that space going?


On 14/03/2024 2:09 am, Igor Fedotov wrote:

Hi Thorn,

could you please share the output of "ceph df detail" command 
representing the problem?



And please give an overview of your OSD layout - amount of OSDs, 
shared or dedicated DB/WAL, main and DB volume sizes.



Thanks,

Igor


On 3/13/2024 5:58 AM, Thorne Lawler wrote:

Hi everyone!

My Ceph cluster (17.2.6) has a CephFS volume which is showing 41TB 
usage for the data pool, but there are only 5.5TB of files in it. 
There are fewer than 100 files on the filesystem in total, so where 
is all that space going?


How can I analyze my cephfs to understand what is using that space, 
and if possible, how can I reclaim that space?


Thank you.


--

Regards,

Thorne Lawler - Senior System Administrator
*DDNS* | ABN 76 088 607 265
First registrar certified ISO 27001-2013 Data Security Standard ITGOV40172
P +61 499 449 170

_DDNS

/_*Please note:* The information contained in this email message and 
any attached files may be confidential information, and may also be 
the subject of legal professional privilege. _If you are not the 
intended recipient any use, disclosure or copying of this email is 
unauthorised. _If you received this email in error, please notify 
Discount Domain Name Services Pty Ltd on 03 9815 6868 to report this 
matter and delete all copies of this transmission together with any 
attachments. /



--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us athttps://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS space usage

2024-03-13 Thread Igor Fedotov

Hi Thorn,

could you please share the output of "ceph df detail" command 
representing the problem?



And please give an overview of your OSD layout - amount of OSDs, shared 
or dedicated DB/WAL, main and DB volume sizes.



Thanks,

Igor


On 3/13/2024 5:58 AM, Thorne Lawler wrote:

Hi everyone!

My Ceph cluster (17.2.6) has a CephFS volume which is showing 41TB 
usage for the data pool, but there are only 5.5TB of files in it. 
There are fewer than 100 files on the filesystem in total, so where is 
all that space going?


How can I analyze my cephfs to understand what is using that space, 
and if possible, how can I reclaim that space?


Thank you.


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: bluestore_min_alloc_size and bluefs_shared_alloc_size

2024-03-13 Thread Igor Fedotov

Hi Joel,

generally speaking you need OSD redeployment to apply 64K to 4K 
min_alloc_size downgrade for block device only. Other improvements 
(including supporting 4K units for BlueFS) are applied to existing OSDs 
automatically when relevant Ceph release is installed.


So yes - if you have octopus deployed OSDs - you need to redeploy them 
with new default settings.


The latest pacific minor release (v16.2.15) has got all the "space 
allocation" improvements I'm aware of. Quincy is one step behind as 
changes brought by https://github.com/ceph/ceph/pull/54877) haven't been 
published yet - will come in the next minor Quincy release.


One of the bugs (https://tracker.ceph.com/issues/63618) fixed by the 
above PR could be of particular interest for you - that's a pretty 
severe issue for legacy deployed OSDs which pops up after upgrade to 
pacific coupled with custom bluefs_shared_alloc_size setting (<64K). 
Just for you to be aware and as an example of why I discourage everyone 
from using custom settings ;)



Thanks,

Igor

On 3/12/2024 7:45 PM, Joel Davidow wrote:

Hi Igor,

Thanks, that's very helpful.

So in this case the Ceph developers recommend that all osds originally 
built under octopus be redeployed with default settings and that 
default settings continue to be used going forward. Is that correct?


Thanks for your assistance,
Joel


On Tue, Mar 12, 2024 at 4:13 AM Igor Fedotov  
wrote:


Hi Joel,

my primary statement would be - do not adjust "alloc size"
settings on your own and use default values!

We've had pretty long and convoluted evolution of this stuff so
tuning recommendations and their aftermaths greatly depend on the
exact Ceph version. While using improper settings could result in
severe performance impact and even data loss.

Current state-of-the-arts is that we support minimal allocation
size at 4K for everything : both HDDs and SSDs, user and bluefs
data. Effective bluefs_shared_alloc_size (i.e. allocation unit we
generally use when BlueFS allocates space for DB [meta]data) is at
64K but BlueFS can fallback to 4K allocations on its own if main
disk space fragmentation is high. Higher base value (=64K)
generally provides less overhead for both performance and metadata
mem/disk footprint. This approach shouldn't be applied to OSDs
which run legacy Ceph versions though. They could lack proper
support for some aspects of this stuff.

Using legacy 64K min allocation size for block device (aka
bfm_bytes_per_block) can sometimes result in a significant space
waste - then one should upgrade to a version which supports 4K
alloc unit and redeploy legacy OSDs. Again with no custom tunings
for both new or old OSDs.

So in short your choice should be: upgrade, redeploy with default
settings if needed and keep using defaults.


Hope this helps.

Thanks,

Igor

On 29/02/2024 01:55, Joel Davidow wrote:

Summary
--
The relationship of the values configured for bluestore_min_alloc_size and 
bluefs_shared_alloc_size are reported to impact space amplification, partial 
overwrites in erasure coded pools, and storage capacity as an osd becomes more 
fragmented and/or more full.


Previous discussions including this topic

comment #7 in bug 63618 in Dec 2023 
-https://tracker.ceph.com/issues/63618#note-7

pad writeup related to bug 62282 likely from late 2023 
-https://pad.ceph.com/p/RCA_62282

email sent 13 Sept 2023 in mail list discussion of cannot create new osd 
-https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/5M4QAXJDCNJ74XVIBIFSHHNSETCCKNMC/

comment #9 in bug 58530 likely from early 2023 
-https://tracker.ceph.com/issues/58530#note-9

email sent 30 Sept 2021 in mail list discussion of flapping osds 
-https://www.mail-archive.com/ceph-users@ceph.io/msg13072.html

email sent 25 Feb 2020 in mail list discussion of changing allocation size 
-https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/B3DGKH6THFGHALLX6ATJ4GGD4SVFNEKU/


Current situation
-
We have three Ceph clusters that were originally built via cephadm on 
octopus and later upgraded to pacific. All osds are HDD (will be moving to 
wal+db on SSD) and were resharded after the upgrade to enable rocksdb sharding.

The value for bluefs_shared_alloc_size has remained unchanged at 65535.

The value for bluestore_min_alloc_size_hdd was 65535 in octopus but is reported 
as 4096 by ceph daemon osd. config show in pacific. However, the osd label 
after upgrading to pacific retains the value of 65535 for bfm_bytes_per_block. 
BitmapFreelistManager.h in Ceph source code 
(src/os/bluestore/BitmapFreelistManager.h) indicates that bytes_per_block is 
bdev_block_size.  This indicates that the physical layout of the osd has not changed 
from 65535 despite

[ceph-users] Re: bluestore_min_alloc_size and bluefs_shared_alloc_size

2024-03-12 Thread Igor Fedotov

Hi Joel,

my primary statement would be - do not adjust "alloc size" settings on 
your own and use default values!


We've had pretty long and convoluted evolution of this stuff so tuning 
recommendations and their aftermaths greatly depend on the exact Ceph 
version. While using improper settings could result in severe 
performance impact and even data loss.


Current state-of-the-arts is that we support minimal allocation size at 
4K for everything : both HDDs and SSDs, user and bluefs data. Effective 
bluefs_shared_alloc_size (i.e. allocation unit we generally use when 
BlueFS allocates space for DB [meta]data) is at 64K but BlueFS can 
fallback to 4K allocations on its own if main disk space fragmentation 
is high. Higher base value (=64K) generally provides less overhead for 
both performance and metadata mem/disk footprint. This approach 
shouldn't be applied to OSDs which run legacy Ceph versions though. They 
could lack proper support for some aspects of this stuff.


Using legacy 64K min allocation size for block device (aka 
bfm_bytes_per_block) can sometimes result in a significant space waste - 
then one should upgrade to a version which supports 4K alloc unit and 
redeploy legacy OSDs. Again with no custom tunings for both new or old 
OSDs.


So in short your choice should be: upgrade, redeploy with default 
settings if needed and keep using defaults.



Hope this helps.

Thanks,

Igor

On 29/02/2024 01:55, Joel Davidow wrote:

Summary
--
The relationship of the values configured for bluestore_min_alloc_size and 
bluefs_shared_alloc_size are reported to impact space amplification, partial 
overwrites in erasure coded pools, and storage capacity as an osd becomes more 
fragmented and/or more full.


Previous discussions including this topic

comment #7 in bug 63618 in Dec 2023 
-https://tracker.ceph.com/issues/63618#note-7

pad writeup related to bug 62282 likely from late 2023 
-https://pad.ceph.com/p/RCA_62282

email sent 13 Sept 2023 in mail list discussion of cannot create new osd 
-https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/5M4QAXJDCNJ74XVIBIFSHHNSETCCKNMC/

comment #9 in bug 58530 likely from early 2023 
-https://tracker.ceph.com/issues/58530#note-9

email sent 30 Sept 2021 in mail list discussion of flapping osds 
-https://www.mail-archive.com/ceph-users@ceph.io/msg13072.html

email sent 25 Feb 2020 in mail list discussion of changing allocation size 
-https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/B3DGKH6THFGHALLX6ATJ4GGD4SVFNEKU/


Current situation
-
We have three Ceph clusters that were originally built via cephadm on octopus 
and later upgraded to pacific. All osds are HDD (will be moving to wal+db on 
SSD) and were resharded after the upgrade to enable rocksdb sharding.

The value for bluefs_shared_alloc_size has remained unchanged at 65535.

The value for bluestore_min_alloc_size_hdd was 65535 in octopus but is reported as 
4096 by ceph daemon osd. config show in pacific. However, the osd label 
after upgrading to pacific retains the value of 65535 for bfm_bytes_per_block. 
BitmapFreelistManager.h in Ceph source code 
(src/os/bluestore/BitmapFreelistManager.h) indicates that bytes_per_block is 
bdev_block_size.  This indicates that the physical layout of the osd has not changed 
from 65535 despite the return of the ceph dameon command reporting it as 4096. This 
interpretation is supported by the Minimum Allocation Size part of the Bluestore 
configuration reference for quincy 
(https://docs.ceph.com/en/quincy/rados/configuration/bluestore-config-ref/#minimum-allocation-size)
Questions
--
What are the pros and cons of the following three cases with two variations per 
case - when using co-located wal+db on HDD and when using separate wal+db on 
SSD:
1) bluefs_shared_alloc_size, bluestore_min_alloc_size, and bfm_bytes_per_block 
all equal2) bluefs_shared_alloc_size greater than but a multiple of 
bluestore_min_alloc_size with bfm_bytes_per_block equal to 
bluestore_min_alloc_size
3) bluefs_shared_alloc_size greater than but a multiple of 
bluestore_min_alloc_size with bfm_bytes_per_block equal to 
bluefs_shared_alloc_size
___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: has anyone enabled bdev_enable_discard?

2024-03-01 Thread Igor Fedotov
I played with this feature a while ago and recall it had visible 
negative impact on user operations due to the need to submit tons of 
discard operations - effectively each data overwrite operation triggers 
one or more discard operation submission to disk.


And I doubt this has been widely used if any.

Nevertheless recently we've got a PR to rework some aspects of thread 
management for this stuff, see https://github.com/ceph/ceph/pull/55469


The author claimed they needed this feature for their cluster so you 
might want to ask him about their user experience.



W.r.t documentation - actually there are just two options

- bdev_enable_discard - enables issuing discard to disk

- bdev_async_discard - instructs whether discard requests are issued 
synchronously (along with disk extents release) or asynchronously (using 
a background thread).


Thanks,

Igor

On 01/03/2024 13:06, jst...@proxforge.de wrote:
Is there any update on this? Did someone test the option and has 
performance values before and after?

Is there any good documentation regarding this option?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How can I clone data from a faulty bluestore disk?

2024-02-02 Thread Igor Fedotov

Hi Carl,

you might want to use ceph-objectstore-tool to export PGs from faulty 
OSDs and import them back to healthy ones.


The process could be quite tricky though.

There is also pending PR (https://github.com/ceph/ceph/pull/54991) to 
make the tool more tolerant to disk errors.


The patch worth trying in some cases, not a silver bullet though.

And generally whether the recovery doable greatly depends on the actual 
error(s).



Thanks,

Igor

On 02/02/2024 19:03, Carl J Taylor wrote:

Hi,
I have a small cluster with some faulty disks within it and I want to clone
the data from the faulty disks onto new ones.

The cluster is currently down and I am unable to do things like
ceph-bluestore-fsck but ceph-bluestore-tool  bluefs-export does appear to
be working.

Any help would be appreciated

Many thanks
Carl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck in upgrade process to reef

2024-01-17 Thread Igor Fedotov

Hi Jan,

w.r.t. osd.0 - if this is the only occurrence then I'd propose simply 
redeploy the OSD. This looks like some BlueStore metadata inconsistency 
which could occur long before the upgrade. Likely the upgrade just 
revealed the issue.  And honestly I can hardly imagine how to 
investigate it at this point.


Let's see how further upgrades go and come back to this question if more 
similar issues pop up.


Meanwhile I'd recommend to run fsck for every OSD prior to upgrade to 
get clear understanding if metadata is consistent or not.


This way - if occurred once again - we can prove/disprove my statement 
about the issue being unrelated to upgrades above.



Thanks,

Igor

On 17/01/2024 15:07, Jan Marek wrote:

Hi Igor,

many thanks for advice!

I've tried to start osd.1 and it started already, now it's
resynchronizing data.

I will start daemons one-by-one.

What do you mean about osd.0, which have a problem with
bluestore fsck? Is there a way to repair it?

Sincerely
Jan


Dne Út, led 16, 2024 at 08:15:03 CET napsal(a) Igor Fedotov:

Hi Jan,

I've just fired an upstream ticket for your case, see
https://tracker.ceph.com/issues/64053 for more details.


You might want to tune (or preferably just remove) your custom
bluestore_cache_.*_ratio settings to fix the issue.

This is reproducible and fixable in my lab this way.

Hope this helps.


Thanks,

Igor


On 15/01/2024 12:54, Jan Marek wrote:

Hi Igor,

I've tried to start ceph-sod daemon as you advice me and I'm
sending log osd.1.start.log

About memory: According to 'top' podman ceph daemon don't reach
2% of whole server memory (64GB)...

I have switch on autotune of memory...

My ceph config dump - see attached dump.txt

Sincerely
Jan Marek

Dne Čt, led 11, 2024 at 04:02:02 CET napsal(a) Igor Fedotov:

Hi Jan,

unfortunately this wasn't very helpful. Moreover the log looks a bit messy -
looks like a mixture of outputs from multiple running instances or
something. I'm not an expert in using containerized setups though.

Could you please simplify things by running ceph-osd process manually like
you did for ceph-objectstore-tool. And enforce log output to a file. Command
line should look somewhat the following:

ceph-osd -i 0 --log-to-file --log-file  --debug-bluestore 5/20
--debug-prioritycache 10

Please don't forget to run repair prior to that.


Also you haven't answered my questions about custom [memory] settings and
RAM usage during OSD startup. It would be nice to hear some feedback.


Thanks,

Igor

On 11/01/2024 16:47, Jan Marek wrote:

Hi Igor,

I've tried to start osd.1 with debug_prioritycache and
debug_bluestore 5/20, see attached file...

Sincerely
Jan

Dne St, led 10, 2024 at 01:03:07 CET napsal(a) Igor Fedotov:

Hi Jan,

indeed this looks like some memory allocation problem - may be OSD's RAM
usage threshold reached or something?

Curious if you have any custom OSD settings or may be any memory caps for
Ceph containers?

Could you please set debug_bluestore to 5/20 and debug_prioritycache to 10
and try to start OSD once again. Please monitor process RAM usage along the
process and share the resulting log.


Thanks,

Igor

On 10/01/2024 11:20, Jan Marek wrote:

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck in upgrade process to reef

2024-01-16 Thread Igor Fedotov

Hi Jan,

I've just fired an upstream ticket for your case, see 
https://tracker.ceph.com/issues/64053 for more details.



You might want to tune (or preferably just remove) your custom 
bluestore_cache_.*_ratio settings to fix the issue.


This is reproducible and fixable in my lab this way.

Hope this helps.


Thanks,

Igor


On 15/01/2024 12:54, Jan Marek wrote:

Hi Igor,

I've tried to start ceph-sod daemon as you advice me and I'm
sending log osd.1.start.log

About memory: According to 'top' podman ceph daemon don't reach
2% of whole server memory (64GB)...

I have switch on autotune of memory...

My ceph config dump - see attached dump.txt

Sincerely
Jan Marek

Dne Čt, led 11, 2024 at 04:02:02 CET napsal(a) Igor Fedotov:

Hi Jan,

unfortunately this wasn't very helpful. Moreover the log looks a bit messy -
looks like a mixture of outputs from multiple running instances or
something. I'm not an expert in using containerized setups though.

Could you please simplify things by running ceph-osd process manually like
you did for ceph-objectstore-tool. And enforce log output to a file. Command
line should look somewhat the following:

ceph-osd -i 0 --log-to-file --log-file  --debug-bluestore 5/20
--debug-prioritycache 10

Please don't forget to run repair prior to that.


Also you haven't answered my questions about custom [memory] settings and
RAM usage during OSD startup. It would be nice to hear some feedback.


Thanks,

Igor

On 11/01/2024 16:47, Jan Marek wrote:

Hi Igor,

I've tried to start osd.1 with debug_prioritycache and
debug_bluestore 5/20, see attached file...

Sincerely
Jan

Dne St, led 10, 2024 at 01:03:07 CET napsal(a) Igor Fedotov:

Hi Jan,

indeed this looks like some memory allocation problem - may be OSD's RAM
usage threshold reached or something?

Curious if you have any custom OSD settings or may be any memory caps for
Ceph containers?

Could you please set debug_bluestore to 5/20 and debug_prioritycache to 10
and try to start OSD once again. Please monitor process RAM usage along the
process and share the resulting log.


Thanks,

Igor

On 10/01/2024 11:20, Jan Marek wrote:

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Pacific bluestore_volume_selection_policy

2024-01-11 Thread Igor Fedotov
back*)+0x135e) [0x55d51a43d96c]
 16: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, 
rocksdb::WriteBatch*)+0x5d) [0x55d51a43c56f]
 17: (RocksDBStore::submit_common(rocksdb::WriteOptions&, 
std::shared_ptr)+0x85) [0x55d51a388635]
 18: 
(RocksDBStore::submit_transaction_sync(std::shared_ptr)+0x9b) 
[0x55d51a38904b]

 19: (BlueStore::_kv_sync_thread()+0x22bc) [0x55d519e016dc]
 20: (BlueStore::KVSyncThread::entry()+0x11) [0x55d519e2de71]
 21: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f490cf23609]
 22: clone()
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.



On Jan 10, 2024, at 12:06 PM, Igor Fedotov  wrote:

Hi Reed,

it looks to me like your settings aren't effective. You might want to 
check OSD log rather than crash info and see the assertion's backtrace.


Does it mention RocksDBBlueFSVolumeSelector as the one in 
https://tracker.ceph.com/issues/53906:


ceph version 17.0.0-10229-g7e035110 (7e035110784fba02ba81944e444be9a36932c6a3) 
quincy (dev)
  1: /lib64/libpthread.so.0(+0x12c20) [0x7f2beb318c20]
  2: gsignal()
  3: abort()
  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x1b0) [0x56347eb33bec]
  5: /usr/bin/ceph-osd(+0x5d5daf) [0x56347eb33daf]
  6: (RocksDBBlueFSVolumeSelector::add_usage(void*, bluefs_fnode_t const&)+0) 
[0x56347f1f7d00]
  7: (BlueFS::_flush_range_F(BlueFS::FileWriter*, unsigned long, unsigned 
long)+0x735) [0x56347f295b45]


If so - then there is still a mess with proper parameter changes.

Thanks
Igor

On 10/01/2024 20:13, Reed Dier wrote:

Well, sadly, that setting doesn’t seem to resolve the issue.

I set the value in ceph.conf for the OSDs with small WAL/DB devices that keep 
running into the issue,


$  ceph tell osd.12 config show | grep bluestore_volume_selection_policy
 "bluestore_volume_selection_policy": "rocksdb_original",
$ ceph crash info 2024-01-10T16:39:05.925534Z_f0c57ca3-b7e6-4511-b7ae-5834541d6c67 | 
egrep "(assert_condition|entity_name)"
 "assert_condition": "cur >= p.length",
 "entity_name": "osd.12",

So, I guess that configuration item doesn’t in fact prevent the crash as was 
purported.
Looks like I may need to fast track moving to quincy…

Reed
___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck in upgrade process to reef

2024-01-11 Thread Igor Fedotov

Hi Jan,

unfortunately this wasn't very helpful. Moreover the log looks a bit 
messy - looks like a mixture of outputs from multiple running instances 
or something. I'm not an expert in using containerized setups though.


Could you please simplify things by running ceph-osd process manually 
like you did for ceph-objectstore-tool. And enforce log output to a 
file. Command line should look somewhat the following:


ceph-osd -i 0 --log-to-file --log-file  --debug-bluestore 
5/20 --debug-prioritycache 10


Please don't forget to run repair prior to that.


Also you haven't answered my questions about custom [memory] settings 
and RAM usage during OSD startup. It would be nice to hear some feedback.



Thanks,

Igor

On 11/01/2024 16:47, Jan Marek wrote:

Hi Igor,

I've tried to start osd.1 with debug_prioritycache and
debug_bluestore 5/20, see attached file...

Sincerely
Jan

Dne St, led 10, 2024 at 01:03:07 CET napsal(a) Igor Fedotov:

Hi Jan,

indeed this looks like some memory allocation problem - may be OSD's RAM
usage threshold reached or something?

Curious if you have any custom OSD settings or may be any memory caps for
Ceph containers?

Could you please set debug_bluestore to 5/20 and debug_prioritycache to 10
and try to start OSD once again. Please monitor process RAM usage along the
process and share the resulting log.


Thanks,

Igor

On 10/01/2024 11:20, Jan Marek wrote:

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Pacific bluestore_volume_selection_policy

2024-01-10 Thread Igor Fedotov

Hi Reed,

it looks to me like your settings aren't effective. You might want to 
check OSD log rather than crash info and see the assertion's backtrace.


Does it mention RocksDBBlueFSVolumeSelector as the one in 
https://tracker.ceph.com/issues/53906:


ceph version 17.0.0-10229-g7e035110 (7e035110784fba02ba81944e444be9a36932c6a3) 
quincy (dev)
 1: /lib64/libpthread.so.0(+0x12c20) [0x7f2beb318c20]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x1b0) [0x56347eb33bec]
 5: /usr/bin/ceph-osd(+0x5d5daf) [0x56347eb33daf]
 6: (RocksDBBlueFSVolumeSelector::add_usage(void*, bluefs_fnode_t const&)+0) 
[0x56347f1f7d00]
 7: (BlueFS::_flush_range_F(BlueFS::FileWriter*, unsigned long, unsigned 
long)+0x735) [0x56347f295b45]


If so - then there is still a mess with proper parameter changes.

Thanks
Igor

On 10/01/2024 20:13, Reed Dier wrote:

Well, sadly, that setting doesn’t seem to resolve the issue.

I set the value in ceph.conf for the OSDs with small WAL/DB devices that keep 
running into the issue,


$  ceph tell osd.12 config show | grep bluestore_volume_selection_policy
 "bluestore_volume_selection_policy": "rocksdb_original",
$ ceph crash info 2024-01-10T16:39:05.925534Z_f0c57ca3-b7e6-4511-b7ae-5834541d6c67 | 
egrep "(assert_condition|entity_name)"
 "assert_condition": "cur >= p.length",
 "entity_name": "osd.12",


So, I guess that configuration item doesn’t in fact prevent the crash as was 
purported.
Looks like I may need to fast track moving to quincy…

Reed


On Jan 8, 2024, at 9:47 AM, Reed Dier  wrote:

I ended up setting it in ceph.conf which appears to have worked (as far as I 
can tell).


[osd]
bluestore_volume_selection_policy = rocksdb_original
$ ceph config show osd.0  | grep bluestore_volume_selection_policy
bluestore_volume_selection_policy   rocksdb_original  file  
(mon[rocksdb_original])

So far so good…

Reed


On Jan 8, 2024, at 2:04 AM, Eugen Block mailto:ebl...@nde.ag>> 
wrote:

Hi,

I just did the same in my lab environment and the config got applied to the 
daemon after a restart:

pacific:~ # ceph tell osd.0 config show | grep bluestore_volume_selection_policy
"bluestore_volume_selection_policy": "rocksdb_original",

This is also a (tiny single-node) cluster running 16.2.14. Maybe you have some 
typo or something while doing the loop? Have you tried to set it for one OSD 
only and see if it starts with the config set?


Zitat von Reed Dier mailto:reed.d...@focusvq.com>>:


After ~3 uneventful weeks after upgrading from 15.2.17 to 16.2.14 I’ve started seeing OSD 
crashes with "cur >= fnode.size” and "cur >= p.length”, which seems to be 
resolved in the next point release for pacific later this month, but until then, I’d love to 
keep the OSDs from flapping.


$ for crash in $(ceph crash ls | grep osd | awk '{print $1}') ; do ceph crash info $crash 
| egrep "(assert_condition|crash_id)" ; done
"assert_condition": "cur >= fnode.size",
"crash_id": 
"2024-01-03T09:07:55.698213Z_348af2d3-d4a7-4c27-9f71-70e6dc7c1af7",
"assert_condition": "cur >= p.length",
"crash_id": 
"2024-01-03T14:21:55.794692Z_4557c416-ffca-4165-aa91-d63698d41454",
"assert_condition": "cur >= fnode.size",
"crash_id": 
"2024-01-03T22:53:43.010010Z_15dc2b2a-30fb-4355-84b9-2f9560f08ea7",
"assert_condition": "cur >= p.length",
"crash_id": 
"2024-01-04T02:34:34.408976Z_2954a2c2-25d2-478e-92ad-d79c42d3ba43",
"assert_condition": "cur2 >= p.length",
"crash_id": 
"2024-01-04T21:57:07.100877Z_12f89c2c-4209-4f5a-b243-f0445ba629d2",
"assert_condition": "cur >= p.length",
"crash_id": 
"2024-01-05T00:35:08.561753Z_a189d967-ab02-4c61-bf68-1229222fd259",
"assert_condition": "cur >= fnode.size",
"crash_id": 
"2024-01-05T04:11:48.625086Z_a598cbaf-2c4f-4824-9939-1271eeba13ea",
"assert_condition": "cur >= p.length",
"crash_id": 
"2024-01-05T13:49:34.911210Z_953e38b9-8ae4-4cfe-8f22-d4b7cdf65cea",
"assert_condition": "cur >= p.length",
"crash_id": 
"2024-01-05T13:54:25.732770Z_4924b1c0-309c-4471-8c5d-c3aaea49166c",
"assert_condition": "cur >= p.length",
"crash_id": 
"2024-01-05T16:35:16.485416Z_0bca3d2a-2451-4275-a049-a65c58c1aff1”,

As noted 
inhttps://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/YNJ35HXN4HXF4XWB6IOZ2RKXX7EQCEIY/
  

  
>


You can apparently work around the issue by setting
'bluestore_volume_selection_policy' config parameter to rocksdb_original.

However, after trying to set that parameter with `ceph config set osd.$osd 
bluestore_volume_selection_policy rocksdb_original` it doesn’t appear to set?


$ ceph config show-with-defaults osd.0  | grep 

[ceph-users] Re: Stuck in upgrade process to reef

2024-01-10 Thread Igor Fedotov

Hi Jan,

indeed this looks like some memory allocation problem - may be OSD's RAM 
usage threshold reached or something?


Curious if you have any custom OSD settings or may be any memory caps 
for Ceph containers?


Could you please set debug_bluestore to 5/20 and debug_prioritycache to 
10 and try to start OSD once again. Please monitor process RAM usage 
along the process and share the resulting log.



Thanks,

Igor

On 10/01/2024 11:20, Jan Marek wrote:

Hi Igor,

I've tried to repair osd.1 with command:

ceph-bluestore-tool --path 
/var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.1 --command repair

and then start osd.1 ceph-osd podman service.

It semms, that there is problem with memory allocation, see
attached log...

Sincerely
Jan

Dne Út, led 09, 2024 at 02:23:32 CET napsal(a) Igor Fedotov:

Hi Marek,

I haven't looked through those upgrade logs yet but here are some comments
regarding last OSD startup attempt.

First of answering your question


_init_alloc::NCB::restore_allocator() failed! Run Full Recovery from ONodes 
(might take a while)
Is it a mandatory part of fsck?

This is caused by previous non-graceful OSD process shutdown. BlueStore is 
unable to find up-to-date allocation map and recovers it from RocksDB. And 
since fsck is a read-only procedure the recovered allocmap is not saved - hence 
all the following BlueStore startups (within fsck or OSD init) cause another 
rebuild attempt. To avoid that you might want to run repair instead of fsck - 
this will persist up-to-date allocation map and avoid its rebuilding on the 
next startup. This will work till the next non-graceful shutdown only - hence 
unsuccessful OSD attempt might break the allocmap state again.

Secondly - looking at OSD startup log one can see that actual OSD log ends with 
that allocmap recovery as well:


2024-01-09T11:25:30.718449+01:00 osd1 ceph-osd[1734062]: 
bluestore(/var/lib/ceph/osd/ceph-1) _init_alloc::NCB::restore_allocator() 
failed! Run Full Recovery from ONodes (might take a while) ...

Subsequent log line indicating OSD daemon termination is from systemd:

2024-01-09T11:25:33.516258+01:00 osd1 systemd[1]: Stopping 
ceph-2c565e24-7850-47dc-a751-a6357cbbaf2a@osd.1.service - Ceph osd.1 for 
2c565e24-7850-47dc-a751-a6357cbbaf2a...

And honestly these lines provide almost no clue why termination happened. No 
obvious OSD failures or something are shown. Perhaps containerized environment 
hides the details e.g. by cutting off OSD log's tail.
So you might want to proceed the investigation by running repair prior to 
starting the OSD as per above. This will result in no alloc map recovery and 
hopefully workaround the problem during startup - if the issue is caused by 
allocmap recovery.
Additionally you might want to increase debug_bluestore log level for osd.1 
before starting it up to get more insight on what's happening.

Alternatively you might want to play with OSD log target settings to write 
OSD.1 log to some file rather than using system wide logging infra - hopefully 
this will be more helpful.

Thanks,
Igor

On 09/01/2024 13:31, Jan Marek wrote:

Hi Igor,

I've sent you logs via filesender.cesnet.cz, if someone would
be interested, they are here:

https://filesender.cesnet.cz/?s=download=047b1ec4-4df0-4e8a-90fc-31706eb168a4

Some points:

1) I've found, that on the osd1 server was bad time (3 minutes in
future). I've corrected that. Yes, I know, that it's bad, but we
moved servers to any other net segment, where they have no access
to the timeservers in Internet, then I must reconfigure it to use
our own NTP servers.

2) I've tried to start osd.1 service by this sequence:

a)

ceph-bluestore-tool --path 
/var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.1 --command fsck

(without setting log properly :-( )

b)

export CEPH_ARGS="--log-file osd.1.log --debug-bluestore 5/20"
ceph-bluestore-tool --path 
/var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.1 --command fsck

- here I have one question: Why is it in this log stil this line:

_init_alloc::NCB::restore_allocator() failed! Run Full Recovery from ONodes 
(might take a while)

Is it a mandatory part of fsck?

Log is attached.

c)

systemctl start ceph-2c565e24-7850-47dc-a751-a6357cbbaf2a@osd.1.service

still crashing, gzip-ed log attached too.

Many thanks for exploring problem.

Sincerely
Jan Marek

Dne Po, led 08, 2024 at 12:00:05 CET napsal(a) Igor Fedotov:

Hi Jan,

indeed fsck logs for the OSDs other than osd.0 look good so it would be
interesting to see OSD startup logs for them. Preferably to have that for
multiple (e.g. 3-4) OSDs to get the pattern.

Original upgrade log(s) would be nice to see as well.

You might want to use Google Drive or any other publicly available file
sharing site for that.


Thanks,

Igor

On 05/01/2024 10:25, Jan Marek wrote:

Hi Igor,

I've tried to start only osd.1, which seems to be fsck'd OK, but
it crashed :-(

I search logs and I've found, that I have logs from 22.12.2023,
whe

[ceph-users] Re: Stuck in upgrade process to reef

2024-01-09 Thread Igor Fedotov

Hi Marek,

I haven't looked through those upgrade logs yet but here are some 
comments regarding last OSD startup attempt.


First of answering your question


_init_alloc::NCB::restore_allocator() failed! Run Full Recovery from ONodes 
(might take a while)



Is it a mandatory part of fsck?


This is caused by previous non-graceful OSD process shutdown. BlueStore is 
unable to find up-to-date allocation map and recovers it from RocksDB. And 
since fsck is a read-only procedure the recovered allocmap is not saved - hence 
all the following BlueStore startups (within fsck or OSD init) cause another 
rebuild attempt. To avoid that you might want to run repair instead of fsck - 
this will persist up-to-date allocation map and avoid its rebuilding on the 
next startup. This will work till the next non-graceful shutdown only - hence 
unsuccessful OSD attempt might break the allocmap state again.

Secondly - looking at OSD startup log one can see that actual OSD log ends with 
that allocmap recovery as well:


2024-01-09T11:25:30.718449+01:00 osd1 ceph-osd[1734062]: 
bluestore(/var/lib/ceph/osd/ceph-1) _init_alloc::NCB::restore_allocator() 
failed! Run Full Recovery from ONodes (might take a while) ...


Subsequent log line indicating OSD daemon termination is from systemd:

2024-01-09T11:25:33.516258+01:00 osd1 systemd[1]: Stopping 
ceph-2c565e24-7850-47dc-a751-a6357cbbaf2a@osd.1.service - Ceph osd.1 for 
2c565e24-7850-47dc-a751-a6357cbbaf2a...


And honestly these lines provide almost no clue why termination happened. No 
obvious OSD failures or something are shown. Perhaps containerized environment 
hides the details e.g. by cutting off OSD log's tail.
So you might want to proceed the investigation by running repair prior to 
starting the OSD as per above. This will result in no alloc map recovery and 
hopefully workaround the problem during startup - if the issue is caused by 
allocmap recovery.
Additionally you might want to increase debug_bluestore log level for osd.1 
before starting it up to get more insight on what's happening.

Alternatively you might want to play with OSD log target settings to write 
OSD.1 log to some file rather than using system wide logging infra - hopefully 
this will be more helpful.

Thanks,
Igor

On 09/01/2024 13:31, Jan Marek wrote:

Hi Igor,

I've sent you logs via filesender.cesnet.cz, if someone would
be interested, they are here:

https://filesender.cesnet.cz/?s=download=047b1ec4-4df0-4e8a-90fc-31706eb168a4

Some points:

1) I've found, that on the osd1 server was bad time (3 minutes in
future). I've corrected that. Yes, I know, that it's bad, but we
moved servers to any other net segment, where they have no access
to the timeservers in Internet, then I must reconfigure it to use
our own NTP servers.

2) I've tried to start osd.1 service by this sequence:

a)

ceph-bluestore-tool --path 
/var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.1 --command fsck

(without setting log properly :-( )

b)

export CEPH_ARGS="--log-file osd.1.log --debug-bluestore 5/20"
ceph-bluestore-tool --path 
/var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.1 --command fsck

- here I have one question: Why is it in this log stil this line:

_init_alloc::NCB::restore_allocator() failed! Run Full Recovery from ONodes 
(might take a while)

Is it a mandatory part of fsck?

Log is attached.

c)

systemctl start ceph-2c565e24-7850-47dc-a751-a6357cbbaf2a@osd.1.service

still crashing, gzip-ed log attached too.

Many thanks for exploring problem.

Sincerely
Jan Marek

Dne Po, led 08, 2024 at 12:00:05 CET napsal(a) Igor Fedotov:

Hi Jan,

indeed fsck logs for the OSDs other than osd.0 look good so it would be
interesting to see OSD startup logs for them. Preferably to have that for
multiple (e.g. 3-4) OSDs to get the pattern.

Original upgrade log(s) would be nice to see as well.

You might want to use Google Drive or any other publicly available file
sharing site for that.


Thanks,

Igor

On 05/01/2024 10:25, Jan Marek wrote:

Hi Igor,

I've tried to start only osd.1, which seems to be fsck'd OK, but
it crashed :-(

I search logs and I've found, that I have logs from 22.12.2023,
when I've did a upgrade (I have set logging to journald).

Would you be interested in those logs? This file have 30MB in
bzip2 format, how I can share it with you?

It contains crash log from start osd.1 too, but I can cut out
from it and send it to list...

Sincerely
Jan Marek

Dne Čt, led 04, 2024 at 02:43:48 CET napsal(a) Jan Marek:

Hi Igor,

I've ran this oneliner:

for i in {0..12}; do export CEPH_ARGS="--log-file osd."${i}".log --debug-bluestore 
5/20" ; ceph-bluestore-tool --path /var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.${i} 
--command fsck ; done;

On osd.0 it crashed very quickly, on osd.1 it is still working.

I've send those logs in one e-mail.

But!

I've tried to list disk devices in monitor view, and I've got
very interesting screenshot - some part I've 

[ceph-users] Re: Stuck in upgrade process to reef

2024-01-08 Thread Igor Fedotov

Hi Jan,

indeed fsck logs for the OSDs other than osd.0 look good so it would be 
interesting to see OSD startup logs for them. Preferably to have that 
for multiple (e.g. 3-4) OSDs to get the pattern.


Original upgrade log(s) would be nice to see as well.

You might want to use Google Drive or any other publicly available file 
sharing site for that.



Thanks,

Igor

On 05/01/2024 10:25, Jan Marek wrote:

Hi Igor,

I've tried to start only osd.1, which seems to be fsck'd OK, but
it crashed :-(

I search logs and I've found, that I have logs from 22.12.2023,
when I've did a upgrade (I have set logging to journald).

Would you be interested in those logs? This file have 30MB in
bzip2 format, how I can share it with you?

It contains crash log from start osd.1 too, but I can cut out
from it and send it to list...

Sincerely
Jan Marek

Dne Čt, led 04, 2024 at 02:43:48 CET napsal(a) Jan Marek:

Hi Igor,

I've ran this oneliner:

for i in {0..12}; do export CEPH_ARGS="--log-file osd."${i}".log --debug-bluestore 
5/20" ; ceph-bluestore-tool --path /var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.${i} 
--command fsck ; done;

On osd.0 it crashed very quickly, on osd.1 it is still working.

I've send those logs in one e-mail.

But!

I've tried to list disk devices in monitor view, and I've got
very interesting screenshot - some part I've emphasized by red
rectangulars.

I've got a json from syslog, which was as a part cephadm call,
where it seems to be correct (for my eyes).

Can be this coincidence for this problem?

Sincerely
Jan Marek

Dne Čt, led 04, 2024 at 12:32:47 CET napsal(a) Igor Fedotov:

Hi Jan,

may I see the fsck logs from all the failing OSDs to see the pattern. IIUC
the full node is suffering from the issue, right?


Thanks,

Igor

On 1/2/2024 10:53 AM, Jan Marek wrote:

Hello once again,

I've tried this:

export CEPH_ARGS="--log-file /tmp/osd.0.log --debug-bluestore 5/20"
ceph-bluestore-tool --path 
/var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.0 --command fsck

And I've sending /tmp/osd.0.log file attached.

Sincerely
Jan Marek

Dne Ne, pro 31, 2023 at 12:38:13 CET napsal(a) Igor Fedotov:

Hi Jan,

this doesn't look like RocksDB corruption but rather like some BlueStore
metadata inconsistency. Also assertion backtrace in the new log looks
completely different from the original one. So in an attempt to find any
systematic pattern I'd suggest to run fsck with verbose logging for every
failing OSD. Relevant command line:

CEPH_ARGS="--log-file osd.N.log --debug-bluestore 5/20"
bin/ceph-bluestore-tool --path  --command fsck

Unlikely this will fix anything it's rather a way to collect logs to get
better insight.


Additionally you might want to run similar fsck for a couple of healthy OSDs
- curious if it succeeds as I have a feeling that the problem with crashing
OSDs had been hidden before the upgrade and revealed rather than caused by
it.


Thanks,

Igor

On 12/29/2023 3:28 PM, Jan Marek wrote:

Hello Igor,

I'm attaching a part of syslog creating while starting OSD.0.

Many thanks for help.

Sincerely
Jan Marek

Dne St, pro 27, 2023 at 04:42:56 CET napsal(a) Igor Fedotov:

Hi Jan,

IIUC the attached log is for ceph-kvstore-tool, right?

Can you please share full OSD startup log as well?


Thanks,

Igor

On 12/27/2023 4:30 PM, Jan Marek wrote:

Hello,

I've problem: my ceph cluster (3x mon nodes, 6x osd nodes, every
osd node have 12 rotational disk and one NVMe device for
bluestore DB). CEPH is installed by ceph orchestrator and have
bluefs storage on osd.

I've started process upgrade from version 17.2.6 to 18.2.1 by
invocating:

ceph orch upgrade start --ceph-version 18.2.1

After upgrade of mon and mgr processes orchestrator tried to
upgrade the first OSD node, but they are falling down.

I've stop the process of upgrade, but I have 1 osd node
completely down.

After upgrade I've got some error messages and I've found
/var/lib/ceph/crash directories, I attach to this message
files, which I've found here.

Please, can you advice, what now I can do? It seems, that rocksdb
is even non-compatible or corrupted :-(

Thanks in advance.

Sincerely
Jan Marek

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
_

[ceph-users] Re: Stuck in upgrade process to reef

2024-01-04 Thread Igor Fedotov

Hi Jan,

may I see the fsck logs from all the failing OSDs to see the pattern. 
IIUC the full node is suffering from the issue, right?



Thanks,

Igor

On 1/2/2024 10:53 AM, Jan Marek wrote:

Hello once again,

I've tried this:

export CEPH_ARGS="--log-file /tmp/osd.0.log --debug-bluestore 5/20"
ceph-bluestore-tool --path 
/var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.0 --command fsck

And I've sending /tmp/osd.0.log file attached.

Sincerely
Jan Marek

Dne Ne, pro 31, 2023 at 12:38:13 CET napsal(a) Igor Fedotov:

Hi Jan,

this doesn't look like RocksDB corruption but rather like some BlueStore
metadata inconsistency. Also assertion backtrace in the new log looks
completely different from the original one. So in an attempt to find any
systematic pattern I'd suggest to run fsck with verbose logging for every
failing OSD. Relevant command line:

CEPH_ARGS="--log-file osd.N.log --debug-bluestore 5/20"
bin/ceph-bluestore-tool --path  --command fsck

Unlikely this will fix anything it's rather a way to collect logs to get
better insight.


Additionally you might want to run similar fsck for a couple of healthy OSDs
- curious if it succeeds as I have a feeling that the problem with crashing
OSDs had been hidden before the upgrade and revealed rather than caused by
it.


Thanks,

Igor

On 12/29/2023 3:28 PM, Jan Marek wrote:

Hello Igor,

I'm attaching a part of syslog creating while starting OSD.0.

Many thanks for help.

Sincerely
Jan Marek

Dne St, pro 27, 2023 at 04:42:56 CET napsal(a) Igor Fedotov:

Hi Jan,

IIUC the attached log is for ceph-kvstore-tool, right?

Can you please share full OSD startup log as well?


Thanks,

Igor

On 12/27/2023 4:30 PM, Jan Marek wrote:

Hello,

I've problem: my ceph cluster (3x mon nodes, 6x osd nodes, every
osd node have 12 rotational disk and one NVMe device for
bluestore DB). CEPH is installed by ceph orchestrator and have
bluefs storage on osd.

I've started process upgrade from version 17.2.6 to 18.2.1 by
invocating:

ceph orch upgrade start --ceph-version 18.2.1

After upgrade of mon and mgr processes orchestrator tried to
upgrade the first OSD node, but they are falling down.

I've stop the process of upgrade, but I have 1 osd node
completely down.

After upgrade I've got some error messages and I've found
/var/lib/ceph/crash directories, I attach to this message
files, which I've found here.

Please, can you advice, what now I can do? It seems, that rocksdb
is even non-compatible or corrupted :-(

Thanks in advance.

Sincerely
Jan Marek

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck in upgrade process to reef

2023-12-30 Thread Igor Fedotov

Hi Jan,

this doesn't look like RocksDB corruption but rather like some BlueStore 
metadata inconsistency. Also assertion backtrace in the new log looks 
completely different from the original one. So in an attempt to find any 
systematic pattern I'd suggest to run fsck with verbose logging for 
every failing OSD. Relevant command line:


CEPH_ARGS="--log-file osd.N.log --debug-bluestore 5/20" 
bin/ceph-bluestore-tool --path  --command fsck


Unlikely this will fix anything it's rather a way to collect logs to get 
better insight.



Additionally you might want to run similar fsck for a couple of healthy 
OSDs - curious if it succeeds as I have a feeling that the problem with 
crashing OSDs had been hidden before the upgrade and revealed rather 
than caused by it.



Thanks,

Igor

On 12/29/2023 3:28 PM, Jan Marek wrote:

Hello Igor,

I'm attaching a part of syslog creating while starting OSD.0.

Many thanks for help.

Sincerely
Jan Marek

Dne St, pro 27, 2023 at 04:42:56 CET napsal(a) Igor Fedotov:

Hi Jan,

IIUC the attached log is for ceph-kvstore-tool, right?

Can you please share full OSD startup log as well?


Thanks,

Igor

On 12/27/2023 4:30 PM, Jan Marek wrote:

Hello,

I've problem: my ceph cluster (3x mon nodes, 6x osd nodes, every
osd node have 12 rotational disk and one NVMe device for
bluestore DB). CEPH is installed by ceph orchestrator and have
bluefs storage on osd.

I've started process upgrade from version 17.2.6 to 18.2.1 by
invocating:

ceph orch upgrade start --ceph-version 18.2.1

After upgrade of mon and mgr processes orchestrator tried to
upgrade the first OSD node, but they are falling down.

I've stop the process of upgrade, but I have 1 osd node
completely down.

After upgrade I've got some error messages and I've found
/var/lib/ceph/crash directories, I attach to this message
files, which I've found here.

Please, can you advice, what now I can do? It seems, that rocksdb
is even non-compatible or corrupted :-(

Thanks in advance.

Sincerely
Jan Marek

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck in upgrade process to reef

2023-12-27 Thread Igor Fedotov

Hi Jan,

IIUC the attached log is for ceph-kvstore-tool, right?

Can you please share full OSD startup log as well?


Thanks,

Igor

On 12/27/2023 4:30 PM, Jan Marek wrote:

Hello,

I've problem: my ceph cluster (3x mon nodes, 6x osd nodes, every
osd node have 12 rotational disk and one NVMe device for
bluestore DB). CEPH is installed by ceph orchestrator and have
bluefs storage on osd.

I've started process upgrade from version 17.2.6 to 18.2.1 by
invocating:

ceph orch upgrade start --ceph-version 18.2.1

After upgrade of mon and mgr processes orchestrator tried to
upgrade the first OSD node, but they are falling down.

I've stop the process of upgrade, but I have 1 osd node
completely down.

After upgrade I've got some error messages and I've found
/var/lib/ceph/crash directories, I attach to this message
files, which I've found here.

Please, can you advice, what now I can do? It seems, that rocksdb
is even non-compatible or corrupted :-(

Thanks in advance.

Sincerely
Jan Marek

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD has Rocksdb corruption that crashes ceph-bluestore-tool repair

2023-12-18 Thread Igor Fedotov

Hi Malcolm,

you might want to try ceph-objectstore-tool's export command to save the 
PG into a file and then import it to another OSD.



Thanks,

Igor

On 18/12/2023 02:59, Malcolm Haak wrote:

Hello all,

I had an OSD go offline due to UWE. When restarting the OSD service,
to try and at least get it to  drain cleanly of that data that wasn't
damaged, the ceph-osd process would crash.

I then attempted to repair it using ceph-bluestore-tool. I can run
fsck and it will complete without issue, however when attempting to
run repair it crashes in the exact same way that ceph-osd crashes.

I'll attach the tail end of the output here:

  2023-12-17T20:24:53.320+1000 7fdb7bf17740 -1 rocksdb: submit_common
error: Corruption: block checksum mismatch: stored = 1106056583,
computed = 657190205, type = 1  in db/020524.sst offset 21626321 size
4014 code =  Rocksdb transaction:
PutCF( prefix = S key = 'per_pool_omap' value size = 1)
   -442> 2023-12-17T20:24:53.386+1000 7fdb7bf17740 -1
/usr/src/debug/ceph/ceph-18.2.0/src/os/bluestore/BlueStore.cc: In
function 'unsigned int BlueStoreRepairer::apply(KeyValueDB*)' thread
7fdb7bf17740 time 2023-12-17T20:24:53.341999+1000
/usr/src/debug/ceph/ceph-18.2.0/src/os/bluestore/BlueStore.cc: 17982:
FAILED ceph_assert(ok)

  ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef
(stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x136) [0x7fdb7b6502c9]
  2: /usr/lib/ceph/libceph-common.so.2(+0x2504a4) [0x7fdb7b6504a4]
  3: (BlueStoreRepairer::apply(KeyValueDB*)+0x5af) [0x559afb98cc7f]
  4: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x45fc)
[0x559afba2436c]
  5: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x204)
[0x559afba31014]
  6: main()
  7: /usr/lib/libc.so.6(+0x27cd0) [0x7fdb7ae45cd0]
  8: __libc_start_main()
  9: _start()

   -441> 2023-12-17T20:24:53.390+1000 7fdb7bf17740 -1 *** Caught signal
(Aborted) **
  in thread 7fdb7bf17740 thread_name:ceph-bluestore-

  ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef
(stable)
  1: /usr/lib/libc.so.6(+0x3e710) [0x7fdb7ae5c710]
  2: /usr/lib/libc.so.6(+0x8e83c) [0x7fdb7aeac83c]
  3: raise()
  4: abort()
  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x191) [0x7fdb7b650324]
  6: /usr/lib/ceph/libceph-common.so.2(+0x2504a4) [0x7fdb7b6504a4]
  7: (BlueStoreRepairer::apply(KeyValueDB*)+0x5af) [0x559afb98cc7f]
  8: (BlueStore::_fsck_on_open(BlueStore::FSCKDepth, bool)+0x45fc)
[0x559afba2436c]
  9: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x204)
[0x559afba31014]
  10: main()
  11: /usr/lib/libc.so.6(+0x27cd0) [0x7fdb7ae45cd0]
  12: __libc_start_main()
  13: _start()
  NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

The reason I need to get this OSD functioning is I had two other OSD's
fail causing a single PG to be in down state. The weird thing is, I
got one of those back up without issue (ceph-osd crashed due to root
filling and alert not sending) but the PG is still down. So I need to
get this other one back up (or the data extracted) to get that PG back
from down.

Thanks in advance
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reef 18.2.1 QE Validation status

2023-12-01 Thread Igor Fedotov

Hi Yuri,

Looks like that's not THAT critical and complicated as it's been thought 
originally. User has to change bluefs_shared_alloc_size to be exposed to 
the issue. So hopefully I'll submit a patch on Monday to close this gap 
and we'll be able to proceed.



Thanks,

Igor

On 01/12/2023 18:16, Yuri Weinstein wrote:

Venky, pls review the test results for smoke and fs after the PRs were merged.

Radek, Igor, Adam - any updates on https://tracker.ceph.com/issues/63618?

Thx

On Thu, Nov 30, 2023 at 8:08 AM Yuri Weinstein  wrote:

The fs PRs:
https://github.com/ceph/ceph/pull/54407
https://github.com/ceph/ceph/pull/54677
were approved/tested and ready for merge.

What is the status/plan for https://tracker.ceph.com/issues/63618?

On Wed, Nov 29, 2023 at 10:51 AM Igor Fedotov  wrote:

https://tracker.ceph.com/issues/63618 to be considered as a blocker for
the next Reef release.

On 07/11/2023 00:30, Yuri Weinstein wrote:

Details of this release are summarized here:

https://tracker.ceph.com/issues/63443#note-1

Seeking approvals/reviews for:

smoke - Laura, Radek, Prashant, Venky (POOL_APP_NOT_ENABLE failures)
rados - Neha, Radek, Travis, Ernesto, Adam King
rgw - Casey
fs - Venky
orch - Adam King
rbd - Ilya
krbd - Ilya
upgrade/quincy-x (reef) - Laura PTL
powercycle - Brad
perf-basic - Laura, Prashant (POOL_APP_NOT_ENABLE failures)

Please reply to this email with approval and/or trackers of known
issues/PRs to address them.

TIA
YuriW
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reef 18.2.1 QE Validation status

2023-11-29 Thread Igor Fedotov
https://tracker.ceph.com/issues/63618 to be considered as a blocker for 
the next Reef release.


On 07/11/2023 00:30, Yuri Weinstein wrote:

Details of this release are summarized here:

https://tracker.ceph.com/issues/63443#note-1

Seeking approvals/reviews for:

smoke - Laura, Radek, Prashant, Venky (POOL_APP_NOT_ENABLE failures)
rados - Neha, Radek, Travis, Ernesto, Adam King
rgw - Casey
fs - Venky
orch - Adam King
rbd - Ilya
krbd - Ilya
upgrade/quincy-x (reef) - Laura PTL
powercycle - Brad
perf-basic - Laura, Prashant (POOL_APP_NOT_ENABLE failures)

Please reply to this email with approval and/or trackers of known
issues/PRs to address them.

TIA
YuriW
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CLT Meeting minutes 2023-11-23

2023-11-23 Thread Igor Fedotov

[1] was closed by mistake. Reopened.


On 11/23/2023 7:18 PM, Konstantin Shalygin wrote:

Hi,


On Nov 23, 2023, at 16:10, Nizamudeen A  wrote:

RCs for reef, quincy and pacific
  for next week when there is more time to discuss


Just little noise: pacific is ready? 16.2.15 should be last release 
(at least that was the last plan), but [1] still not merged. Why now 
ticket is closed - I don't know


Also many users reports about OOM in 16.2.14 release: patch also 
should be merged to main first [2]




Thanks,
k

[1] https://tracker.ceph.com/issues/62815
[2] https://tracker.ceph.com/issues/59580


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us athttps://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: migrate wal/db to block device

2023-11-15 Thread Igor Fedotov

Hi Chris,

haven't checked you actions thoroughly but migration to be done on a 
down OSD which is apparently not the case here.


May be that's a culprit and we/you somehow missed the relevant error 
during the migration process?



Thanks,

Igor

On 11/15/2023 5:33 AM, Chris Dunlop wrote:

Hi,

What's the correct way to migrate an OSD wal/db from a fast device to 
the (slow) block device?


I have an osd with wal/db on a fast LV device and block on a slow LV 
device. I want to move the wal/db onto the block device so I can 
reconfigure the fast device before moving the wal/db back to the fast 
device.


This link says to use "ceph-volume lvm migrate" (I'm on pacific, but 
the quincy and reef docs are the same):


https://docs.ceph.com/en/pacific/ceph-volume/lvm/migrate/

I tried:

$ cephadm  unit --fsid ${fsid} --name osd.${osdid} stop
$ cephadm shell --fsid ${fsid} --name osd.${osdid} -- \
  ceph-volume lvm migrate --osd-id ${osdid} --osd-fsid ${osd_fsid} \
  --from db wal --target ${block_vglv}
$ systemctl stop ${osd_service}
$ systemctl start ${osd_service}

"cephadm ceph-volume lvm list" now shows only the (slow) block device 
whereas before the migrate it was showing both the block and db 
devices.  However "lsof" shows the new osd process still has the 
original fast wal/db device open and "iostat" shows this device is 
still getting i/o.


Also:

$ ls -l /var/lib/ceph/${fsid}/osd.${osdid}/block*

...shows both the "block" and "block.db" symlinks to the original 
separate devices.


And there are now no lv_tags on the original wal/db LV:

$ lvs -o lv_tags ${original_db_vg_lv}

Now I'm concerned there's device mismatch for this osd: "cephadm 
ceph-volume lvm list" believes there's no separate wal/db, but the osd 
is currently *using* the original separate wal/db.


I guess if the server were to restart this osd would be in all sorts 
of trouble.


What's going on there, and what can be done to fix it?  Is it a matter 
of recreating the tags on the original db device?  (But then what 
happens to whatever did get migrated to the block device - e.g. is 
that space lost?)
Or is it a matter of using ceph-bluestore-tool to do a 
bluefs-bdev-migrate, e.g. something like:


$ cephadm  unit --fsid ${fsid} --name osd.${osdid} stop
$ osddir=/var/lib/ceph/osd/ceph-${osdid}
$ cephadm shell --fsid ${fsid} --name osd.${osdid} -- \
  ceph-bluestore-tool --path ${osddir} --devs-source ${osddir}/block.db \
  --dev-target ${osddir}/block bluefs-bdev-migrate
$ rm /var/lib/ceph/${fsid}/osd.${osdid}/block.db
$ systemctl stop ${osd_service}
$ systemctl start ${osd_service}

Or... something else?


And how *should* moving the wal/db be done?

Cheers,

Chris
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: migrate wal/db to block device

2023-11-15 Thread Igor Fedotov

Hi Eugen,

this scenario is supported, see the last example on the relevant doc page:

Moves BlueFS data from main, DB and WAL devices to main device, WAL and 
DB are removed:


ceph-volume  lvm  migrate  --osd-id  1  --osd-fsid--from  db  wal  
--target  vgname/data


Thanks,
Igor

On 11/15/2023 11:20 AM, Eugen Block wrote:

Hi,

AFAIU, you can’t migrate back to the slow device. It’s either 
migrating from the slow device to a fast device or remove between fast 
devices. I’m not aware that your scenario was considered in that tool. 
The docs don’t specifically say that, but they also don’t mention 
going back to slow device only. Someone please correct me, but I’d say 
you’ll have to rebuild that OSD to detach it from the fast device.


Regards,
Eugen

Zitat von Chris Dunlop :


Hi,

What's the correct way to migrate an OSD wal/db from a fast device to 
the (slow) block device?


I have an osd with wal/db on a fast LV device and block on a slow LV 
device. I want to move the wal/db onto the block device so I can 
reconfigure the fast device before moving the wal/db back to the fast 
device.


This link says to use "ceph-volume lvm migrate" (I'm on pacific, but 
the quincy and reef docs are the same):


https://docs.ceph.com/en/pacific/ceph-volume/lvm/migrate/

I tried:

$ cephadm  unit --fsid ${fsid} --name osd.${osdid} stop
$ cephadm shell --fsid ${fsid} --name osd.${osdid} -- \
  ceph-volume lvm migrate --osd-id ${osdid} --osd-fsid ${osd_fsid} \
  --from db wal --target ${block_vglv}
$ systemctl stop ${osd_service}
$ systemctl start ${osd_service}

"cephadm ceph-volume lvm list" now shows only the (slow) block device 
whereas before the migrate it was showing both the block and db 
devices.  However "lsof" shows the new osd process still has the 
original fast wal/db device open and "iostat" shows this device is 
still getting i/o.


Also:

$ ls -l /var/lib/ceph/${fsid}/osd.${osdid}/block*

...shows both the "block" and "block.db" symlinks to the original 
separate devices.


And there are now no lv_tags on the original wal/db LV:

$ lvs -o lv_tags ${original_db_vg_lv}

Now I'm concerned there's device mismatch for this osd: "cephadm 
ceph-volume lvm list" believes there's no separate wal/db, but the 
osd is currently *using* the original separate wal/db.


I guess if the server were to restart this osd would be in all sorts 
of trouble.


What's going on there, and what can be done to fix it?  Is it a 
matter of recreating the tags on the original db device?  (But then 
what happens to whatever did get migrated to the block device - e.g. 
is that space lost?)
Or is it a matter of using ceph-bluestore-tool to do a 
bluefs-bdev-migrate, e.g. something like:


$ cephadm  unit --fsid ${fsid} --name osd.${osdid} stop
$ osddir=/var/lib/ceph/osd/ceph-${osdid}
$ cephadm shell --fsid ${fsid} --name osd.${osdid} -- \
  ceph-bluestore-tool --path ${osddir} --devs-source 
${osddir}/block.db \

  --dev-target ${osddir}/block bluefs-bdev-migrate
$ rm /var/lib/ceph/${fsid}/osd.${osdid}/block.db
$ systemctl stop ${osd_service}
$ systemctl start ${osd_service}

Or... something else?


And how *should* moving the wal/db be done?

Cheers,

Chris
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us athttps://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Allocation - used space is unreasonably higher than stored space

2023-11-13 Thread Igor Fedotov

Hi Motahare,

On 13/11/2023 14:44, Motahare S wrote:

Hello everyone,

Recently we have noticed that the results of "ceph df" stored and used
space does not match; as the amount of stored data *1.5 (ec factor) is
still like 5TB away from used amount:

POOLID   PGS   STORED  OBJECTS USED  %USED
  MAX AVAIL
default.rgw.buckets.data12  1024  144 TiB   70.60M  221 TiB  18.68
  643 TiB

blob and alloc configs are as below:
bluestore_min_alloc_size_hdd : 65536
bluestore_min_alloc_size_ssd  : 4096
luestore_max_blob_size_hdd : 524288

bluestore_max_blob_size_ssd : 65536

bluefs_shared_alloc_size : 65536

 From sources across web about how ceph actually writes on the disk, I
presumed that It will zero-pad the extents of an object to match the
4KB bdev_block_size, and then writes it in a blob which matches the
min_alloc_size, however it can re-use parts of the blob's unwritten (but
allocated because of min_alloc_size) space for another extent later.
The problem though, was that we tested different configs in a minimal ceph
octopus cluster with a 2G osd and bluestore_min_alloc_size_hdd = 65536.
When we uploaded a 1KB file with aws s3 client, the amount of used/stored
space was 64KB/1KB. We then uploaded another 1KB, and it went 128K/2K; kept
doing it until 100% of the pool was used, but only 32MB stored. I expected
ceph to start writing new 1KB files in the wasted 63KB(60KB)s of
min_alloc_size blocks, but the cluster was totally acting as a full cluster
and could no longer receive any new object. Is this behaviour expected for
s3? Does ceph really use 64x space if your dataset is made of 1KB files?
and all your object sizes should be a multiple of 64KB? Note that 5TB /
(70.6M*1.5) ~ 50 so for every rados object about 50KB is wasted on average.
we didn't observe this problem in RBD pools, probably because it cuts all
objects in 4MB.


The above analysis is correct, indeed BlueStore will waste up to 64K for 
every object unaligned to 64K (i.e. both 1K and 65K objects will waste 
63K).


Hence n*1K objects take n*64K bytes.

And since S3 objects are unaligned it tend to waste 32K bytes in average 
on each object (assuming their sizes are distributed equally).


The only correction to the above math would be due to the actual m+n EC 
layout. E.g. for 2+1 EC object count multiplier would be 3 not 1.5. 
Hence the overhead per rados object is rather less than 50K in your case.



I know that min_alloc_hdd is changed to 4KB in pacific, but I'm still
curious how allocation really works and why it doesn't behave as expected?
Also, re-deploying OSDs is a headache.




Sincerely
Motahare
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph 16.2.14: OSDs randomly crash in bstore_kv_sync

2023-10-20 Thread Igor Fedotov

Zakhar,

my general concern about downgrading to previous versions is that this 
procedure is generally neither assumed nor tested by dev team. Although 
is possible most of the time. But in this specific case it is not doable 
due to (at least) https://github.com/ceph/ceph/pull/52212 which enables 
4K bluefs allocation unit support - once some daemon gets it - there is 
no way back.


I'm still thinking that setting "fit_to_fast" mode without enabling 
dynamic compaction levels is quite safe but definitely it's better to be 
tested in the real environment and under actual payload first. Also you 
might want to apply such a workaround gradually - one daemon first, bake 
it for a while, then apply for the full node, bake a bit more and 
finally go forward and update the remaining. Or even better - bake it in 
a test cluster first.


Alternatively you might consider building updated code yourself and make 
patched binaries on top of .14...



Thanks,

Igor


On 20/10/2023 15:10, Zakhar Kirpichenko wrote:

Thank you, Igor.

It is somewhat disappointing that fixing this bug in Pacific has such 
a low priority, considering its impact on existing clusters.


The document attached to the PR explicitly says about 
`level_compaction_dynamic_level_bytes` that "enabling it on an 
existing DB requires special caution", we'd rather not experiment with 
something that has the potential to cause data corruption or loss in a 
production cluster. Perhaps a downgrade to the previous version, 
16.2.13 which worked for us without any issues, is an option, or would 
you advise against such a downgrade from 16.2.14?


/Z

On Fri, 20 Oct 2023 at 14:46, Igor Fedotov  wrote:

Hi Zakhar,

Definitely we expect one more (and apparently the last) Pacific
minor release. There is no specific date yet though - the plans
are to release Quincy and Reef minor releases prior to it.
Hopefully to be done before the Christmas/New Year.

Meanwhile you might want to workaround the issue by tuning
bluestore_volume_selection_policy. Unfortunately most likely my
original proposal to set it to rocksdb_original wouldn't work in
this case so you better try "fit_to_fast" mode. This should be
coupled with enabling 'level_compaction_dynamic_level_bytes' mode
in RocksDB - there is pretty good spec on applying this mode to
BlueStore attached to https://github.com/ceph/ceph/pull/37156.


Thanks,

Igor

On 20/10/2023 06:03, Zakhar Kirpichenko wrote:

Igor, I noticed that there's no roadmap for the next 16.2.x
release. May I ask what time frame we are looking at with regards
to a possible fix?

We're experiencing several OSD crashes caused by this issue per day.

/Z

On Mon, 16 Oct 2023 at 14:19, Igor Fedotov
 wrote:

That's true.

On 16/10/2023 14:13, Zakhar Kirpichenko wrote:

Many thanks, Igor. I found previously submitted bug reports
and subscribed to them. My understanding is that the issue
is going to be fixed in the next Pacific minor release.

/Z

On Mon, 16 Oct 2023 at 14:03, Igor Fedotov
 wrote:

Hi Zakhar,

please see my reply for the post on the similar issue at:

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/YNJ35HXN4HXF4XWB6IOZ2RKXX7EQCEIY/


Thanks,

Igor

On 16/10/2023 09:26, Zakhar Kirpichenko wrote:
> Hi,
>
> After upgrading to Ceph 16.2.14 we had several OSD crashes
> in bstore_kv_sync thread:
>
>
>     1. "assert_thread_name": "bstore_kv_sync",
>     2. "backtrace": [
>     3. "/lib64/libpthread.so.0(+0x12cf0)
[0x7ff2f6750cf0]",
>     4. "gsignal()",
>     5. "abort()",
>     6. "(ceph::__ceph_assert_fail(char const*, char
const*, int, char
>     const*)+0x1a9) [0x564dc5f87d0b]",
>     7. "/usr/bin/ceph-osd(+0x584ed4) [0x564dc5f87ed4]",
>     8. "(RocksDBBlueFSVolumeSelector::sub_usage(void*,
bluefs_fnode_t
>     const&)+0x15e) [0x564dc6604a9e]",
>     9. "(BlueFS::_flush_range_F(BlueFS::FileWriter*,
unsigned long, unsigned
>     long)+0x77d) [0x564dc66951cd]",
>     10. "(BlueFS::_flush_F(BlueFS::FileWriter*, bool,
bool*)+0x90)
>     [0x564dc6695670]",
>     11. "(BlueFS::fsync(BlueFS::FileWriter*)+0x18b)
[0x564dc66b1a6b]",
>     12. "(BlueRocksWritableFile::Sync()+0x18)
[0x564dc66c1768]",
>     13.

[ceph-users] Re: Ceph 16.2.14: OSDs randomly crash in bstore_kv_sync

2023-10-20 Thread Igor Fedotov

Hi Zakhar,

Definitely we expect one more (and apparently the last) Pacific minor 
release. There is no specific date yet though - the plans are to release 
Quincy and Reef minor releases prior to it. Hopefully to be done before 
the Christmas/New Year.


Meanwhile you might want to workaround the issue by tuning 
bluestore_volume_selection_policy. Unfortunately most likely my original 
proposal to set it to rocksdb_original wouldn't work in this case so you 
better try "fit_to_fast" mode. This should be coupled with enabling 
'level_compaction_dynamic_level_bytes' mode in RocksDB - there is pretty 
good spec on applying this mode to BlueStore attached to 
https://github.com/ceph/ceph/pull/37156.



Thanks,

Igor

On 20/10/2023 06:03, Zakhar Kirpichenko wrote:
Igor, I noticed that there's no roadmap for the next 16.2.x release. 
May I ask what time frame we are looking at with regards to a possible 
fix?


We're experiencing several OSD crashes caused by this issue per day.

/Z

On Mon, 16 Oct 2023 at 14:19, Igor Fedotov  wrote:

That's true.

On 16/10/2023 14:13, Zakhar Kirpichenko wrote:

Many thanks, Igor. I found previously submitted bug reports and
subscribed to them. My understanding is that the issue is going
to be fixed in the next Pacific minor release.

/Z

On Mon, 16 Oct 2023 at 14:03, Igor Fedotov
 wrote:

Hi Zakhar,

please see my reply for the post on the similar issue at:

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/YNJ35HXN4HXF4XWB6IOZ2RKXX7EQCEIY/


Thanks,

Igor

On 16/10/2023 09:26, Zakhar Kirpichenko wrote:
> Hi,
>
> After upgrading to Ceph 16.2.14 we had several OSD crashes
> in bstore_kv_sync thread:
>
>
>     1. "assert_thread_name": "bstore_kv_sync",
>     2. "backtrace": [
>     3. "/lib64/libpthread.so.0(+0x12cf0) [0x7ff2f6750cf0]",
>     4. "gsignal()",
>     5. "abort()",
>     6. "(ceph::__ceph_assert_fail(char const*, char const*,
int, char
>     const*)+0x1a9) [0x564dc5f87d0b]",
>     7. "/usr/bin/ceph-osd(+0x584ed4) [0x564dc5f87ed4]",
>     8. "(RocksDBBlueFSVolumeSelector::sub_usage(void*,
bluefs_fnode_t
>     const&)+0x15e) [0x564dc6604a9e]",
>     9. "(BlueFS::_flush_range_F(BlueFS::FileWriter*,
unsigned long, unsigned
>     long)+0x77d) [0x564dc66951cd]",
>     10. "(BlueFS::_flush_F(BlueFS::FileWriter*, bool,
bool*)+0x90)
>     [0x564dc6695670]",
>     11. "(BlueFS::fsync(BlueFS::FileWriter*)+0x18b)
[0x564dc66b1a6b]",
>     12. "(BlueRocksWritableFile::Sync()+0x18)
[0x564dc66c1768]",
>     13.
"(rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions
>     const&, rocksdb::IODebugContext*)+0x1f) [0x564dc6b6496f]",
>     14.
"(rocksdb::WritableFileWriter::SyncInternal(bool)+0x402)
>     [0x564dc6c761c2]",
>     15. "(rocksdb::WritableFileWriter::Sync(bool)+0x88)
[0x564dc6c77808]",
>     16.
"(rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup
>     const&, rocksdb::log::Writer*, unsigned long*, bool,
bool, unsigned
>     long)+0x309) [0x564dc6b780c9]",
>     17. "(rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions
const&,
>     rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned
long*, unsigned
>     long, bool, unsigned long*, unsigned long,
>     rocksdb::PreReleaseCallback*)+0x2629) [0x564dc6b80c69]",
>     18. "(rocksdb::DBImpl::Write(rocksdb::WriteOptions const&,
>     rocksdb::WriteBatch*)+0x21) [0x564dc6b80e61]",
>     19. "(RocksDBStore::submit_common(rocksdb::WriteOptions&,
>  std::shared_ptr)+0x84)
[0x564dc6b1f644]",
>     20.

"(RocksDBStore::submit_transaction_sync(std::shared_ptr)+0x9a)
>     [0x564dc6b2004a]",
>     21. "(BlueStore::_kv_sync_thread()+0x30d8)
[0x564dc6602ec8]",
>     22. "(BlueStore::KVSyncThread::entry()+0x11)
[0x564dc662ab61]",
>     23. "/lib64/libpthread.so.0(+0x81ca) [0x7ff2f67461ca]",
>     24. "clone()"
>     25. ],
>
>
> I am attaching two instances of crash info for further
reference:
> https://pastebin.com/E6myaHNU
>
> OSD configuration is

[ceph-users] Re: Fixing BlueFS spillover (pacific 16.2.14)

2023-10-16 Thread Igor Fedotov

Hi Chris,

for the first question (osd.76) you might want to try ceph-volume's "lvm 
migrate --from data --target " command. Looks like some 
persistent DB remnants are still kept at main device causing the alert.


W.r.t osd.86's question - the line "SLOW    0 B 3.0 GiB 
59 GiB" means that RocksDB higher levels  data (usually L3+) are spread 
over DB and main (aka slow) devices as 3 GB and 59 GB respectively.


In other words SLOW row refers to DB data which is originally supposed 
to be at SLOW device (due to RocksDB data mapping mechanics). But 
improved bluefs logic (introduced by 
https://github.com/ceph/ceph/pull/29687) permitted extra DB disk usage 
for a part of this data.


Resizing DB volume and following DB compaction should do the trick and 
move all the data to DB device. Alternatively ceph-volume's lvm migrate 
command should do the same but the result will be rather temporary 
without DB volume resizing.


Hope this helps.


Thanks,

Igor

On 06/10/2023 06:55, Chris Dunlop wrote:

Hi,

tl;dr why are my osds still spilling?

I've recently upgraded to 16.2.14 from 16.2.9 and started receiving 
bluefs spillover warnings (due to the "fix spillover alert" per the 
16.2.14 release notes). E.g. from 'ceph health detail', the warning on 
one of these (there are a few):


osd.76 spilled over 128 KiB metadata from 'db' device (56 GiB used of 
60 GiB) to slow device


This is a 15T HDD with only a 60G SSD for the db so it's not 
surprising it spilled as it's way below the recommendation for rbd 
usage at db size 1-2% of the storage size.


There was some spare space on the db ssd so I increased the size of 
the db LV up over 400G and did an bluefs-bdev-expand.


However, days later, I'm still getting the spillover warning for that 
osd, including after running a manual compact:


# ceph tell osd.76 compact

See attached perf-dump-76 for the perf dump output:

# cephadm enter --name 'osd.76' ceph daemon 'osd.76' perf dump" | jq 
-r '.bluefs'


In particular, if my understanding is correct, that's telling me the 
db available size is 487G (i.e. the LV expand worked), of which it's 
using 59G, and there's 128K spilled to the slow device:


"db_total_bytes": 512309059584,  # 487G
"db_used_bytes": 63470305280,    # 59G
"slow_used_bytes": 131072,   # 128K

A "bluefs stats" also says the db is using 128K of slow storage 
(although perhaps it's getting the info from the same place as the 
perf dump?):


# ceph tell osd.76 bluefs stats 1 : device size 0x7747ffe000 : using 
0xea620(59 GiB)

2 : device size 0xe8d7fc0 : using 0x6554d689000(6.3 TiB)
RocksDBBlueFSVolumeSelector Usage Matrix:
DEV/LEV WAL DB  SLOW    * *   
REAL    FILES   LOG 0 B 10 MiB  0 
B 0 B 0 B 8.8 MiB 1   WAL 0 
B 2.5 GiB 0 B 0 B 0 B 751 MiB 
8   DB  0 B 56 GiB  128 KiB 0 
B 0 B 50 GiB  842 SLOW    0 B 
0 B 0 B 0 B 0 B 0 B 0 
TOTAL   0 B 58 GiB  128 KiB 0 B 0 
B 0 B 850 MAXIMUMS:
LOG 0 B 22 MiB  0 B 0 B 0 
B 18 MiB  WAL 0 B 3.9 GiB 0 B 
0 B 0 B 1.0 GiB DB  0 B 71 
GiB  282 MiB 0 B 0 B 62 GiB  SLOW    0 
B 0 B 0 B 0 B 0 B 0 B 
TOTAL   0 B 74 GiB  282 MiB 0 B 0 
B 0 B
SIZE <<  0 B 453 GiB 14 TiB 


I had a look at the "DUMPING STATS" output in the logs bug I don't 
know how to interpret it. I did try calculating the total of the sizes 
on the "Sum" lines but that comes to 100G so I don't know what that 
all means. See attached log-stats-76.


I also tried "ceph-kvstore-tool bluestore-kv ... stats":

$ {
  cephadm  unit --fsid $clusterid --name osd.76 stop
  cephadm shell --fsid $clusterid --name osd.76 -- ceph-kvstore-tool 
bluestore-kv /var/lib/ceph/osd/ceph-76 stats cephadm  unit --fsid 
$clusterid --name osd.76 start

}

Output attached as bluestore-kv-stats-76. I can't see anything 
interesting in there, although again I don't really know how to 
interpret it.


So... why is this osd db still spilling onto slow storage, and how do 
I fix things so it's no longer using the slow storage?



And a bonus issue...  on another osd that hasn't yet been resized 
(i.e.  again with a grossly undersized 60G db on SSD with a 15T HDD) 
I'm also getting a spillover warning. The "bluefs stats" seems to be 
saying the db is NOT currently spilling (i.e. "0 B" the DB/SLOW 
position in the matrix), but there's "something" currently using 59G 
on the slow device:


$ ceph tell osd.85 bluefs stats
1 : device size 0xee000 : using 0x3a390(15 GiB)
2 : device size 0xe8d7fc0 : using 

[ceph-users] Re: Ceph 16.2.14: OSDs randomly crash in bstore_kv_sync

2023-10-16 Thread Igor Fedotov

That's true.

On 16/10/2023 14:13, Zakhar Kirpichenko wrote:
Many thanks, Igor. I found previously submitted bug reports and 
subscribed to them. My understanding is that the issue is going to be 
fixed in the next Pacific minor release.


/Z

On Mon, 16 Oct 2023 at 14:03, Igor Fedotov  wrote:

Hi Zakhar,

please see my reply for the post on the similar issue at:

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/YNJ35HXN4HXF4XWB6IOZ2RKXX7EQCEIY/


Thanks,

Igor

On 16/10/2023 09:26, Zakhar Kirpichenko wrote:
> Hi,
>
> After upgrading to Ceph 16.2.14 we had several OSD crashes
> in bstore_kv_sync thread:
>
>
>     1. "assert_thread_name": "bstore_kv_sync",
>     2. "backtrace": [
>     3. "/lib64/libpthread.so.0(+0x12cf0) [0x7ff2f6750cf0]",
>     4. "gsignal()",
>     5. "abort()",
>     6. "(ceph::__ceph_assert_fail(char const*, char const*, int,
char
>     const*)+0x1a9) [0x564dc5f87d0b]",
>     7. "/usr/bin/ceph-osd(+0x584ed4) [0x564dc5f87ed4]",
>     8. "(RocksDBBlueFSVolumeSelector::sub_usage(void*,
bluefs_fnode_t
>     const&)+0x15e) [0x564dc6604a9e]",
>     9. "(BlueFS::_flush_range_F(BlueFS::FileWriter*, unsigned
long, unsigned
>     long)+0x77d) [0x564dc66951cd]",
>     10. "(BlueFS::_flush_F(BlueFS::FileWriter*, bool, bool*)+0x90)
>     [0x564dc6695670]",
>     11. "(BlueFS::fsync(BlueFS::FileWriter*)+0x18b)
[0x564dc66b1a6b]",
>     12. "(BlueRocksWritableFile::Sync()+0x18) [0x564dc66c1768]",
>     13.
"(rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions
>     const&, rocksdb::IODebugContext*)+0x1f) [0x564dc6b6496f]",
>     14. "(rocksdb::WritableFileWriter::SyncInternal(bool)+0x402)
>     [0x564dc6c761c2]",
>     15. "(rocksdb::WritableFileWriter::Sync(bool)+0x88)
[0x564dc6c77808]",
>     16.
"(rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup
>     const&, rocksdb::log::Writer*, unsigned long*, bool, bool,
unsigned
>     long)+0x309) [0x564dc6b780c9]",
>     17. "(rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&,
>     rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned
long*, unsigned
>     long, bool, unsigned long*, unsigned long,
>     rocksdb::PreReleaseCallback*)+0x2629) [0x564dc6b80c69]",
>     18. "(rocksdb::DBImpl::Write(rocksdb::WriteOptions const&,
>     rocksdb::WriteBatch*)+0x21) [0x564dc6b80e61]",
>     19. "(RocksDBStore::submit_common(rocksdb::WriteOptions&,
>  std::shared_ptr)+0x84)
[0x564dc6b1f644]",
>     20.

"(RocksDBStore::submit_transaction_sync(std::shared_ptr)+0x9a)
>     [0x564dc6b2004a]",
>     21. "(BlueStore::_kv_sync_thread()+0x30d8) [0x564dc6602ec8]",
>     22. "(BlueStore::KVSyncThread::entry()+0x11) [0x564dc662ab61]",
>     23. "/lib64/libpthread.so.0(+0x81ca) [0x7ff2f67461ca]",
>     24. "clone()"
>     25. ],
>
>
> I am attaching two instances of crash info for further reference:
> https://pastebin.com/E6myaHNU
>
> OSD configuration is rather simple and close to default:
>
> osd.6         dev       bluestore_cache_size_hdd   4294967296
>                                            osd.6  dev
> bluestore_cache_size_ssd            4294967296
>                    osd           advanced  debug_rocksdb
>    1/5              osd
>          advanced  osd_max_backfills                   2
>                                                  osd      basic
> osd_memory_target                   17179869184
>                      osd           advanced osd_recovery_max_active
>      2          osd
>      advanced  osd_scrub_sleep  0.10
>                                        osd  advanced
>   rbd_balance_parent_reads            false
>
> debug_rocksdb is a recent change, otherwise this configuration
has been
> running without issues for months. The crashes happened on two
different
> hosts with identical hardware, the hosts and storage (NVME
DB/WAL, HDD
> block) don't exhibit any issues. We have not experienced such
crashes with
> Ceph < 16.2.14.
>
> Is this a known issue, or should I open a bug report?
>
> Best regards,
> Zakhar
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph 16.2.14: OSDs randomly crash in bstore_kv_sync

2023-10-16 Thread Igor Fedotov

Hi Zakhar,

please see my reply for the post on the similar issue at: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/YNJ35HXN4HXF4XWB6IOZ2RKXX7EQCEIY/



Thanks,

Igor

On 16/10/2023 09:26, Zakhar Kirpichenko wrote:

Hi,

After upgrading to Ceph 16.2.14 we had several OSD crashes
in bstore_kv_sync thread:


1. "assert_thread_name": "bstore_kv_sync",
2. "backtrace": [
3. "/lib64/libpthread.so.0(+0x12cf0) [0x7ff2f6750cf0]",
4. "gsignal()",
5. "abort()",
6. "(ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1a9) [0x564dc5f87d0b]",
7. "/usr/bin/ceph-osd(+0x584ed4) [0x564dc5f87ed4]",
8. "(RocksDBBlueFSVolumeSelector::sub_usage(void*, bluefs_fnode_t
const&)+0x15e) [0x564dc6604a9e]",
9. "(BlueFS::_flush_range_F(BlueFS::FileWriter*, unsigned long, unsigned
long)+0x77d) [0x564dc66951cd]",
10. "(BlueFS::_flush_F(BlueFS::FileWriter*, bool, bool*)+0x90)
[0x564dc6695670]",
11. "(BlueFS::fsync(BlueFS::FileWriter*)+0x18b) [0x564dc66b1a6b]",
12. "(BlueRocksWritableFile::Sync()+0x18) [0x564dc66c1768]",
13. "(rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions
const&, rocksdb::IODebugContext*)+0x1f) [0x564dc6b6496f]",
14. "(rocksdb::WritableFileWriter::SyncInternal(bool)+0x402)
[0x564dc6c761c2]",
15. "(rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x564dc6c77808]",
16. "(rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup
const&, rocksdb::log::Writer*, unsigned long*, bool, bool, unsigned
long)+0x309) [0x564dc6b780c9]",
17. "(rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&,
rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned
long, bool, unsigned long*, unsigned long,
rocksdb::PreReleaseCallback*)+0x2629) [0x564dc6b80c69]",
18. "(rocksdb::DBImpl::Write(rocksdb::WriteOptions const&,
rocksdb::WriteBatch*)+0x21) [0x564dc6b80e61]",
19. "(RocksDBStore::submit_common(rocksdb::WriteOptions&,
std::shared_ptr)+0x84) [0x564dc6b1f644]",
20. 
"(RocksDBStore::submit_transaction_sync(std::shared_ptr)+0x9a)
[0x564dc6b2004a]",
21. "(BlueStore::_kv_sync_thread()+0x30d8) [0x564dc6602ec8]",
22. "(BlueStore::KVSyncThread::entry()+0x11) [0x564dc662ab61]",
23. "/lib64/libpthread.so.0(+0x81ca) [0x7ff2f67461ca]",
24. "clone()"
25. ],


I am attaching two instances of crash info for further reference:
https://pastebin.com/E6myaHNU

OSD configuration is rather simple and close to default:

osd.6 dev   bluestore_cache_size_hdd4294967296
   osd.6 dev
bluestore_cache_size_ssd4294967296
   osd   advanced  debug_rocksdb
   1/5 osd
 advanced  osd_max_backfills   2
 osd   basic
osd_memory_target   17179869184
 osd   advanced  osd_recovery_max_active
 2 osd
 advanced  osd_scrub_sleep 0.10
   osd   advanced
  rbd_balance_parent_readsfalse

debug_rocksdb is a recent change, otherwise this configuration has been
running without issues for months. The crashes happened on two different
hosts with identical hardware, the hosts and storage (NVME DB/WAL, HDD
block) don't exhibit any issues. We have not experienced such crashes with
Ceph < 16.2.14.

Is this a known issue, or should I open a bug report?

Best regards,
Zakhar
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph 16.2.x excessive logging, how to reduce?

2023-10-04 Thread Igor Fedotov

Hi Zakhar,

do reduce rocksdb logging verbosity you might want to set debug_rocksdb 
to 3 (or 0).


I presume it produces a  significant part of the logging traffic.


Thanks,

Igor

On 04/10/2023 20:51, Zakhar Kirpichenko wrote:

Any input from anyone, please?

On Tue, 19 Sept 2023 at 09:01, Zakhar Kirpichenko  wrote:


Hi,

Our Ceph 16.2.x cluster managed by cephadm is logging a lot of very
detailed messages, Ceph logs alone on hosts with monitors and several OSDs
has already eaten through 50% of the endurance of the flash system drives
over a couple of years.

Cluster logging settings are default, and it seems that all daemons are
writing lots and lots of debug information to the logs, such as for
example: https://pastebin.com/ebZq8KZk (it's just a snippet, but there's
lots and lots of various information).

Is there a way to reduce the amount of logging and, for example, limit the
logging to warnings or important messages so that it doesn't include every
successful authentication attempt, compaction etc, etc, when the cluster is
healthy and operating normally?

I would very much appreciate your advice on this.

Best regards,
Zakhar




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: After upgrading from 17.2.6 to 18.2.0, OSDs are very frequently restarting due to livenessprobe failures

2023-09-28 Thread Igor Fedotov

Hi Sudhin,

It looks like manual DB compactions are (periodically?) issued via admin 
socket for your OSDs, which (my working hypothesis) triggers DB access 
stalls.


Here are the log lines indicating such calls

debug 2023-09-22T11:24:55.234+ 7fc4efa20700  1 osd.1 1192508 
triggering manual compaction


debug 2023-09-21T15:35:22.696+ 7faf22c8b700  1 osd.2 1180406 
finished manual compaction in 722.287 seconds


So I'm curious if you do have some external stuff performing manual OSD 
compactions? If so - would the primary issue go away when it's disabled?


You might want to disable it cluster wide and let OSDs run after that 
for a while to make sure that's the case. Then try to reproduce it again 
by manual running compaction for a specific OSD via CLI. Would it fail 
again?



If the above hypotheses is confirmed I could see two potential root causes:

1. Hybrid allocator might cause severe BlueFS stalls which make 
irresponsive. See https://tracker.ceph.com/issues/62815


2. Default RocksDB settings were changed in Reef. See 
https://github.com/ceph/ceph/pull/51900



The easiest way to verify if you're facing 1. is to set 
bluestore_allocator to bitmap for all the OSDs (and restart them) via 
"ceph config set" command .  Then monitor OSDs behavior during manual 
compactions.


For validating 2.  one should revert bluestore_rocksdb_options back to 
the original value 
"compression=kNoCompression,max_write_buffer_number=64,min_write_buffer_number_to_merge=6,compaction_style=kCompactionStyleLevel,write_buffer_size=16777216,max_background_jobs=4,level0_file_num_compaction_trigger=8,max_bytes_for_level_base=1073741824,max_bytes_for_level_multiplier=8,compaction_readahead_size=2MB,max_total_wal_size=1073741824,writable_file_max_buffer_size=0"


I'd recommend to do that for a single OSD first. Just in case - we don't 
have much knowledge on how OSDs survive such a reversion hence 
better/safier do that gradually.



Hope this helps and awaiting for the feedback.

Thanks,

Igor

On 27/09/2023 22:04, sbeng...@gmail.com wrote:

Hi Igor,

I have copied three OSD logs to
https://drive.google.com/file/d/1aQxibFJR6Dzvr3RbuqnpPhaSMhPSL--F/view?usp=sharing

Hopefully they include some meaningful information.

Thank you.

Sudhin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: After upgrading from 17.2.6 to 18.2.0, OSDs are very frequently restarting due to livenessprobe failures

2023-09-26 Thread Igor Fedotov

Hi Sudhin,

any publicly available cloud storage, e.g. Google drive should work.


Thanks,

Igor

On 26/09/2023 22:52, sbeng...@gmail.com wrote:

Hi Igor,
Please let where can I upload the OSD logs.
Thanks.
Sudhin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recently started OSD crashes (or messages thereof)

2023-09-21 Thread Igor Fedotov

Hi Luke,

highly likely this is caused by the issue covered 
https://tracker.ceph.com/issues/53906


Unfortunately it looks like we missed proper backport in Pacific.

You can apparently work around the issue by setting 
'bluestore_volume_selection_policy' config parameter to rocksdb_original.


The potential implication of that "tuning" is a less effective free 
space usage for DB volume - RocksDB/BlueFS might initiate data spillover 
to main (slow) device despite having available free space at standalone 
DB volume. Which in turn might cause some performance regression. 
Relevant alert will pop up if such a spillover takes place .


The above consequences are not highly likely to occur though. And they 
are rather minor most of the time so I would encourage you to try that 
if OSD crashes are that common.



Thanks,

Igor


On 21/09/2023 17:48, Luke Hall wrote:

Hi,

Since the recent update to 16.2.14-1~bpo11+1 on Debian Bullseye I've 
started seeing OSD crashes being registered almost daily across all 
six physical machines (6xOSD disks per machine). There's a --block-db 
for each osd on a LV from an NVMe.


If anyone has any idea what might be causing these I'd appreciate some 
insight. Happy to provide any other info which might be useful.


Thanks,

Luke



{
    "assert_condition": "cur2 >= p.length",
    "assert_file": "./src/os/bluestore/BlueStore.h",
    "assert_func": "virtual void 
RocksDBBlueFSVolumeSelector::sub_usage(void*, const bluefs_fnode_t&)",

    "assert_line": 3875,
    "assert_msg": "./src/os/bluestore/BlueStore.h: In function 
'virtual void RocksDBBlueFSVolumeSelector::sub_usage(void*, const 
bluefs_fnode_t&)' thread 7f7f54f25700 time 
2023-09-20T14:24:00.455721+0100\n./src/os/bluestore/BlueStore.h: 3875: 
FAILED ceph_assert(cur2 >= p.length)\n",

    "assert_thread_name": "bstore_kv_sync",
    "backtrace": [
    "/lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) 
[0x7f7f68632140]",

    "gsignal()",
    "abort()",
    "(ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x16e) [0x55b22a49b5fa]",

    "/usr/bin/ceph-osd(+0xac673b) [0x55b22a49b73b]",
    "(RocksDBBlueFSVolumeSelector::sub_usage(void*, bluefs_fnode_t 
const&)+0x11e) [0x55b22ab0077e]",
    "(BlueFS::_flush_range_F(BlueFS::FileWriter*, unsigned long, 
unsigned long)+0x5bd) [0x55b22ab9b8ed]",
    "(BlueFS::_flush_F(BlueFS::FileWriter*, bool, bool*)+0x9a) 
[0x55b22ab9bd7a]",

    "(BlueFS::fsync(BlueFS::FileWriter*)+0x79) [0x55b22aba97a9]",
    "(BlueRocksWritableFile::Sync()+0x15) [0x55b22abbf405]",
"(rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions const&, 
rocksdb::IODebugContext*)+0x3f) [0x55b22b0914d1]",
    "(rocksdb::WritableFileWriter::SyncInternal(bool)+0x1f4) 
[0x55b22b26b7c6]",
    "(rocksdb::WritableFileWriter::Sync(bool)+0x18c) 
[0x55b22b26b1f8]",
"(rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&, 
rocksdb::log::Writer*, unsigned long*, bool, bool, unsigned 
long)+0x366) [0x55b22b0e4a98]",
    "(rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, 
rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, 
unsigned long, bool, unsigned long*, unsigned long, 
rocksdb::PreReleaseCallback*)+0x12cc) [0x55b22b0e0c5a]",
    "(rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, 
rocksdb::WriteBatch*)+0x4a) [0x55b22b0df92a]",
    "(RocksDBStore::submit_common(rocksdb::WriteOptions&, 
std::shared_ptr)+0x82) [0x55b22b036c42]",


"(RocksDBStore::submit_transaction_sync(std::shared_ptr)+0x96) 
[0x55b22b037cc6]",

    "(BlueStore::_kv_sync_thread()+0x1201) [0x55b22aafc891]",
    "(BlueStore::KVSyncThread::entry()+0xd) [0x55b22ab2792d]",
    "/lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) 
[0x7f7f68626ea7]",

    "clone()"
    ],
    "ceph_version": "16.2.14",
    "crash_id": 
"2023-09-20T13:24:00.562318Z_beb5c664-9ffb-4a4e-8c61-166865fd4e0b",

    "entity_name": "osd.8",
    "os_id": "11",
    "os_name": "Debian GNU/Linux 11 (bullseye)",
    "os_version": "11 (bullseye)",
    "os_version_id": "11",
    "process_name": "ceph-osd",
    "stack_sig": 
"90d1fb6954f0f5b1e98659a93a1b9ce5a5a42cd5e0b2990a65dc336567adcb26",

    "timestamp": "2023-09-20T13:24:00.562318Z",
    "utsname_hostname": "cphosd02",
    "utsname_machine": "x86_64",
    "utsname_release": "5.10.0-23-amd64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Debian 5.10.179-1 (2023-05-12)"
}



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: After power outage, osd do not restart

2023-09-21 Thread Igor Fedotov
]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 15171 (podman) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 15646 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 15648 (podman) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 15792 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 15794 (podman) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 25561 (bash) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 25563 (podman) in control group while starting unit. 
Ignoring.
sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.


Patrick

Le 21/09/2023 à 12:44, Igor Fedotov a écrit :

Hi Patrick,

please share osd restart log to investigate that.


Thanks,

Igor

On 21/09/2023 13:41, Patrick Begou wrote:

Hi,

After a power outage on my test ceph cluster, 2 osd fail to 
restart.  The log file show:


8e5f-00266cf8869c@osd.2.service: Failed with result 'timeout'.
Sep 21 11:55:02 mostha1 systemd[1]: Failed to start Ceph osd.2 for 
250f9864-0142-11ee-8e5f-00266cf8869c.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Service 
RestartSec=10s expired, scheduling restart.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Scheduled 
restart job, restart counter is at 2.
Sep 21 11:55:12 mostha1 systemd[1]: Stopped Ceph osd.2 for 
250f9864-0142-11ee-8e5f-00266cf8869c.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 1858 (bash) in control group while starting unit. 
Ignoring.
Sep 21 11:55:12 mostha1 systemd[1]: This usually indicates unclean 
termination of a previous run, or service implementation deficiencies.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 2815 (podman) in control group while starting 
unit. Ignoring.


This is not critical as it is a test cluster and it is actually 
rebalancing on other osd but I would like to know how to return to 
HEALTH_OK status.


Smartctl show the HDD are OK.

So is there a way to recover the osd from this state ? Version is 
15.2.17 (juste moved from 15.2.13 to 15.2.17 yesterday, will try to 
move to latest versions as soon as this problem is solved)


Thanks

Patrick

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: After power outage, osd do not restart

2023-09-21 Thread Igor Fedotov

Hi Patrick,

please share osd restart log to investigate that.


Thanks,

Igor

On 21/09/2023 13:41, Patrick Begou wrote:

Hi,

After a power outage on my test ceph cluster, 2 osd fail to restart.  
The log file show:


8e5f-00266cf8869c@osd.2.service: Failed with result 'timeout'.
Sep 21 11:55:02 mostha1 systemd[1]: Failed to start Ceph osd.2 for 
250f9864-0142-11ee-8e5f-00266cf8869c.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Service 
RestartSec=10s expired, scheduling restart.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Scheduled 
restart job, restart counter is at 2.
Sep 21 11:55:12 mostha1 systemd[1]: Stopped Ceph osd.2 for 
250f9864-0142-11ee-8e5f-00266cf8869c.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 1858 (bash) in control group while starting unit. 
Ignoring.
Sep 21 11:55:12 mostha1 systemd[1]: This usually indicates unclean 
termination of a previous run, or service implementation deficiencies.
Sep 21 11:55:12 mostha1 systemd[1]: 
ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found 
left-over process 2815 (podman) in control group while starting unit. 
Ignoring.


This is not critical as it is a test cluster and it is actually 
rebalancing on other osd but I would like to know how to return to 
HEALTH_OK status.


Smartctl show the HDD are OK.

So is there a way to recover the osd from this state ? Version is 
15.2.17 (juste moved from 15.2.13 to 15.2.17 yesterday, will try to 
move to latest versions as soon as this problem is solved)


Thanks

Patrick

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: After upgrading from 17.2.6 to 18.2.0, OSDs are very frequently restarting due to livenessprobe failures

2023-09-21 Thread Igor Fedotov

Hi!

Can you share OSD logs demostrating such a restart?


Thanks,

Igor

On 20/09/2023 20:16, sbeng...@gmail.com wrote:

Since upgrading to 18.2.0 , OSDs are very frequently restarting due to 
livenessprobe failures making the cluster unusable. Has anyone else seen this 
behavior?

Upgrade path: ceph 17.2.6 to 18.2.0 (and rook from 1.11.9 to 1.12.1)
on ubuntu 20.04 kernel 5.15.0-79-generic

Thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cannot create new OSDs - ceph version 17.2.6 (810db68029296377607028a6c6da1ec06f5a2b27) quincy (stable)

2023-09-13 Thread Igor Fedotov

Hi Martin,

it looks like you're using custom osd settings. Namely:

- bluestore_allocator set to bitmap (which is fine)

- bluestore_min_alloc_size set to 128K

The latter is apparently out-of-sync with bluefs_shared_alloc_size (set 
to 64K by default). Which causes the assertion at some point due to 
unexpected allocation request alignment.


Generally one should have bluefs_shared_alloc_size equal or higher than 
(and aligned to) bluestore_min_alloc_size.


I'm curious why have you raised bluestore_min_alloc_size though? I 
recall no cases when we've heard of any benefit from that. Particularly 
for SSD devices...


I'd recommend to set it back to default.


Thanks,

Igor

On 12/09/2023 19:44, Konold, Martin wrote:

Hi Igor,

I recreated the log with full debugging enabled.

https://www.konsec.com/download/full-debug-20-ceph-osd.43.log.gz

and another without the debug settings

https://www.konsec.com/download/failed-ceph-osd.43.log.gz

I hope you can draw some conclusions from it and I am looking forward 
to your response.


Regards
ppa. Martin Konold

--
Martin Konold - Prokurist, CTO
KONSEC GmbH -⁠ make things real
Amtsgericht Stuttgart, HRB 23690
Geschäftsführer: Andreas Mack
Im Köller 3, 70794 Filderstadt, Germany

On 2023-09-11 22:08, Igor Fedotov wrote:

Hi Martin,

could you please share the full existing log and also set
debug_bluestore and debug_bluefs to 20 and collect new osd startup
log.


Thanks,

Igor

On 11/09/2023 20:53, Konold, Martin wrote:

Hi,

I want to create a new OSD on a 4TB Samsung MZ1L23T8HBLA-00A07 
enterprise nvme device in a hyper-converged proxmox 8 environment.


Creating the OSD works but it cannot be initialized and therefore 
not started.


In the log I see an entry about a failed assert.

./src/os/bluestore/fastbmap_allocator_impl.cc: 405: FAILED 
ceph_assert((aligned_extent.length % l0_granularity) == 0)


Is this the culprit?

In addition at the end of the logfile there is a failed mount and a 
failed osd init mentioned.


2023-09-11T16:30:04.708+0200 7f99aa28f3c0 -1 bluefs 
_check_allocations OP_FILE_UPDATE_INC invalid extent 1: 
0x14~1: duplicate reference, ino 30
2023-09-11T16:30:04.708+0200 7f99aa28f3c0 -1 bluefs mount failed to 
replay log: (14) Bad address

2023-09-11T16:30:04.708+0200 7f99aa28f3c0 20 bluefs _stop_alloc
2023-09-11T16:30:04.708+0200 7f99aa28f3c0 -1 
bluestore(/var/lib/ceph/osd/ceph-43) _open_bluefs failed bluefs 
mount: (14) Bad address
2023-09-11T16:30:04.708+0200 7f99aa28f3c0 10 bluefs 
maybe_verify_layout no memorized_layout in bluefs superblock
2023-09-11T16:30:04.708+0200 7f99aa28f3c0 -1 
bluestore(/var/lib/ceph/osd/ceph-43) _open_db failed to prepare db 
environment:
2023-09-11T16:30:04.708+0200 7f99aa28f3c0  1 bdev(0x5565c261fc00 
/var/lib/ceph/osd/ceph-43/block) close
2023-09-11T16:30:04.940+0200 7f99aa28f3c0 -1 osd.43 0 OSD:init: 
unable to mount object store
2023-09-11T16:30:04.940+0200 7f99aa28f3c0 -1  ** ERROR: osd init 
failed: (5) Input/output error


I verified that the hardware of the new nvme is working fine.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-12 Thread Igor Fedotov

Hey Konstantin,

forget to mention - indeed clusters having 4K bluestore min alloc size 
are more likely to be exposed to the issue. The key point is the 
difference between bluestore and bluefs allocation sizes. The issue 
likely to pop-up when user and DB data are collocated but different 
allocation units are in use. As a result allocator needs to locate 
properly aligned chunks for BlueFS among a bunch of inappropriate 
misaligned chunks. Which might be ineffective in the current 
implementation and cause the slowdown.



Thanks,

Igor

On 12/09/2023 15:47, Konstantin Shalygin wrote:

Hi Igor,


On 12 Sep 2023, at 15:28, Igor Fedotov  wrote:

Default hybrid allocator (as well as AVL one it's based on) could take 
dramatically long time to allocate pretty large (hundreds of MBs) 64K-aligned 
chunks for BlueFS. At the original cluster it was exposed as 20-30 sec OSD 
stalls.

For the chunks, this mean bluestore min alloc size?
This cluster was deployed pre Pacific (64k) and not redeployed to Pacific 
default (4k)?


Thanks,
k
Sent from my iPhone



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-12 Thread Igor Fedotov

HI All,

as promised here is a postmortem analysis on what happened.

the following ticket (https://tracker.ceph.com/issues/62815) with 
accompanying materials provide a low-level overview on the issue.


In a few words it is as follows:

Default hybrid allocator (as well as AVL one it's based on) could take 
dramatically long time to allocate pretty large (hundreds of MBs) 
64K-aligned chunks for BlueFS. At the original cluster it was exposed as 
20-30 sec OSD stalls.


This is apparently not specific to the recent 16.2.14 Pacific release as 
I saw that at least once before but 
https://github.com/ceph/ceph/pull/51773 made it more likely to pop up. 
RocksDB could preallocate huge WALs in a single short from now on .


The issue is definitely bound to aged/fragmented main OSD volumes which 
colocate DB ones. I don't expect it to pop up for standalone DB/WALs.


As already mentioned in this thread the proposed work-around is to 
switch bluestore_allocator to bitmap. This might cause minor overall 
performance drop so I'm not sure one should apply this unconditionally.


I'd like to ask for apologies for the inconvenience this could result. 
We're currently working on a proper fix...


Thanks,

Igor

On 07/09/2023 10:05, J-P Methot wrote:

Hi,

We're running latest Pacific on our production cluster and we've been 
seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed 
out after 15.00954s' error. We have reasons to believe this 
happens each time the RocksDB compaction process is launched on an 
OSD. My question is, does the cluster detecting that an OSD has timed 
out interrupt the compaction process? This seems to be what's 
happening, but it's not immediately obvious. We are currently facing 
an infinite loop of random OSDs timing out and if the compaction 
process is interrupted without finishing, it may explain that.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cannot create new OSDs - ceph version 17.2.6 (810db68029296377607028a6c6da1ec06f5a2b27) quincy (stable)

2023-09-11 Thread Igor Fedotov

Hi Martin,

could you please share the full existing log and also set 
debug_bluestore and debug_bluefs to 20 and collect new osd startup log.



Thanks,

Igor

On 11/09/2023 20:53, Konold, Martin wrote:

Hi,

I want to create a new OSD on a 4TB Samsung MZ1L23T8HBLA-00A07 
enterprise nvme device in a hyper-converged proxmox 8 environment.


Creating the OSD works but it cannot be initialized and therefore not 
started.


In the log I see an entry about a failed assert.

./src/os/bluestore/fastbmap_allocator_impl.cc: 405: FAILED 
ceph_assert((aligned_extent.length % l0_granularity) == 0)


Is this the culprit?

In addition at the end of the logfile there is a failed mount and a 
failed osd init mentioned.


2023-09-11T16:30:04.708+0200 7f99aa28f3c0 -1 bluefs _check_allocations 
OP_FILE_UPDATE_INC invalid extent 1: 0x14~1: duplicate 
reference, ino 30
2023-09-11T16:30:04.708+0200 7f99aa28f3c0 -1 bluefs mount failed to 
replay log: (14) Bad address

2023-09-11T16:30:04.708+0200 7f99aa28f3c0 20 bluefs _stop_alloc
2023-09-11T16:30:04.708+0200 7f99aa28f3c0 -1 
bluestore(/var/lib/ceph/osd/ceph-43) _open_bluefs failed bluefs mount: 
(14) Bad address
2023-09-11T16:30:04.708+0200 7f99aa28f3c0 10 bluefs 
maybe_verify_layout no memorized_layout in bluefs superblock
2023-09-11T16:30:04.708+0200 7f99aa28f3c0 -1 
bluestore(/var/lib/ceph/osd/ceph-43) _open_db failed to prepare db 
environment:
2023-09-11T16:30:04.708+0200 7f99aa28f3c0  1 bdev(0x5565c261fc00 
/var/lib/ceph/osd/ceph-43/block) close
2023-09-11T16:30:04.940+0200 7f99aa28f3c0 -1 osd.43 0 OSD:init: unable 
to mount object store
2023-09-11T16:30:04.940+0200 7f99aa28f3c0 -1  ** ERROR: osd init 
failed: (5) Input/output error


I verified that the hardware of the new nvme is working fine.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: A couple OSDs not starting after host reboot

2023-08-29 Thread Igor Fedotov

Hi All,

from the log output (the line with "Malformed input" string) it rather 
looks like a device label (the very first 4K data block at main OSD 
device containing  some basic OSD meta, e.g. OSD UUID) corruption. There 
are some chances that wrong device has been attached too.


Alison, to investigate further could you please share the 4K superblock 
content (can be retrieved using dd tool: "dd 
if=/var/lib/ceph/osd/ceph-665/block count=1 bs=4096 of=./superb.out") 
and /var/lib/ceph/osd/ceph-665 file listing.



Thanks,

Igor

On 8/25/2023 8:58 PM, Eugen Block wrote:

Hi,
one thing coming to mind is maybe the device names have changed from 
/dev/sdX to /dev/sdY? Something like that has been reported a couple 
of times in the last months.


Zitat von Alison Peisker :


Hi all,

We rebooted all the nodes in our 17.2.5 cluster after performing 
kernel updates, but 2 of the OSDs on different nodes are not coming 
back up. This is a production cluster using cephadm.


The error message from the OSD log is ceph-osd[87340]:  ** ERROR: 
unable to open OSD superblock on /var/lib/ceph/osd/ceph-665: (2) No 
such file or directory


The error message from ceph-volume is 2023-08-23T16:12:43.452-0500 
7f0cad968600  2 
bluestore(/dev/mapper/ceph--febad5a5--ba44--41aa--a39e--b9897f757752-osd--block--87e548f4--b9b5--4ed8--aca8--de703a341a50) 
_read_bdev_label unable to decode label at offset 102: void 
bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) 
decode past end of struct encoding: Malformed input


We tried restarting the daemons and rebooting the node again, but 
still see the same error.

Has anyone experienced this issue before? How do we fix this?

Thanks,
Alison
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Lots of space allocated in completely empty OSDs

2023-08-14 Thread Igor Fedotov

Hi Andras,

does

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2925 --op 
meta-list


show nothing as well?


On 8/11/2023 11:00 PM, Andras Pataki wrote:
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2925  --op list 


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-volume lvm migrate error

2023-08-02 Thread Igor Fedotov

Hi Roland,

could you please share the content of thd relevant OSD subfolder?

Also you might want to run:

ceph-bluestore-tool --path  --command bluefs-bdev-sizes

to make sure DB/WAL are effectively in use.


Thanks,

Igor

On 8/2/2023 12:04 PM, Roland Giesler wrote:
I need some help with this please.  The command below gives and error 
which is not helpful to me.


ceph-volume lvm migrate --osd-id 14 --osd-fsid 
4de2a617-4452-420d-a99b-9e0cd6b2a99b --from db wal --target 
NodeC-nvme1/NodeC-nvme-LV-RocksDB1

--> Source device list is empty
Unable to migrate to : NodeC-nvme1/NodeC-nvme-LV-RocksDB1

Alternatively I have tried to only specify --from db instead of 
including wal, but it makes no difference.


Here is the OSD in question.

# ls -la /dev/ceph-025b887e-4f06-468f-845c-0ddf9ad04990/
lrwxrwxrwx  1 root root    7 Dec 25  2022 
osd-block-4de2a617-4452-420d-a99b-9e0cd6b2a99b -> ../dm-4



What is happening here?  I want to move the DB/WAL to NVMe storage 
without trashing the data OSD and having to go through rebalancing for 
each drive I do this for.



thanks

Roland

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD stuck on booting state after upgrade (v15.2.17 -> v17.2.6)

2023-07-27 Thread Igor Fedotov

Hi,

looks like you've hit into https://tracker.ceph.com/issues/58156


IIRC there is no way to work around the issue other than having custom 
build with the proper patch (Quincy backport is 
https://github.com/ceph/ceph/pull/51102). Unfortunately the fix hasn't 
been merged into Quincy/Pacific yet hence you'll need to build that 
staff yourself.



Hope this helps,

Igor

On 26/07/2023 10:34, s.smagu...@gmail.com wrote:

Updating the Ceph cluster from the Octopus version (v15.2.17) to Quincy 
(v17.2.6).

We used ceph-deploy to update all ceph packages on all hosts, and then we restarted the 
services one by one (mon -> mgr -> osd -> rgw).
During the restart on the first node, all osd encountered an issue where they didn't change to the 
"up" state and got stuck in "booting" state.

# ceph daemon osd.3 status
{
 "cluster_fsid": "f95b201c-4cd6-4c36-a54e-7f2b68608b8f",
 "osd_fsid": "b0141718-a2ac-4a26-808b-17b6741b789e",
 "whoami": 3,
 "state": "booting",
 "oldest_map": 4437792,
 "newest_map": 4441114,
 "num_pgs": 29
}

While changing "ceph osd require-osd-release quincy," the monitor service 
crashed with errors.

# ceph report | jq '.osdmap.require_osd_release'
"nautilus"

-2> 2023-07-25T12:10:20.977+0600 7f245a84f700  5 
mon.ceph-ph-mon1-dc3@0(leader).paxos(paxos updating c 81819224..81819937) 
is_readable = 1 - now=2023-07-25T12:10:20.981801+0600 
lease_expire=2023-07-25T12:10:25.959818+0600 has v0 lc 81819937
-1> 2023-07-25T12:10:20.997+0600 7f245a84f700 -1 
/build/ceph-17.2.6/src/mon/OSDMonitor.cc: In function 'bool 
OSDMonitor::prepare_command_impl(MonOpRequestRef, const cmdmap_t&)' thread 
7f245a84f700 time 2023-07-25T12:10:20.981991+0600
/build/ceph-17.2.6/src/mon/OSDMonitor.cc: 11631: FAILED 
ceph_assert(osdmap.require_osd_release >= ceph_release_t::octopus)

  ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14f) [0x7f24629d3878]
  2: /usr/lib/ceph/libceph-common.so.2(+0x27da8a) [0x7f24629d3a8a]
  3: (OSDMonitor::prepare_command_impl(boost::intrusive_ptr, 
std::map, std::allocator 
>, boost::variant, std::allocator >, bool, long, double, 
std::vector, std::allocator >, 
std::allocator, std::allocator > > >, std::vector 
>, std::vector > >, std::less, std::allocator, std::allocator > const, 
boost::variant, std::allocator >, 
bool, long, double, std:
:vector, std::allocator >, 
std::allocator, std::allocator > > >, 
std::vector >, std::vector > > > > > 
const&)+0xcb13) [0x5569f209a823]
  4: (OSDMonitor::prepare_command(boost::intrusive_ptr)+0x45f) 
[0x5569f20ab89f]
  5: (OSDMonitor::prepare_update(boost::intrusive_ptr)+0x162) 
[0x5569f20baa42]
  6: (PaxosService::dispatch(boost::intrusive_ptr)+0x716) 
[0x5569f201fd86]
  7: (PaxosService::C_RetryMessage::_finish(int)+0x6c) [0x5569f1f4f93c]
  8: (C_MonOp::finish(int)+0x4b) [0x5569f1ebbb3b]
  9: (Context::complete(int)+0xd) [0x5569f1ebaa0d]
  10: (void finish_contexts > 
>(ceph::common::CephContext*, std::__cxx11::list >&, 
int)+0xb0) [0x5569f1ef11e0]
  11: (Paxos::finish_round()+0xb1) [0x5569f2015a61]
  12: (Paxos::handle_last(boost::intrusive_ptr)+0x11e3) 
[0x5569f20172a3]
  13: (Paxos::dispatch(boost::intrusive_ptr)+0x49f) 
[0x5569f2019f7f]
  14: (Monitor::dispatch_op(boost::intrusive_ptr)+0x14f4) 
[0x5569f1eb7f34]
  15: (Monitor::_ms_dispatch(Message*)+0xa68) [0x5569f1eb8bd8]
  16: (Dispatcher::ms_dispatch2(boost::intrusive_ptr const&)+0x5d) 
[0x5569f1ef2c4d]
  17: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr 
const&)+0x460) [0x7f2462c71da0]
  18: (DispatchQueue::entry()+0x58f) [0x7f2462c6f63f]
  19: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f2462d40b61]
  20: /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f24624f4609]
  21: clone()

  0> 2023-07-25T12:10:21.009+0600 7f245a84f700 -1 *** Caught signal 
(Aborted) **
  in thread 7f245a84f700 thread_name:ms_dispatch

  ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
  1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7f24625003c0]
  2: gsignal()
  3: abort()
  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x1b7) [0x7f24629d38e0]
  5: /usr/lib/ceph/libceph-common.so.2(+0x27da8a) [0x7f24629d3a8a]
  6: (OSDMonitor::prepare_command_impl(boost::intrusive_ptr, 
std::map, std::allocator 
>, boost::variant, std::allocator >, bool, long, double, 
std::vector, std::allocator >, 
std::allocator, std::allocator > > >, std::vector 
>, std::vector > >, std::less, std::allocator, std::allocator > const, 
boost::variant, std::allocator >, 
bool, long, double, std:
:vector, std::allocator >, 
std::allocator, std::allocator > > >, 
std::vector >, std::vector > > > > > 
const&)+0xcb13) [0x5569f209a823]
  7: (OSDMonitor::prepare_command(boost::intrusive_ptr)+0x45f) 
[0x5569f20ab89f]
  8: (OSDMonitor::prepare_update(boost::intrusive_ptr)+0x162) 
[0x5569f20baa42]
  9: 

[ceph-users] Re: Bluestore compression - Which algo to choose? Zstd really still that bad?

2023-06-27 Thread Igor Fedotov

Hi Christian,

I can't say anything about your primary question on zstd 
benefits/drawbacks but I'd like to emphasize that compression ratio at 
BlueStore is (to a major degree) determined by the input data flow 
characteristics (primarily write block size), object store allocation 
unit size (bluestore_min_alloc_size) and some parameters (e.g. maximum 
blob size) that determine how input data chunks are logically split when 
landing on disk.


E.g. if one has min_alloc_size set to 4K and write block size is in 
(4K-8K] then resulting compressed block would never be less than 4K. 
Hence compression ratio is never more than 2.


Similarly if min_alloc_size is 64K there would be no benefit in 
compression at all for the above input since target allocation units are 
always larger than input blocks.


The rationale of the above behavior is that compression is applied 
exclusively on input blocks - there is no additional processing to merge 
input and existing data and compress them all together.



Thanks,

Igor


On 26/06/2023 11:48, Christian Rohmann wrote:

Hey ceph-users,

we've been using the default "snappy" to have Ceph compress data on 
certain pools - namely backups / copies of volumes of a VM environment.

So it's write once, and no random access.
I am now wondering if switching to another algo (there is snappy, 
zlib, lz4, or zstd) would improve the compression ratio (significantly)?


* Does anybody have any real world data on snappy vs. $anyother?

Using zstd is tempting as it's used in various other applications 
(btrfs, MongoDB, ...) for inline-compression with great success.
For Ceph though there is a warning ([1]), about it being not 
recommended in the docs still. But I am wondering if this still stands 
with e.g. [2] merged.
And there was [3] trying to improve the performance, this this reads 
as it only lead to a dead-end and no code changes?



In any case does anybody have any numbers to help with the decision on 
the compression algo?




Regards


Christian


[1] 
https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#confval-bluestore_compression_algorithm

[2] https://github.com/ceph/ceph/pull/33790
[3] https://github.com/facebook/zstd/issues/910
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Pacific bluefs enospc bug with newly created OSDs

2023-06-22 Thread Igor Fedotov
Quincy brings support for 4K allocation unit but doesn't start using it 
immediately. Instead it falls back to 4K when bluefs is unable to 
allocate more space with the default size. And even this mode isn't 
permanent, bluefs attempts to bring larger units back from time to time.



Thanks,

Igor

On 22/06/2023 00:04, Fox, Kevin M wrote:

Does quincy automatically switch existing things to 4k or do you need to do a 
new ost to get the 4k size?

Thanks,
Kevin


From: Igor Fedotov 
Sent: Wednesday, June 21, 2023 5:56 AM
To: Carsten Grommel; ceph-users@ceph.io
Subject: [ceph-users] Re: Ceph Pacific bluefs enospc bug with newly created OSDs

Check twice before you click! This email originated from outside PNNL.


Hi Carsten,

please also note a workaround to bring the osds back for e.g. data
recovery - set bluefs_shared_alloc_size to 32768.

This will hopefully allow OSD to startup and pull data out of it. But I
wouldn't discourage you from using such OSDs long term as fragmentation
might evolve and this workaround will become ineffective as well.

Please do not apply this change to healthy OSDs as it's irreversible.


BTW, having two namespace at NVMe drive is a good alternative to Logical
Volumes if for some reasons one needs two "physical" disks for OSD setup...

Thanks,

Igor

On 21/06/2023 11:41, Carsten Grommel wrote:

Hi Igor,

thank you for your ansere!


first of all Quincy does have a fix for the issue, see
https://tracker.ceph.com/issues/53466 (and its Quincy counterpart
https://tracker.ceph.com/issues/58588)

Thank you I somehow missed that release, good to know!


SSD or HDD? Standalone or shared DB volume? I presume the latter... What
is disk size and current utilization?

Please share ceph-bluestore-tool's bluefs-bdev-sizes command output if
possible

We use 4 TB NVMe SSDs, shared db yes and mainly Micron with some Dell
and Samsung in this cluster:

Micron_7400_MTFDKCB3T8TDZ_214733D291B1 cloud5-1561:nvme5n1  osd.5

All Disks are at ~ 88% utilization. I noticed that around 92% our
disks tend to run into this bug.

Here are some bluefs-bdev-sizes from different OSDs on different hosts
in this cluster:

ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-36/

inferring bluefs devices from bluestore path

1 : device size 0x37e3ec0 : using 0x2e1b390(2.9 TiB)

ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-24/

inferring bluefs devices from bluestore path

1 : device size 0x37e3ec0 : using 0x2d4e318d000(2.8 TiB)

ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-5/

inferring bluefs devices from bluestore path

1 : device size 0x37e3ec0 : using 0x2f2da93d000(2.9 TiB)


Generally, given my assumption that DB volume is currently collocated
and you still want to stay on Pacific, you might want to consider
redeploying OSDs with a standalone DB volume setup.

Just create large enough (2x of the current DB size seems to be pretty
conservative estimation for that volume's size) additional LV on top of
the same physical disk. And put DB there...

Separating DB from main disk would result in much less fragmentation at
DB volume and hence work around the problem. The cost would be having
some extra spare space at DB volume unavailable for user data .

I guess that makes, so the suggestion would be to deploy the osd and
db on the same NVMe

but with different logical volumes or updating to quincy.

Thank you!

Carsten

*Von: *Igor Fedotov 
*Datum: *Dienstag, 20. Juni 2023 um 12:48
*An: *Carsten Grommel , ceph-users@ceph.io

*Betreff: *Re: [ceph-users] Ceph Pacific bluefs enospc bug with newly
created OSDs

Hi Carsten,

first of all Quincy does have a fix for the issue, see
https://tracker.ceph.com/issues/53466 (and its Quincy counterpart
https://tracker.ceph.com/issues/58588)

Could you please share a bit more info on OSD disk layout?

SSD or HDD? Standalone or shared DB volume? I presume the latter... What
is disk size and current utilization?

Please share ceph-bluestore-tool's bluefs-bdev-sizes command output if
possible


Generally, given my assumption that DB volume is currently collocated
and you still want to stay on Pacific, you might want to consider
redeploying OSDs with a standalone DB volume setup.

Just create large enough (2x of the current DB size seems to be pretty
conservative estimation for that volume's size) additional LV on top of
the same physical disk. And put DB there...

Separating DB from main disk would result in much less fragmentation at
DB volume and hence work around the problem. The cost would be having
some extra spare space at DB volume unavailable for user data .


Hope this helps,

Igor


On 20/06/2023 10:29, Carsten Grommel wrote:

Hi all,

we are experiencing the “bluefs enospc bug” again after redeploying

all OSDs of our Pacific Cluster.

I know that our cluster is a bit too utilized at the moment with

87.26 % raw usage but still this should not ha

[ceph-users] Re: Ceph Pacific bluefs enospc bug with newly created OSDs

2023-06-21 Thread Igor Fedotov

Hi Carsten,

please also note a workaround to bring the osds back for e.g. data 
recovery - set bluefs_shared_alloc_size to 32768.


This will hopefully allow OSD to startup and pull data out of it. But I 
wouldn't discourage you from using such OSDs long term as fragmentation 
might evolve and this workaround will become ineffective as well.


Please do not apply this change to healthy OSDs as it's irreversible.


BTW, having two namespace at NVMe drive is a good alternative to Logical 
Volumes if for some reasons one needs two "physical" disks for OSD setup...


Thanks,

Igor

On 21/06/2023 11:41, Carsten Grommel wrote:


Hi Igor,

thank you for your ansere!

>first of all Quincy does have a fix for the issue, see
>https://tracker.ceph.com/issues/53466 (and its Quincy counterpart
>https://tracker.ceph.com/issues/58588)

Thank you I somehow missed that release, good to know!

>SSD or HDD? Standalone or shared DB volume? I presume the latter... What
>is disk size and current utilization?
>
>Please share ceph-bluestore-tool's bluefs-bdev-sizes command output if
>possible

We use 4 TB NVMe SSDs, shared db yes and mainly Micron with some Dell 
and Samsung in this cluster:


Micron_7400_MTFDKCB3T8TDZ_214733D291B1 cloud5-1561:nvme5n1  osd.5

All Disks are at ~ 88% utilization. I noticed that around 92% our 
disks tend to run into this bug.


Here are some bluefs-bdev-sizes from different OSDs on different hosts 
in this cluster:


ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-36/

inferring bluefs devices from bluestore path

1 : device size 0x37e3ec0 : using 0x2e1b390(2.9 TiB)

ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-24/

inferring bluefs devices from bluestore path

1 : device size 0x37e3ec0 : using 0x2d4e318d000(2.8 TiB)

ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-5/

inferring bluefs devices from bluestore path

1 : device size 0x37e3ec0 : using 0x2f2da93d000(2.9 TiB)

>Generally, given my assumption that DB volume is currently collocated
>and you still want to stay on Pacific, you might want to consider
>redeploying OSDs with a standalone DB volume setup.
>
>Just create large enough (2x of the current DB size seems to be pretty
>conservative estimation for that volume's size) additional LV on top of
>the same physical disk. And put DB there...
>
>Separating DB from main disk would result in much less fragmentation at
>DB volume and hence work around the problem. The cost would be having
>some extra spare space at DB volume unavailable for user data .

I guess that makes, so the suggestion would be to deploy the osd and 
db on the same NVMe


but with different logical volumes or updating to quincy.

Thank you!

Carsten

*Von: *Igor Fedotov 
*Datum: *Dienstag, 20. Juni 2023 um 12:48
*An: *Carsten Grommel , ceph-users@ceph.io 

*Betreff: *Re: [ceph-users] Ceph Pacific bluefs enospc bug with newly 
created OSDs


Hi Carsten,

first of all Quincy does have a fix for the issue, see
https://tracker.ceph.com/issues/53466 (and its Quincy counterpart
https://tracker.ceph.com/issues/58588)

Could you please share a bit more info on OSD disk layout?

SSD or HDD? Standalone or shared DB volume? I presume the latter... What
is disk size and current utilization?

Please share ceph-bluestore-tool's bluefs-bdev-sizes command output if
possible


Generally, given my assumption that DB volume is currently collocated
and you still want to stay on Pacific, you might want to consider
redeploying OSDs with a standalone DB volume setup.

Just create large enough (2x of the current DB size seems to be pretty
conservative estimation for that volume's size) additional LV on top of
the same physical disk. And put DB there...

Separating DB from main disk would result in much less fragmentation at
DB volume and hence work around the problem. The cost would be having
some extra spare space at DB volume unavailable for user data .


Hope this helps,

Igor


On 20/06/2023 10:29, Carsten Grommel wrote:
> Hi all,
>
> we are experiencing the “bluefs enospc bug” again after redeploying 
all OSDs of our Pacific Cluster.
> I know that our cluster is a bit too utilized at the moment with 
87.26 % raw usage but still this should not happen afaik.
> We never hat this problem with previous ceph versions and right now 
I am kind of out of ideas at how to tackle these crashes.

> Compacting the database did not help in the past either.
> Redeploy seems to no help in the long run as well. For documentation 
I used these commands to redeploy the osds:

>
> systemctl stop ceph-osd@${OSDNUM}
> ceph osd destroy --yes-i-really-mean-it ${OSDNUM}
> blkdiscard ${DEVICE}
> sgdisk -Z ${DEVICE}
> dmsetup remove ${DMDEVICE}
> ceph-volume lvm create --osd-id ${OSDNUM} --data ${DEVICE}
>
> Any ideas or possible solutions on this?  I am not yet ready to 
u

[ceph-users] Re: Ceph Pacific bluefs enospc bug with newly created OSDs

2023-06-20 Thread Igor Fedotov

Hi Carsten,

first of all Quincy does have a fix for the issue, see 
https://tracker.ceph.com/issues/53466 (and its Quincy counterpart 
https://tracker.ceph.com/issues/58588)


Could you please share a bit more info on OSD disk layout?

SSD or HDD? Standalone or shared DB volume? I presume the latter... What 
is disk size and current utilization?


Please share ceph-bluestore-tool's bluefs-bdev-sizes command output if 
possible



Generally, given my assumption that DB volume is currently collocated 
and you still want to stay on Pacific, you might want to consider 
redeploying OSDs with a standalone DB volume setup.


Just create large enough (2x of the current DB size seems to be pretty 
conservative estimation for that volume's size) additional LV on top of 
the same physical disk. And put DB there...


Separating DB from main disk would result in much less fragmentation at 
DB volume and hence work around the problem. The cost would be having 
some extra spare space at DB volume unavailable for user data .



Hope this helps,

Igor


On 20/06/2023 10:29, Carsten Grommel wrote:

Hi all,

we are experiencing the “bluefs enospc bug” again after redeploying all OSDs of 
our Pacific Cluster.
I know that our cluster is a bit too utilized at the moment with 87.26 % raw 
usage but still this should not happen afaik.
We never hat this problem with previous ceph versions and right now I am kind 
of out of ideas at how to tackle these crashes.
Compacting the database did not help in the past either.
Redeploy seems to no help in the long run as well. For documentation I used 
these commands to redeploy the osds:

systemctl stop ceph-osd@${OSDNUM}
ceph osd destroy --yes-i-really-mean-it ${OSDNUM}
blkdiscard ${DEVICE}
sgdisk -Z ${DEVICE}
dmsetup remove ${DMDEVICE}
ceph-volume lvm create --osd-id ${OSDNUM} --data ${DEVICE}

Any ideas or possible solutions on this?  I am not yet ready to upgrade our 
clusters to quincy, also I do presume that this bug is still present in quincy 
as well?

Follow our cluster information:

Crash Info:
ceph crash info 2023-06-19T21:23:51.285180Z_ac4105d7-cb09-45c8-a6e3-8a6bb6727b25
{
 "assert_condition": "abort",
 "assert_file": "/build/ceph/src/os/bluestore/BlueFS.cc",
 "assert_func": "int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, 
uint64_t)",
 "assert_line": 2810,
 "assert_msg": "/build/ceph/src/os/bluestore/BlueFS.cc: In function 'int 
BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7fd561810100 time 
2023-06-19T23:23:51.261617+0200\n/build/ceph/src/os/bluestore/BlueFS.cc: 2810: ceph_abort_msg(\"bluefs 
enospc\")\n",
 "assert_thread_name": "ceph-osd",
 "backtrace": [
 "/lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7fd56225f730]",
 "gsignal()",
 "abort()",
 "(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string, std::allocator > const&)+0x1a7) [0x557bb3c65762]",
 "(BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned 
long)+0x1175) [0x557bb42e7945]",
 "(BlueFS::_flush(BlueFS::FileWriter*, bool, bool*)+0xa1) 
[0x557bb42e7ad1]",
 "(BlueFS::_flush(BlueFS::FileWriter*, bool, 
std::unique_lock&)+0x2e) [0x557bb42f803e]",
 "(BlueRocksWritableFile::Append(rocksdb::Slice const&)+0x11b) 
[0x557bb431134b]",
 "(rocksdb::LegacyWritableFileWrapper::Append(rocksdb::Slice const&, 
rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x44) [0x557bb478e602]",
 "(rocksdb::WritableFileWriter::WriteBuffered(char const*, unsigned 
long)+0x333) [0x557bb4956feb]",
 "(rocksdb::WritableFileWriter::Append(rocksdb::Slice const&)+0x5d1) 
[0x557bb4955569]",
 "(rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const&, 
rocksdb::CompressionType, rocksdb::BlockHandle*, bool)+0x11d) [0x557bb4b142e1]",
 "(rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::Slice const&, 
rocksdb::BlockHandle*, bool)+0x7d6) [0x557bb4b140ca]",
 "(rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::BlockBuilder*, 
rocksdb::BlockHandle*, bool)+0x48) [0x557bb4b138e0]",
 "(rocksdb::BlockBasedTableBuilder::Flush()+0x9a) [0x557bb4b13890]",
 "(rocksdb::BlockBasedTableBuilder::Add(rocksdb::Slice const&, rocksdb::Slice 
const&)+0x192) [0x557bb4b133c8]",
 "(rocksdb::BuildTable(std::__cxx11::basic_string, std::allocator > const&, rocksdb::Env*, rocksdb::FileSystem*, rocksdb::ImmutableCFOptions const&, 
rocksdb::MutableCFOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase*, std::vector >, std::allocator > > >, 
rocksdb::FileMetaData*, rocksdb::InternalKeyComparator const&, std::vector >, 
std::allocator > > > const*, unsigned int, std::__cxx11::basic_string, 
std::allocator > const&, std::vector >, unsigned long, rocksdb::SnapshotChecker*, rocksdb::CompressionType, unsigned long, rocksdb::CompressionOptions const&, 
bool, rocksdb::InternalStats*, 

[ceph-users] Re: BlueStore fragmentation woes

2023-05-31 Thread Igor Fedotov
/ceph-183)  allocation stats probe 2: cnt: 17987 
frags: 17987 size: 32446676992
May 28 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: 
debug 2023-05-28T18:35:22.790+ 7fe190013700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -1: 21145,  21145, 44858146816
May 28 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: 
debug 2023-05-28T18:35:22.790+ 7fe190013700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -2: 21986,  21986, 47407562752
May 28 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: 
debug 2023-05-28T18:35:22.790+ 7fe190013700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -6: 0,  0, 0
May 28 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: 
debug 2023-05-28T18:35:22.790+ 7fe190013700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -10: 0,  0, 0
May 28 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: 
debug 2023-05-28T18:35:22.790+ 7fe190013700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -18: 0,  0, 0
May 29 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: 
debug 2023-05-29T18:35:22.815+ 7fe190013700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  allocation stats probe 3: cnt: 17509 
frags: 17509 size: 31015436288
May 29 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: 
debug 2023-05-29T18:35:22.815+ 7fe190013700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -1: 17987,  17987, 32446676992
May 29 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: 
debug 2023-05-29T18:35:22.815+ 7fe190013700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -3: 21986,  21986, 47407562752
May 29 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: 
debug 2023-05-29T18:35:22.815+ 7fe190013700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -7: 0,  0, 0
May 29 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: 
debug 2023-05-29T18:35:22.815+ 7fe190013700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -11: 0,  0, 0
May 29 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: 
debug 2023-05-29T18:35:22.815+ 7fe190013700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -19: 0,  0, 0
May 30 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: 
debug 2023-05-30T18:35:22.826+ 7fe190013700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  allocation stats probe 4: cnt: 21016 
frags: 21016 size: 45432438784
May 30 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: 
debug 2023-05-30T18:35:22.826+ 7fe190013700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -1: 17509,  17509, 31015436288
May 30 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: 
debug 2023-05-30T18:35:22.826+ 7fe190013700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -2: 17987,  17987, 32446676992
May 30 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: 
debug 2023-05-30T18:35:22.826+ 7fe190013700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -4: 21986,  21986, 47407562752
May 30 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: 
debug 2023-05-30T18:35:22.826+ 7fe190013700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -12: 0,  0, 0
May 30 11:35:22 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2490690]: 
debug 2023-05-30T18:35:22.826+ 7fe190013700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -20: 0,  0, 0

Thanks,
Kevin


From: Fox, Kevin M 
Sent: Thursday, May 25, 2023 9:36 AM
To: Igor Fedotov; Hector Martin; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: BlueStore fragmentation woes

Ok, I'm gathering the "allocation stats probe" stuff. Not sure I follow what 
you mean by the historic probes. just:
| egrep "allocation stats probe|probe"   ?

That gets something like:
May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-24T18:24:34.105+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  allocation stats probe 110: cnt: 27637 
frags: 27637 size: 63777406976
May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-24T18:24:34.105+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -1: 24503,  24503, 58141900800
May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-24T18:24:34.105+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -2: 24594,  24594, 56951898112
May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-24T18:24:34.105+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -6: 19737,  19737, 37299027968
May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-24T18:24:34.105+ 7f53603fc700  0 
blu

[ceph-users] Re: BlueStore fragmentation woes

2023-05-29 Thread Igor Fedotov

Hi Stefan,

given that allocation probes include every allocation (including short 
4K ones) your stats look pretty high indeed.


Although you omitted historic probes so it's hard to tell if there is 
negative trend in it..


As I mentioned in my reply to Hector one might want to make further 
investigation by e.g. building a histogram (chunk-size, num chanks) 
using the output from 'ceph tell osd.N bluestore allocator dump block' 
command and monitoring how  it evolves over time. Script to build such a 
histogram still to be written. ;)



As for Pacific release being a culprit - likely it is. But there were 
two major updates which could have the impact. Both came in the same PR 
(https://github.com/ceph/ceph/pull/34588):


1. 4K allocation unit for spinners

2. Switch to avl/hybrid allocator.

Honestly I'd rather bet on 1.

>BlueFS 4K allocation unit will not be backported to Pacific [3]. Would 
it make sense to skip re-provisiong OSDs in Pacific altogether and do 
re-provisioning in Quincy release with BlueFS 4K alloc size support [4]?


IIRC this feature doesn't require OSD redeployment - new superblock 
format is applied on-the-fly and 4K allocations are enabled immediately. 
So there is no specific requirement to re-provision OSD at Quincy+. 
Hence you're free to go with Pacific and enable 4K for BlueFS later in 
Quincy.



Thanks,

Igor

On 26/05/2023 16:03, Stefan Kooman wrote:

On 5/25/23 22:12, Igor Fedotov wrote:


On 25/05/2023 20:36, Stefan Kooman wrote:

On 5/25/23 18:17, Igor Fedotov wrote:

Perhaps...

I don't like the idea to use fragmentation score as a real index. 
IMO it's mostly like a very imprecise first turn marker to alert 
that something might be wrong. But not a real quantitative 
high-quality estimate.


Chiming in on the high fragmentation issue. We started collecting 
"fragmentation_rating" of each OSD this afternoon. All OSDs that 
have been provisioned a year ago have a fragmentation rating of ~ 
0.9. Not sure for how long they are on this level.


Could you please collect allocation probes from existing OSD logs? 
Just a few samples from different OSDs...


10 OSDs from one host, but I have checked other nodes and they are 
similar:


CNT    FRAG    Size    Ratio    Avg Frag size
21350923    37146899    317040259072    1.73982637659271 8534.77053554322
20951932    38122769    317841477632    1.8195347808498 8337.31352599283
21188454    37298950    278389411840    1.76034315670223 7463.73321072041
21605451    39369462    270427185152    1.82220042525379 6868.95810646333
19215230    36063713    290967818240    1.87682962941375 8068.16032059705
19293599    35464928    269238423552    1.83817068033807 7591.68109835159
19963538    36088151    315796836352    1.80770317365589 8750.70702159277
18030613    31753098    297826177024    1.76106591606176 9379.43683554909
17889602    31718012    299550142464    1.77298589426417 9444.16511551859
18475332    33264944    266053271552    1.80050588536109 7998.0074985847
18618154    31914219    254801883136    1.71414518324427 7983.96110323113
16437108    29421873    275350355968    1.78996651965784 9358.69568766067
17164338    28605353    249404649472    1.66655731202683 8718.81040838755
17895480    29658102    309047177216    1.65729569701399 10420.3288941416
19546560    34588509    301368737792    1.76954456436324 8712.97279081905
18525784    34806856    314875801600    1.87883309014075 9046.37297893266
18550989    35236438    273069948928    1.89943716747393 7749.64679823767
19085807    34605572    255512043520    1.81315738967705 7383.55209155335
17203820    31205542    277097357312    1.81387284916954 8879.74826112618
18003801    33723670    269696761856    1.87314167713807 7997.25420916525
18655425    33227176    306511810560    1.78109992133655 9224.7325069094
26380965    45627920    33528040    1.72957736762093 7348.15680925188
24923956    44721109    328790982656    1.79430219664968 7352.03106559813
25312482    43035393    287792226304    1.70016488308021 6687.33817079351
25841471    46276699    288168476672    1.79079197929561 6227.07502693742
25618384    43785917    321591488512    1.70915999229303 7344.63294469772
26006097    45056206    298747666432    1.73252472295247 6630.55532088077
26684805    45196730    351100243968    1.69372532420604 7768.26650883814
24025872    42450135    353265467392    1.76685095966548 8321.89267223768
24080466    45510525    371726323712    1.88993539410741 8167.91991988666
23195936    45095051    326473826304    1.94409274969546 7239.68193990955
23653302    43312705    307549573120    1.83114835298683 7100.67803707942
21589455    40034670    322982109184    1.85436223378497 8067.56017182107
22469039    42042723    314323701760    1.87114023879704 7476.29266924504
23647633    43486098    370003841024    1.83891969230071 8508.55464254346
23750561    37387139    320471453696    1.57415814304344 8571.70305799542
23142315    38640274    329341046784    1.66968058294946 8523.2585768

[ceph-users] Re: BlueStore fragmentation woes

2023-05-29 Thread Igor Fedotov
So fragmentation score calculation was improved recently indeed, 
seehttps://github.com/ceph/ceph/pull/49885



And yeah one can see some fragmentation in allocations for the first two 
OSDs. Doesn't look that dramatic as fragmentation scores tell though.



Additionally you might want to collect free extents dump using 'ceph 
tell osd.N ceph bluestore allocator dump block' command and do more 
analysis on these data.


E.g. I'd recommend to build something like a histogram showing amount of 
chunks for specific size range:


[1-4K]: N1 chunks

(4K-16]: N2 chunks

(16K-64K): N3

...

[16M-inf) : Nn chunks


This should be even more informative about fragmentation state - 
particularly if observed in evolution.


Looking for volunteers to write a script for building such a histogram... ;)


Thanks,

Igor


On 28/05/2023 08:31, Hector Martin wrote:

So chiming in, I think something is definitely wrong with at *least* the
frag score.

Here's what happened so far:

1. I had 8 OSDs (all 8T HDDs)
2. I added 2 more (osd.0,1) , with Quincy defaults
3. I marked 2 old ones out (the ones that seemed to be struggling the
most with IOPS)
4. I added 2 more (osd.2,3), but this time I had previously set
bluestore_min_alloc_size_hdd to 16K as an experiment

This has all happened in the space of a ~week. That means there was data
movement into the first 2 new OSDs, then before that completed I added 2
new OSDs. So I would expect some data thashing on the first 2, but
nothing extreme.

The fragmentation scores for the 4 new OSDs are, respectively:

0.746, 0.835, 0.160, 0.067

That seems ridiculous for the first two, it's only been a week. The
newest two seem in better shape, though those mostly would've seen only
data moving in, not out. The rebalance isn't done yet, but it's almost
done and all 4 OSDs have a similar fullness level at this time.

Looking at alloc stats:

ceph-0)  allocation stats probe 6: cnt: 2219302 frags: 2328003 size:
1238454677504
ceph-0)  probe -1: 1848577,  1970325, 1022324588544
ceph-0)  probe -2: 848301,  862622, 505329963008
ceph-0)  probe -6: 2187448,  2187448, 1055241568256
ceph-0)  probe -14: 0,  0, 0
ceph-0)  probe -22: 0,  0, 0

ceph-1)  allocation stats probe 6: cnt: 1882396 frags: 1947321 size:
1054829641728
ceph-1)  probe -1: 2212293,  2345923, 1215418728448
ceph-1)  probe -2: 1471623,  1525498, 826984652800
ceph-1)  probe -6: 2095298,  2095298, 165933312
ceph-1)  probe -14: 0,  0, 0
ceph-1)  probe -22: 0,  0, 0

ceph-2)  allocation stats probe 3: cnt: 2760200 frags: 2760200 size:
1554513903616
ceph-2)  probe -1: 2584046,  2584046, 1498140393472
ceph-2)  probe -3: 1696921,  1696921, 869424496640
ceph-2)  probe -7: 0,  0, 0
ceph-2)  probe -11: 0,  0, 0
ceph-2)  probe -19: 0,  0, 0

ceph-3)  allocation stats probe 3: cnt: 2544818 frags: 2544818 size:
1432225021952
ceph-3)  probe -1: 2688015,  2688015, 1515260739584
ceph-3)  probe -3: 1086875,  1086875, 622025424896
ceph-3)  probe -7: 0,  0, 0
ceph-3)  probe -11: 0,  0, 0
ceph-3)  probe -19: 0,  0, 0

So OSDs 2 and 3 (the latest ones to be added, note that these 4 new OSDs
are 0-3 since those IDs were free) are in good shape, but 0 and 1 are
already suffering from at least some fragmentation of objects, which is
a bit worrying when they are only ~70% full right now and only a week old.

I did delete a couple million small objects during the rebalance to try
to reduce load (I had some nasty directories), but that was cumulatively
only about 60GB of data. So while that could explain a high frag score
if there are now a million little holes in the free space map of the
OSDs (how is it calculated?), it should not actually cause new data
moving in to end up fragmented since there should be plenty of
unfragmented free space going around still.

I am now restarting OSDs 0 and 1 to see whether that makes the frag
score go down over time. I will do further analysis later with the raw
bluestore free space map, since I still have a bunch of rebalancing and
moving data around planned (I'm moving my cluster to new machines).

On 26/05/2023 00.29, Igor Fedotov wrote:

Hi Hector,

I can advise two tools for further fragmentation analysis:

1) One might want to use ceph-bluestore-tool's free-dump command to get
a list of free chunks for an OSD and try to analyze whether it's really
highly fragmented and lacks long enough extents. free-dump just returns
a list of extents in json format, I can take a look to the output if
shared...

2) You might want to look for allocation probs in OSD logs and see how
fragmentation in allocated chunks has evolved.

E.g.

allocation stats probe 33: cnt: 8148921 frags: 10958186 size: 1704348508>
probe -1: 35168547,  46401246, 1199516209152
probe -3: 27275094,  35681802, 200121712640
probe -5: 34847167,  52539758, 271272230912
probe -9: 44291522,  60025613, 523997483008
probe -17: 10646313,  10646313, 155178434560

The first probe refers to the last day while others match days (or
rather probes) -1, -3, -5, -9, -17

'

[ceph-users] Re: BlueStore fragmentation woes

2023-05-26 Thread Igor Fedotov

yeah, definitely this makes sense

On 26/05/2023 09:39, Konstantin Shalygin wrote:

Hi Igor,

Should we backpot this to the p,q and reef release's?


Thanks,
k
Sent from my iPhone


On 25 May 2023, at 23:13, Igor Fedotov  wrote:

You might be facing the issue fixed by https://github.com/ceph/ceph/pull/49885

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: BlueStore fragmentation woes

2023-05-25 Thread Igor Fedotov

Yeah this looks fine. Please collect all of them for a given OSD.

Then restart OSD, wait more to come (1-2 days) and collect them too.


A side note - in the attached probe I can't see any fragmentation at all 
- amount of allocations is equal to amount of fragments, e.g.


cnt: 27637 frags: 27637


And the average requested chunk is 63777406976 / 27637 = ~2308 bytes. 
I.e. in average one needed less than a single alloc unit. Which would 
tell us nothing about the fragmentation...


Thanks,
Igor



On 25/05/2023 19:36, Fox, Kevin M wrote:

Ok, I'm gathering the "allocation stats probe" stuff. Not sure I follow what 
you mean by the historic probes. just:
| egrep "allocation stats probe|probe"   ?

That gets something like:
May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-24T18:24:34.105+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  allocation stats probe 110: cnt: 27637 
frags: 27637 size: 63777406976
May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-24T18:24:34.105+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -1: 24503,  24503, 58141900800
May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-24T18:24:34.105+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -2: 24594,  24594, 56951898112
May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-24T18:24:34.105+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -6: 19737,  19737, 37299027968
May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-24T18:24:34.105+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -14: 20373,  20373, 35302801408
May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-24T18:24:34.105+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -30: 19072,  19072, 33645854720

if that is the right query, then I'll gather the metrics, restart and gather 
some more after and let you know.

Thanks,
Kevin

________
From: Igor Fedotov 
Sent: Thursday, May 25, 2023 9:29 AM
To: Fox, Kevin M; Hector Martin; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: BlueStore fragmentation woes

Just run through available logs for a specific OSD (which you suspect
suffer from high fragmentation) and collect all allocation stats probes
you can find ("allocation stats probe" string is a perfect grep pattern,
please append lines with historic probes following day-0 line as well.
Given this is printed once per day there wouldn't be too many).

Then do OSD restart and wait a couple more days. Would allocation stats
show much better disparity between cnt and frags columns?

Is the similar pattern (eventual degradation in stats prior to restart
and severe improvement afterwards) be observed for other OSDs?


On 25/05/2023 19:20, Fox, Kevin M wrote:

If you can give me instructions on what you want me to gather before the 
restart and after restart I can do it. I have some running away right now.

Thanks,
Kevin

________
From: Igor Fedotov 
Sent: Thursday, May 25, 2023 9:17 AM
To: Fox, Kevin M; Hector Martin; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: BlueStore fragmentation woes

Perhaps...

I don't like the idea to use fragmentation score as a real index. IMO
it's mostly like a very imprecise first turn marker to alert that
something might be wrong. But not a real quantitative high-quality estimate.

So in fact I'd like to see a series of allocation probes showing
eventual degradation without OSD restart and immediate severe
improvement after the restart.

Can you try to collect something like that? Would the same behavior
persist with an alternative allocator?


Thanks,

Igor


On 25/05/2023 18:41, Fox, Kevin M wrote:

Is this related to https://tracker.ceph.com/issues/58022 ?

We still see run away osds at times, somewhat randomly, that causes runaway 
fragmentation issues.

Thanks,
Kevin

________
From: Igor Fedotov 
Sent: Thursday, May 25, 2023 8:29 AM
To: Hector Martin; ceph-users@ceph.io
Subject: [ceph-users] Re: BlueStore fragmentation woes

Check twice before you click! This email originated from outside PNNL.


Hi Hector,

I can advise two tools for further fragmentation analysis:

1) One might want to use ceph-bluestore-tool's free-dump command to get
a list of free chunks for an OSD and try to analyze whether it's really
highly fragmented and lacks long enough extents. free-dump just returns
a list of extents in json format, I can take a look to the output if
shared...

2) You might want to look for allocation probs in OSD logs and see how
fragmentation in allocated chunks has evolved.

E.g.

allocation stats probe 33: cnt: 8148921 frags: 10958186 size: 1704348508&

[ceph-users] Re: BlueStore fragmentation woes

2023-05-25 Thread Igor Fedotov
Just run through available logs for a specific OSD (which you suspect 
suffer from high fragmentation) and collect all allocation stats probes 
you can find ("allocation stats probe" string is a perfect grep pattern, 
please append lines with historic probes following day-0 line as well. 
Given this is printed once per day there wouldn't be too many).


Then do OSD restart and wait a couple more days. Would allocation stats 
show much better disparity between cnt and frags columns?


Is the similar pattern (eventual degradation in stats prior to restart 
and severe improvement afterwards) be observed for other OSDs?



On 25/05/2023 19:20, Fox, Kevin M wrote:

If you can give me instructions on what you want me to gather before the 
restart and after restart I can do it. I have some running away right now.

Thanks,
Kevin

____
From: Igor Fedotov 
Sent: Thursday, May 25, 2023 9:17 AM
To: Fox, Kevin M; Hector Martin; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: BlueStore fragmentation woes

Perhaps...

I don't like the idea to use fragmentation score as a real index. IMO
it's mostly like a very imprecise first turn marker to alert that
something might be wrong. But not a real quantitative high-quality estimate.

So in fact I'd like to see a series of allocation probes showing
eventual degradation without OSD restart and immediate severe
improvement after the restart.

Can you try to collect something like that? Would the same behavior
persist with an alternative allocator?


Thanks,

Igor


On 25/05/2023 18:41, Fox, Kevin M wrote:

Is this related to https://tracker.ceph.com/issues/58022 ?

We still see run away osds at times, somewhat randomly, that causes runaway 
fragmentation issues.

Thanks,
Kevin

____
From: Igor Fedotov 
Sent: Thursday, May 25, 2023 8:29 AM
To: Hector Martin; ceph-users@ceph.io
Subject: [ceph-users] Re: BlueStore fragmentation woes

Check twice before you click! This email originated from outside PNNL.


Hi Hector,

I can advise two tools for further fragmentation analysis:

1) One might want to use ceph-bluestore-tool's free-dump command to get
a list of free chunks for an OSD and try to analyze whether it's really
highly fragmented and lacks long enough extents. free-dump just returns
a list of extents in json format, I can take a look to the output if
shared...

2) You might want to look for allocation probs in OSD logs and see how
fragmentation in allocated chunks has evolved.

E.g.

allocation stats probe 33: cnt: 8148921 frags: 10958186 size: 1704348508>
probe -1: 35168547,  46401246, 1199516209152
probe -3: 27275094,  35681802, 200121712640
probe -5: 34847167,  52539758, 271272230912
probe -9: 44291522,  60025613, 523997483008
probe -17: 10646313,  10646313, 155178434560

The first probe refers to the last day while others match days (or
rather probes) -1, -3, -5, -9, -17

'cnt' column represents the amount of allocations performed in the
previous 24 hours and 'frags' one shows amount of fragments in the
resulted allocations. So significant mismatch between frags and cnt
might indicate some issues with high fragmentation indeed.

Apart from retrospective analysis you might also want how OSD behavior
changes after reboot - e.g. wouldn't rebooted OSD produce less
fragmentation... Which in turn might indicate some issues with BlueStore
allocator..

Just FYI: allocation probe printing interval is controlled by
bluestore_alloc_stats_dump_interval parameter.


Thanks,

Igor



On 24/05/2023 17:18, Hector Martin wrote:

On 24/05/2023 22.07, Mark Nelson wrote:

Yep, bluestore fragmentation is an issue.  It's sort of a natural result
of using copy-on-write and never implementing any kind of
defragmentation scheme.  Adam and I have been talking about doing it
now, probably piggybacking on scrub or other operations that already
area reading all of the extents for an object anyway.


I wrote a very simply prototype for clone to speed up the rbd mirror use
case here:

https://github.com/markhpc/ceph/commit/29fc1bfd4c90dd618eb9e0d4ae6474d8cfa5dfdf


Adam ended up going the extra mile and completely changed how shared
blobs works which probably eliminates the need to do defrag on clone
anymore from an rbd-mirror perspective, but I think we still need to
identify any times we are doing full object reads of fragmented objects
and consider defragmenting at that time.  It might be clone, or scrub,
or other things, but the point is that if we are already doing most of
the work (seeks on HDD especially!) the extra cost of a large write to
clean it up isn't that bad, especially if we are doing it over the
course of months or years and can help keep freespace less fragmented.

Note that my particular issue seemed to specifically be free space
fragmentation. I don't use RBD mirror and I would not *expect* most of
my cephfs use cases to lead to any weird cow/fragmentation issues with
objects other than those force

[ceph-users] Re: BlueStore fragmentation woes

2023-05-25 Thread Igor Fedotov

Perhaps...

I don't like the idea to use fragmentation score as a real index. IMO 
it's mostly like a very imprecise first turn marker to alert that 
something might be wrong. But not a real quantitative high-quality estimate.


So in fact I'd like to see a series of allocation probes showing 
eventual degradation without OSD restart and immediate severe 
improvement after the restart.


Can you try to collect something like that? Would the same behavior 
persist with an alternative allocator?



Thanks,

Igor


On 25/05/2023 18:41, Fox, Kevin M wrote:

Is this related to https://tracker.ceph.com/issues/58022 ?

We still see run away osds at times, somewhat randomly, that causes runaway 
fragmentation issues.

Thanks,
Kevin


From: Igor Fedotov 
Sent: Thursday, May 25, 2023 8:29 AM
To: Hector Martin; ceph-users@ceph.io
Subject: [ceph-users] Re: BlueStore fragmentation woes

Check twice before you click! This email originated from outside PNNL.


Hi Hector,

I can advise two tools for further fragmentation analysis:

1) One might want to use ceph-bluestore-tool's free-dump command to get
a list of free chunks for an OSD and try to analyze whether it's really
highly fragmented and lacks long enough extents. free-dump just returns
a list of extents in json format, I can take a look to the output if
shared...

2) You might want to look for allocation probs in OSD logs and see how
fragmentation in allocated chunks has evolved.

E.g.

allocation stats probe 33: cnt: 8148921 frags: 10958186 size: 1704348508>
probe -1: 35168547,  46401246, 1199516209152
probe -3: 27275094,  35681802, 200121712640
probe -5: 34847167,  52539758, 271272230912
probe -9: 44291522,  60025613, 523997483008
probe -17: 10646313,  10646313, 155178434560

The first probe refers to the last day while others match days (or
rather probes) -1, -3, -5, -9, -17

'cnt' column represents the amount of allocations performed in the
previous 24 hours and 'frags' one shows amount of fragments in the
resulted allocations. So significant mismatch between frags and cnt
might indicate some issues with high fragmentation indeed.

Apart from retrospective analysis you might also want how OSD behavior
changes after reboot - e.g. wouldn't rebooted OSD produce less
fragmentation... Which in turn might indicate some issues with BlueStore
allocator..

Just FYI: allocation probe printing interval is controlled by
bluestore_alloc_stats_dump_interval parameter.


Thanks,

Igor



On 24/05/2023 17:18, Hector Martin wrote:

On 24/05/2023 22.07, Mark Nelson wrote:

Yep, bluestore fragmentation is an issue.  It's sort of a natural result
of using copy-on-write and never implementing any kind of
defragmentation scheme.  Adam and I have been talking about doing it
now, probably piggybacking on scrub or other operations that already
area reading all of the extents for an object anyway.


I wrote a very simply prototype for clone to speed up the rbd mirror use
case here:

https://github.com/markhpc/ceph/commit/29fc1bfd4c90dd618eb9e0d4ae6474d8cfa5dfdf


Adam ended up going the extra mile and completely changed how shared
blobs works which probably eliminates the need to do defrag on clone
anymore from an rbd-mirror perspective, but I think we still need to
identify any times we are doing full object reads of fragmented objects
and consider defragmenting at that time.  It might be clone, or scrub,
or other things, but the point is that if we are already doing most of
the work (seeks on HDD especially!) the extra cost of a large write to
clean it up isn't that bad, especially if we are doing it over the
course of months or years and can help keep freespace less fragmented.

Note that my particular issue seemed to specifically be free space
fragmentation. I don't use RBD mirror and I would not *expect* most of
my cephfs use cases to lead to any weird cow/fragmentation issues with
objects other than those forced by the free space becoming fragmented
(unless there is some weird pathological use case I'm hitting). Most of
my write workloads are just copying files in bulk and incrementally
writing out files.

Would simply defragging objects during scrub/etc help with free space
fragmentation itself? Those seem like two somewhat unrelated issues...
note that if free space is already fragmented, you wouldn't even have a
place to put down a defragmented object.

Are there any stats I can look at to figure out how bad object and free
space fragmentation is? It would be nice to have some clearer data
beyond my hunch/deduction after seeing the I/O patterns and the sole
fragmentation number :). Also would be interesting to get some kind of
trace of the bluestore ops the OSD is doing, so I can find out whether
it's doing something pathological that causes more fragmentation for
some reason.


Mark


On 5/24/23 07:17, Hector Martin wrote:

Hi,

I've been seeing relatively large fragmentation numbers on all my OSDs:

ceph daemon osd.13 bluest

[ceph-users] Re: BlueStore fragmentation woes

2023-05-25 Thread Igor Fedotov

Hi Hector,

I can advise two tools for further fragmentation analysis:

1) One might want to use ceph-bluestore-tool's free-dump command to get 
a list of free chunks for an OSD and try to analyze whether it's really 
highly fragmented and lacks long enough extents. free-dump just returns 
a list of extents in json format, I can take a look to the output if 
shared...


2) You might want to look for allocation probs in OSD logs and see how 
fragmentation in allocated chunks has evolved.


E.g.

allocation stats probe 33: cnt: 8148921 frags: 10958186 size: 1704348508>
probe -1: 35168547,  46401246, 1199516209152
probe -3: 27275094,  35681802, 200121712640
probe -5: 34847167,  52539758, 271272230912
probe -9: 44291522,  60025613, 523997483008
probe -17: 10646313,  10646313, 155178434560

The first probe refers to the last day while others match days (or 
rather probes) -1, -3, -5, -9, -17


'cnt' column represents the amount of allocations performed in the 
previous 24 hours and 'frags' one shows amount of fragments in the 
resulted allocations. So significant mismatch between frags and cnt 
might indicate some issues with high fragmentation indeed.


Apart from retrospective analysis you might also want how OSD behavior 
changes after reboot - e.g. wouldn't rebooted OSD produce less 
fragmentation... Which in turn might indicate some issues with BlueStore 
allocator..


Just FYI: allocation probe printing interval is controlled by 
bluestore_alloc_stats_dump_interval parameter.



Thanks,

Igor



On 24/05/2023 17:18, Hector Martin wrote:

On 24/05/2023 22.07, Mark Nelson wrote:

Yep, bluestore fragmentation is an issue.  It's sort of a natural result
of using copy-on-write and never implementing any kind of
defragmentation scheme.  Adam and I have been talking about doing it
now, probably piggybacking on scrub or other operations that already
area reading all of the extents for an object anyway.


I wrote a very simply prototype for clone to speed up the rbd mirror use
case here:

https://github.com/markhpc/ceph/commit/29fc1bfd4c90dd618eb9e0d4ae6474d8cfa5dfdf


Adam ended up going the extra mile and completely changed how shared
blobs works which probably eliminates the need to do defrag on clone
anymore from an rbd-mirror perspective, but I think we still need to
identify any times we are doing full object reads of fragmented objects
and consider defragmenting at that time.  It might be clone, or scrub,
or other things, but the point is that if we are already doing most of
the work (seeks on HDD especially!) the extra cost of a large write to
clean it up isn't that bad, especially if we are doing it over the
course of months or years and can help keep freespace less fragmented.

Note that my particular issue seemed to specifically be free space
fragmentation. I don't use RBD mirror and I would not *expect* most of
my cephfs use cases to lead to any weird cow/fragmentation issues with
objects other than those forced by the free space becoming fragmented
(unless there is some weird pathological use case I'm hitting). Most of
my write workloads are just copying files in bulk and incrementally
writing out files.

Would simply defragging objects during scrub/etc help with free space
fragmentation itself? Those seem like two somewhat unrelated issues...
note that if free space is already fragmented, you wouldn't even have a
place to put down a defragmented object.

Are there any stats I can look at to figure out how bad object and free
space fragmentation is? It would be nice to have some clearer data
beyond my hunch/deduction after seeing the I/O patterns and the sole
fragmentation number :). Also would be interesting to get some kind of
trace of the bluestore ops the OSD is doing, so I can find out whether
it's doing something pathological that causes more fragmentation for
some reason.


Mark


On 5/24/23 07:17, Hector Martin wrote:

Hi,

I've been seeing relatively large fragmentation numbers on all my OSDs:

ceph daemon osd.13 bluestore allocator score block
{
  "fragmentation_rating": 0.77251526920454427
}

These aren't that old, as I recreated them all around July last year.
They mostly hold CephFS data with erasure coding, with a mix of large
and small files. The OSDs are at around 80%-85% utilization right now.
Most of the data was written sequentially when the OSDs were created (I
rsynced everything from a remote backup). Since then more data has been
added, but not particularly quickly.

At some point I noticed pathologically slow writes, and I couldn't
figure out what was wrong. Eventually I did some block tracing and
noticed the I/Os were very small, even though CephFS-side I was just
writing one large file sequentially, and that's when I stumbled upon the
free space fragmentation problem. Indeed, deleting some large files
opened up some larger free extents and resolved the problem, but only
until those get filled up and I'm back to fragmented tiny extents. So
effectively I'm stuck at the 

[ceph-users] Re: quincy 17.2.6 - write performance continuously slowing down until OSD restart needed

2023-05-10 Thread Igor Fedotov

Hey Zakhar,

You do need to restart OSDs to bring performance back to normal anyway, 
don't you? So yeah, we're not aware of better way so far - all the 
information I  have is from you and Nikola. And you both tell us about 
the need for restart.


Apparently there is no need to restart every OSD but "degraded/slow" 
ones only. We actually need to verify that. So please indicate the 
slowest OSDs (in terms of subop_w_lat) and do restart for them first. 
Hopefully just a fraction of your OSDs would require this.



Thanks,
Igor

On 5/10/2023 6:01 AM, Zakhar Kirpichenko wrote:
Thank you, Igor. I will try to see how to collect the perf values. Not 
sure about restarting all OSDs as it's a production cluster, is there 
a less invasive way?


/Z

On Tue, 9 May 2023 at 23:58, Igor Fedotov  wrote:

Hi Zakhar,

Let's leave questions regarding cache usage/tuning to a different
topic for now. And concentrate on performance drop.

Could you please do the same experiment I asked from Nikola once
your cluster reaches "bad performance" state (Nikola, could you
please use this improved scenario as well?):

- collect perf counters for every OSD

- reset perf counters for every OSD

-  leave the cluster running for 10 mins and collect perf counters
again.

- Then restart OSDs one-by-one starting with the worst OSD (in
terms of subop_w_lat from the prev step). Wouldn't be sufficient
to reset just a few OSDs before the cluster is back to normal?

- if partial OSD restart is sufficient - please leave the
remaining OSDs run as-is without reboot.

- after the restart (no matter partial or complete one - the key
thing it's should successful) reset all the perf counters and
leave the cluster run for 30 mins and collect perf counters again.

- wait 24 hours and collect the counters one more time

- share all four counters snapshots.


Thanks,

Igor

On 5/8/2023 11:31 PM, Zakhar Kirpichenko wrote:

Don't mean to hijack the thread, but I may be observing something
similar with 16.2.12: OSD performance noticeably peaks after OSD
restart and then gradually reduces over 10-14 days, while commit
and apply latencies increase across the board.

Non-default settings are:

        "bluestore_cache_size_hdd": {
            "default": "1073741824",
            "mon": "4294967296",
            "final": "4294967296"
        },
        "bluestore_cache_size_ssd": {
            "default": "3221225472",
            "mon": "4294967296",
            "final": "4294967296"
        },
...
        "osd_memory_cache_min": {
            "default": "134217728",
            "mon": "2147483648",
            "final": "2147483648"
        },
        "osd_memory_target": {
            "default": "4294967296",
            "mon": "17179869184",
            "final": "17179869184"
        },
        "osd_scrub_sleep": {
            "default": 0,
            "mon": 0.10001,
            "final": 0.10001
        },
        "rbd_balance_parent_reads": {
            "default": false,
            "mon": true,
            "final": true
        },

All other settings are default, the usage is rather simple
Openstack / RBD.

I also noticed that OSD cache usage doesn't increase over time
(see my message "Ceph 16.2.12, bluestore cache doesn't seem to be
used much" dated 26 April 2023, which received no comments),
despite OSDs are being used rather heavily and there's plenty of
host and OSD cache / target memory available. It may be worth
checking if available memory is being used in a good way.

/Z

On Mon, 8 May 2023 at 22:35, Igor Fedotov 
wrote:

Hey Nikola,

On 5/8/2023 10:13 PM, Nikola Ciprich wrote:
> OK, starting collecting those for all OSDs..
> I have hour samples of all OSDs perf dumps loaded in DB, so
I can easily examine,
> sort, whatever..
>
You didn't reset the counters every hour, do you? So having
average
subop_w_latency growing that way means the current values
were much
higher than before.

Curious if subop latencies were growing for every OSD or just
a subset
(may be even just a single one) of them?


Next time you reach the bad state please do the following if
possible:

- reset perf counters for every OSD

- 

[ceph-users] Re: quincy 17.2.6 - write performance continuously slowing down until OSD restart needed

2023-05-09 Thread Igor Fedotov

Hi Zakhar,

Let's leave questions regarding cache usage/tuning to a different topic 
for now. And concentrate on performance drop.


Could you please do the same experiment I asked from Nikola once your 
cluster reaches "bad performance" state (Nikola, could you please use 
this improved scenario as well?):


- collect perf counters for every OSD

- reset perf counters for every OSD

-  leave the cluster running for 10 mins and collect perf counters again.

- Then restart OSDs one-by-one starting with the worst OSD (in terms of 
subop_w_lat from the prev step). Wouldn't be sufficient to reset just a 
few OSDs before the cluster is back to normal?


- if partial OSD restart is sufficient - please leave the remaining OSDs 
run as-is without reboot.


- after the restart (no matter partial or complete one - the key thing 
it's should successful) reset all the perf counters and leave the 
cluster run for 30 mins and collect perf counters again.


- wait 24 hours and collect the counters one more time

- share all four counters snapshots.


Thanks,

Igor

On 5/8/2023 11:31 PM, Zakhar Kirpichenko wrote:
Don't mean to hijack the thread, but I may be observing something 
similar with 16.2.12: OSD performance noticeably peaks after OSD 
restart and then gradually reduces over 10-14 days, while commit and 
apply latencies increase across the board.


Non-default settings are:

        "bluestore_cache_size_hdd": {
            "default": "1073741824",
            "mon": "4294967296",
            "final": "4294967296"
        },
        "bluestore_cache_size_ssd": {
            "default": "3221225472",
            "mon": "4294967296",
            "final": "4294967296"
        },
...
        "osd_memory_cache_min": {
            "default": "134217728",
            "mon": "2147483648",
            "final": "2147483648"
        },
        "osd_memory_target": {
            "default": "4294967296",
            "mon": "17179869184",
            "final": "17179869184"
        },
        "osd_scrub_sleep": {
            "default": 0,
            "mon": 0.10001,
            "final": 0.10001
        },
        "rbd_balance_parent_reads": {
            "default": false,
            "mon": true,
            "final": true
        },

All other settings are default, the usage is rather simple Openstack / 
RBD.


I also noticed that OSD cache usage doesn't increase over time (see my 
message "Ceph 16.2.12, bluestore cache doesn't seem to be used much" 
dated 26 April 2023, which received no comments), despite OSDs are 
being used rather heavily and there's plenty of host and OSD cache / 
target memory available. It may be worth checking if available memory 
is being used in a good way.


/Z

On Mon, 8 May 2023 at 22:35, Igor Fedotov  wrote:

Hey Nikola,

On 5/8/2023 10:13 PM, Nikola Ciprich wrote:
> OK, starting collecting those for all OSDs..
> I have hour samples of all OSDs perf dumps loaded in DB, so I
can easily examine,
> sort, whatever..
>
You didn't reset the counters every hour, do you? So having average
subop_w_latency growing that way means the current values were much
higher than before.

Curious if subop latencies were growing for every OSD or just a
subset
(may be even just a single one) of them?


Next time you reach the bad state please do the following if possible:

- reset perf counters for every OSD

-  leave the cluster running for 10 mins and collect perf counters
again.

- Then start restarting OSD one-by-one starting with the worst OSD
(in
terms of subop_w_lat from the prev step). Wouldn't be sufficient to
reset just a few OSDs before the cluster is back to normal?

>> currently values for avgtime are around 0.0003 for subop_w_lat
and 0.001-0.002
>> for op_w_lat
> OK, so there is no visible trend on op_w_lat, still between
0.001 and 0.002
>
> subop_w_lat seems to have increased since yesterday though! I
see values from
> 0.0004 to as high as 0.001
>
> If some other perf data might be interesting, please let me know..
>
> During OSD restarts, I noticed strange thing - restarts on first
6 machines
> went smooth, but then on another 3, I saw rocksdb logs recovery
on all SSD
> OSDs. but first didn't see any mention of daemon crash in ceph -s
>
> later, crash info appeared, but only about 3 daemons (in total,
at least 20
> of them crashed though)
>
> crash report was similar 

[ceph-users] Re: quincy 17.2.6 - write performance continuously slowing down until OSD restart needed

2023-05-08 Thread Igor Fedotov

Hey Nikola,

On 5/8/2023 10:13 PM, Nikola Ciprich wrote:

OK, starting collecting those for all OSDs..
I have hour samples of all OSDs perf dumps loaded in DB, so I can easily 
examine,
sort, whatever..

You didn't reset the counters every hour, do you? So having average 
subop_w_latency growing that way means the current values were much 
higher than before.


Curious if subop latencies were growing for every OSD or just a subset 
(may be even just a single one) of them?



Next time you reach the bad state please do the following if possible:

- reset perf counters for every OSD

-  leave the cluster running for 10 mins and collect perf counters again.

- Then start restarting OSD one-by-one starting with the worst OSD (in 
terms of subop_w_lat from the prev step). Wouldn't be sufficient to 
reset just a few OSDs before the cluster is back to normal?



currently values for avgtime are around 0.0003 for subop_w_lat and 0.001-0.002
for op_w_lat

OK, so there is no visible trend on op_w_lat, still between 0.001 and 0.002

subop_w_lat seems to have increased since yesterday though! I see values from
0.0004 to as high as 0.001

If some other perf data might be interesting, please let me know..

During OSD restarts, I noticed strange thing - restarts on first 6 machines
went smooth, but then on another 3, I saw rocksdb logs recovery on all SSD
OSDs. but first didn't see any mention of daemon crash in ceph -s

later, crash info appeared, but only about 3 daemons (in total, at least 20
of them crashed though)

crash report was similar for all three OSDs:

[root@nrbphav4a ~]# ceph crash info 
2023-05-08T17:45:47.056675Z_a5759fe9-60c6-423a-88fc-57663f692bd3
{
 "backtrace": [
 "/lib64/libc.so.6(+0x54d90) [0x7f64a6323d90]",
 "(BlueStore::_txc_create(BlueStore::Collection*, BlueStore::OpSequencer*, 
std::__cxx11::list >*, 
boost::intrusive_ptr)+0x413) [0x55a1c9d07c43]",
 "(BlueStore::queue_transactions(boost::intrusive_ptr&, 
std::vector >&, 
boost::intrusive_ptr, ThreadPool::TPHandle*)+0x22b) [0x55a1c9d27e9b]",
 "(ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr >&&, eversion_t const&, eversion_t const&, std::vector >&&, std::optional&, Context*, unsigned long, osd_reqid_t, 
boost::intrusive_ptr)+0x8ad) [0x55a1c9bbcfdd]",
 "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, 
PrimaryLogPG::OpContext*)+0x38f) [0x55a1c99d1cbf]",
 "(PrimaryLogPG::simple_opc_submit(std::unique_ptr >)+0x57) [0x55a1c99d6777]",
 "(PrimaryLogPG::handle_watch_timeout(std::shared_ptr)+0xb73) 
[0x55a1c99da883]",
 "/usr/bin/ceph-osd(+0x58794e) [0x55a1c992994e]",
 "(CommonSafeTimer::timer_thread()+0x11a) [0x55a1c9e226aa]",
 "/usr/bin/ceph-osd(+0xa80eb1) [0x55a1c9e22eb1]",
 "/lib64/libc.so.6(+0x9f802) [0x7f64a636e802]",
 "/lib64/libc.so.6(+0x3f450) [0x7f64a630e450]"
 ],
 "ceph_version": "17.2.6",
 "crash_id": 
"2023-05-08T17:45:47.056675Z_a5759fe9-60c6-423a-88fc-57663f692bd3",
 "entity_name": "osd.98",
 "os_id": "almalinux",
 "os_name": "AlmaLinux",
 "os_version": "9.0 (Emerald Puma)",
 "os_version_id": "9.0",
 "process_name": "ceph-osd",
 "stack_sig": 
"b1a1c5bd45e23382497312202e16cfd7a62df018c6ebf9ded0f3b3ca3c1dfa66",
 "timestamp": "2023-05-08T17:45:47.056675Z",
 "utsname_hostname": "nrbphav4h",
 "utsname_machine": "x86_64",
 "utsname_release": "5.15.90lb9.01",
 "utsname_sysname": "Linux",
 "utsname_version": "#1 SMP Fri Jan 27 15:52:13 CET 2023"
}


I was trying to figure out why this particular 3 nodes could behave differently
and found out from colleagues, that those 3 nodes were added to cluster lately
with direct install of 17.2.5 (others were installed 15.2.16 and later upgraded)

not sure whether this is related to our problem though..

I see very similar crash reported here:https://tracker.ceph.com/issues/56346
so I'm not reporting..

Do you think this might somehow be the cause of the problem? Anything else I 
should
check in perf dumps or elsewhere?


Hmm... don't know yet. Could you please last 20K lines prior the crash 
from e.g two sample OSDs?


And the crash isn't permanent, OSDs are able to start after the 
second(?) shot, aren't they?



with best regards

nik







--
Igor Fedotov
Ceph Lead Developer
--
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amt

[ceph-users] Re: quincy 17.2.6 - write performance continuously slowing down until OSD restart needed

2023-05-03 Thread Igor Fedotov



On 5/2/2023 9:02 PM, Nikola Ciprich wrote:


hewever, probably worh noting, historically we're using following OSD options:
ceph config set osd bluestore_rocksdb_options 
compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB
ceph config set osd bluestore_cache_autotune 0
ceph config set osd bluestore_cache_size_ssd 2G
ceph config set osd bluestore_cache_kv_ratio 0.2
ceph config set osd bluestore_cache_meta_ratio 0.8
ceph config set osd osd_min_pg_log_entries 10
ceph config set osd osd_max_pg_log_entries 10
ceph config set osd osd_pg_log_dups_tracked 10
ceph config set osd osd_pg_log_trim_min 10

so maybe I'll start resetting those to defaults (ie enabling cache autotune etc)
as a first step..


Generally I wouldn't recommend using non-default settings unless there 
are explicit rationales. So yeah better to revert to defaults whenever 
possible.


I doubt this is a root cause for your issue though..







Thanks,

Igor

On 5/2/2023 11:32 AM, Nikola Ciprich wrote:

Hello dear CEPH users and developers,

we're dealing with strange problems.. we're having 12 node alma linux 9 cluster,
initially installed CEPH 15.2.16, then upgraded to 17.2.5. It's running bunch
of KVM virtual machines accessing volumes using RBD.

everything is working well, but there is strange and for us quite serious issue
   - speed of write operations (both sequential and random) is constantly 
degrading
   drastically to almost unusable numbers (in ~1week it drops from ~70k 4k 
writes/s
   from 1 VM  to ~7k writes/s)

When I restart all OSD daemons, numbers immediately return to normal..

volumes are stored on replicated pool of 4 replicas, on top of 7*12 = 84
INTEL SSDPE2KX080T8 NVMEs.

I've updated cluster to 17.2.6 some time ago, but the problem persists. This is
especially annoying in connection with https://tracker.ceph.com/issues/56896
as restarting OSDs is quite painfull when half of them crash..

I don't see anything suspicious, nodes load is quite low, no logs errors,
network latency and throughput is OK too

Anyone having simimar issue?

I'd like to ask for hints on what should I check further..

we're running lots of 14.2.x and 15.2.x clusters, none showing similar
issue, so I'm suspecting this is something related to quincy

thanks a lot in advance

with best regards

nikola ciprich




--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: quincy 17.2.6 - write performance continuously slowing down until OSD restart needed

2023-05-02 Thread Igor Fedotov



On 5/2/2023 11:32 AM, Nikola Ciprich wrote:

I've updated cluster to 17.2.6 some time ago, but the problem persists. This is
especially annoying in connection with https://tracker.ceph.com/issues/56896
as restarting OSDs is quite painfull when half of them crash..
with best regards

Feel free to set osd_fast_shutdown_timeout to zero to workaround the 
above. IMO this assertion is a nonsence and I don't see any usage of 
this timeout parameter other than just throw an assertion.



--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: quincy 17.2.6 - write performance continuously slowing down until OSD restart needed

2023-05-02 Thread Igor Fedotov

Hi Nikola,

I'd suggest to start monitoring perf counters for your osds. 
op_w_lat/subop_w_lat ones specifically. I presume they raise eventually, 
don't they?


Does subop_w_lat grow for every OSD or just a subset of them? How large 
is the delta between the best and the worst OSDs after a one week 
period? How many "bad" OSDs are at this point?



And some more questions:

How large are space utilization/fragmentation for your OSDs?

Is the same performance drop observed for artificial benchmarks, e.g. 4k 
random writes to a fresh RBD image using fio?


Is there any RAM utilization growth for OSD processes over time? Or may 
be any suspicious growth in mempool stats?



As a blind and brute force approach you might also want to compact 
RocksDB through ceph-kvstore-tool and switch bluestore allocator to 
bitmap (presuming default hybrid one is effective right now). Please do 
one modification at a time to realize what action is actually helpful if 
any.



Thanks,

Igor

On 5/2/2023 11:32 AM, Nikola Ciprich wrote:

Hello dear CEPH users and developers,

we're dealing with strange problems.. we're having 12 node alma linux 9 cluster,
initially installed CEPH 15.2.16, then upgraded to 17.2.5. It's running bunch
of KVM virtual machines accessing volumes using RBD.

everything is working well, but there is strange and for us quite serious issue
  - speed of write operations (both sequential and random) is constantly 
degrading
  drastically to almost unusable numbers (in ~1week it drops from ~70k 4k 
writes/s
  from 1 VM  to ~7k writes/s)

When I restart all OSD daemons, numbers immediately return to normal..

volumes are stored on replicated pool of 4 replicas, on top of 7*12 = 84
INTEL SSDPE2KX080T8 NVMEs.

I've updated cluster to 17.2.6 some time ago, but the problem persists. This is
especially annoying in connection with https://tracker.ceph.com/issues/56896
as restarting OSDs is quite painfull when half of them crash..

I don't see anything suspicious, nodes load is quite low, no logs errors,
network latency and throughput is OK too

Anyone having simimar issue?

I'd like to ask for hints on what should I check further..

we're running lots of 14.2.x and 15.2.x clusters, none showing similar
issue, so I'm suspecting this is something related to quincy

thanks a lot in advance

with best regards

nikola ciprich




--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph 16.2.12, particular OSD shows higher latency than others

2023-04-27 Thread Igor Fedotov

Hi Zakhar,

you might want to try offline DB compaction using ceph-kvstore-tool for 
this specific OSD.


Periodically we observe OSD perf drop due to degraded RocksDB 
performance, particularly after bulk data removal/migration.. Compaction 
is quite helpful in this case.



Thanks,

Igor



On 26/04/2023 20:22, Zakhar Kirpichenko wrote:

Hi,

I have a Ceph 16.2.12 cluster with uniform hardware, same drive make/model,
etc. A particular OSD is showing higher latency than usual in `ceph osd
perf`, usually mid to high tens of milliseconds while other OSDs show low
single digits, although its drive's I/O stats don't look different from
those of other drives. The workload is mainly random 4K reads and writes,
the cluster is being used as Openstack VM storage.

Is there a way to trace, which particular PG, pool and disk image or object
cause this OSD's excessive latency? Is there a way to tell Ceph to

I would appreciate any advice or pointers.

Best regards,
Zakhar
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Mysteriously dead OSD process

2023-04-15 Thread Igor Fedotov

Hi J-P Methot,

perhaps my response is a bit late  but this to some degree recalls me an 
issue we've been facing yesterday.


First of all you might want to set debug-osd to 20 for this specific OSD 
and see if log would be more helpful. Please share if possible.


Secondly I'm curious if the last reported PG (2.99s3) is always the same 
before the crash ? If so you might want to remove it from the OSD using 
ceph-objectstore-tool's export-remove command - if our case this helped 
to bring OSD up. Exported PG can be loaded to another OSD or (if that's 
a single problematic OSD) just thrown away and fixed by scrubbing...



Thanks,

Igor

On 05/04/2023 23:36, J-P Methot wrote:

Hi,


We currently use Ceph Pacific 16.2.10 deployed with Cephadm on this 
storage cluster. Last night, one of our OSD died. However, since its 
storage is a SSD, we ran hardware checks and found no issue with the 
SSD itself. However, if we try starting the service again, the 
container just crashes 1 second after booting up. If I look at the 
logs, there's no error. You can see the OSD starting up normally and 
then the last line before the crash is :


debug 2023-04-05T18:32:57.433+ 7f8078e0c700  1 osd.87 pg_epoch: 
207175 pg[2.99s3( v 207174'218628609 
(207134'218623666,207174'218628609] local-lis/les=207140/207141 
n=38969 ec=41966/315 lis/c=207140/207049 les/c/f=207141/207050/0 
sis=207175 pruub=11.464111328s) 
[5,228,217,NONE,17,25,167,114,158,178,159]/[5,228,217,87,17,25,167,114,158,178,159]p5(0) 
r=3 lpr=207175 pi=[207049,207175)/1 crt=207174'218628605 mlcod 0'0 
remapped NOTIFY pruub 12054.601562500s@ mbc={}] state: 
transitioning to Stray


I don't really see how this line could cause the OSD to crash. Systemd 
just writes :


Stopping Ceph osd.83 for (uuid)

What could cause this OSD to boot up and then suddenly die? Outside 
the ceph daemon logs and the systemd logs, is there another way I 
could gain more information?



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recently deployed cluster showing 9Tb of raw usage without any load deployed

2023-04-04 Thread Igor Fedotov

Originally you mentioned 14TB HDDs not 15TB. Could this be a trick?

If not - please share "ceph osd df tree" output?


On 4/4/2023 2:18 PM, Work Ceph wrote:
Thank you guys for your replies. The "used space" there is exactly 
that. It is the accounting for Rocks.DB and WAL.

```
RAW USED: The sum of USED space and the space allocated the db and wal 
BlueStore partitions.
```

There is one detail I do not understand. We are off-loading WAL and 
RocksDB to an NVME device; however, Ceph still seems to think that we 
use our data plane disks to store those elements. We have about 375TB 
(5 * 5 * 15) in HDD disks, and Ceph seems to be discounting from the 
usable space the volume (space) dedicated to WAL and Rocks.DB, which 
are applied into different disks; therefore, it shows as usable space 
364 TB (after removing the space dedicated to WAL and Rocks.DB, which 
are in another device). Is that a bug of some sort?



On Tue, Apr 4, 2023 at 6:31 AM Igor Fedotov  wrote:

Please also note that total cluster size reported below as SIZE
apparently includes DB volumes:

# ceph df
--- RAW STORAGE ---
CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
hdd    373 TiB  364 TiB  9.3 TiB   9.3 TiB       2.50

On 4/4/2023 12:22 PM, Igor Fedotov wrote:
> Do you have standalone DB volumes for your OSD?
>
> If so then highly likely RAW usage is that high due to DB volumes
> space is considered as in-use one already.
>
> Could you please share "ceph osd df tree" output to prove that?
>
>
> Thanks,
>
> Igor
>
> On 4/4/2023 4:25 AM, Work Ceph wrote:
>> Hello guys!
>>
>>
>> We noticed an unexpected situation. In a recently deployed Ceph
>> cluster we
>> are seeing a raw usage, that is a bit odd. We have the
following setup:
>>
>>
>> We have a new cluster with 5 nodes with the following setup:
>>
>>     - 128 GB of RAM
>>     - 2 cpus Intel(R) Intel Xeon Silver 4210R
>>     - 1 NVME of 2 TB for the rocks DB caching
>>     - 5 HDDs of 14TB
>>     - 1 NIC dual port of 25GiB in BOND mode.
>>
>>
>> Right after deploying the Ceph cluster, we see a raw usage of
about
>> 9TiB.
>> However, no load has been applied onto the cluster. Have you guys
>> seen such
>> a situation? Or, can you guys help understand it?
>>
>>
>> We are using Ceph Octopus, and we have set the following
configurations:
>>
>> ```
>>
>> ceph_conf_overrides:
>>
>>    global:
>>
>>  osd pool default size: 3
>>
>>  osd pool default min size: 1
>>
>>  osd pool default pg autoscale mode: "warn"
>>
>>  perf: true
>>
>>  rocksdb perf: true
>>
>>    mon:
>>
>>  mon osd down out interval: 120
>>
>>    osd:
    >>
>>  bluestore min alloc size hdd: 65536
>>
>>
>> ```
>>
>>
>> Any tip or help on how to explain this situation is welcome!
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>
-- 
Igor Fedotov

Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at
https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us athttps://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recently deployed cluster showing 9Tb of raw usage without any load deployed

2023-04-04 Thread Igor Fedotov
Please also note that total cluster size reported below as SIZE 
apparently includes DB volumes:


# ceph df
--- RAW STORAGE ---
CLASS  SIZE AVAILUSED RAW USED  %RAW USED
hdd373 TiB  364 TiB  9.3 TiB   9.3 TiB   2.50

On 4/4/2023 12:22 PM, Igor Fedotov wrote:

Do you have standalone DB volumes for your OSD?

If so then highly likely RAW usage is that high due to DB volumes 
space is considered as in-use one already.


Could you please share "ceph osd df tree" output to prove that?


Thanks,

Igor

On 4/4/2023 4:25 AM, Work Ceph wrote:

Hello guys!


We noticed an unexpected situation. In a recently deployed Ceph 
cluster we

are seeing a raw usage, that is a bit odd. We have the following setup:


We have a new cluster with 5 nodes with the following setup:

    - 128 GB of RAM
    - 2 cpus Intel(R) Intel Xeon Silver 4210R
    - 1 NVME of 2 TB for the rocks DB caching
    - 5 HDDs of 14TB
    - 1 NIC dual port of 25GiB in BOND mode.


Right after deploying the Ceph cluster, we see a raw usage of about 
9TiB.
However, no load has been applied onto the cluster. Have you guys 
seen such

a situation? Or, can you guys help understand it?


We are using Ceph Octopus, and we have set the following configurations:

```

ceph_conf_overrides:

   global:

 osd pool default size: 3

 osd pool default min size: 1

 osd pool default pg autoscale mode: "warn"

 perf: true

 rocksdb perf: true

   mon:

 mon osd down out interval: 120

   osd:

 bluestore min alloc size hdd: 65536


```


Any tip or help on how to explain this situation is welcome!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recently deployed cluster showing 9Tb of raw usage without any load deployed

2023-04-04 Thread Igor Fedotov

Do you have standalone DB volumes for your OSD?

If so then highly likely RAW usage is that high due to DB volumes space 
is considered as in-use one already.


Could you please share "ceph osd df tree" output to prove that?


Thanks,

Igor

On 4/4/2023 4:25 AM, Work Ceph wrote:

Hello guys!


We noticed an unexpected situation. In a recently deployed Ceph cluster we
are seeing a raw usage, that is a bit odd. We have the following setup:


We have a new cluster with 5 nodes with the following setup:

- 128 GB of RAM
- 2 cpus Intel(R) Intel Xeon Silver 4210R
- 1 NVME of 2 TB for the rocks DB caching
- 5 HDDs of 14TB
- 1 NIC dual port of 25GiB in BOND mode.


Right after deploying the Ceph cluster, we see a raw usage of about 9TiB.
However, no load has been applied onto the cluster. Have you guys seen such
a situation? Or, can you guys help understand it?


We are using Ceph Octopus, and we have set the following configurations:

```

ceph_conf_overrides:

   global:

 osd pool default size: 3

 osd pool default min size: 1

 osd pool default pg autoscale mode: "warn"

 perf: true

 rocksdb perf: true

   mon:

 mon osd down out interval: 120

   osd:

 bluestore min alloc size hdd: 65536


```


Any tip or help on how to explain this situation is welcome!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-27 Thread Igor Fedotov


On 3/27/2023 12:19 PM, Boris Behrens wrote:

Nonetheless the IOPS the bench command generates are still VERY low
compared to the nautilus cluster (~150 vs ~250). But this is something I
would pin to this bug:https://tracker.ceph.com/issues/58530


I've just run "ceph tell bench" against main, octopus and nautilus 
branches (fresh osd deployed with vstart.sh) - I don't see any 
difference between releases - sata drive shows around 110 IOPs in my case..


So I suspect some difference between clusters in your case. E.g. are you 
sure disk caching is off for both?



@Igor do you want to me to update the ticket with my findings and the logs
from pastebin?
Feel free to update if you like but IMO we still lack the understanding 
what was the trigger for perf improvements in you case - OSD 
redeployment, disk trimming or both?

--
Igor Fedotov
Ceph Lead Developer
--
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web <https://croit.io/> | LinkedIn <http://linkedin.com/company/croit> | 
Youtube <https://www.youtube.com/channel/UCIJJSKVdcSLGLBtwSFx_epw> | 
Twitter <https://twitter.com/croit_io>


Meet us at the SC22 Conference! Learn more <https://croit.io/croit-sc22>
Technology Fast50 Award Winner by Deloitte 
<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>!


<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-27 Thread Igor Fedotov

Hi Boris,

I wouldn't recommend to take absolute "osd bench" numbers too seriously. 
It's definitely not a full-scale quality benchmark tool.


The idea was just to make brief OSDs comparison from c1 and c2.

And for your reference -  IOPS numbers I'm getting in my lab with 
data/DB colocated:


1) OSD on top of Intel S4600 (SATA SSD) - ~110 IOPS

2) OSD on top of Samsung DCT 983 (M.2 NVMe) - 310 IOPS

3) OSD on top of Intel 905p (Optane NVMe) - 546 IOPS.


Could you please provide a bit more info on the H/W and OSD setup?

What are the disk models? NVMe or SATA? Are DB and main disk shared?


Thanks,

Igor

On 3/23/2023 12:45 AM, Boris Behrens wrote:

Hey Igor,

sadly we do not have the data from the time where c1 was on nautilus.
The RocksDB warning persisted the recreation.

Here are the measurements.
I've picked the same SSD models from the clusters to have some comparablity.
For the 8TB disks it's even the same chassis configuration
(CPU/Memory/Board/Network)

The IOPS seem VERY low for me. Or are these normal values for SSDs? After
recreation the IOPS are a lot better on the pacific cluster.

I also blkdiscarded the SSDs before recreating them.

Nautilus Cluster
osd.22  = 8TB
osd.343 = 2TB
https://pastebin.com/EfSSLmYS

Pacific Cluster before recreating OSDs
osd.40  = 8TB
osd.162 = 2TB
https://pastebin.com/wKMmSW9T

Pacific Cluster after recreation OSDs
osd.40  = 8TB
osd.162 = 2TB
https://pastebin.com/80eMwwBW

Am Mi., 22. März 2023 um 11:09 Uhr schrieb Igor Fedotov <
igor.fedo...@croit.io>:


Hi Boris,

first of all I'm not sure if it's valid to compare two different clusters
(pacific vs . nautilus, C1 vs. C2 respectively). The perf numbers
difference might be caused by a bunch of other factors: different H/W, user
load, network etc... I can see that you got ~2x latency increase after
Octopus to Pacific upgrade at C1 but Octopus numbers had been much above
Nautilus at C2 before the upgrade. Did you observe even lower numbers at C1
when it was running Nautilus if any?


You might want to try "ceph tell osd.N bench" to compare OSDs performance
for both C1 and C2. Would it be that different?


Then redeploy a single OSD at C1, wait till rebalance completion and
benchmark it again. What would be the new numbers? Please also collect perf
counters from the to-be-redeployed OSD beforehand.

W.r.t. rocksdb warning - I presume this might be caused by newer RocksDB
version running on top of DB with a legacy format.. Perhaps redeployment
would fix that...


Thanks,

Igor
On 3/21/2023 5:31 PM, Boris Behrens wrote:

Hi Igor,
i've offline compacted all the OSDs and reenabled the bluefs_buffered_io

It didn't change anything and the commit and apply latencies are around
5-10 times higher than on our nautlus cluster. The pacific cluster got a 5
minute mean over all OSDs 2.2ms, while the nautilus cluster is around 0.2 -
0.7 ms.

I also see these kind of logs. Google didn't really help:
2023-03-21T14:08:22.089+ 7efe7b911700  3 rocksdb:
[le/block_based/filter_policy.cc:579] Using legacy Bloom filter with high
(20) bits/key. Dramatic filter space and/or accuracy improvement is
available with format_version>=5.




Am Di., 21. März 2023 um 10:46 Uhr schrieb Igor Fedotov:


Hi Boris,

additionally you might want to manually compact RocksDB for every OSD.


Thanks,

Igor
On 3/21/2023 12:22 PM, Boris Behrens wrote:

Disabling the write cache and the bluefs_buffered_io did not change
anything.
What we see is that larger disks seem to be the leader in therms of
slowness (we have 70% 2TB, 20% 4TB and 10% 8TB SSDs in the cluster), but
removing some of the 8TB disks and replace them with 2TB (because it's by
far the majority and we have a lot of them) disks did also not change
anything.

Are there any other ideas I could try. Customer start to complain about the
slower performance and our k8s team mentions problems with ETCD because the
latency is too high.

Would it be an option to recreate every OSD?

Cheers
  Boris

Am Di., 28. Feb. 2023 um 22:46 Uhr schrieb Boris Behrens
  :


Hi Josh,
thanks a lot for the breakdown and the links.
I disabled the write cache but it didn't change anything. Tomorrow I will
try to disable bluefs_buffered_io.

It doesn't sound that I can mitigate the problem with more SSDs.


Am Di., 28. Feb. 2023 um 15:42 Uhr schrieb Josh Baergen  
:


Hi Boris,

OK, what I'm wondering is whetherhttps://tracker.ceph.com/issues/58530 is 
involved. There are two
aspects to that ticket:
* A measurable increase in the number of bytes written to disk in
Pacific as compared to Nautilus
* The same, but for IOPS

Per the current theory, both are due to the loss of rocksdb log
recycling when using default recovery options in rocksdb 6.8; Octopus
uses version 6.1.2, Pacific uses 6.8.1.

16.2.11 largely addressed the bytes-written amplification, but the
IOPS amplification remains. In practice, whether this results in a
write performance degradation depends on the speed of the unde

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-22 Thread Igor Fedotov

Hi Boris,

first of all I'm not sure if it's valid to compare two different 
clusters (pacific vs . nautilus, C1 vs. C2 respectively). The perf 
numbers difference might be caused by a bunch of other factors: 
different H/W, user load, network etc... I can see that you got ~2x 
latency increase after Octopus to Pacific upgrade at C1 but Octopus 
numbers had been much above Nautilus at C2 before the upgrade. Did you 
observe even lower numbers at C1 when it was running Nautilus if any?



You might want to try "ceph tell osd.N bench" to compare OSDs 
performance for both C1 and C2. Would it be that different?



Then redeploy a single OSD at C1, wait till rebalance completion and  
benchmark it again. What would be the new numbers? Please also collect 
perf counters from the to-be-redeployed OSD beforehand.


W.r.t. rocksdb warning - I presume this might be caused by newer RocksDB 
version running on top of DB with a legacy format.. Perhaps redeployment 
would fix that...



Thanks,

Igor

On 3/21/2023 5:31 PM, Boris Behrens wrote:

Hi Igor,
i've offline compacted all the OSDs and reenabled the bluefs_buffered_io

It didn't change anything and the commit and apply latencies are around
5-10 times higher than on our nautlus cluster. The pacific cluster got a 5
minute mean over all OSDs 2.2ms, while the nautilus cluster is around 0.2 -
0.7 ms.

I also see these kind of logs. Google didn't really help:
2023-03-21T14:08:22.089+ 7efe7b911700  3 rocksdb:
[le/block_based/filter_policy.cc:579] Using legacy Bloom filter with high
(20) bits/key. Dramatic filter space and/or accuracy improvement is
available with format_version>=5.




Am Di., 21. März 2023 um 10:46 Uhr schrieb Igor Fedotov <
igor.fedo...@croit.io>:


Hi Boris,

additionally you might want to manually compact RocksDB for every OSD.


Thanks,

Igor
On 3/21/2023 12:22 PM, Boris Behrens wrote:

Disabling the write cache and the bluefs_buffered_io did not change
anything.
What we see is that larger disks seem to be the leader in therms of
slowness (we have 70% 2TB, 20% 4TB and 10% 8TB SSDs in the cluster), but
removing some of the 8TB disks and replace them with 2TB (because it's by
far the majority and we have a lot of them) disks did also not change
anything.

Are there any other ideas I could try. Customer start to complain about the
slower performance and our k8s team mentions problems with ETCD because the
latency is too high.

Would it be an option to recreate every OSD?

Cheers
  Boris

Am Di., 28. Feb. 2023 um 22:46 Uhr schrieb Boris Behrens  
:


Hi Josh,
thanks a lot for the breakdown and the links.
I disabled the write cache but it didn't change anything. Tomorrow I will
try to disable bluefs_buffered_io.

It doesn't sound that I can mitigate the problem with more SSDs.


Am Di., 28. Feb. 2023 um 15:42 Uhr schrieb Josh 
Baergen:


Hi Boris,

OK, what I'm wondering is whetherhttps://tracker.ceph.com/issues/58530 is 
involved. There are two
aspects to that ticket:
* A measurable increase in the number of bytes written to disk in
Pacific as compared to Nautilus
* The same, but for IOPS

Per the current theory, both are due to the loss of rocksdb log
recycling when using default recovery options in rocksdb 6.8; Octopus
uses version 6.1.2, Pacific uses 6.8.1.

16.2.11 largely addressed the bytes-written amplification, but the
IOPS amplification remains. In practice, whether this results in a
write performance degradation depends on the speed of the underlying
media and the workload, and thus the things I mention in the next
paragraph may or may not be applicable to you.

There's no known workaround or solution for this at this time. In some
cases I've seen that disabling bluefs_buffered_io (which itself can
cause IOPS amplification in some cases) can help; I think most folks
do this by setting it in local conf and then restarting OSDs in order
to gain the config change. Something else to consider is
https://docs.ceph.com/en/quincy/start/hardware-recommendations/#write-caches
,
as sometimes disabling these write caches can improve the IOPS
performance of SSDs.

Josh

On Tue, Feb 28, 2023 at 7:19 AM Boris Behrens   
 wrote:

Hi Josh,
we upgraded 15.2.17 -> 16.2.11 and we only use rbd workload.



Am Di., 28. Feb. 2023 um 15:00 Uhr schrieb Josh Baergen <

jbaer...@digitalocean.com>:

Hi Boris,

Which version did you upgrade from and to, specifically? And what
workload are you running (RBD, etc.)?

Josh

On Tue, Feb 28, 2023 at 6:51 AM Boris Behrens   
 wrote:

Hi,
today I did the first update from octopus to pacific, and it looks

like the

avg apply latency went up from 1ms to 2ms.

All 36 OSDs are 4TB SSDs and nothing else changed.
Someone knows if this is an issue, or am I just missing a config

value?

Cheers
  Boris
___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io

--
Die Selbsthilfegruppe "UTF-8-Probleme&qu

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-21 Thread Igor Fedotov

Hi Boris,

additionally you might want to manually compact RocksDB for every OSD.


Thanks,

Igor

On 3/21/2023 12:22 PM, Boris Behrens wrote:

Disabling the write cache and the bluefs_buffered_io did not change
anything.
What we see is that larger disks seem to be the leader in therms of
slowness (we have 70% 2TB, 20% 4TB and 10% 8TB SSDs in the cluster), but
removing some of the 8TB disks and replace them with 2TB (because it's by
far the majority and we have a lot of them) disks did also not change
anything.

Are there any other ideas I could try. Customer start to complain about the
slower performance and our k8s team mentions problems with ETCD because the
latency is too high.

Would it be an option to recreate every OSD?

Cheers
  Boris

Am Di., 28. Feb. 2023 um 22:46 Uhr schrieb Boris Behrens:


Hi Josh,
thanks a lot for the breakdown and the links.
I disabled the write cache but it didn't change anything. Tomorrow I will
try to disable bluefs_buffered_io.

It doesn't sound that I can mitigate the problem with more SSDs.


Am Di., 28. Feb. 2023 um 15:42 Uhr schrieb Josh Baergen <
jbaer...@digitalocean.com>:


Hi Boris,

OK, what I'm wondering is whether
https://tracker.ceph.com/issues/58530  is involved. There are two
aspects to that ticket:
* A measurable increase in the number of bytes written to disk in
Pacific as compared to Nautilus
* The same, but for IOPS

Per the current theory, both are due to the loss of rocksdb log
recycling when using default recovery options in rocksdb 6.8; Octopus
uses version 6.1.2, Pacific uses 6.8.1.

16.2.11 largely addressed the bytes-written amplification, but the
IOPS amplification remains. In practice, whether this results in a
write performance degradation depends on the speed of the underlying
media and the workload, and thus the things I mention in the next
paragraph may or may not be applicable to you.

There's no known workaround or solution for this at this time. In some
cases I've seen that disabling bluefs_buffered_io (which itself can
cause IOPS amplification in some cases) can help; I think most folks
do this by setting it in local conf and then restarting OSDs in order
to gain the config change. Something else to consider is

https://docs.ceph.com/en/quincy/start/hardware-recommendations/#write-caches
,
as sometimes disabling these write caches can improve the IOPS
performance of SSDs.

Josh

On Tue, Feb 28, 2023 at 7:19 AM Boris Behrens  wrote:

Hi Josh,
we upgraded 15.2.17 -> 16.2.11 and we only use rbd workload.



Am Di., 28. Feb. 2023 um 15:00 Uhr schrieb Josh Baergen <

jbaer...@digitalocean.com>:

Hi Boris,

Which version did you upgrade from and to, specifically? And what
workload are you running (RBD, etc.)?

Josh

On Tue, Feb 28, 2023 at 6:51 AM Boris Behrens  wrote:

Hi,
today I did the first update from octopus to pacific, and it looks

like the

avg apply latency went up from 1ms to 2ms.

All 36 OSDs are 4TB SSDs and nothing else changed.
Someone knows if this is an issue, or am I just missing a config

value?

Cheers
  Boris
___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io



--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend

im groüen Saal.



--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.




--
Igor Fedotov
Ceph Lead Developer
--
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web <https://croit.io/> | LinkedIn <http://linkedin.com/company/croit> | 
Youtube <https://www.youtube.com/channel/UCIJJSKVdcSLGLBtwSFx_epw> | 
Twitter <https://twitter.com/croit_io>


Meet us at the SC22 Conference! Learn more <https://croit.io/croit-sc22>
Technology Fast50 Award Winner by Deloitte 
<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>!


<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: BlueFS spillover warning gone after upgrade to Quincy

2023-01-16 Thread Igor Fedotov

Hi Benoit and Peter,

looks like your findings are valid and spillover alert is broken for 
now.  I've just created https://tracker.ceph.com/issues/58440 to track this.



Thanks,

Igor

On 1/13/2023 9:54 AM, Benoît Knecht wrote:

Hi Peter,

On Thursday, January 12th, 2023 at 15:12, Peter van Heusden  
wrote:

I have a Ceph installation where some of the OSDs were misconfigured to use
1GB SSD partitions for rocksdb. This caused a spillover ("BlueFS spillover
detected"). I recently upgraded to quincy using cephadm (17.2.5) the
spillover warning vanished. This is
despite bluestore_warn_on_bluefs_spillover still being set to true.

I noticed this on Pacific as well, and I think it's due to this commit:
https://github.com/ceph/ceph/commit/d17cd6604b4031ca997deddc5440248aff451269.
It removes the logic that would normally update the spillover health check, so
it never triggers anymore.

As others mentioned, you can get the relevant metrics from Prometheus and setup
alerts there instead. But it does make me wonder how many people might have
spillover in their clusters and not even realize it, since there's no warning by
default.

Cheers,


--
Igor Fedotov
Ceph Lead Developer
--
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web <https://croit.io/> | LinkedIn <http://linkedin.com/company/croit> | 
Youtube <https://www.youtube.com/channel/UCIJJSKVdcSLGLBtwSFx_epw> | 
Twitter <https://twitter.com/croit_io>


Meet us at the SC22 Conference! Learn more <https://croit.io/croit-sc22>
Technology Fast50 Award Winner by Deloitte 
<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>!


<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crash on Onode::put

2023-01-12 Thread Igor Fedotov

Hi Frank,

IMO all the below logic is a bit of overkill and no one can provide 100% 
valid guidance on specific numbers atm. Generally I agree with 
Dongdong's point that crash is effectively an OSD restart and hence no 
much sense to perform such a restart manually - well, the rationale 
might be to do that gracefully and avoid some potential issues though...


Anyway I'd rather recommend to do periodic(!) manual OSD restart e.g. on 
a daily basis at off-peak hours instead of using tricks with mempool 
stats analysis..



Thanks,

Igor


On 1/10/2023 1:15 PM, Frank Schilder wrote:

Hi Dongdong and Igor,

thanks for pointing to this issue. I guess if its a memory leak issue (well, 
cache pool trim issue), checking for some indicator and an OSD restart should 
be a work-around? Dongdong promised a work-around but talks only about a patch 
(fix).

Looking at the tracker items, my conclusion is that unusually low values of 
.mempool.by_pool.bluestore_cache_onode.items of an OSD might be such an 
indicator. I just run a very simple check on all our OSDs:

for o in $(ceph osd ls); do n_onode="$(ceph tell "osd.$o" dump_mempools | jq 
".mempool.by_pool.bluestore_cache_onode.items")"; echo -n "$o: "; ((n_onode<10)) && echo "$n_onode"; 
done; echo ""

and found 2 with seemingly very unusual values:

: 3098
1112: 7403

Comparing two OSDs with same disk on the same host gives:

# ceph daemon osd. dump_mempools | jq 
".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_onode.bytes,.mempool.by_pool.bluestore_cache_other.items,.mempool.by_pool.bluestore_cache_other.bytes"
3200
1971200
260924
900303680

# ceph daemon osd.1030 dump_mempools | jq 
".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_onode.bytes,.mempool.by_pool.bluestore_cache_other.items,.mempool.by_pool.bluestore_cache_other.bytes"
60281
37133096
8908591
255862680

OSD  does look somewhat bad. Shortly after restarting this OSD I get

# ceph daemon osd. dump_mempools | jq 
".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_onode.bytes,.mempool.by_pool.bluestore_cache_other.items,.mempool.by_pool.bluestore_cache_other.bytes"
20775
12797400
803582
24017100

So, the above procedure seems to work and, yes, there seems to be a leak of 
items in cache_other that pushes other pools down to 0. There seem to be 2 
useful indicators:

- very low .mempool.by_pool.bluestore_cache_onode.items
- very high 
.mempool.by_pool.bluestore_cache_other.bytes/.mempool.by_pool.bluestore_cache_other.items

Here a command to get both numbers with OSD ID in an awk-friendly format:

for o in $(ceph osd ls); do printf "%6d %8d %7.2f\n" "$o" $(ceph tell "osd.$o" 
dump_mempools | jq 
".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_other.bytes/.mempool.by_pool.bluestore_cache_other.items");
 done

Pipe it to a file and do things like:

awk '$2<5 || $3>200' FILE

For example, I still get:

# awk '$2<5 || $3>200' cache_onode.txt
   109249225   43.74
   109346193   43.70
   109847550   43.47
   110148873   43.34
   110248008   43.31
   110348152   43.29
   110549235   43.59
   110746694   43.35
   110948511   43.08
   111314612  739.46
   111413199  693.76
   111645300  205.70

flagging 3 more outliers.

Would it be possible to provide a bit of guidance to everyone about when to 
consider restarting an OSD? What values of the above variables are critical and 
what are tolerable? Of course a proper fix would be better, but I doubt that 
everyone is willing to apply a patch. Therefore, some guidance on how to 
mitigate this problem to acceptable levels might be useful. I'm thinking here 
how few onode items are acceptable before performance drops painfully.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov
Sent: 09 January 2023 13:34:42
To: Dongdong Tao;ceph-users@ceph.io
Cc:d...@ceph.io
Subject: [ceph-users] Re: OSD crash on Onode::put

Hi Dongdong,

thanks a lot for your post, it's really helpful.


Thanks,

Igor

On 1/5/2023 6:12 AM, Dongdong Tao wrote:

I see many users recently reporting that they have been struggling
with this Onode::put race condition issue[1] on both the latest
Octopus and pacific.
Igor opened a PR [2]  to address this issue, I've reviewed it
carefully, and looks good to me. I'm hoping this could get some
priority from the community.

For those who had been hitting this issue, I would like to share a
workaround that could unblock you:

During the investigation of this issue, I found this race condition
always happens after the bluestore onode cache size becomes 0.
Setting debug_bluestore = 1/30 will allow you to see the cache size
afte

[ceph-users] Re: OSD crash on Onode::put

2023-01-09 Thread Igor Fedotov

Hi Dongdong,

thanks a lot for your post, it's really helpful.


Thanks,

Igor

On 1/5/2023 6:12 AM, Dongdong Tao wrote:


I see many users recently reporting that they have been struggling 
with this Onode::put race condition issue[1] on both the latest 
Octopus and pacific.
Igor opened a PR [2]  to address this issue, I've reviewed it 
carefully, and looks good to me. I'm hoping this could get some 
priority from the community.


For those who had been hitting this issue, I would like to share a 
workaround that could unblock you:


During the investigation of this issue, I found this race condition 
always happens after the bluestore onode cache size becomes 0.
Setting debug_bluestore = 1/30 will allow you to see the cache size 
after the crash:

---
2022-10-25T00:47:26.562+ 7f424f78e700 30 
bluestore.MempoolThread(0x564a9dae2a68) _resize_shards 
max_shard_onodes: 0 max_shard_buffer: 8388608

---

This is apparently wrong as this means the bluestore metadata cache is 
basically disabled,
but it makes much sense to explain why we are hitting the race 
condition so easily -- An onode will be trimmed right away after it's 
unpinned.


Keep going with the investigation, it turns out the culprit for the 
0-sized cache is the leak that happened in bluestore_cache_other mempool
Please refer to the bug tracker [3] which has the detail of the leak 
issue, it was already fixed by  [4], and the next Pacific point 
release will have it.

But it was never backported to Octopus.
So if you are hitting the same:
For those who are on Octopus, you can manually backport this patch to 
fix the leak and prevent the race condition from happening.
For those who are on Pacific, you can wait for the next Pacific point 
release.


By the way, I'm backporting the fix to ubuntu Octopus and Pacific 
through this SRU [5], so it will be landed in ubuntu's package soon.


[1] https://tracker.ceph.com/issues/56382
[2] https://github.com/ceph/ceph/pull/47702
[3] https://tracker.ceph.com/issues/56424
[4] https://github.com/ceph/ceph/pull/46911
[5] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1996010

Cheers,
Dongdong



--
Igor Fedotov
Ceph Lead Developer
--
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web <https://croit.io/> | LinkedIn <http://linkedin.com/company/croit> | 
Youtube <https://www.youtube.com/channel/UCIJJSKVdcSLGLBtwSFx_epw> | 
Twitter <https://twitter.com/croit_io>


Meet us at the SC22 Conference! Learn more <https://croit.io/croit-sc22>
Technology Fast50 Award Winner by Deloitte 
<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>!


<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: LVM osds loose connection to disk

2022-11-17 Thread Igor Fedotov

Hi Frank,

interesting findings indeed.

Unfortunately I'm absolutely unfamiliar with this disk scheduler stuff 
in linux. From my experience I've never faced any issues with that and 
never needed to tune everything at this level..


But - given that AFAIK you're the only who faced the issue and Octopus 
is a pretty mature release - it still looks to me like a very uncommon 
interoperability issue specific to disk/Ceph/OS/whatever...


Anyway looking forward to learn whether switching the scheduler helps..


As for benchmark tool - I can't suggest anything other than fio - but 
you might want to try different engine - IIRC fio supports a bunch of  
payloads: RGW/CephFS/RBD/Rados and perhaps using more complicated stack 
(RGS or CephFS) would help.. Or alternatively just more randomness 
should  be introduced: in access pattern/block sizes/etc...



Thanks,

Igor

On 11/17/2022 3:23 PM, Frank Schilder wrote:

Hi Igor,

I might have a smoking gun. Could it be that ceph (the kernel??) has issues 
with certain disk schedulers? There was a recommendation on this list to use 
bfq with bluestore. This was actually the one change other than ceph version 
during upgrade: to make bfq default. Now, this might be a problem with certain 
drives that have a preferred scheduler different than bfq. Here my observation:

I managed to get one of the OSDs to hang today. It was not the usual abort, I 
don't know why the op_thread_timeout and suicide_timeout didn't trigger. The 
OSD's worker thread was unresponsive for a bit more than 10 minutes before I 
took action. Hence, nothing in the log (should maybe have used kill sigabort). 
Now, this time I tried to check if I can access the disk with dd. And, I could 
not. A

dd if=/dev/sdn of=disk-dump bs=4096 count=100

got stuck right away in D-state:

1652472 D+   dd if=/dev/sdn of=disk-dump bs=4096 count=100

This time, since I was curious about the disk scheduler, I went to another 
terminal on the same machine and did:

# cat /sys/block/sdn/queue/scheduler
mq-deadline kyber [bfq] none
# echo none >> /sys/block/sdn/queue/scheduler
# cat /sys/block/sdn/queue/scheduler
[none] mq-deadline kyber bfq

Going back to the stuck session, I see now (you can see my attempts to 
interrupt the dd):

# dd if=/dev/sdn of=disk-dump bs=4096 count=100
^C^C3+0 records in
2+0 records out
8192 bytes (8.2 kB) copied, 336.712 s, 0.0 kB/s

Suddenly, the disk responds again! Also, the ceph container stopped (a docker 
stop container returned without the container stopping - as before in this 
situation).

Could it be that recommendations for disk scheduler choice should be 
reconsidered, or is this pointing towards a bug in either how ceph or the 
kernel schedules disk IO? To confirm this hypothesis, I will retry the stress 
test with the scheduler set to the default kernel choice.

I did day-long fio benchmarks with all schedulers and all sorts of workloads on 
our drives and could not find anything like that. It looks like it is very 
difficult to impossible to reproduce a realistic ceph-osd IO pattern for 
testing. Is there any tool available for this?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder
Sent: 14 November 2022 13:03:58
To: Igor Fedotov;ceph-users@ceph.io
Subject: [ceph-users] Re: LVM osds loose connection to disk

I can't reproduce the problem with artificial workloads, I need to get one of 
these OSDs running in the meta-data pool until it crashes. My plan is to reduce 
time-outs and increase log level for these specific OSDs to capture what 
happened before an abort in the memory log. I can spare about 100G of RAM for 
log entries. I found the following relevant options with settings I think will 
work for my case:

osd_op_thread_suicide_timeout 30 # default 150
osd_op_thread_timeout 10 # default 15
debug_bluefs 1/20 # default 1/5
debug_bluestore 1/20 # default 1/5
bluestore_kv_sync_util_logging_s 3 # default 10
log_max_recent 10 # default 1

It would be great if someone could confirm that these settings will achieve 
what I want (or what is missing). I would like to capture at least 1 minute 
worth of log entries in RAM with high debug settings. Does anyone have a good 
estimate for how many log-entries are created per second with these settings 
for tuning log_max_recent?

Thanks for your help!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder
Sent: 11 November 2022 10:25:17
To: Igor Fedotov;ceph-users@ceph.io
Subject: [ceph-users] Re: LVM osds loose connection to disk

Hi Igor,

thanks for your reply. We only exchanged the mimic containers with the octopus 
ones. We didn't even reboot the servers during upgrade, only later for trouble 
shooting. The only change since the upgrade is the ceph container.

I'm trying to go down the stack and run a benchmark on the OSD directly. 
Unfortu

[ceph-users] Re: LVM osds loose connection to disk

2022-11-10 Thread Igor Fedotov

Hi Frank,

unfortunately IMO it's not an easy task to identify what are the 
relevant difference between mimic and octopus in this respect.. At least 
the question would be what minor Ceph releases are/were in use.


I recall there were some tricks with setting/clearing bluefs_buffered_io 
somewhere in that period. But I can hardly recall anything else... 
Curious if OS/Container or other third-party software was upgraded with 
Ceph upgrade as well? Just in case I presume you were using containers 
in mimic too, right?



Instead I'd rather approach the issue from a different side:

1) Learn how to [easily] reproduce the issue, preferably in a test lab 
rather than in the field. Better use exactly the same disk(s), H/W, OS 
and S/W versions as in the production.


2) (can be done in prod as well) - once an OSD stuck - set debug-bluefs 
and debug-bdev to 20 and collect verbose log - check what's happening 
there.  Meanwhile monitor disk activity - is there any load to the disk 
while in this state at all? Do disk reads (e.g. via dd) out of OSD 
container succeed at this point?



Thanks,

Igor

On 11/10/2022 5:23 PM, Frank Schilder wrote:

Hi all,

I have some kind of update on the matter of stuck OSDs. It seems not to be an 
LVM issue and it also seems not to be connected to the OSD size.

After moving all data from the tiny 100G OSDs to spare SSDs, I redeployed the 
400G disks with 1 OSD per disk and started to move data from the slow spare 
SSDs back to the fast ones. After moving about 100 out of 1024 PGs of the pool 
fast OSDs started failing again. It is kind of the same observation as before, 
I can't stop a container with a failed OSD. However, when I restart docker, 
everything comes up clean.

When I look at the logs, I see that the OSD aborts with suicide timeout after 
many osd_op_tp thread timeouts. However, the OSD process and with it the 
container does not terminate because of a hanging thread. The syslog has the 
message (all messages below in chronological order):

kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 122 seconds

about 30 seconds before the OSD aborts with suicide timeout with

ceph-osd: 2022-11-09T17:36:53.691+0100 7f663a23c700 -1 *** Caught signal 
(Aborted) **#012 in thread 7f663a23c700 thread_name:msgr-worker-2#012#012 ceph 
version 15.2.17 ...

What I see in the syslog is, that the thread bstore_kv_sync seems not to be 
terminated with the abort. These messages continue to show up:

kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 368 seconds.
kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 491 seconds.
kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 614 seconds.

On docker stop container the launch-script receives the term signal, but the 
OSD cannot be deactivated due to this thread:

journal: osd_lvm_start: deactivating OSD 959
journal: osd_lvm_start: unmounting /var/lib/ceph/osd/ceph-959
journal: umount: /var/lib/ceph/osd/ceph-959: target is busy.

Its probably busy because of bstore_kv_sync thread hanging. As a consequence, 
the container is still running, has a ceph-osd process shown with docker top 
and these messages continue to show up:

INFO: task bstore_kv_sync:1283156 blocked for more than 737 seconds.
INFO: task bstore_kv_sync:1283156 blocked for more than 860 seconds.

Although the bstore_kv_sync thread is unkillable, a restart of docker clears everything 
out and the OSD restarts fine. I'm somewhat hesitant to accept the simple "must be 
the firmware" statement, because these disks worked fine for 4 years with mimic. The 
only thing that changed was the ceph version from mimic to octopus, everything else 
stayed the same: OS version, kernel version, docker version, firmware version.

Since it happens only on this type of disks, it could very well have to do with 
firmware, but not without ceph having had a serious change in low-level disk 
access between mimic and cotopus. So, I'm wondering what features of the 
firmware octopus is using that mimic was not. Would be great if somebody has 
some pointers for what part of the software stack I should look at, I would 
like to avoid hunting ghosts.

Many thanks and best regards!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 10 October 2022 23:33:32
To: Igor Fedotov; ceph-users@ceph.io
Subject: [ceph-users] Re: LVM osds loose connection to disk

Hi Igor.

The problem of OSD crashes was resolved after migrating just a little bit of the meta-data 
pool to other disks (we decided to evacuate the small OSDs onto larger disks to make space). 
Therefore, I don't think its an LVM or disk issue. The cluster is working perfectly now 
after migrating some data away from the small OSDs. I rather believe that its tightly 
related to "OSD crashes during upgrade mimic->octopus", it happens only on OSDs 
where the repair command errs out wi

[ceph-users] Re: Is it a bug that OSD crashed when it's full?

2022-11-01 Thread Igor Fedotov
::basic_string, std::allocator > const&, rocksdb::Env*, rocksdb::FileSystem*, rocksdb::ImmutableCFOptions const&, 
rocksdb::MutableCFOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase*, std::vector >, std::allocator > > 
>, rocksdb::FileMetaData*, rocksdb::InternalKeyComparator const&, std::vector >, 
std::allocator > > > const*, unsigned int, std::__cxx11::basic_string, std::allocator > const&, std::vector >, unsigned long, rocksdb::SnapshotChecker*, rocksdb::CompressionType, unsigned long, 
rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*, rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned long, unsigned long, 
rocksdb::Env::WriteLifeTimeHint, unsigned long)+0xa45) [0x55858e58be45]
  18: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, 
rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0xcf5) 
[0x55858e3f0ea5]
  19: (rocksdb::DBImpl::RecoverLogFiles(std::vector > const&, unsigned long*, bool, bool*)+0x1c2e) 
[0x55858e3f35de]
  20: (rocksdb::DBImpl::Recover(std::vector > const&, bool, bool, bool, 
unsigned long*)+0xae8) [0x55858e3f4938]
  21: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string, 
std::allocator > const&, std::vector > const&, std::vector >*, rocksdb::DB**, bool, bool)+0x59d) [0x55858e3ee65d]
  22: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string, 
std::allocator > const&, std::vector > const&, std::vector >*, rocksdb::DB**)+0x15) [0x55858e3ef9f5]
  23: (RocksDBStore::do_open(std::ostream&, bool, bool, std::__cxx11::basic_string, std::allocator > const&)+0x10c1) [0x55858e367601]
  24: (BlueStore::_open_db(bool, bool, bool)+0x8c7) [0x55858ddde857]
  25: (BlueStore::_open_db_and_around(bool, bool)+0x2f7) [0x55858de4c8f7]
  26: (BlueStore::_mount()+0x204) [0x55858de4f7b4]
  27: (OSD::init()+0x380) [0x55858d91d1d0]
  28: main()
  29: __libc_start_main()
  30: _start()
  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.


Thanks!
Tony
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: LVM osds loose connection to disk

2022-10-09 Thread Igor Fedotov
 80: ceph_abort_msg("hit suicide timeout")

  ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus 
(stable)
  1: (ceph::__ceph_abort(char const*, int, char const*, 
std::__cxx11::basic_string, std::allocator > const&)+0xe5) [0x556b9b10cb32]
  2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, 
unsigned long)+0x295)
[0x556b9b82c795]
  3: (ceph::HeartbeatMap::is_healthy()+0x112) [0x556b9b82d292]
  4: (OSD::handle_osd_ping(MOSDPing*)+0xc2f) [0x556b9b1e253f]
  5: (OSD::heartbeat_dispatch(Message*)+0x1db) [0x556b9b1e44eb]
  6: (DispatchQueue::fast_dispatch(boost::intrusive_ptr const&)+0x155) 
[0x556b9bb83aa5]
  7: (ProtocolV2::handle_message()+0x142a) [0x556b9bbb941a]
  8: (ProtocolV2::handle_read_frame_dispatch()+0x258) [0x556b9bbcb418]
  9: (ProtocolV2::_handle_read_frame_epilogue_main()+0x95) [0x556b9bbcb515]
  10: 
(ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr&&, int)+0x92) [0x556b9bbcc912]
  11: (ProtocolV2::run_continuation(Ct&)+0x3c) [0x556b9bbb480c]
  12: (AsyncConnection::process()+0x8a9) [0x556b9bb8b6c9]
  13: (EventCenter::process_events(unsigned int, std::chrono::duration >*)+0xcb7) [0x556b9b9e22c7]
  14: (()+0xde78ac) [0x556b9b9e78ac]
  15: (()+0xc2ba3) [0x7fbdf84c8ba3]
  16: (()+0x81ca) [0x7fbdf8e751ca]
  17: (clone()+0x43) [0x7fbdf7adfdd3]

2022-10-08T16:10:52.078+0200 7fbdf4678700 -1 *** Caught signal (Aborted) **
  in thread 7fbdf4678700 thread_name:msgr-worker-2

  ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus 
(stable)
  1: (()+0x12ce0) [0x7fbdf8e7fce0]
  2: (gsignal()+0x10f) [0x7fbdf7af4a9f]
  3: (abort()+0x127) [0x7fbdf7ac7e05]
  4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string, std::allocator > const&)+0x1b6) [0x556b9b10cc03]
  5: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, 
unsigned long)+0x295) [0x556b9b82c795]
  6: (ceph::HeartbeatMap::is_healthy()+0x112) [0x556b9b82d292]
  7: (OSD::handle_osd_ping(MOSDPing*)+0xc2f) [0x556b9b1e253f]
  8: (OSD::heartbeat_dispatch(Message*)+0x1db) [0x556b9b1e44eb]
  9: (DispatchQueue::fast_dispatch(boost::intrusive_ptr const&)+0x155) 
[0x556b9bb83aa5]
  10: (ProtocolV2::handle_message()+0x142a) [0x556b9bbb941a]
  11: (ProtocolV2::handle_read_frame_dispatch()+0x258) [0x556b9bbcb418]
  12: (ProtocolV2::_handle_read_frame_epilogue_main()+0x95) [0x556b9bbcb515]
  13: 
(ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr&&, int)+0x92) [0x556b9bbcc912]
  14: (ProtocolV2::run_continuation(Ct&)+0x3c) [0x556b9bbb480c]
  15: (AsyncConnection::process()+0x8a9) [0x556b9bb8b6c9]
  16: (EventCenter::process_events(unsigned int, std::chrono::duration >*)+0xcb7) [0x556b9b9e22c7]
  17: (()+0xde78ac) [0x556b9b9e78ac]
  17: (()+0xde78ac) [0x556b9b9e78ac]
  18: (()+0xc2ba3) [0x7fbdf84c8ba3]
  19: (()+0x81ca) [0x7fbdf8e751ca]
  20: (clone()+0x43) [0x7fbdf7adfdd3]
  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

What I'm most interested right now is if anyone has an idea what our underlying 
issue of these disks freezing might be and why the crashed OSD is not 
recognised as down. Any hints on what to check if it happens again are also 
welcome.

Many thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Igor Fedotov

Hi Frank,

there no tools to defragment OSD atm. The only way to defragment OSD is 
to redeploy it...



Thanks,

Igor


On 10/7/2022 3:04 AM, Frank Schilder wrote:

Hi Igor,

sorry for the extra e-mail. I forgot to ask: I'm interested in a tool to 
de-fragment the OSD. It doesn't look like the fsck command does that. Is there 
any such tool?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 07 October 2022 01:53:20
To: Igor Fedotov; ceph-users@ceph.io
Subject: [ceph-users] Re: OSD crashes during upgrade mimic->octopus

Hi Igor,

I added a sample of OSDs on identical disks. The usage is quite well balanced, 
so the numbers I included are representative. I don't believe that we had one 
such extreme outlier. Maybe it ran full during conversion. Most of the data is 
OMAP after all.

I can't dump the free-dumps into paste bin, they are too large. Not sure if you 
can access ceph-post-files. I will send you a tgz in a separate e-mail directly 
to you.


And once again - do other non-starting OSDs show the same ENOSPC error?
Evidently I'm unable to make any generalization about the root cause due
to lack of the info...

As I said before, I need more time to check this and give you the answer you 
actually want. The stupid answer is they don't, because the other 3 are taken 
down the moment 16 crashes and don't reach the same point. I need to take them 
out of the grouped management and start them by hand, which I can do tomorrow. 
I'm too tired now to play on our production system.

The free-dumps are on their separate way. I included one for OSD 17 as well (on 
the same disk).

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 01:19:44
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

The log I inspected was for osd.16  so please share that OSD
utilization... And honestly I trust allocator's stats more so it's
rather CLI stats are incorrect if any. Anyway free dump should provide
additional proofs..

And once again - do other non-starting OSDs show the same ENOSPC error?
Evidently I'm unable to make any generalization about the root cause due
to lack of the info...


W.r.t fsck - you can try to run it - since fsck opens DB in read-pnly
there are some chances it will work.


Thanks,

Igor


On 10/7/2022 1:59 AM, Frank Schilder wrote:

Hi Igor,

I suspect there is something wrong with the data reported. These OSDs are only 
50-60% used. For example:

IDCLASS WEIGHT   REWEIGHT  SIZE RAW USE   DATA  OMAP 
META  AVAIL%USE   VAR   PGS  STATUS TYPE NAME
29   ssd  0.09099   1.0   93 GiB49 GiB17 GiB   16 GiB   
 15 GiB   44 GiB  52.42  1.91  104 up  osd.29
44   ssd  0.09099   1.0   93 GiB50 GiB23 GiB   10 GiB   
 16 GiB   43 GiB  53.88  1.96  121 up  osd.44
58   ssd  0.09099   1.0   93 GiB49 GiB16 GiB   15 GiB   
 18 GiB   44 GiB  52.81  1.92  123 up  osd.58
   984   ssd  0.09099   1.0   93 GiB57 GiB26 GiB   13 GiB   
 17 GiB   37 GiB  60.81  2.21  133 up  osd.984

Yes, these drives are small, but it should be possible to find 1M more. It 
sounds like some stats data/counters are incorrect/corrupted. Is it possible to 
run an fsck on a bluestore device to have it checked for that? Any idea how an 
incorrect utilisation might come about?

I will look into starting these OSDs individually. This will be a bit of work 
as our deployment method is to start/stop all OSDs sharing the same disk 
simultaneously (OSDs are grouped by disk). If one fails all others also go 
down. Its for simplifying disk management and this debugging is a new use case 
we never needed before.

Thanks for your help at this late hour!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 00:37:34
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Hi Frank,

the abort message "bluefs enospc" indicates lack of free space for
additional bluefs space allocations which prevents osd from startup.

   From the following log line one can see that bluefs needs ~1M more
space while the total available one is approx 622M. the problem is that
bluefs needs continuous(!) 64K chunks though. Which apparently aren't
available due to high disk fragmentation.

   -4> 2022-10-06T23:22:49.267+0200 7f669d129700 -1
bluestore(/var/lib/ceph/osd/ceph-16) allocate_bluefs_freespace failed to
allocate on 0x11 min_size 0x11 >

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Igor Fedotov

Just FYI:

standalone ceph-bluestore-tool's quick-fix behaves pretty similar to the 
action performed on start-up with bluestore_fsck_quick_fix_on_mount = true




On 10/7/2022 10:18 AM, Frank Schilder wrote:

Hi Stefan,

super thanks!

I found a quick-fix command in the help output:

# ceph-bluestore-tool -h
[...]
Positional options:
   --command arg  fsck, repair, quick-fix, bluefs-export,
  bluefs-bdev-sizes, bluefs-bdev-expand,
  bluefs-bdev-new-db, bluefs-bdev-new-wal,
  bluefs-bdev-migrate, show-label, set-label-key,
  rm-label-key, prime-osd-dir, bluefs-log-dump,
  free-dump, free-score, bluefs-stats

but its not documented in https://docs.ceph.com/en/octopus/man/8/ceph-bluestore-tool/. I 
guess I will stick with the tested command "repair". Nothing I found mentions 
what exactly is executed on start-up with bluestore_fsck_quick_fix_on_mount = true.

Thanks for your quick answer!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Stefan Kooman 
Sent: 07 October 2022 09:07:37
To: Frank Schilder; Igor Fedotov; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

On 10/7/22 09:03, Frank Schilder wrote:

Hi Igor and Stefan,

thanks a lot for your help! Our cluster is almost finished with recovery and I 
would like to switch to off-line conversion of the SSD OSDs. In one of Stefan's 
I coud find the command for manual compaction:

ceph-kvstore-tool bluestore-kv "/var/lib/ceph/osd/ceph-${OSD_ID}" compact

Unfortunately, I can't find the command for performing the omap conversion. It 
is not mentioned here 
https://docs.ceph.com/en/quincy/releases/octopus/#upgrading-from-mimic-or-nautilus
 even though it does mention the option to skip conversion in step 5. How to 
continue with an off-line conversion is not mentioned. I know it has been 
posted before, but I seem unable to find it on this list. If someone could send 
me the command, I would be most grateful.

for osd in `ls /var/lib/ceph/osd/`; do ceph-bluestore-tool repair --path
   /var/lib/ceph/osd/$osd;done

That's what I use.

Gr. Stefan


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Igor Fedotov
For format updates one can use quick-fix command instead of repair, it 
might work a bit faster..


On 10/7/2022 10:07 AM, Stefan Kooman wrote:

On 10/7/22 09:03, Frank Schilder wrote:

Hi Igor and Stefan,

thanks a lot for your help! Our cluster is almost finished with 
recovery and I would like to switch to off-line conversion of the SSD 
OSDs. In one of Stefan's I coud find the command for manual compaction:


ceph-kvstore-tool bluestore-kv "/var/lib/ceph/osd/ceph-${OSD_ID}" 
compact


Unfortunately, I can't find the command for performing the omap 
conversion. It is not mentioned here 
https://docs.ceph.com/en/quincy/releases/octopus/#upgrading-from-mimic-or-nautilus 
even though it does mention the option to skip conversion in step 5. 
How to continue with an off-line conversion is not mentioned. I know 
it has been posted before, but I seem unable to find it on this list. 
If someone could send me the command, I would be most grateful.


for osd in `ls /var/lib/ceph/osd/`; do ceph-bluestore-tool repair 
--path  /var/lib/ceph/osd/$osd;done


That's what I use.

Gr. Stefan


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Igor Fedotov

Hi Frank,

one more thing I realized during the night :)

Whe performing conversion DB gets a significant bunch of new data 
(approx. on par with the original OMAP volume) without old one being 
immediately removed. Hence one should expect DB size grows dramatically 
at this point. Which should go away after compaction (either enforced or 
regular background one).


But the point is that during that peak usage one might (temporarily) be 
out of free space And I believe that's the root cause for your 
outage. So please be careful when doing further conversions, I think 
your OSDs are exposed to this issue due to limited space available ...



Thanks,

Igor

On 10/7/2022 2:53 AM, Frank Schilder wrote:

Hi Igor,

I added a sample of OSDs on identical disks. The usage is quite well balanced, 
so the numbers I included are representative. I don't believe that we had one 
such extreme outlier. Maybe it ran full during conversion. Most of the data is 
OMAP after all.

I can't dump the free-dumps into paste bin, they are too large. Not sure if you 
can access ceph-post-files. I will send you a tgz in a separate e-mail directly 
to you.


And once again - do other non-starting OSDs show the same ENOSPC error?
Evidently I'm unable to make any generalization about the root cause due
to lack of the info...

As I said before, I need more time to check this and give you the answer you 
actually want. The stupid answer is they don't, because the other 3 are taken 
down the moment 16 crashes and don't reach the same point. I need to take them 
out of the grouped management and start them by hand, which I can do tomorrow. 
I'm too tired now to play on our production system.

The free-dumps are on their separate way. I included one for OSD 17 as well (on 
the same disk).

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 01:19:44
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

The log I inspected was for osd.16  so please share that OSD
utilization... And honestly I trust allocator's stats more so it's
rather CLI stats are incorrect if any. Anyway free dump should provide
additional proofs..

And once again - do other non-starting OSDs show the same ENOSPC error?
Evidently I'm unable to make any generalization about the root cause due
to lack of the info...


W.r.t fsck - you can try to run it - since fsck opens DB in read-pnly
there are some chances it will work.


Thanks,

Igor


On 10/7/2022 1:59 AM, Frank Schilder wrote:

Hi Igor,

I suspect there is something wrong with the data reported. These OSDs are only 
50-60% used. For example:

IDCLASS WEIGHT   REWEIGHT  SIZE RAW USE   DATA  OMAP 
META  AVAIL%USE   VAR   PGS  STATUS TYPE NAME
29   ssd  0.09099   1.0   93 GiB49 GiB17 GiB   16 GiB   
 15 GiB   44 GiB  52.42  1.91  104 up  osd.29
44   ssd  0.09099   1.0   93 GiB50 GiB23 GiB   10 GiB   
 16 GiB   43 GiB  53.88  1.96  121 up  osd.44
58   ssd  0.09099   1.0   93 GiB49 GiB16 GiB   15 GiB   
 18 GiB   44 GiB  52.81  1.92  123 up  osd.58
   984   ssd  0.09099   1.0   93 GiB57 GiB26 GiB   13 GiB   
 17 GiB   37 GiB  60.81  2.21  133 up  osd.984

Yes, these drives are small, but it should be possible to find 1M more. It 
sounds like some stats data/counters are incorrect/corrupted. Is it possible to 
run an fsck on a bluestore device to have it checked for that? Any idea how an 
incorrect utilisation might come about?

I will look into starting these OSDs individually. This will be a bit of work 
as our deployment method is to start/stop all OSDs sharing the same disk 
simultaneously (OSDs are grouped by disk). If one fails all others also go 
down. Its for simplifying disk management and this debugging is a new use case 
we never needed before.

Thanks for your help at this late hour!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 00:37:34
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Hi Frank,

the abort message "bluefs enospc" indicates lack of free space for
additional bluefs space allocations which prevents osd from startup.

   From the following log line one can see that bluefs needs ~1M more
space while the total available one is approx 622M. the problem is that
bluefs needs continuous(!) 64K chunks though. Which apparently aren't
available due to high disk fragmentation.

   -4> 2022-10-06T23:22:49.267+0200 7f669d129700 -1
bluestore(/var/l

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
well, I've just realized that you're apparently unable to collect these 
high-level stats for broken OSDs, aren't you?


But if that's the case you shouldn't make any assumption about faulty 
OSDs utilization from healthy ones - it's definitely a very doubtful 
approach ;)




On 10/7/2022 2:19 AM, Igor Fedotov wrote:
The log I inspected was for osd.16  so please share that OSD 
utilization... And honestly I trust allocator's stats more so it's 
rather CLI stats are incorrect if any. Anyway free dump should provide 
additional proofs..


And once again - do other non-starting OSDs show the same ENOSPC 
error?  Evidently I'm unable to make any generalization about the root 
cause due to lack of the info...



W.r.t fsck - you can try to run it - since fsck opens DB in read-pnly 
there are some chances it will work.



Thanks,

Igor


On 10/7/2022 1:59 AM, Frank Schilder wrote:

Hi Igor,

I suspect there is something wrong with the data reported. These OSDs 
are only 50-60% used. For example:


ID    CLASS WEIGHT   REWEIGHT  SIZE RAW USE DATA  
OMAP META  AVAIL    %USE   VAR   PGS STATUS TYPE NAME
   29   ssd  0.09099   1.0   93 GiB    49 GiB    17 GiB   
16 GiB    15 GiB   44 GiB  52.42  1.91  104 up  
osd.29
   44   ssd  0.09099   1.0   93 GiB    50 GiB    23 GiB   
10 GiB    16 GiB   43 GiB  53.88  1.96  121 up  
osd.44
   58   ssd  0.09099   1.0   93 GiB    49 GiB    16 GiB   
15 GiB    18 GiB   44 GiB  52.81  1.92  123 up  
osd.58
  984   ssd  0.09099   1.0   93 GiB    57 GiB    26 GiB   
13 GiB    17 GiB   37 GiB  60.81  2.21  133 up  
osd.984


Yes, these drives are small, but it should be possible to find 1M 
more. It sounds like some stats data/counters are 
incorrect/corrupted. Is it possible to run an fsck on a bluestore 
device to have it checked for that? Any idea how an incorrect 
utilisation might come about?


I will look into starting these OSDs individually. This will be a bit 
of work as our deployment method is to start/stop all OSDs sharing 
the same disk simultaneously (OSDs are grouped by disk). If one fails 
all others also go down. Its for simplifying disk management and this 
debugging is a new use case we never needed before.


Thanks for your help at this late hour!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 00:37:34
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Hi Frank,

the abort message "bluefs enospc" indicates lack of free space for
additional bluefs space allocations which prevents osd from startup.

  From the following log line one can see that bluefs needs ~1M more
space while the total available one is approx 622M. the problem is that
bluefs needs continuous(!) 64K chunks though. Which apparently aren't
available due to high disk fragmentation.

  -4> 2022-10-06T23:22:49.267+0200 7f669d129700 -1
bluestore(/var/lib/ceph/osd/ceph-16) allocate_bluefs_freespace failed to
allocate on 0x11 min_size 0x11 > allocated total 0x3
bluefs_shared_alloc_size 0x1 allocated 0x3 available 0x 25134000


To double check the above root cause analysis it would be helpful to get
ceph-bluestore-tool's free_dump command output - small chances there is
a bug in allocator which "misses" some long enough chunks. But given
disk space utilization (>90%) and pretty small disk size this is
unlikely IMO.

So to work around the issue and bring OSD up you should either expand
the main device for OSD or add standalone DB volume.


Curious whether other non-starting OSDs report the same error...


Thanks,

Igor



On 10/7/2022 1:02 AM, Frank Schilder wrote:

Hi Igor,

the problematic disk holds OSDs 16,17,18 and 19. OSD 16 is the one 
crashing the show. I collected its startup log here: 
https://pastebin.com/25D3piS6 . The line sticking out is line 603:


/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/os/bluestore/BlueFS.cc: 
2931: ceph_abort_msg("bluefs enospc")


This smells a lot like rocksdb corruption. Can I do something about 
that? I still need to convert most of our OSDs and I cannot afford 
to loose more. The rebuild simply takes too long in the current 
situation.


Thanks for your help and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________
From: Igor Fedotov 
Sent: 06 October 2022 17:03:53
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Sorry - no clue about CephFS related questions...

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
The log I inspected was for osd.16  so please share that OSD 
utilization... And honestly I trust allocator's stats more so it's 
rather CLI stats are incorrect if any. Anyway free dump should provide 
additional proofs..


And once again - do other non-starting OSDs show the same ENOSPC error?  
Evidently I'm unable to make any generalization about the root cause due 
to lack of the info...



W.r.t fsck - you can try to run it - since fsck opens DB in read-pnly 
there are some chances it will work.



Thanks,

Igor


On 10/7/2022 1:59 AM, Frank Schilder wrote:

Hi Igor,

I suspect there is something wrong with the data reported. These OSDs are only 
50-60% used. For example:

IDCLASS WEIGHT   REWEIGHT  SIZE RAW USE   DATA  OMAP 
META  AVAIL%USE   VAR   PGS  STATUS TYPE NAME
   29   ssd  0.09099   1.0   93 GiB49 GiB17 GiB   16 GiB
15 GiB   44 GiB  52.42  1.91  104 up  osd.29
   44   ssd  0.09099   1.0   93 GiB50 GiB23 GiB   10 GiB
16 GiB   43 GiB  53.88  1.96  121 up  osd.44
   58   ssd  0.09099   1.0   93 GiB49 GiB16 GiB   15 GiB
18 GiB   44 GiB  52.81  1.92  123 up  osd.58
  984   ssd  0.09099   1.0   93 GiB57 GiB26 GiB   13 GiB
17 GiB   37 GiB  60.81  2.21  133 up  osd.984

Yes, these drives are small, but it should be possible to find 1M more. It 
sounds like some stats data/counters are incorrect/corrupted. Is it possible to 
run an fsck on a bluestore device to have it checked for that? Any idea how an 
incorrect utilisation might come about?

I will look into starting these OSDs individually. This will be a bit of work 
as our deployment method is to start/stop all OSDs sharing the same disk 
simultaneously (OSDs are grouped by disk). If one fails all others also go 
down. Its for simplifying disk management and this debugging is a new use case 
we never needed before.

Thanks for your help at this late hour!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 00:37:34
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Hi Frank,

the abort message "bluefs enospc" indicates lack of free space for
additional bluefs space allocations which prevents osd from startup.

  From the following log line one can see that bluefs needs ~1M more
space while the total available one is approx 622M. the problem is that
bluefs needs continuous(!) 64K chunks though. Which apparently aren't
available due to high disk fragmentation.

  -4> 2022-10-06T23:22:49.267+0200 7f669d129700 -1
bluestore(/var/lib/ceph/osd/ceph-16) allocate_bluefs_freespace failed to
allocate on 0x11 min_size 0x11 > allocated total 0x3
bluefs_shared_alloc_size 0x1 allocated 0x3 available 0x 25134000


To double check the above root cause analysis it would be helpful to get
ceph-bluestore-tool's free_dump command output - small chances there is
a bug in allocator which "misses" some long enough chunks. But given
disk space utilization (>90%) and pretty small disk size this is
unlikely IMO.

So to work around the issue and bring OSD up you should either expand
the main device for OSD or add standalone DB volume.


Curious whether other non-starting OSDs report the same error...


Thanks,

Igor



On 10/7/2022 1:02 AM, Frank Schilder wrote:

Hi Igor,

the problematic disk holds OSDs 16,17,18 and 19. OSD 16 is the one crashing the 
show. I collected its startup log here: https://pastebin.com/25D3piS6 . The 
line sticking out is line 603:

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/os/bluestore/BlueFS.cc:
 2931: ceph_abort_msg("bluefs enospc")

This smells a lot like rocksdb corruption. Can I do something about that? I 
still need to convert most of our OSDs and I cannot afford to loose more. The 
rebuild simply takes too long in the current situation.

Thanks for your help and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________
From: Igor Fedotov 
Sent: 06 October 2022 17:03:53
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Sorry - no clue about CephFS related questions...

But could you please share full OSD startup log for any one which is
unable to restart after host reboot?


On 10/6/2022 5:12 PM, Frank Schilder wrote:

Hi Igor and Stefan.


Not sure why you're talking about replicated(!) 4(2) pool.

Its because in the production cluster its the 4(2) poo

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov

Hi Frank,

the abort message "bluefs enospc" indicates lack of free space for 
additional bluefs space allocations which prevents osd from startup.


From the following log line one can see that bluefs needs ~1M more 
space while the total available one is approx 622M. the problem is that 
bluefs needs continuous(!) 64K chunks though. Which apparently aren't 
available due to high disk fragmentation.


    -4> 2022-10-06T23:22:49.267+0200 7f669d129700 -1 
bluestore(/var/lib/ceph/osd/ceph-16) allocate_bluefs_freespace failed to 
allocate on 0x11 min_size 0x11 > allocated total 0x3 
bluefs_shared_alloc_size 0x1 allocated 0x3 available 0x 25134000



To double check the above root cause analysis it would be helpful to get 
ceph-bluestore-tool's free_dump command output - small chances there is 
a bug in allocator which "misses" some long enough chunks. But given 
disk space utilization (>90%) and pretty small disk size this is 
unlikely IMO.


So to work around the issue and bring OSD up you should either expand 
the main device for OSD or add standalone DB volume.



Curious whether other non-starting OSDs report the same error...


Thanks,

Igor



On 10/7/2022 1:02 AM, Frank Schilder wrote:

Hi Igor,

the problematic disk holds OSDs 16,17,18 and 19. OSD 16 is the one crashing the 
show. I collected its startup log here: https://pastebin.com/25D3piS6 . The 
line sticking out is line 603:

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/os/bluestore/BlueFS.cc:
 2931: ceph_abort_msg("bluefs enospc")

This smells a lot like rocksdb corruption. Can I do something about that? I 
still need to convert most of our OSDs and I cannot afford to loose more. The 
rebuild simply takes too long in the current situation.

Thanks for your help and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Igor Fedotov 
Sent: 06 October 2022 17:03:53
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Sorry - no clue about CephFS related questions...

But could you please share full OSD startup log for any one which is
unable to restart after host reboot?


On 10/6/2022 5:12 PM, Frank Schilder wrote:

Hi Igor and Stefan.


Not sure why you're talking about replicated(!) 4(2) pool.

Its because in the production cluster its the 4(2) pool that has that problem. On the 
test cluster it was an > > EC pool. Seems to affect all sorts of pools.

I have to take this one back. It is indeed an EC pool that is also on these SSD 
OSDs that is affected. The meta-data pool was all active all the time until we 
lost the 3rd host. So, the bug reported is confirmed to affect EC pools.


If not - does any died OSD unconditionally mean its underlying disk is
unavailable any more?

Fortunately not. After loosing disks on the 3rd host, we had to start taking 
somewhat more desperate measures. We set the file system off-line to stop 
client IO and started rebooting hosts in reverse order of failing. This brought 
back the OSDs on the still un-converted hosts. We rebooted the converted host 
with the original fail of OSDs last. Unfortunately, here it seems we lost a 
drive for good. It looks like the OSDs crashed while the conversion was going 
on or something. They don't boot up and I need to look into that with more 
detail.

We are currently trying to encourage fs clients to reconnect to the file 
system. Unfortunately, on many we get

# ls /shares/nfs/ait_pnora01 # this *is* a ceph-fs mount point
ls: cannot access '/shares/nfs/ait_pnora01': Stale file handle

Is there a server-sided way to encourage the FS clients to reconnect to the 
cluster? What is a clean way to get them back onto the file system? I tried a 
remounts without success.

Before executing the next conversion, I will compact the rocksdb on all SSD 
OSDs. The HDDs seem to be entirely unaffected. The SSDs have a very high number 
of objects per PG, which is potentially the main reason for our observations.

Thanks for your help,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________
From: Igor Fedotov 
Sent: 06 October 2022 14:39
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Are crashing OSDs still bound to two hosts?

If not - does any died OSD unconditionally mean its underlying disk is
unavailable any more?


On 10/6/2022 3:35 PM, Frank Schilder wrote:

Hi Igor.


Not sure why you're talking about replicated(!) 4(2) pool.

Its because in the production cluster its the 4(2) pool that has that problem. 
On the test cluster it was an EC pool. Seems to affect all sorts of pools.

I just lost another 

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov

Sorry - no clue about CephFS related questions...

But could you please share full OSD startup log for any one which is 
unable to restart after host reboot?



On 10/6/2022 5:12 PM, Frank Schilder wrote:

Hi Igor and Stefan.


Not sure why you're talking about replicated(!) 4(2) pool.

Its because in the production cluster its the 4(2) pool that has that problem. On the 
test cluster it was an > > EC pool. Seems to affect all sorts of pools.

I have to take this one back. It is indeed an EC pool that is also on these SSD 
OSDs that is affected. The meta-data pool was all active all the time until we 
lost the 3rd host. So, the bug reported is confirmed to affect EC pools.


If not - does any died OSD unconditionally mean its underlying disk is
unavailable any more?

Fortunately not. After loosing disks on the 3rd host, we had to start taking 
somewhat more desperate measures. We set the file system off-line to stop 
client IO and started rebooting hosts in reverse order of failing. This brought 
back the OSDs on the still un-converted hosts. We rebooted the converted host 
with the original fail of OSDs last. Unfortunately, here it seems we lost a 
drive for good. It looks like the OSDs crashed while the conversion was going 
on or something. They don't boot up and I need to look into that with more 
detail.

We are currently trying to encourage fs clients to reconnect to the file 
system. Unfortunately, on many we get

# ls /shares/nfs/ait_pnora01 # this *is* a ceph-fs mount point
ls: cannot access '/shares/nfs/ait_pnora01': Stale file handle

Is there a server-sided way to encourage the FS clients to reconnect to the 
cluster? What is a clean way to get them back onto the file system? I tried a 
remounts without success.

Before executing the next conversion, I will compact the rocksdb on all SSD 
OSDs. The HDDs seem to be entirely unaffected. The SSDs have a very high number 
of objects per PG, which is potentially the main reason for our observations.

Thanks for your help,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 06 October 2022 14:39
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Are crashing OSDs still bound to two hosts?

If not - does any died OSD unconditionally mean its underlying disk is
unavailable any more?


On 10/6/2022 3:35 PM, Frank Schilder wrote:

Hi Igor.


Not sure why you're talking about replicated(!) 4(2) pool.

Its because in the production cluster its the 4(2) pool that has that problem. 
On the test cluster it was an EC pool. Seems to affect all sorts of pools.

I just lost another disk, we have PGs down now. I really hope the stuck 
bstore_kv_sync thread does not lead to rocksdb corruption.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Igor Fedotov 
Sent: 06 October 2022 14:26
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

On 10/6/2022 2:55 PM, Frank Schilder wrote:

Hi Igor,

it has the SSD OSDs down, the HDD OSDs are running just fine. I don't want to 
make a bad situation worse for now and wait for recovery to finish. The 
inactive PGs are activating very slowly.

Got it.


By the way, there are 2 out of 4 OSDs up in the replicated 4(2) pool. Why are PGs even 
inactive here? This "feature" is new in octopus, I reported it about 2 months 
ago as a bug. Testing with mimic I cannot reproduce this problem: 
https://tracker.ceph.com/issues/56995

Not sure why you're talking about replicated(!) 4(2) pool. In the above
ticket I can see EC 4+2 one (pool 4 'fs-data' erasure profile
ec-4-2...). Which means 6 shards per object and may be this setup has
some issues with mapping to unique osds within a host (just 3 hosts are
available!) ...  One can see that pg 4.* are marked as inactive only.
Not a big expert in this stuff so mostly just speculating


Do you have the same setup in the production cluster in question? If so
- then you lack 2 of 6 shards and IMO the cluster properly marks the
relevant PGs as inactive. The same would apply to 3x replicated PGs as
well though since two replicas are down..



I found this in the syslog, maybe it helps:

kernel: task:bstore_kv_sync  state:D stack:0 pid:3646032 ppid:3645340 
flags:0x
kernel: Call Trace:
kernel: __schedule+0x2a2/0x7e0
kernel: schedule+0x4e/0xb0
kernel: io_schedule+0x16/0x40
kernel: wait_on_page_bit_common+0x15c/0x3e0
kernel: ? __page_cache_alloc+0xb0/0xb0
kernel: wait_on_page_bit+0x3f/0x50
kernel: wait_on_page_writeback+0x26/0x70
kernel: __filemap_fdatawait_range+0x98/0x100
kernel: ? __filemap_fdatawrite_range+0xd8/0x110
kernel: file_fdatawait_range+0x1a/0x30
kernel: sync_file_range+0xc2/0xf0
kernel: ksys_sync_file_range+0x41/0x80
kernel: __x64_sys_sync_file_range+0x1e/0x30
kernel: d

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov



On 10/6/2022 3:16 PM, Stefan Kooman wrote:

On 10/6/22 13:41, Frank Schilder wrote:

Hi Stefan,

thanks for looking at this. The conversion has happened on 1 host 
only. Status is:


- all daemons on all hosts upgraded
- all OSDs on 1 OSD-host were restarted with 
bluestore_fsck_quick_fix_on_mount = true in its local ceph.conf, 
these OSDs completed conversion and rebooted, I would assume that the 
freshly created OMAPs are compacted by default?


As far as I know it's not.


According to https://tracker.ceph.com/issues/51711 compaction is applied 
after OMAP upgrade starting v15.2.14






- unfortunately, the converted SSD-OSDs on this host died
- now SSD OSDs on other (un-converted) hosts also start crashing 
randomly and very badly (not possible to restart due to stuck D-state 
processes)


Does compaction even work properly on upgraded but unconverted OSDs?


yes, compaction is available irrespective to the data format which OSD 
uses for keeping in DB. Hence both converted and unconverted OSDs can 
benefit from it.



We have done serveral measurements based on production data (clones of 
data disks from prod.). In this case the conversion from octopus to 
pacific, and the resharding as well). We would save half the time by 
compacting them before hand. It would take, in our case, many hours to 
do a conversion, so it would pay off immensely. So yes, you can do 
this. Not sure if I have tested this on Octopus conversion, but as the 
conversion to pacific involves a similar process it's safe to assume 
it will be the same.


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov

Are crashing OSDs still bound to two hosts?

If not - does any died OSD unconditionally mean its underlying disk is 
unavailable any more?



On 10/6/2022 3:35 PM, Frank Schilder wrote:

Hi Igor.


Not sure why you're talking about replicated(!) 4(2) pool.

Its because in the production cluster its the 4(2) pool that has that problem. 
On the test cluster it was an EC pool. Seems to affect all sorts of pools.

I just lost another disk, we have PGs down now. I really hope the stuck 
bstore_kv_sync thread does not lead to rocksdb corruption.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 06 October 2022 14:26
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

On 10/6/2022 2:55 PM, Frank Schilder wrote:

Hi Igor,

it has the SSD OSDs down, the HDD OSDs are running just fine. I don't want to 
make a bad situation worse for now and wait for recovery to finish. The 
inactive PGs are activating very slowly.

Got it.


By the way, there are 2 out of 4 OSDs up in the replicated 4(2) pool. Why are PGs even 
inactive here? This "feature" is new in octopus, I reported it about 2 months 
ago as a bug. Testing with mimic I cannot reproduce this problem: 
https://tracker.ceph.com/issues/56995

Not sure why you're talking about replicated(!) 4(2) pool. In the above
ticket I can see EC 4+2 one (pool 4 'fs-data' erasure profile
ec-4-2...). Which means 6 shards per object and may be this setup has
some issues with mapping to unique osds within a host (just 3 hosts are
available!) ...  One can see that pg 4.* are marked as inactive only.
Not a big expert in this stuff so mostly just speculating


Do you have the same setup in the production cluster in question? If so
- then you lack 2 of 6 shards and IMO the cluster properly marks the
relevant PGs as inactive. The same would apply to 3x replicated PGs as
well though since two replicas are down..



I found this in the syslog, maybe it helps:

kernel: task:bstore_kv_sync  state:D stack:0 pid:3646032 ppid:3645340 
flags:0x
kernel: Call Trace:
kernel: __schedule+0x2a2/0x7e0
kernel: schedule+0x4e/0xb0
kernel: io_schedule+0x16/0x40
kernel: wait_on_page_bit_common+0x15c/0x3e0
kernel: ? __page_cache_alloc+0xb0/0xb0
kernel: wait_on_page_bit+0x3f/0x50
kernel: wait_on_page_writeback+0x26/0x70
kernel: __filemap_fdatawait_range+0x98/0x100
kernel: ? __filemap_fdatawrite_range+0xd8/0x110
kernel: file_fdatawait_range+0x1a/0x30
kernel: sync_file_range+0xc2/0xf0
kernel: ksys_sync_file_range+0x41/0x80
kernel: __x64_sys_sync_file_range+0x1e/0x30
kernel: do_syscall_64+0x3b/0x90
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
kernel: RIP: 0033:0x7ffbb6f77ae7
kernel: RSP: 002b:7ffba478c3c0 EFLAGS: 0293 ORIG_RAX: 0115
kernel: RAX: ffda RBX: 002d RCX: 7ffbb6f77ae7
kernel: RDX: 2000 RSI: 00015f849000 RDI: 002d
kernel: RBP: 00015f849000 R08:  R09: 2000
kernel: R10: 0007 R11: 0293 R12: 2000
kernel: R13: 0007 R14: 0001 R15: 560a1ae20380
kernel: INFO: task bstore_kv_sync:3646117 blocked for more than 123 seconds.
kernel:  Tainted: GE 5.14.13-1.el7.elrepo.x86_64 #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.

It is quite possible that this was the moment when these OSDs got stuck and 
were marked down. The time stamp is about right.

Right. this is a primary thread which submits transactions to DB. And it
stuck for >123 seconds. Given that the disk is completely unresponsive I
presume something has happened at lower level (controller or disk FW)
though.. May be this was somehow caused by "fragmented" DB access and
compaction would heal this. On the other hand the compaction had to be
applied after omap upgrade so I'm not sure another one would change the
state...




Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Igor Fedotov 
Sent: 06 October 2022 13:45:17
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

   From your response to Stefan I'm getting that one of two damaged hosts
has all OSDs down and unable to start. I that correct? If so you can
reboot it with no problem and proceed with manual compaction [and other
experiments] quite "safely" for the rest of the cluster.


On 10/6/2022 2:35 PM, Frank Schilder wrote:

Hi Igor,

I can't access these drives. They have an OSD- or LVM process hanging in 
D-state. Any attempt to do something with these gets stuck as well.

I somehow need to wait for recovery to finish and protect the still running 
OSDs from crashing similarly badly.

After we have full redunda

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov

On 10/6/2022 2:55 PM, Frank Schilder wrote:

Hi Igor,

it has the SSD OSDs down, the HDD OSDs are running just fine. I don't want to 
make a bad situation worse for now and wait for recovery to finish. The 
inactive PGs are activating very slowly.

Got it.



By the way, there are 2 out of 4 OSDs up in the replicated 4(2) pool. Why are PGs even 
inactive here? This "feature" is new in octopus, I reported it about 2 months 
ago as a bug. Testing with mimic I cannot reproduce this problem: 
https://tracker.ceph.com/issues/56995


Not sure why you're talking about replicated(!) 4(2) pool. In the above 
ticket I can see EC 4+2 one (pool 4 'fs-data' erasure profile 
ec-4-2...). Which means 6 shards per object and may be this setup has 
some issues with mapping to unique osds within a host (just 3 hosts are 
available!) ...  One can see that pg 4.* are marked as inactive only. 
Not a big expert in this stuff so mostly just speculating



Do you have the same setup in the production cluster in question? If so 
- then you lack 2 of 6 shards and IMO the cluster properly marks the 
relevant PGs as inactive. The same would apply to 3x replicated PGs as 
well though since two replicas are down..





I found this in the syslog, maybe it helps:

kernel: task:bstore_kv_sync  state:D stack:0 pid:3646032 ppid:3645340 
flags:0x
kernel: Call Trace:
kernel: __schedule+0x2a2/0x7e0
kernel: schedule+0x4e/0xb0
kernel: io_schedule+0x16/0x40
kernel: wait_on_page_bit_common+0x15c/0x3e0
kernel: ? __page_cache_alloc+0xb0/0xb0
kernel: wait_on_page_bit+0x3f/0x50
kernel: wait_on_page_writeback+0x26/0x70
kernel: __filemap_fdatawait_range+0x98/0x100
kernel: ? __filemap_fdatawrite_range+0xd8/0x110
kernel: file_fdatawait_range+0x1a/0x30
kernel: sync_file_range+0xc2/0xf0
kernel: ksys_sync_file_range+0x41/0x80
kernel: __x64_sys_sync_file_range+0x1e/0x30
kernel: do_syscall_64+0x3b/0x90
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
kernel: RIP: 0033:0x7ffbb6f77ae7
kernel: RSP: 002b:7ffba478c3c0 EFLAGS: 0293 ORIG_RAX: 0115
kernel: RAX: ffda RBX: 002d RCX: 7ffbb6f77ae7
kernel: RDX: 2000 RSI: 00015f849000 RDI: 002d
kernel: RBP: 00015f849000 R08:  R09: 2000
kernel: R10: 0007 R11: 0293 R12: 2000
kernel: R13: 0007 R14: 0001 R15: 560a1ae20380
kernel: INFO: task bstore_kv_sync:3646117 blocked for more than 123 seconds.
kernel:  Tainted: GE 5.14.13-1.el7.elrepo.x86_64 #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.

It is quite possible that this was the moment when these OSDs got stuck and 
were marked down. The time stamp is about right.


Right. this is a primary thread which submits transactions to DB. And it 
stuck for >123 seconds. Given that the disk is completely unresponsive I 
presume something has happened at lower level (controller or disk FW) 
though.. May be this was somehow caused by "fragmented" DB access and 
compaction would heal this. On the other hand the compaction had to be 
applied after omap upgrade so I'm not sure another one would change the 
state...






Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________
From: Igor Fedotov 
Sent: 06 October 2022 13:45:17
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

  From your response to Stefan I'm getting that one of two damaged hosts
has all OSDs down and unable to start. I that correct? If so you can
reboot it with no problem and proceed with manual compaction [and other
experiments] quite "safely" for the rest of the cluster.


On 10/6/2022 2:35 PM, Frank Schilder wrote:

Hi Igor,

I can't access these drives. They have an OSD- or LVM process hanging in 
D-state. Any attempt to do something with these gets stuck as well.

I somehow need to wait for recovery to finish and protect the still running 
OSDs from crashing similarly badly.

After we have full redundancy again and service is back, I can add the setting 
osd_compact_on_start=true and start rebooting servers. Right now I need to 
prevent the ship from sinking.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________
From: Igor Fedotov 
Sent: 06 October 2022 13:28:11
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

IIUC the OSDs that expose "had timed out after 15" are failing to start
up. Is that correct or I missed something?  I meant trying compaction
for them...


On 10/6/2022 2:27 PM, Frank Schilder wrote:

Hi Igor,

thanks for your response.


And what's the target Octopus release?

ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus 

  1   2   3   4   5   >