[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov

Hi Frank,

you might want to compact RocksDB by ceph-kvstore-tool for those OSDs 
which are showing


"heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1886536700' had timed out after 
15"



I could see such an error after bulk data removal and following severe 
DB performance drop pretty often.


Thanks,
Igor

On 10/6/2022 2:06 PM, Frank Schilder wrote:

Hi all,

we are stuck with a really unpleasant situation and we would appreciate help. 
Yesterday we completed the ceph deamon upgrade from mimic to octopus all he way 
through with bluestore_fsck_quick_fix_on_mount = false and started the OSD OMAP 
conversion today in the morning. Everything went well at the beginning. The 
conversion went much faster than expected and OSDs came slowly back up. 
Unfortunately, trouble was only around the corner.

We have 12 hosts with 2 SSDs, 4 OSDs per disk and >65 HDDs. On the host where 
we started the conversion, the OSDs on the SSDs either crashed or didn't come up. 
These OSDs are part of our FS meta data pool, which is replicated 4(2). So far so 
unusual.

The problems now are:
   - I cannot restart the crashed OSDs, because a D-state LVM process is 
blocking access to the drives, and
   - that OSDs on other hosts in that pool also started crashing. And they 
crash bad (cannot restart either). The OSD processes' last log lines look 
something like that:

2022-10-06T12:21:09.473+0200 7f18a1ddd700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f1885d357
00' had timed out after 15
2022-10-06T12:21:09.473+0200 7f18a1ddd700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f18865367
00' had timed out after 15
2022-10-06T12:21:12.526+0200 7f18a25de700  0 --1- 
[v2:192.168.16.78:6886/3645427,v1:192.168.16.78:6888/36
45427] >> v1:192.168.16.73:6903/861744 conn(0x55f32322e800 0x55f32320a000 :-1 
s=CONNECTING_SEND_CONNECT_M
SG pgs=330 cs=1 l=0).handle_connect_reply_2 connect got RESETSESSION
2022-10-06T12:21:12.526+0200 7f18a1ddd700  0 --1- 
[v2:192.168.16.78:6886/3645427,v1:192.168.16.78:6888/36
45427] >> v1:192.168.16.81:6891/3651159 conn(0x55f341dae000 0x55f30c65b000 :-1 
s=CONNECTING_SEND_CONNECT_
MSG pgs=117 cs=1 l=0).handle_connect_reply_2 connect got RESETSESSION
2022-10-06T12:21:13.437+0200 7f18a1ddd700  0 --1- 
[v2:192.168.16.78:6886/3645427,v1:192.168.16.78:6888/36
45427] >> v1:192.168.16.77:6917/3645255 conn(0x55f323b44000 0x55f325576000 :-1 
s=CONNECTING_SEND_CONNECT_
MSG pgs=306 cs=1 l=0).handle_connect_reply_2 connect got RESETSESSION
2022-10-06T12:21:45.224+0200 7f189054a700  4 rocksdb: [db/db_impl.cc:777] 
--- DUMPING STATS ---
2022-10-06T12:21:45.224+0200 7f189054a700  4 rocksdb: [db/db_impl.cc:778]
** DB Stats **

With stats following. At this point it gets stuck. Trying to stop the OSD 
results in the log lines:

2022-10-06T12:26:08.990+0200 7f189fdd9700 -1 received  signal: Terminated from  
(PID: 3728898) UID: 0
2022-10-06T12:26:08.990+0200 7f189fdd9700 -1 osd.959 900605 *** Got signal 
Terminated ***
2022-10-06T12:26:08.990+0200 7f189fdd9700  0 osd.959 900605 prepare_to_stop 
telling mon we are shutting down and dead
2022-10-06T12:26:13.990+0200 7f189fdd9700  0 osd.959 900605 prepare_to_stop 
starting shutdown

and the OSD process get stuck in Dl-state. It is not possible to terminate the 
process. I'm slowly loosing redundancy and already lost service:

# ceph status
   cluster:
 id: e4ece518-f2cb-4708-b00f-b6bf511e91d9
 health: HEALTH_ERR
 953 OSD(s) reporting legacy (not per-pool) BlueStore stats
 953 OSD(s) reporting legacy (not per-pool) BlueStore omap usage 
stats
 nosnaptrim flag(s) set
 1 scrub errors
 Reduced data availability: 22 pgs inactive
 Possible data damage: 1 pg inconsistent
 Degraded data redundancy: 130313597/11974669139 objects degraded 
(1.088%),  pgs degraded, 1120 pgs undersized
  
   services:

 mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 18h)
 mgr: ceph-25(active, since 26h), standbys: ceph-26, ceph-03, ceph-02, 
ceph-01
 mds: con-fs2:1 {0=ceph-11=up:active} 11 up:standby
 osd: 1086 osds: 1038 up (since 34m), 1038 in (since 24m); 1120 remapped pgs
  flags nosnaptrim
  
   task status:
  
   data:

 pools:   14 pools, 17375 pgs
 objects: 1.39G objects, 2.5 PiB
 usage:   3.1 PiB used, 8.3 PiB / 11 PiB avail
 pgs: 0.127% pgs not active
  130313597/11974669139 objects degraded (1.088%)
  339825/11974669139 objects misplaced (0.003%)
  16238 active+clean
  747   active+undersized+degraded+remapped+backfilling
  342   active+undersized+degraded+remapped+backfill_wait
  22
forced_recovery+undersized+degraded+remapped+backfilling+peered
  16active+clean+scrubbing+deep
  5 active+undersized+remapped+backfilling
  4 active+undersized+remapped+backfill_wait
  1   

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov

And what's the target Octopus release?


On 10/6/2022 2:06 PM, Frank Schilder wrote:

Hi all,

we are stuck with a really unpleasant situation and we would appreciate help. 
Yesterday we completed the ceph deamon upgrade from mimic to octopus all he way 
through with bluestore_fsck_quick_fix_on_mount = false and started the OSD OMAP 
conversion today in the morning. Everything went well at the beginning. The 
conversion went much faster than expected and OSDs came slowly back up. 
Unfortunately, trouble was only around the corner.

We have 12 hosts with 2 SSDs, 4 OSDs per disk and >65 HDDs. On the host where 
we started the conversion, the OSDs on the SSDs either crashed or didn't come up. 
These OSDs are part of our FS meta data pool, which is replicated 4(2). So far so 
unusual.

The problems now are:
   - I cannot restart the crashed OSDs, because a D-state LVM process is 
blocking access to the drives, and
   - that OSDs on other hosts in that pool also started crashing. And they 
crash bad (cannot restart either). The OSD processes' last log lines look 
something like that:

2022-10-06T12:21:09.473+0200 7f18a1ddd700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f1885d357
00' had timed out after 15
2022-10-06T12:21:09.473+0200 7f18a1ddd700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f18865367
00' had timed out after 15
2022-10-06T12:21:12.526+0200 7f18a25de700  0 --1- 
[v2:192.168.16.78:6886/3645427,v1:192.168.16.78:6888/36
45427] >> v1:192.168.16.73:6903/861744 conn(0x55f32322e800 0x55f32320a000 :-1 
s=CONNECTING_SEND_CONNECT_M
SG pgs=330 cs=1 l=0).handle_connect_reply_2 connect got RESETSESSION
2022-10-06T12:21:12.526+0200 7f18a1ddd700  0 --1- 
[v2:192.168.16.78:6886/3645427,v1:192.168.16.78:6888/36
45427] >> v1:192.168.16.81:6891/3651159 conn(0x55f341dae000 0x55f30c65b000 :-1 
s=CONNECTING_SEND_CONNECT_
MSG pgs=117 cs=1 l=0).handle_connect_reply_2 connect got RESETSESSION
2022-10-06T12:21:13.437+0200 7f18a1ddd700  0 --1- 
[v2:192.168.16.78:6886/3645427,v1:192.168.16.78:6888/36
45427] >> v1:192.168.16.77:6917/3645255 conn(0x55f323b44000 0x55f325576000 :-1 
s=CONNECTING_SEND_CONNECT_
MSG pgs=306 cs=1 l=0).handle_connect_reply_2 connect got RESETSESSION
2022-10-06T12:21:45.224+0200 7f189054a700  4 rocksdb: [db/db_impl.cc:777] 
--- DUMPING STATS ---
2022-10-06T12:21:45.224+0200 7f189054a700  4 rocksdb: [db/db_impl.cc:778]
** DB Stats **

With stats following. At this point it gets stuck. Trying to stop the OSD 
results in the log lines:

2022-10-06T12:26:08.990+0200 7f189fdd9700 -1 received  signal: Terminated from  
(PID: 3728898) UID: 0
2022-10-06T12:26:08.990+0200 7f189fdd9700 -1 osd.959 900605 *** Got signal 
Terminated ***
2022-10-06T12:26:08.990+0200 7f189fdd9700  0 osd.959 900605 prepare_to_stop 
telling mon we are shutting down and dead
2022-10-06T12:26:13.990+0200 7f189fdd9700  0 osd.959 900605 prepare_to_stop 
starting shutdown

and the OSD process get stuck in Dl-state. It is not possible to terminate the 
process. I'm slowly loosing redundancy and already lost service:

# ceph status
   cluster:
 id: e4ece518-f2cb-4708-b00f-b6bf511e91d9
 health: HEALTH_ERR
 953 OSD(s) reporting legacy (not per-pool) BlueStore stats
 953 OSD(s) reporting legacy (not per-pool) BlueStore omap usage 
stats
 nosnaptrim flag(s) set
 1 scrub errors
 Reduced data availability: 22 pgs inactive
 Possible data damage: 1 pg inconsistent
 Degraded data redundancy: 130313597/11974669139 objects degraded 
(1.088%),  pgs degraded, 1120 pgs undersized
  
   services:

 mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 18h)
 mgr: ceph-25(active, since 26h), standbys: ceph-26, ceph-03, ceph-02, 
ceph-01
 mds: con-fs2:1 {0=ceph-11=up:active} 11 up:standby
 osd: 1086 osds: 1038 up (since 34m), 1038 in (since 24m); 1120 remapped pgs
  flags nosnaptrim
  
   task status:
  
   data:

 pools:   14 pools, 17375 pgs
 objects: 1.39G objects, 2.5 PiB
 usage:   3.1 PiB used, 8.3 PiB / 11 PiB avail
 pgs: 0.127% pgs not active
  130313597/11974669139 objects degraded (1.088%)
  339825/11974669139 objects misplaced (0.003%)
  16238 active+clean
  747   active+undersized+degraded+remapped+backfilling
  342   active+undersized+degraded+remapped+backfill_wait
  22
forced_recovery+undersized+degraded+remapped+backfilling+peered
  16active+clean+scrubbing+deep
  5 active+undersized+remapped+backfilling
  4 active+undersized+remapped+backfill_wait
  1 active+clean+inconsistent
  
   io:

 client:   77 MiB/s rd, 42 MiB/s wr, 2.58k op/s rd, 1.81k op/s wr
 recovery: 8.5 MiB/s, 2.52k keys/s, 3.39k objects/s

If SSDs on yet another host go down, we are stuck. Right now I hope recovery 
gets the inactive PGs up, but it

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor,

thanks for your response.

> And what's the target Octopus release?

ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)

I'm afraid I don't have the luxury right now to take OSDs down or add extra 
load with an on-line compaction. I would really appreciate a way to make the 
OSDs more crash tolerant until I have full redundancy again. Is there a setting 
that increases the OPS timeout or is there a way to restrict the load to 
tolerable levels?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 06 October 2022 13:15
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Hi Frank,

you might want to compact RocksDB by ceph-kvstore-tool for those OSDs
which are showing

"heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1886536700' had timed out 
after 15"



I could see such an error after bulk data removal and following severe
DB performance drop pretty often.

Thanks,
Igor
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Stefan Kooman

On 10/6/22 13:06, Frank Schilder wrote:

Hi all,

we are stuck with a really unpleasant situation and we would appreciate help. 
Yesterday we completed the ceph deamon upgrade from mimic to octopus all he way 
through with bluestore_fsck_quick_fix_on_mount = false and started the OSD OMAP 
conversion today in the morning. Everything went well at the beginning. The 
conversion went much faster than expected and OSDs came slowly back up. 
Unfortunately, trouble was only around the corner.


That sucks. Not sure how far into the upgrade process you are based on 
the info in this mail, but just to make sure you are not hit by RocksDB 
degradation:


Have you done an offline compaction of the OSDs after the conversion? We 
have seen that degraded RocksDB can severely impact the performance. So 
make sure the OSDs are compacted, i.e.:


stop osd processes: systemctl stop ceph-osd.target
df|grep "/var/lib/ceph/osd"|awk '{print $6}'|cut -d '-' -f 2|sort 
-n|xargs -n 1 -P 10 -I OSD ceph-kvstore-tool bluestore-kv 
/var/lib/ceph/osd/ceph-OSD compact


There can be a ton of other things happening of course. In that case try 
to gather debug logs.


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
IIUC the OSDs that expose "had timed out after 15" are failing to start 
up. Is that correct or I missed something?  I meant trying compaction 
for them...



On 10/6/2022 2:27 PM, Frank Schilder wrote:

Hi Igor,

thanks for your response.


And what's the target Octopus release?

ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)

I'm afraid I don't have the luxury right now to take OSDs down or add extra 
load with an on-line compaction. I would really appreciate a way to make the 
OSDs more crash tolerant until I have full redundancy again. Is there a setting 
that increases the OPS timeout or is there a way to restrict the load to 
tolerable levels?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 06 October 2022 13:15
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Hi Frank,

you might want to compact RocksDB by ceph-kvstore-tool for those OSDs
which are showing

"heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1886536700' had timed out after 
15"



I could see such an error after bulk data removal and following severe
DB performance drop pretty often.

Thanks,
Igor


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor,

I can't access these drives. They have an OSD- or LVM process hanging in 
D-state. Any attempt to do something with these gets stuck as well.

I somehow need to wait for recovery to finish and protect the still running 
OSDs from crashing similarly badly.

After we have full redundancy again and service is back, I can add the setting 
osd_compact_on_start=true and start rebooting servers. Right now I need to 
prevent the ship from sinking.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 06 October 2022 13:28:11
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

IIUC the OSDs that expose "had timed out after 15" are failing to start
up. Is that correct or I missed something?  I meant trying compaction
for them...


On 10/6/2022 2:27 PM, Frank Schilder wrote:
> Hi Igor,
>
> thanks for your response.
>
>> And what's the target Octopus release?
> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus 
> (stable)
>
> I'm afraid I don't have the luxury right now to take OSDs down or add extra 
> load with an on-line compaction. I would really appreciate a way to make the 
> OSDs more crash tolerant until I have full redundancy again. Is there a 
> setting that increases the OPS timeout or is there a way to restrict the load 
> to tolerable levels?
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Igor Fedotov 
> Sent: 06 October 2022 13:15
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus
>
> Hi Frank,
>
> you might want to compact RocksDB by ceph-kvstore-tool for those OSDs
> which are showing
>
> "heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1886536700' had timed 
> out after 15"
>
>
>
> I could see such an error after bulk data removal and following severe
> DB performance drop pretty often.
>
> Thanks,
> Igor

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Stefan,

thanks for looking at this. The conversion has happened on 1 host only. Status 
is:

- all daemons on all hosts upgraded
- all OSDs on 1 OSD-host were restarted with bluestore_fsck_quick_fix_on_mount 
= true in its local ceph.conf, these OSDs completed conversion and rebooted, I 
would assume that the freshly created OMAPs are compacted by default?
- unfortunately, the converted SSD-OSDs on this host died
- now SSD OSDs on other (un-converted) hosts also start crashing randomly and 
very badly (not possible to restart due to stuck D-state processes)

Does compaction even work properly on upgraded but unconverted OSDs?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Stefan Kooman 
Sent: 06 October 2022 13:27
To: ceph-users@ceph.io; Frank Schilder
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

On 10/6/22 13:06, Frank Schilder wrote:
> Hi all,
>
> we are stuck with a really unpleasant situation and we would appreciate help. 
> Yesterday we completed the ceph deamon upgrade from mimic to octopus all he 
> way through with bluestore_fsck_quick_fix_on_mount = false and started the 
> OSD OMAP conversion today in the morning. Everything went well at the 
> beginning. The conversion went much faster than expected and OSDs came slowly 
> back up. Unfortunately, trouble was only around the corner.

That sucks. Not sure how far into the upgrade process you are based on
the info in this mail, but just to make sure you are not hit by RocksDB
degradation:

Have you done an offline compaction of the OSDs after the conversion? We
have seen that degraded RocksDB can severely impact the performance. So
make sure the OSDs are compacted, i.e.:

stop osd processes: systemctl stop ceph-osd.target
df|grep "/var/lib/ceph/osd"|awk '{print $6}'|cut -d '-' -f 2|sort
-n|xargs -n 1 -P 10 -I OSD ceph-kvstore-tool bluestore-kv
/var/lib/ceph/osd/ceph-OSD compact

There can be a ton of other things happening of course. In that case try
to gather debug logs.

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
From your response to Stefan I'm getting that one of two damaged hosts 
has all OSDs down and unable to start. I that correct? If so you can 
reboot it with no problem and proceed with manual compaction [and other 
experiments] quite "safely" for the rest of the cluster.



On 10/6/2022 2:35 PM, Frank Schilder wrote:

Hi Igor,

I can't access these drives. They have an OSD- or LVM process hanging in 
D-state. Any attempt to do something with these gets stuck as well.

I somehow need to wait for recovery to finish and protect the still running 
OSDs from crashing similarly badly.

After we have full redundancy again and service is back, I can add the setting 
osd_compact_on_start=true and start rebooting servers. Right now I need to 
prevent the ship from sinking.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 06 October 2022 13:28:11
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

IIUC the OSDs that expose "had timed out after 15" are failing to start
up. Is that correct or I missed something?  I meant trying compaction
for them...


On 10/6/2022 2:27 PM, Frank Schilder wrote:

Hi Igor,

thanks for your response.


And what's the target Octopus release?

ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)

I'm afraid I don't have the luxury right now to take OSDs down or add extra 
load with an on-line compaction. I would really appreciate a way to make the 
OSDs more crash tolerant until I have full redundancy again. Is there a setting 
that increases the OPS timeout or is there a way to restrict the load to 
tolerable levels?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 06 October 2022 13:15
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Hi Frank,

you might want to compact RocksDB by ceph-kvstore-tool for those OSDs
which are showing

"heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1886536700' had timed out after 
15"



I could see such an error after bulk data removal and following severe
DB performance drop pretty often.

Thanks,
Igor

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor,

it has the SSD OSDs down, the HDD OSDs are running just fine. I don't want to 
make a bad situation worse for now and wait for recovery to finish. The 
inactive PGs are activating very slowly.

By the way, there are 2 out of 4 OSDs up in the replicated 4(2) pool. Why are 
PGs even inactive here? This "feature" is new in octopus, I reported it about 2 
months ago as a bug. Testing with mimic I cannot reproduce this problem: 
https://tracker.ceph.com/issues/56995

I found this in the syslog, maybe it helps:

kernel: task:bstore_kv_sync  state:D stack:0 pid:3646032 ppid:3645340 
flags:0x
kernel: Call Trace:
kernel: __schedule+0x2a2/0x7e0
kernel: schedule+0x4e/0xb0
kernel: io_schedule+0x16/0x40
kernel: wait_on_page_bit_common+0x15c/0x3e0
kernel: ? __page_cache_alloc+0xb0/0xb0
kernel: wait_on_page_bit+0x3f/0x50
kernel: wait_on_page_writeback+0x26/0x70
kernel: __filemap_fdatawait_range+0x98/0x100
kernel: ? __filemap_fdatawrite_range+0xd8/0x110
kernel: file_fdatawait_range+0x1a/0x30
kernel: sync_file_range+0xc2/0xf0
kernel: ksys_sync_file_range+0x41/0x80
kernel: __x64_sys_sync_file_range+0x1e/0x30
kernel: do_syscall_64+0x3b/0x90
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
kernel: RIP: 0033:0x7ffbb6f77ae7
kernel: RSP: 002b:7ffba478c3c0 EFLAGS: 0293 ORIG_RAX: 0115
kernel: RAX: ffda RBX: 002d RCX: 7ffbb6f77ae7
kernel: RDX: 2000 RSI: 00015f849000 RDI: 002d
kernel: RBP: 00015f849000 R08:  R09: 2000
kernel: R10: 0007 R11: 0293 R12: 2000
kernel: R13: 0007 R14: 0001 R15: 560a1ae20380
kernel: INFO: task bstore_kv_sync:3646117 blocked for more than 123 seconds.
kernel:  Tainted: GE 5.14.13-1.el7.elrepo.x86_64 #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.

It is quite possible that this was the moment when these OSDs got stuck and 
were marked down. The time stamp is about right.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 06 October 2022 13:45:17
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

 From your response to Stefan I'm getting that one of two damaged hosts
has all OSDs down and unable to start. I that correct? If so you can
reboot it with no problem and proceed with manual compaction [and other
experiments] quite "safely" for the rest of the cluster.


On 10/6/2022 2:35 PM, Frank Schilder wrote:
> Hi Igor,
>
> I can't access these drives. They have an OSD- or LVM process hanging in 
> D-state. Any attempt to do something with these gets stuck as well.
>
> I somehow need to wait for recovery to finish and protect the still running 
> OSDs from crashing similarly badly.
>
> After we have full redundancy again and service is back, I can add the 
> setting osd_compact_on_start=true and start rebooting servers. Right now I 
> need to prevent the ship from sinking.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Igor Fedotov 
> Sent: 06 October 2022 13:28:11
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus
>
> IIUC the OSDs that expose "had timed out after 15" are failing to start
> up. Is that correct or I missed something?  I meant trying compaction
> for them...
>
>
> On 10/6/2022 2:27 PM, Frank Schilder wrote:
>> Hi Igor,
>>
>> thanks for your response.
>>
>>> And what's the target Octopus release?
>> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus 
>> (stable)
>>
>> I'm afraid I don't have the luxury right now to take OSDs down or add extra 
>> load with an on-line compaction. I would really appreciate a way to make the 
>> OSDs more crash tolerant until I have full redundancy again. Is there a 
>> setting that increases the OPS timeout or is there a way to restrict the 
>> load to tolerable levels?
>>
>> Best regards,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> 
>> From: Igor Fedotov 
>> Sent: 06 October 2022 13:15
>> To: Frank Schilder; ceph-users@ceph.io
>> Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus
>>
>> Hi Frank,
>>
>> you might want to compact RocksDB by ceph-kvstore-tool for those OSDs
>> which are showing
>>
>> "heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1886536700' had timed 
>> out after 15"
>>
>>
>>
>> I could see such an error after bulk data removal and following severe
>> DB performance drop pretty often.
>>
>> Thanks,
>> Igor
> --
> Igor Fedotov
> Ceph Lead Developer
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH, Freseniusstr. 31h, 81247 

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Stefan Kooman

On 10/6/22 13:41, Frank Schilder wrote:

Hi Stefan,

thanks for looking at this. The conversion has happened on 1 host only. Status 
is:

- all daemons on all hosts upgraded
- all OSDs on 1 OSD-host were restarted with bluestore_fsck_quick_fix_on_mount 
= true in its local ceph.conf, these OSDs completed conversion and rebooted, I 
would assume that the freshly created OMAPs are compacted by default?


As far as I know it's not.


- unfortunately, the converted SSD-OSDs on this host died
- now SSD OSDs on other (un-converted) hosts also start crashing randomly and 
very badly (not possible to restart due to stuck D-state processes)

Does compaction even work properly on upgraded but unconverted OSDs?


We have done serveral measurements based on production data (clones of 
data disks from prod.). In this case the conversion from octopus to 
pacific, and the resharding as well). We would save half the time by 
compacting them before hand. It would take, in our case, many hours to 
do a conversion, so it would pay off immensely. So yes, you can do this. 
Not sure if I have tested this on Octopus conversion, but as the 
conversion to pacific involves a similar process it's safe to assume it 
will be the same.


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov

On 10/6/2022 2:55 PM, Frank Schilder wrote:

Hi Igor,

it has the SSD OSDs down, the HDD OSDs are running just fine. I don't want to 
make a bad situation worse for now and wait for recovery to finish. The 
inactive PGs are activating very slowly.

Got it.



By the way, there are 2 out of 4 OSDs up in the replicated 4(2) pool. Why are PGs even 
inactive here? This "feature" is new in octopus, I reported it about 2 months 
ago as a bug. Testing with mimic I cannot reproduce this problem: 
https://tracker.ceph.com/issues/56995


Not sure why you're talking about replicated(!) 4(2) pool. In the above 
ticket I can see EC 4+2 one (pool 4 'fs-data' erasure profile 
ec-4-2...). Which means 6 shards per object and may be this setup has 
some issues with mapping to unique osds within a host (just 3 hosts are 
available!) ...  One can see that pg 4.* are marked as inactive only. 
Not a big expert in this stuff so mostly just speculating



Do you have the same setup in the production cluster in question? If so 
- then you lack 2 of 6 shards and IMO the cluster properly marks the 
relevant PGs as inactive. The same would apply to 3x replicated PGs as 
well though since two replicas are down..





I found this in the syslog, maybe it helps:

kernel: task:bstore_kv_sync  state:D stack:0 pid:3646032 ppid:3645340 
flags:0x
kernel: Call Trace:
kernel: __schedule+0x2a2/0x7e0
kernel: schedule+0x4e/0xb0
kernel: io_schedule+0x16/0x40
kernel: wait_on_page_bit_common+0x15c/0x3e0
kernel: ? __page_cache_alloc+0xb0/0xb0
kernel: wait_on_page_bit+0x3f/0x50
kernel: wait_on_page_writeback+0x26/0x70
kernel: __filemap_fdatawait_range+0x98/0x100
kernel: ? __filemap_fdatawrite_range+0xd8/0x110
kernel: file_fdatawait_range+0x1a/0x30
kernel: sync_file_range+0xc2/0xf0
kernel: ksys_sync_file_range+0x41/0x80
kernel: __x64_sys_sync_file_range+0x1e/0x30
kernel: do_syscall_64+0x3b/0x90
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
kernel: RIP: 0033:0x7ffbb6f77ae7
kernel: RSP: 002b:7ffba478c3c0 EFLAGS: 0293 ORIG_RAX: 0115
kernel: RAX: ffda RBX: 002d RCX: 7ffbb6f77ae7
kernel: RDX: 2000 RSI: 00015f849000 RDI: 002d
kernel: RBP: 00015f849000 R08:  R09: 2000
kernel: R10: 0007 R11: 0293 R12: 2000
kernel: R13: 0007 R14: 0001 R15: 560a1ae20380
kernel: INFO: task bstore_kv_sync:3646117 blocked for more than 123 seconds.
kernel:  Tainted: GE 5.14.13-1.el7.elrepo.x86_64 #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.

It is quite possible that this was the moment when these OSDs got stuck and 
were marked down. The time stamp is about right.


Right. this is a primary thread which submits transactions to DB. And it 
stuck for >123 seconds. Given that the disk is completely unresponsive I 
presume something has happened at lower level (controller or disk FW) 
though.. May be this was somehow caused by "fragmented" DB access and 
compaction would heal this. On the other hand the compaction had to be 
applied after omap upgrade so I'm not sure another one would change the 
state...






Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 06 October 2022 13:45:17
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

  From your response to Stefan I'm getting that one of two damaged hosts
has all OSDs down and unable to start. I that correct? If so you can
reboot it with no problem and proceed with manual compaction [and other
experiments] quite "safely" for the rest of the cluster.


On 10/6/2022 2:35 PM, Frank Schilder wrote:

Hi Igor,

I can't access these drives. They have an OSD- or LVM process hanging in 
D-state. Any attempt to do something with these gets stuck as well.

I somehow need to wait for recovery to finish and protect the still running 
OSDs from crashing similarly badly.

After we have full redundancy again and service is back, I can add the setting 
osd_compact_on_start=true and start rebooting servers. Right now I need to 
prevent the ship from sinking.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 06 October 2022 13:28:11
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

IIUC the OSDs that expose "had timed out after 15" are failing to start
up. Is that correct or I missed something?  I meant trying compaction
for them...


On 10/6/2022 2:27 PM, Frank Schilder wrote:

Hi Igor,

thanks for your response.


And what's the target Octopus release?

ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)

I'm afraid I don't have the luxury right now to take OS

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor.

> Not sure why you're talking about replicated(!) 4(2) pool.

Its because in the production cluster its the 4(2) pool that has that problem. 
On the test cluster it was an EC pool. Seems to affect all sorts of pools.

I just lost another disk, we have PGs down now. I really hope the stuck 
bstore_kv_sync thread does not lead to rocksdb corruption.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 06 October 2022 14:26
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

On 10/6/2022 2:55 PM, Frank Schilder wrote:
> Hi Igor,
>
> it has the SSD OSDs down, the HDD OSDs are running just fine. I don't want to 
> make a bad situation worse for now and wait for recovery to finish. The 
> inactive PGs are activating very slowly.
Got it.

>
> By the way, there are 2 out of 4 OSDs up in the replicated 4(2) pool. Why are 
> PGs even inactive here? This "feature" is new in octopus, I reported it about 
> 2 months ago as a bug. Testing with mimic I cannot reproduce this problem: 
> https://tracker.ceph.com/issues/56995

Not sure why you're talking about replicated(!) 4(2) pool. In the above
ticket I can see EC 4+2 one (pool 4 'fs-data' erasure profile
ec-4-2...). Which means 6 shards per object and may be this setup has
some issues with mapping to unique osds within a host (just 3 hosts are
available!) ...  One can see that pg 4.* are marked as inactive only.
Not a big expert in this stuff so mostly just speculating


Do you have the same setup in the production cluster in question? If so
- then you lack 2 of 6 shards and IMO the cluster properly marks the
relevant PGs as inactive. The same would apply to 3x replicated PGs as
well though since two replicas are down..


>
> I found this in the syslog, maybe it helps:
>
> kernel: task:bstore_kv_sync  state:D stack:0 pid:3646032 ppid:3645340 
> flags:0x
> kernel: Call Trace:
> kernel: __schedule+0x2a2/0x7e0
> kernel: schedule+0x4e/0xb0
> kernel: io_schedule+0x16/0x40
> kernel: wait_on_page_bit_common+0x15c/0x3e0
> kernel: ? __page_cache_alloc+0xb0/0xb0
> kernel: wait_on_page_bit+0x3f/0x50
> kernel: wait_on_page_writeback+0x26/0x70
> kernel: __filemap_fdatawait_range+0x98/0x100
> kernel: ? __filemap_fdatawrite_range+0xd8/0x110
> kernel: file_fdatawait_range+0x1a/0x30
> kernel: sync_file_range+0xc2/0xf0
> kernel: ksys_sync_file_range+0x41/0x80
> kernel: __x64_sys_sync_file_range+0x1e/0x30
> kernel: do_syscall_64+0x3b/0x90
> kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
> kernel: RIP: 0033:0x7ffbb6f77ae7
> kernel: RSP: 002b:7ffba478c3c0 EFLAGS: 0293 ORIG_RAX: 0115
> kernel: RAX: ffda RBX: 002d RCX: 7ffbb6f77ae7
> kernel: RDX: 2000 RSI: 00015f849000 RDI: 002d
> kernel: RBP: 00015f849000 R08:  R09: 2000
> kernel: R10: 0007 R11: 0293 R12: 2000
> kernel: R13: 0007 R14: 0001 R15: 560a1ae20380
> kernel: INFO: task bstore_kv_sync:3646117 blocked for more than 123 seconds.
> kernel:  Tainted: GE 5.14.13-1.el7.elrepo.x86_64 #1
> kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
> message.
>
> It is quite possible that this was the moment when these OSDs got stuck and 
> were marked down. The time stamp is about right.

Right. this is a primary thread which submits transactions to DB. And it
stuck for >123 seconds. Given that the disk is completely unresponsive I
presume something has happened at lower level (controller or disk FW)
though.. May be this was somehow caused by "fragmented" DB access and
compaction would heal this. On the other hand the compaction had to be
applied after omap upgrade so I'm not sure another one would change the
state...



>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Igor Fedotov 
> Sent: 06 October 2022 13:45:17
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus
>
>   From your response to Stefan I'm getting that one of two damaged hosts
> has all OSDs down and unable to start. I that correct? If so you can
> reboot it with no problem and proceed with manual compaction [and other
> experiments] quite "safely" for the rest of the cluster.
>
>
> On 10/6/2022 2:35 PM, Frank Schilder wrote:
>> Hi Igor,
>>
>> I can't access these drives. They have an OSD- or LVM process hanging in 
>> D-state. Any attempt to do something with these gets stuck as well.
>>
>> I somehow need to wait for recovery to finish and protect the still running 
>> OSDs from crashing similarly badly.
>>
>> After we have full redundancy again and service is back, I can add the 
>> setting osd_compact_on_start=true and start rebooting

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov

Are crashing OSDs still bound to two hosts?

If not - does any died OSD unconditionally mean its underlying disk is 
unavailable any more?



On 10/6/2022 3:35 PM, Frank Schilder wrote:

Hi Igor.


Not sure why you're talking about replicated(!) 4(2) pool.

Its because in the production cluster its the 4(2) pool that has that problem. 
On the test cluster it was an EC pool. Seems to affect all sorts of pools.

I just lost another disk, we have PGs down now. I really hope the stuck 
bstore_kv_sync thread does not lead to rocksdb corruption.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 06 October 2022 14:26
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

On 10/6/2022 2:55 PM, Frank Schilder wrote:

Hi Igor,

it has the SSD OSDs down, the HDD OSDs are running just fine. I don't want to 
make a bad situation worse for now and wait for recovery to finish. The 
inactive PGs are activating very slowly.

Got it.


By the way, there are 2 out of 4 OSDs up in the replicated 4(2) pool. Why are PGs even 
inactive here? This "feature" is new in octopus, I reported it about 2 months 
ago as a bug. Testing with mimic I cannot reproduce this problem: 
https://tracker.ceph.com/issues/56995

Not sure why you're talking about replicated(!) 4(2) pool. In the above
ticket I can see EC 4+2 one (pool 4 'fs-data' erasure profile
ec-4-2...). Which means 6 shards per object and may be this setup has
some issues with mapping to unique osds within a host (just 3 hosts are
available!) ...  One can see that pg 4.* are marked as inactive only.
Not a big expert in this stuff so mostly just speculating


Do you have the same setup in the production cluster in question? If so
- then you lack 2 of 6 shards and IMO the cluster properly marks the
relevant PGs as inactive. The same would apply to 3x replicated PGs as
well though since two replicas are down..



I found this in the syslog, maybe it helps:

kernel: task:bstore_kv_sync  state:D stack:0 pid:3646032 ppid:3645340 
flags:0x
kernel: Call Trace:
kernel: __schedule+0x2a2/0x7e0
kernel: schedule+0x4e/0xb0
kernel: io_schedule+0x16/0x40
kernel: wait_on_page_bit_common+0x15c/0x3e0
kernel: ? __page_cache_alloc+0xb0/0xb0
kernel: wait_on_page_bit+0x3f/0x50
kernel: wait_on_page_writeback+0x26/0x70
kernel: __filemap_fdatawait_range+0x98/0x100
kernel: ? __filemap_fdatawrite_range+0xd8/0x110
kernel: file_fdatawait_range+0x1a/0x30
kernel: sync_file_range+0xc2/0xf0
kernel: ksys_sync_file_range+0x41/0x80
kernel: __x64_sys_sync_file_range+0x1e/0x30
kernel: do_syscall_64+0x3b/0x90
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
kernel: RIP: 0033:0x7ffbb6f77ae7
kernel: RSP: 002b:7ffba478c3c0 EFLAGS: 0293 ORIG_RAX: 0115
kernel: RAX: ffda RBX: 002d RCX: 7ffbb6f77ae7
kernel: RDX: 2000 RSI: 00015f849000 RDI: 002d
kernel: RBP: 00015f849000 R08:  R09: 2000
kernel: R10: 0007 R11: 0293 R12: 2000
kernel: R13: 0007 R14: 0001 R15: 560a1ae20380
kernel: INFO: task bstore_kv_sync:3646117 blocked for more than 123 seconds.
kernel:  Tainted: GE 5.14.13-1.el7.elrepo.x86_64 #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.

It is quite possible that this was the moment when these OSDs got stuck and 
were marked down. The time stamp is about right.

Right. this is a primary thread which submits transactions to DB. And it
stuck for >123 seconds. Given that the disk is completely unresponsive I
presume something has happened at lower level (controller or disk FW)
though.. May be this was somehow caused by "fragmented" DB access and
compaction would heal this. On the other hand the compaction had to be
applied after omap upgrade so I'm not sure another one would change the
state...




Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 06 October 2022 13:45:17
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

   From your response to Stefan I'm getting that one of two damaged hosts
has all OSDs down and unable to start. I that correct? If so you can
reboot it with no problem and proceed with manual compaction [and other
experiments] quite "safely" for the rest of the cluster.


On 10/6/2022 2:35 PM, Frank Schilder wrote:

Hi Igor,

I can't access these drives. They have an OSD- or LVM process hanging in 
D-state. Any attempt to do something with these gets stuck as well.

I somehow need to wait for recovery to finish and protect the still running 
OSDs from crashing similarly badly.

After we have full redundancy again and service is back, I can add the setting 
o

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov



On 10/6/2022 3:16 PM, Stefan Kooman wrote:

On 10/6/22 13:41, Frank Schilder wrote:

Hi Stefan,

thanks for looking at this. The conversion has happened on 1 host 
only. Status is:


- all daemons on all hosts upgraded
- all OSDs on 1 OSD-host were restarted with 
bluestore_fsck_quick_fix_on_mount = true in its local ceph.conf, 
these OSDs completed conversion and rebooted, I would assume that the 
freshly created OMAPs are compacted by default?


As far as I know it's not.


According to https://tracker.ceph.com/issues/51711 compaction is applied 
after OMAP upgrade starting v15.2.14






- unfortunately, the converted SSD-OSDs on this host died
- now SSD OSDs on other (un-converted) hosts also start crashing 
randomly and very badly (not possible to restart due to stuck D-state 
processes)


Does compaction even work properly on upgraded but unconverted OSDs?


yes, compaction is available irrespective to the data format which OSD 
uses for keeping in DB. Hence both converted and unconverted OSDs can 
benefit from it.



We have done serveral measurements based on production data (clones of 
data disks from prod.). In this case the conversion from octopus to 
pacific, and the resharding as well). We would save half the time by 
compacting them before hand. It would take, in our case, many hours to 
do a conversion, so it would pay off immensely. So yes, you can do 
this. Not sure if I have tested this on Octopus conversion, but as the 
conversion to pacific involves a similar process it's safe to assume 
it will be the same.


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor and Stefan.

> > Not sure why you're talking about replicated(!) 4(2) pool.
> 
> Its because in the production cluster its the 4(2) pool that has that 
> problem. On the test cluster it was an > > EC pool. Seems to affect all sorts 
> of pools.

I have to take this one back. It is indeed an EC pool that is also on these SSD 
OSDs that is affected. The meta-data pool was all active all the time until we 
lost the 3rd host. So, the bug reported is confirmed to affect EC pools.

> If not - does any died OSD unconditionally mean its underlying disk is
> unavailable any more?

Fortunately not. After loosing disks on the 3rd host, we had to start taking 
somewhat more desperate measures. We set the file system off-line to stop 
client IO and started rebooting hosts in reverse order of failing. This brought 
back the OSDs on the still un-converted hosts. We rebooted the converted host 
with the original fail of OSDs last. Unfortunately, here it seems we lost a 
drive for good. It looks like the OSDs crashed while the conversion was going 
on or something. They don't boot up and I need to look into that with more 
detail.

We are currently trying to encourage fs clients to reconnect to the file 
system. Unfortunately, on many we get

# ls /shares/nfs/ait_pnora01 # this *is* a ceph-fs mount point
ls: cannot access '/shares/nfs/ait_pnora01': Stale file handle

Is there a server-sided way to encourage the FS clients to reconnect to the 
cluster? What is a clean way to get them back onto the file system? I tried a 
remounts without success.

Before executing the next conversion, I will compact the rocksdb on all SSD 
OSDs. The HDDs seem to be entirely unaffected. The SSDs have a very high number 
of objects per PG, which is potentially the main reason for our observations.

Thanks for your help,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 06 October 2022 14:39
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Are crashing OSDs still bound to two hosts?

If not - does any died OSD unconditionally mean its underlying disk is
unavailable any more?


On 10/6/2022 3:35 PM, Frank Schilder wrote:
> Hi Igor.
>
>> Not sure why you're talking about replicated(!) 4(2) pool.
> Its because in the production cluster its the 4(2) pool that has that 
> problem. On the test cluster it was an EC pool. Seems to affect all sorts of 
> pools.
>
> I just lost another disk, we have PGs down now. I really hope the stuck 
> bstore_kv_sync thread does not lead to rocksdb corruption.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Igor Fedotov 
> Sent: 06 October 2022 14:26
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus
>
> On 10/6/2022 2:55 PM, Frank Schilder wrote:
>> Hi Igor,
>>
>> it has the SSD OSDs down, the HDD OSDs are running just fine. I don't want 
>> to make a bad situation worse for now and wait for recovery to finish. The 
>> inactive PGs are activating very slowly.
> Got it.
>
>> By the way, there are 2 out of 4 OSDs up in the replicated 4(2) pool. Why 
>> are PGs even inactive here? This "feature" is new in octopus, I reported it 
>> about 2 months ago as a bug. Testing with mimic I cannot reproduce this 
>> problem: https://tracker.ceph.com/issues/56995
> Not sure why you're talking about replicated(!) 4(2) pool. In the above
> ticket I can see EC 4+2 one (pool 4 'fs-data' erasure profile
> ec-4-2...). Which means 6 shards per object and may be this setup has
> some issues with mapping to unique osds within a host (just 3 hosts are
> available!) ...  One can see that pg 4.* are marked as inactive only.
> Not a big expert in this stuff so mostly just speculating
>
>
> Do you have the same setup in the production cluster in question? If so
> - then you lack 2 of 6 shards and IMO the cluster properly marks the
> relevant PGs as inactive. The same would apply to 3x replicated PGs as
> well though since two replicas are down..
>
>
>> I found this in the syslog, maybe it helps:
>>
>> kernel: task:bstore_kv_sync  state:D stack:0 pid:3646032 ppid:3645340 
>> flags:0x
>> kernel: Call Trace:
>> kernel: __schedule+0x2a2/0x7e0
>> kernel: schedule+0x4e/0xb0
>> kernel: io_schedule+0x16/0x40
>> kernel: wait_on_page_bit_common+0x15c/0x3e0
>> kernel: ? __page_cache_alloc+0xb0/0xb0
>> kernel: wait_on_page_bit+0x3f/0x50
>> kernel: wait_on_page_writeback+0x26/0x70
>> kernel: __filemap_fdatawait_range+0x98/0x100
>> kernel: ? __filemap_fdatawrite_range+0xd8/0x110
>> kernel: file_fdatawait_range+0x1a/0x30
>> kernel: sync_file_range+0xc2/0xf0
>> kernel: ksys_sync_file_range+0x41/0x80
>> kernel: __x64_sys_sync_file_range+0x1e/0x30
>> kernel: do_syscall_64+0x3b/0x90
>> kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
>

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Stefan Kooman

On 10/6/22 16:12, Frank Schilder wrote:

Hi Igor and Stefan.


Not sure why you're talking about replicated(!) 4(2) pool.


Its because in the production cluster its the 4(2) pool that has that problem. On the 
test cluster it was an > > EC pool. Seems to affect all sorts of pools.


I have to take this one back. It is indeed an EC pool that is also on these SSD 
OSDs that is affected. The meta-data pool was all active all the time until we 
lost the 3rd host. So, the bug reported is confirmed to affect EC pools.


If not - does any died OSD unconditionally mean its underlying disk is
unavailable any more?


Fortunately not. After loosing disks on the 3rd host, we had to start taking 
somewhat more desperate measures. We set the file system off-line to stop 
client IO and started rebooting hosts in reverse order of failing. This brought 
back the OSDs on the still un-converted hosts. We rebooted the converted host 
with the original fail of OSDs last. Unfortunately, here it seems we lost a 
drive for good. It looks like the OSDs crashed while the conversion was going 
on or something. They don't boot up and I need to look into that with more 
detail.


Maybe not do the online conversion, but opt for offline one? So you can 
inspect if it works or not. Time wise it hardly matters (online 
conversions used to be much slower, but that is not the case anymore). 
If an already upgraded OSD reboots (because it crashed for example), it 
will immediately do the conversion. It might be better to have a bit 
more control over it and do it manually.


We recently observed that OSDs that are restarted might take some time 
to do their standard RocksDB compactions. We therefore set the "noup" 
flag to give them some time to do the housekeeping and only after that 
finishes unset the noup flag. It helped prevent a lot of slow ops we 
would have had otherwise. It might help in this case as well.




We are currently trying to encourage fs clients to reconnect to the file 
system. Unfortunately, on many we get

# ls /shares/nfs/ait_pnora01 # this *is* a ceph-fs mount point
ls: cannot access '/shares/nfs/ait_pnora01': Stale file handle

Is there a server-sided way to encourage the FS clients to reconnect to the 
cluster? What is a clean way to get them back onto the file system? I tried a 
remounts without success.


Not that I know of. You probably need to reboot those hosts.



Before executing the next conversion, I will compact the rocksdb on all SSD 
OSDs. The HDDs seem to be entirely unaffected. The SSDs have a very high number 
of objects per PG, which is potentially the main reason for our observations.


Yup, pretty much certain that's the reason. Nowadays one of our default 
maintenance routines before doing upgrades / conversions, etc. is to do 
offline compaction of all OSDs beforehand.


I hope it helps.


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov

Sorry - no clue about CephFS related questions...

But could you please share full OSD startup log for any one which is 
unable to restart after host reboot?



On 10/6/2022 5:12 PM, Frank Schilder wrote:

Hi Igor and Stefan.


Not sure why you're talking about replicated(!) 4(2) pool.

Its because in the production cluster its the 4(2) pool that has that problem. On the 
test cluster it was an > > EC pool. Seems to affect all sorts of pools.

I have to take this one back. It is indeed an EC pool that is also on these SSD 
OSDs that is affected. The meta-data pool was all active all the time until we 
lost the 3rd host. So, the bug reported is confirmed to affect EC pools.


If not - does any died OSD unconditionally mean its underlying disk is
unavailable any more?

Fortunately not. After loosing disks on the 3rd host, we had to start taking 
somewhat more desperate measures. We set the file system off-line to stop 
client IO and started rebooting hosts in reverse order of failing. This brought 
back the OSDs on the still un-converted hosts. We rebooted the converted host 
with the original fail of OSDs last. Unfortunately, here it seems we lost a 
drive for good. It looks like the OSDs crashed while the conversion was going 
on or something. They don't boot up and I need to look into that with more 
detail.

We are currently trying to encourage fs clients to reconnect to the file 
system. Unfortunately, on many we get

# ls /shares/nfs/ait_pnora01 # this *is* a ceph-fs mount point
ls: cannot access '/shares/nfs/ait_pnora01': Stale file handle

Is there a server-sided way to encourage the FS clients to reconnect to the 
cluster? What is a clean way to get them back onto the file system? I tried a 
remounts without success.

Before executing the next conversion, I will compact the rocksdb on all SSD 
OSDs. The HDDs seem to be entirely unaffected. The SSDs have a very high number 
of objects per PG, which is potentially the main reason for our observations.

Thanks for your help,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 06 October 2022 14:39
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Are crashing OSDs still bound to two hosts?

If not - does any died OSD unconditionally mean its underlying disk is
unavailable any more?


On 10/6/2022 3:35 PM, Frank Schilder wrote:

Hi Igor.


Not sure why you're talking about replicated(!) 4(2) pool.

Its because in the production cluster its the 4(2) pool that has that problem. 
On the test cluster it was an EC pool. Seems to affect all sorts of pools.

I just lost another disk, we have PGs down now. I really hope the stuck 
bstore_kv_sync thread does not lead to rocksdb corruption.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 06 October 2022 14:26
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

On 10/6/2022 2:55 PM, Frank Schilder wrote:

Hi Igor,

it has the SSD OSDs down, the HDD OSDs are running just fine. I don't want to 
make a bad situation worse for now and wait for recovery to finish. The 
inactive PGs are activating very slowly.

Got it.


By the way, there are 2 out of 4 OSDs up in the replicated 4(2) pool. Why are PGs even 
inactive here? This "feature" is new in octopus, I reported it about 2 months 
ago as a bug. Testing with mimic I cannot reproduce this problem: 
https://tracker.ceph.com/issues/56995

Not sure why you're talking about replicated(!) 4(2) pool. In the above
ticket I can see EC 4+2 one (pool 4 'fs-data' erasure profile
ec-4-2...). Which means 6 shards per object and may be this setup has
some issues with mapping to unique osds within a host (just 3 hosts are
available!) ...  One can see that pg 4.* are marked as inactive only.
Not a big expert in this stuff so mostly just speculating


Do you have the same setup in the production cluster in question? If so
- then you lack 2 of 6 shards and IMO the cluster properly marks the
relevant PGs as inactive. The same would apply to 3x replicated PGs as
well though since two replicas are down..



I found this in the syslog, maybe it helps:

kernel: task:bstore_kv_sync  state:D stack:0 pid:3646032 ppid:3645340 
flags:0x
kernel: Call Trace:
kernel: __schedule+0x2a2/0x7e0
kernel: schedule+0x4e/0xb0
kernel: io_schedule+0x16/0x40
kernel: wait_on_page_bit_common+0x15c/0x3e0
kernel: ? __page_cache_alloc+0xb0/0xb0
kernel: wait_on_page_bit+0x3f/0x50
kernel: wait_on_page_writeback+0x26/0x70
kernel: __filemap_fdatawait_range+0x98/0x100
kernel: ? __filemap_fdatawrite_range+0xd8/0x110
kernel: file_fdatawait_range+0x1a/0x30
kernel: sync_file_range+0xc2/0xf0
kernel: ksys_sync_file_range+0x41/0x80
kernel: __x64_sys_sync_file_range+0x1e/0x30
kernel: do_syscall_64+0x3b/0x90
kerne

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor.

> But could you please share full OSD startup log for any one which is
> unable to restart after host reboot?

Will do. I also would like to know what happened here and if it is possible to 
recover these OSDs. The rebuild takes ages with the current throttled recovery 
settings.

> Sorry - no clue about CephFS related questions...

Just for the general audience. In the past we did cluster maintenance by 
setting "ceph fs set FS down true" (freezing all client IO in D-state), waited 
for all MDSes becoming standby and doing the job. After that, we set "ceph fs 
set FS down false", the MDSes started again, all clients connected more or less 
instantaneously and continued exactly at the point where they were frozen.

This time, a huge number of clients just crashed instead of freezing and of the 
few ones that remained up only a small number reconnected. This is in our 
experience very unusual behaviour. Was there a change or are we looking at a 
potential bug here?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 06 October 2022 17:03
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Sorry - no clue about CephFS related questions...

But could you please share full OSD startup log for any one which is
unable to restart after host reboot?


On 10/6/2022 5:12 PM, Frank Schilder wrote:
> Hi Igor and Stefan.
>
>>> Not sure why you're talking about replicated(!) 4(2) pool.
>> Its because in the production cluster its the 4(2) pool that has that 
>> problem. On the test cluster it was an > > EC pool. Seems to affect all 
>> sorts of pools.
> I have to take this one back. It is indeed an EC pool that is also on these 
> SSD OSDs that is affected. The meta-data pool was all active all the time 
> until we lost the 3rd host. So, the bug reported is confirmed to affect EC 
> pools.
>
>> If not - does any died OSD unconditionally mean its underlying disk is
>> unavailable any more?
> Fortunately not. After loosing disks on the 3rd host, we had to start taking 
> somewhat more desperate measures. We set the file system off-line to stop 
> client IO and started rebooting hosts in reverse order of failing. This 
> brought back the OSDs on the still un-converted hosts. We rebooted the 
> converted host with the original fail of OSDs last. Unfortunately, here it 
> seems we lost a drive for good. It looks like the OSDs crashed while the 
> conversion was going on or something. They don't boot up and I need to look 
> into that with more detail.
>
> We are currently trying to encourage fs clients to reconnect to the file 
> system. Unfortunately, on many we get
>
> # ls /shares/nfs/ait_pnora01 # this *is* a ceph-fs mount point
> ls: cannot access '/shares/nfs/ait_pnora01': Stale file handle
>
> Is there a server-sided way to encourage the FS clients to reconnect to the 
> cluster? What is a clean way to get them back onto the file system? I tried a 
> remounts without success.
>
> Before executing the next conversion, I will compact the rocksdb on all SSD 
> OSDs. The HDDs seem to be entirely unaffected. The SSDs have a very high 
> number of objects per PG, which is potentially the main reason for our 
> observations.
>
> Thanks for your help,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Igor Fedotov 
> Sent: 06 October 2022 14:39
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus
>
> Are crashing OSDs still bound to two hosts?
>
> If not - does any died OSD unconditionally mean its underlying disk is
> unavailable any more?
>
>
> On 10/6/2022 3:35 PM, Frank Schilder wrote:
>> Hi Igor.
>>
>>> Not sure why you're talking about replicated(!) 4(2) pool.
>> Its because in the production cluster its the 4(2) pool that has that 
>> problem. On the test cluster it was an EC pool. Seems to affect all sorts of 
>> pools.
>>
>> I just lost another disk, we have PGs down now. I really hope the stuck 
>> bstore_kv_sync thread does not lead to rocksdb corruption.
>>
>> Best regards,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> 
>> From: Igor Fedotov 
>> Sent: 06 October 2022 14:26
>> To: Frank Schilder; ceph-users@ceph.io
>> Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus
>>
>> On 10/6/2022 2:55 PM, Frank Schilder wrote:
>>> Hi Igor,
>>>
>>> it has the SSD OSDs down, the HDD OSDs are running just fine. I don't want 
>>> to make a bad situation worse for now and wait for recovery to finish. The 
>>> inactive PGs are activating very slowly.
>> Got it.
>>
>>> By the way, there are 2 out of 4 OSDs up in the replicated 4(2) pool. Why 
>>> are PGs even inactive here? This "feature" is new in octopus, I reported

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Stefan,

to answer your question as well:

> ... conversion from octopus to
> pacific, and the resharding as well). We would save half the time by
> compacting them before hand. It would take, in our case, many hours to
> do a conversion, so it would pay off immensely. ...

With experiments on our test cluster I expected the HDD conversion to take 9 
hours and the SSD conversion around 20 minutes. I was really surprised that the 
HDD conversion took only a few minutes. The SSDs took probably 10-20 minutes, 
but I'm not sure, because I don't remember having seen them come up. If they 
were up, then only for a very short moment.

With this short time I was celebrating already that the conversion on all our 
hosts can be done in 2 days. Well. Now we managed the first one in 24-48 hours 
:( I really hope the next ones go smoother.

So, for us on-line conversion actually worked great ... until the unconverted 
OSDs started crashing. Things are stable now since about an hour. I really hope 
nothing more crashes. Recovery will likely take more than 24 hours. A long way 
to go in such a fragile situation.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Stefan Kooman 
Sent: 06 October 2022 14:16
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Stefan and anyone else reading this, we are probably misunderstanding each 
other here:

> There is a strict MDS maintenance dance you have to perform [1].
> ...
> [1]: https://docs.ceph.com/en/octopus/cephfs/upgrading/

Our ceph fs shut-down was *after* completing the upgrade to octopus, *not part 
of it*. We are not in the middle of the upgrade procedure [1], we are done with 
it (with the bluestore_fsck_quick_fix_on_mount = false version). As I explained 
at the beginning of this thread, all our daemons are on octopus:

# ceph versions
{
"mon": {
"ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
octopus (stable)": 5
},
"mgr": {
"ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
octopus (stable)": 5
},
"osd": {
"ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
octopus (stable)": 1046
},
"mds": {
"ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
octopus (stable)": 12
},
"overall": {
"ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
octopus (stable)": 1068
}
}

and the upgrade has been finalised with

ceph osd require-osd-release octopus

and enabling v2 for the monitors.

The conversion I'm talking about happens *after* the complete upgrade, at which 
point I would expect the system to behave normal. This includes FS maintenance, 
shut down and startup. Ceph fs clients should not crash on "ceph fs set XYZ 
down true", they should freeze. Etc.

Its just the omap conversion that was postponed to post-upgrade as explained in 
[1], nothing else.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Stefan Kooman 
Sent: 06 October 2022 18:12:09
To: Frank Schilder; Igor Fedotov; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

On 10/6/22 17:22, Frank Schilder wrote:

>
> Just for the general audience. In the past we did cluster maintenance by 
> setting "ceph fs set FS down true" (freezing all client IO in D-state), 
> waited for all MDSes becoming standby and doing the job. After that, we set 
> "ceph fs set FS down false", the MDSes started again, all clients connected 
> more or less instantaneously and continued exactly at the point where they 
> were frozen.
>
> This time, a huge number of clients just crashed instead of freezing and of 
> the few ones that remained up only a small number reconnected. This is in our 
> experience very unusual behaviour. Was there a change or are we looking at a 
> potential bug here?

There is a strict MDS maintenance dance you have to perform [1]. In
order to avoid MDS committing suicide for example. We would just have
the last remaining "up:active" MDS restart. And as soon as it became
up:active again all clients would reconnect virtually instantly. Even if
rejoining had taken 3.5 minutes. Especially for MDSes I would not
deviate from best practices.

Gr. Stefan

[1]: https://docs.ceph.com/en/octopus/cephfs/upgrading/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor,

the problematic disk holds OSDs 16,17,18 and 19. OSD 16 is the one crashing the 
show. I collected its startup log here: https://pastebin.com/25D3piS6 . The 
line sticking out is line 603:

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/os/bluestore/BlueFS.cc:
 2931: ceph_abort_msg("bluefs enospc")

This smells a lot like rocksdb corruption. Can I do something about that? I 
still need to convert most of our OSDs and I cannot afford to loose more. The 
rebuild simply takes too long in the current situation.

Thanks for your help and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 06 October 2022 17:03:53
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Sorry - no clue about CephFS related questions...

But could you please share full OSD startup log for any one which is
unable to restart after host reboot?


On 10/6/2022 5:12 PM, Frank Schilder wrote:
> Hi Igor and Stefan.
>
>>> Not sure why you're talking about replicated(!) 4(2) pool.
>> Its because in the production cluster its the 4(2) pool that has that 
>> problem. On the test cluster it was an > > EC pool. Seems to affect all 
>> sorts of pools.
> I have to take this one back. It is indeed an EC pool that is also on these 
> SSD OSDs that is affected. The meta-data pool was all active all the time 
> until we lost the 3rd host. So, the bug reported is confirmed to affect EC 
> pools.
>
>> If not - does any died OSD unconditionally mean its underlying disk is
>> unavailable any more?
> Fortunately not. After loosing disks on the 3rd host, we had to start taking 
> somewhat more desperate measures. We set the file system off-line to stop 
> client IO and started rebooting hosts in reverse order of failing. This 
> brought back the OSDs on the still un-converted hosts. We rebooted the 
> converted host with the original fail of OSDs last. Unfortunately, here it 
> seems we lost a drive for good. It looks like the OSDs crashed while the 
> conversion was going on or something. They don't boot up and I need to look 
> into that with more detail.
>
> We are currently trying to encourage fs clients to reconnect to the file 
> system. Unfortunately, on many we get
>
> # ls /shares/nfs/ait_pnora01 # this *is* a ceph-fs mount point
> ls: cannot access '/shares/nfs/ait_pnora01': Stale file handle
>
> Is there a server-sided way to encourage the FS clients to reconnect to the 
> cluster? What is a clean way to get them back onto the file system? I tried a 
> remounts without success.
>
> Before executing the next conversion, I will compact the rocksdb on all SSD 
> OSDs. The HDDs seem to be entirely unaffected. The SSDs have a very high 
> number of objects per PG, which is potentially the main reason for our 
> observations.
>
> Thanks for your help,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Igor Fedotov 
> Sent: 06 October 2022 14:39
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus
>
> Are crashing OSDs still bound to two hosts?
>
> If not - does any died OSD unconditionally mean its underlying disk is
> unavailable any more?
>
>
> On 10/6/2022 3:35 PM, Frank Schilder wrote:
>> Hi Igor.
>>
>>> Not sure why you're talking about replicated(!) 4(2) pool.
>> Its because in the production cluster its the 4(2) pool that has that 
>> problem. On the test cluster it was an EC pool. Seems to affect all sorts of 
>> pools.
>>
>> I just lost another disk, we have PGs down now. I really hope the stuck 
>> bstore_kv_sync thread does not lead to rocksdb corruption.
>>
>> Best regards,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> 
>> From: Igor Fedotov 
>> Sent: 06 October 2022 14:26
>> To: Frank Schilder; ceph-users@ceph.io
>> Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus
>>
>> On 10/6/2022 2:55 PM, Frank Schilder wrote:
>>> Hi Igor,
>>>
>>> it has the SSD OSDs down, the HDD OSDs are running just fine. I don't want 
>>> to make a bad situation worse for now and wait for recovery to finish. The 
>>> inactive PGs are activating very slowly.
>> Got it.
>>
>>> By the way, there are 2 out of 4 OSDs up in the replicated 4(2) pool. Why 
>>> are PGs even inactive here? This "feature" is new in octopus, I reported it 
>>> about 2 months ago as a bug. Testing with mimic I cannot reproduce this 
>>> problem: https://tracker.ceph.com/issues/56995
>> Not sure why you're talking about replicated(!) 4(2) pool. In the above
>> ticket I can see EC 4+2 one (pool 4 'fs-data' erasure profile
>> ec-4-2...). Which means

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov

Hi Frank,

the abort message "bluefs enospc" indicates lack of free space for 
additional bluefs space allocations which prevents osd from startup.


From the following log line one can see that bluefs needs ~1M more 
space while the total available one is approx 622M. the problem is that 
bluefs needs continuous(!) 64K chunks though. Which apparently aren't 
available due to high disk fragmentation.


    -4> 2022-10-06T23:22:49.267+0200 7f669d129700 -1 
bluestore(/var/lib/ceph/osd/ceph-16) allocate_bluefs_freespace failed to 
allocate on 0x11 min_size 0x11 > allocated total 0x3 
bluefs_shared_alloc_size 0x1 allocated 0x3 available 0x 25134000



To double check the above root cause analysis it would be helpful to get 
ceph-bluestore-tool's free_dump command output - small chances there is 
a bug in allocator which "misses" some long enough chunks. But given 
disk space utilization (>90%) and pretty small disk size this is 
unlikely IMO.


So to work around the issue and bring OSD up you should either expand 
the main device for OSD or add standalone DB volume.



Curious whether other non-starting OSDs report the same error...


Thanks,

Igor



On 10/7/2022 1:02 AM, Frank Schilder wrote:

Hi Igor,

the problematic disk holds OSDs 16,17,18 and 19. OSD 16 is the one crashing the 
show. I collected its startup log here: https://pastebin.com/25D3piS6 . The 
line sticking out is line 603:

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/os/bluestore/BlueFS.cc:
 2931: ceph_abort_msg("bluefs enospc")

This smells a lot like rocksdb corruption. Can I do something about that? I 
still need to convert most of our OSDs and I cannot afford to loose more. The 
rebuild simply takes too long in the current situation.

Thanks for your help and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 06 October 2022 17:03:53
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Sorry - no clue about CephFS related questions...

But could you please share full OSD startup log for any one which is
unable to restart after host reboot?


On 10/6/2022 5:12 PM, Frank Schilder wrote:

Hi Igor and Stefan.


Not sure why you're talking about replicated(!) 4(2) pool.

Its because in the production cluster its the 4(2) pool that has that problem. On the 
test cluster it was an > > EC pool. Seems to affect all sorts of pools.

I have to take this one back. It is indeed an EC pool that is also on these SSD 
OSDs that is affected. The meta-data pool was all active all the time until we 
lost the 3rd host. So, the bug reported is confirmed to affect EC pools.


If not - does any died OSD unconditionally mean its underlying disk is
unavailable any more?

Fortunately not. After loosing disks on the 3rd host, we had to start taking 
somewhat more desperate measures. We set the file system off-line to stop 
client IO and started rebooting hosts in reverse order of failing. This brought 
back the OSDs on the still un-converted hosts. We rebooted the converted host 
with the original fail of OSDs last. Unfortunately, here it seems we lost a 
drive for good. It looks like the OSDs crashed while the conversion was going 
on or something. They don't boot up and I need to look into that with more 
detail.

We are currently trying to encourage fs clients to reconnect to the file 
system. Unfortunately, on many we get

# ls /shares/nfs/ait_pnora01 # this *is* a ceph-fs mount point
ls: cannot access '/shares/nfs/ait_pnora01': Stale file handle

Is there a server-sided way to encourage the FS clients to reconnect to the 
cluster? What is a clean way to get them back onto the file system? I tried a 
remounts without success.

Before executing the next conversion, I will compact the rocksdb on all SSD 
OSDs. The HDDs seem to be entirely unaffected. The SSDs have a very high number 
of objects per PG, which is potentially the main reason for our observations.

Thanks for your help,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 06 October 2022 14:39
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Are crashing OSDs still bound to two hosts?

If not - does any died OSD unconditionally mean its underlying disk is
unavailable any more?


On 10/6/2022 3:35 PM, Frank Schilder wrote:

Hi Igor.


Not sure why you're talking about replicated(!) 4(2) pool.

Its because in the production cluster its the 4(2) pool that has that problem. 
On the test cluster it was an EC pool. Seems to affect all sorts of pools.

I just lost another disk, we have PGs down now. I really hope the stuck 
bstore_kv_s

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor,

I suspect there is something wrong with the data reported. These OSDs are only 
50-60% used. For example:

IDCLASS WEIGHT   REWEIGHT  SIZE RAW USE   DATA  OMAP 
META  AVAIL%USE   VAR   PGS  STATUS TYPE NAME
  29   ssd  0.09099   1.0   93 GiB49 GiB17 GiB   16 GiB
15 GiB   44 GiB  52.42  1.91  104 up  osd.29
  44   ssd  0.09099   1.0   93 GiB50 GiB23 GiB   10 GiB
16 GiB   43 GiB  53.88  1.96  121 up  osd.44
  58   ssd  0.09099   1.0   93 GiB49 GiB16 GiB   15 GiB
18 GiB   44 GiB  52.81  1.92  123 up  osd.58
 984   ssd  0.09099   1.0   93 GiB57 GiB26 GiB   13 GiB
17 GiB   37 GiB  60.81  2.21  133 up  osd.984

Yes, these drives are small, but it should be possible to find 1M more. It 
sounds like some stats data/counters are incorrect/corrupted. Is it possible to 
run an fsck on a bluestore device to have it checked for that? Any idea how an 
incorrect utilisation might come about?

I will look into starting these OSDs individually. This will be a bit of work 
as our deployment method is to start/stop all OSDs sharing the same disk 
simultaneously (OSDs are grouped by disk). If one fails all others also go 
down. Its for simplifying disk management and this debugging is a new use case 
we never needed before.

Thanks for your help at this late hour!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 00:37:34
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Hi Frank,

the abort message "bluefs enospc" indicates lack of free space for
additional bluefs space allocations which prevents osd from startup.

 From the following log line one can see that bluefs needs ~1M more
space while the total available one is approx 622M. the problem is that
bluefs needs continuous(!) 64K chunks though. Which apparently aren't
available due to high disk fragmentation.

 -4> 2022-10-06T23:22:49.267+0200 7f669d129700 -1
bluestore(/var/lib/ceph/osd/ceph-16) allocate_bluefs_freespace failed to
allocate on 0x11 min_size 0x11 > allocated total 0x3
bluefs_shared_alloc_size 0x1 allocated 0x3 available 0x 25134000


To double check the above root cause analysis it would be helpful to get
ceph-bluestore-tool's free_dump command output - small chances there is
a bug in allocator which "misses" some long enough chunks. But given
disk space utilization (>90%) and pretty small disk size this is
unlikely IMO.

So to work around the issue and bring OSD up you should either expand
the main device for OSD or add standalone DB volume.


Curious whether other non-starting OSDs report the same error...


Thanks,

Igor



On 10/7/2022 1:02 AM, Frank Schilder wrote:
> Hi Igor,
>
> the problematic disk holds OSDs 16,17,18 and 19. OSD 16 is the one crashing 
> the show. I collected its startup log here: https://pastebin.com/25D3piS6 . 
> The line sticking out is line 603:
>
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/os/bluestore/BlueFS.cc:
>  2931: ceph_abort_msg("bluefs enospc")
>
> This smells a lot like rocksdb corruption. Can I do something about that? I 
> still need to convert most of our OSDs and I cannot afford to loose more. The 
> rebuild simply takes too long in the current situation.
>
> Thanks for your help and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Igor Fedotov 
> Sent: 06 October 2022 17:03:53
> To: Frank Schilder; ceph-users@ceph.io
> Cc: Stefan Kooman
> Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus
>
> Sorry - no clue about CephFS related questions...
>
> But could you please share full OSD startup log for any one which is
> unable to restart after host reboot?
>
>
> On 10/6/2022 5:12 PM, Frank Schilder wrote:
>> Hi Igor and Stefan.
>>
 Not sure why you're talking about replicated(!) 4(2) pool.
>>> Its because in the production cluster its the 4(2) pool that has that 
>>> problem. On the test cluster it was an > > EC pool. Seems to affect all 
>>> sorts of pools.
>> I have to take this one back. It is indeed an EC pool that is also on these 
>> SSD OSDs that is affected. The meta-data pool was all active all the time 
>> until we lost the 3rd host. So, the bug reported is confirmed to affect EC 
>> pools.
>>
>>> If not - does any died OSD unconditionally mean its underlying disk is
>>> unavailable any more?
>> Fortunately not. After loosing disks on the 3rd host, we had to start taking 
>> somewhat more

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
The log I inspected was for osd.16  so please share that OSD 
utilization... And honestly I trust allocator's stats more so it's 
rather CLI stats are incorrect if any. Anyway free dump should provide 
additional proofs..


And once again - do other non-starting OSDs show the same ENOSPC error?  
Evidently I'm unable to make any generalization about the root cause due 
to lack of the info...



W.r.t fsck - you can try to run it - since fsck opens DB in read-pnly 
there are some chances it will work.



Thanks,

Igor


On 10/7/2022 1:59 AM, Frank Schilder wrote:

Hi Igor,

I suspect there is something wrong with the data reported. These OSDs are only 
50-60% used. For example:

IDCLASS WEIGHT   REWEIGHT  SIZE RAW USE   DATA  OMAP 
META  AVAIL%USE   VAR   PGS  STATUS TYPE NAME
   29   ssd  0.09099   1.0   93 GiB49 GiB17 GiB   16 GiB
15 GiB   44 GiB  52.42  1.91  104 up  osd.29
   44   ssd  0.09099   1.0   93 GiB50 GiB23 GiB   10 GiB
16 GiB   43 GiB  53.88  1.96  121 up  osd.44
   58   ssd  0.09099   1.0   93 GiB49 GiB16 GiB   15 GiB
18 GiB   44 GiB  52.81  1.92  123 up  osd.58
  984   ssd  0.09099   1.0   93 GiB57 GiB26 GiB   13 GiB
17 GiB   37 GiB  60.81  2.21  133 up  osd.984

Yes, these drives are small, but it should be possible to find 1M more. It 
sounds like some stats data/counters are incorrect/corrupted. Is it possible to 
run an fsck on a bluestore device to have it checked for that? Any idea how an 
incorrect utilisation might come about?

I will look into starting these OSDs individually. This will be a bit of work 
as our deployment method is to start/stop all OSDs sharing the same disk 
simultaneously (OSDs are grouped by disk). If one fails all others also go 
down. Its for simplifying disk management and this debugging is a new use case 
we never needed before.

Thanks for your help at this late hour!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 00:37:34
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Hi Frank,

the abort message "bluefs enospc" indicates lack of free space for
additional bluefs space allocations which prevents osd from startup.

  From the following log line one can see that bluefs needs ~1M more
space while the total available one is approx 622M. the problem is that
bluefs needs continuous(!) 64K chunks though. Which apparently aren't
available due to high disk fragmentation.

  -4> 2022-10-06T23:22:49.267+0200 7f669d129700 -1
bluestore(/var/lib/ceph/osd/ceph-16) allocate_bluefs_freespace failed to
allocate on 0x11 min_size 0x11 > allocated total 0x3
bluefs_shared_alloc_size 0x1 allocated 0x3 available 0x 25134000


To double check the above root cause analysis it would be helpful to get
ceph-bluestore-tool's free_dump command output - small chances there is
a bug in allocator which "misses" some long enough chunks. But given
disk space utilization (>90%) and pretty small disk size this is
unlikely IMO.

So to work around the issue and bring OSD up you should either expand
the main device for OSD or add standalone DB volume.


Curious whether other non-starting OSDs report the same error...


Thanks,

Igor



On 10/7/2022 1:02 AM, Frank Schilder wrote:

Hi Igor,

the problematic disk holds OSDs 16,17,18 and 19. OSD 16 is the one crashing the 
show. I collected its startup log here: https://pastebin.com/25D3piS6 . The 
line sticking out is line 603:

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/os/bluestore/BlueFS.cc:
 2931: ceph_abort_msg("bluefs enospc")

This smells a lot like rocksdb corruption. Can I do something about that? I 
still need to convert most of our OSDs and I cannot afford to loose more. The 
rebuild simply takes too long in the current situation.

Thanks for your help and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 06 October 2022 17:03:53
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Sorry - no clue about CephFS related questions...

But could you please share full OSD startup log for any one which is
unable to restart after host reboot?


On 10/6/2022 5:12 PM, Frank Schilder wrote:

Hi Igor and Stefan.


Not sure why you're talking about replicated(!) 4(2) pool.

Its because in the production cluster its the 4(2) pool that has that problem. On the 
test cluster it was 

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
well, I've just realized that you're apparently unable to collect these 
high-level stats for broken OSDs, aren't you?


But if that's the case you shouldn't make any assumption about faulty 
OSDs utilization from healthy ones - it's definitely a very doubtful 
approach ;)




On 10/7/2022 2:19 AM, Igor Fedotov wrote:
The log I inspected was for osd.16  so please share that OSD 
utilization... And honestly I trust allocator's stats more so it's 
rather CLI stats are incorrect if any. Anyway free dump should provide 
additional proofs..


And once again - do other non-starting OSDs show the same ENOSPC 
error?  Evidently I'm unable to make any generalization about the root 
cause due to lack of the info...



W.r.t fsck - you can try to run it - since fsck opens DB in read-pnly 
there are some chances it will work.



Thanks,

Igor


On 10/7/2022 1:59 AM, Frank Schilder wrote:

Hi Igor,

I suspect there is something wrong with the data reported. These OSDs 
are only 50-60% used. For example:


ID    CLASS WEIGHT   REWEIGHT  SIZE RAW USE DATA  
OMAP META  AVAIL    %USE   VAR   PGS STATUS TYPE NAME
   29   ssd  0.09099   1.0   93 GiB    49 GiB    17 GiB   
16 GiB    15 GiB   44 GiB  52.42  1.91  104 up  
osd.29
   44   ssd  0.09099   1.0   93 GiB    50 GiB    23 GiB   
10 GiB    16 GiB   43 GiB  53.88  1.96  121 up  
osd.44
   58   ssd  0.09099   1.0   93 GiB    49 GiB    16 GiB   
15 GiB    18 GiB   44 GiB  52.81  1.92  123 up  
osd.58
  984   ssd  0.09099   1.0   93 GiB    57 GiB    26 GiB   
13 GiB    17 GiB   37 GiB  60.81  2.21  133 up  
osd.984


Yes, these drives are small, but it should be possible to find 1M 
more. It sounds like some stats data/counters are 
incorrect/corrupted. Is it possible to run an fsck on a bluestore 
device to have it checked for that? Any idea how an incorrect 
utilisation might come about?


I will look into starting these OSDs individually. This will be a bit 
of work as our deployment method is to start/stop all OSDs sharing 
the same disk simultaneously (OSDs are grouped by disk). If one fails 
all others also go down. Its for simplifying disk management and this 
debugging is a new use case we never needed before.


Thanks for your help at this late hour!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 00:37:34
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Hi Frank,

the abort message "bluefs enospc" indicates lack of free space for
additional bluefs space allocations which prevents osd from startup.

  From the following log line one can see that bluefs needs ~1M more
space while the total available one is approx 622M. the problem is that
bluefs needs continuous(!) 64K chunks though. Which apparently aren't
available due to high disk fragmentation.

  -4> 2022-10-06T23:22:49.267+0200 7f669d129700 -1
bluestore(/var/lib/ceph/osd/ceph-16) allocate_bluefs_freespace failed to
allocate on 0x11 min_size 0x11 > allocated total 0x3
bluefs_shared_alloc_size 0x1 allocated 0x3 available 0x 25134000


To double check the above root cause analysis it would be helpful to get
ceph-bluestore-tool's free_dump command output - small chances there is
a bug in allocator which "misses" some long enough chunks. But given
disk space utilization (>90%) and pretty small disk size this is
unlikely IMO.

So to work around the issue and bring OSD up you should either expand
the main device for OSD or add standalone DB volume.


Curious whether other non-starting OSDs report the same error...


Thanks,

Igor



On 10/7/2022 1:02 AM, Frank Schilder wrote:

Hi Igor,

the problematic disk holds OSDs 16,17,18 and 19. OSD 16 is the one 
crashing the show. I collected its startup log here: 
https://pastebin.com/25D3piS6 . The line sticking out is line 603:


/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/os/bluestore/BlueFS.cc: 
2931: ceph_abort_msg("bluefs enospc")


This smells a lot like rocksdb corruption. Can I do something about 
that? I still need to convert most of our OSDs and I cannot afford 
to loose more. The rebuild simply takes too long in the current 
situation.


Thanks for your help and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 06 October 2022 17:03:53
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Sorry - no clue about CephFS related questions...

But could you please share full OSD startup 

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor,

I added a sample of OSDs on identical disks. The usage is quite well balanced, 
so the numbers I included are representative. I don't believe that we had one 
such extreme outlier. Maybe it ran full during conversion. Most of the data is 
OMAP after all.

I can't dump the free-dumps into paste bin, they are too large. Not sure if you 
can access ceph-post-files. I will send you a tgz in a separate e-mail directly 
to you.

> And once again - do other non-starting OSDs show the same ENOSPC error?
> Evidently I'm unable to make any generalization about the root cause due
> to lack of the info...

As I said before, I need more time to check this and give you the answer you 
actually want. The stupid answer is they don't, because the other 3 are taken 
down the moment 16 crashes and don't reach the same point. I need to take them 
out of the grouped management and start them by hand, which I can do tomorrow. 
I'm too tired now to play on our production system.

The free-dumps are on their separate way. I included one for OSD 17 as well (on 
the same disk).

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 01:19:44
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

The log I inspected was for osd.16  so please share that OSD
utilization... And honestly I trust allocator's stats more so it's
rather CLI stats are incorrect if any. Anyway free dump should provide
additional proofs..

And once again - do other non-starting OSDs show the same ENOSPC error?
Evidently I'm unable to make any generalization about the root cause due
to lack of the info...


W.r.t fsck - you can try to run it - since fsck opens DB in read-pnly
there are some chances it will work.


Thanks,

Igor


On 10/7/2022 1:59 AM, Frank Schilder wrote:
> Hi Igor,
>
> I suspect there is something wrong with the data reported. These OSDs are 
> only 50-60% used. For example:
>
> IDCLASS WEIGHT   REWEIGHT  SIZE RAW USE   DATA  OMAP 
> META  AVAIL%USE   VAR   PGS  STATUS TYPE NAME
>29   ssd  0.09099   1.0   93 GiB49 GiB17 GiB   16 GiB  
>   15 GiB   44 GiB  52.42  1.91  104 up  osd.29
>44   ssd  0.09099   1.0   93 GiB50 GiB23 GiB   10 GiB  
>   16 GiB   43 GiB  53.88  1.96  121 up  osd.44
>58   ssd  0.09099   1.0   93 GiB49 GiB16 GiB   15 GiB  
>   18 GiB   44 GiB  52.81  1.92  123 up  osd.58
>   984   ssd  0.09099   1.0   93 GiB57 GiB26 GiB   13 GiB  
>   17 GiB   37 GiB  60.81  2.21  133 up  osd.984
>
> Yes, these drives are small, but it should be possible to find 1M more. It 
> sounds like some stats data/counters are incorrect/corrupted. Is it possible 
> to run an fsck on a bluestore device to have it checked for that? Any idea 
> how an incorrect utilisation might come about?
>
> I will look into starting these OSDs individually. This will be a bit of work 
> as our deployment method is to start/stop all OSDs sharing the same disk 
> simultaneously (OSDs are grouped by disk). If one fails all others also go 
> down. Its for simplifying disk management and this debugging is a new use 
> case we never needed before.
>
> Thanks for your help at this late hour!
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Igor Fedotov 
> Sent: 07 October 2022 00:37:34
> To: Frank Schilder; ceph-users@ceph.io
> Cc: Stefan Kooman
> Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus
>
> Hi Frank,
>
> the abort message "bluefs enospc" indicates lack of free space for
> additional bluefs space allocations which prevents osd from startup.
>
>   From the following log line one can see that bluefs needs ~1M more
> space while the total available one is approx 622M. the problem is that
> bluefs needs continuous(!) 64K chunks though. Which apparently aren't
> available due to high disk fragmentation.
>
>   -4> 2022-10-06T23:22:49.267+0200 7f669d129700 -1
> bluestore(/var/lib/ceph/osd/ceph-16) allocate_bluefs_freespace failed to
> allocate on 0x11 min_size 0x11 > allocated total 0x3
> bluefs_shared_alloc_size 0x1 allocated 0x3 available 0x 25134000
>
>
> To double check the above root cause analysis it would be helpful to get
> ceph-bluestore-tool's free_dump command output - small chances there is
> a bug in allocator which "misses" some long enough chunks. But given
> disk space utilization (>90%) and pretty small disk size this is
> unlikely IMO.
>
> So to work around the issue and bring OSD up you should either expand
> the main device for OSD or add standalone DB volume.
>
>
> Curious whether oth

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor,

sorry for the extra e-mail. I forgot to ask: I'm interested in a tool to 
de-fragment the OSD. It doesn't look like the fsck command does that. Is there 
any such tool?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 07 October 2022 01:53:20
To: Igor Fedotov; ceph-users@ceph.io
Subject: [ceph-users] Re: OSD crashes during upgrade mimic->octopus

Hi Igor,

I added a sample of OSDs on identical disks. The usage is quite well balanced, 
so the numbers I included are representative. I don't believe that we had one 
such extreme outlier. Maybe it ran full during conversion. Most of the data is 
OMAP after all.

I can't dump the free-dumps into paste bin, they are too large. Not sure if you 
can access ceph-post-files. I will send you a tgz in a separate e-mail directly 
to you.

> And once again - do other non-starting OSDs show the same ENOSPC error?
> Evidently I'm unable to make any generalization about the root cause due
> to lack of the info...

As I said before, I need more time to check this and give you the answer you 
actually want. The stupid answer is they don't, because the other 3 are taken 
down the moment 16 crashes and don't reach the same point. I need to take them 
out of the grouped management and start them by hand, which I can do tomorrow. 
I'm too tired now to play on our production system.

The free-dumps are on their separate way. I included one for OSD 17 as well (on 
the same disk).

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 01:19:44
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

The log I inspected was for osd.16  so please share that OSD
utilization... And honestly I trust allocator's stats more so it's
rather CLI stats are incorrect if any. Anyway free dump should provide
additional proofs..

And once again - do other non-starting OSDs show the same ENOSPC error?
Evidently I'm unable to make any generalization about the root cause due
to lack of the info...


W.r.t fsck - you can try to run it - since fsck opens DB in read-pnly
there are some chances it will work.


Thanks,

Igor


On 10/7/2022 1:59 AM, Frank Schilder wrote:
> Hi Igor,
>
> I suspect there is something wrong with the data reported. These OSDs are 
> only 50-60% used. For example:
>
> IDCLASS WEIGHT   REWEIGHT  SIZE RAW USE   DATA  OMAP 
> META  AVAIL%USE   VAR   PGS  STATUS TYPE NAME
>29   ssd  0.09099   1.0   93 GiB49 GiB17 GiB   16 GiB  
>   15 GiB   44 GiB  52.42  1.91  104 up  osd.29
>44   ssd  0.09099   1.0   93 GiB50 GiB23 GiB   10 GiB  
>   16 GiB   43 GiB  53.88  1.96  121 up  osd.44
>58   ssd  0.09099   1.0   93 GiB49 GiB16 GiB   15 GiB  
>   18 GiB   44 GiB  52.81  1.92  123 up  osd.58
>   984   ssd  0.09099   1.0   93 GiB57 GiB26 GiB   13 GiB  
>   17 GiB   37 GiB  60.81  2.21  133 up  osd.984
>
> Yes, these drives are small, but it should be possible to find 1M more. It 
> sounds like some stats data/counters are incorrect/corrupted. Is it possible 
> to run an fsck on a bluestore device to have it checked for that? Any idea 
> how an incorrect utilisation might come about?
>
> I will look into starting these OSDs individually. This will be a bit of work 
> as our deployment method is to start/stop all OSDs sharing the same disk 
> simultaneously (OSDs are grouped by disk). If one fails all others also go 
> down. Its for simplifying disk management and this debugging is a new use 
> case we never needed before.
>
> Thanks for your help at this late hour!
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Igor Fedotov 
> Sent: 07 October 2022 00:37:34
> To: Frank Schilder; ceph-users@ceph.io
> Cc: Stefan Kooman
> Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus
>
> Hi Frank,
>
> the abort message "bluefs enospc" indicates lack of free space for
> additional bluefs space allocations which prevents osd from startup.
>
>   From the following log line one can see that bluefs needs ~1M more
> space while the total available one is approx 622M. the problem is that
> bluefs needs continuous(!) 64K chunks though. Which apparently aren't
> available due to high disk fragmentation.
>
>   -4> 20

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Frank Schilder
Hi Igor and Stefan,

thanks a lot for your help! Our cluster is almost finished with recovery and I 
would like to switch to off-line conversion of the SSD OSDs. In one of Stefan's 
I coud find the command for manual compaction:

ceph-kvstore-tool bluestore-kv "/var/lib/ceph/osd/ceph-${OSD_ID}" compact

Unfortunately, I can't find the command for performing the omap conversion. It 
is not mentioned here 
https://docs.ceph.com/en/quincy/releases/octopus/#upgrading-from-mimic-or-nautilus
 even though it does mention the option to skip conversion in step 5. How to 
continue with an off-line conversion is not mentioned. I know it has been 
posted before, but I seem unable to find it on this list. If someone could send 
me the command, I would be most grateful.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Frank Schilder
Hi Stefan,

super thanks!

I found a quick-fix command in the help output:

# ceph-bluestore-tool -h
[...]
Positional options:
  --command arg  fsck, repair, quick-fix, bluefs-export,
 bluefs-bdev-sizes, bluefs-bdev-expand,
 bluefs-bdev-new-db, bluefs-bdev-new-wal,
 bluefs-bdev-migrate, show-label, set-label-key,
 rm-label-key, prime-osd-dir, bluefs-log-dump,
 free-dump, free-score, bluefs-stats

but its not documented in 
https://docs.ceph.com/en/octopus/man/8/ceph-bluestore-tool/. I guess I will 
stick with the tested command "repair". Nothing I found mentions what exactly 
is executed on start-up with bluestore_fsck_quick_fix_on_mount = true.

Thanks for your quick answer!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Stefan Kooman 
Sent: 07 October 2022 09:07:37
To: Frank Schilder; Igor Fedotov; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

On 10/7/22 09:03, Frank Schilder wrote:
> Hi Igor and Stefan,
>
> thanks a lot for your help! Our cluster is almost finished with recovery and 
> I would like to switch to off-line conversion of the SSD OSDs. In one of 
> Stefan's I coud find the command for manual compaction:
>
> ceph-kvstore-tool bluestore-kv "/var/lib/ceph/osd/ceph-${OSD_ID}" compact
>
> Unfortunately, I can't find the command for performing the omap conversion. 
> It is not mentioned here 
> https://docs.ceph.com/en/quincy/releases/octopus/#upgrading-from-mimic-or-nautilus
>  even though it does mention the option to skip conversion in step 5. How to 
> continue with an off-line conversion is not mentioned. I know it has been 
> posted before, but I seem unable to find it on this list. If someone could 
> send me the command, I would be most grateful.

for osd in `ls /var/lib/ceph/osd/`; do ceph-bluestore-tool repair --path
  /var/lib/ceph/osd/$osd;done

That's what I use.

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Igor Fedotov

Hi Frank,

one more thing I realized during the night :)

Whe performing conversion DB gets a significant bunch of new data 
(approx. on par with the original OMAP volume) without old one being 
immediately removed. Hence one should expect DB size grows dramatically 
at this point. Which should go away after compaction (either enforced or 
regular background one).


But the point is that during that peak usage one might (temporarily) be 
out of free space And I believe that's the root cause for your 
outage. So please be careful when doing further conversions, I think 
your OSDs are exposed to this issue due to limited space available ...



Thanks,

Igor

On 10/7/2022 2:53 AM, Frank Schilder wrote:

Hi Igor,

I added a sample of OSDs on identical disks. The usage is quite well balanced, 
so the numbers I included are representative. I don't believe that we had one 
such extreme outlier. Maybe it ran full during conversion. Most of the data is 
OMAP after all.

I can't dump the free-dumps into paste bin, they are too large. Not sure if you 
can access ceph-post-files. I will send you a tgz in a separate e-mail directly 
to you.


And once again - do other non-starting OSDs show the same ENOSPC error?
Evidently I'm unable to make any generalization about the root cause due
to lack of the info...

As I said before, I need more time to check this and give you the answer you 
actually want. The stupid answer is they don't, because the other 3 are taken 
down the moment 16 crashes and don't reach the same point. I need to take them 
out of the grouped management and start them by hand, which I can do tomorrow. 
I'm too tired now to play on our production system.

The free-dumps are on their separate way. I included one for OSD 17 as well (on 
the same disk).

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 01:19:44
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

The log I inspected was for osd.16  so please share that OSD
utilization... And honestly I trust allocator's stats more so it's
rather CLI stats are incorrect if any. Anyway free dump should provide
additional proofs..

And once again - do other non-starting OSDs show the same ENOSPC error?
Evidently I'm unable to make any generalization about the root cause due
to lack of the info...


W.r.t fsck - you can try to run it - since fsck opens DB in read-pnly
there are some chances it will work.


Thanks,

Igor


On 10/7/2022 1:59 AM, Frank Schilder wrote:

Hi Igor,

I suspect there is something wrong with the data reported. These OSDs are only 
50-60% used. For example:

IDCLASS WEIGHT   REWEIGHT  SIZE RAW USE   DATA  OMAP 
META  AVAIL%USE   VAR   PGS  STATUS TYPE NAME
29   ssd  0.09099   1.0   93 GiB49 GiB17 GiB   16 GiB   
 15 GiB   44 GiB  52.42  1.91  104 up  osd.29
44   ssd  0.09099   1.0   93 GiB50 GiB23 GiB   10 GiB   
 16 GiB   43 GiB  53.88  1.96  121 up  osd.44
58   ssd  0.09099   1.0   93 GiB49 GiB16 GiB   15 GiB   
 18 GiB   44 GiB  52.81  1.92  123 up  osd.58
   984   ssd  0.09099   1.0   93 GiB57 GiB26 GiB   13 GiB   
 17 GiB   37 GiB  60.81  2.21  133 up  osd.984

Yes, these drives are small, but it should be possible to find 1M more. It 
sounds like some stats data/counters are incorrect/corrupted. Is it possible to 
run an fsck on a bluestore device to have it checked for that? Any idea how an 
incorrect utilisation might come about?

I will look into starting these OSDs individually. This will be a bit of work 
as our deployment method is to start/stop all OSDs sharing the same disk 
simultaneously (OSDs are grouped by disk). If one fails all others also go 
down. Its for simplifying disk management and this debugging is a new use case 
we never needed before.

Thanks for your help at this late hour!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 00:37:34
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Hi Frank,

the abort message "bluefs enospc" indicates lack of free space for
additional bluefs space allocations which prevents osd from startup.

   From the following log line one can see that bluefs needs ~1M more
space while the total available one is approx 622M. the problem is that
bluefs needs continuous(!) 64K chunks though. Which apparently aren't
available due to high disk fragmentation.

   -4> 2022-10-06T23:22:49.267+0200 7f669d129700 -1
bluestore(/var/lib/ceph/osd/ceph-16) allocate_bluefs_fre

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Igor Fedotov
For format updates one can use quick-fix command instead of repair, it 
might work a bit faster..


On 10/7/2022 10:07 AM, Stefan Kooman wrote:

On 10/7/22 09:03, Frank Schilder wrote:

Hi Igor and Stefan,

thanks a lot for your help! Our cluster is almost finished with 
recovery and I would like to switch to off-line conversion of the SSD 
OSDs. In one of Stefan's I coud find the command for manual compaction:


ceph-kvstore-tool bluestore-kv "/var/lib/ceph/osd/ceph-${OSD_ID}" 
compact


Unfortunately, I can't find the command for performing the omap 
conversion. It is not mentioned here 
https://docs.ceph.com/en/quincy/releases/octopus/#upgrading-from-mimic-or-nautilus 
even though it does mention the option to skip conversion in step 5. 
How to continue with an off-line conversion is not mentioned. I know 
it has been posted before, but I seem unable to find it on this list. 
If someone could send me the command, I would be most grateful.


for osd in `ls /var/lib/ceph/osd/`; do ceph-bluestore-tool repair 
--path  /var/lib/ceph/osd/$osd;done


That's what I use.

Gr. Stefan


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Igor Fedotov

Just FYI:

standalone ceph-bluestore-tool's quick-fix behaves pretty similar to the 
action performed on start-up with bluestore_fsck_quick_fix_on_mount = true




On 10/7/2022 10:18 AM, Frank Schilder wrote:

Hi Stefan,

super thanks!

I found a quick-fix command in the help output:

# ceph-bluestore-tool -h
[...]
Positional options:
   --command arg  fsck, repair, quick-fix, bluefs-export,
  bluefs-bdev-sizes, bluefs-bdev-expand,
  bluefs-bdev-new-db, bluefs-bdev-new-wal,
  bluefs-bdev-migrate, show-label, set-label-key,
  rm-label-key, prime-osd-dir, bluefs-log-dump,
  free-dump, free-score, bluefs-stats

but its not documented in https://docs.ceph.com/en/octopus/man/8/ceph-bluestore-tool/. I 
guess I will stick with the tested command "repair". Nothing I found mentions 
what exactly is executed on start-up with bluestore_fsck_quick_fix_on_mount = true.

Thanks for your quick answer!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Stefan Kooman 
Sent: 07 October 2022 09:07:37
To: Frank Schilder; Igor Fedotov; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

On 10/7/22 09:03, Frank Schilder wrote:

Hi Igor and Stefan,

thanks a lot for your help! Our cluster is almost finished with recovery and I 
would like to switch to off-line conversion of the SSD OSDs. In one of Stefan's 
I coud find the command for manual compaction:

ceph-kvstore-tool bluestore-kv "/var/lib/ceph/osd/ceph-${OSD_ID}" compact

Unfortunately, I can't find the command for performing the omap conversion. It 
is not mentioned here 
https://docs.ceph.com/en/quincy/releases/octopus/#upgrading-from-mimic-or-nautilus
 even though it does mention the option to skip conversion in step 5. How to 
continue with an off-line conversion is not mentioned. I know it has been 
posted before, but I seem unable to find it on this list. If someone could send 
me the command, I would be most grateful.

for osd in `ls /var/lib/ceph/osd/`; do ceph-bluestore-tool repair --path
   /var/lib/ceph/osd/$osd;done

That's what I use.

Gr. Stefan


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Igor Fedotov

Hi Frank,

there no tools to defragment OSD atm. The only way to defragment OSD is 
to redeploy it...



Thanks,

Igor


On 10/7/2022 3:04 AM, Frank Schilder wrote:

Hi Igor,

sorry for the extra e-mail. I forgot to ask: I'm interested in a tool to 
de-fragment the OSD. It doesn't look like the fsck command does that. Is there 
any such tool?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 07 October 2022 01:53:20
To: Igor Fedotov; ceph-users@ceph.io
Subject: [ceph-users] Re: OSD crashes during upgrade mimic->octopus

Hi Igor,

I added a sample of OSDs on identical disks. The usage is quite well balanced, 
so the numbers I included are representative. I don't believe that we had one 
such extreme outlier. Maybe it ran full during conversion. Most of the data is 
OMAP after all.

I can't dump the free-dumps into paste bin, they are too large. Not sure if you 
can access ceph-post-files. I will send you a tgz in a separate e-mail directly 
to you.


And once again - do other non-starting OSDs show the same ENOSPC error?
Evidently I'm unable to make any generalization about the root cause due
to lack of the info...

As I said before, I need more time to check this and give you the answer you 
actually want. The stupid answer is they don't, because the other 3 are taken 
down the moment 16 crashes and don't reach the same point. I need to take them 
out of the grouped management and start them by hand, which I can do tomorrow. 
I'm too tired now to play on our production system.

The free-dumps are on their separate way. I included one for OSD 17 as well (on 
the same disk).

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 01:19:44
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

The log I inspected was for osd.16  so please share that OSD
utilization... And honestly I trust allocator's stats more so it's
rather CLI stats are incorrect if any. Anyway free dump should provide
additional proofs..

And once again - do other non-starting OSDs show the same ENOSPC error?
Evidently I'm unable to make any generalization about the root cause due
to lack of the info...


W.r.t fsck - you can try to run it - since fsck opens DB in read-pnly
there are some chances it will work.


Thanks,

Igor


On 10/7/2022 1:59 AM, Frank Schilder wrote:

Hi Igor,

I suspect there is something wrong with the data reported. These OSDs are only 
50-60% used. For example:

IDCLASS WEIGHT   REWEIGHT  SIZE RAW USE   DATA  OMAP 
META  AVAIL%USE   VAR   PGS  STATUS TYPE NAME
29   ssd  0.09099   1.0   93 GiB49 GiB17 GiB   16 GiB   
 15 GiB   44 GiB  52.42  1.91  104 up  osd.29
44   ssd  0.09099   1.0   93 GiB50 GiB23 GiB   10 GiB   
 16 GiB   43 GiB  53.88  1.96  121 up  osd.44
58   ssd  0.09099   1.0   93 GiB49 GiB16 GiB   15 GiB   
 18 GiB   44 GiB  52.81  1.92  123 up  osd.58
   984   ssd  0.09099   1.0   93 GiB57 GiB26 GiB   13 GiB   
 17 GiB   37 GiB  60.81  2.21  133 up  osd.984

Yes, these drives are small, but it should be possible to find 1M more. It 
sounds like some stats data/counters are incorrect/corrupted. Is it possible to 
run an fsck on a bluestore device to have it checked for that? Any idea how an 
incorrect utilisation might come about?

I will look into starting these OSDs individually. This will be a bit of work 
as our deployment method is to start/stop all OSDs sharing the same disk 
simultaneously (OSDs are grouped by disk). If one fails all others also go 
down. Its for simplifying disk management and this debugging is a new use case 
we never needed before.

Thanks for your help at this late hour!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 00:37:34
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Hi Frank,

the abort message "bluefs enospc" indicates lack of free space for
additional bluefs space allocations which prevents osd from startup.

   From the following log line one can see that bluefs needs ~1M more
space while the total available one is approx 622M. the problem is that
bluefs needs continuous(!) 64K chunks though. Which apparently aren't
available due to high disk fragmentation.

   -4> 2022-10-06T23:22:49.267+0200 7f669d129700 -1
bluestore(/var/lib/ceph/osd/ceph-16) allocate_bluefs_fr

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Frank Schilder
Hi all,

trying to respond to 4 past emails :)

We started using manual conversion and, if  the conversion fails, it fails in 
the last step. So far, we have a fail on 1 out of 8 OSDs. The OSD can be 
repaired with running a compaction + another repair, which will complete the 
last step. Looks like we are just on the edge and can get away with 
double-compaction.

For the interested future reader, we have subdivided 400G high-performance SSDs 
into 4x100G OSDs for our FS meta data pool. The increased concurrency improves 
performance a lot. But yes, we are on the edge. OMAP+META is almost 50%.

In our case, just merging 2x100 into 1x200 will probably not improve things as 
we will end up with an even more insane number of objects per PG than what we 
have already today. I will plan for having more OSDs for the meta-data pool 
available and also plan for having the infamous 60G temp space available with a 
bit more margin than what we have now.

Thanks to everyone who helped!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 13:21:29
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: OSD crashes during upgrade mimic->octopus

Hi Frank,

there no tools to defragment OSD atm. The only way to defragment OSD is
to redeploy it...


Thanks,

Igor


On 10/7/2022 3:04 AM, Frank Schilder wrote:
> Hi Igor,
>
> sorry for the extra e-mail. I forgot to ask: I'm interested in a tool to 
> de-fragment the OSD. It doesn't look like the fsck command does that. Is 
> there any such tool?
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Frank Schilder 
> Sent: 07 October 2022 01:53:20
> To: Igor Fedotov; ceph-users@ceph.io
> Subject: [ceph-users] Re: OSD crashes during upgrade mimic->octopus
>
> Hi Igor,
>
> I added a sample of OSDs on identical disks. The usage is quite well 
> balanced, so the numbers I included are representative. I don't believe that 
> we had one such extreme outlier. Maybe it ran full during conversion. Most of 
> the data is OMAP after all.
>
> I can't dump the free-dumps into paste bin, they are too large. Not sure if 
> you can access ceph-post-files. I will send you a tgz in a separate e-mail 
> directly to you.
>
>> And once again - do other non-starting OSDs show the same ENOSPC error?
>> Evidently I'm unable to make any generalization about the root cause due
>> to lack of the info...
> As I said before, I need more time to check this and give you the answer you 
> actually want. The stupid answer is they don't, because the other 3 are taken 
> down the moment 16 crashes and don't reach the same point. I need to take 
> them out of the grouped management and start them by hand, which I can do 
> tomorrow. I'm too tired now to play on our production system.
>
> The free-dumps are on their separate way. I included one for OSD 17 as well 
> (on the same disk).
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Igor Fedotov 
> Sent: 07 October 2022 01:19:44
> To: Frank Schilder; ceph-users@ceph.io
> Cc: Stefan Kooman
> Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus
>
> The log I inspected was for osd.16  so please share that OSD
> utilization... And honestly I trust allocator's stats more so it's
> rather CLI stats are incorrect if any. Anyway free dump should provide
> additional proofs..
>
> And once again - do other non-starting OSDs show the same ENOSPC error?
> Evidently I'm unable to make any generalization about the root cause due
> to lack of the info...
>
>
> W.r.t fsck - you can try to run it - since fsck opens DB in read-pnly
> there are some chances it will work.
>
>
> Thanks,
>
> Igor
>
>
> On 10/7/2022 1:59 AM, Frank Schilder wrote:
>> Hi Igor,
>>
>> I suspect there is something wrong with the data reported. These OSDs are 
>> only 50-60% used. For example:
>>
>> IDCLASS WEIGHT   REWEIGHT  SIZE RAW USE   DATA  OMAP 
>> META  AVAIL%USE   VAR   PGS  STATUS TYPE NAME
>> 29   ssd  0.09099   1.0   93 GiB49 GiB17 GiB   16 
>> GiB15 GiB   44 GiB  52.42  1.91  104 up  
>> osd.29
>> 44   ssd  0.09099   1.0   93 GiB50 GiB23 GiB   10 
>> GiB16 GiB   43 GiB  53.88  1.96  121 up  
>> osd.44
>> 58   ssd  0.09099

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Szabo, Istvan (Agoda)
Finally how is your pg distribution? How many pg/disk?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

-Original Message-
From: Frank Schilder 
Sent: Friday, October 7, 2022 6:50 PM
To: Igor Fedotov ; ceph-users@ceph.io
Subject: [ceph-users] Re: OSD crashes during upgrade mimic->octopus

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi all,

trying to respond to 4 past emails :)

We started using manual conversion and, if  the conversion fails, it fails in 
the last step. So far, we have a fail on 1 out of 8 OSDs. The OSD can be 
repaired with running a compaction + another repair, which will complete the 
last step. Looks like we are just on the edge and can get away with 
double-compaction.

For the interested future reader, we have subdivided 400G high-performance SSDs 
into 4x100G OSDs for our FS meta data pool. The increased concurrency improves 
performance a lot. But yes, we are on the edge. OMAP+META is almost 50%.

In our case, just merging 2x100 into 1x200 will probably not improve things as 
we will end up with an even more insane number of objects per PG than what we 
have already today. I will plan for having more OSDs for the meta-data pool 
available and also plan for having the infamous 60G temp space available with a 
bit more margin than what we have now.

Thanks to everyone who helped!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 13:21:29
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: OSD crashes during upgrade mimic->octopus

Hi Frank,

there no tools to defragment OSD atm. The only way to defragment OSD is to 
redeploy it...


Thanks,

Igor


On 10/7/2022 3:04 AM, Frank Schilder wrote:
> Hi Igor,
>
> sorry for the extra e-mail. I forgot to ask: I'm interested in a tool to 
> de-fragment the OSD. It doesn't look like the fsck command does that. Is 
> there any such tool?
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Frank Schilder 
> Sent: 07 October 2022 01:53:20
> To: Igor Fedotov; ceph-users@ceph.io
> Subject: [ceph-users] Re: OSD crashes during upgrade mimic->octopus
>
> Hi Igor,
>
> I added a sample of OSDs on identical disks. The usage is quite well 
> balanced, so the numbers I included are representative. I don't believe that 
> we had one such extreme outlier. Maybe it ran full during conversion. Most of 
> the data is OMAP after all.
>
> I can't dump the free-dumps into paste bin, they are too large. Not sure if 
> you can access ceph-post-files. I will send you a tgz in a separate e-mail 
> directly to you.
>
>> And once again - do other non-starting OSDs show the same ENOSPC error?
>> Evidently I'm unable to make any generalization about the root cause
>> due to lack of the info...
> As I said before, I need more time to check this and give you the answer you 
> actually want. The stupid answer is they don't, because the other 3 are taken 
> down the moment 16 crashes and don't reach the same point. I need to take 
> them out of the grouped management and start them by hand, which I can do 
> tomorrow. I'm too tired now to play on our production system.
>
> The free-dumps are on their separate way. I included one for OSD 17 as well 
> (on the same disk).
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Igor Fedotov 
> Sent: 07 October 2022 01:19:44
> To: Frank Schilder; ceph-users@ceph.io
> Cc: Stefan Kooman
> Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus
>
> The log I inspected was for osd.16  so please share that OSD
> utilization... And honestly I trust allocator's stats more so it's
> rather CLI stats are incorrect if any. Anyway free dump should provide
> additional proofs..
>
> And once again - do other non-starting OSDs show the same ENOSPC error?
> Evidently I'm unable to make any generalization about the root cause
> due to lack of the info...
>
>
> W.r.t fsck - you can try to run it - since fsck opens DB in read-pnly
> there are some chances it will work.
>
>
> Thanks,
>
> Igor
>
>
> On 10/7/2022 1:59 AM, Frank Schilder wrote:
>> Hi Igor,
>>
>> I suspect there is something wrong with the data reported. These OSDs are 
>> only 50-60% u

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-12 Thread Frank Schilder
Hi Szabo.

You mean like copy+paste what I wrote before:

> IDCLASS WEIGHT   REWEIGHT  SIZE RAW USE   DATA  OMAP 
> META  AVAIL%USE   VAR   PGS  STATUS TYPE NAME
>29   ssd  0.09099   1.0   93 GiB49 GiB17 GiB   16 GiB  
>   15 GiB   44 GiB  52.42  1.91  104 up  osd.29
>44   ssd  0.09099   1.0   93 GiB50 GiB23 GiB   10 GiB  
>   16 GiB   43 GiB  53.88  1.96  121 up  osd.44
>58   ssd  0.09099   1.0   93 GiB49 GiB16 GiB   15 GiB  
>   18 GiB   44 GiB  52.81  1.92  123 up  osd.58
>   984   ssd  0.09099   1.0   93 GiB57 GiB26 GiB   13 GiB  
>   17 GiB   37 GiB  60.81  2.21  133 up  osd.984

This is representative for the entire pool. I'm done with getting the cluster 
up again and these disks are now almost empty. The problem seems to be that 
100G OSDs are just a bit too small for octopus.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Szabo, Istvan (Agoda) 
Sent: 07 October 2022 14:28
To: Frank Schilder
Cc: Igor Fedotov; ceph-users@ceph.io
Subject: RE: [ceph-users] Re: OSD crashes during upgrade mimic->octopus

Finally how is your pg distribution? How many pg/disk?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

-Original Message-
From: Frank Schilder 
Sent: Friday, October 7, 2022 6:50 PM
To: Igor Fedotov ; ceph-users@ceph.io
Subject: [ceph-users] Re: OSD crashes during upgrade mimic->octopus

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi all,

trying to respond to 4 past emails :)

We started using manual conversion and, if  the conversion fails, it fails in 
the last step. So far, we have a fail on 1 out of 8 OSDs. The OSD can be 
repaired with running a compaction + another repair, which will complete the 
last step. Looks like we are just on the edge and can get away with 
double-compaction.

For the interested future reader, we have subdivided 400G high-performance SSDs 
into 4x100G OSDs for our FS meta data pool. The increased concurrency improves 
performance a lot. But yes, we are on the edge. OMAP+META is almost 50%.

In our case, just merging 2x100 into 1x200 will probably not improve things as 
we will end up with an even more insane number of objects per PG than what we 
have already today. I will plan for having more OSDs for the meta-data pool 
available and also plan for having the infamous 60G temp space available with a 
bit more margin than what we have now.

Thanks to everyone who helped!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 13:21:29
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: OSD crashes during upgrade mimic->octopus

Hi Frank,

there no tools to defragment OSD atm. The only way to defragment OSD is to 
redeploy it...


Thanks,

Igor


On 10/7/2022 3:04 AM, Frank Schilder wrote:
> Hi Igor,
>
> sorry for the extra e-mail. I forgot to ask: I'm interested in a tool to 
> de-fragment the OSD. It doesn't look like the fsck command does that. Is 
> there any such tool?
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Frank Schilder 
> Sent: 07 October 2022 01:53:20
> To: Igor Fedotov; ceph-users@ceph.io
> Subject: [ceph-users] Re: OSD crashes during upgrade mimic->octopus
>
> Hi Igor,
>
> I added a sample of OSDs on identical disks. The usage is quite well 
> balanced, so the numbers I included are representative. I don't believe that 
> we had one such extreme outlier. Maybe it ran full during conversion. Most of 
> the data is OMAP after all.
>
> I can't dump the free-dumps into paste bin, they are too large. Not sure if 
> you can access ceph-post-files. I will send you a tgz in a separate e-mail 
> directly to you.
>
>> And once again - do other non-starting OSDs show the same ENOSPC error?
>> Evidently I'm unable to make any generalization about the root cause
>> due to lack of the info...
> As I said before, I need more time to check this and give you the answer you 
> actually want. The stupid answer is they don't, because the other 3 are taken 
> down the moment 16 crashes and don't reach the same point. I need to take 
> them out of the grouped management and start t

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-12 Thread Alexander E. Patrakov
пт, 7 окт. 2022 г. в 19:50, Frank Schilder :
> For the interested future reader, we have subdivided 400G high-performance 
> SSDs into 4x100G OSDs for our FS meta data pool. The increased concurrency 
> improves performance a lot. But yes, we are on the edge. OMAP+META is almost 
> 50%.

Please be careful with that. In the past, I had to help a customer who
ran out of disk space on small SSD partitions. This has happened
because MONs keep a history of all OSD and PG maps until at least the
clean state. So, during a prolonged semi-outage (when the cluster is
not healthy) they will slowly creep and accumulate and eat disk space
- and the problematic part is that this creepage is replicated to
OSDs.


-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io