[ceph-users] Re: Stuck in upgrade

2022-10-07 Thread Jan Marek
Hello,

I've now cluster healthy.

I've studied OSDMonitor.cc file and I've found, that there is
some problematic logic.

Assumptions:

1) require_osd_release can be only raise.

2) ceph-mon in version 17.2.3 can set require_osd_release to
minimal value 'octopus'.

I have two variants:

1) If I can set require_osd_release to octopus, I have to have
set require_osd_release actually to 'nautilus' (I will raise
require_osd_release from nautilus to octopus). Then I have to
have on line 11618 in OSDMonitor.cc this line:

ceph_assert(osdmap.require_osd_release >= ceph_release_t::nautilus);

2) If I would have to preserve on line 11618 in file
OSDMonitor.cc line:

ceph_assert(osdmap.require_osd_release >= ceph_release_t::octopus);

it is nonsense to can set require_osd_release parameter to
'octopus', because this line ensures, that I alredy set
require_osd_release parameter to octopus.

I suggest to use variant 1) and I've sendig attached patch.

There is another question, if MON daemon have to check
require_osd_release, when it is joining to the cluster, when it
cannot raise it's value.

It is potentially dangerous situation, see my old e-mail below...

Sincerely
Jan Marek

Dne Po, říj 03, 2022 at 11:26:51 CEST napsal Jan Marek:
> Hello,
> 
> I've problem with our ceph cluster - I've stucked in upgrade
> process between versions 16.2.7 and 17.2.3.
> 
> My problem is, that I have upgraded MON, MGR, MDS processes, and
> when I started upgrade OSDs, ceph tell me, that I cannot add OSD
> with that version to cluster, because I have problem with
> require_osd_release.
> 
> In my osdmap I have:
> 
> # ceph osd dump | grep require_osd_release
> require_osd_release nautilus
> 
> When I tried set this to octopus or pacific, my MON daemon crashed with
> assertion:
> 
> ceph_assert(osdmap.require_osd_release >= ceph_release_t::octopus);
> 
> in OSDMonitor.cc on line 11618.
> 
> Please, is there a way to repair it?
> 
> Can I (temporary) change ceph_assert to this line:
> 
> ceph_assert(osdmap.require_osd_release >= ceph_release_t::nautilus);
> 
> and set require_osd_release to, say, pacific?
> 
> I've tried to downgrade ceph-mon process back to version 16.2,
> but it cannot join to cluster...
> 
> Sincerely
> Jan Marek
> -- 
> Ing. Jan Marek
> University of South Bohemia
> Academic Computer Centre
> Phone: +420389032080
> http://www.gnu.org/philosophy/no-word-attachments.cs.html


> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


-- 
Ing. Jan Marek
University of South Bohemia
Academic Computer Centre
Phone: +420389032080
http://www.gnu.org/philosophy/no-word-attachments.cs.html
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck in upgrade

2022-10-07 Thread Dan van der Ster
Hi Jan,

It looks like you got into this situation by not setting
require-osd-release to pacific while you were running 16.2.7.
The code has that expectation, and unluckily for you if you had
upgraded to 16.2.8 you would have had a HEALTH_WARN that pointed out
the mismatch between require_osd_release and the running version:
https://tracker.ceph.com/issues/53551
https://github.com/ceph/ceph/pull/44259

Cheers, Dan

On Fri, Oct 7, 2022 at 10:05 AM Jan Marek  wrote:
>
> Hello,
>
> I've now cluster healthy.
>
> I've studied OSDMonitor.cc file and I've found, that there is
> some problematic logic.
>
> Assumptions:
>
> 1) require_osd_release can be only raise.
>
> 2) ceph-mon in version 17.2.3 can set require_osd_release to
> minimal value 'octopus'.
>
> I have two variants:
>
> 1) If I can set require_osd_release to octopus, I have to have
> set require_osd_release actually to 'nautilus' (I will raise
> require_osd_release from nautilus to octopus). Then I have to
> have on line 11618 in OSDMonitor.cc this line:
>
> ceph_assert(osdmap.require_osd_release >= ceph_release_t::nautilus);
>
> 2) If I would have to preserve on line 11618 in file
> OSDMonitor.cc line:
>
> ceph_assert(osdmap.require_osd_release >= ceph_release_t::octopus);
>
> it is nonsense to can set require_osd_release parameter to
> 'octopus', because this line ensures, that I alredy set
> require_osd_release parameter to octopus.
>
> I suggest to use variant 1) and I've sendig attached patch.
>
> There is another question, if MON daemon have to check
> require_osd_release, when it is joining to the cluster, when it
> cannot raise it's value.
>
> It is potentially dangerous situation, see my old e-mail below...
>
> Sincerely
> Jan Marek
>
> Dne Po, říj 03, 2022 at 11:26:51 CEST napsal Jan Marek:
> > Hello,
> >
> > I've problem with our ceph cluster - I've stucked in upgrade
> > process between versions 16.2.7 and 17.2.3.
> >
> > My problem is, that I have upgraded MON, MGR, MDS processes, and
> > when I started upgrade OSDs, ceph tell me, that I cannot add OSD
> > with that version to cluster, because I have problem with
> > require_osd_release.
> >
> > In my osdmap I have:
> >
> > # ceph osd dump | grep require_osd_release
> > require_osd_release nautilus
> >
> > When I tried set this to octopus or pacific, my MON daemon crashed with
> > assertion:
> >
> > ceph_assert(osdmap.require_osd_release >= ceph_release_t::octopus);
> >
> > in OSDMonitor.cc on line 11618.
> >
> > Please, is there a way to repair it?
> >
> > Can I (temporary) change ceph_assert to this line:
> >
> > ceph_assert(osdmap.require_osd_release >= ceph_release_t::nautilus);
> >
> > and set require_osd_release to, say, pacific?
> >
> > I've tried to downgrade ceph-mon process back to version 16.2,
> > but it cannot join to cluster...
> >
> > Sincerely
> > Jan Marek
> > --
> > Ing. Jan Marek
> > University of South Bohemia
> > Academic Computer Centre
> > Phone: +420389032080
> > http://www.gnu.org/philosophy/no-word-attachments.cs.html
>
>
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> --
> Ing. Jan Marek
> University of South Bohemia
> Academic Computer Centre
> Phone: +420389032080
> http://www.gnu.org/philosophy/no-word-attachments.cs.html
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck in upgrade

2022-10-07 Thread Jan Marek
Hi Dan,

thanks for this point, it's at least minimum, which can be done.

But can you imagine, what I have to do, when I have not an
ability to change OSDMonitor.cc, recompile and raise
require-osd-release? Or have require-osd-release lower than
nautilus?

Parameter min_mon_release raised automatically, when every MON
daemons in cluster have this version. Is there a reason
to not raise automatically parameter require-osd-release?

Sincerely
Jan Marek

Dne Pá, říj 07, 2022 at 11:08:52 CEST napsal Dan van der Ster:
> Hi Jan,
> 
> It looks like you got into this situation by not setting
> require-osd-release to pacific while you were running 16.2.7.
> The code has that expectation, and unluckily for you if you had
> upgraded to 16.2.8 you would have had a HEALTH_WARN that pointed out
> the mismatch between require_osd_release and the running version:
> https://tracker.ceph.com/issues/53551
> https://github.com/ceph/ceph/pull/44259
> 
> Cheers, Dan
> 
> On Fri, Oct 7, 2022 at 10:05 AM Jan Marek  wrote:
> >
> > Hello,
> >
> > I've now cluster healthy.
> >
> > I've studied OSDMonitor.cc file and I've found, that there is
> > some problematic logic.
> >
> > Assumptions:
> >
> > 1) require_osd_release can be only raise.
> >
> > 2) ceph-mon in version 17.2.3 can set require_osd_release to
> > minimal value 'octopus'.
> >
> > I have two variants:
> >
> > 1) If I can set require_osd_release to octopus, I have to have
> > set require_osd_release actually to 'nautilus' (I will raise
> > require_osd_release from nautilus to octopus). Then I have to
> > have on line 11618 in OSDMonitor.cc this line:
> >
> > ceph_assert(osdmap.require_osd_release >= ceph_release_t::nautilus);
> >
> > 2) If I would have to preserve on line 11618 in file
> > OSDMonitor.cc line:
> >
> > ceph_assert(osdmap.require_osd_release >= ceph_release_t::octopus);
> >
> > it is nonsense to can set require_osd_release parameter to
> > 'octopus', because this line ensures, that I alredy set
> > require_osd_release parameter to octopus.
> >
> > I suggest to use variant 1) and I've sendig attached patch.
> >
> > There is another question, if MON daemon have to check
> > require_osd_release, when it is joining to the cluster, when it
> > cannot raise it's value.
> >
> > It is potentially dangerous situation, see my old e-mail below...
> >
> > Sincerely
> > Jan Marek
> >
> > Dne Po, říj 03, 2022 at 11:26:51 CEST napsal Jan Marek:
> > > Hello,
> > >
> > > I've problem with our ceph cluster - I've stucked in upgrade
> > > process between versions 16.2.7 and 17.2.3.
> > >
> > > My problem is, that I have upgraded MON, MGR, MDS processes, and
> > > when I started upgrade OSDs, ceph tell me, that I cannot add OSD
> > > with that version to cluster, because I have problem with
> > > require_osd_release.
> > >
> > > In my osdmap I have:
> > >
> > > # ceph osd dump | grep require_osd_release
> > > require_osd_release nautilus
> > >
> > > When I tried set this to octopus or pacific, my MON daemon crashed with
> > > assertion:
> > >
> > > ceph_assert(osdmap.require_osd_release >= ceph_release_t::octopus);
> > >
> > > in OSDMonitor.cc on line 11618.
> > >
> > > Please, is there a way to repair it?
> > >
> > > Can I (temporary) change ceph_assert to this line:
> > >
> > > ceph_assert(osdmap.require_osd_release >= ceph_release_t::nautilus);
> > >
> > > and set require_osd_release to, say, pacific?
> > >
> > > I've tried to downgrade ceph-mon process back to version 16.2,
> > > but it cannot join to cluster...
> > >
> > > Sincerely
> > > Jan Marek
> > > --
> > > Ing. Jan Marek
> > > University of South Bohemia
> > > Academic Computer Centre
> > > Phone: +420389032080
> > > http://www.gnu.org/philosophy/no-word-attachments.cs.html
> >
> >
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> > --
> > Ing. Jan Marek
> > University of South Bohemia
> > Academic Computer Centre
> > Phone: +420389032080
> > http://www.gnu.org/philosophy/no-word-attachments.cs.html
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

-- 
Ing. Jan Marek
University of South Bohemia
Academic Computer Centre
Phone: +420389032080
http://www.gnu.org/philosophy/no-word-attachments.cs.html
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck in upgrade process to reef

2023-12-27 Thread Igor Fedotov

Hi Jan,

IIUC the attached log is for ceph-kvstore-tool, right?

Can you please share full OSD startup log as well?


Thanks,

Igor

On 12/27/2023 4:30 PM, Jan Marek wrote:

Hello,

I've problem: my ceph cluster (3x mon nodes, 6x osd nodes, every
osd node have 12 rotational disk and one NVMe device for
bluestore DB). CEPH is installed by ceph orchestrator and have
bluefs storage on osd.

I've started process upgrade from version 17.2.6 to 18.2.1 by
invocating:

ceph orch upgrade start --ceph-version 18.2.1

After upgrade of mon and mgr processes orchestrator tried to
upgrade the first OSD node, but they are falling down.

I've stop the process of upgrade, but I have 1 osd node
completely down.

After upgrade I've got some error messages and I've found
/var/lib/ceph/crash directories, I attach to this message
files, which I've found here.

Please, can you advice, what now I can do? It seems, that rocksdb
is even non-compatible or corrupted :-(

Thanks in advance.

Sincerely
Jan Marek

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck in upgrade process to reef

2023-12-30 Thread Igor Fedotov

Hi Jan,

this doesn't look like RocksDB corruption but rather like some BlueStore 
metadata inconsistency. Also assertion backtrace in the new log looks 
completely different from the original one. So in an attempt to find any 
systematic pattern I'd suggest to run fsck with verbose logging for 
every failing OSD. Relevant command line:


CEPH_ARGS="--log-file osd.N.log --debug-bluestore 5/20" 
bin/ceph-bluestore-tool --path  --command fsck


Unlikely this will fix anything it's rather a way to collect logs to get 
better insight.



Additionally you might want to run similar fsck for a couple of healthy 
OSDs - curious if it succeeds as I have a feeling that the problem with 
crashing OSDs had been hidden before the upgrade and revealed rather 
than caused by it.



Thanks,

Igor

On 12/29/2023 3:28 PM, Jan Marek wrote:

Hello Igor,

I'm attaching a part of syslog creating while starting OSD.0.

Many thanks for help.

Sincerely
Jan Marek

Dne St, pro 27, 2023 at 04:42:56 CET napsal(a) Igor Fedotov:

Hi Jan,

IIUC the attached log is for ceph-kvstore-tool, right?

Can you please share full OSD startup log as well?


Thanks,

Igor

On 12/27/2023 4:30 PM, Jan Marek wrote:

Hello,

I've problem: my ceph cluster (3x mon nodes, 6x osd nodes, every
osd node have 12 rotational disk and one NVMe device for
bluestore DB). CEPH is installed by ceph orchestrator and have
bluefs storage on osd.

I've started process upgrade from version 17.2.6 to 18.2.1 by
invocating:

ceph orch upgrade start --ceph-version 18.2.1

After upgrade of mon and mgr processes orchestrator tried to
upgrade the first OSD node, but they are falling down.

I've stop the process of upgrade, but I have 1 osd node
completely down.

After upgrade I've got some error messages and I've found
/var/lib/ceph/crash directories, I attach to this message
files, which I've found here.

Please, can you advice, what now I can do? It seems, that rocksdb
is even non-compatible or corrupted :-(

Thanks in advance.

Sincerely
Jan Marek

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck in upgrade process to reef

2024-01-04 Thread Igor Fedotov

Hi Jan,

may I see the fsck logs from all the failing OSDs to see the pattern. 
IIUC the full node is suffering from the issue, right?



Thanks,

Igor

On 1/2/2024 10:53 AM, Jan Marek wrote:

Hello once again,

I've tried this:

export CEPH_ARGS="--log-file /tmp/osd.0.log --debug-bluestore 5/20"
ceph-bluestore-tool --path 
/var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.0 --command fsck

And I've sending /tmp/osd.0.log file attached.

Sincerely
Jan Marek

Dne Ne, pro 31, 2023 at 12:38:13 CET napsal(a) Igor Fedotov:

Hi Jan,

this doesn't look like RocksDB corruption but rather like some BlueStore
metadata inconsistency. Also assertion backtrace in the new log looks
completely different from the original one. So in an attempt to find any
systematic pattern I'd suggest to run fsck with verbose logging for every
failing OSD. Relevant command line:

CEPH_ARGS="--log-file osd.N.log --debug-bluestore 5/20"
bin/ceph-bluestore-tool --path  --command fsck

Unlikely this will fix anything it's rather a way to collect logs to get
better insight.


Additionally you might want to run similar fsck for a couple of healthy OSDs
- curious if it succeeds as I have a feeling that the problem with crashing
OSDs had been hidden before the upgrade and revealed rather than caused by
it.


Thanks,

Igor

On 12/29/2023 3:28 PM, Jan Marek wrote:

Hello Igor,

I'm attaching a part of syslog creating while starting OSD.0.

Many thanks for help.

Sincerely
Jan Marek

Dne St, pro 27, 2023 at 04:42:56 CET napsal(a) Igor Fedotov:

Hi Jan,

IIUC the attached log is for ceph-kvstore-tool, right?

Can you please share full OSD startup log as well?


Thanks,

Igor

On 12/27/2023 4:30 PM, Jan Marek wrote:

Hello,

I've problem: my ceph cluster (3x mon nodes, 6x osd nodes, every
osd node have 12 rotational disk and one NVMe device for
bluestore DB). CEPH is installed by ceph orchestrator and have
bluefs storage on osd.

I've started process upgrade from version 17.2.6 to 18.2.1 by
invocating:

ceph orch upgrade start --ceph-version 18.2.1

After upgrade of mon and mgr processes orchestrator tried to
upgrade the first OSD node, but they are falling down.

I've stop the process of upgrade, but I have 1 osd node
completely down.

After upgrade I've got some error messages and I've found
/var/lib/ceph/crash directories, I attach to this message
files, which I've found here.

Please, can you advice, what now I can do? It seems, that rocksdb
is even non-compatible or corrupted :-(

Thanks in advance.

Sincerely
Jan Marek

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck in upgrade process to reef

2024-01-04 Thread Jan Marek
Hi Igor,

I've tried to start only osd.1, which seems to be fsck'd OK, but
it crashed :-(

I search logs and I've found, that I have logs from 22.12.2023,
when I've did a upgrade (I have set logging to journald).

Would you be interested in those logs? This file have 30MB in
bzip2 format, how I can share it with you?

It contains crash log from start osd.1 too, but I can cut out
from it and send it to list...

Sincerely
Jan Marek

Dne Čt, led 04, 2024 at 02:43:48 CET napsal(a) Jan Marek:
> Hi Igor,
> 
> I've ran this oneliner:
> 
> for i in {0..12}; do export CEPH_ARGS="--log-file osd."${i}".log 
> --debug-bluestore 5/20" ; ceph-bluestore-tool --path 
> /var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.${i} --command fsck ; 
> done;
> 
> On osd.0 it crashed very quickly, on osd.1 it is still working.
> 
> I've send those logs in one e-mail.
> 
> But!
> 
> I've tried to list disk devices in monitor view, and I've got
> very interesting screenshot - some part I've emphasized by red
> rectangulars.
> 
> I've got a json from syslog, which was as a part cephadm call,
> where it seems to be correct (for my eyes).
> 
> Can be this coincidence for this problem?
> 
> Sincerely
> Jan Marek
> 
> Dne Čt, led 04, 2024 at 12:32:47 CET napsal(a) Igor Fedotov:
> > Hi Jan,
> > 
> > may I see the fsck logs from all the failing OSDs to see the pattern. IIUC
> > the full node is suffering from the issue, right?
> > 
> > 
> > Thanks,
> > 
> > Igor
> > 
> > On 1/2/2024 10:53 AM, Jan Marek wrote:
> > > Hello once again,
> > > 
> > > I've tried this:
> > > 
> > > export CEPH_ARGS="--log-file /tmp/osd.0.log --debug-bluestore 5/20"
> > > ceph-bluestore-tool --path 
> > > /var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.0 --command fsck
> > > 
> > > And I've sending /tmp/osd.0.log file attached.
> > > 
> > > Sincerely
> > > Jan Marek
> > > 
> > > Dne Ne, pro 31, 2023 at 12:38:13 CET napsal(a) Igor Fedotov:
> > > > Hi Jan,
> > > > 
> > > > this doesn't look like RocksDB corruption but rather like some BlueStore
> > > > metadata inconsistency. Also assertion backtrace in the new log looks
> > > > completely different from the original one. So in an attempt to find any
> > > > systematic pattern I'd suggest to run fsck with verbose logging for 
> > > > every
> > > > failing OSD. Relevant command line:
> > > > 
> > > > CEPH_ARGS="--log-file osd.N.log --debug-bluestore 5/20"
> > > > bin/ceph-bluestore-tool --path  --command fsck
> > > > 
> > > > Unlikely this will fix anything it's rather a way to collect logs to get
> > > > better insight.
> > > > 
> > > > 
> > > > Additionally you might want to run similar fsck for a couple of healthy 
> > > > OSDs
> > > > - curious if it succeeds as I have a feeling that the problem with 
> > > > crashing
> > > > OSDs had been hidden before the upgrade and revealed rather than caused 
> > > > by
> > > > it.
> > > > 
> > > > 
> > > > Thanks,
> > > > 
> > > > Igor
> > > > 
> > > > On 12/29/2023 3:28 PM, Jan Marek wrote:
> > > > > Hello Igor,
> > > > > 
> > > > > I'm attaching a part of syslog creating while starting OSD.0.
> > > > > 
> > > > > Many thanks for help.
> > > > > 
> > > > > Sincerely
> > > > > Jan Marek
> > > > > 
> > > > > Dne St, pro 27, 2023 at 04:42:56 CET napsal(a) Igor Fedotov:
> > > > > > Hi Jan,
> > > > > > 
> > > > > > IIUC the attached log is for ceph-kvstore-tool, right?
> > > > > > 
> > > > > > Can you please share full OSD startup log as well?
> > > > > > 
> > > > > > 
> > > > > > Thanks,
> > > > > > 
> > > > > > Igor
> > > > > > 
> > > > > > On 12/27/2023 4:30 PM, Jan Marek wrote:
> > > > > > > Hello,
> > > > > > > 
> > > > > > > I've problem: my ceph cluster (3x mon nodes, 6x osd nodes, every
> > > > > > > osd node have 12 rotational disk and one NVMe device for
> > > > > > > bluestore DB). CEPH is installed by ceph orchestrator and have
> > > > > > > bluefs storage on osd.
> > > > > > > 
> > > > > > > I've started process upgrade from version 17.2.6 to 18.2.1 by
> > > > > > > invocating:
> > > > > > > 
> > > > > > > ceph orch upgrade start --ceph-version 18.2.1
> > > > > > > 
> > > > > > > After upgrade of mon and mgr processes orchestrator tried to
> > > > > > > upgrade the first OSD node, but they are falling down.
> > > > > > > 
> > > > > > > I've stop the process of upgrade, but I have 1 osd node
> > > > > > > completely down.
> > > > > > > 
> > > > > > > After upgrade I've got some error messages and I've found
> > > > > > > /var/lib/ceph/crash directories, I attach to this message
> > > > > > > files, which I've found here.
> > > > > > > 
> > > > > > > Please, can you advice, what now I can do? It seems, that rocksdb
> > > > > > > is even non-compatible or corrupted :-(
> > > > > > > 
> > > > > > > Thanks in advance.
> > > > > > > 
> > > > > > > Sincerely
> > > > > > > Jan Marek
> > > > > > > 
> > > > > > > ___
> > > > > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > > > > To unsubscribe send an email

[ceph-users] Re: Stuck in upgrade process to reef

2024-01-08 Thread Igor Fedotov

Hi Jan,

indeed fsck logs for the OSDs other than osd.0 look good so it would be 
interesting to see OSD startup logs for them. Preferably to have that 
for multiple (e.g. 3-4) OSDs to get the pattern.


Original upgrade log(s) would be nice to see as well.

You might want to use Google Drive or any other publicly available file 
sharing site for that.



Thanks,

Igor

On 05/01/2024 10:25, Jan Marek wrote:

Hi Igor,

I've tried to start only osd.1, which seems to be fsck'd OK, but
it crashed :-(

I search logs and I've found, that I have logs from 22.12.2023,
when I've did a upgrade (I have set logging to journald).

Would you be interested in those logs? This file have 30MB in
bzip2 format, how I can share it with you?

It contains crash log from start osd.1 too, but I can cut out
from it and send it to list...

Sincerely
Jan Marek

Dne Čt, led 04, 2024 at 02:43:48 CET napsal(a) Jan Marek:

Hi Igor,

I've ran this oneliner:

for i in {0..12}; do export CEPH_ARGS="--log-file osd."${i}".log --debug-bluestore 
5/20" ; ceph-bluestore-tool --path /var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.${i} 
--command fsck ; done;

On osd.0 it crashed very quickly, on osd.1 it is still working.

I've send those logs in one e-mail.

But!

I've tried to list disk devices in monitor view, and I've got
very interesting screenshot - some part I've emphasized by red
rectangulars.

I've got a json from syslog, which was as a part cephadm call,
where it seems to be correct (for my eyes).

Can be this coincidence for this problem?

Sincerely
Jan Marek

Dne Čt, led 04, 2024 at 12:32:47 CET napsal(a) Igor Fedotov:

Hi Jan,

may I see the fsck logs from all the failing OSDs to see the pattern. IIUC
the full node is suffering from the issue, right?


Thanks,

Igor

On 1/2/2024 10:53 AM, Jan Marek wrote:

Hello once again,

I've tried this:

export CEPH_ARGS="--log-file /tmp/osd.0.log --debug-bluestore 5/20"
ceph-bluestore-tool --path 
/var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.0 --command fsck

And I've sending /tmp/osd.0.log file attached.

Sincerely
Jan Marek

Dne Ne, pro 31, 2023 at 12:38:13 CET napsal(a) Igor Fedotov:

Hi Jan,

this doesn't look like RocksDB corruption but rather like some BlueStore
metadata inconsistency. Also assertion backtrace in the new log looks
completely different from the original one. So in an attempt to find any
systematic pattern I'd suggest to run fsck with verbose logging for every
failing OSD. Relevant command line:

CEPH_ARGS="--log-file osd.N.log --debug-bluestore 5/20"
bin/ceph-bluestore-tool --path  --command fsck

Unlikely this will fix anything it's rather a way to collect logs to get
better insight.


Additionally you might want to run similar fsck for a couple of healthy OSDs
- curious if it succeeds as I have a feeling that the problem with crashing
OSDs had been hidden before the upgrade and revealed rather than caused by
it.


Thanks,

Igor

On 12/29/2023 3:28 PM, Jan Marek wrote:

Hello Igor,

I'm attaching a part of syslog creating while starting OSD.0.

Many thanks for help.

Sincerely
Jan Marek

Dne St, pro 27, 2023 at 04:42:56 CET napsal(a) Igor Fedotov:

Hi Jan,

IIUC the attached log is for ceph-kvstore-tool, right?

Can you please share full OSD startup log as well?


Thanks,

Igor

On 12/27/2023 4:30 PM, Jan Marek wrote:

Hello,

I've problem: my ceph cluster (3x mon nodes, 6x osd nodes, every
osd node have 12 rotational disk and one NVMe device for
bluestore DB). CEPH is installed by ceph orchestrator and have
bluefs storage on osd.

I've started process upgrade from version 17.2.6 to 18.2.1 by
invocating:

ceph orch upgrade start --ceph-version 18.2.1

After upgrade of mon and mgr processes orchestrator tried to
upgrade the first OSD node, but they are falling down.

I've stop the process of upgrade, but I have 1 osd node
completely down.

After upgrade I've got some error messages and I've found
/var/lib/ceph/crash directories, I attach to this message
files, which I've found here.

Please, can you advice, what now I can do? It seems, that rocksdb
is even non-compatible or corrupted :-(

Thanks in advance.

Sincerely
Jan Marek

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io

[ceph-users] Re: Stuck in upgrade process to reef

2024-01-09 Thread Igor Fedotov

Hi Marek,

I haven't looked through those upgrade logs yet but here are some 
comments regarding last OSD startup attempt.


First of answering your question


_init_alloc::NCB::restore_allocator() failed! Run Full Recovery from ONodes 
(might take a while)



Is it a mandatory part of fsck?


This is caused by previous non-graceful OSD process shutdown. BlueStore is 
unable to find up-to-date allocation map and recovers it from RocksDB. And 
since fsck is a read-only procedure the recovered allocmap is not saved - hence 
all the following BlueStore startups (within fsck or OSD init) cause another 
rebuild attempt. To avoid that you might want to run repair instead of fsck - 
this will persist up-to-date allocation map and avoid its rebuilding on the 
next startup. This will work till the next non-graceful shutdown only - hence 
unsuccessful OSD attempt might break the allocmap state again.

Secondly - looking at OSD startup log one can see that actual OSD log ends with 
that allocmap recovery as well:


2024-01-09T11:25:30.718449+01:00 osd1 ceph-osd[1734062]: 
bluestore(/var/lib/ceph/osd/ceph-1) _init_alloc::NCB::restore_allocator() 
failed! Run Full Recovery from ONodes (might take a while) ...


Subsequent log line indicating OSD daemon termination is from systemd:

2024-01-09T11:25:33.516258+01:00 osd1 systemd[1]: Stopping 
ceph-2c565e24-7850-47dc-a751-a6357cbbaf2a@osd.1.service - Ceph osd.1 for 
2c565e24-7850-47dc-a751-a6357cbbaf2a...


And honestly these lines provide almost no clue why termination happened. No 
obvious OSD failures or something are shown. Perhaps containerized environment 
hides the details e.g. by cutting off OSD log's tail.
So you might want to proceed the investigation by running repair prior to 
starting the OSD as per above. This will result in no alloc map recovery and 
hopefully workaround the problem during startup - if the issue is caused by 
allocmap recovery.
Additionally you might want to increase debug_bluestore log level for osd.1 
before starting it up to get more insight on what's happening.

Alternatively you might want to play with OSD log target settings to write 
OSD.1 log to some file rather than using system wide logging infra - hopefully 
this will be more helpful.

Thanks,
Igor

On 09/01/2024 13:31, Jan Marek wrote:

Hi Igor,

I've sent you logs via filesender.cesnet.cz, if someone would
be interested, they are here:

https://filesender.cesnet.cz/?s=download&token=047b1ec4-4df0-4e8a-90fc-31706eb168a4

Some points:

1) I've found, that on the osd1 server was bad time (3 minutes in
future). I've corrected that. Yes, I know, that it's bad, but we
moved servers to any other net segment, where they have no access
to the timeservers in Internet, then I must reconfigure it to use
our own NTP servers.

2) I've tried to start osd.1 service by this sequence:

a)

ceph-bluestore-tool --path 
/var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.1 --command fsck

(without setting log properly :-( )

b)

export CEPH_ARGS="--log-file osd.1.log --debug-bluestore 5/20"
ceph-bluestore-tool --path 
/var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.1 --command fsck

- here I have one question: Why is it in this log stil this line:

_init_alloc::NCB::restore_allocator() failed! Run Full Recovery from ONodes 
(might take a while)

Is it a mandatory part of fsck?

Log is attached.

c)

systemctl start ceph-2c565e24-7850-47dc-a751-a6357cbbaf2a@osd.1.service

still crashing, gzip-ed log attached too.

Many thanks for exploring problem.

Sincerely
Jan Marek

Dne Po, led 08, 2024 at 12:00:05 CET napsal(a) Igor Fedotov:

Hi Jan,

indeed fsck logs for the OSDs other than osd.0 look good so it would be
interesting to see OSD startup logs for them. Preferably to have that for
multiple (e.g. 3-4) OSDs to get the pattern.

Original upgrade log(s) would be nice to see as well.

You might want to use Google Drive or any other publicly available file
sharing site for that.


Thanks,

Igor

On 05/01/2024 10:25, Jan Marek wrote:

Hi Igor,

I've tried to start only osd.1, which seems to be fsck'd OK, but
it crashed :-(

I search logs and I've found, that I have logs from 22.12.2023,
when I've did a upgrade (I have set logging to journald).

Would you be interested in those logs? This file have 30MB in
bzip2 format, how I can share it with you?

It contains crash log from start osd.1 too, but I can cut out
from it and send it to list...

Sincerely
Jan Marek

Dne Čt, led 04, 2024 at 02:43:48 CET napsal(a) Jan Marek:

Hi Igor,

I've ran this oneliner:

for i in {0..12}; do export CEPH_ARGS="--log-file osd."${i}".log --debug-bluestore 
5/20" ; ceph-bluestore-tool --path /var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.${i} 
--command fsck ; done;

On osd.0 it crashed very quickly, on osd.1 it is still working.

I've send those logs in one e-mail.

But!

I've tried to list disk devices in monitor view, and I've got
very interesting screenshot - some part I've emphasized by red
rectangular

[ceph-users] Re: Stuck in upgrade process to reef

2024-01-10 Thread Igor Fedotov

Hi Jan,

indeed this looks like some memory allocation problem - may be OSD's RAM 
usage threshold reached or something?


Curious if you have any custom OSD settings or may be any memory caps 
for Ceph containers?


Could you please set debug_bluestore to 5/20 and debug_prioritycache to 
10 and try to start OSD once again. Please monitor process RAM usage 
along the process and share the resulting log.



Thanks,

Igor

On 10/01/2024 11:20, Jan Marek wrote:

Hi Igor,

I've tried to repair osd.1 with command:

ceph-bluestore-tool --path 
/var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.1 --command repair

and then start osd.1 ceph-osd podman service.

It semms, that there is problem with memory allocation, see
attached log...

Sincerely
Jan

Dne Út, led 09, 2024 at 02:23:32 CET napsal(a) Igor Fedotov:

Hi Marek,

I haven't looked through those upgrade logs yet but here are some comments
regarding last OSD startup attempt.

First of answering your question


_init_alloc::NCB::restore_allocator() failed! Run Full Recovery from ONodes 
(might take a while)
Is it a mandatory part of fsck?

This is caused by previous non-graceful OSD process shutdown. BlueStore is 
unable to find up-to-date allocation map and recovers it from RocksDB. And 
since fsck is a read-only procedure the recovered allocmap is not saved - hence 
all the following BlueStore startups (within fsck or OSD init) cause another 
rebuild attempt. To avoid that you might want to run repair instead of fsck - 
this will persist up-to-date allocation map and avoid its rebuilding on the 
next startup. This will work till the next non-graceful shutdown only - hence 
unsuccessful OSD attempt might break the allocmap state again.

Secondly - looking at OSD startup log one can see that actual OSD log ends with 
that allocmap recovery as well:


2024-01-09T11:25:30.718449+01:00 osd1 ceph-osd[1734062]: 
bluestore(/var/lib/ceph/osd/ceph-1) _init_alloc::NCB::restore_allocator() 
failed! Run Full Recovery from ONodes (might take a while) ...

Subsequent log line indicating OSD daemon termination is from systemd:

2024-01-09T11:25:33.516258+01:00 osd1 systemd[1]: Stopping 
ceph-2c565e24-7850-47dc-a751-a6357cbbaf2a@osd.1.service - Ceph osd.1 for 
2c565e24-7850-47dc-a751-a6357cbbaf2a...

And honestly these lines provide almost no clue why termination happened. No 
obvious OSD failures or something are shown. Perhaps containerized environment 
hides the details e.g. by cutting off OSD log's tail.
So you might want to proceed the investigation by running repair prior to 
starting the OSD as per above. This will result in no alloc map recovery and 
hopefully workaround the problem during startup - if the issue is caused by 
allocmap recovery.
Additionally you might want to increase debug_bluestore log level for osd.1 
before starting it up to get more insight on what's happening.

Alternatively you might want to play with OSD log target settings to write 
OSD.1 log to some file rather than using system wide logging infra - hopefully 
this will be more helpful.

Thanks,
Igor

On 09/01/2024 13:31, Jan Marek wrote:

Hi Igor,

I've sent you logs via filesender.cesnet.cz, if someone would
be interested, they are here:

https://filesender.cesnet.cz/?s=download&token=047b1ec4-4df0-4e8a-90fc-31706eb168a4

Some points:

1) I've found, that on the osd1 server was bad time (3 minutes in
future). I've corrected that. Yes, I know, that it's bad, but we
moved servers to any other net segment, where they have no access
to the timeservers in Internet, then I must reconfigure it to use
our own NTP servers.

2) I've tried to start osd.1 service by this sequence:

a)

ceph-bluestore-tool --path 
/var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.1 --command fsck

(without setting log properly :-( )

b)

export CEPH_ARGS="--log-file osd.1.log --debug-bluestore 5/20"
ceph-bluestore-tool --path 
/var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.1 --command fsck

- here I have one question: Why is it in this log stil this line:

_init_alloc::NCB::restore_allocator() failed! Run Full Recovery from ONodes 
(might take a while)

Is it a mandatory part of fsck?

Log is attached.

c)

systemctl start ceph-2c565e24-7850-47dc-a751-a6357cbbaf2a@osd.1.service

still crashing, gzip-ed log attached too.

Many thanks for exploring problem.

Sincerely
Jan Marek

Dne Po, led 08, 2024 at 12:00:05 CET napsal(a) Igor Fedotov:

Hi Jan,

indeed fsck logs for the OSDs other than osd.0 look good so it would be
interesting to see OSD startup logs for them. Preferably to have that for
multiple (e.g. 3-4) OSDs to get the pattern.

Original upgrade log(s) would be nice to see as well.

You might want to use Google Drive or any other publicly available file
sharing site for that.


Thanks,

Igor

On 05/01/2024 10:25, Jan Marek wrote:

Hi Igor,

I've tried to start only osd.1, which seems to be fsck'd OK, but
it crashed :-(

I search logs and I've found, that I have logs from 22.12.2023,
when I've 

[ceph-users] Re: Stuck in upgrade process to reef

2024-01-11 Thread Igor Fedotov

Hi Jan,

unfortunately this wasn't very helpful. Moreover the log looks a bit 
messy - looks like a mixture of outputs from multiple running instances 
or something. I'm not an expert in using containerized setups though.


Could you please simplify things by running ceph-osd process manually 
like you did for ceph-objectstore-tool. And enforce log output to a 
file. Command line should look somewhat the following:


ceph-osd -i 0 --log-to-file --log-file  --debug-bluestore 
5/20 --debug-prioritycache 10


Please don't forget to run repair prior to that.


Also you haven't answered my questions about custom [memory] settings 
and RAM usage during OSD startup. It would be nice to hear some feedback.



Thanks,

Igor

On 11/01/2024 16:47, Jan Marek wrote:

Hi Igor,

I've tried to start osd.1 with debug_prioritycache and
debug_bluestore 5/20, see attached file...

Sincerely
Jan

Dne St, led 10, 2024 at 01:03:07 CET napsal(a) Igor Fedotov:

Hi Jan,

indeed this looks like some memory allocation problem - may be OSD's RAM
usage threshold reached or something?

Curious if you have any custom OSD settings or may be any memory caps for
Ceph containers?

Could you please set debug_bluestore to 5/20 and debug_prioritycache to 10
and try to start OSD once again. Please monitor process RAM usage along the
process and share the resulting log.


Thanks,

Igor

On 10/01/2024 11:20, Jan Marek wrote:

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck in upgrade process to reef

2024-01-16 Thread Igor Fedotov

Hi Jan,

I've just fired an upstream ticket for your case, see 
https://tracker.ceph.com/issues/64053 for more details.



You might want to tune (or preferably just remove) your custom 
bluestore_cache_.*_ratio settings to fix the issue.


This is reproducible and fixable in my lab this way.

Hope this helps.


Thanks,

Igor


On 15/01/2024 12:54, Jan Marek wrote:

Hi Igor,

I've tried to start ceph-sod daemon as you advice me and I'm
sending log osd.1.start.log

About memory: According to 'top' podman ceph daemon don't reach
2% of whole server memory (64GB)...

I have switch on autotune of memory...

My ceph config dump - see attached dump.txt

Sincerely
Jan Marek

Dne Čt, led 11, 2024 at 04:02:02 CET napsal(a) Igor Fedotov:

Hi Jan,

unfortunately this wasn't very helpful. Moreover the log looks a bit messy -
looks like a mixture of outputs from multiple running instances or
something. I'm not an expert in using containerized setups though.

Could you please simplify things by running ceph-osd process manually like
you did for ceph-objectstore-tool. And enforce log output to a file. Command
line should look somewhat the following:

ceph-osd -i 0 --log-to-file --log-file  --debug-bluestore 5/20
--debug-prioritycache 10

Please don't forget to run repair prior to that.


Also you haven't answered my questions about custom [memory] settings and
RAM usage during OSD startup. It would be nice to hear some feedback.


Thanks,

Igor

On 11/01/2024 16:47, Jan Marek wrote:

Hi Igor,

I've tried to start osd.1 with debug_prioritycache and
debug_bluestore 5/20, see attached file...

Sincerely
Jan

Dne St, led 10, 2024 at 01:03:07 CET napsal(a) Igor Fedotov:

Hi Jan,

indeed this looks like some memory allocation problem - may be OSD's RAM
usage threshold reached or something?

Curious if you have any custom OSD settings or may be any memory caps for
Ceph containers?

Could you please set debug_bluestore to 5/20 and debug_prioritycache to 10
and try to start OSD once again. Please monitor process RAM usage along the
process and share the resulting log.


Thanks,

Igor

On 10/01/2024 11:20, Jan Marek wrote:

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck in upgrade process to reef

2024-01-17 Thread Jan Marek
Hi Igor,

many thanks for advice!

I've tried to start osd.1 and it started already, now it's
resynchronizing data.

I will start daemons one-by-one.

What do you mean about osd.0, which have a problem with
bluestore fsck? Is there a way to repair it?

Sincerely
Jan


Dne Út, led 16, 2024 at 08:15:03 CET napsal(a) Igor Fedotov:
> Hi Jan,
> 
> I've just fired an upstream ticket for your case, see
> https://tracker.ceph.com/issues/64053 for more details.
> 
> 
> You might want to tune (or preferably just remove) your custom
> bluestore_cache_.*_ratio settings to fix the issue.
> 
> This is reproducible and fixable in my lab this way.
> 
> Hope this helps.
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 15/01/2024 12:54, Jan Marek wrote:
> > Hi Igor,
> > 
> > I've tried to start ceph-sod daemon as you advice me and I'm
> > sending log osd.1.start.log
> > 
> > About memory: According to 'top' podman ceph daemon don't reach
> > 2% of whole server memory (64GB)...
> > 
> > I have switch on autotune of memory...
> > 
> > My ceph config dump - see attached dump.txt
> > 
> > Sincerely
> > Jan Marek
> > 
> > Dne Čt, led 11, 2024 at 04:02:02 CET napsal(a) Igor Fedotov:
> > > Hi Jan,
> > > 
> > > unfortunately this wasn't very helpful. Moreover the log looks a bit 
> > > messy -
> > > looks like a mixture of outputs from multiple running instances or
> > > something. I'm not an expert in using containerized setups though.
> > > 
> > > Could you please simplify things by running ceph-osd process manually like
> > > you did for ceph-objectstore-tool. And enforce log output to a file. 
> > > Command
> > > line should look somewhat the following:
> > > 
> > > ceph-osd -i 0 --log-to-file --log-file  --debug-bluestore 5/20
> > > --debug-prioritycache 10
> > > 
> > > Please don't forget to run repair prior to that.
> > > 
> > > 
> > > Also you haven't answered my questions about custom [memory] settings and
> > > RAM usage during OSD startup. It would be nice to hear some feedback.
> > > 
> > > 
> > > Thanks,
> > > 
> > > Igor
> > > 
> > > On 11/01/2024 16:47, Jan Marek wrote:
> > > > Hi Igor,
> > > > 
> > > > I've tried to start osd.1 with debug_prioritycache and
> > > > debug_bluestore 5/20, see attached file...
> > > > 
> > > > Sincerely
> > > > Jan
> > > > 
> > > > Dne St, led 10, 2024 at 01:03:07 CET napsal(a) Igor Fedotov:
> > > > > Hi Jan,
> > > > > 
> > > > > indeed this looks like some memory allocation problem - may be OSD's 
> > > > > RAM
> > > > > usage threshold reached or something?
> > > > > 
> > > > > Curious if you have any custom OSD settings or may be any memory caps 
> > > > > for
> > > > > Ceph containers?
> > > > > 
> > > > > Could you please set debug_bluestore to 5/20 and debug_prioritycache 
> > > > > to 10
> > > > > and try to start OSD once again. Please monitor process RAM usage 
> > > > > along the
> > > > > process and share the resulting log.
> > > > > 
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > Igor
> > > > > 
> > > > > On 10/01/2024 11:20, Jan Marek wrote:
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io

-- 
Ing. Jan Marek
University of South Bohemia
Academic Computer Centre
Phone: +420389032080
http://www.gnu.org/philosophy/no-word-attachments.cs.html


signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck in upgrade process to reef

2024-01-17 Thread Igor Fedotov

Hi Jan,

w.r.t. osd.0 - if this is the only occurrence then I'd propose simply 
redeploy the OSD. This looks like some BlueStore metadata inconsistency 
which could occur long before the upgrade. Likely the upgrade just 
revealed the issue.  And honestly I can hardly imagine how to 
investigate it at this point.


Let's see how further upgrades go and come back to this question if more 
similar issues pop up.


Meanwhile I'd recommend to run fsck for every OSD prior to upgrade to 
get clear understanding if metadata is consistent or not.


This way - if occurred once again - we can prove/disprove my statement 
about the issue being unrelated to upgrades above.



Thanks,

Igor

On 17/01/2024 15:07, Jan Marek wrote:

Hi Igor,

many thanks for advice!

I've tried to start osd.1 and it started already, now it's
resynchronizing data.

I will start daemons one-by-one.

What do you mean about osd.0, which have a problem with
bluestore fsck? Is there a way to repair it?

Sincerely
Jan


Dne Út, led 16, 2024 at 08:15:03 CET napsal(a) Igor Fedotov:

Hi Jan,

I've just fired an upstream ticket for your case, see
https://tracker.ceph.com/issues/64053 for more details.


You might want to tune (or preferably just remove) your custom
bluestore_cache_.*_ratio settings to fix the issue.

This is reproducible and fixable in my lab this way.

Hope this helps.


Thanks,

Igor


On 15/01/2024 12:54, Jan Marek wrote:

Hi Igor,

I've tried to start ceph-sod daemon as you advice me and I'm
sending log osd.1.start.log

About memory: According to 'top' podman ceph daemon don't reach
2% of whole server memory (64GB)...

I have switch on autotune of memory...

My ceph config dump - see attached dump.txt

Sincerely
Jan Marek

Dne Čt, led 11, 2024 at 04:02:02 CET napsal(a) Igor Fedotov:

Hi Jan,

unfortunately this wasn't very helpful. Moreover the log looks a bit messy -
looks like a mixture of outputs from multiple running instances or
something. I'm not an expert in using containerized setups though.

Could you please simplify things by running ceph-osd process manually like
you did for ceph-objectstore-tool. And enforce log output to a file. Command
line should look somewhat the following:

ceph-osd -i 0 --log-to-file --log-file  --debug-bluestore 5/20
--debug-prioritycache 10

Please don't forget to run repair prior to that.


Also you haven't answered my questions about custom [memory] settings and
RAM usage during OSD startup. It would be nice to hear some feedback.


Thanks,

Igor

On 11/01/2024 16:47, Jan Marek wrote:

Hi Igor,

I've tried to start osd.1 with debug_prioritycache and
debug_bluestore 5/20, see attached file...

Sincerely
Jan

Dne St, led 10, 2024 at 01:03:07 CET napsal(a) Igor Fedotov:

Hi Jan,

indeed this looks like some memory allocation problem - may be OSD's RAM
usage threshold reached or something?

Curious if you have any custom OSD settings or may be any memory caps for
Ceph containers?

Could you please set debug_bluestore to 5/20 and debug_prioritycache to 10
and try to start OSD once again. Please monitor process RAM usage along the
process and share the resulting log.


Thanks,

Igor

On 10/01/2024 11:20, Jan Marek wrote:

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io