[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-05-28 Thread Janne Johansson
Den fre 27 maj 2022 kl 18:26 skrev Sarunas Burdulis
:
> Thanks. I don't recall creating any of the default.* pools, so they
> might have created by ceph-deploy, years ago (kraken?). They all have
> min_size 1, replica 2.

Those are automatically created by radosgw when it starts.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-05-27 Thread Sarunas Burdulis

On 5/27/22 11:41, Bogdan Adrian Velica wrote:

Hi,

Can you please tell us the side of your ceph cluster? How man OSDs do 
you have?


16 OSDs.

$ ceph df
--- RAW STORAGE ---
CLASS SIZE    AVAIL USED  RAW USED  %RAW USED
hdd    8.9 TiB  8.3 TiB  595 GiB   595 GiB   6.55
ssd    7.6 TiB  7.0 TiB  664 GiB   664 GiB   8.49
TOTAL   17 TiB   15 TiB  1.2 TiB   1.2 TiB   7.45

--- POOLS ---
POOL  ID  PGS   STORED  OBJECTS USED %USED  MAX 
AVAIL

rbd    0   64  4.4 GiB    1.11k  8.7 GiB 0.06    6.8 TiB
.rgw.root  1   32   41 KiB   12  633 KiB 0    6.8 TiB
default.rgw.control    2   32   24 KiB    8   47 KiB 0    6.8 TiB
default.rgw.data.root  3   32   10 KiB    0   21 KiB 0    6.8 TiB
default.rgw.gc 4   32  1.3 MiB   32  4.7 MiB 0    6.8 TiB
default.rgw.log    5   32  5.5 MiB  179   11 MiB 0    6.8 TiB
default.rgw.users.uid  6   32  2.5 KiB    1   72 KiB 0    6.8 TiB
mathfs_data   12   32  140 GiB    1.06M  388 GiB 2.69    6.0 TiB
mathfs_metadata   13   32  598 MiB   75.75k  1.8 GiB 0.01    4.6 TiB
default.rgw.lc    15   32  245 KiB   32  491 KiB 0    6.8 TiB
libvirt   21   32  172 GiB   44.47k  491 GiB 3.38    4.6 TiB
monthly_archive_metadata  36   32  426 MiB   20.66k  853 MiB 0    6.8 TiB
monthly_archive_data  37   32   39 GiB  263.23k   93 GiB 0.66    6.8 TiB
device_health_metrics 38    1   84 MiB   22  168 MiB 0    6.8 TiB
lensfun_metadata  41   32  246 MiB  544  493 MiB 0    6.8 TiB
lensfun_data  42   32  131 GiB   37.65k  263 GiB 1.84    6.8 TiB
default.rgw.users.keys    43   32 13 B    1  128 KiB 0    6.8 TiB


The default recommendations are to have a min_size of 2 and replica 3 
per replicated pool.


Thanks. I don't recall creating any of the default.* pools, so they 
might have created by ceph-deploy, years ago (kraken?). They all have 
min_size 1, replica 2.


--
Sarunas Burdulis
Dartmouth Mathematics
math.dartmouth.edu/~sarunas

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-05-27 Thread Bogdan Adrian Velica
Hi,

Can you please tell us the side of your ceph cluster? How man OSDs do you
have?
The default recommendations are to have a min_size of 2 and replica 3 per
replicated pool.

Thank you,
Bogdan Velica
croit.io

On Fri, May 27, 2022 at 6:33 PM Sarunas Burdulis 
wrote:

> On 5/27/22 04:54, Robert Sander wrote:
> > Am 26.05.22 um 20:21 schrieb Sarunas Burdulis:
> >> size 2 min_size 1
> >
> > With such a setting you are guaranteed to lose data.
>
> What would you suggest?
>
> --
> Sarunas Burdulis
> Dartmouth Mathematics
> math.dartmouth.edu/~sarunas
>
> · https://useplaintext.email ·
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-05-27 Thread Sarunas Burdulis

On 5/27/22 04:54, Robert Sander wrote:

Am 26.05.22 um 20:21 schrieb Sarunas Burdulis:

size 2 min_size 1


With such a setting you are guaranteed to lose data.


What would you suggest?

--
Sarunas Burdulis
Dartmouth Mathematics
math.dartmouth.edu/~sarunas

· https://useplaintext.email ·
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-05-27 Thread Sarunas Burdulis

On 5/26/22 14:38, Wesley Dillingham wrote:

pool 13 'mathfs_metadata' replicated size 2 min_size 2 crush_rule 0
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change

The problem is you have size=2 and min_size=2 on this pool. I would 
increase the size of this pool to 3 (but i would also do that to all of 
your pools which are size=2) the ok-to-stop command is failing because 
you would drop below min_size by stopping any osd service this pg and 
those pgs would then be inactive.


Thank you. Having size>min_size for all pools solved this (all OSDs 
became ok-to-stop and upgrade to 16.2.8 completed).


--
Sarunas Burdulis
Dartmouth Mathematics
math.dartmouth.edu/~sarunas

· https://useplaintext.email ·
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-05-27 Thread Robert Sander

Am 26.05.22 um 20:21 schrieb Sarunas Burdulis:

size 2 min_size 1


With such a setting you are guaranteed to lose data.

Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-05-26 Thread Wesley Dillingham
pool 13 'mathfs_metadata' replicated size 2 min_size 2 crush_rule 0
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change

The problem is you have size=2 and min_size=2 on this pool. I would
increase the size of this pool to 3 (but i would also do that to all of
your pools which are size=2) the ok-to-stop command is failing because you
would drop below min_size by stopping any osd service this pg and those pgs
would then be inactive.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Thu, May 26, 2022 at 2:22 PM Sarunas Burdulis 
wrote:

> On 5/26/22 14:09, Wesley Dillingham wrote:
> > What does "ceph osd pool ls detail" say?
>
> $ ceph osd pool ls detail
> pool 0 'rbd' replicated size 2 min_size 1 crush_rule 0 object_hash
> rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 44740 flags
> hashpspool,selfmanaged_snaps stripe_width 0 application rbd
> pool 1 '.rgw.root' replicated size 2 min_size 1 crush_rule 0 object_hash
> rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 44740 lfor
> 0/0/31483 owner 18446744073709551615 flags hashpspool stripe_width 0
> application rgw
> pool 2 'default.rgw.control' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 44740 lfor 0/0/31469 owner 18446744073709551615 flags hashpspool
> stripe_width 0 application rgw
> pool 3 'default.rgw.data.root' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 44740 lfor 0/0/31471 owner 18446744073709551615 flags hashpspool
> stripe_width 0 application rgw
> pool 4 'default.rgw.gc' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 44740 lfor 0/0/31471 owner 18446744073709551615 flags hashpspool
> stripe_width 0 application rgw
> pool 5 'default.rgw.log' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 44740 lfor 0/0/31387 owner 18446744073709551615 flags hashpspool
> stripe_width 0 application rgw
> pool 6 'default.rgw.users.uid' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 44740 lfor 0/0/31387 flags hashpspool stripe_width 0 application rgw
> pool 12 'mathfs_data' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 44740 lfor 0/31370/31368 flags hashpspool stripe_width 0 application cephfs
> pool 13 'mathfs_metadata' replicated size 2 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 44740 lfor 0/27164/27162 flags hashpspool stripe_width 0 application cephfs
> pool 15 'default.rgw.lc' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 44740 lfor 0/0/31374 flags hashpspool stripe_width 0 application rgw
> pool 21 'libvirt' replicated size 3 min_size 1 crush_rule 0 object_hash
> rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 56244 lfor
> 0/33144/33142 flags hashpspool,selfmanaged_snaps stripe_width 0
> application rbd
> pool 36 'monthly_archive_metadata' replicated size 2 min_size 1
> crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
> last_change 45338 lfor 0/27845/27843 flags hashpspool stripe_width 0
> application cephfs
> pool 37 'monthly_archive_data' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 45334 lfor 0/44535/44533 flags hashpspool stripe_width 0 application cephfs
> pool 38 'device_health_metrics' replicated size 2 min_size 1 crush_rule
> 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change
> 56507 flags hashpspool stripe_width 0 pg_num_min 1 application
> mgr_devicehealth
> pool 41 'lensfun_metadata' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 54066 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16
> recovery_priority 5 application cephfs
> pool 42 'lensfun_data' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 54066 flags hashpspool stripe_width 0 application cephfs
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-05-26 Thread Sarunas Burdulis

On 5/26/22 14:09, Wesley Dillingham wrote:

What does "ceph osd pool ls detail" say?


$ ceph osd pool ls detail
pool 0 'rbd' replicated size 2 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 44740 flags 
hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 1 '.rgw.root' replicated size 2 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 44740 lfor 
0/0/31483 owner 18446744073709551615 flags hashpspool stripe_width 0 
application rgw
pool 2 'default.rgw.control' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
44740 lfor 0/0/31469 owner 18446744073709551615 flags hashpspool 
stripe_width 0 application rgw
pool 3 'default.rgw.data.root' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
44740 lfor 0/0/31471 owner 18446744073709551615 flags hashpspool 
stripe_width 0 application rgw
pool 4 'default.rgw.gc' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
44740 lfor 0/0/31471 owner 18446744073709551615 flags hashpspool 
stripe_width 0 application rgw
pool 5 'default.rgw.log' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
44740 lfor 0/0/31387 owner 18446744073709551615 flags hashpspool 
stripe_width 0 application rgw
pool 6 'default.rgw.users.uid' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
44740 lfor 0/0/31387 flags hashpspool stripe_width 0 application rgw
pool 12 'mathfs_data' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
44740 lfor 0/31370/31368 flags hashpspool stripe_width 0 application cephfs
pool 13 'mathfs_metadata' replicated size 2 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
44740 lfor 0/27164/27162 flags hashpspool stripe_width 0 application cephfs
pool 15 'default.rgw.lc' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
44740 lfor 0/0/31374 flags hashpspool stripe_width 0 application rgw
pool 21 'libvirt' replicated size 3 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 56244 lfor 
0/33144/33142 flags hashpspool,selfmanaged_snaps stripe_width 0 
application rbd
pool 36 'monthly_archive_metadata' replicated size 2 min_size 1 
crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on 
last_change 45338 lfor 0/27845/27843 flags hashpspool stripe_width 0 
application cephfs
pool 37 'monthly_archive_data' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
45334 lfor 0/44535/44533 flags hashpspool stripe_width 0 application cephfs
pool 38 'device_health_metrics' replicated size 2 min_size 1 crush_rule 
0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 
56507 flags hashpspool stripe_width 0 pg_num_min 1 application 
mgr_devicehealth
pool 41 'lensfun_metadata' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
54066 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 
recovery_priority 5 application cephfs
pool 42 'lensfun_data' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
54066 flags hashpspool stripe_width 0 application cephfs



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-05-26 Thread Wesley Dillingham
What does "ceph osd pool ls detail" say?

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Thu, May 26, 2022 at 11:24 AM Sarunas Burdulis <
saru...@math.dartmouth.edu> wrote:

> Running
>
> `ceph osd ok-to-stop 0`
>
> shows:
>
> {"ok_to_stop":false,"osds":[1],
> "num_ok_pgs":25,"num_not_ok_pgs":2,
> "bad_become_inactive":["13.a","13.11"],
>
> "ok_become_degraded":["0.4","0.b","0.11","0.1a","0.1e","0.3c","2.5","2.10","3.19","3.1a","4.7","4.19","4.1e","6.10","12.1","12.6","15.9","21.17","21.18","36.8","36.13","41.7","41.1b","42.6","42.1a"]}
> Error EBUSY: unsafe to stop osd(s) at this time (2 PGs are or would
> become offline)
>
> What are “bad_become_inactive” PGs?
> What can be done to make OSD into “ok-to-stop” (or override it)?
>
> `ceph -s` still reports HEALT_OK and all PGs active+clean.
>
> Upgrade to 16.2.8 still complains about non-stoppable OSDs and won't
> proceed.
>
> --
> Sarunas Burdulis
> Dartmouth Mathematics
> math.dartmouth.edu/~sarunas
>
> · https://useplaintext.email ·
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-05-26 Thread Sarunas Burdulis

Running

`ceph osd ok-to-stop 0`

shows:

{"ok_to_stop":false,"osds":[1],
"num_ok_pgs":25,"num_not_ok_pgs":2,
"bad_become_inactive":["13.a","13.11"],
"ok_become_degraded":["0.4","0.b","0.11","0.1a","0.1e","0.3c","2.5","2.10","3.19","3.1a","4.7","4.19","4.1e","6.10","12.1","12.6","15.9","21.17","21.18","36.8","36.13","41.7","41.1b","42.6","42.1a"]}
Error EBUSY: unsafe to stop osd(s) at this time (2 PGs are or would 
become offline)


What are “bad_become_inactive” PGs?
What can be done to make OSD into “ok-to-stop” (or override it)?

`ceph -s` still reports HEALT_OK and all PGs active+clean.

Upgrade to 16.2.8 still complains about non-stoppable OSDs and won't 
proceed.


--
Sarunas Burdulis
Dartmouth Mathematics
math.dartmouth.edu/~sarunas

· https://useplaintext.email ·
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-05-25 Thread Sarunas Burdulis

On 25/05/2022 15.39, Tim Olow wrote:

Do you have any pools with only one replica?


All pools are 'replicated size' 2 or 3, 'min_size' 1 or 2.

--
Sarunas Burdulis
Dartmouth Mathematics
https://math.dartmouth.edu/~sarunas

· https://useplaintext.email ·
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-05-25 Thread Tim Olow
Do you have any pools with only one replica?

Tim

On 5/25/22, 1:48 PM, "Sarunas Burdulis"  wrote:

> ceph health detail says my 5-node cluster is healthy, yet when I ran 
> ceph orch upgrade start --ceph-version 16.2.7 everything seemed to go 
> fine until we got to the OSD section, now for the past hour, every 15 
> seconds a new log entry of   'Upgrade: unsafe to stop osd(s) at this time 
> (1 PGs are or would become offline)' appears in the logs.

Hi,

Has there been any solution or workaround to this?

We have a seemingly healthy cluster, which is stuck on OSD upgrade step 
when upgrading from 15.2.16 to 16.2.8 with the same error(s).

-- 
Sarunas Burdulis
Dartmouth Mathematics
math.dartmouth.edu/~sarunas

· https://useplaintext.email ·


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-05-25 Thread Sarunas Burdulis
ceph health detail says my 5-node cluster is healthy, yet when I ran 
ceph orch upgrade start --ceph-version 16.2.7 everything seemed to go 
fine until we got to the OSD section, now for the past hour, every 15 
seconds a new log entry of   'Upgrade: unsafe to stop osd(s) at this time 
(1 PGs are or would become offline)' appears in the logs.


Hi,

Has there been any solution or workaround to this?

We have a seemingly healthy cluster, which is stuck on OSD upgrade step 
when upgrading from 15.2.16 to 16.2.8 with the same error(s).


--
Sarunas Burdulis
Dartmouth Mathematics
math.dartmouth.edu/~sarunas

· https://useplaintext.email ·
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-02-10 Thread Zach Heise (SSCC)

  
Yes, these 8 PGs have been in this 'remapped' state for quite
  awhile. I don't know why CRUSH has not seen fit to designate new
  OSDs for them so that acting and up match.

For the error in question - ceph upgrade is saying that only 1 PG
  would become offline if OSD(s) were stopped. So if these 8 pgs
  were causing the problem, I thought it would tell me specifically.

Greg, is there a way I could check if crush is failing to map
  properly and figure out why? Because HEALTH_OK is shown even with
  these 8 active+clean+remapped PGs I assumed it was normal/okay for
  it to be in this state.

 

   

  PG_STAT
  OBJECTS
  MISSING_ON_PRIMARY
  DEGRADED
  MISPLACED
  UNFOUND
  BYTES
  OMAP_BYTES*
  OMAP_KEYS*
  LOG
  DISK_LOG
  STATE
  STATE_STAMP
  VERSION
  REPORTED
  UP
  UP_PRIMARY
  ACTING
  ACTING_PRIMARY
  LAST_SCRUB
  SCRUB_STAMP
  LAST_DEEP_SCRUB
  DEEP_SCRUB_STAMP
  SNAPTRIMQ_LEN


  7.11
  42
  0
  0
  42
  0
  1.76E+08
  0
  0
  631
  631
  active+clean+remapped
  2022-02-10T05:16:21.791091+
  8564'631
  10088:11277
  [15,7]
  15
  [15,7,11]
  15
  8564'631
  2022-02-10T05:16:21.791028+
  8564'631
  2022-02-08T17:38:26.806576+
  0


  9.17
  23
  0
  0
  23
  0
  88554155
  0
  0
  2700
  2700
  active+clean+remapped
  2022-02-09T22:40:19.229658+
  9668'2700
  10088:15778
  [22,9]
  22
  [22,9,2]
  22
  9668'2700
  2022-02-09T22:40:19.229581+
  9668'2700
  2022-02-06T13:09:04.264912+
  0


  11.10
  3
  0
  0
  3
  0
  9752576
  0
  0
  6323
  6323
  active+clean+remapped
  2022-02-10T16:56:10.410048+
  6255'6323
  10088:23237
  [0,19]
  0
  [0,19,2]
  0
  6255'6323
  2022-02-10T16:56:10.409954+
  6255'6323
  2022-02-05T18:08:35.490642+
  0


  11.19
  2
  0
  0
  2
  0
  4194304
  0
  0
  10008
  10008
  active+clean+remapped
  2022-02-09T21:52:33.190075+
  9862'14908
  10088:38973
  [19,9]
  19
  [19,9,12]
  19
  9862'14908
  2022-02-09T21:52:33.190002+
  8852'14906
  2022-02-04T21:34:27.141103+
  0


  11.1a
  2
  0
  0
  2
  0
  4194323
  0
  0
  8522
  8522
  active+clean+remapped
  2022-02-10T10:08:29.451623+
  5721'8522
  10088:29920
  [12,24]
  12
  [12,24,28]
  12
  5721'8522
  2022-02-10T10:08:29.451543+
  5721'8522
  2022-02-09T04:45:34.096178+
  0


  7.1a
  67
  0
  0
  67
  0
  2.81E+08
  0
  0
  1040
  1040
  active+clean+remapped
  2022-02-09T18:39:53.571433+
  8537'1040
  10088:13580
  [20,3]
  20
  [20,3,28]
  20
  8537'1040
  2022-02-09T18:39:53.571328+
  8537'1040
  2022-02-09T18:39:53.571328+
  0


  7.e
  63
  0
  0
  63
  0
  2.6E+08
  0
  0
  591
  591
  active+clean+remapped
  2022-02-10T11:40:11.560673+
  8442'591
  10088:11607
  [25,3]
  25
  [25,3,33]
  25
  8442'591
  2022-02-10T11:40:11.560567+
  8442'591
  2022-02-10T11:40:11.560567+
  0


  9.d
  29
  0
  0
  29
  0
  1.17E+08
  0
  0
  2448
  2448
  active+clean+remapped
  2022-02-10T14:22:42.203264+
  9784'2448
  10088:16349
  [22,2]
  22
  [22,2,8]
  22
  9784'2448
  2022-02-10T14:22:42.203183+
  9784'2448
  2022-02-06T15:38:36.38980

[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-02-10 Thread Gregory Farnum
“Up” is the set of OSDs which are alive from the calculated crush mapping.
“Acting” includes those extras which have been added in to bring the PG up
to proper size. So the PG does have 3 live OSDs serving it.

But perhaps the safety check *is* looking at up instead of acting? That
seems like a plausible bug. (Also, if crush is failing to map properly,
that’s not a great sign for your cluster health or design.)

On Thu, Feb 10, 2022 at 11:26 AM 胡 玮文  wrote:

> I believe this is the reason.
>
> I mean number of OSDs in the “up” set should be at least 1 greater than
> the min_size for the upgrade to proceed. Or once any OSD is stopped, it can
> drop below min_size, and prevent the pg from becoming active. So just
> cleanup the misplaced and the upgrade should proceed automatically.
>
> But I’m a little confused. I think if you have only 2 up OSD in a
> replicate x3 pool, it should in degraded state, and should give you a
> HEALTH_WARN.
>
> 在 2022年2月11日,03:06,Zach Heise (SSCC)  写道:
>
> 
>
> Hi Weiwen, thanks for replying.
>
> All of my replicated pools, including the newest ssdpool I made most
> recently, have a min_size of 2. My other two EC pools have a min_size of 3.
>
> Looking at pg dump output again, it does look like the two EC pools have
> exactly 4 OSDs listed in the "Acting" column, and everything else has 3
> OSDs in Acting. So that's as it should be, I believe?
>
> I do have some 'misplaced' objects on 8 different PGs (the
> active+clean+remapped ones listed in my original ceph -s output), that only
> have 2 "up" OSDs listed, but in the "Acting" columns each have 3 OSDs as
> they should. Apparently these 231 misplaced objects aren't enough to cause
> ceph to drop out of HEALTH_OK status.
>
> Zach
>
>
> On 2022-02-10 12:41 PM, huw...@outlook.com
> wrote:
>
> Hi Zach,
>
> How about your min_size setting? Have you checked the number of OSDs in
> the acting set of every PG is at least 1 greater than the min_size of the
> corresponding pool?
>
> Weiwen Hu
>
>
>
> 在 2022年2月10日,05:02,Zach Heise (SSCC)  he...@ssc.wisc.edu> 写道:
>
> Hello,
>
> ceph health detail says my 5-node cluster is healthy, yet when I ran ceph
> orch upgrade start --ceph-version 16.2.7 everything seemed to go fine until
> we got to the OSD section, now for the past hour, every 15 seconds a new
> log entry of  'Upgrade: unsafe to stop osd(s) at this time (1 PGs are or
> would become offline)' appears in the logs.
>
> ceph pg dump_stuck (unclean, degraded, etc) shows "ok" for everything too.
> Yet somehow 1 PG is (apparently) holding up all the OSD upgrades and not
> letting the process finish. Should I stop the upgrade and try it again? (I
> haven't done that before so was just nervous to try it). Any other ideas?
>
>  cluster:
>id: 9aa000e8-b999-11eb-82f2-ecf4bbcc0ac0
>health: HEALTH_OK
>   services:
>mon: 4 daemons, quorum ceph05,ceph04,ceph01,ceph03 (age 92m)
>mgr: ceph03.futetp(active, since 97m), standbys: ceph01.fblojp
>mds: 1/1 daemons up, 1 hot standby
>osd: 33 osds: 33 up (since 2h), 33 in (since 4h); 9 remapped pgs
>   data:
>volumes: 1/1 healthy
>pools:   7 pools, 193 pgs
>objects: 3.72k objects, 14 GiB
>usage:   43 GiB used, 64 TiB / 64 TiB avail
>pgs: 231/11170 objects misplaced (2.068%)
> 185 active+clean
> 8   active+clean+remapped
>   io:
>client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
>   progress:
>Upgrade to 16.2.7 (5m)
>  [=...] (remaining: 24m)
>
> --
> Zach
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io ceph-users-le...@ceph.io>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-02-10 Thread 胡 玮文
I believe this is the reason.

I mean number of OSDs in the “up” set should be at least 1 greater than the 
min_size for the upgrade to proceed. Or once any OSD is stopped, it can drop 
below min_size, and prevent the pg from becoming active. So just cleanup the 
misplaced and the upgrade should proceed automatically.

But I’m a little confused. I think if you have only 2 up OSD in a replicate x3 
pool, it should in degraded state, and should give you a HEALTH_WARN.

在 2022年2月11日,03:06,Zach Heise (SSCC)  写道:



Hi Weiwen, thanks for replying.

All of my replicated pools, including the newest ssdpool I made most recently, 
have a min_size of 2. My other two EC pools have a min_size of 3.

Looking at pg dump output again, it does look like the two EC pools have 
exactly 4 OSDs listed in the "Acting" column, and everything else has 3 OSDs in 
Acting. So that's as it should be, I believe?

I do have some 'misplaced' objects on 8 different PGs (the 
active+clean+remapped ones listed in my original ceph -s output), that only 
have 2 "up" OSDs listed, but in the "Acting" columns each have 3 OSDs as they 
should. Apparently these 231 misplaced objects aren't enough to cause ceph to 
drop out of HEALTH_OK status.

Zach


On 2022-02-10 12:41 PM, huw...@outlook.com wrote:

Hi Zach,

How about your min_size setting? Have you checked the number of OSDs in the 
acting set of every PG is at least 1 greater than the min_size of the 
corresponding pool?

Weiwen Hu



在 2022年2月10日,05:02,Zach Heise (SSCC) 
 写道:

Hello,

ceph health detail says my 5-node cluster is healthy, yet when I ran ceph orch 
upgrade start --ceph-version 16.2.7 everything seemed to go fine until we got 
to the OSD section, now for the past hour, every 15 seconds a new log entry of  
'Upgrade: unsafe to stop osd(s) at this time (1 PGs are or would become 
offline)' appears in the logs.

ceph pg dump_stuck (unclean, degraded, etc) shows "ok" for everything too. Yet 
somehow 1 PG is (apparently) holding up all the OSD upgrades and not letting 
the process finish. Should I stop the upgrade and try it again? (I haven't done 
that before so was just nervous to try it). Any other ideas?

 cluster:
   id: 9aa000e8-b999-11eb-82f2-ecf4bbcc0ac0
   health: HEALTH_OK
  services:
   mon: 4 daemons, quorum ceph05,ceph04,ceph01,ceph03 (age 92m)
   mgr: ceph03.futetp(active, since 97m), standbys: ceph01.fblojp
   mds: 1/1 daemons up, 1 hot standby
   osd: 33 osds: 33 up (since 2h), 33 in (since 4h); 9 remapped pgs
  data:
   volumes: 1/1 healthy
   pools:   7 pools, 193 pgs
   objects: 3.72k objects, 14 GiB
   usage:   43 GiB used, 64 TiB / 64 TiB avail
   pgs: 231/11170 objects misplaced (2.068%)
185 active+clean
8   active+clean+remapped
  io:
   client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
  progress:
   Upgrade to 16.2.7 (5m)
 [=...] (remaining: 24m)

--
Zach
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to 
ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-02-10 Thread Zach Heise (SSCC)

  
Hi Weiwen, thanks for replying.

All of my replicated pools, including the newest ssdpool I made
  most recently, have a min_size of 2. My other two EC pools have a
  min_size of 3.
Looking at pg dump output again, it does look like the two EC
  pools have exactly 4 OSDs listed in the "Acting" column, and
  everything else has 3 OSDs in Acting. So that's as it should be, I
  believe?

I do have some 'misplaced' objects on 8 different PGs (the
  active+clean+remapped ones listed in my original ceph -s output),
  that only have 2 "up" OSDs listed, but in the "Acting" columns
  each have 3 OSDs as they should. Apparently these 231 misplaced
  objects aren't enough to cause ceph to drop out of HEALTH_OK
  status.

Zach
  
  
  

On 2022-02-10 12:41 PM,
  huw...@outlook.com wrote:


  Hi Zach,

How about your min_size setting? Have you checked the number of OSDs in the acting set of every PG is at least 1 greater than the min_size of the corresponding pool?

Weiwen Hu


  
? 2022?2?10??05:02?Zach Heise (SSCC)  ???

?Hello,

ceph health detail says my 5-node cluster is healthy, yet when I ran ceph orch upgrade start --ceph-version 16.2.7 everything seemed to go fine until we got to the OSD section, now for the past hour, every 15 seconds a new log entry of  'Upgrade: unsafe to stop osd(s) at this time (1 PGs are or would become offline)' appears in the logs.

ceph pg dump_stuck (unclean, degraded, etc) shows "ok" for everything too. Yet somehow 1 PG is (apparently) holding up all the OSD upgrades and not letting the process finish. Should I stop the upgrade and try it again? (I haven't done that before so was just nervous to try it). Any other ideas?

 cluster:
   id: 9aa000e8-b999-11eb-82f2-ecf4bbcc0ac0
   health: HEALTH_OK
  services:
   mon: 4 daemons, quorum ceph05,ceph04,ceph01,ceph03 (age 92m)
   mgr: ceph03.futetp(active, since 97m), standbys: ceph01.fblojp
   mds: 1/1 daemons up, 1 hot standby
   osd: 33 osds: 33 up (since 2h), 33 in (since 4h); 9 remapped pgs
  data:
   volumes: 1/1 healthy
   pools:   7 pools, 193 pgs
   objects: 3.72k objects, 14 GiB
   usage:   43 GiB used, 64 TiB / 64 TiB avail
   pgs: 231/11170 objects misplaced (2.068%)
185 active+clean
8   active+clean+remapped
  io:
   client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
  progress:
   Upgrade to 16.2.7 (5m)
 [=...] (remaining: 24m)

-- 
Zach
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

  

  

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-02-10 Thread 胡 玮文
Hi Zach,

How about your min_size setting? Have you checked the number of OSDs in the 
acting set of every PG is at least 1 greater than the min_size of the 
corresponding pool?

Weiwen Hu

> 在 2022年2月10日,05:02,Zach Heise (SSCC)  写道:
> 
> Hello,
> 
> ceph health detail says my 5-node cluster is healthy, yet when I ran ceph 
> orch upgrade start --ceph-version 16.2.7 everything seemed to go fine until 
> we got to the OSD section, now for the past hour, every 15 seconds a new log 
> entry of  'Upgrade: unsafe to stop osd(s) at this time (1 PGs are or would 
> become offline)' appears in the logs.
> 
> ceph pg dump_stuck (unclean, degraded, etc) shows "ok" for everything too. 
> Yet somehow 1 PG is (apparently) holding up all the OSD upgrades and not 
> letting the process finish. Should I stop the upgrade and try it again? (I 
> haven't done that before so was just nervous to try it). Any other ideas?
> 
>  cluster:
>id: 9aa000e8-b999-11eb-82f2-ecf4bbcc0ac0
>health: HEALTH_OK
>   services:
>mon: 4 daemons, quorum ceph05,ceph04,ceph01,ceph03 (age 92m)
>mgr: ceph03.futetp(active, since 97m), standbys: ceph01.fblojp
>mds: 1/1 daemons up, 1 hot standby
>osd: 33 osds: 33 up (since 2h), 33 in (since 4h); 9 remapped pgs
>   data:
>volumes: 1/1 healthy
>pools:   7 pools, 193 pgs
>objects: 3.72k objects, 14 GiB
>usage:   43 GiB used, 64 TiB / 64 TiB avail
>pgs: 231/11170 objects misplaced (2.068%)
> 185 active+clean
> 8   active+clean+remapped
>   io:
>client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
>   progress:
>Upgrade to 16.2.7 (5m)
>  [=...] (remaining: 24m)
> 
> -- 
> Zach
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-02-10 Thread Zach Heise (SSCC)

  
That's an excellent point! Between my last ceph upgrade and now,
  I did make a new crush ruleset and a new pool that uses that crush
  rule. It was just for SSDs, of which I have 5, one per host.
All of my other pools are using the default crush rulesets
  "replicated_rule" for the Replica x3, and "erasure-code" for the
  EC pools.

{
      "rule_id": 2,
      "rule_name": "highspeedSSD",
      "ruleset": 2,
      "type": 1,
      "min_size": 1,
      "max_size": 10,
      "steps": [
      {
      "op": "take",
      "item": -27,
      "item_name": "default~ssd"
      },
      {
      "op": "chooseleaf_firstn",
      "num": 0,
      "type": "host"
      },
      {
      "op": "emit"
      }
      ]
  }

They're on OSDs 5, 14, 26, 32 and 33. the new "ssdpool" is number
  13 and is a replica x3.

So I looked at at all the PGs starting with 13 (table below) and
  yes, just as one would hope, they are only using the ssd OSDs.
  But, each SSD and its corresponding OSD is on a different host.
  Every one of the 32 PGs has its 3x OSDs therefore also on
  different hosts. No PG for pool is being replicated to the same
  host.
 
If one of the PGs for this pool were not being properly
  replicated to 3 different OSDs on 3 different hosts, that would be
  a warning or error I'd think?
 

 

  PG_STAT
  OBJECTS
  MISSING_ON_PRIMARY
  DEGRADED
  MISPLACED
  UNFOUND
  BYTES
  OMAP_BYTES*
  OMAP_KEYS*
  LOG
  DISK_LOG
  STATE
  STATE_STAMP
  VERSION
  REPORTED
  UP
  UP_PRIMARY
  ACTING
  ACTING_PRIMARY
  LAST_SCRUB
  SCRUB_STAMP
  LAST_DEEP_SCRUB
  DEEP_SCRUB_STAMP
  SNAPTRIMQ_LEN


  13.1a
  36
  0
  0
  0
  0
  1.51E+08
  0
  0
  190
  190
  active+clean
  2022-02-10T02:52:45.595274+
  8595'190
  10088:2829
  [5,33,14]
  5
  [5,33,14]
  5
  8595'190
  2022-02-10T02:52:45.595191+
  8595'190
  2022-02-09T01:48:54.354375+
  0


  13.1b
  34
  0
  0
  0
  0
  1.43E+08
  0
  0
  186
  186
  active+clean
  2022-02-10T09:57:13.864888+
  8595'186
  10088:2764
  [32,5,26]
  32
  [32,5,26]
  32
  8595'186
  2022-02-10T09:57:13.864816+
  8595'186
  2022-02-10T09:57:13.864816+
  0


  13.1c
  30
  0
  0
  0
  0
  1.26E+08
  0
  0
  174
  174
  active+clean
  2022-02-10T00:51:38.215782+
  8531'174
  10088:2565
  [32,26,33]
  32
  [32,26,33]
  32
  8531'174
  2022-02-10T00:51:38.215699+
  8531'174
  2022-02-04T00:13:03.051036+
  0


  13.1d
  31
  0
  0
  0
  0
  1.26E+08
  0
  0
  182
  182
  active+clean
  2022-02-09T22:34:36.010164+
  8472'182
  10088:2954
  [14,5,33]
  14
  [14,5,33]
  14
  8472'182
  2022-02-09T22:34:36.010085+
  8472'182
  2022-02-09T22:34:36.010085+
  0


  13.1e
  22
  0
  0
  0
  0
  92274688
  0
  0
  140
  140
  active+clean
  2022-02-09T18:39:59.737323+
  8563'140
  10088:2286
  [33,14,32]
  33
  [33,14,32]
  33
  8563'140
  2022-02-09T09:27:55.107416+
  8563'140
  2022-02-08T07:54:49.474418+
  0


  13.1f
  31
  0
  0
  0
  0
  1.3E+08
  0
  0
  208
  208
  active+clean
  2022-02-10T02:51:33.770146+
  8412'208
  10088:2784
  [32,26,5]
  32
  [32,26,5]
  32
  8412'208
  2022-02-10T02:51:33.770066+
  8412'208
  2022-02-08T19:42:17.625397+
  0


  13.a
  32
  0
  0
  0
  0
   

[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-02-10 Thread Gregory Farnum
I don’t know how to get better errors out of cephadm, but the only way I
can think of for this to happen is if your crush rule is somehow placing
multiple replicas of a pg on a single host that cephadm wants to upgrade.
So check your rules, your pool sizes, and osd tree?
-Greg

On Thu, Feb 10, 2022 at 8:25 AM Zach Heise (SSCC) 
wrote:

> It could be an issue with the devicehealthpool as you are correct, it is a
> single PG - but when the cluster is reporting that everything is healthy,
> it's difficult where to go from there. What I don't understand is why its
> refusing to upgrade ANY of the osd daemons; I have 33 of them, why would a
> single PG going offline be a problem for all of them?
>
> I did try stopping the upgrade and restarting it, but it just picks up at
> the same place (11/56 daemons upgraded) and immediately reports the same
> issue.
>
> Is there any way to at least tell which PG is the problematic one?
>
>
> Zach
>
>
> On 2022-02-09 4:19 PM, anthony.da...@gmail.com wrote:
>
> Speculation:  might the devicehealth pool be involved?  It seems to typically 
> have just 1 PG.
>
>
>
>
> On Feb 9, 2022, at 1:41 PM, Zach Heise (SSCC)  
>  wrote:
>
> Good afternoon, thank you for your reply. Yes I know you are right, 
> eventually we'll switch to an odd number of mons rather than even. We're 
> still in 'testing' mode right now and only my coworkers and I are using the 
> cluster.
>
> Of the 7 pools, all but 2 are replica x3. The last two are EC 2+2.
>
> Zach Heise
>
>
> On 2022-02-09 3:38 PM, sascha.art...@gmail.com wrote:
>
> Hello,
>
> all your pools running replica > 1?
> also having 4 monitors is pretty bad for split brain situations..
>
> Zach Heise (SSCC)   schrieb am Mi., 
> 9. Feb. 2022, 22:02:
>
>Hello,
>
>ceph health detail says my 5-node cluster is healthy, yet when I ran
>ceph orch upgrade start --ceph-version 16.2.7 everything seemed to go
>fine until we got to the OSD section, now for the past hour, every 15
>seconds a new log entry of  'Upgrade: unsafe to stop osd(s) at
>this time
>(1 PGs are or would become offline)' appears in the logs.
>
>ceph pg dump_stuck (unclean, degraded, etc) shows "ok" for everything
>too. Yet somehow 1 PG is (apparently) holding up all the OSD upgrades
>and not letting the process finish. Should I stop the upgrade and
>try it
>again? (I haven't done that before so was just nervous to try it).
>Any
>other ideas?
>
>   cluster:
> id: 9aa000e8-b999-11eb-82f2-ecf4bbcc0ac0
> health: HEALTH_OK
>
>   services:
> mon: 4 daemons, quorum ceph05,ceph04,ceph01,ceph03 (age 92m)
> mgr: ceph03.futetp(active, since 97m), standbys: ceph01.fblojp
> mds: 1/1 daemons up, 1 hot standby
> osd: 33 osds: 33 up (since 2h), 33 in (since 4h); 9 remapped pgs
>
>   data:
> volumes: 1/1 healthy
> pools:   7 pools, 193 pgs
> objects: 3.72k objects, 14 GiB
> usage:   43 GiB used, 64 TiB / 64 TiB avail
> pgs: 231/11170 objects misplaced (2.068%)
>  185 active+clean
>  8   active+clean+remapped
>
>   io:
> client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
>
>   progress:
> Upgrade to 16.2.7 (5m)
>   [=...] (remaining: 24m)
>
>-- Zach
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-02-10 Thread Zach Heise (SSCC)

  
It could be an issue with the devicehealthpool as you are
  correct, it is a single PG - but when the cluster is reporting
  that everything is healthy, it's difficult where to go from there.
  What I don't understand is why its refusing to upgrade ANY of the
  osd daemons; I have 33 of them, why would a single PG going
  offline be a problem for all of them?

I did try stopping the upgrade and restarting it, but it just
  picks up at the same place (11/56 daemons upgraded) and
  immediately reports the same issue.
Is there any way to at least tell which PG is the problematic
  one?

Zach
  
  
  

On 2022-02-09 4:19 PM,
  anthony.da...@gmail.com wrote:


  
Speculation:  might the devicehealth pool be involved?  It seems to typically have just 1 PG.




  
On Feb 9, 2022, at 1:41 PM, Zach Heise (SSCC)  wrote:

Good afternoon, thank you for your reply. Yes I know you are right, eventually we'll switch to an odd number of mons rather than even. We're still in 'testing' mode right now and only my coworkers and I are using the cluster.

Of the 7 pools, all but 2 are replica x3. The last two are EC 2+2.

Zach Heise


On 2022-02-09 3:38 PM, sascha.art...@gmail.com wrote:


  Hello,

all your pools running replica > 1?
also having 4 monitors is pretty bad for split brain situations..

Zach Heise (SSCC)  schrieb am Mi., 9. Feb. 2022, 22:02:

   Hello,

   ceph health detail says my 5-node cluster is healthy, yet when I ran
   ceph orch upgrade start --ceph-version 16.2.7 everything seemed to go
   fine until we got to the OSD section, now for the past hour, every 15
   seconds a new log entry of  'Upgrade: unsafe to stop osd(s) at
   this time
   (1 PGs are or would become offline)' appears in the logs.

   ceph pg dump_stuck (unclean, degraded, etc) shows "ok" for everything
   too. Yet somehow 1 PG is (apparently) holding up all the OSD upgrades
   and not letting the process finish. Should I stop the upgrade and
   try it
   again? (I haven't done that before so was just nervous to try it).
   Any
   other ideas?

  cluster:
id: 9aa000e8-b999-11eb-82f2-ecf4bbcc0ac0
health: HEALTH_OK

  services:
mon: 4 daemons, quorum ceph05,ceph04,ceph01,ceph03 (age 92m)
mgr: ceph03.futetp(active, since 97m), standbys: ceph01.fblojp
mds: 1/1 daemons up, 1 hot standby
osd: 33 osds: 33 up (since 2h), 33 in (since 4h); 9 remapped pgs

  data:
volumes: 1/1 healthy
pools:   7 pools, 193 pgs
objects: 3.72k objects, 14 GiB
usage:   43 GiB used, 64 TiB / 64 TiB avail
pgs: 231/11170 objects misplaced (2.068%)
 185 active+clean
 8   active+clean+remapped

  io:
client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr

  progress:
Upgrade to 16.2.7 (5m)
  [=...] (remaining: 24m)

   -- Zach


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

  
  
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


  

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-02-09 Thread Anthony D'Atri


Speculation:  might the devicehealth pool be involved?  It seems to typically 
have just 1 PG.



> On Feb 9, 2022, at 1:41 PM, Zach Heise (SSCC)  wrote:
> 
> Good afternoon, thank you for your reply. Yes I know you are right, 
> eventually we'll switch to an odd number of mons rather than even. We're 
> still in 'testing' mode right now and only my coworkers and I are using the 
> cluster.
> 
> Of the 7 pools, all but 2 are replica x3. The last two are EC 2+2.
> 
> Zach Heise
> 
> 
> On 2022-02-09 3:38 PM, sascha.art...@gmail.com wrote:
>> Hello,
>> 
>> all your pools running replica > 1?
>> also having 4 monitors is pretty bad for split brain situations..
>> 
>> Zach Heise (SSCC)  schrieb am Mi., 9. Feb. 2022, 22:02:
>> 
>>Hello,
>> 
>>ceph health detail says my 5-node cluster is healthy, yet when I ran
>>ceph orch upgrade start --ceph-version 16.2.7 everything seemed to go
>>fine until we got to the OSD section, now for the past hour, every 15
>>seconds a new log entry of  'Upgrade: unsafe to stop osd(s) at
>>this time
>>(1 PGs are or would become offline)' appears in the logs.
>> 
>>ceph pg dump_stuck (unclean, degraded, etc) shows "ok" for everything
>>too. Yet somehow 1 PG is (apparently) holding up all the OSD upgrades
>>and not letting the process finish. Should I stop the upgrade and
>>try it
>>again? (I haven't done that before so was just nervous to try it).
>>Any
>>other ideas?
>> 
>>   cluster:
>> id: 9aa000e8-b999-11eb-82f2-ecf4bbcc0ac0
>> health: HEALTH_OK
>> 
>>   services:
>> mon: 4 daemons, quorum ceph05,ceph04,ceph01,ceph03 (age 92m)
>> mgr: ceph03.futetp(active, since 97m), standbys: ceph01.fblojp
>> mds: 1/1 daemons up, 1 hot standby
>> osd: 33 osds: 33 up (since 2h), 33 in (since 4h); 9 remapped pgs
>> 
>>   data:
>> volumes: 1/1 healthy
>> pools:   7 pools, 193 pgs
>> objects: 3.72k objects, 14 GiB
>> usage:   43 GiB used, 64 TiB / 64 TiB avail
>> pgs: 231/11170 objects misplaced (2.068%)
>>  185 active+clean
>>  8   active+clean+remapped
>> 
>>   io:
>> client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
>> 
>>   progress:
>> Upgrade to 16.2.7 (5m)
>>   [=...] (remaining: 24m)
>> 
>>-- Zach
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-02-09 Thread Zach Heise (SSCC)
Good afternoon, thank you for your reply. Yes I know you are right, 
eventually we'll switch to an odd number of mons rather than even. We're 
still in 'testing' mode right now and only my coworkers and I are using 
the cluster.


Of the 7 pools, all but 2 are replica x3. The last two are EC 2+2.

Zach Heise


On 2022-02-09 3:38 PM, sascha.art...@gmail.com wrote:

Hello,

all your pools running replica > 1?
also having 4 monitors is pretty bad for split brain situations..

Zach Heise (SSCC)  schrieb am Mi., 9. Feb. 2022, 
22:02:


Hello,

ceph health detail says my 5-node cluster is healthy, yet when I ran
ceph orch upgrade start --ceph-version 16.2.7 everything seemed to go
fine until we got to the OSD section, now for the past hour, every 15
seconds a new log entry of  'Upgrade: unsafe to stop osd(s) at
this time
(1 PGs are or would become offline)' appears in the logs.

ceph pg dump_stuck (unclean, degraded, etc) shows "ok" for everything
too. Yet somehow 1 PG is (apparently) holding up all the OSD upgrades
and not letting the process finish. Should I stop the upgrade and
try it
again? (I haven't done that before so was just nervous to try it).
Any
other ideas?

   cluster:
     id:     9aa000e8-b999-11eb-82f2-ecf4bbcc0ac0
     health: HEALTH_OK

   services:
     mon: 4 daemons, quorum ceph05,ceph04,ceph01,ceph03 (age 92m)
     mgr: ceph03.futetp(active, since 97m), standbys: ceph01.fblojp
     mds: 1/1 daemons up, 1 hot standby
     osd: 33 osds: 33 up (since 2h), 33 in (since 4h); 9 remapped pgs

   data:
     volumes: 1/1 healthy
     pools:   7 pools, 193 pgs
     objects: 3.72k objects, 14 GiB
     usage:   43 GiB used, 64 TiB / 64 TiB avail
     pgs:     231/11170 objects misplaced (2.068%)
              185 active+clean
              8   active+clean+remapped

   io:
     client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr

   progress:
     Upgrade to 16.2.7 (5m)
       [=...] (remaining: 24m)

-- 
Zach
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-02-09 Thread sascha a.
Hello,

all your pools running replica > 1?
also having 4 monitors is pretty bad for split brain situations..

Zach Heise (SSCC)  schrieb am Mi., 9. Feb. 2022, 22:02:

> Hello,
>
> ceph health detail says my 5-node cluster is healthy, yet when I ran
> ceph orch upgrade start --ceph-version 16.2.7 everything seemed to go
> fine until we got to the OSD section, now for the past hour, every 15
> seconds a new log entry of  'Upgrade: unsafe to stop osd(s) at this time
> (1 PGs are or would become offline)' appears in the logs.
>
> ceph pg dump_stuck (unclean, degraded, etc) shows "ok" for everything
> too. Yet somehow 1 PG is (apparently) holding up all the OSD upgrades
> and not letting the process finish. Should I stop the upgrade and try it
> again? (I haven't done that before so was just nervous to try it). Any
> other ideas?
>
>cluster:
>  id: 9aa000e8-b999-11eb-82f2-ecf4bbcc0ac0
>  health: HEALTH_OK
>
>services:
>  mon: 4 daemons, quorum ceph05,ceph04,ceph01,ceph03 (age 92m)
>  mgr: ceph03.futetp(active, since 97m), standbys: ceph01.fblojp
>  mds: 1/1 daemons up, 1 hot standby
>  osd: 33 osds: 33 up (since 2h), 33 in (since 4h); 9 remapped pgs
>
>data:
>  volumes: 1/1 healthy
>  pools:   7 pools, 193 pgs
>  objects: 3.72k objects, 14 GiB
>  usage:   43 GiB used, 64 TiB / 64 TiB avail
>  pgs: 231/11170 objects misplaced (2.068%)
>   185 active+clean
>   8   active+clean+remapped
>
>io:
>  client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
>
>progress:
>  Upgrade to 16.2.7 (5m)
>[=...] (remaining: 24m)
>
> --
> Zach
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io