[ceph-users] Re: pg repair doesn't start

2022-10-13 Thread Frank Schilder
Hi Eugen,

thanks for your answer. I gave a search another try and did indeed find 
something: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/TN6WJVCHTVJ4YIA4JH2D2WYYZFZRMSXI/

Quote: " ... And I've also observed that the repair req isn't queued up -- if 
the OSDs are busy with other scrubs, the repair req is forgotten. ..."

I'm biting my tongue really really hard right now. @Dan (if you read this), 
thanks for the script: 
https://github.com/cernceph/ceph-scripts/blob/master/tools/scrubbing/autorepair.sh

New status:

# ceph status
  cluster:
id: e4ece518-f2cb-4708-b00f-b6bf511e91d9
health: HEALTH_ERR
1 scrub errors
Possible data damage: 1 pg inconsistent

  services:
mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 8d)
mgr: ceph-25(active, since 8d), standbys: ceph-26, ceph-03, ceph-02, ceph-01
mds: con-fs2:8 4 up:standby 8 up:active
osd: 1086 osds: 1071 up (since 14h), 1070 in (since 4d); 542 remapped pgs

  task status:

  data:
pools:   14 pools, 17185 pgs
objects: 1.39G objects, 2.5 PiB
usage:   3.1 PiB used, 8.4 PiB / 11 PiB avail
pgs: 301878494/11947144857 objects misplaced (2.527%)
 16634 active+clean
 513   active+remapped+backfill_wait
 19active+remapped+backfilling
 10active+remapped+backfill_wait+forced_backfill
 6 active+clean+scrubbing+deep
 2 active+clean+scrubbing
 1 active+clean+scrubbing+deep+inconsistent+repair

  io:
client:   444 MiB/s rd, 446 MiB/s wr, 2.19k op/s rd, 2.34k op/s wr
recovery: 0 B/s, 223 objects/s

Yay!

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: 13 October 2022 23:23:10
To: ceph-users@ceph.io
Subject: [ceph-users] Re: pg repair doesn't start

Hi,

I’m not sure if I remember correctly but I believe the backfill is
preventing the repair to happen. I think it has been discussed a
couple of times on this list but I don’t know right now  if you can
tweak anything to prioritize the repair, I believe there is, but not
sure. It looks like your backfill could take quite some time…

Zitat von Frank Schilder :

> Hi all,
>
> we have an inconsistent PG for a couple of days now (octopus latest):
>
> # ceph status
>   cluster:
> id:
> health: HEALTH_ERR
> 1 scrub errors
> Possible data damage: 1 pg inconsistent
>
>   services:
> mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 8d)
> mgr: ceph-25(active, since 8d), standbys: ceph-26, ceph-03,
> ceph-02, ceph-01
> mds: con-fs2:8 4 up:standby 8 up:active
> osd: 1086 osds: 1071 up (since 13h), 1070 in (since 4d); 547 remapped pgs
>
>   task status:
>
>   data:
> pools:   14 pools, 17185 pgs
> objects: 1.39G objects, 2.5 PiB
> usage:   3.1 PiB used, 8.4 PiB / 11 PiB avail
> pgs: 305530535/11943726075 objects misplaced (2.558%)
>  16614 active+clean
>  516   active+remapped+backfill_wait
>  23active+clean+scrubbing+deep
>  21active+remapped+backfilling
>  10active+remapped+backfill_wait+forced_backfill
>  1 active+clean+inconsistent
>
>   io:
> client:   143 MiB/s rd, 135 MiB/s wr, 2.21k op/s rd, 2.33k op/s wr
> recovery: 0 B/s, 224 objects/s
>
> I issued "ceph pg repair 11.1ba" more than 36 hours ago, but it
> never got executed (checked the logs for repair state). The usual
> wait time we had on our cluster so far was 2-6 hours. 36 hours is
> unusually long. The pool in question is moderately busy and has no
> misplaced ojects. Its only unhealthy PG is the inconsistent one.
>
> Are there situations in which ceph cancels/ignores a pg repair?
> Is there any way to check if it is actually still scheduled to happen?
> Is there a way to force it a bit more urgently?
>
> The error was caused by a read error, the drive is healthy:
>
> 2022-10-11T19:19:13.621470+0200 osd.231 (osd.231) 40 : cluster [ERR]
> 11.1ba shard 294(6) soid
> 11:5df75341:::rbd_data.1.b688997dc79def.0005d530:head :
> candidate had a read error
> 2022-10-11T19:26:22.344862+0200 osd.231 (osd.231) 41 : cluster [ERR]
> 11.1bas0 deep-scrub 0 missing, 1 inconsistent objects
> 2022-10-11T19:26:22.344866+0200 osd.231 (osd.231) 42 : cluster [ERR]
> 11.1ba deep-scrub 1 errors
> 2022-10-11T19:26:23.356402+0200 mgr.ceph-25 (mgr.144330518) 378551 :
> cluster [DBG] pgmap v301249: 17334 pgs: 1 active+clean+inconsistent,
> 2 active+clean+scrubbing, 26 active+remapped+backfill_wait, 13
> active+remapped+backfilling, 19 active+clean+scrubbing+deep, 17273
> active+clean; 2.5 PiB data, 3.1 PiB used, 8.4 PiB / 11 PiB avail;
> 193 MiB/s rd, 181 MiB/s wr, 4.95k op/s; 16126995/11848511097 objects
> misplaced (0.136%); 0 B/s, 513 objects/s recovering
> 

[ceph-users] Re: pg repair doesn't start

2022-10-13 Thread Eugen Block

Hi,

I’m not sure if I remember correctly but I believe the backfill is  
preventing the repair to happen. I think it has been discussed a  
couple of times on this list but I don’t know right now  if you can  
tweak anything to prioritize the repair, I believe there is, but not  
sure. It looks like your backfill could take quite some time…


Zitat von Frank Schilder :


Hi all,

we have an inconsistent PG for a couple of days now (octopus latest):

# ceph status
  cluster:
id:
health: HEALTH_ERR
1 scrub errors
Possible data damage: 1 pg inconsistent

  services:
mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 8d)
mgr: ceph-25(active, since 8d), standbys: ceph-26, ceph-03,  
ceph-02, ceph-01

mds: con-fs2:8 4 up:standby 8 up:active
osd: 1086 osds: 1071 up (since 13h), 1070 in (since 4d); 547 remapped pgs

  task status:

  data:
pools:   14 pools, 17185 pgs
objects: 1.39G objects, 2.5 PiB
usage:   3.1 PiB used, 8.4 PiB / 11 PiB avail
pgs: 305530535/11943726075 objects misplaced (2.558%)
 16614 active+clean
 516   active+remapped+backfill_wait
 23active+clean+scrubbing+deep
 21active+remapped+backfilling
 10active+remapped+backfill_wait+forced_backfill
 1 active+clean+inconsistent

  io:
client:   143 MiB/s rd, 135 MiB/s wr, 2.21k op/s rd, 2.33k op/s wr
recovery: 0 B/s, 224 objects/s

I issued "ceph pg repair 11.1ba" more than 36 hours ago, but it  
never got executed (checked the logs for repair state). The usual  
wait time we had on our cluster so far was 2-6 hours. 36 hours is  
unusually long. The pool in question is moderately busy and has no  
misplaced ojects. Its only unhealthy PG is the inconsistent one.


Are there situations in which ceph cancels/ignores a pg repair?
Is there any way to check if it is actually still scheduled to happen?
Is there a way to force it a bit more urgently?

The error was caused by a read error, the drive is healthy:

2022-10-11T19:19:13.621470+0200 osd.231 (osd.231) 40 : cluster [ERR]  
11.1ba shard 294(6) soid  
11:5df75341:::rbd_data.1.b688997dc79def.0005d530:head :  
candidate had a read error
2022-10-11T19:26:22.344862+0200 osd.231 (osd.231) 41 : cluster [ERR]  
11.1bas0 deep-scrub 0 missing, 1 inconsistent objects
2022-10-11T19:26:22.344866+0200 osd.231 (osd.231) 42 : cluster [ERR]  
11.1ba deep-scrub 1 errors
2022-10-11T19:26:23.356402+0200 mgr.ceph-25 (mgr.144330518) 378551 :  
cluster [DBG] pgmap v301249: 17334 pgs: 1 active+clean+inconsistent,  
2 active+clean+scrubbing, 26 active+remapped+backfill_wait, 13  
active+remapped+backfilling, 19 active+clean+scrubbing+deep, 17273  
active+clean; 2.5 PiB data, 3.1 PiB used, 8.4 PiB / 11 PiB avail;  
193 MiB/s rd, 181 MiB/s wr, 4.95k op/s; 16126995/11848511097 objects  
misplaced (0.136%); 0 B/s, 513 objects/s recovering
2022-10-11T19:26:24.246194+0200 mon.ceph-01 (mon.0) 633486 : cluster  
[ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS)
2022-10-11T19:26:24.246215+0200 mon.ceph-01 (mon.0) 633487 : cluster  
[ERR] Health check failed: Possible data damage: 1 pg inconsistent  
(PG_DAMAGED)


Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] pg repair doesn't start

2022-10-13 Thread Frank Schilder
Hi all,

we have an inconsistent PG for a couple of days now (octopus latest):

# ceph status
  cluster:
id: 
health: HEALTH_ERR
1 scrub errors
Possible data damage: 1 pg inconsistent
 
  services:
mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 8d)
mgr: ceph-25(active, since 8d), standbys: ceph-26, ceph-03, ceph-02, ceph-01
mds: con-fs2:8 4 up:standby 8 up:active
osd: 1086 osds: 1071 up (since 13h), 1070 in (since 4d); 547 remapped pgs
 
  task status:
 
  data:
pools:   14 pools, 17185 pgs
objects: 1.39G objects, 2.5 PiB
usage:   3.1 PiB used, 8.4 PiB / 11 PiB avail
pgs: 305530535/11943726075 objects misplaced (2.558%)
 16614 active+clean
 516   active+remapped+backfill_wait
 23active+clean+scrubbing+deep
 21active+remapped+backfilling
 10active+remapped+backfill_wait+forced_backfill
 1 active+clean+inconsistent
 
  io:
client:   143 MiB/s rd, 135 MiB/s wr, 2.21k op/s rd, 2.33k op/s wr
recovery: 0 B/s, 224 objects/s

I issued "ceph pg repair 11.1ba" more than 36 hours ago, but it never got 
executed (checked the logs for repair state). The usual wait time we had on our 
cluster so far was 2-6 hours. 36 hours is unusually long. The pool in question 
is moderately busy and has no misplaced ojects. Its only unhealthy PG is the 
inconsistent one.

Are there situations in which ceph cancels/ignores a pg repair?
Is there any way to check if it is actually still scheduled to happen?
Is there a way to force it a bit more urgently?

The error was caused by a read error, the drive is healthy:

2022-10-11T19:19:13.621470+0200 osd.231 (osd.231) 40 : cluster [ERR] 11.1ba 
shard 294(6) soid 11:5df75341:::rbd_data.1.b688997dc79def.0005d530:head 
: candidate had a read error
2022-10-11T19:26:22.344862+0200 osd.231 (osd.231) 41 : cluster [ERR] 11.1bas0 
deep-scrub 0 missing, 1 inconsistent objects
2022-10-11T19:26:22.344866+0200 osd.231 (osd.231) 42 : cluster [ERR] 11.1ba 
deep-scrub 1 errors
2022-10-11T19:26:23.356402+0200 mgr.ceph-25 (mgr.144330518) 378551 : cluster 
[DBG] pgmap v301249: 17334 pgs: 1 active+clean+inconsistent, 2 
active+clean+scrubbing, 26 active+remapped+backfill_wait, 13 
active+remapped+backfilling, 19 active+clean+scrubbing+deep, 17273 
active+clean; 2.5 PiB data, 3.1 PiB used, 8.4 PiB / 11 PiB avail; 193 MiB/s rd, 
181 MiB/s wr, 4.95k op/s; 16126995/11848511097 objects misplaced (0.136%); 0 
B/s, 513 objects/s recovering
2022-10-11T19:26:24.246194+0200 mon.ceph-01 (mon.0) 633486 : cluster [ERR] 
Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS)
2022-10-11T19:26:24.246215+0200 mon.ceph-01 (mon.0) 633487 : cluster [ERR] 
Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED)

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] monitoring drives

2022-10-13 Thread Marc
I was wondering what is a best practice for monitoring drives. I am 
transitioning from sata to sas drives which have less smartctl information not 
even power on hours. 

eg. is ceph registering somewhere when an osd has been created?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster crashing when stopping some host

2022-10-13 Thread Murilo Morais
Unfortunately I can't verify if ceph reports any inactive PG. As soon as
the second host disconnects practically everything is locked, nothing
appears even using "ceph -w". It only appears that the OSDs are offline
when dcs2 returns.

Note: Apparently there was a new update recently. When I was in the test
environment, this behavior was not happening, dcs1 was UP with all services
without crashing even with dcs2 DOWN, performing reading and writing, even
without adding dcs3.

### COMMANDS ###
[ceph: root@dcs1 /]# ceph osd tree
ID  CLASS  WEIGHTTYPE NAME   STATUS  REWEIGHT  PRI-AFF
-1 65.49570  root default
-3 32.74785  host dcs1
 0hdd   2.72899  osd.0   up   1.0  1.0
 1hdd   2.72899  osd.1   up   1.0  1.0
 2hdd   2.72899  osd.2   up   1.0  1.0
 3hdd   2.72899  osd.3   up   1.0  1.0
 4hdd   2.72899  osd.4   up   1.0  1.0
 5hdd   2.72899  osd.5   up   1.0  1.0
 6hdd   2.72899  osd.6   up   1.0  1.0
 7hdd   2.72899  osd.7   up   1.0  1.0
 8hdd   2.72899  osd.8   up   1.0  1.0
 9hdd   2.72899  osd.9   up   1.0  1.0
10hdd   2.72899  osd.10  up   1.0  1.0
11hdd   2.72899  osd.11  up   1.0  1.0
-5 32.74785  host dcs2
12hdd   2.72899  osd.12  up   1.0  1.0
13hdd   2.72899  osd.13  up   1.0  1.0
14hdd   2.72899  osd.14  up   1.0  1.0
15hdd   2.72899  osd.15  up   1.0  1.0
16hdd   2.72899  osd.16  up   1.0  1.0
17hdd   2.72899  osd.17  up   1.0  1.0
18hdd   2.72899  osd.18  up   1.0  1.0
19hdd   2.72899  osd.19  up   1.0  1.0
20hdd   2.72899  osd.20  up   1.0  1.0
21hdd   2.72899  osd.21  up   1.0  1.0
22hdd   2.72899  osd.22  up   1.0  1.0
23hdd   2.72899  osd.23  up   1.0  1.0


[ceph: root@dcs1 /]# ceph osd pool ls detail
pool 1 '.mgr' replicated size 2 min_size 1 crush_rule 0 object_hash
rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 26 flags
hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 2 'cephfs.ovirt_hosted_engine.meta' replicated size 2 min_size 1
crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on
last_change 77 lfor 0/0/47 flags hashpspool stripe_width 0
pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 3 'cephfs.ovirt_hosted_engine.data' replicated size 2 min_size 1
crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
last_change 179 lfor 0/0/47 flags hashpspool max_bytes 107374182400
stripe_width 0 application cephfs
pool 6 '.nfs' replicated size 2 min_size 1 crush_rule 0 object_hash
rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 254 lfor
0/0/252 flags hashpspool stripe_width 0 application nfs
pool 7 'cephfs.ovirt_storage_sas.meta' replicated size 2 min_size 1
crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on
last_change 322 lfor 0/0/287 flags hashpspool stripe_width 0
pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 8 'cephfs.ovirt_storage_sas.data' replicated size 2 min_size 1
crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
last_change 291 lfor 0/0/289 flags hashpspool stripe_width 0 application
cephfs
pool 9 'cephfs.ovirt_storage_iso.meta' replicated size 2 min_size 1
crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on
last_change 356 lfor 0/0/325 flags hashpspool stripe_width 0
pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 10 'cephfs.ovirt_storage_iso.data' replicated size 2 min_size 1
crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
last_change 329 lfor 0/0/327 flags hashpspool stripe_width 0 application
cephfs


[ceph: root@dcs1 /]# ceph osd crush rule dump replicated_rule
{
"rule_id": 0,
"rule_name": "replicated_rule",
"type": 1,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}


[ceph: root@dcs1 /]# ceph pg ls-by-pool cephfs.ovirt_hosted_engine.data
PGOBJECTS  DEGRADED  MISPLACED  UNFOUND  BYTES  OMAP_BYTES*
 OMAP_KEYS*  LOGSTATE SINCE  VERSIONREPORTED   UP
 ACTING  SCRUB_STAMP  DEEP_SCRUB_STAMP
LAST_SCRUB_DURATION  SCRUB_SCHEDULING
3.069 0  00  2852130950
  0  10057  active+clean

[ceph-users] Re: crush hierarchy backwards and upmaps ...

2022-10-13 Thread Christopher Durham

Dan,
Again i am using 16.2.10 on rocky 8

I decided to take a step back and check a variety of options before I do 
anything. Here are my results.
If I use this rule:
 rule mypoolname {
 id -5
    type erasure
    step take myroot
    step choose indep 4 type rack    step choose indep 2 type chassis
    step chooseleaf indep 1 type host
    step emit
This is changing the pod definitions all to type chassis. I get NO moves when 
running osdmaptool --test-pg-upmap-itemsand comparing to the current. But 
--upmap-cleanup gives:
check_pg_upmaps verify upmap of pool.pgid returning -22verify_upmap number of 
buckets 8 exceeds desired 2
for each of my existing upmaps. And it wants to remove them all.

If I use the rule:
 rule mypoolname {
 id -5
    type erasure
    step take myroot
    step choose indep 4 type rack    step chooseleaf indep 2 type chassis
    step emit
I get almost 1/2 my data moving as per osdmaptool --test-pg-upmap-items.
With --upmap-cleanup I get:
verify_upmap multiple osds N,M come from the same failure domain 
-382check_pg_upmap verify upmap of pg poolid.pgid returning -22. 

For about 1/8 of my upmaps. And it wants to remove these and and add about 100 
more.
Although I suspect that this will be rectified after things are moved and such. 
Am I correct? 

If I use the rule: (after changing my rack definition to only contain hosts 
that were previously a part of thepods or chassis):

 rule mypoolname {
 id -5
    type erasure
    step take myroot
    step choose indep 4 type rack    step chooseleaf indep 2 type host
    step emit
I get almost all my data moving as per osdmaptool --test-pg-upmap-items.
With --upmap-cleanup, I get only 10 of these:
verify_upmap multiple osds N,M come from the same failure domain 
-382check_pg_upmap verify upmap of pg poolid.pgid returning -22. 

But upmap-cleanup wants to remove all my upmaps, which may actually make sense 
if weredo the entire map this way.

I am curious for the first rule, where I am getting the expected 8 got 2, if I 
am hitting this bug, that seems tosuggest that I am having a problem because I 
have a multi-level (>2) level rule  for an ec pool:
 https://tracker.ceph.com/issues/51729


This bug appears to be on 14.x, but perhaps it exists on pacific as well.It 
would be great if I could use the first rule, except for this bug. Perhaps the 
second rule is best at this point.

Any other thoughts would be appreciated.
-Chris

 
-Original Message-
From: Dan van der Ster 
To: Christopher Durham 
Cc: Ceph Users 
Sent: Tue, Oct 11, 2022 11:39 am
Subject: [ceph-users] Re: crush hierarchy backwards and upmaps ...

Hi Chris,

Just curious, does this rule make sense and help with the multi level crush
map issue?
(Maybe it also results in zero movement, or at least less then the
alternative you proposed?)

    step choose indep 4 type rack
    step chooseleaf indep 2 type chassis

Cheers, Dan




On Tue, Oct 11, 2022, 19:29 Christopher Durham  wrote:

> Dan,
>
> Thank you.
>
> I did what you said regarding --test-map-pgs-dump and it wants to move 3
> OSDs in every PG. Yuk.
>
> So before i do that, I tried this rule, after changing all my 'pod' bucket
> definitions to 'chassis', and compiling and
> injecting the new crushmap to an osdmap:
>
>
> rule mypoolname {
>    id -5
>    type erasure
>    step take myroot
>    step choose indep 4 type rack
>    step choose indep 2 type chassis
>    step chooseleaf indep 1 type host
>    step emit
>
> }
>
> --test-pg-upmap-entries shows there were NO changes to be done after
> comparing it with the original!!!
>
> However, --upmap-cleanup says:
>
> verify_upmap number of buckets 8 exceeds desired number of 2
> check_pg_upmaps verify_upmap of poolid.pgid returning -22
>
> This is output for every current upmap, but I really do want 8 total
> buckets per PG, as my pool is a 6+2.
>
> The upmap-cleanup output wants me to remove all of my upmaps.
>
> This seems consistent with a bug report that says that there is a problem
> with the balancer on a
> multi-level rule such as the above, albeit on 14.2.x. Any thoughts?
>
> https://tracker.ceph.com/issues/51729
>
> I am leaning towards just eliminating the middle rule and go directly from
> rack to host, even though
> it wants to move a LARGE amount of data according to  a diff before and
> after of --test-pg-upmap-entries.
> In this scenario, I dont see any unexpected errors with --upmap-cleanup
> and I do not want to get stuck
>
> rule mypoolname {
>    id -5
>    type erasure
>    step take myroot
>    step choose indep 4 type rack
>    step chooseleaf indep 2 type host
>    step emit
> }
>
> -Chris
>
>
> -Original Message-
> From: Dan van der Ster 
> To: Christopher Durham 
> Cc: Ceph Users 
> Sent: Mon, Oct 10, 2022 12:22 pm
> Subject: [ceph-users] Re: crush hierarchy backwards and upmaps ...
>
> Hi,
>
> Here's a similar bug: https://tracker.ceph.com/issues/47361
>
> Back then, upmap would generate mappings that invalidate the crush 

[ceph-users] Re: Cluster crashing when stopping some host

2022-10-13 Thread Marc
If you do not mind data loss, why do you care about needing to have 2x?
Alternative would be to change the replication so it is not over hosts but just 
on osd's that can reside on one host.

> Marc, but there is no mechanism to prevent IO pause? At the moment I
> don't worry about data loss.
> I understand that putting it as replica x1 can work, but I need it to be
> x2.
> 
 
> 
>   >
>   > I'm having strange behavior on a new cluster.
> 
>   Not strange, by design
> 
>   > I have 3 machines, two of them have the disks. We can name them
> like
>   > this:
>   > dcs1 to dcs3. The dcs1 and dcs2 machines contain the disks.
>   >
>   > I started bootstrapping through dcs1, added the other hosts and
> left mgr
>   > on
>   > dcs3 only.
>   >
>   > What is happening is that if I take down dcs2 everything hangs
> and
>   > becomes
>   > irresponsible, including the mount points that were pointed to
> dcs1.
> 
>   You have to have disks in 3 machines. (Or set the replication to
> 1x)
> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster crashing when stopping some host

2022-10-13 Thread Eugen Block
Could you share more details? Does ceph report inactive PGs when one  
node is down? Please share:

ceph osd tree
ceph osd pool ls detail
ceph osd crush rule dump 
ceph pg ls-by-pool 
ceph -s

Zitat von Murilo Morais :


Thanks for answering.
Marc, but there is no mechanism to prevent IO pause? At the moment I don't
worry about data loss.
I understand that putting it as replica x1 can work, but I need it to be x2.

Em qui., 13 de out. de 2022 às 12:26, Marc 
escreveu:



>
> I'm having strange behavior on a new cluster.

Not strange, by design

> I have 3 machines, two of them have the disks. We can name them like
> this:
> dcs1 to dcs3. The dcs1 and dcs2 machines contain the disks.
>
> I started bootstrapping through dcs1, added the other hosts and left mgr
> on
> dcs3 only.
>
> What is happening is that if I take down dcs2 everything hangs and
> becomes
> irresponsible, including the mount points that were pointed to dcs1.

You have to have disks in 3 machines. (Or set the replication to 1x)


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster crashing when stopping some host

2022-10-13 Thread Murilo Morais
Thanks for answering.
Marc, but there is no mechanism to prevent IO pause? At the moment I don't
worry about data loss.
I understand that putting it as replica x1 can work, but I need it to be x2.

Em qui., 13 de out. de 2022 às 12:26, Marc 
escreveu:

>
> >
> > I'm having strange behavior on a new cluster.
>
> Not strange, by design
>
> > I have 3 machines, two of them have the disks. We can name them like
> > this:
> > dcs1 to dcs3. The dcs1 and dcs2 machines contain the disks.
> >
> > I started bootstrapping through dcs1, added the other hosts and left mgr
> > on
> > dcs3 only.
> >
> > What is happening is that if I take down dcs2 everything hangs and
> > becomes
> > irresponsible, including the mount points that were pointed to dcs1.
>
> You have to have disks in 3 machines. (Or set the replication to 1x)
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: why rgw generates large quantities orphan objects?

2022-10-13 Thread Haas, Josh
Hi Liang,

My guess would be this bug:

https://tracker.ceph.com/issues/44660
https://www.spinics.net/lists/ceph-users/msg30151.html


It's actually existed for at least 6 years:
https://tracker.ceph.com/issues/16767


Which occurs any time you reupload the same *part* in a single Multipart Upload 
multiple times. For example, if my Multipart upload consists of 3 parts, if I 
upload part #2 twice, then the first upload of part #2 becomes orphaned.


If this was indeed the cause, you should have multiple "_multipart_" rados 
objects for the same part in "rados ls". For example, here's all the rados 
objects associated with a bugged bucket before I deleted it:

cc79b188-89d1-4f47-acb1-ab90513e9bc9.23325574.228__multipart_file.txt.4vkWzU4C5XLd2R6unFgbQ6aZM26vPuq8.1
cc79b188-89d1-4f47-acb1-ab90513e9bc9.23325574.228__multipart_file.txt.2~4zogSe4Ep0xvSC8j6aX71x_96cOgvQN.1
cc79b188-89d1-4f47-acb1-ab90513e9bc9.23325574.228__shadow_file.txt.4vkWzU4C5XLd2R6unFgbQ6aZM26vPuq8.1_1
cc79b188-89d1-4f47-acb1-ab90513e9bc9.23325574.228__shadow_file.txt.2~4zogSe4Ep0xvSC8j6aX71x_96cOgvQN.1_1


If we look at just these two:

cc79b188-89d1-4f47-acb1-ab90513e9bc9.23325574.228__multipart_file.txt.4vkWzU4C5XLd2R6unFgbQ6aZM26vPuq8.1
cc79b188-89d1-4f47-acb1-ab90513e9bc9.23325574.228__multipart_file.txt.2~4zogSe4Ep0xvSC8j6aX71x_96cOgvQN.1


They are in the format:

$BUCKETID__multipart_$S3KEY.$PARTUID.$PARTNUM

Because everything matches ($BUCKETID, $S3KEY, $PARTNUM) except for $PARTUID, 
this S3 object has been affected by the bug. If you find instances of rados 
keys that match on everything except $PARTUID, then this bug is probably the 
cause.


Josh


From: 郑亮 
Sent: Wednesday, October 12, 2022 1:34:31 AM
To: ceph-users@ceph.io
Subject: [ceph-users] why rgw generates large quantities orphan objects?

Hi all,
Description of problem: [RGW] Buckets/objects deletion is causing large
quantities orphan raods objects

The cluster was running a cosbench workload, then remove the partial data
by deleting objects from the cosbench client, then we have deleted all the
buckets with the help of `s3cmd rb --recursive --force` command that
removed all the buckets, but that did not help in the space reclaimation.

```
[root@node01 /]# rgw-orphan-list

Available pools:

device_health_metrics

.rgw.root

os-test.rgw.buckets.non-ec

os-test.rgw.log

os-test.rgw.control

os-test.rgw.buckets.index

os-test.rgw.meta

os-test.rgw.buckets.data

deeproute-replica-hdd-pool

deeproute-replica-ssd-pool

cephfs-metadata

cephfs-replicated-pool

.nfs

Which pool do you want to search for orphans (for multiple, use
space-separated list)? os-test.rgw.buckets.data

Pool is "os-test.rgw.buckets.data".

Note: output files produced will be tagged with the current timestamp --
20221008062356.
running 'rados ls' at Sat Oct  8 06:24:05 UTC 2022

running 'rados ls' on pool os-test.rgw.buckets.data.



running 'radosgw-admin bucket radoslist' at Sat Oct  8 06:43:21 UTC 2022
computing delta at Sat Oct  8 06:47:17 UTC 2022

39662551 potential orphans found out of a possible 39844453 (99%).
The results can be found in './orphan-list-20221008062356.out'.

Intermediate files are './rados-20221008062356.intermediate' and
'./radosgw-admin-20221008062356.intermediate'.
***

*** WARNING: This is EXPERIMENTAL code and the results should be used
***  only with CAUTION!
***
Done at Sat Oct  8 06:48:07 UTC 2022.

[root@node01 /]# radosgw-admin gc list
[]

[root@node01 /]# cat orphan-list-20221008062356.out | wc -l
39662551

[root@node01 /]# rados df
POOL_NAME   USED   OBJECTS  CLONES COPIES
 MISSING_ON_PRIMARY  UNFOUND  DEGRADED RD_OPS   RD WR_OPS
 WR  USED COMPR  UNDER COMPR
.nfs 4.3 MiB 4   0 12
00 0  77398   76 MiB146   79
KiB 0 B  0 B
.rgw.root180 KiB16   0 48
00 0  28749   28 MiB  0
0 B 0 B  0 B
cephfs-metadata  932 MiB 14772   0  44316
00 01569690  3.8 GiB1258651  3.4
GiB 0 B  0 B
cephfs-replicated-pool   738 GiB300962   0 902886
00 0 794612  470 GiB 770689  245
GiB 0 B  0 B
deeproute-replica-hdd-pool  1016 GiB104276   0 312828
00 0   18176216  298 GiB  441783780  6.7
TiB 0 B  0 B
deeproute-replica-ssd-pool30 GiB  3691   0  11073
00 02466079  2.1 GiB8416232  221
GiB 0 B  0 B
device_health_metrics 50 MiB   108   0324
00 0   1836  1.8 MiB   1944   18
MiB 0 B  0 B
os-test.rgw.buckets.data 5.6 TiB  39844453   0  239066718
00 

[ceph-users] Re: Cluster crashing when stopping some host

2022-10-13 Thread Marc


> 
> I'm having strange behavior on a new cluster.

Not strange, by design

> I have 3 machines, two of them have the disks. We can name them like
> this:
> dcs1 to dcs3. The dcs1 and dcs2 machines contain the disks.
> 
> I started bootstrapping through dcs1, added the other hosts and left mgr
> on
> dcs3 only.
> 
> What is happening is that if I take down dcs2 everything hangs and
> becomes
> irresponsible, including the mount points that were pointed to dcs1.

You have to have disks in 3 machines. (Or set the replication to 1x)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster crashing when stopping some host

2022-10-13 Thread Murilo Morais
I'm using Host as Failure Domain.

Em qui., 13 de out. de 2022 às 11:41, Eugen Block  escreveu:

> What is your failure domain? If it's osd you'd have both PGs on the
> same host and then no replica is available.
>
> Zitat von Murilo Morais :
>
> > Eugen, thanks for responding.
> >
> > In the current scenario there is no way to insert disks into dcs3.
> >
> > My pools are size 2, at the moment we can't add more machines with disks,
> > so it was sized in this proportion.
> >
> > Even with min_size=1, if dcs2 stops the IO also stops.
> >
> > Em qui., 13 de out. de 2022 às 11:19, Eugen Block 
> escreveu:
> >
> >> Hi,
> >>
> >> if your pools have a size 2 (don't do that except in test
> >> environments) and host is your failure domain then all IO is paused if
> >> one osd host goes down, depending on your min_size. Can you move some
> >> disks to dcs3 so you can have size 3 pools with min_size 2?
> >>
> >> Zitat von Murilo Morais :
> >>
> >> > Good morning everyone.
> >> >
> >> > I'm having strange behavior on a new cluster.
> >> >
> >> > I have 3 machines, two of them have the disks. We can name them like
> >> this:
> >> > dcs1 to dcs3. The dcs1 and dcs2 machines contain the disks.
> >> >
> >> > I started bootstrapping through dcs1, added the other hosts and left
> mgr
> >> on
> >> > dcs3 only.
> >> >
> >> > What is happening is that if I take down dcs2 everything hangs and
> >> becomes
> >> > irresponsible, including the mount points that were pointed to dcs1.
> >> > ___
> >> > ceph-users mailing list -- ceph-users@ceph.io
> >> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >>
> >>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
>
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster crashing when stopping some host

2022-10-13 Thread Eugen Block
What is your failure domain? If it's osd you'd have both PGs on the  
same host and then no replica is available.


Zitat von Murilo Morais :


Eugen, thanks for responding.

In the current scenario there is no way to insert disks into dcs3.

My pools are size 2, at the moment we can't add more machines with disks,
so it was sized in this proportion.

Even with min_size=1, if dcs2 stops the IO also stops.

Em qui., 13 de out. de 2022 às 11:19, Eugen Block  escreveu:


Hi,

if your pools have a size 2 (don't do that except in test
environments) and host is your failure domain then all IO is paused if
one osd host goes down, depending on your min_size. Can you move some
disks to dcs3 so you can have size 3 pools with min_size 2?

Zitat von Murilo Morais :

> Good morning everyone.
>
> I'm having strange behavior on a new cluster.
>
> I have 3 machines, two of them have the disks. We can name them like
this:
> dcs1 to dcs3. The dcs1 and dcs2 machines contain the disks.
>
> I started bootstrapping through dcs1, added the other hosts and left mgr
on
> dcs3 only.
>
> What is happening is that if I take down dcs2 everything hangs and
becomes
> irresponsible, including the mount points that were pointed to dcs1.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io





___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster crashing when stopping some host

2022-10-13 Thread Murilo Morais
Eugen, thanks for responding.

In the current scenario there is no way to insert disks into dcs3.

My pools are size 2, at the moment we can't add more machines with disks,
so it was sized in this proportion.

Even with min_size=1, if dcs2 stops the IO also stops.

Em qui., 13 de out. de 2022 às 11:19, Eugen Block  escreveu:

> Hi,
>
> if your pools have a size 2 (don't do that except in test
> environments) and host is your failure domain then all IO is paused if
> one osd host goes down, depending on your min_size. Can you move some
> disks to dcs3 so you can have size 3 pools with min_size 2?
>
> Zitat von Murilo Morais :
>
> > Good morning everyone.
> >
> > I'm having strange behavior on a new cluster.
> >
> > I have 3 machines, two of them have the disks. We can name them like
> this:
> > dcs1 to dcs3. The dcs1 and dcs2 machines contain the disks.
> >
> > I started bootstrapping through dcs1, added the other hosts and left mgr
> on
> > dcs3 only.
> >
> > What is happening is that if I take down dcs2 everything hangs and
> becomes
> > irresponsible, including the mount points that were pointed to dcs1.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster crashing when stopping some host

2022-10-13 Thread Eugen Block

Hi,

if your pools have a size 2 (don't do that except in test  
environments) and host is your failure domain then all IO is paused if  
one osd host goes down, depending on your min_size. Can you move some  
disks to dcs3 so you can have size 3 pools with min_size 2?


Zitat von Murilo Morais :


Good morning everyone.

I'm having strange behavior on a new cluster.

I have 3 machines, two of them have the disks. We can name them like this:
dcs1 to dcs3. The dcs1 and dcs2 machines contain the disks.

I started bootstrapping through dcs1, added the other hosts and left mgr on
dcs3 only.

What is happening is that if I take down dcs2 everything hangs and becomes
irresponsible, including the mount points that were pointed to dcs1.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Cluster crashing when stopping some host

2022-10-13 Thread Murilo Morais
Good morning everyone.

I'm having strange behavior on a new cluster.

I have 3 machines, two of them have the disks. We can name them like this:
dcs1 to dcs3. The dcs1 and dcs2 machines contain the disks.

I started bootstrapping through dcs1, added the other hosts and left mgr on
dcs3 only.

What is happening is that if I take down dcs2 everything hangs and becomes
irresponsible, including the mount points that were pointed to dcs1.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS Performance and PG/PGP value

2022-10-13 Thread Frank Schilder
Hi Yoann,

I'm not using pacific yet, but this here looks very strange to me:

  cephfs_data  data 243T  19.7T
usage:   245 TiB used, 89 TiB / 334 TiB avail

I'm not sure if there is a mix of raw vs. stored here. Assuming the cephfs_data 
allocation is right, I'm wondering what your osd [near] full ratios are. The PG 
counts look very good. The slow ops can have 2 reasons: a bad disk or full 
OSDs. Looking at 19.7/(243+16.7)=6.4% free I wonder why there are no osd [near] 
full warnings all over the place. Even if its still 20% free performance can 
degrade dramatically according to benchmarks we made on octopus.

I think you need to provide a lot more details here. Of interest are:

ceph df detail
ceph osd df tree

and possibly a few others. I don't think the multi-MDS mode is bugging you, but 
you should check. We have seen degraded performance on mimic caused by 
excessive export_dir operations between the MDSes. However, I can't see such 
operations reported as stuck. You might want to check on your MDSes with ceph 
daemon mds.xzy ops | grep -e dirfrag -e export and/or similar commands. You 
should report a bit what kind of operations tend to be stuck longest.

I also remember that there used to be problems having a kclient ceph fs mount 
on OSD nodes. Not sure if this could play a role here.

You have basically zero IO going on:

client:   6.2 MiB/s rd, 12 MiB/s wr, 10 op/s rd, 366 op/s wr

yet, PGs are laggy. The problem could sit on a non-ceph component.

With the hardware you have, there is something very weird going on. You might 
also want to check that you have the correct MTU on all devices on every single 
host and that the speed negotiated is the same. Problems like these I have seen 
with a single host having a wrong MTU and with LACP bonds with a broken 
transceiver.

Something else to check is flaky controller/PCIe connections. We had a case 
where a controller was behaving odd and we had a huge amount of device resets 
in the logs. On the host with the broken controller, IO wait was way above 
average (shown by top). Something similar might happen with NVMes. A painful 
procedure to locate a bad host could be to out OSDs manually on a single host 
and wait for PGs to peer and become active. If you have a bad host, in this 
moment IO should recover to good levels. Do this host by host. I know, it will 
be a day or two but, well, it might locate something.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Stefan Kooman 
Sent: 13 October 2022 13:56:45
To: Yoann Moulin; Patrick Donnelly
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: MDS Performance and PG/PGP value

On 10/13/22 13:47, Yoann Moulin wrote:
>> Also, you mentioned you're using 7 active MDS. How's that working out
>> for you? Do you use pinning?
>
> I don't really know how to do that, I have 55 worker nodes in my K8s
> cluster, each one can run pods that have access to a cephfs pvc. we have
> 28 cephfs persistent volumes. Pods are ML/DL/AI workload, each can be
> start and stop whenever our researchers need it. The workloads are
> unpredictable.

See [1] and [2].

Gr. Stefan

[1]:
https://docs.ceph.com/en/quincy/cephfs/multimds/#manually-pinning-directory-trees-to-a-particular-rank
[2]:
https://docs.ceph.com/en/quincy/cephfs/multimds/#setting-subtree-partitioning-policies

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS Performance and PG/PGP value

2022-10-13 Thread Stefan Kooman

On 10/13/22 13:47, Yoann Moulin wrote:

Also, you mentioned you're using 7 active MDS. How's that working out
for you? Do you use pinning?


I don't really know how to do that, I have 55 worker nodes in my K8s 
cluster, each one can run pods that have access to a cephfs pvc. we have 
28 cephfs persistent volumes. Pods are ML/DL/AI workload, each can be 
start and stop whenever our researchers need it. The workloads are 
unpredictable.


See [1] and [2].

Gr. Stefan

[1]: 
https://docs.ceph.com/en/quincy/cephfs/multimds/#manually-pinning-directory-trees-to-a-particular-rank
[2]: 
https://docs.ceph.com/en/quincy/cephfs/multimds/#setting-subtree-partitioning-policies


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS Performance and PG/PGP value

2022-10-13 Thread Yoann Moulin

Hello Patrick,

Unfortunately, increasing the number of PG did not help a lot in the end, my 
cluster is still in trouble...

Here the current state of my cluster : https://pastebin.com/Avw5ybgd


Is 256 good value in our case ? We have 80TB of data with more than 300M files.


You want at least as many PGs that each of the OSDs host a portion of the OMAP 
data. You want to spread out OMAP to as many _fast_ OSDs as possible.

I have tried to find an answer to your question: are more metadata PGs better? 
I haven't found a definitive answer. This would ideally be tested in a non-prod 
/ pre-prod environment and tuned
to individual requirements (type of workload). For now, I would not blindly 
trust the PG autoscaler. I have seen it advise settings that would definately 
not be OK. You can skew things in the
autoscaler with the "bias" parameter, to compensate for this. But as far as I 
know the current heuristics to determine a good value do not take into account the 
importance of OMAP (RocksDB)
spread accross OSDs. See a blog post about autoscaler tuning [1].

It would be great if tuning metadata PGs for CephFS / RGW could be performed during the 
"large scale tests" the devs are planning to perform in the future. With use 
cases that take into
consideration "a lot of small files / objects" versus "loads of large files / 
objects" to get a feeling how tuning this impacts performance for different work loads.

Gr. Stefan

[1]: https://ceph.io/en/news/blog/2022/autoscaler_tuning/


Thanks for the information, I agree that autoscaler seem to not be able to 
handle my use case.
(thanks to icepic...@gmail.com too)

By the way, since I have set PG=256, I have much less SLOW requests than 
before, even I still have, the impact on my users has been reduced a lot.


# zgrep -c -E 'WRN.*(SLOW_OPS|SLOW_REQUEST|MDS_SLOW_METADATA_IO)' 
floki.log.4.gz floki.log.3.gz floki.log.2.gz floki.log.1.gz floki.log
floki.log.4.gz:6883
floki.log.3.gz:11794
floki.log.2.gz:3391
floki.log.1.gz:1180
floki.log:122


If I have the opportunity, I will try to run some benchmark with multiple value 
of the PG on cephfs_metadata pool.


256 sounds like a good number to me. Maybe even 128. If you do some
experiments, please do share the results.


Yes, of course.


Also, you mentioned you're using 7 active MDS. How's that working out
for you? Do you use pinning?


I don't really know how to do that, I have 55 worker nodes in my K8s cluster, each one can run pods that have access to a cephfs pvc. we have 28 cephfs persistent volumes. Pods are ML/DL/AI 
workload, each can be start and stop whenever our researchers need it. The workloads are unpredictable.


Thanks for your help.

Best regards,

--
Yoann Moulin
EPFL IC-IT

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rgw multisite octopus - bucket can not be resharded after cancelling prior reshard process

2022-10-13 Thread Boris
Hi Christian,
resharding is not an issue, because we only sync the metadata. Like aws s3. 

But this looks very broken to me, does anyone got an idea how to fix that?

> Am 13.10.2022 um 11:58 schrieb Christian Rohmann 
> :
> 
> Hey Boris,
> 
>> On 07/10/2022 11:30, Boris Behrens wrote:
>> I just wanted to reshard a bucket but mistyped the amount of shards. In a
>> reflex I hit ctrl-c and waited. It looked like the resharding did not
>> finish so I canceled it, and now the bucket is in this state.
>> How can I fix it. It does not show up in the stale-instace list. It's also
>> a multisite environment (we only sync metadata).
> I believe resharding is not supported with rgw multisite 
> (https://docs.ceph.com/en/latest/radosgw/dynamicresharding/#multisite)
> but is being worked on / implemented fpr the Quincy release, see 
> https://tracker.ceph.com/projects/rgw/issues?query_id=247
> 
> But you are not syncing the data in your deployment? Maybe that's a different 
> case then?
> 
> 
> 
> Regards
> 
> Christian
> 
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rgw multisite octopus - bucket can not be resharded after cancelling prior reshard process

2022-10-13 Thread Christian Rohmann

Hey Boris,

On 07/10/2022 11:30, Boris Behrens wrote:

I just wanted to reshard a bucket but mistyped the amount of shards. In a
reflex I hit ctrl-c and waited. It looked like the resharding did not
finish so I canceled it, and now the bucket is in this state.
How can I fix it. It does not show up in the stale-instace list. It's also
a multisite environment (we only sync metadata).
I believe resharding is not supported with rgw multisite 
(https://docs.ceph.com/en/latest/radosgw/dynamicresharding/#multisite)
but is being worked on / implemented fpr the Quincy release, see 
https://tracker.ceph.com/projects/rgw/issues?query_id=247


But you are not syncing the data in your deployment? Maybe that's a 
different case then?




Regards

Christian


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Understanding the total space in CephFS

2022-10-13 Thread Nicola Mori

Hi Stefan,

the cluster is built of several old machines, with different numbers of 
disks (from 8 to 16) and disk sizes (from 500 GB to 4 TB). After the PG 
increase it is still recovering: the number of PGP is at 213 and has to 
grow up to 256. The balancer status gives:


{
"active": true,
"last_optimize_duration": "0:00:00.000347",
"last_optimize_started": "Thu Oct 13 08:59:22 2022",
"mode": "upmap",
"optimize_result": "Too many objects (0.051218 > 0.05) are 
misplaced; try again later",

"plans": []
}

and I guess that this means that optimization is ongoing, right?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CephFS constant high write I/O to the metadata pool

2022-10-13 Thread Olli Rajala
Hi,

I'm seeing constant 25-50MB/s writes to the metadata pool even when all
clients and the cluster is idling and in clean state. This surely can't be
normal?

There's no apparent issues with the performance of the cluster but this
write rate seems excessive and I don't know where to look for the culprit.

The setup is Ceph 16.2.9 running in hyperconverged 3 node core cluster and
6 hdd osd nodes.

Here's typical status when pretty much all clients are idling. Most of that
write bandwidth and maybe fifth of the write iops is hitting the
metadata pool.

---
root@pve-core-1:~# ceph -s
  cluster:
id: 2088b4b1-8de1-44d4-956e-aa3d3afff77f
health: HEALTH_OK

  services:
mon: 3 daemons, quorum pve-core-1,pve-core-2,pve-core-3 (age 2w)
mgr: pve-core-1(active, since 4w), standbys: pve-core-2, pve-core-3
mds: 1/1 daemons up, 2 standby
osd: 48 osds: 48 up (since 5h), 48 in (since 4M)

  data:
volumes: 1/1 healthy
pools:   10 pools, 625 pgs
objects: 70.06M objects, 46 TiB
usage:   95 TiB used, 182 TiB / 278 TiB avail
pgs: 625 active+clean

  io:
client:   45 KiB/s rd, 38 MiB/s wr, 6 op/s rd, 287 op/s wr
---

Here's some daemonperf dump:

---
root@pve-core-1:~# ceph daemonperf mds.`hostname -s`
mds-
--mds_cache--- --mds_log-- -mds_mem- ---mds_server--- mds_
-objecter-- purg
req  rlat fwd  inos caps exi  imi  hifc crev cgra ctru cfsa cfa  hcc  hccd
hccr prcr|stry recy recd|subm evts segs repl|ino  dn  |hcr  hcs  hsr  cre
 cat |sess|actv rd   wr   rdwr|purg|
 4000  767k  78k   0001610055
 37 |1.1k   00 | 17  3.7k 1340 |767k 767k| 40500
 0 |110 |  42   210 |  2
 5720  767k  78k   0003   16300   11   11
 0   17 |1.1k   00 | 45  3.7k 1370 |767k 767k| 57800
 0 |110 |  02   280 |  4
 5740  767k  78k   0004   34400   34   33
 2   26 |1.0k   00 |134  3.9k 1390 |767k 767k| 57   1300
 0 |110 |  02  1120 | 19
 6730  767k  78k   0006   32600   22   22
 0   32 |1.1k   00 | 78  3.9k 1410 |767k 768k| 67400
 0 |110 |  02   560 |  2
---
Any ideas where to look at?

Tnx!
o.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Understanding the total space in CephFS

2022-10-13 Thread Stefan Kooman

On 10/13/22 09:32, Nicola Mori wrote:

Dear Ceph users,

I'd need some help in understanding the total space in a CephFS. My 
cluster is currently built of 8 machines, the one with the smallest 
capacity has 8 TB of total disk space, and the total available raw space 
is 153 TB. I set up a 3x replicated metadata pool and a 6+2 erasure 
coded data pool with host failure domain for my CephFS. In this 
configuration every host holds a data chunk, so I would expect a total 
of about 48 TB of total storage space. I computed this value by noting 
that (roughly speaking and neglecting the metadata) 48 TB of data will 
need 48 TB of data chunks and 16 TB of coding chunks, for a total of 64 
TB that evenly divided into my 8 machines gives an occupancy of 8 TB per 
host, which exactly saturates the smallest one.


Assuming that the above is correct then I would expect that a df -h on a 
machine mounting the CephFS would report 48 TB of total space. Instead 
it started with something around 75 TB at the beginning, and it's slowly 
decreasing while I'm transferring data to the CephFS, being now at 62 TB.


I cannot understand this behavior, nor if my assumptions about the total 
space are correct, so I'd need some help with this.


The amount of space available depends on how well the cluster is 
balanced. And the fullest OSD is used to calculate amount of space 
available. IIRC you have recently increased PGs. Do you use the Ceph 
balancer to achieve optimal data placement (ceph balancer status)?


ceph osd df will show in what shape your cluster is with respect to 
balancing.


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to remove remaining bucket index shard objects

2022-10-13 Thread 伊藤 祐司
Hi,

Unfortunately, the "large omap objects" message recurred last weekend. So I ran 
the script you showed to check the situation. `used_.*` is small, but `omap_.*` 
is large, which is strange. Do you have any idea what it is?

id    used_mbytes  used_objects  omap_used_mbytes    omap_used_keys
--    ---        --
6.0   0            0             0                   0
6.1   0            0             0                   0
6.2   0            0             86.14682674407959   298586
6.3   0            0             93.08089542388916   323902
6.4   0            1             0                   0
6.5   0            1             2.124929428100586   7744
6.6   0            0             0                   0
6.7   0            0             2.2477407455444336  8192
6.8   0            0             0                   0
6.9   0            0             439.5090618133545   1524746
6.a   0            0             0                   0
6.b   0            0             3.4069366455078125  12416
6.c   0            0             0                   0
6.d   0            0             0                   0
6.e   0            0             0                   0
6.f   0            1             0                   0
6.10  0            1             2.177792549133301   7936
6.11  0            0             3.9340572357177734  14336
6.12  0            0             7.727175712585449   28160
6.13  0            0             114.01904964447021  394996
6.14  0            0             0                   0
6.15  0            0             88.56490707397461   307353
6.16  0            0             0                   0
6.17  0            0             7.6217451095581055  27776
6.18  0            0             3.933901786804199   14336
6.19  0            1             0                   0
6.1a  0            1             0                   0
6.1b  0            0             0                   0
6.1c  0            0             88.36568355560303   306677
6.1d  0            0             0                   0
6.1e  0            1             0                   0
6.1f  0            0             92.21501541137695   320707
6.20  0            1             2.1074790954589844  7680
6.21  0            0             0                   0
6.22  0            0             0                   0
6.23  0            0             8.605427742004395   31360
6.24  0            0             7.938144683837891   28928
6.25  0            0             0                   0
6.26  0            0             0                   0
6.27  0            1             2.10748291015625    7680
6.28  0            0             0                   0
6.29  0            0             2.1601409912109375  7872
6.2a  0            1             0                   0
6.2b  0            0             0                   0
6.2c  0            0             5.479369163513184   19968
6.2d  0            0             0                   0
6.2e  0            0             0                   0
6.2f  0            0             0                   0
6.30  0            0             117.55222415924072  407521
6.31  0            1             0                   0
6.32  0            1             0                   0
6.33  0            0             5.812973976135254   21184
6.34  0            0             0                   0
6.35  0            0             0                   0
6.36  0            0             5.865510940551758   21376
6.37  0            0             86.26362419128418   298993
6.38  0            0             93.97305393218994   327089
6.39  0            0             15.493829727172852  71787
6.3a  0            0             0                   0
6.3b  0            0             4.056745529174805   14784
6.3c  0            0             4.039289474487305   14720
6.3d  0            0             0                   0
6.3e  0            0             0                   0
6.3f  0            0             0                   0
6.40  0            0             2.1073970794677734  7680
6.41  0            1             4.004250526428223   14592
6.42  0            0             3.9866724014282227  14528
6.43  0            0             345.3690414428711   1197068
6.44  0            0             0                   0
6.45  0            1             0                   0
6.46  0            0             3.968973159790039   14464
6.47  0            0             0                   0
6.48  0            0             0                   0
6.49  0            0             263.9479990005493   914805
6.4a  0            0             94.751708984375     336275
6.4b  0            0             0                   0
6.4c  0            0             0                   0
6.4d  0            0             270.53627490997314  937581
6.4e  0            1             0                   0
6.4f  0            0             0                   0
6.50  0            0             1.8790569305419922  6848
6.51  0            

[ceph-users] Understanding the total space in CephFS

2022-10-13 Thread Nicola Mori

Dear Ceph users,

I'd need some help in understanding the total space in a CephFS. My 
cluster is currently built of 8 machines, the one with the smallest 
capacity has 8 TB of total disk space, and the total available raw space 
is 153 TB. I set up a 3x replicated metadata pool and a 6+2 erasure 
coded data pool with host failure domain for my CephFS. In this 
configuration every host holds a data chunk, so I would expect a total 
of about 48 TB of total storage space. I computed this value by noting 
that (roughly speaking and neglecting the metadata) 48 TB of data will 
need 48 TB of data chunks and 16 TB of coding chunks, for a total of 64 
TB that evenly divided into my 8 machines gives an occupancy of 8 TB per 
host, which exactly saturates the smallest one.


Assuming that the above is correct then I would expect that a df -h on a 
machine mounting the CephFS would report 48 TB of total space. Instead 
it started with something around 75 TB at the beginning, and it's slowly 
decreasing while I'm transferring data to the CephFS, being now at 62 TB.


I cannot understand this behavior, nor if my assumptions about the total 
space are correct, so I'd need some help with this.

Thanks,

Nicola
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Iinfinite backfill loop + number of pgp groups stuck at wrong value

2022-10-13 Thread Nicola Mori
Thank you Frank for the insight. I'd need to study a bit more the 
details of all of this, but for sure now I understand it a bit better.


Nicola
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] rbd: Snapshot Only Permissions

2022-10-13 Thread Dan Poltawski
Hi All,

Is there any way to configure capabilities for a user to allow the client to 
*only* create/delete snapshots? I can't find anything which suggests this is 
possible on https://docs.ceph.com/en/latest/rados/operations/user-management/.

Context: I'm writing a script to automatically create and delete snapshots. 
Ideally i'd like to restrict the permissions for this user so it can't do 
anything else with rbd images and give it the least privileges possible.

thanks,
Dan





The Networking People (TNP) Limited. Registered office: Network House, Caton 
Rd, Lancaster, LA1 3PE. Registered in England & Wales with company number: 
07667393

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and delete 
this e-mail from your system. If you are not the intended recipient you are 
notified that disclosing, copying, distributing or taking any action in 
reliance on the contents of this information is strictly prohibited.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io