[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time

2023-06-22 Thread Eugen Block

Hi,

have you tried restarting the primary OSD (currently 343)? It looks  
like this PG is part of an EC pool, are there enough hosts available,  
assuming your failure-domain is host? I assume that ceph isn't able to  
recreate the shard on a different OSD. You could share your osd tree  
and the crush rule as well as the erasure profile so we could get a  
better picture.


Thanks,
Eugen

Zitat von siddhit.ren...@nxtgen.com:


Hello All,

Ceph version: 14.2.5-382-g8881d33957  
(8881d33957b54b101eae9c7627b351af10e87ee8) nautilus (stable)


Issue:
1 PG stucked in "active+undersized+degraded for long time
Degraded data redundancy: 44800/8717052637 objects degraded  
(0.001%), 1 pg degraded, 1 pg undersized


#ceph pg dump_stuck
PG_STAT STATE   UP
   UP_PRIMARY ACTING  
 ACTING_PRIMARY
15.28f0  active+undersized+degraded  
[2147483647,343,355,415,426,640,302,392,78,202,607]343  
[2147483647,343,355,415,426,640,302,392,78,202,607]343


PG Query:
#ceph pg 15.28f0 query

{
"state": "active+undersized+degraded",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 303362,
"up": [
2147483647,
343,
355,
415,
426,
640,
302,
392,
78,
202,
607
],
"acting": [
2147483647,
343,
355,
415,
426,
640,
302,
392,
78,
202,
607
],
"acting_recovery_backfill": [
"78(8)",
"202(9)",
"302(6)",
"343(1)",
"355(2)",
"392(7)",
"415(3)",
"426(4)",
"607(10)",
"640(5)"
],
"info": {
"pgid": "15.28f0s1",
"last_update": "303161'598853",
"last_complete": "303161'598853",
"log_tail": "261289'595825",
"last_user_version": 598853,
"last_backfill": "MAX",
"last_backfill_bitwise": 1,
"purged_snaps": [],
"history": {
"epoch_created": 19841,
"epoch_pool_created": 16141,
"last_epoch_started": 303017,
"last_interval_started": 303016,
"last_epoch_clean": 250583,
"last_interval_clean": 250582,
"last_epoch_split": 19841,
"last_epoch_marked_full": 0,
"same_up_since": 303016,
"same_interval_since": 303016,
"same_primary_since": 256311,
"last_scrub": "255277'537760",
"last_scrub_stamp": "2021-04-11 03:18:39.164439",
"last_deep_scrub": "255277'537756",
"last_deep_scrub_stamp": "2021-04-10 01:42:16.182528",
"last_clean_scrub_stamp": "2021-04-11 03:18:39.164439"
},
"stats": {
"version": "303161'598853",
"reported_seq": "3594551",
"reported_epoch": "303362",
"state": "active+undersized+degraded",
"last_fresh": "2023-06-20 19:03:59.135295",
"last_change": "2023-06-20 15:11:12.569114",
"last_active": "2023-06-20 19:03:59.135295",
"last_peered": "2023-06-20 19:03:59.135295",
"last_clean": "2021-04-11 15:21:44.271834",
"last_became_active": "2023-06-20 15:11:12.569114",
"last_became_peered": "2023-06-20 15:11:12.569114",
"last_unstale": "2023-06-20 19:03:59.135295",
"last_undegraded": "2023-06-20 15:11:10.430426",
"last_fullsized": "2023-06-20 15:11:10.430154",
"mapping_epoch": 303016,
"log_start": "261289'595825",
"ondisk_log_start": "261289'595825",
"created": 19841,
"last_epoch_clean": 250583,
"parent": "0.0",
"parent_split_bits": 14,
"last_scrub": "255277'537760",
"last_scrub_stamp": "2021-04-11 03:18:39.164439",
"last_deep_scrub": "255277'537756",
"last_deep_scrub_stamp": "2021-04-10 01:42:16.182528",
"last_clean_scrub_stamp": "2021-04-11 03:18:39.164439",
"log_size": 3028,
"ondisk_log_size": 3028,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"manifest_stats_invalid": false,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 54989065178,
"num_objects": 44800,
"num_object_clones": 0,
"num_object_copies": 492800,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 44800,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,

[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time

2023-06-22 Thread Damian

Hi Siddhit

You need more OSD's. Please read:

https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-pg/#erasure-coded-pgs-are-not-active-clean

Greetings

Damian

On 2023-06-20 15:53, siddhit.ren...@nxtgen.com wrote:

Hello All,

Ceph version: 14.2.5-382-g8881d33957 
(8881d33957b54b101eae9c7627b351af10e87ee8) nautilus (stable)


Issue:
1 PG stucked in "active+undersized+degraded for long time
Degraded data redundancy: 44800/8717052637 objects degraded (0.001%), 1 
pg degraded, 1 pg undersized


#ceph pg dump_stuck
PG_STAT STATE   UP  
UP_PRIMARY ACTING   
   ACTING_PRIMARY
15.28f0  active+undersized+degraded 
[2147483647,343,355,415,426,640,302,392,78,202,607]343 
[2147483647,343,355,415,426,640,302,392,78,202,607]343


PG Query:
#ceph pg 15.28f0 query

{
"state": "active+undersized+degraded",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 303362,
"up": [
2147483647,
343,
355,
415,
426,
640,
302,
392,
78,
202,
607
],
"acting": [
2147483647,
343,
355,
415,
426,
640,
302,
392,
78,
202,
607
],
"acting_recovery_backfill": [
"78(8)",
"202(9)",
"302(6)",
"343(1)",
"355(2)",
"392(7)",
"415(3)",
"426(4)",
"607(10)",
"640(5)"
],
"info": {
"pgid": "15.28f0s1",
"last_update": "303161'598853",
"last_complete": "303161'598853",
"log_tail": "261289'595825",
"last_user_version": 598853,
"last_backfill": "MAX",
"last_backfill_bitwise": 1,
"purged_snaps": [],
"history": {
"epoch_created": 19841,
"epoch_pool_created": 16141,
"last_epoch_started": 303017,
"last_interval_started": 303016,
"last_epoch_clean": 250583,
"last_interval_clean": 250582,
"last_epoch_split": 19841,
"last_epoch_marked_full": 0,
"same_up_since": 303016,
"same_interval_since": 303016,
"same_primary_since": 256311,
"last_scrub": "255277'537760",
"last_scrub_stamp": "2021-04-11 03:18:39.164439",
"last_deep_scrub": "255277'537756",
"last_deep_scrub_stamp": "2021-04-10 01:42:16.182528",
"last_clean_scrub_stamp": "2021-04-11 03:18:39.164439"
},
"stats": {
"version": "303161'598853",
"reported_seq": "3594551",
"reported_epoch": "303362",
"state": "active+undersized+degraded",
"last_fresh": "2023-06-20 19:03:59.135295",
"last_change": "2023-06-20 15:11:12.569114",
"last_active": "2023-06-20 19:03:59.135295",
"last_peered": "2023-06-20 19:03:59.135295",
"last_clean": "2021-04-11 15:21:44.271834",
"last_became_active": "2023-06-20 15:11:12.569114",
"last_became_peered": "2023-06-20 15:11:12.569114",
"last_unstale": "2023-06-20 19:03:59.135295",
"last_undegraded": "2023-06-20 15:11:10.430426",
"last_fullsized": "2023-06-20 15:11:10.430154",
"mapping_epoch": 303016,
"log_start": "261289'595825",
"ondisk_log_start": "261289'595825",
"created": 19841,
"last_epoch_clean": 250583,
"parent": "0.0",
"parent_split_bits": 14,
"last_scrub": "255277'537760",
"last_scrub_stamp": "2021-04-11 03:18:39.164439",
"last_deep_scrub": "255277'537756",
"last_deep_scrub_stamp": "2021-04-10 01:42:16.182528",
"last_clean_scrub_stamp": "2021-04-11 03:18:39.164439",
"log_size": 3028,
"ondisk_log_size": 3028,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"manifest_stats_invalid": false,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 54989065178,
"num_objects": 44800,
"num_object_clones": 0,
"num_object_copies": 492800,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 44800,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 44800,
"num_whiteouts": 0,
"num_read": 201078,
"num_read_kb": 30408632,
"num_write": 219335,
  

[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time

2023-07-20 Thread siddhit . renake
Hello Eugen,

Requested details are as below.

PG ID: 15.28f0
Pool ID: 15
Pool:  default.rgw.buckets.data   
Pool EC Ratio: 8: 3
Number of Hosts: 12

## crush dump for rule ##
#ceph osd crush rule dump data_ec_rule
{
"rule_id": 1,
"rule_name": "data_ec_rule",
"ruleset": 1,
"type": 3,
"min_size": 3,
"max_size": 11,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 5
},
{
"op": "set_choose_tries",
"num": 100
},
{
"op": "take",
"item": -50,
"item_name": "root_data~hdd"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}

## From Crushmap dump ##
rule data_ec_rule {
id 1
type erasure
min_size 3
max_size 11
step set_chooseleaf_tries 5
step set_choose_tries 100
step take root_data class hdd
step chooseleaf indep 0 type host
step emit
}

## EC Profile ##
ceph osd erasure-code-profile get data
crush-device-class=hdd
crush-failure-domain=host
crush-root=root_data
jerasure-per-chunk-alignment=false
k=8
m=3
plugin=jerasure
technique=reed_sol_van
w=8

OSD Tree:
https://pastebin.com/raw/q6u7aSeu
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time

2023-07-20 Thread siddhit . renake
What should be appropriate way to restart primary OSD in this case (343) ?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time

2023-07-20 Thread Matthew Leonard (BLOOMBERG/ 120 PARK)
Assuming you're running systemctl OSDs you can run the following command on the 
host that OSD 343 resides on.

systemctl restart ceph-osd@343 

From: siddhit.ren...@nxtgen.com At: 07/20/23 13:44:36 UTC-4:00To:  
ceph-users@ceph.io
Subject: [ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long 
time

What should be appropriate way to restart primary OSD in this case (343) ?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time

2023-07-20 Thread Anthony D'Atri
Sometimes one can even get away with "ceph osd down 343" which doesn't affect 
the process.  I have had occasions when this goosed peering in a less-intrusive 
way.  I believe it just marks the OSD down in the mons' map, and when that 
makes it to the OSD, the OSD responds with "I'm not dead yet" and gets marked 
up again.

> On Jul 20, 2023, at 13:50, Matthew Leonard (BLOOMBERG/ 120 PARK) 
>  wrote:
> 
> Assuming you're running systemctl OSDs you can run the following command on 
> the host that OSD 343 resides on.
> 
> systemctl restart ceph-osd@343 
> 
> From: siddhit.ren...@nxtgen.com At: 07/20/23 13:44:36 UTC-4:00To:  
> ceph-users@ceph.io
> Subject: [ceph-users] Re: 1 PG stucked in "active+undersized+degraded for 
> long time
> 
> What should be appropriate way to restart primary OSD in this case (343) ?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time

2023-07-26 Thread Eugen Block
I can provide some more details, these were the recovery steps taken  
so far, they started from here (I don't know the whole/exact story  
though):


  70/868386704 objects unfound (0.000%)
  Reduced data availability: 8 pgs inactive, 8 pgs incomplete
  Possible data damage: 1 pg recovery_unfound
  Degraded data redundancy: 45558/8766139136 objects degraded  
(0.001%), 2 pgs degraded, 1 pg undersized


And with reducing min_size for the EC pools some of the inactive PGs  
were cleaned up. From the remaining 4 incomplete PGs they got further  
by marking them unfound_lost:


# ceph pg 15.f4f mark_unfound_lost delete
pg has 70 objects unfound and apparently lost marking

And now one PG is stuck degraded:

# ceph pg ls degraded
PG  OBJECTS DEGRADED MISPLACED UNFOUND BYTES   OMAP_BYTES*  
OMAP_KEYS* LOG  STATE  SINCE VERSION
REPORTED   UP   
ACTING  SCRUB_STAMP 
DEEP_SCRUB_STAMP
15.28f0   4499444994 0   0 55288092914   0  
 0 3077 active+undersized+degraded   93s 310625'599302  
310657:3603406 [2147483647,343,355,415,426,640,302,392,78,202,607]p343  
[2147483647,343,355,415,426,640,302,392,78,202,607]p343 2021-04-11  
03:18:39.164439 2021-04-10 01:42:16.182528


Setting osd.343 down didn't have any effect, I then suggested to  
increase set_choose_retries from 100 to 150 for the respective  
crush_rule (found a thread where that seemed to have helped), don't  
have a response to that yet. If nothing else helps, would it help  
marking the PG as unfound_lost (with data loss) help here?


Zitat von Anthony D'Atri :

Sometimes one can even get away with "ceph osd down 343" which  
doesn't affect the process.  I have had occasions when this goosed  
peering in a less-intrusive way.  I believe it just marks the OSD  
down in the mons' map, and when that makes it to the OSD, the OSD  
responds with "I'm not dead yet" and gets marked up again.


On Jul 20, 2023, at 13:50, Matthew Leonard (BLOOMBERG/ 120 PARK)  
 wrote:


Assuming you're running systemctl OSDs you can run the following  
command on the host that OSD 343 resides on.


systemctl restart ceph-osd@343

From: siddhit.ren...@nxtgen.com At: 07/20/23 13:44:36 UTC-4:00To:   
ceph-users@ceph.io
Subject: [ceph-users] Re: 1 PG stucked in  
"active+undersized+degraded for long time


What should be appropriate way to restart primary OSD in this case (343) ?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io