Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

2018-06-27 Thread Steve Anthony
One addendum for the sake of completeness. A few PGs still refused to
repair even after the clone object was gone. To resolve this I needed to
remove the clone metadata from the HEAD using ceph-objectstore-tool.
First, I found the problematic clone ID in the log on the primary replica:

ceph2:~# grep ERR /var/log/ceph/ceph-osd.229.log
2018-06-25 10:59:37.554924 7fbdd80d2700 -1 log_channel(cluster) log
[ERR] : repair 2.9a6
2:65942a51:::rb.0.2479b45.238e1f29.002d338d:head expected clone
2:65942a51:::rb.0.2479b45.238e1f29.002d338d:1320 1 missing

In this case the clone ID is 1320. Note that this is the hex value and
ceph-objectstore-tool will expect the decimal equivalent (4896 in this
case). Then on each host stop the OSD and remove the metadata. For
Bluestore this looks like:

ceph2:~# ceph-objectstore-tool --type bluestore --data-path
/var/lib/ceph/osd/ceph-229/ --pgid 2.9a6
'{"oid":"rb.0.2479b45.238e1f29","snapid":-2,"hash":2320771494,"max":0,"pool":2,"namespace":"","max":0}'
remove-clone-metadata 4896
Removal of clone 1320 complete
Use pg repair after OSD restarted to correct stat information

And if it's a Filestore OSD:

ceph15:~# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-122/
--pgid 2.9a6
'{"oid":"rb.0.2479b45.238e1f29.002d338d","key":"","snapid":-2,"hash":2320771494,"max":0,"pool":2,"namespace":"","max":0}'
remove-clone-metadata 4896
Removal of clone 1320 complete
Use pg repair after OSD restarted to correct stat information

Once that's done, starting the OSD and repairing the PG finally marked
it as clean.

-Steve

On 06/14/2018 05:07 PM, Steve Anthony wrote:
>
> For reference, building luminous with the changes in the pull request
> also fixed this issue for me. Some of my unexpected snapshots were on
> Bluestore devices; here's how I used the objectstore tool to remove
> them. In the example, the problematic placement group is 2.1c3f, and
> the unexpected clone is identified in the OSD's log as
> rb.0.2479b45.238e1f29.0df:1356.
>
> ceph14:~# systemctl stop ceph-osd@34.service
> ceph14:~# ceph-objectstore-tool --type bluestore --data-path 
> /var/lib/ceph/osd/ceph-34/ --pgid 2.1c3f --op list 
> rb.0.2479b45.238e1f29.0ddf
> Error getting attr on : 2.1c3f_head,#-4:fc38:::scrub_2.1c3f:head#, (61) 
> No data available
> ["2.1c3f",{"oid":"rb.0.2479b45.238e1f29.0ddf","key":"","snapid":4950,"hash":1151294527,"max":0,"pool":2,"namespace":"","max":0}]
> ["2.1c3f",{"oid":"rb.0.2479b45.238e1f29.0ddf","key":"","snapid":-2,"hash":1151294527,"max":0,"pool":2,"namespace":"","max":0}]
> ceph14:~# ceph-objectstore-tool --dry-run --type bluestore --data-path 
> /var/lib/ceph/osd/ceph-34/ --pgid 2.1c3f 
> '{"oid":"rb.0.2479b45.238e1f29.0ddf","key":"","snapid":4950,"hash":1151294527,"max":0,"pool":2,"namespace":"","max":0}'
>  remove
> remove #2:fc3af922:::rb.0.2479b45.238e1f29.0ddf:1356#
> dry-run: Nothing changed
> ceph14:~# ceph-objectstore-tool --type bluestore --data-path 
> /var/lib/ceph/osd/ceph-34/ --pgid 2.1c3f 
> '{"oid":"rb.0.2479b45.238e1f29.0ddf","key":"","snapid":4950,"hash":1151294527,"max":0,"pool":2,"namespace":"","max":0}'
>  remove
> remove #2:fc3af922:::rb.0.2479b45.238e1f29.0ddf:1356#
> ceph14:~# systemctl start ceph-osd@34.service
>
> -Steve
>
> On 06/14/2018 04:59 PM, Nick Fisk wrote:
>> For completeness in case anyone has this issue in the future and stumbles 
>> across this thread
>>
>> If your OSD is crashing and you are still running on a Luminous build that 
>> does not have the fix in the pull request below, you will
>> need to compile the ceph-osd binary and replace it on the affected OSD node. 
>> This will get your OSD's/cluster back up and running.
>>
>> In regards to the stray object/clone, I was unable to remove it using the 
>> objectstore tool, I'm guessing this is because as far as
>> the OSD is concerned it believes that clone should have already been 
>> deleted. I am still running Filestore on this cluster and
>> simply removing the clone object from the OSD PG folder (Note: the object 
>> won't have _head in its name) and then running a deep
>> scrub

Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

2018-06-14 Thread Steve Anthony
For reference, building luminous with the changes in the pull request
also fixed this issue for me. Some of my unexpected snapshots were on
Bluestore devices; here's how I used the objectstore tool to remove
them. In the example, the problematic placement group is 2.1c3f, and the
unexpected clone is identified in the OSD's log as
rb.0.2479b45.238e1f29.0df:1356.

ceph14:~# systemctl stop ceph-osd@34.service
ceph14:~# ceph-objectstore-tool --type bluestore --data-path 
/var/lib/ceph/osd/ceph-34/ --pgid 2.1c3f --op list 
rb.0.2479b45.238e1f29.0ddf
Error getting attr on : 2.1c3f_head,#-4:fc38:::scrub_2.1c3f:head#, (61) No 
data available
["2.1c3f",{"oid":"rb.0.2479b45.238e1f29.0ddf","key":"","snapid":4950,"hash":1151294527,"max":0,"pool":2,"namespace":"","max":0}]
["2.1c3f",{"oid":"rb.0.2479b45.238e1f29.0ddf","key":"","snapid":-2,"hash":1151294527,"max":0,"pool":2,"namespace":"","max":0}]
ceph14:~# ceph-objectstore-tool --dry-run --type bluestore --data-path 
/var/lib/ceph/osd/ceph-34/ --pgid 2.1c3f 
'{"oid":"rb.0.2479b45.238e1f29.0ddf","key":"","snapid":4950,"hash":1151294527,"max":0,"pool":2,"namespace":"","max":0}'
 remove
remove #2:fc3af922:::rb.0.2479b45.238e1f29.0ddf:1356#
dry-run: Nothing changed
ceph14:~# ceph-objectstore-tool --type bluestore --data-path 
/var/lib/ceph/osd/ceph-34/ --pgid 2.1c3f 
'{"oid":"rb.0.2479b45.238e1f29.0ddf","key":"","snapid":4950,"hash":1151294527,"max":0,"pool":2,"namespace":"","max":0}'
 remove
remove #2:fc3af922:::rb.0.2479b45.238e1f29.0ddf:1356#
ceph14:~# systemctl start ceph-osd@34.service

-Steve

On 06/14/2018 04:59 PM, Nick Fisk wrote:
> For completeness in case anyone has this issue in the future and stumbles 
> across this thread
>
> If your OSD is crashing and you are still running on a Luminous build that 
> does not have the fix in the pull request below, you will
> need to compile the ceph-osd binary and replace it on the affected OSD node. 
> This will get your OSD's/cluster back up and running.
>
> In regards to the stray object/clone, I was unable to remove it using the 
> objectstore tool, I'm guessing this is because as far as
> the OSD is concerned it believes that clone should have already been deleted. 
> I am still running Filestore on this cluster and
> simply removing the clone object from the OSD PG folder (Note: the object 
> won't have _head in its name) and then running a deep
> scrub on the PG again fixed the issue for me.
>
> Nick
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Nick 
> Fisk
> Sent: 07 June 2018 14:01
> To: 'ceph-users' 
> Subject: Re: [ceph-users] FAILED assert(p != 
> recovery_info.ss.clone_snaps.end())
>
> So I've recompiled a 12.2.5 ceph-osd binary with the fix included in 
> https://github.com/ceph/ceph/pull/22396
>
> The OSD has restarted as expected and the PG is now active+clean 
> ..success so far.
>
> What's the best method to clean up the stray snapshot on OSD.46? I'm guessing 
> using the object-store-tool, but not sure if I want to
> clean the clone metadata or try and remove the actual snapshot object.
>
> -Original Message-
> From: ceph-users  On Behalf Of Nick Fisk
> Sent: 05 June 2018 17:22
> To: 'ceph-users' 
> Subject: Re: [ceph-users] FAILED assert(p != 
> recovery_info.ss.clone_snaps.end())
>
> So, from what I can see I believe this issue is being caused by one of the 
> remaining OSD's acting for this PG containing a snapshot
> file of the object
>
> /var/lib/ceph/osd/ceph-46/current/1.2ca_head/DIR_A/DIR_C/DIR_2/DIR_D/DIR_0/rbd\udata.0c4c14238e1f29.000bf479__head_F930D2CA_
> _1
> /var/lib/ceph/osd/ceph-46/current/1.2ca_head/DIR_A/DIR_C/DIR_2/DIR_D/DIR_0/rbd\udata.0c4c14238e1f29.000bf479__1c_F930D2CA__1
>
> Both the OSD which crashed and the other acting OSD don't have this "1c" 
> snapshot file. Is the solution to use objectstore tool to
> remove this "1c" snapshot object and then allow thigs to backfill?
>
>
> -Original Message-
> From: ceph-users  On Behalf Of Nick Fisk
> Sent: 05 June 2018 16:43
> To: 'ceph-users' 
> Subject: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())
>
> Hi,
>
> After a RBD snapshot was removed, I seem to be having OSD's assert when they 
&

Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

2018-06-14 Thread Nick Fisk
For completeness in case anyone has this issue in the future and stumbles 
across this thread

If your OSD is crashing and you are still running on a Luminous build that does 
not have the fix in the pull request below, you will
need to compile the ceph-osd binary and replace it on the affected OSD node. 
This will get your OSD's/cluster back up and running.

In regards to the stray object/clone, I was unable to remove it using the 
objectstore tool, I'm guessing this is because as far as
the OSD is concerned it believes that clone should have already been deleted. I 
am still running Filestore on this cluster and
simply removing the clone object from the OSD PG folder (Note: the object won't 
have _head in its name) and then running a deep
scrub on the PG again fixed the issue for me.

Nick

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Nick 
Fisk
Sent: 07 June 2018 14:01
To: 'ceph-users' 
Subject: Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

So I've recompiled a 12.2.5 ceph-osd binary with the fix included in 
https://github.com/ceph/ceph/pull/22396

The OSD has restarted as expected and the PG is now active+clean ..success 
so far.

What's the best method to clean up the stray snapshot on OSD.46? I'm guessing 
using the object-store-tool, but not sure if I want to
clean the clone metadata or try and remove the actual snapshot object.

-Original Message-
From: ceph-users  On Behalf Of Nick Fisk
Sent: 05 June 2018 17:22
To: 'ceph-users' 
Subject: Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

So, from what I can see I believe this issue is being caused by one of the 
remaining OSD's acting for this PG containing a snapshot
file of the object

/var/lib/ceph/osd/ceph-46/current/1.2ca_head/DIR_A/DIR_C/DIR_2/DIR_D/DIR_0/rbd\udata.0c4c14238e1f29.000bf479__head_F930D2CA_
_1
/var/lib/ceph/osd/ceph-46/current/1.2ca_head/DIR_A/DIR_C/DIR_2/DIR_D/DIR_0/rbd\udata.0c4c14238e1f29.000bf479__1c_F930D2CA__1

Both the OSD which crashed and the other acting OSD don't have this "1c" 
snapshot file. Is the solution to use objectstore tool to
remove this "1c" snapshot object and then allow thigs to backfill?


-Original Message-
From: ceph-users  On Behalf Of Nick Fisk
Sent: 05 June 2018 16:43
To: 'ceph-users' 
Subject: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

Hi,

After a RBD snapshot was removed, I seem to be having OSD's assert when they 
try and recover pg 1.2ca. The issue seems to follow the
PG around as OSD's fail. I've seen this bug tracker and associated mailing list 
post, but would appreciate if anyone can give any
pointers. https://tracker.ceph.com/issues/23030,

Cluster is 12.2.5 with Filestore and 3x replica pools

I noticed after the snapshot was removed that there were 2 inconsistent PG's. I 
can a repair on both and one of them (1.2ca) seems
to have triggered this issue.


Snippet of log from two different OSD's. 1st one is from the original OSD 
holding the PG, 2nd is from where the OSD was marked out
and it was trying to be recovered to:


-4> 2018-06-05 15:15:45.997225 7febce7a5700  1 -- 
[2a03:25e0:254:5::112]:6819/3544315 --> [2a03:25e0:253:5::14]:0/3307345 --
osd_ping(ping_reply e2196479 stamp 2018-06-05 15:15:45.994907) v4 -- 
0x557d3d72f800 con 0
-3> 2018-06-05 15:15:46.018088 7febb2954700  1 -- 
[2a03:25e0:254:5::112]:6817/3544315 --> [2a03:25e0:254:5::12]:6809/5784710 --
MOSDPGPull(1.2ca e2196479/2196477 cost 8389608) v3 -- 0x557d4d180b40 con 0
-2> 2018-06-05 15:15:46.018412 7febce7a5700  5 -- 
[2a03:25e0:254:5::112]:6817/3544315 >> [2a03:25e0:254:5::12]:6809/5784710
conn(0x557d4b1a9000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=13470 
cs=1 l=0).
rx osd.46 seq 13 0x557d4d180b40 MOSDPGPush(1.2ca 2196479/2196477 
[PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c,
version: 2195927'1249660, data_included: [], data_size: 0, omap_header_size: 0, 
omap_ent
ries_size: 0, attrset_size: 1, recovery_info:
ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c@2195927'1249660,
 size: 0, copy_subset: [], clone_subset:
{}, snapset: 1c=[]:{}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, 
omap_recovered_to:, omap_complete:true, error:false),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, 
data_complete:false, omap _recovered_to:, omap_complete:false,
error:false))]) v3
-1> 2018-06-05 15:15:46.018425 7febce7a5700  1 -- 
[2a03:25e0:254:5::112]:6817/3544315 <== osd.46
[2a03:25e0:254:5::12]:6809/5784710 13  MOSDPGPush(1.2ca 2196479/2196477 
[PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f
29.000bf479:1c, version: 2195927'1249660, data_included: [], data_size: 
0, omap_header_size: 0, omap_entries_size: 0,
attrset_size: 1, recovery_info: 
ObjectRecoveryInfo(1:534b0c9f

Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

2018-06-07 Thread Nick Fisk
So I've recompiled a 12.2.5 ceph-osd binary with the fix included in 
https://github.com/ceph/ceph/pull/22396

The OSD has restarted as expected and the PG is now active+clean ..success 
so far.

What's the best method to clean up the stray snapshot on OSD.46? I'm guessing 
using the object-store-tool, but not sure if I want to
clean the clone metadata or try and remove the actual snapshot object.

-Original Message-
From: ceph-users  On Behalf Of Nick Fisk
Sent: 05 June 2018 17:22
To: 'ceph-users' 
Subject: Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

So, from what I can see I believe this issue is being caused by one of the 
remaining OSD's acting for this PG containing a snapshot
file of the object

/var/lib/ceph/osd/ceph-46/current/1.2ca_head/DIR_A/DIR_C/DIR_2/DIR_D/DIR_0/rbd\udata.0c4c14238e1f29.000bf479__head_F930D2CA_
_1
/var/lib/ceph/osd/ceph-46/current/1.2ca_head/DIR_A/DIR_C/DIR_2/DIR_D/DIR_0/rbd\udata.0c4c14238e1f29.000bf479__1c_F930D2CA__1

Both the OSD which crashed and the other acting OSD don't have this "1c" 
snapshot file. Is the solution to use objectstore tool to
remove this "1c" snapshot object and then allow thigs to backfill?


-Original Message-
From: ceph-users  On Behalf Of Nick Fisk
Sent: 05 June 2018 16:43
To: 'ceph-users' 
Subject: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

Hi,

After a RBD snapshot was removed, I seem to be having OSD's assert when they 
try and recover pg 1.2ca. The issue seems to follow the
PG around as OSD's fail. I've seen this bug tracker and associated mailing list 
post, but would appreciate if anyone can give any
pointers. https://tracker.ceph.com/issues/23030,

Cluster is 12.2.5 with Filestore and 3x replica pools

I noticed after the snapshot was removed that there were 2 inconsistent PG's. I 
can a repair on both and one of them (1.2ca) seems
to have triggered this issue.


Snippet of log from two different OSD's. 1st one is from the original OSD 
holding the PG, 2nd is from where the OSD was marked out
and it was trying to be recovered to:


-4> 2018-06-05 15:15:45.997225 7febce7a5700  1 -- 
[2a03:25e0:254:5::112]:6819/3544315 --> [2a03:25e0:253:5::14]:0/3307345 --
osd_ping(ping_reply e2196479 stamp 2018-06-05 15:15:45.994907) v4 -- 
0x557d3d72f800 con 0
-3> 2018-06-05 15:15:46.018088 7febb2954700  1 -- 
[2a03:25e0:254:5::112]:6817/3544315 --> [2a03:25e0:254:5::12]:6809/5784710 --
MOSDPGPull(1.2ca e2196479/2196477 cost 8389608) v3 -- 0x557d4d180b40 con 0
-2> 2018-06-05 15:15:46.018412 7febce7a5700  5 -- 
[2a03:25e0:254:5::112]:6817/3544315 >> [2a03:25e0:254:5::12]:6809/5784710
conn(0x557d4b1a9000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=13470 
cs=1 l=0).
rx osd.46 seq 13 0x557d4d180b40 MOSDPGPush(1.2ca 2196479/2196477 
[PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c,
version: 2195927'1249660, data_included: [], data_size: 0, omap_header_size: 0, 
omap_ent
ries_size: 0, attrset_size: 1, recovery_info:
ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c@2195927'1249660,
 size: 0, copy_subset: [], clone_subset:
{}, snapset: 1c=[]:{}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, 
omap_recovered_to:, omap_complete:true, error:false),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, 
data_complete:false, omap _recovered_to:, omap_complete:false,
error:false))]) v3
-1> 2018-06-05 15:15:46.018425 7febce7a5700  1 -- 
[2a03:25e0:254:5::112]:6817/3544315 <== osd.46
[2a03:25e0:254:5::12]:6809/5784710 13  MOSDPGPush(1.2ca 2196479/2196477 
[PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f
29.000bf479:1c, version: 2195927'1249660, data_included: [], data_size: 
0, omap_header_size: 0, omap_entries_size: 0,
attrset_size: 1, recovery_info: 
ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.0
00bf479:1c@2195927'1249660, size: 0, copy_subset: [], clone_subset: {}, 
snapset: 1c=[]:{}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, 
omap_recovered_to:, omap_complete:t rue, error:false),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, 
data_complete:false, omap_recovered_to:, omap_complete:false,
error:false))]) v3  885+0+0 (695790480 0 0) 0x557d4d180b40 con 0x5
57d4b1a9000
 0> 2018-06-05 15:15:46.022099 7febb2954700 -1 
/build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: In function 'virtual void
PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, 
ObjectContextRef, bool , ObjectStore::Transaction*)'
thread 7febb2954700 time 2018-06-05 15:15:46.019130
/build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != 
recovery_info.ss.clone_snaps.end())







-4> 2018-06-05 16:28:59.560140 7fcd7b655700  5 -- 
[2a03:25e0:254:5::113]:6829/525383 >

Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

2018-06-05 Thread Sage Weil
On Tue, 5 Jun 2018, Paul Emmerich wrote:
> 2018-06-05 17:42 GMT+02:00 Nick Fisk :
> 
> > Hi,
> >
> > After a RBD snapshot was removed, I seem to be having OSD's assert when
> > they try and recover pg 1.2ca. The issue seems to follow the
> > PG around as OSD's fail. I've seen this bug tracker and associated mailing
> > list post, but would appreciate if anyone can give any
> > pointers. https://tracker.ceph.com/issues/23030,
> >
> 
> I've originally reported that issue. Our cluster that was seeing this
> somehow got magically better without us intervening.
> We might have deleted the offending snapshot or RBD, not sure.

I believe we resolved one variant of this crash, at least.  See

https://tracker.ceph.com/issues/24396

In this case there was both a stray clone *and* the PG was unergoing 
recovery (due to an attempted repair in this case).

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

2018-06-05 Thread Nick Fisk
So, from what I can see I believe this issue is being caused by one of the 
remaining OSD's acting for this PG containing a snapshot
file of the object

/var/lib/ceph/osd/ceph-46/current/1.2ca_head/DIR_A/DIR_C/DIR_2/DIR_D/DIR_0/rbd\udata.0c4c14238e1f29.000bf479__head_F930D2CA_
_1
/var/lib/ceph/osd/ceph-46/current/1.2ca_head/DIR_A/DIR_C/DIR_2/DIR_D/DIR_0/rbd\udata.0c4c14238e1f29.000bf479__1c_F930D2CA__1

Both the OSD which crashed and the other acting OSD don't have this "1c" 
snapshot file. Is the solution to use objectstore tool to
remove this "1c" snapshot object and then allow thigs to backfill?


-Original Message-
From: ceph-users  On Behalf Of Nick Fisk
Sent: 05 June 2018 16:43
To: 'ceph-users' 
Subject: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

Hi,

After a RBD snapshot was removed, I seem to be having OSD's assert when they 
try and recover pg 1.2ca. The issue seems to follow the
PG around as OSD's fail. I've seen this bug tracker and associated mailing list 
post, but would appreciate if anyone can give any
pointers. https://tracker.ceph.com/issues/23030,

Cluster is 12.2.5 with Filestore and 3x replica pools

I noticed after the snapshot was removed that there were 2 inconsistent PG's. I 
can a repair on both and one of them (1.2ca) seems
to have triggered this issue.


Snippet of log from two different OSD's. 1st one is from the original OSD 
holding the PG, 2nd is from where the OSD was marked out
and it was trying to be recovered to:


-4> 2018-06-05 15:15:45.997225 7febce7a5700  1 -- 
[2a03:25e0:254:5::112]:6819/3544315 --> [2a03:25e0:253:5::14]:0/3307345 --
osd_ping(ping_reply e2196479 stamp 2018-06-05 15:15:45.994907) v4 -- 
0x557d3d72f800 con 0
-3> 2018-06-05 15:15:46.018088 7febb2954700  1 -- 
[2a03:25e0:254:5::112]:6817/3544315 --> [2a03:25e0:254:5::12]:6809/5784710 --
MOSDPGPull(1.2ca e2196479/2196477 cost 8389608) v3 -- 0x557d4d180b40 con 0
-2> 2018-06-05 15:15:46.018412 7febce7a5700  5 -- 
[2a03:25e0:254:5::112]:6817/3544315 >> [2a03:25e0:254:5::12]:6809/5784710
conn(0x557d4b1a9000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=13470 
cs=1 l=0).
rx osd.46 seq 13 0x557d4d180b40 MOSDPGPush(1.2ca 2196479/2196477 
[PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c,
version: 2195927'1249660, data_included: [], data_size: 0, omap_header_size: 0, 
omap_ent
ries_size: 0, attrset_size: 1, recovery_info:
ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c@2195927'1249660,
 size: 0, copy_subset: [], clone_subset:
{}, snapset: 1c=[]:{}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, 
omap_recovered_to:, omap_complete:true, error:false),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, 
data_complete:false, omap _recovered_to:, omap_complete:false,
error:false))]) v3
-1> 2018-06-05 15:15:46.018425 7febce7a5700  1 -- 
[2a03:25e0:254:5::112]:6817/3544315 <== osd.46
[2a03:25e0:254:5::12]:6809/5784710 13  MOSDPGPush(1.2ca 2196479/2196477 
[PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f
29.000bf479:1c, version: 2195927'1249660, data_included: [], data_size: 
0, omap_header_size: 0, omap_entries_size: 0,
attrset_size: 1, recovery_info: 
ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.0
00bf479:1c@2195927'1249660, size: 0, copy_subset: [], clone_subset: {}, 
snapset: 1c=[]:{}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, 
omap_recovered_to:, omap_complete:t rue, error:false),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, 
data_complete:false, omap_recovered_to:, omap_complete:false,
error:false))]) v3  885+0+0 (695790480 0 0) 0x557d4d180b40 con 0x5
57d4b1a9000
 0> 2018-06-05 15:15:46.022099 7febb2954700 -1 
/build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: In function 'virtual void
PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, 
ObjectContextRef, bool , ObjectStore::Transaction*)'
thread 7febb2954700 time 2018-06-05 15:15:46.019130
/build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != 
recovery_info.ss.clone_snaps.end())







-4> 2018-06-05 16:28:59.560140 7fcd7b655700  5 -- 
[2a03:25e0:254:5::113]:6829/525383 >> [2a03:25e0:254:5::12]:6809/5784710
conn(0x557447510800 :6829 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH 
pgs=13524 cs
=1 l=0). rx osd.46 seq 6 0x5574480d0d80 MOSDPGPush(1.2ca 2196813/2196812
[PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c, version: 
2195927'1249660, data_included: [], data_size: 0,
omap_header_s
ize: 0, omap_entries_size: 0, attrset_size: 1, recovery_info:
ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c@2195927'1249660,
 size: 4194304, copy_subset: [],
clone_subset: {}, snapse
t: 1c=[]:{}), after_progress: ObjectRecoveryProgress(!first, 
data_recover

Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

2018-06-05 Thread Nick Fisk
 

 

From: ceph-users  On Behalf Of Paul Emmerich
Sent: 05 June 2018 17:02
To: n...@fisk.me.uk
Cc: ceph-users 
Subject: Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

 

 

 

2018-06-05 17:42 GMT+02:00 Nick Fisk mailto:n...@fisk.me.uk> 
>:

Hi,

After a RBD snapshot was removed, I seem to be having OSD's assert when they 
try and recover pg 1.2ca. The issue seems to follow the
PG around as OSD's fail. I've seen this bug tracker and associated mailing list 
post, but would appreciate if anyone can give any
pointers. https://tracker.ceph.com/issues/23030,

 

I've originally reported that issue. Our cluster that was seeing this somehow 
got magically better without us intervening.

We might have deleted the offending snapshot or RBD, not sure.

 

Thanks Paul, It looks like the snapshot has gone, but there is obviously some 
remains still around.

 

 

Paul





-- 

Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io <http://www.croit.io> 
Tel: +49 89 1896585 90

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

2018-06-05 Thread Paul Emmerich
2018-06-05 17:42 GMT+02:00 Nick Fisk :

> Hi,
>
> After a RBD snapshot was removed, I seem to be having OSD's assert when
> they try and recover pg 1.2ca. The issue seems to follow the
> PG around as OSD's fail. I've seen this bug tracker and associated mailing
> list post, but would appreciate if anyone can give any
> pointers. https://tracker.ceph.com/issues/23030,
>

I've originally reported that issue. Our cluster that was seeing this
somehow got magically better without us intervening.
We might have deleted the offending snapshot or RBD, not sure.


Paul


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())

2018-06-05 Thread Nick Fisk
Hi,

After a RBD snapshot was removed, I seem to be having OSD's assert when they 
try and recover pg 1.2ca. The issue seems to follow the
PG around as OSD's fail. I've seen this bug tracker and associated mailing list 
post, but would appreciate if anyone can give any
pointers. https://tracker.ceph.com/issues/23030,

Cluster is 12.2.5 with Filestore and 3x replica pools

I noticed after the snapshot was removed that there were 2 inconsistent PG's. I 
can a repair on both and one of them (1.2ca) seems
to have triggered this issue.


Snippet of log from two different OSD's. 1st one is from the original OSD 
holding the PG, 2nd is from where the OSD was marked out
and it was trying to be recovered to:


-4> 2018-06-05 15:15:45.997225 7febce7a5700  1 -- 
[2a03:25e0:254:5::112]:6819/3544315 --> [2a03:25e0:253:5::14]:0/3307345 --
osd_ping(ping_reply e2196479 stamp 2018-06-05 15:15:45.994907) v4 -- 
0x557d3d72f800 con 0
-3> 2018-06-05 15:15:46.018088 7febb2954700  1 -- 
[2a03:25e0:254:5::112]:6817/3544315 --> [2a03:25e0:254:5::12]:6809/5784710 --
MOSDPGPull(1.2ca e2196479/2196477 cost 8389608) v3 -- 0x557d4d180b40 con 0
-2> 2018-06-05 15:15:46.018412 7febce7a5700  5 -- 
[2a03:25e0:254:5::112]:6817/3544315 >> [2a03:25e0:254:5::12]:6809/5784710
conn(0x557d4b1a9000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=13470 
cs=1 l=0).
rx osd.46 seq 13 0x557d4d180b40 MOSDPGPush(1.2ca 2196479/2196477 
[PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c,
version: 2195927'1249660, data_included: [], data_size: 0, omap_header_size: 0, 
omap_ent
ries_size: 0, attrset_size: 1, recovery_info:
ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c@2195927'1249660,
 size: 0, copy_subset: [], clone_subset:
{}, snapset: 1c=[]:{}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, 
omap_recovered_to:, omap_complete:true, error:false),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, 
data_complete:false, omap
_recovered_to:, omap_complete:false, error:false))]) v3
-1> 2018-06-05 15:15:46.018425 7febce7a5700  1 -- 
[2a03:25e0:254:5::112]:6817/3544315 <== osd.46
[2a03:25e0:254:5::12]:6809/5784710 13  MOSDPGPush(1.2ca 2196479/2196477 
[PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f
29.000bf479:1c, version: 2195927'1249660, data_included: [], data_size: 
0, omap_header_size: 0, omap_entries_size: 0,
attrset_size: 1, recovery_info: 
ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.0
00bf479:1c@2195927'1249660, size: 0, copy_subset: [], clone_subset: {}, 
snapset: 1c=[]:{}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, 
omap_recovered_to:, omap_complete:t
rue, error:false), before_progress: ObjectRecoveryProgress(first, 
data_recovered_to:0, data_complete:false, omap_recovered_to:,
omap_complete:false, error:false))]) v3  885+0+0 (695790480 0 0) 
0x557d4d180b40 con 0x5
57d4b1a9000
 0> 2018-06-05 15:15:46.022099 7febb2954700 -1 
/build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: In function 'virtual void
PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, 
ObjectContextRef, bool
, ObjectStore::Transaction*)' thread 7febb2954700 time 2018-06-05 
15:15:46.019130
/build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != 
recovery_info.ss.clone_snaps.end())







-4> 2018-06-05 16:28:59.560140 7fcd7b655700  5 -- 
[2a03:25e0:254:5::113]:6829/525383 >> [2a03:25e0:254:5::12]:6809/5784710
conn(0x557447510800 :6829 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH 
pgs=13524 cs
=1 l=0). rx osd.46 seq 6 0x5574480d0d80 MOSDPGPush(1.2ca 2196813/2196812
[PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c, version: 
2195927'1249660, data_included: [], data_size: 0,
omap_header_s
ize: 0, omap_entries_size: 0, attrset_size: 1, recovery_info:
ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.000bf479:1c@2195927'1249660,
 size: 4194304, copy_subset: [],
clone_subset: {}, snapse
t: 1c=[]:{}), after_progress: ObjectRecoveryProgress(!first, 
data_recovered_to:0, data_complete:true, omap_recovered_to:,
omap_complete:true, error:false), before_progress: 
ObjectRecoveryProgress(first, data_rec
overed_to:0, data_complete:false, omap_recovered_to:, omap_complete:false, 
error:false))]) v3
-3> 2018-06-05 16:28:59.560155 7fcd7b655700  1 -- 
[2a03:25e0:254:5::113]:6829/525383 <== osd.46
[2a03:25e0:254:5::12]:6809/5784710 6  MOSDPGPush(1.2ca 2196813/2196812 
[PushOp(1:534b0c9f:::rbd_data.0c4c14
238e1f29.000bf479:1c, version: 2195927'1249660, data_included: [], 
data_size: 0, omap_header_size: 0, omap_entries_size: 0,
attrset_size: 1, recovery_info: ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14
238e1f29.000bf479:1c@2195927'1249660, size: 4194304, copy_subset: [], 
clone_subset: {}, snapset: 1c=[]:{}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, omap_re