For reference, building luminous with the changes in the pull request
also fixed this issue for me. Some of my unexpected snapshots were on
Bluestore devices; here's how I used the objectstore tool to remove
them. In the example, the problematic placement group is 2.1c3f, and the
unexpected clone is identified in the OSD's log as
rb.0.2479b45.238e1f29.000000000df:1356.

ceph14:~# systemctl stop ceph-osd@34.service
ceph14:~# ceph-objectstore-tool --type bluestore --data-path 
/var/lib/ceph/osd/ceph-34/ --pgid 2.1c3f --op list 
rb.0.2479b45.238e1f29.000000000ddf
Error getting attr on : 2.1c3f_head,#-4:fc380000:::scrub_2.1c3f:head#, (61) No 
data available
["2.1c3f",{"oid":"rb.0.2479b45.238e1f29.000000000ddf","key":"","snapid":4950,"hash":1151294527,"max":0,"pool":2,"namespace":"","max":0}]
["2.1c3f",{"oid":"rb.0.2479b45.238e1f29.000000000ddf","key":"","snapid":-2,"hash":1151294527,"max":0,"pool":2,"namespace":"","max":0}]
ceph14:~# ceph-objectstore-tool --dry-run --type bluestore --data-path 
/var/lib/ceph/osd/ceph-34/ --pgid 2.1c3f 
'{"oid":"rb.0.2479b45.238e1f29.000000000ddf","key":"","snapid":4950,"hash":1151294527,"max":0,"pool":2,"namespace":"","max":0}'
 remove
remove #2:fc3af922:::rb.0.2479b45.238e1f29.000000000ddf:1356#
dry-run: Nothing changed
ceph14:~# ceph-objectstore-tool --type bluestore --data-path 
/var/lib/ceph/osd/ceph-34/ --pgid 2.1c3f 
'{"oid":"rb.0.2479b45.238e1f29.000000000ddf","key":"","snapid":4950,"hash":1151294527,"max":0,"pool":2,"namespace":"","max":0}'
 remove
remove #2:fc3af922:::rb.0.2479b45.238e1f29.000000000ddf:1356#
ceph14:~# systemctl start ceph-osd@34.service

-Steve

On 06/14/2018 04:59 PM, Nick Fisk wrote:
> For completeness in case anyone has this issue in the future and stumbles 
> across this thread
>
> If your OSD is crashing and you are still running on a Luminous build that 
> does not have the fix in the pull request below, you will
> need to compile the ceph-osd binary and replace it on the affected OSD node. 
> This will get your OSD's/cluster back up and running.
>
> In regards to the stray object/clone, I was unable to remove it using the 
> objectstore tool, I'm guessing this is because as far as
> the OSD is concerned it believes that clone should have already been deleted. 
> I am still running Filestore on this cluster and
> simply removing the clone object from the OSD PG folder (Note: the object 
> won't have _head in its name) and then running a deep
> scrub on the PG again fixed the issue for me.
>
> Nick
>
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Nick 
> Fisk
> Sent: 07 June 2018 14:01
> To: 'ceph-users' <ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] FAILED assert(p != 
> recovery_info.ss.clone_snaps.end())
>
> So I've recompiled a 12.2.5 ceph-osd binary with the fix included in 
> https://github.com/ceph/ceph/pull/22396
>
> The OSD has restarted as expected and the PG is now active+clean 
> ......success so far.
>
> What's the best method to clean up the stray snapshot on OSD.46? I'm guessing 
> using the object-store-tool, but not sure if I want to
> clean the clone metadata or try and remove the actual snapshot object.
>
> -----Original Message-----
> From: ceph-users <ceph-users-boun...@lists.ceph.com> On Behalf Of Nick Fisk
> Sent: 05 June 2018 17:22
> To: 'ceph-users' <ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] FAILED assert(p != 
> recovery_info.ss.clone_snaps.end())
>
> So, from what I can see I believe this issue is being caused by one of the 
> remaining OSD's acting for this PG containing a snapshot
> file of the object
>
> /var/lib/ceph/osd/ceph-46/current/1.2ca_head/DIR_A/DIR_C/DIR_2/DIR_D/DIR_0/rbd\udata.0c4c14238e1f29.00000000000bf479__head_F930D2CA_
> _1
> /var/lib/ceph/osd/ceph-46/current/1.2ca_head/DIR_A/DIR_C/DIR_2/DIR_D/DIR_0/rbd\udata.0c4c14238e1f29.00000000000bf479__1c_F930D2CA__1
>
> Both the OSD which crashed and the other acting OSD don't have this "1c" 
> snapshot file. Is the solution to use objectstore tool to
> remove this "1c" snapshot object and then allow thigs to backfill?
>
>
> -----Original Message-----
> From: ceph-users <ceph-users-boun...@lists.ceph.com> On Behalf Of Nick Fisk
> Sent: 05 June 2018 16:43
> To: 'ceph-users' <ceph-users@lists.ceph.com>
> Subject: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())
>
> Hi,
>
> After a RBD snapshot was removed, I seem to be having OSD's assert when they 
> try and recover pg 1.2ca. The issue seems to follow the
> PG around as OSD's fail. I've seen this bug tracker and associated mailing 
> list post, but would appreciate if anyone can give any
> pointers. https://tracker.ceph.com/issues/23030,
>
> Cluster is 12.2.5 with Filestore and 3x replica pools
>
> I noticed after the snapshot was removed that there were 2 inconsistent PG's. 
> I can a repair on both and one of them (1.2ca) seems
> to have triggered this issue.
>
>
> Snippet of log from two different OSD's. 1st one is from the original OSD 
> holding the PG, 2nd is from where the OSD was marked out
> and it was trying to be recovered to:
>
>
>     -4> 2018-06-05 15:15:45.997225 7febce7a5700  1 -- 
> [2a03:25e0:254:5::112]:6819/3544315 --> [2a03:25e0:253:5::14]:0/3307345 --
> osd_ping(ping_reply e2196479 stamp 2018-06-05 15:15:45.994907) v4 -- 
> 0x557d3d72f800 con 0
>     -3> 2018-06-05 15:15:46.018088 7febb2954700  1 -- 
> [2a03:25e0:254:5::112]:6817/3544315 --> [2a03:25e0:254:5::12]:6809/5784710 --
> MOSDPGPull(1.2ca e2196479/2196477 cost 8389608) v3 -- 0x557d4d180b40 con 0
>     -2> 2018-06-05 15:15:46.018412 7febce7a5700  5 -- 
> [2a03:25e0:254:5::112]:6817/3544315 >> [2a03:25e0:254:5::12]:6809/5784710
> conn(0x557d4b1a9000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH 
> pgs=13470 cs=1 l=0).
> rx osd.46 seq 13 0x557d4d180b40 MOSDPGPush(1.2ca 2196479/2196477 
> [PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f29.00000000000bf479:1c,
> version: 2195927'1249660, data_included: [], data_size: 0, omap_header_size: 
> 0, omap_ent
> ries_size: 0, attrset_size: 1, recovery_info:
> ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.00000000000bf479:1c@2195927'1249660,
>  size: 0, copy_subset: [], clone_subset:
> {}, snapset: 1c=[]:{}), after_progress:
> ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, 
> omap_recovered_to:, omap_complete:true, error:false),
> before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, 
> data_complete:false, omap _recovered_to:, omap_complete:false,
> error:false))]) v3
>     -1> 2018-06-05 15:15:46.018425 7febce7a5700  1 -- 
> [2a03:25e0:254:5::112]:6817/3544315 <== osd.46
> [2a03:25e0:254:5::12]:6809/5784710 13 ==== MOSDPGPush(1.2ca 2196479/2196477 
> [PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f
> 29.00000000000bf479:1c, version: 2195927'1249660, data_included: [], 
> data_size: 0, omap_header_size: 0, omap_entries_size: 0,
> attrset_size: 1, recovery_info: 
> ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.00000
> 000000bf479:1c@2195927'1249660, size: 0, copy_subset: [], clone_subset: {}, 
> snapset: 1c=[]:{}), after_progress:
> ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, 
> omap_recovered_to:, omap_complete:t rue, error:false),
> before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, 
> data_complete:false, omap_recovered_to:, omap_complete:false,
> error:false))]) v3 ==== 885+0+0 (695790480 0 0) 0x557d4d180b40 con 0x5
> 57d4b1a9000
>      0> 2018-06-05 15:15:46.022099 7febb2954700 -1 
> /build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: In function 'virtual void
> PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, 
> ObjectContextRef, bool , ObjectStore::Transaction*)'
> thread 7febb2954700 time 2018-06-05 15:15:46.019130
> /build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != 
> recovery_info.ss.clone_snaps.end())
>
>
>
>
>
>
>
>     -4> 2018-06-05 16:28:59.560140 7fcd7b655700  5 -- 
> [2a03:25e0:254:5::113]:6829/525383 >> [2a03:25e0:254:5::12]:6809/5784710
> conn(0x557447510800 :6829 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH 
> pgs=13524 cs
> =1 l=0). rx osd.46 seq 6 0x5574480d0d80 MOSDPGPush(1.2ca 2196813/2196812
> [PushOp(1:534b0c9f:::rbd_data.0c4c14238e1f29.00000000000bf479:1c, version: 
> 2195927'1249660, data_included: [], data_size: 0,
> omap_header_s
> ize: 0, omap_entries_size: 0, attrset_size: 1, recovery_info:
> ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14238e1f29.00000000000bf479:1c@2195927'1249660,
>  size: 4194304, copy_subset: [],
> clone_subset: {}, snapse
> t: 1c=[]:{}), after_progress: ObjectRecoveryProgress(!first, 
> data_recovered_to:0, data_complete:true, omap_recovered_to:,
> omap_complete:true, error:false), before_progress: 
> ObjectRecoveryProgress(first, data_rec overed_to:0, data_complete:false,
> omap_recovered_to:, omap_complete:false, error:false))]) v3
>     -3> 2018-06-05 16:28:59.560155 7fcd7b655700  1 -- 
> [2a03:25e0:254:5::113]:6829/525383 <== osd.46
> [2a03:25e0:254:5::12]:6809/5784710 6 ==== MOSDPGPush(1.2ca 2196813/2196812 
> [PushOp(1:534b0c9f:::rbd_data.0c4c14
> 238e1f29.00000000000bf479:1c, version: 2195927'1249660, data_included: [], 
> data_size: 0, omap_header_size: 0, omap_entries_size: 0,
> attrset_size: 1, recovery_info: 
> ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14
> 238e1f29.00000000000bf479:1c@2195927'1249660, size: 4194304, copy_subset: [], 
> clone_subset: {}, snapset: 1c=[]:{}), after_progress:
> ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, 
> omap_re covered_to:, omap_complete:true, error:false),
> before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, 
> data_complete:false, omap_recovered_to:, omap_complete:false,
> error:false))]) v3 ==== 909+0+0 (7
> 22394556 0 0) 0x5574480d0d80 con 0x557447510800
>     -2> 2018-06-05 16:28:59.560183 7fcd7b655700  5 -- 
> [2a03:25e0:254:5::113]:6829/525383 >> [2a03:25e0:254:5::12]:6809/5784710
> conn(0x557447510800 :6829 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH 
> pgs=13524 cs
> =1 l=0). rx osd.46 seq 7 0x55744783f900 pg_backfill(progress 1.2ca e 
> 2196813/2196813 lb
> 1:534b0b88:::rbd_data.f870ac238e1f29.00000000000ff145:head) v3
>     -1> 2018-06-05 16:28:59.560189 7fcd7b655700  1 -- 
> [2a03:25e0:254:5::113]:6829/525383 <== osd.46
> [2a03:25e0:254:5::12]:6809/5784710 7 ==== pg_backfill(progress 1.2ca e 
> 2196813/2196813 lb 1:534b0b88:::rbd_data
> .f870ac238e1f29.00000000000ff145:head) v3 ==== 946+0+0 (3865576583 0 0) 
> 0x55744783f900 con 0x557447510800
>      0> 2018-06-05 16:28:59.564054 7fcd5f3ba700 -1 
> /build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: In function 'virtual void
> PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, 
> ObjectContextR ef, bool, ObjectStore::Transaction*)'
> thread 7fcd5f3ba700 time 2018-06-05 16:28:59.561060
> /build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != 
> recovery_info.ss.clone_snaps.end())
>
>  ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous 
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x102) [0x557424971a02]
>  2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo 
> const&, std::shared_ptr<ObjectContext>, bool,
> ObjectStore::Transaction*)+0xd63) [0x5574244df873]
>  3: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, 
> ObjectStore::Transaction*)+0x2da) [0x5574246715ca]
>  4: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x12e) 
> [0x5574246717fe]
>  5: 
> (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2c1) 
> [0x557424680d71]
>  6: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) 
> [0x55742458c440]
>  7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, 
> ThreadPool::TPHandle&)+0x543) [0x5574244f0853]
>  8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, 
> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a9) 
> [0x557424367539]
>  9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> 
> const&)+0x57) [0x557424610f37]
>  10: (OSD::ShardedOpWQ::_process(unsigned int, 
> ceph::heartbeat_handle_d*)+0x1047) [0x557424395847]
>  11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) 
> [0x5574249767f4]
>  12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x557424979830]
>  13: (()+0x76ba) [0x7fcd7f1cb6ba]
>  14: (clone()+0x6d) [0x7fcd7e24241d]
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Steve Anthony
LTS HPC Senior Analyst
Lehigh University
sma...@lehigh.edu

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to