Re: [ceph-users] OSD crash loop - FAILED assert(recovery_info.oi.snaps.size())

2017-06-05 Thread Stephen M. Anthony ( Faculty/Staff - Ctr for Innovation in Teach & )
Using rbd ls -l poolname to list all images and their snapshots, then 
purging snapshots from each image with rbd snap purge 
poolname/imagename, then finally reweighing each flapping OSD to 0.0 
resolved this issue.


-Steve

On 2017-06-02 14:15, Steve Anthony wrote:

I'm seeing this again on two OSDs after adding another 20 disks to my
cluster. Is there someway I can maybe determine which snapshots the
recovery process is looking for? Or maybe find and remove the objects
it's trying to recover, since there's apparently a problem with them?
Thanks! -Steve

On 05/18/2017 01:06 PM, Steve Anthony wrote:


Hmmm, after crashing for a few days every 30 seconds it's apparently
running normally again. Weird. I was thinking since it's looking for
a snapshot object, maybe re-enabling snaptrimming and removing all
the snapshots in the pool would remove that object (and the
problem)? Never got to that point this time, but I'm going to need
to cycle more OSDs in and out of the cluster, so if it happens again
I might try that and update.

Thanks!

-Steve

On 05/17/2017 03:17 PM, Gregory Farnum wrote:

On Wed, May 17, 2017 at 10:51 AM Steve Anthony 
wrote:
Hello,

After starting a backup (create snap, export and import into a
second
cluster - one RBD image still exporting/importing as of this
message)
the other day while recovery operations on the primary cluster were
ongoing I noticed an OSD (osd.126) start to crash; I reweighted it
to 0
to prepare to remove it. Shortly thereafter I noticed the problem
seemed
to move to another OSD (osd.223). After looking at the logs, I
noticed
they appeared to have the same problem. I'm running Ceph version
9.2.1
(752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) on Debian 8.

Log for osd.126 from start to crash: https://pastebin.com/y4fn94xe

Log for osd.223 from start to crash: https://pastebin.com/AE4CYvSA

May 15 10:39:55 ceph13 ceph-osd[21506]: -9308> 2017-05-15
10:39:51.561342 7f225c385900 -1 osd.126 616621 log_to_monitors
{default=true}
May 15 10:39:55 ceph13 ceph-osd[21506]: 2017-05-15 10:39:55.328897
7f2236be3700 -1 osd/ReplicatedPG.cc: In function 'virtual void
ReplicatedPG::on_local_recover(const hobject_t&, const
object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef,
ObjectStore::Transaction*)' thread 7f2236be3700 time 2017-05-15
10:39:55.322306
May 15 10:39:55 ceph13 ceph-osd[21506]: osd/ReplicatedPG.cc: 192:
FAILED
assert(recovery_info.oi.snaps.size())

May 15 16:45:25 ceph19 ceph-osd[30527]: 2017-05-15 16:45:25.343391
7ff40f41e900 -1 osd.223 619808 log_to_monitors {default=true}
May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: In
function
'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const
object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef,
ObjectStore::Transaction*)' thread 7ff3eab63700 time 2017-05-15
16:45:30.799839
May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: 192:
FAILED
assert(recovery_info.oi.snaps.size())

I did some searching and thought it might be related to
http://tracker.ceph.com/issues/13837 aka
https://bugzilla.redhat.com/show_bug.cgi?id=1351320 so I disabled
scrubbing and deep-scrubbing, and set
osd_pg_max_concurrent_snap_trims
to 0 for all OSDs. No luck. I had changed the systemd service file
to
automatically restart osd.223 while recovery was happening, but it
appears to have stalled; I suppose it's needed up for the remaining
objects.

Yeah, these aren't really related that I can see — though I
haven't spent much time in this code that I can recall. The OSD is
receiving a "push" as part of log recovery and finds that the object
it's receiving is a snapshot object without having any information
about the snap IDs that exist, which is weird. I don't know of any
way a client could break it either, but maybe David or Jason know
something more.
-Greg

I didn't see anything else online, so I thought I see if anyone has
seen
this before or has any other ideas. Thanks for taking the time.

-Steve

--
Steve Anthony
LTS HPC Senior Analyst
Lehigh University
sma...@lehigh.edu

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Steve Anthony
LTS HPC Senior Analyst
Lehigh University
sma...@lehigh.edu

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Steve Anthony
LTS HPC Senior Analyst
Lehigh University
sma...@lehigh.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD crash loop - FAILED assert(recovery_info.oi.snaps.size())

2017-06-02 Thread Steve Anthony
I'm seeing this again on two OSDs after adding another 20 disks to my
cluster. Is there someway I can maybe determine which snapshots the
recovery process is looking for? Or maybe find and remove the objects
it's trying to recover, since there's apparently a problem with them?
Thanks!

-Steve

On 05/18/2017 01:06 PM, Steve Anthony wrote:
>
> Hmmm, after crashing for a few days every 30 seconds it's apparently
> running normally again. Weird. I was thinking since it's looking for a
> snapshot object, maybe re-enabling snaptrimming and removing all the
> snapshots in the pool would remove that object (and the problem)?
> Never got to that point this time, but I'm going to need to cycle more
> OSDs in and out of the cluster, so if it happens again I might try
> that and update.
>
> Thanks!
>
> -Steve
>
>
> On 05/17/2017 03:17 PM, Gregory Farnum wrote:
>>
>>
>> On Wed, May 17, 2017 at 10:51 AM Steve Anthony > > wrote:
>>
>> Hello,
>>
>> After starting a backup (create snap, export and import into a second
>> cluster - one RBD image still exporting/importing as of this message)
>> the other day while recovery operations on the primary cluster were
>> ongoing I noticed an OSD (osd.126) start to crash; I reweighted
>> it to 0
>> to prepare to remove it. Shortly thereafter I noticed the problem
>> seemed
>> to move to another OSD (osd.223). After looking at the logs, I
>> noticed
>> they appeared to have the same problem. I'm running Ceph version
>> 9.2.1
>> (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) on Debian 8.
>>
>> Log for osd.126 from start to crash: https://pastebin.com/y4fn94xe
>>
>> Log for osd.223 from start to crash: https://pastebin.com/AE4CYvSA
>>
>>
>> May 15 10:39:55 ceph13 ceph-osd[21506]: -9308> 2017-05-15
>> 10:39:51.561342 7f225c385900 -1 osd.126 616621 log_to_monitors
>> {default=true}
>> May 15 10:39:55 ceph13 ceph-osd[21506]: 2017-05-15 10:39:55.328897
>> 7f2236be3700 -1 osd/ReplicatedPG.cc: In function 'virtual void
>> ReplicatedPG::on_local_recover(const hobject_t&, const
>> object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef,
>> ObjectStore::Transaction*)' thread 7f2236be3700 time 2017-05-15
>> 10:39:55.322306
>> May 15 10:39:55 ceph13 ceph-osd[21506]: osd/ReplicatedPG.cc: 192:
>> FAILED
>> assert(recovery_info.oi.snaps.size())
>>
>> May 15 16:45:25 ceph19 ceph-osd[30527]: 2017-05-15 16:45:25.343391
>> 7ff40f41e900 -1 osd.223 619808 log_to_monitors {default=true}
>> May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: In
>> function
>> 'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const
>> object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef,
>> ObjectStore::Transaction*)' thread 7ff3eab63700 time 2017-05-15
>> 16:45:30.799839
>> May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: 192:
>> FAILED
>> assert(recovery_info.oi.snaps.size())
>>
>>
>> I did some searching and thought it might be related to
>> http://tracker.ceph.com/issues/13837 aka
>> https://bugzilla.redhat.com/show_bug.cgi?id=1351320 so I disabled
>> scrubbing and deep-scrubbing, and set
>> osd_pg_max_concurrent_snap_trims
>> to 0 for all OSDs. No luck. I had changed the systemd service file to
>> automatically restart osd.223 while recovery was happening, but it
>> appears to have stalled; I suppose it's needed up for the
>> remaining objects.
>>
>>
>> Yeah, these aren't really related that I can see — though I haven't
>> spent much time in this code that I can recall. The OSD is receiving
>> a "push" as part of log recovery and finds that the object it's
>> receiving is a snapshot object without having any information about
>> the snap IDs that exist, which is weird. I don't know of any way a
>> client could break it either, but maybe David or Jason know something
>> more.
>> -Greg
>>  
>>
>>
>> I didn't see anything else online, so I thought I see if anyone
>> has seen
>> this before or has any other ideas. Thanks for taking the time.
>>
>> -Steve
>>
>>
>> --
>> Steve Anthony
>> LTS HPC Senior Analyst
>> Lehigh University
>> sma...@lehigh.edu 
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> -- 
> Steve Anthony
> LTS HPC Senior Analyst
> Lehigh University
> sma...@lehigh.edu
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Steve Anthony
LTS HPC Senior Analyst
Lehigh University
sma...@lehigh.edu



signature.asc
Description: OpenPGP digital signature

Re: [ceph-users] OSD crash loop - FAILED assert(recovery_info.oi.snaps.size())

2017-05-18 Thread Steve Anthony
Hmmm, after crashing for a few days every 30 seconds it's apparently
running normally again. Weird. I was thinking since it's looking for a
snapshot object, maybe re-enabling snaptrimming and removing all the
snapshots in the pool would remove that object (and the problem)? Never
got to that point this time, but I'm going to need to cycle more OSDs in
and out of the cluster, so if it happens again I might try that and update.

Thanks!

-Steve


On 05/17/2017 03:17 PM, Gregory Farnum wrote:
>
>
> On Wed, May 17, 2017 at 10:51 AM Steve Anthony  > wrote:
>
> Hello,
>
> After starting a backup (create snap, export and import into a second
> cluster - one RBD image still exporting/importing as of this message)
> the other day while recovery operations on the primary cluster were
> ongoing I noticed an OSD (osd.126) start to crash; I reweighted it
> to 0
> to prepare to remove it. Shortly thereafter I noticed the problem
> seemed
> to move to another OSD (osd.223). After looking at the logs, I noticed
> they appeared to have the same problem. I'm running Ceph version 9.2.1
> (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) on Debian 8.
>
> Log for osd.126 from start to crash: https://pastebin.com/y4fn94xe
>
> Log for osd.223 from start to crash: https://pastebin.com/AE4CYvSA
>
>
> May 15 10:39:55 ceph13 ceph-osd[21506]: -9308> 2017-05-15
> 10:39:51.561342 7f225c385900 -1 osd.126 616621 log_to_monitors
> {default=true}
> May 15 10:39:55 ceph13 ceph-osd[21506]: 2017-05-15 10:39:55.328897
> 7f2236be3700 -1 osd/ReplicatedPG.cc: In function 'virtual void
> ReplicatedPG::on_local_recover(const hobject_t&, const
> object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef,
> ObjectStore::Transaction*)' thread 7f2236be3700 time 2017-05-15
> 10:39:55.322306
> May 15 10:39:55 ceph13 ceph-osd[21506]: osd/ReplicatedPG.cc: 192:
> FAILED
> assert(recovery_info.oi.snaps.size())
>
> May 15 16:45:25 ceph19 ceph-osd[30527]: 2017-05-15 16:45:25.343391
> 7ff40f41e900 -1 osd.223 619808 log_to_monitors {default=true}
> May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: In
> function
> 'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const
> object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef,
> ObjectStore::Transaction*)' thread 7ff3eab63700 time 2017-05-15
> 16:45:30.799839
> May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: 192:
> FAILED
> assert(recovery_info.oi.snaps.size())
>
>
> I did some searching and thought it might be related to
> http://tracker.ceph.com/issues/13837 aka
> https://bugzilla.redhat.com/show_bug.cgi?id=1351320 so I disabled
> scrubbing and deep-scrubbing, and set osd_pg_max_concurrent_snap_trims
> to 0 for all OSDs. No luck. I had changed the systemd service file to
> automatically restart osd.223 while recovery was happening, but it
> appears to have stalled; I suppose it's needed up for the
> remaining objects.
>
>
> Yeah, these aren't really related that I can see — though I haven't
> spent much time in this code that I can recall. The OSD is receiving a
> "push" as part of log recovery and finds that the object it's
> receiving is a snapshot object without having any information about
> the snap IDs that exist, which is weird. I don't know of any way a
> client could break it either, but maybe David or Jason know something
> more.
> -Greg
>  
>
>
> I didn't see anything else online, so I thought I see if anyone
> has seen
> this before or has any other ideas. Thanks for taking the time.
>
> -Steve
>
>
> --
> Steve Anthony
> LTS HPC Senior Analyst
> Lehigh University
> sma...@lehigh.edu 
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
Steve Anthony
LTS HPC Senior Analyst
Lehigh University
sma...@lehigh.edu



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD crash loop - FAILED assert(recovery_info.oi.snaps.size())

2017-05-17 Thread Gregory Farnum
On Wed, May 17, 2017 at 10:51 AM Steve Anthony  wrote:

> Hello,
>
> After starting a backup (create snap, export and import into a second
> cluster - one RBD image still exporting/importing as of this message)
> the other day while recovery operations on the primary cluster were
> ongoing I noticed an OSD (osd.126) start to crash; I reweighted it to 0
> to prepare to remove it. Shortly thereafter I noticed the problem seemed
> to move to another OSD (osd.223). After looking at the logs, I noticed
> they appeared to have the same problem. I'm running Ceph version 9.2.1
> (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) on Debian 8.
>
> Log for osd.126 from start to crash: https://pastebin.com/y4fn94xe
>
> Log for osd.223 from start to crash: https://pastebin.com/AE4CYvSA
>
>
> May 15 10:39:55 ceph13 ceph-osd[21506]: -9308> 2017-05-15
> 10:39:51.561342 7f225c385900 -1 osd.126 616621 log_to_monitors
> {default=true}
> May 15 10:39:55 ceph13 ceph-osd[21506]: 2017-05-15 10:39:55.328897
> 7f2236be3700 -1 osd/ReplicatedPG.cc: In function 'virtual void
> ReplicatedPG::on_local_recover(const hobject_t&, const
> object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef,
> ObjectStore::Transaction*)' thread 7f2236be3700 time 2017-05-15
> 10:39:55.322306
> May 15 10:39:55 ceph13 ceph-osd[21506]: osd/ReplicatedPG.cc: 192: FAILED
> assert(recovery_info.oi.snaps.size())
>
> May 15 16:45:25 ceph19 ceph-osd[30527]: 2017-05-15 16:45:25.343391
> 7ff40f41e900 -1 osd.223 619808 log_to_monitors {default=true}
> May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: In function
> 'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const
> object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef,
> ObjectStore::Transaction*)' thread 7ff3eab63700 time 2017-05-15
> 16:45:30.799839
> May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: 192: FAILED
> assert(recovery_info.oi.snaps.size())
>
>
> I did some searching and thought it might be related to
> http://tracker.ceph.com/issues/13837 aka
> https://bugzilla.redhat.com/show_bug.cgi?id=1351320 so I disabled
> scrubbing and deep-scrubbing, and set osd_pg_max_concurrent_snap_trims
> to 0 for all OSDs. No luck. I had changed the systemd service file to
> automatically restart osd.223 while recovery was happening, but it
> appears to have stalled; I suppose it's needed up for the remaining
> objects.
>

Yeah, these aren't really related that I can see — though I haven't spent
much time in this code that I can recall. The OSD is receiving a "push" as
part of log recovery and finds that the object it's receiving is a snapshot
object without having any information about the snap IDs that exist, which
is weird. I don't know of any way a client could break it either, but maybe
David or Jason know something more.
-Greg


>
> I didn't see anything else online, so I thought I see if anyone has seen
> this before or has any other ideas. Thanks for taking the time.
>
> -Steve
>
>
> --
> Steve Anthony
> LTS HPC Senior Analyst
> Lehigh University
> sma...@lehigh.edu
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD crash loop - FAILED assert(recovery_info.oi.snaps.size())

2017-05-17 Thread Steve Anthony
Hello,

After starting a backup (create snap, export and import into a second
cluster - one RBD image still exporting/importing as of this message)
the other day while recovery operations on the primary cluster were
ongoing I noticed an OSD (osd.126) start to crash; I reweighted it to 0
to prepare to remove it. Shortly thereafter I noticed the problem seemed
to move to another OSD (osd.223). After looking at the logs, I noticed
they appeared to have the same problem. I'm running Ceph version 9.2.1
(752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) on Debian 8.

Log for osd.126 from start to crash: https://pastebin.com/y4fn94xe

Log for osd.223 from start to crash: https://pastebin.com/AE4CYvSA


May 15 10:39:55 ceph13 ceph-osd[21506]: -9308> 2017-05-15
10:39:51.561342 7f225c385900 -1 osd.126 616621 log_to_monitors
{default=true}
May 15 10:39:55 ceph13 ceph-osd[21506]: 2017-05-15 10:39:55.328897
7f2236be3700 -1 osd/ReplicatedPG.cc: In function 'virtual void
ReplicatedPG::on_local_recover(const hobject_t&, const
object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef,
ObjectStore::Transaction*)' thread 7f2236be3700 time 2017-05-15
10:39:55.322306
May 15 10:39:55 ceph13 ceph-osd[21506]: osd/ReplicatedPG.cc: 192: FAILED
assert(recovery_info.oi.snaps.size())

May 15 16:45:25 ceph19 ceph-osd[30527]: 2017-05-15 16:45:25.343391
7ff40f41e900 -1 osd.223 619808 log_to_monitors {default=true}
May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: In function
'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const
object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef,
ObjectStore::Transaction*)' thread 7ff3eab63700 time 2017-05-15
16:45:30.799839
May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: 192: FAILED
assert(recovery_info.oi.snaps.size())


I did some searching and thought it might be related to
http://tracker.ceph.com/issues/13837 aka
https://bugzilla.redhat.com/show_bug.cgi?id=1351320 so I disabled
scrubbing and deep-scrubbing, and set osd_pg_max_concurrent_snap_trims
to 0 for all OSDs. No luck. I had changed the systemd service file to
automatically restart osd.223 while recovery was happening, but it
appears to have stalled; I suppose it's needed up for the remaining objects.

I didn't see anything else online, so I thought I see if anyone has seen
this before or has any other ideas. Thanks for taking the time.

-Steve


-- 
Steve Anthony
LTS HPC Senior Analyst
Lehigh University
sma...@lehigh.edu




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com