osd crash with object store set to newstore

2015-06-01 Thread Srikanth Madugundi
Hi Sage and all,

I build ceph code from wip-newstore on RHEL7 and running performance
tests to compare with filestore. After few hours of running the tests
the osd daemons started to crash. Here is the stack trace, the osd
crashes immediately after the restart. So I could not get the osd up
and running.

ceph version b8e22893f44979613738dfcdd40dada2b513118
(eb8e22893f44979613738dfcdd40dada2b513118)
1: /usr/bin/ceph-osd() [0xb84652]
2: (()+0xf130) [0x7f915f84f130]
3: (gsignal()+0x39) [0x7f915e2695c9]
4: (abort()+0x148) [0x7f915e26acd8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
6: (()+0x5e946) [0x7f915eb6b946]
7: (()+0x5e973) [0x7f915eb6b973]
8: (()+0x5eb9f) [0x7f915eb6bb9f]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x27a) [0xc84c5a]
10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
snapid_t, std::vector >*,
ghobject_t*)+0x13c9) [0xa08639]
11: (PGBackend::objects_list_partial(hobject_t const&, int, int,
snapid_t, std::vector >*,
hobject_t*)+0x352) [0x918a02]
12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptr)+0x1066) [0x8aa906]
13: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0x1eb) [0x8cd06b]
14: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
ThreadPool::TPHandle&)+0x68a) [0x85dbea]
15: (OSD::dequeue_op(boost::intrusive_ptr,
std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x3ed)
[0x6c3f5d]
16: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf]
18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
19: (()+0x7df3) [0x7f915f847df3]
20: (clone()+0x6d) [0x7f915e32a01d]
NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

Please let me know the cause of this crash, when this crash happens I
noticed that two osds on separate machines are down. I can bring one
osd up but restarting the other osd causes both OSDs to crash. My
understanding is the crash seems to happen when two OSDs try to
communicate and replicate a particular PG.

Regards
Srikanth
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash with object store set to newstore

2015-06-01 Thread Sage Weil
On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
> Hi Sage and all,
> 
> I build ceph code from wip-newstore on RHEL7 and running performance
> tests to compare with filestore. After few hours of running the tests
> the osd daemons started to crash. Here is the stack trace, the osd
> crashes immediately after the restart. So I could not get the osd up
> and running.
> 
> ceph version b8e22893f44979613738dfcdd40dada2b513118
> (eb8e22893f44979613738dfcdd40dada2b513118)
> 1: /usr/bin/ceph-osd() [0xb84652]
> 2: (()+0xf130) [0x7f915f84f130]
> 3: (gsignal()+0x39) [0x7f915e2695c9]
> 4: (abort()+0x148) [0x7f915e26acd8]
> 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
> 6: (()+0x5e946) [0x7f915eb6b946]
> 7: (()+0x5e973) [0x7f915eb6b973]
> 8: (()+0x5eb9f) [0x7f915eb6bb9f]
> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x27a) [0xc84c5a]
> 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
> snapid_t, std::vector >*,
> ghobject_t*)+0x13c9) [0xa08639]
> 11: (PGBackend::objects_list_partial(hobject_t const&, int, int,
> snapid_t, std::vector >*,
> hobject_t*)+0x352) [0x918a02]
> 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptr)+0x1066) 
> [0x8aa906]
> 13: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0x1eb) [0x8cd06b]
> 14: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
> ThreadPool::TPHandle&)+0x68a) [0x85dbea]
> 15: (OSD::dequeue_op(boost::intrusive_ptr,
> std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x3ed)
> [0x6c3f5d]
> 16: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
> 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
> [0xc746bf]
> 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
> 19: (()+0x7df3) [0x7f915f847df3]
> 20: (clone()+0x6d) [0x7f915e32a01d]
> NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
> 
> Please let me know the cause of this crash, when this crash happens I
> noticed that two osds on separate machines are down. I can bring one
> osd up but restarting the other osd causes both OSDs to crash. My
> understanding is the crash seems to happen when two OSDs try to
> communicate and replicate a particular PG.

Can you include the log lines that preceed the dump above?  In particular, 
there should be a line that tells you what assertion failed in what 
function and at what line number.  I haven't seen this crash so I'm not 
sure offhand what it is.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash with object store set to newstore

2015-06-01 Thread Srikanth Madugundi
Hi Sage,

The assertion failed at line 1639, here is the log message


2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
function 'virtual int NewStore::collection_list_partial(coll_t,
ghobject_t, int, int, snapid_t, std::vector*,
ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174

os/newstore/NewStore.cc: 1639: FAILED assert(k >= start_key && k < end_key)


Just before the crash the here are the debug statements printed by the
method (collection_list_partial)

2015-05-30 22:49:23.607232 7f1681934700 15
newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
start -1/0//0/0 min/max 1024/1024 snap head
2015-05-30 22:49:23.607251 7f1681934700 20
newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
--.7fb4.. to --.7fb4.0800. and
--.804b.. to --.804b.0800. start
-1/0//0/0


Regards
Srikanth

On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil  wrote:
> On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
>> Hi Sage and all,
>>
>> I build ceph code from wip-newstore on RHEL7 and running performance
>> tests to compare with filestore. After few hours of running the tests
>> the osd daemons started to crash. Here is the stack trace, the osd
>> crashes immediately after the restart. So I could not get the osd up
>> and running.
>>
>> ceph version b8e22893f44979613738dfcdd40dada2b513118
>> (eb8e22893f44979613738dfcdd40dada2b513118)
>> 1: /usr/bin/ceph-osd() [0xb84652]
>> 2: (()+0xf130) [0x7f915f84f130]
>> 3: (gsignal()+0x39) [0x7f915e2695c9]
>> 4: (abort()+0x148) [0x7f915e26acd8]
>> 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
>> 6: (()+0x5e946) [0x7f915eb6b946]
>> 7: (()+0x5e973) [0x7f915eb6b973]
>> 8: (()+0x5eb9f) [0x7f915eb6bb9f]
>> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x27a) [0xc84c5a]
>> 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
>> snapid_t, std::vector >*,
>> ghobject_t*)+0x13c9) [0xa08639]
>> 11: (PGBackend::objects_list_partial(hobject_t const&, int, int,
>> snapid_t, std::vector >*,
>> hobject_t*)+0x352) [0x918a02]
>> 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptr)+0x1066) 
>> [0x8aa906]
>> 13: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0x1eb) [0x8cd06b]
>> 14: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
>> ThreadPool::TPHandle&)+0x68a) [0x85dbea]
>> 15: (OSD::dequeue_op(boost::intrusive_ptr,
>> std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x3ed)
>> [0x6c3f5d]
>> 16: (OSD::ShardedOpWQ::_process(unsigned int,
>> ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
>> 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
>> [0xc746bf]
>> 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
>> 19: (()+0x7df3) [0x7f915f847df3]
>> 20: (clone()+0x6d) [0x7f915e32a01d]
>> NOTE: a copy of the executable, or `objdump -rdS ` is
>> needed to interpret this.
>>
>> Please let me know the cause of this crash, when this crash happens I
>> noticed that two osds on separate machines are down. I can bring one
>> osd up but restarting the other osd causes both OSDs to crash. My
>> understanding is the crash seems to happen when two OSDs try to
>> communicate and replicate a particular PG.
>
> Can you include the log lines that preceed the dump above?  In particular,
> there should be a line that tells you what assertion failed in what
> function and at what line number.  I haven't seen this crash so I'm not
> sure offhand what it is.
>
> Thanks!
> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash with object store set to newstore

2015-06-01 Thread Sage Weil
I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash 
with that branch with 'debug newstore = 20' and send us the log?  
(You can just do 'ceph-post-file '.)

Thanks!
sage

On Mon, 1 Jun 2015, Srikanth Madugundi wrote:

> Hi Sage,
> 
> The assertion failed at line 1639, here is the log message
> 
> 
> 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
> function 'virtual int NewStore::collection_list_partial(coll_t,
> ghobject_t, int, int, snapid_t, std::vector*,
> ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174
> 
> os/newstore/NewStore.cc: 1639: FAILED assert(k >= start_key && k < end_key)
> 
> 
> Just before the crash the here are the debug statements printed by the
> method (collection_list_partial)
> 
> 2015-05-30 22:49:23.607232 7f1681934700 15
> newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
> start -1/0//0/0 min/max 1024/1024 snap head
> 2015-05-30 22:49:23.607251 7f1681934700 20
> newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
> --.7fb4.. to --.7fb4.0800. and
> --.804b.. to --.804b.0800. start
> -1/0//0/0
> 
> 
> Regards
> Srikanth
> 
> On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil  wrote:
> > On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
> >> Hi Sage and all,
> >>
> >> I build ceph code from wip-newstore on RHEL7 and running performance
> >> tests to compare with filestore. After few hours of running the tests
> >> the osd daemons started to crash. Here is the stack trace, the osd
> >> crashes immediately after the restart. So I could not get the osd up
> >> and running.
> >>
> >> ceph version b8e22893f44979613738dfcdd40dada2b513118
> >> (eb8e22893f44979613738dfcdd40dada2b513118)
> >> 1: /usr/bin/ceph-osd() [0xb84652]
> >> 2: (()+0xf130) [0x7f915f84f130]
> >> 3: (gsignal()+0x39) [0x7f915e2695c9]
> >> 4: (abort()+0x148) [0x7f915e26acd8]
> >> 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
> >> 6: (()+0x5e946) [0x7f915eb6b946]
> >> 7: (()+0x5e973) [0x7f915eb6b973]
> >> 8: (()+0x5eb9f) [0x7f915eb6bb9f]
> >> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >> const*)+0x27a) [0xc84c5a]
> >> 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
> >> snapid_t, std::vector >*,
> >> ghobject_t*)+0x13c9) [0xa08639]
> >> 11: (PGBackend::objects_list_partial(hobject_t const&, int, int,
> >> snapid_t, std::vector >*,
> >> hobject_t*)+0x352) [0x918a02]
> >> 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptr)+0x1066) 
> >> [0x8aa906]
> >> 13: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0x1eb) 
> >> [0x8cd06b]
> >> 14: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
> >> ThreadPool::TPHandle&)+0x68a) [0x85dbea]
> >> 15: (OSD::dequeue_op(boost::intrusive_ptr,
> >> std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x3ed)
> >> [0x6c3f5d]
> >> 16: (OSD::ShardedOpWQ::_process(unsigned int,
> >> ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
> >> 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
> >> [0xc746bf]
> >> 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
> >> 19: (()+0x7df3) [0x7f915f847df3]
> >> 20: (clone()+0x6d) [0x7f915e32a01d]
> >> NOTE: a copy of the executable, or `objdump -rdS ` is
> >> needed to interpret this.
> >>
> >> Please let me know the cause of this crash, when this crash happens I
> >> noticed that two osds on separate machines are down. I can bring one
> >> osd up but restarting the other osd causes both OSDs to crash. My
> >> understanding is the crash seems to happen when two OSDs try to
> >> communicate and replicate a particular PG.
> >
> > Can you include the log lines that preceed the dump above?  In particular,
> > there should be a line that tells you what assertion failed in what
> > function and at what line number.  I haven't seen this crash so I'm not
> > sure offhand what it is.
> >
> > Thanks!
> > sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash with object store set to newstore

2015-06-01 Thread Srikanth Madugundi
Hi Sage,

Unfortunately I purged the cluster yesterday and restarted the
backfill tool. I did not see the osd crash yet on the cluster. I am
monitoring the OSDs and will update you once I see the crash.

With the new backfill run I have reduced the rps by half, not sure if
this is the reason for not seeing the crash yet.

Regards
Srikanth


On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil  wrote:
> I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash
> with that branch with 'debug newstore = 20' and send us the log?
> (You can just do 'ceph-post-file '.)
>
> Thanks!
> sage
>
> On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
>
>> Hi Sage,
>>
>> The assertion failed at line 1639, here is the log message
>>
>>
>> 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
>> function 'virtual int NewStore::collection_list_partial(coll_t,
>> ghobject_t, int, int, snapid_t, std::vector*,
>> ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174
>>
>> os/newstore/NewStore.cc: 1639: FAILED assert(k >= start_key && k < end_key)
>>
>>
>> Just before the crash the here are the debug statements printed by the
>> method (collection_list_partial)
>>
>> 2015-05-30 22:49:23.607232 7f1681934700 15
>> newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
>> start -1/0//0/0 min/max 1024/1024 snap head
>> 2015-05-30 22:49:23.607251 7f1681934700 20
>> newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
>> --.7fb4.. to --.7fb4.0800. and
>> --.804b.. to --.804b.0800. start
>> -1/0//0/0
>>
>>
>> Regards
>> Srikanth
>>
>> On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil  wrote:
>> > On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
>> >> Hi Sage and all,
>> >>
>> >> I build ceph code from wip-newstore on RHEL7 and running performance
>> >> tests to compare with filestore. After few hours of running the tests
>> >> the osd daemons started to crash. Here is the stack trace, the osd
>> >> crashes immediately after the restart. So I could not get the osd up
>> >> and running.
>> >>
>> >> ceph version b8e22893f44979613738dfcdd40dada2b513118
>> >> (eb8e22893f44979613738dfcdd40dada2b513118)
>> >> 1: /usr/bin/ceph-osd() [0xb84652]
>> >> 2: (()+0xf130) [0x7f915f84f130]
>> >> 3: (gsignal()+0x39) [0x7f915e2695c9]
>> >> 4: (abort()+0x148) [0x7f915e26acd8]
>> >> 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
>> >> 6: (()+0x5e946) [0x7f915eb6b946]
>> >> 7: (()+0x5e973) [0x7f915eb6b973]
>> >> 8: (()+0x5eb9f) [0x7f915eb6bb9f]
>> >> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> >> const*)+0x27a) [0xc84c5a]
>> >> 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
>> >> snapid_t, std::vector >*,
>> >> ghobject_t*)+0x13c9) [0xa08639]
>> >> 11: (PGBackend::objects_list_partial(hobject_t const&, int, int,
>> >> snapid_t, std::vector >*,
>> >> hobject_t*)+0x352) [0x918a02]
>> >> 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptr)+0x1066) 
>> >> [0x8aa906]
>> >> 13: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0x1eb) 
>> >> [0x8cd06b]
>> >> 14: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
>> >> ThreadPool::TPHandle&)+0x68a) [0x85dbea]
>> >> 15: (OSD::dequeue_op(boost::intrusive_ptr,
>> >> std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x3ed)
>> >> [0x6c3f5d]
>> >> 16: (OSD::ShardedOpWQ::_process(unsigned int,
>> >> ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
>> >> 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
>> >> [0xc746bf]
>> >> 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
>> >> 19: (()+0x7df3) [0x7f915f847df3]
>> >> 20: (clone()+0x6d) [0x7f915e32a01d]
>> >> NOTE: a copy of the executable, or `objdump -rdS ` is
>> >> needed to interpret this.
>> >>
>> >> Please let me know the cause of this crash, when this crash happens I
>> >> noticed that two osds on separate machines are down. I can bring one
>> >> osd up but restarting the other osd causes both OSDs to crash. My
>> >> understanding is the crash seems to happen when two OSDs try to
>> >> communicate and replicate a particular PG.
>> >
>> > Can you include the log lines that preceed the dump above?  In particular,
>> > there should be a line that tells you what assertion failed in what
>> > function and at what line number.  I haven't seen this crash so I'm not
>> > sure offhand what it is.
>> >
>> > Thanks!
>> > sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash with object store set to newstore

2015-06-03 Thread Srikanth Madugundi
Hi Sage,

I saw the crash again here is the output after adding the debug
message from wip-newstore-debuglist


   -31> 2015-06-03 20:28:18.864496 7fd95976b700 -1
newstore(/var/lib/ceph/osd/ceph-19) start is -1/0//0/0 ... k is
--.7fff..!!!.


Here is the id of the file I posted.

ceph-post-file: ddfcf940-8c13-4913-a7b9-436c1a7d0804

Let me know if you need anything else.

Regards
Srikanth


On Mon, Jun 1, 2015 at 10:25 PM, Srikanth Madugundi
 wrote:
> Hi Sage,
>
> Unfortunately I purged the cluster yesterday and restarted the
> backfill tool. I did not see the osd crash yet on the cluster. I am
> monitoring the OSDs and will update you once I see the crash.
>
> With the new backfill run I have reduced the rps by half, not sure if
> this is the reason for not seeing the crash yet.
>
> Regards
> Srikanth
>
>
> On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil  wrote:
>> I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash
>> with that branch with 'debug newstore = 20' and send us the log?
>> (You can just do 'ceph-post-file '.)
>>
>> Thanks!
>> sage
>>
>> On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
>>
>>> Hi Sage,
>>>
>>> The assertion failed at line 1639, here is the log message
>>>
>>>
>>> 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
>>> function 'virtual int NewStore::collection_list_partial(coll_t,
>>> ghobject_t, int, int, snapid_t, std::vector*,
>>> ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174
>>>
>>> os/newstore/NewStore.cc: 1639: FAILED assert(k >= start_key && k < end_key)
>>>
>>>
>>> Just before the crash the here are the debug statements printed by the
>>> method (collection_list_partial)
>>>
>>> 2015-05-30 22:49:23.607232 7f1681934700 15
>>> newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
>>> start -1/0//0/0 min/max 1024/1024 snap head
>>> 2015-05-30 22:49:23.607251 7f1681934700 20
>>> newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
>>> --.7fb4.. to --.7fb4.0800. and
>>> --.804b.. to --.804b.0800. start
>>> -1/0//0/0
>>>
>>>
>>> Regards
>>> Srikanth
>>>
>>> On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil  wrote:
>>> > On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
>>> >> Hi Sage and all,
>>> >>
>>> >> I build ceph code from wip-newstore on RHEL7 and running performance
>>> >> tests to compare with filestore. After few hours of running the tests
>>> >> the osd daemons started to crash. Here is the stack trace, the osd
>>> >> crashes immediately after the restart. So I could not get the osd up
>>> >> and running.
>>> >>
>>> >> ceph version b8e22893f44979613738dfcdd40dada2b513118
>>> >> (eb8e22893f44979613738dfcdd40dada2b513118)
>>> >> 1: /usr/bin/ceph-osd() [0xb84652]
>>> >> 2: (()+0xf130) [0x7f915f84f130]
>>> >> 3: (gsignal()+0x39) [0x7f915e2695c9]
>>> >> 4: (abort()+0x148) [0x7f915e26acd8]
>>> >> 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
>>> >> 6: (()+0x5e946) [0x7f915eb6b946]
>>> >> 7: (()+0x5e973) [0x7f915eb6b973]
>>> >> 8: (()+0x5eb9f) [0x7f915eb6bb9f]
>>> >> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> >> const*)+0x27a) [0xc84c5a]
>>> >> 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
>>> >> snapid_t, std::vector >*,
>>> >> ghobject_t*)+0x13c9) [0xa08639]
>>> >> 11: (PGBackend::objects_list_partial(hobject_t const&, int, int,
>>> >> snapid_t, std::vector >*,
>>> >> hobject_t*)+0x352) [0x918a02]
>>> >> 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptr)+0x1066) 
>>> >> [0x8aa906]
>>> >> 13: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0x1eb) 
>>> >> [0x8cd06b]
>>> >> 14: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
>>> >> ThreadPool::TPHandle&)+0x68a) [0x85dbea]
>>> >> 15: (OSD::dequeue_op(boost::intrusive_ptr,
>>> >> std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x3ed)
>>> >> [0x6c3f5d]
>>> >> 16: (OSD::ShardedOpWQ::_process(unsigned int,
>>> >> ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
>>> >> 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
>>> >> [0xc746bf]
>>> >> 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
>>> >> 19: (()+0x7df3) [0x7f915f847df3]
>>> >> 20: (clone()+0x6d) [0x7f915e32a01d]
>>> >> NOTE: a copy of the executable, or `objdump -rdS ` is
>>> >> needed to interpret this.
>>> >>
>>> >> Please let me know the cause of this crash, when this crash happens I
>>> >> noticed that two osds on separate machines are down. I can bring one
>>> >> osd up but restarting the other osd causes both OSDs to crash. My
>>> >> understanding is the crash seems to happen when two OSDs try to
>>> >> communicate and replicate a particular PG.
>>> >
>>> > Can you include the log lines that preceed the dump above?  In particular,
>>> > there should be a line that tells you what assertion failed in what
>>> > function and at what line number.  I haven't seen this crash so I'm

Re: osd crash with object store set to newstore

2015-06-05 Thread Srikanth Madugundi
Hi Sage,

Did you get a chance to look at the crash?

Regards
Srikanth

On Wed, Jun 3, 2015 at 1:38 PM, Srikanth Madugundi
 wrote:
> Hi Sage,
>
> I saw the crash again here is the output after adding the debug
> message from wip-newstore-debuglist
>
>
>-31> 2015-06-03 20:28:18.864496 7fd95976b700 -1
> newstore(/var/lib/ceph/osd/ceph-19) start is -1/0//0/0 ... k is
> --.7fff..!!!.
>
>
> Here is the id of the file I posted.
>
> ceph-post-file: ddfcf940-8c13-4913-a7b9-436c1a7d0804
>
> Let me know if you need anything else.
>
> Regards
> Srikanth
>
>
> On Mon, Jun 1, 2015 at 10:25 PM, Srikanth Madugundi
>  wrote:
>> Hi Sage,
>>
>> Unfortunately I purged the cluster yesterday and restarted the
>> backfill tool. I did not see the osd crash yet on the cluster. I am
>> monitoring the OSDs and will update you once I see the crash.
>>
>> With the new backfill run I have reduced the rps by half, not sure if
>> this is the reason for not seeing the crash yet.
>>
>> Regards
>> Srikanth
>>
>>
>> On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil  wrote:
>>> I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash
>>> with that branch with 'debug newstore = 20' and send us the log?
>>> (You can just do 'ceph-post-file '.)
>>>
>>> Thanks!
>>> sage
>>>
>>> On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
>>>
 Hi Sage,

 The assertion failed at line 1639, here is the log message


 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
 function 'virtual int NewStore::collection_list_partial(coll_t,
 ghobject_t, int, int, snapid_t, std::vector*,
 ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174

 os/newstore/NewStore.cc: 1639: FAILED assert(k >= start_key && k < end_key)


 Just before the crash the here are the debug statements printed by the
 method (collection_list_partial)

 2015-05-30 22:49:23.607232 7f1681934700 15
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
 start -1/0//0/0 min/max 1024/1024 snap head
 2015-05-30 22:49:23.607251 7f1681934700 20
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
 --.7fb4.. to --.7fb4.0800. and
 --.804b.. to --.804b.0800. start
 -1/0//0/0


 Regards
 Srikanth

 On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil  wrote:
 > On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
 >> Hi Sage and all,
 >>
 >> I build ceph code from wip-newstore on RHEL7 and running performance
 >> tests to compare with filestore. After few hours of running the tests
 >> the osd daemons started to crash. Here is the stack trace, the osd
 >> crashes immediately after the restart. So I could not get the osd up
 >> and running.
 >>
 >> ceph version b8e22893f44979613738dfcdd40dada2b513118
 >> (eb8e22893f44979613738dfcdd40dada2b513118)
 >> 1: /usr/bin/ceph-osd() [0xb84652]
 >> 2: (()+0xf130) [0x7f915f84f130]
 >> 3: (gsignal()+0x39) [0x7f915e2695c9]
 >> 4: (abort()+0x148) [0x7f915e26acd8]
 >> 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
 >> 6: (()+0x5e946) [0x7f915eb6b946]
 >> 7: (()+0x5e973) [0x7f915eb6b973]
 >> 8: (()+0x5eb9f) [0x7f915eb6bb9f]
 >> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 >> const*)+0x27a) [0xc84c5a]
 >> 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
 >> snapid_t, std::vector >*,
 >> ghobject_t*)+0x13c9) [0xa08639]
 >> 11: (PGBackend::objects_list_partial(hobject_t const&, int, int,
 >> snapid_t, std::vector >*,
 >> hobject_t*)+0x352) [0x918a02]
 >> 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptr)+0x1066) 
 >> [0x8aa906]
 >> 13: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0x1eb) 
 >> [0x8cd06b]
 >> 14: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
 >> ThreadPool::TPHandle&)+0x68a) [0x85dbea]
 >> 15: (OSD::dequeue_op(boost::intrusive_ptr,
 >> std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x3ed)
 >> [0x6c3f5d]
 >> 16: (OSD::ShardedOpWQ::_process(unsigned int,
 >> ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
 >> 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
 >> [0xc746bf]
 >> 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
 >> 19: (()+0x7df3) [0x7f915f847df3]
 >> 20: (clone()+0x6d) [0x7f915e32a01d]
 >> NOTE: a copy of the executable, or `objdump -rdS ` is
 >> needed to interpret this.
 >>
 >> Please let me know the cause of this crash, when this crash happens I
 >> noticed that two osds on separate machines are down. I can bring one
 >> osd up but restarting the other osd causes both OSDs to crash. My
 >> understanding is the crash seems to happen when two OSDs try to
 >> commun

Re: osd crash with object store set to newstore

2015-06-05 Thread Sage Weil
On Fri, 5 Jun 2015, Srikanth Madugundi wrote:
> Hi Sage,
> 
> Did you get a chance to look at the crash?

Not yet--I am still focusing on getting wip-temp (and other newstore 
prerequisite code) working before turning back to newstore.  I'll look at 
this once I get back to newstore... hopefully in the next week or so!

sage


> 
> Regards
> Srikanth
> 
> On Wed, Jun 3, 2015 at 1:38 PM, Srikanth Madugundi
>  wrote:
> > Hi Sage,
> >
> > I saw the crash again here is the output after adding the debug
> > message from wip-newstore-debuglist
> >
> >
> >-31> 2015-06-03 20:28:18.864496 7fd95976b700 -1
> > newstore(/var/lib/ceph/osd/ceph-19) start is -1/0//0/0 ... k is
> > --.7fff..!!!.
> >
> >
> > Here is the id of the file I posted.
> >
> > ceph-post-file: ddfcf940-8c13-4913-a7b9-436c1a7d0804
> >
> > Let me know if you need anything else.
> >
> > Regards
> > Srikanth
> >
> >
> > On Mon, Jun 1, 2015 at 10:25 PM, Srikanth Madugundi
> >  wrote:
> >> Hi Sage,
> >>
> >> Unfortunately I purged the cluster yesterday and restarted the
> >> backfill tool. I did not see the osd crash yet on the cluster. I am
> >> monitoring the OSDs and will update you once I see the crash.
> >>
> >> With the new backfill run I have reduced the rps by half, not sure if
> >> this is the reason for not seeing the crash yet.
> >>
> >> Regards
> >> Srikanth
> >>
> >>
> >> On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil  wrote:
> >>> I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash
> >>> with that branch with 'debug newstore = 20' and send us the log?
> >>> (You can just do 'ceph-post-file '.)
> >>>
> >>> Thanks!
> >>> sage
> >>>
> >>> On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
> >>>
>  Hi Sage,
> 
>  The assertion failed at line 1639, here is the log message
> 
> 
>  2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
>  function 'virtual int NewStore::collection_list_partial(coll_t,
>  ghobject_t, int, int, snapid_t, std::vector*,
>  ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174
> 
>  os/newstore/NewStore.cc: 1639: FAILED assert(k >= start_key && k < 
>  end_key)
> 
> 
>  Just before the crash the here are the debug statements printed by the
>  method (collection_list_partial)
> 
>  2015-05-30 22:49:23.607232 7f1681934700 15
>  newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
>  start -1/0//0/0 min/max 1024/1024 snap head
>  2015-05-30 22:49:23.607251 7f1681934700 20
>  newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
>  --.7fb4.. to --.7fb4.0800. and
>  --.804b.. to --.804b.0800. start
>  -1/0//0/0
> 
> 
>  Regards
>  Srikanth
> 
>  On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil  wrote:
>  > On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
>  >> Hi Sage and all,
>  >>
>  >> I build ceph code from wip-newstore on RHEL7 and running performance
>  >> tests to compare with filestore. After few hours of running the tests
>  >> the osd daemons started to crash. Here is the stack trace, the osd
>  >> crashes immediately after the restart. So I could not get the osd up
>  >> and running.
>  >>
>  >> ceph version b8e22893f44979613738dfcdd40dada2b513118
>  >> (eb8e22893f44979613738dfcdd40dada2b513118)
>  >> 1: /usr/bin/ceph-osd() [0xb84652]
>  >> 2: (()+0xf130) [0x7f915f84f130]
>  >> 3: (gsignal()+0x39) [0x7f915e2695c9]
>  >> 4: (abort()+0x148) [0x7f915e26acd8]
>  >> 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
>  >> 6: (()+0x5e946) [0x7f915eb6b946]
>  >> 7: (()+0x5e973) [0x7f915eb6b973]
>  >> 8: (()+0x5eb9f) [0x7f915eb6bb9f]
>  >> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>  >> const*)+0x27a) [0xc84c5a]
>  >> 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
>  >> snapid_t, std::vector >*,
>  >> ghobject_t*)+0x13c9) [0xa08639]
>  >> 11: (PGBackend::objects_list_partial(hobject_t const&, int, int,
>  >> snapid_t, std::vector >*,
>  >> hobject_t*)+0x352) [0x918a02]
>  >> 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptr)+0x1066) 
>  >> [0x8aa906]
>  >> 13: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0x1eb) 
>  >> [0x8cd06b]
>  >> 14: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
>  >> ThreadPool::TPHandle&)+0x68a) [0x85dbea]
>  >> 15: (OSD::dequeue_op(boost::intrusive_ptr,
>  >> std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x3ed)
>  >> [0x6c3f5d]
>  >> 16: (OSD::ShardedOpWQ::_process(unsigned int,
>  >> ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
>  >> 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
>  >> [0xc746bf]
>  >> 18: (ShardedThreadPool::WorkThreadSharded: