Re: osd crash with object store set to newstore
Hi Sage, Did you get a chance to look at the crash? Regards Srikanth On Wed, Jun 3, 2015 at 1:38 PM, Srikanth Madugundi srikanth.madugu...@gmail.com wrote: Hi Sage, I saw the crash again here is the output after adding the debug message from wip-newstore-debuglist -31 2015-06-03 20:28:18.864496 7fd95976b700 -1 newstore(/var/lib/ceph/osd/ceph-19) start is -1/0//0/0 ... k is --.7fff..!!!. Here is the id of the file I posted. ceph-post-file: ddfcf940-8c13-4913-a7b9-436c1a7d0804 Let me know if you need anything else. Regards Srikanth On Mon, Jun 1, 2015 at 10:25 PM, Srikanth Madugundi srikanth.madugu...@gmail.com wrote: Hi Sage, Unfortunately I purged the cluster yesterday and restarted the backfill tool. I did not see the osd crash yet on the cluster. I am monitoring the OSDs and will update you once I see the crash. With the new backfill run I have reduced the rps by half, not sure if this is the reason for not seeing the crash yet. Regards Srikanth On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil s...@newdream.net wrote: I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash with that branch with 'debug newstore = 20' and send us the log? (You can just do 'ceph-post-file filename'.) Thanks! sage On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage, The assertion failed at line 1639, here is the log message 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In function 'virtual int NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t*, ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key k end_key) Just before the crash the here are the debug statements printed by the method (collection_list_partial) 2015-05-30 22:49:23.607232 7f1681934700 15 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head start -1/0//0/0 min/max 1024/1024 snap head 2015-05-30 22:49:23.607251 7f1681934700 20 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range --.7fb4.. to --.7fb4.0800. and --.804b.. to --.804b.0800. start -1/0//0/0 Regards Srikanth On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote: On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know the cause of this crash, when this crash happens I noticed that two osds on separate machines are down. I can bring one osd up but restarting the other osd causes both OSDs to crash. My understanding is the crash seems to happen when two OSDs try to communicate and replicate a particular PG. Can you include the log lines that preceed the dump above? In particular, there should be a line that tells you what assertion failed in what function and at what line number. I haven't seen this crash so I'm not sure offhand what it
Re: osd crash with object store set to newstore
On Fri, 5 Jun 2015, Srikanth Madugundi wrote: Hi Sage, Did you get a chance to look at the crash? Not yet--I am still focusing on getting wip-temp (and other newstore prerequisite code) working before turning back to newstore. I'll look at this once I get back to newstore... hopefully in the next week or so! sage Regards Srikanth On Wed, Jun 3, 2015 at 1:38 PM, Srikanth Madugundi srikanth.madugu...@gmail.com wrote: Hi Sage, I saw the crash again here is the output after adding the debug message from wip-newstore-debuglist -31 2015-06-03 20:28:18.864496 7fd95976b700 -1 newstore(/var/lib/ceph/osd/ceph-19) start is -1/0//0/0 ... k is --.7fff..!!!. Here is the id of the file I posted. ceph-post-file: ddfcf940-8c13-4913-a7b9-436c1a7d0804 Let me know if you need anything else. Regards Srikanth On Mon, Jun 1, 2015 at 10:25 PM, Srikanth Madugundi srikanth.madugu...@gmail.com wrote: Hi Sage, Unfortunately I purged the cluster yesterday and restarted the backfill tool. I did not see the osd crash yet on the cluster. I am monitoring the OSDs and will update you once I see the crash. With the new backfill run I have reduced the rps by half, not sure if this is the reason for not seeing the crash yet. Regards Srikanth On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil s...@newdream.net wrote: I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash with that branch with 'debug newstore = 20' and send us the log? (You can just do 'ceph-post-file filename'.) Thanks! sage On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage, The assertion failed at line 1639, here is the log message 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In function 'virtual int NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t*, ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key k end_key) Just before the crash the here are the debug statements printed by the method (collection_list_partial) 2015-05-30 22:49:23.607232 7f1681934700 15 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head start -1/0//0/0 min/max 1024/1024 snap head 2015-05-30 22:49:23.607251 7f1681934700 20 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range --.7fb4.. to --.7fb4.0800. and --.804b.. to --.804b.0800. start -1/0//0/0 Regards Srikanth On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote: On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know the cause of this crash, when this crash happens I noticed that two osds on separate machines are down. I can bring one osd up but restarting
Re: osd crash with object store set to newstore
Hi Sage, I saw the crash again here is the output after adding the debug message from wip-newstore-debuglist -31 2015-06-03 20:28:18.864496 7fd95976b700 -1 newstore(/var/lib/ceph/osd/ceph-19) start is -1/0//0/0 ... k is --.7fff..!!!. Here is the id of the file I posted. ceph-post-file: ddfcf940-8c13-4913-a7b9-436c1a7d0804 Let me know if you need anything else. Regards Srikanth On Mon, Jun 1, 2015 at 10:25 PM, Srikanth Madugundi srikanth.madugu...@gmail.com wrote: Hi Sage, Unfortunately I purged the cluster yesterday and restarted the backfill tool. I did not see the osd crash yet on the cluster. I am monitoring the OSDs and will update you once I see the crash. With the new backfill run I have reduced the rps by half, not sure if this is the reason for not seeing the crash yet. Regards Srikanth On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil s...@newdream.net wrote: I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash with that branch with 'debug newstore = 20' and send us the log? (You can just do 'ceph-post-file filename'.) Thanks! sage On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage, The assertion failed at line 1639, here is the log message 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In function 'virtual int NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t*, ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key k end_key) Just before the crash the here are the debug statements printed by the method (collection_list_partial) 2015-05-30 22:49:23.607232 7f1681934700 15 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head start -1/0//0/0 min/max 1024/1024 snap head 2015-05-30 22:49:23.607251 7f1681934700 20 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range --.7fb4.. to --.7fb4.0800. and --.804b.. to --.804b.0800. start -1/0//0/0 Regards Srikanth On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote: On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know the cause of this crash, when this crash happens I noticed that two osds on separate machines are down. I can bring one osd up but restarting the other osd causes both OSDs to crash. My understanding is the crash seems to happen when two OSDs try to communicate and replicate a particular PG. Can you include the log lines that preceed the dump above? In particular, there should be a line that tells you what assertion failed in what function and at what line number. I haven't seen this crash so I'm not sure offhand what it is. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: osd crash with object store set to newstore
I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash with that branch with 'debug newstore = 20' and send us the log? (You can just do 'ceph-post-file filename'.) Thanks! sage On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage, The assertion failed at line 1639, here is the log message 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In function 'virtual int NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t*, ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key k end_key) Just before the crash the here are the debug statements printed by the method (collection_list_partial) 2015-05-30 22:49:23.607232 7f1681934700 15 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head start -1/0//0/0 min/max 1024/1024 snap head 2015-05-30 22:49:23.607251 7f1681934700 20 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range --.7fb4.. to --.7fb4.0800. and --.804b.. to --.804b.0800. start -1/0//0/0 Regards Srikanth On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote: On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know the cause of this crash, when this crash happens I noticed that two osds on separate machines are down. I can bring one osd up but restarting the other osd causes both OSDs to crash. My understanding is the crash seems to happen when two OSDs try to communicate and replicate a particular PG. Can you include the log lines that preceed the dump above? In particular, there should be a line that tells you what assertion failed in what function and at what line number. I haven't seen this crash so I'm not sure offhand what it is. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
osd crash with object store set to newstore
Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know the cause of this crash, when this crash happens I noticed that two osds on separate machines are down. I can bring one osd up but restarting the other osd causes both OSDs to crash. My understanding is the crash seems to happen when two OSDs try to communicate and replicate a particular PG. Regards Srikanth -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash with object store set to newstore
Hi Sage, The assertion failed at line 1639, here is the log message 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In function 'virtual int NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t*, ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key k end_key) Just before the crash the here are the debug statements printed by the method (collection_list_partial) 2015-05-30 22:49:23.607232 7f1681934700 15 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head start -1/0//0/0 min/max 1024/1024 snap head 2015-05-30 22:49:23.607251 7f1681934700 20 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range --.7fb4.. to --.7fb4.0800. and --.804b.. to --.804b.0800. start -1/0//0/0 Regards Srikanth On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote: On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know the cause of this crash, when this crash happens I noticed that two osds on separate machines are down. I can bring one osd up but restarting the other osd causes both OSDs to crash. My understanding is the crash seems to happen when two OSDs try to communicate and replicate a particular PG. Can you include the log lines that preceed the dump above? In particular, there should be a line that tells you what assertion failed in what function and at what line number. I haven't seen this crash so I'm not sure offhand what it is. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash with object store set to newstore
On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know the cause of this crash, when this crash happens I noticed that two osds on separate machines are down. I can bring one osd up but restarting the other osd causes both OSDs to crash. My understanding is the crash seems to happen when two OSDs try to communicate and replicate a particular PG. Can you include the log lines that preceed the dump above? In particular, there should be a line that tells you what assertion failed in what function and at what line number. I haven't seen this crash so I'm not sure offhand what it is. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash with object store set to newstore
Hi Sage, Unfortunately I purged the cluster yesterday and restarted the backfill tool. I did not see the osd crash yet on the cluster. I am monitoring the OSDs and will update you once I see the crash. With the new backfill run I have reduced the rps by half, not sure if this is the reason for not seeing the crash yet. Regards Srikanth On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil s...@newdream.net wrote: I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash with that branch with 'debug newstore = 20' and send us the log? (You can just do 'ceph-post-file filename'.) Thanks! sage On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage, The assertion failed at line 1639, here is the log message 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In function 'virtual int NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t*, ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key k end_key) Just before the crash the here are the debug statements printed by the method (collection_list_partial) 2015-05-30 22:49:23.607232 7f1681934700 15 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head start -1/0//0/0 min/max 1024/1024 snap head 2015-05-30 22:49:23.607251 7f1681934700 20 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range --.7fb4.. to --.7fb4.0800. and --.804b.. to --.804b.0800. start -1/0//0/0 Regards Srikanth On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote: On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know the cause of this crash, when this crash happens I noticed that two osds on separate machines are down. I can bring one osd up but restarting the other osd causes both OSDs to crash. My understanding is the crash seems to happen when two OSDs try to communicate and replicate a particular PG. Can you include the log lines that preceed the dump above? In particular, there should be a line that tells you what assertion failed in what function and at what line number. I haven't seen this crash so I'm not sure offhand what it is. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html