Re: osd crash with object store set to newstore

2015-06-05 Thread Srikanth Madugundi
Hi Sage,

Did you get a chance to look at the crash?

Regards
Srikanth

On Wed, Jun 3, 2015 at 1:38 PM, Srikanth Madugundi
srikanth.madugu...@gmail.com wrote:
 Hi Sage,

 I saw the crash again here is the output after adding the debug
 message from wip-newstore-debuglist


-31 2015-06-03 20:28:18.864496 7fd95976b700 -1
 newstore(/var/lib/ceph/osd/ceph-19) start is -1/0//0/0 ... k is
 --.7fff..!!!.


 Here is the id of the file I posted.

 ceph-post-file: ddfcf940-8c13-4913-a7b9-436c1a7d0804

 Let me know if you need anything else.

 Regards
 Srikanth


 On Mon, Jun 1, 2015 at 10:25 PM, Srikanth Madugundi
 srikanth.madugu...@gmail.com wrote:
 Hi Sage,

 Unfortunately I purged the cluster yesterday and restarted the
 backfill tool. I did not see the osd crash yet on the cluster. I am
 monitoring the OSDs and will update you once I see the crash.

 With the new backfill run I have reduced the rps by half, not sure if
 this is the reason for not seeing the crash yet.

 Regards
 Srikanth


 On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil s...@newdream.net wrote:
 I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash
 with that branch with 'debug newstore = 20' and send us the log?
 (You can just do 'ceph-post-file filename'.)

 Thanks!
 sage

 On Mon, 1 Jun 2015, Srikanth Madugundi wrote:

 Hi Sage,

 The assertion failed at line 1639, here is the log message


 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
 function 'virtual int NewStore::collection_list_partial(coll_t,
 ghobject_t, int, int, snapid_t, std::vectorghobject_t*,
 ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174

 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key  k  end_key)


 Just before the crash the here are the debug statements printed by the
 method (collection_list_partial)

 2015-05-30 22:49:23.607232 7f1681934700 15
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
 start -1/0//0/0 min/max 1024/1024 snap head
 2015-05-30 22:49:23.607251 7f1681934700 20
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
 --.7fb4.. to --.7fb4.0800. and
 --.804b.. to --.804b.0800. start
 -1/0//0/0


 Regards
 Srikanth

 On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote:
  On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
  Hi Sage and all,
 
  I build ceph code from wip-newstore on RHEL7 and running performance
  tests to compare with filestore. After few hours of running the tests
  the osd daemons started to crash. Here is the stack trace, the osd
  crashes immediately after the restart. So I could not get the osd up
  and running.
 
  ceph version b8e22893f44979613738dfcdd40dada2b513118
  (eb8e22893f44979613738dfcdd40dada2b513118)
  1: /usr/bin/ceph-osd() [0xb84652]
  2: (()+0xf130) [0x7f915f84f130]
  3: (gsignal()+0x39) [0x7f915e2695c9]
  4: (abort()+0x148) [0x7f915e26acd8]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
  6: (()+0x5e946) [0x7f915eb6b946]
  7: (()+0x5e973) [0x7f915eb6b973]
  8: (()+0x5eb9f) [0x7f915eb6bb9f]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
  const*)+0x27a) [0xc84c5a]
  10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
  snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
  ghobject_t*)+0x13c9) [0xa08639]
  11: (PGBackend::objects_list_partial(hobject_t const, int, int,
  snapid_t, std::vectorhobject_t, std::allocatorhobject_t *,
  hobject_t*)+0x352) [0x918a02]
  12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) 
  [0x8aa906]
  13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) 
  [0x8cd06b]
  14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
  ThreadPool::TPHandle)+0x68a) [0x85dbea]
  15: (OSD::dequeue_op(boost::intrusive_ptrPG,
  std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed)
  [0x6c3f5d]
  16: (OSD::ShardedOpWQ::_process(unsigned int,
  ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
  17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
  [0xc746bf]
  18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
  19: (()+0x7df3) [0x7f915f847df3]
  20: (clone()+0x6d) [0x7f915e32a01d]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
  needed to interpret this.
 
  Please let me know the cause of this crash, when this crash happens I
  noticed that two osds on separate machines are down. I can bring one
  osd up but restarting the other osd causes both OSDs to crash. My
  understanding is the crash seems to happen when two OSDs try to
  communicate and replicate a particular PG.
 
  Can you include the log lines that preceed the dump above?  In 
  particular,
  there should be a line that tells you what assertion failed in what
  function and at what line number.  I haven't seen this crash so I'm not
  sure offhand what it 

Re: osd crash with object store set to newstore

2015-06-05 Thread Sage Weil
On Fri, 5 Jun 2015, Srikanth Madugundi wrote:
 Hi Sage,
 
 Did you get a chance to look at the crash?

Not yet--I am still focusing on getting wip-temp (and other newstore 
prerequisite code) working before turning back to newstore.  I'll look at 
this once I get back to newstore... hopefully in the next week or so!

sage


 
 Regards
 Srikanth
 
 On Wed, Jun 3, 2015 at 1:38 PM, Srikanth Madugundi
 srikanth.madugu...@gmail.com wrote:
  Hi Sage,
 
  I saw the crash again here is the output after adding the debug
  message from wip-newstore-debuglist
 
 
 -31 2015-06-03 20:28:18.864496 7fd95976b700 -1
  newstore(/var/lib/ceph/osd/ceph-19) start is -1/0//0/0 ... k is
  --.7fff..!!!.
 
 
  Here is the id of the file I posted.
 
  ceph-post-file: ddfcf940-8c13-4913-a7b9-436c1a7d0804
 
  Let me know if you need anything else.
 
  Regards
  Srikanth
 
 
  On Mon, Jun 1, 2015 at 10:25 PM, Srikanth Madugundi
  srikanth.madugu...@gmail.com wrote:
  Hi Sage,
 
  Unfortunately I purged the cluster yesterday and restarted the
  backfill tool. I did not see the osd crash yet on the cluster. I am
  monitoring the OSDs and will update you once I see the crash.
 
  With the new backfill run I have reduced the rps by half, not sure if
  this is the reason for not seeing the crash yet.
 
  Regards
  Srikanth
 
 
  On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil s...@newdream.net wrote:
  I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash
  with that branch with 'debug newstore = 20' and send us the log?
  (You can just do 'ceph-post-file filename'.)
 
  Thanks!
  sage
 
  On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
 
  Hi Sage,
 
  The assertion failed at line 1639, here is the log message
 
 
  2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
  function 'virtual int NewStore::collection_list_partial(coll_t,
  ghobject_t, int, int, snapid_t, std::vectorghobject_t*,
  ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174
 
  os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key  k  
  end_key)
 
 
  Just before the crash the here are the debug statements printed by the
  method (collection_list_partial)
 
  2015-05-30 22:49:23.607232 7f1681934700 15
  newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
  start -1/0//0/0 min/max 1024/1024 snap head
  2015-05-30 22:49:23.607251 7f1681934700 20
  newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
  --.7fb4.. to --.7fb4.0800. and
  --.804b.. to --.804b.0800. start
  -1/0//0/0
 
 
  Regards
  Srikanth
 
  On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote:
   On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
   Hi Sage and all,
  
   I build ceph code from wip-newstore on RHEL7 and running performance
   tests to compare with filestore. After few hours of running the tests
   the osd daemons started to crash. Here is the stack trace, the osd
   crashes immediately after the restart. So I could not get the osd up
   and running.
  
   ceph version b8e22893f44979613738dfcdd40dada2b513118
   (eb8e22893f44979613738dfcdd40dada2b513118)
   1: /usr/bin/ceph-osd() [0xb84652]
   2: (()+0xf130) [0x7f915f84f130]
   3: (gsignal()+0x39) [0x7f915e2695c9]
   4: (abort()+0x148) [0x7f915e26acd8]
   5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
   6: (()+0x5e946) [0x7f915eb6b946]
   7: (()+0x5e973) [0x7f915eb6b973]
   8: (()+0x5eb9f) [0x7f915eb6bb9f]
   9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
   const*)+0x27a) [0xc84c5a]
   10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
   snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
   ghobject_t*)+0x13c9) [0xa08639]
   11: (PGBackend::objects_list_partial(hobject_t const, int, int,
   snapid_t, std::vectorhobject_t, std::allocatorhobject_t *,
   hobject_t*)+0x352) [0x918a02]
   12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) 
   [0x8aa906]
   13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) 
   [0x8cd06b]
   14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
   ThreadPool::TPHandle)+0x68a) [0x85dbea]
   15: (OSD::dequeue_op(boost::intrusive_ptrPG,
   std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed)
   [0x6c3f5d]
   16: (OSD::ShardedOpWQ::_process(unsigned int,
   ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
   17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
   [0xc746bf]
   18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
   19: (()+0x7df3) [0x7f915f847df3]
   20: (clone()+0x6d) [0x7f915e32a01d]
   NOTE: a copy of the executable, or `objdump -rdS executable` is
   needed to interpret this.
  
   Please let me know the cause of this crash, when this crash happens I
   noticed that two osds on separate machines are down. I can bring one
   osd up but restarting 

Re: osd crash with object store set to newstore

2015-06-03 Thread Srikanth Madugundi
Hi Sage,

I saw the crash again here is the output after adding the debug
message from wip-newstore-debuglist


   -31 2015-06-03 20:28:18.864496 7fd95976b700 -1
newstore(/var/lib/ceph/osd/ceph-19) start is -1/0//0/0 ... k is
--.7fff..!!!.


Here is the id of the file I posted.

ceph-post-file: ddfcf940-8c13-4913-a7b9-436c1a7d0804

Let me know if you need anything else.

Regards
Srikanth


On Mon, Jun 1, 2015 at 10:25 PM, Srikanth Madugundi
srikanth.madugu...@gmail.com wrote:
 Hi Sage,

 Unfortunately I purged the cluster yesterday and restarted the
 backfill tool. I did not see the osd crash yet on the cluster. I am
 monitoring the OSDs and will update you once I see the crash.

 With the new backfill run I have reduced the rps by half, not sure if
 this is the reason for not seeing the crash yet.

 Regards
 Srikanth


 On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil s...@newdream.net wrote:
 I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash
 with that branch with 'debug newstore = 20' and send us the log?
 (You can just do 'ceph-post-file filename'.)

 Thanks!
 sage

 On Mon, 1 Jun 2015, Srikanth Madugundi wrote:

 Hi Sage,

 The assertion failed at line 1639, here is the log message


 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
 function 'virtual int NewStore::collection_list_partial(coll_t,
 ghobject_t, int, int, snapid_t, std::vectorghobject_t*,
 ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174

 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key  k  end_key)


 Just before the crash the here are the debug statements printed by the
 method (collection_list_partial)

 2015-05-30 22:49:23.607232 7f1681934700 15
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
 start -1/0//0/0 min/max 1024/1024 snap head
 2015-05-30 22:49:23.607251 7f1681934700 20
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
 --.7fb4.. to --.7fb4.0800. and
 --.804b.. to --.804b.0800. start
 -1/0//0/0


 Regards
 Srikanth

 On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote:
  On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
  Hi Sage and all,
 
  I build ceph code from wip-newstore on RHEL7 and running performance
  tests to compare with filestore. After few hours of running the tests
  the osd daemons started to crash. Here is the stack trace, the osd
  crashes immediately after the restart. So I could not get the osd up
  and running.
 
  ceph version b8e22893f44979613738dfcdd40dada2b513118
  (eb8e22893f44979613738dfcdd40dada2b513118)
  1: /usr/bin/ceph-osd() [0xb84652]
  2: (()+0xf130) [0x7f915f84f130]
  3: (gsignal()+0x39) [0x7f915e2695c9]
  4: (abort()+0x148) [0x7f915e26acd8]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
  6: (()+0x5e946) [0x7f915eb6b946]
  7: (()+0x5e973) [0x7f915eb6b973]
  8: (()+0x5eb9f) [0x7f915eb6bb9f]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
  const*)+0x27a) [0xc84c5a]
  10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
  snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
  ghobject_t*)+0x13c9) [0xa08639]
  11: (PGBackend::objects_list_partial(hobject_t const, int, int,
  snapid_t, std::vectorhobject_t, std::allocatorhobject_t *,
  hobject_t*)+0x352) [0x918a02]
  12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) 
  [0x8aa906]
  13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) 
  [0x8cd06b]
  14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
  ThreadPool::TPHandle)+0x68a) [0x85dbea]
  15: (OSD::dequeue_op(boost::intrusive_ptrPG,
  std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed)
  [0x6c3f5d]
  16: (OSD::ShardedOpWQ::_process(unsigned int,
  ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
  17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
  [0xc746bf]
  18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
  19: (()+0x7df3) [0x7f915f847df3]
  20: (clone()+0x6d) [0x7f915e32a01d]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
  needed to interpret this.
 
  Please let me know the cause of this crash, when this crash happens I
  noticed that two osds on separate machines are down. I can bring one
  osd up but restarting the other osd causes both OSDs to crash. My
  understanding is the crash seems to happen when two OSDs try to
  communicate and replicate a particular PG.
 
  Can you include the log lines that preceed the dump above?  In particular,
  there should be a line that tells you what assertion failed in what
  function and at what line number.  I haven't seen this crash so I'm not
  sure offhand what it is.
 
  Thanks!
  sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  

Re: osd crash with object store set to newstore

2015-06-01 Thread Sage Weil
I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash 
with that branch with 'debug newstore = 20' and send us the log?  
(You can just do 'ceph-post-file filename'.)

Thanks!
sage

On Mon, 1 Jun 2015, Srikanth Madugundi wrote:

 Hi Sage,
 
 The assertion failed at line 1639, here is the log message
 
 
 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
 function 'virtual int NewStore::collection_list_partial(coll_t,
 ghobject_t, int, int, snapid_t, std::vectorghobject_t*,
 ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174
 
 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key  k  end_key)
 
 
 Just before the crash the here are the debug statements printed by the
 method (collection_list_partial)
 
 2015-05-30 22:49:23.607232 7f1681934700 15
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
 start -1/0//0/0 min/max 1024/1024 snap head
 2015-05-30 22:49:23.607251 7f1681934700 20
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
 --.7fb4.. to --.7fb4.0800. and
 --.804b.. to --.804b.0800. start
 -1/0//0/0
 
 
 Regards
 Srikanth
 
 On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote:
  On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
  Hi Sage and all,
 
  I build ceph code from wip-newstore on RHEL7 and running performance
  tests to compare with filestore. After few hours of running the tests
  the osd daemons started to crash. Here is the stack trace, the osd
  crashes immediately after the restart. So I could not get the osd up
  and running.
 
  ceph version b8e22893f44979613738dfcdd40dada2b513118
  (eb8e22893f44979613738dfcdd40dada2b513118)
  1: /usr/bin/ceph-osd() [0xb84652]
  2: (()+0xf130) [0x7f915f84f130]
  3: (gsignal()+0x39) [0x7f915e2695c9]
  4: (abort()+0x148) [0x7f915e26acd8]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
  6: (()+0x5e946) [0x7f915eb6b946]
  7: (()+0x5e973) [0x7f915eb6b973]
  8: (()+0x5eb9f) [0x7f915eb6bb9f]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
  const*)+0x27a) [0xc84c5a]
  10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
  snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
  ghobject_t*)+0x13c9) [0xa08639]
  11: (PGBackend::objects_list_partial(hobject_t const, int, int,
  snapid_t, std::vectorhobject_t, std::allocatorhobject_t *,
  hobject_t*)+0x352) [0x918a02]
  12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) 
  [0x8aa906]
  13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) 
  [0x8cd06b]
  14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
  ThreadPool::TPHandle)+0x68a) [0x85dbea]
  15: (OSD::dequeue_op(boost::intrusive_ptrPG,
  std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed)
  [0x6c3f5d]
  16: (OSD::ShardedOpWQ::_process(unsigned int,
  ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
  17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
  [0xc746bf]
  18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
  19: (()+0x7df3) [0x7f915f847df3]
  20: (clone()+0x6d) [0x7f915e32a01d]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
  needed to interpret this.
 
  Please let me know the cause of this crash, when this crash happens I
  noticed that two osds on separate machines are down. I can bring one
  osd up but restarting the other osd causes both OSDs to crash. My
  understanding is the crash seems to happen when two OSDs try to
  communicate and replicate a particular PG.
 
  Can you include the log lines that preceed the dump above?  In particular,
  there should be a line that tells you what assertion failed in what
  function and at what line number.  I haven't seen this crash so I'm not
  sure offhand what it is.
 
  Thanks!
  sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


osd crash with object store set to newstore

2015-06-01 Thread Srikanth Madugundi
Hi Sage and all,

I build ceph code from wip-newstore on RHEL7 and running performance
tests to compare with filestore. After few hours of running the tests
the osd daemons started to crash. Here is the stack trace, the osd
crashes immediately after the restart. So I could not get the osd up
and running.

ceph version b8e22893f44979613738dfcdd40dada2b513118
(eb8e22893f44979613738dfcdd40dada2b513118)
1: /usr/bin/ceph-osd() [0xb84652]
2: (()+0xf130) [0x7f915f84f130]
3: (gsignal()+0x39) [0x7f915e2695c9]
4: (abort()+0x148) [0x7f915e26acd8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
6: (()+0x5e946) [0x7f915eb6b946]
7: (()+0x5e973) [0x7f915eb6b973]
8: (()+0x5eb9f) [0x7f915eb6bb9f]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x27a) [0xc84c5a]
10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
ghobject_t*)+0x13c9) [0xa08639]
11: (PGBackend::objects_list_partial(hobject_t const, int, int,
snapid_t, std::vectorhobject_t, std::allocatorhobject_t *,
hobject_t*)+0x352) [0x918a02]
12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906]
13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b]
14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
ThreadPool::TPHandle)+0x68a) [0x85dbea]
15: (OSD::dequeue_op(boost::intrusive_ptrPG,
std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed)
[0x6c3f5d]
16: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf]
18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
19: (()+0x7df3) [0x7f915f847df3]
20: (clone()+0x6d) [0x7f915e32a01d]
NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

Please let me know the cause of this crash, when this crash happens I
noticed that two osds on separate machines are down. I can bring one
osd up but restarting the other osd causes both OSDs to crash. My
understanding is the crash seems to happen when two OSDs try to
communicate and replicate a particular PG.

Regards
Srikanth
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash with object store set to newstore

2015-06-01 Thread Srikanth Madugundi
Hi Sage,

The assertion failed at line 1639, here is the log message


2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
function 'virtual int NewStore::collection_list_partial(coll_t,
ghobject_t, int, int, snapid_t, std::vectorghobject_t*,
ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174

os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key  k  end_key)


Just before the crash the here are the debug statements printed by the
method (collection_list_partial)

2015-05-30 22:49:23.607232 7f1681934700 15
newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
start -1/0//0/0 min/max 1024/1024 snap head
2015-05-30 22:49:23.607251 7f1681934700 20
newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
--.7fb4.. to --.7fb4.0800. and
--.804b.. to --.804b.0800. start
-1/0//0/0


Regards
Srikanth

On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote:
 On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
 Hi Sage and all,

 I build ceph code from wip-newstore on RHEL7 and running performance
 tests to compare with filestore. After few hours of running the tests
 the osd daemons started to crash. Here is the stack trace, the osd
 crashes immediately after the restart. So I could not get the osd up
 and running.

 ceph version b8e22893f44979613738dfcdd40dada2b513118
 (eb8e22893f44979613738dfcdd40dada2b513118)
 1: /usr/bin/ceph-osd() [0xb84652]
 2: (()+0xf130) [0x7f915f84f130]
 3: (gsignal()+0x39) [0x7f915e2695c9]
 4: (abort()+0x148) [0x7f915e26acd8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
 6: (()+0x5e946) [0x7f915eb6b946]
 7: (()+0x5e973) [0x7f915eb6b973]
 8: (()+0x5eb9f) [0x7f915eb6bb9f]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 const*)+0x27a) [0xc84c5a]
 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
 snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
 ghobject_t*)+0x13c9) [0xa08639]
 11: (PGBackend::objects_list_partial(hobject_t const, int, int,
 snapid_t, std::vectorhobject_t, std::allocatorhobject_t *,
 hobject_t*)+0x352) [0x918a02]
 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) 
 [0x8aa906]
 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b]
 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
 ThreadPool::TPHandle)+0x68a) [0x85dbea]
 15: (OSD::dequeue_op(boost::intrusive_ptrPG,
 std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed)
 [0x6c3f5d]
 16: (OSD::ShardedOpWQ::_process(unsigned int,
 ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
 [0xc746bf]
 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
 19: (()+0x7df3) [0x7f915f847df3]
 20: (clone()+0x6d) [0x7f915e32a01d]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.

 Please let me know the cause of this crash, when this crash happens I
 noticed that two osds on separate machines are down. I can bring one
 osd up but restarting the other osd causes both OSDs to crash. My
 understanding is the crash seems to happen when two OSDs try to
 communicate and replicate a particular PG.

 Can you include the log lines that preceed the dump above?  In particular,
 there should be a line that tells you what assertion failed in what
 function and at what line number.  I haven't seen this crash so I'm not
 sure offhand what it is.

 Thanks!
 sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash with object store set to newstore

2015-06-01 Thread Sage Weil
On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
 Hi Sage and all,
 
 I build ceph code from wip-newstore on RHEL7 and running performance
 tests to compare with filestore. After few hours of running the tests
 the osd daemons started to crash. Here is the stack trace, the osd
 crashes immediately after the restart. So I could not get the osd up
 and running.
 
 ceph version b8e22893f44979613738dfcdd40dada2b513118
 (eb8e22893f44979613738dfcdd40dada2b513118)
 1: /usr/bin/ceph-osd() [0xb84652]
 2: (()+0xf130) [0x7f915f84f130]
 3: (gsignal()+0x39) [0x7f915e2695c9]
 4: (abort()+0x148) [0x7f915e26acd8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
 6: (()+0x5e946) [0x7f915eb6b946]
 7: (()+0x5e973) [0x7f915eb6b973]
 8: (()+0x5eb9f) [0x7f915eb6bb9f]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 const*)+0x27a) [0xc84c5a]
 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
 snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
 ghobject_t*)+0x13c9) [0xa08639]
 11: (PGBackend::objects_list_partial(hobject_t const, int, int,
 snapid_t, std::vectorhobject_t, std::allocatorhobject_t *,
 hobject_t*)+0x352) [0x918a02]
 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) 
 [0x8aa906]
 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b]
 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
 ThreadPool::TPHandle)+0x68a) [0x85dbea]
 15: (OSD::dequeue_op(boost::intrusive_ptrPG,
 std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed)
 [0x6c3f5d]
 16: (OSD::ShardedOpWQ::_process(unsigned int,
 ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
 [0xc746bf]
 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
 19: (()+0x7df3) [0x7f915f847df3]
 20: (clone()+0x6d) [0x7f915e32a01d]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.
 
 Please let me know the cause of this crash, when this crash happens I
 noticed that two osds on separate machines are down. I can bring one
 osd up but restarting the other osd causes both OSDs to crash. My
 understanding is the crash seems to happen when two OSDs try to
 communicate and replicate a particular PG.

Can you include the log lines that preceed the dump above?  In particular, 
there should be a line that tells you what assertion failed in what 
function and at what line number.  I haven't seen this crash so I'm not 
sure offhand what it is.

Thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash with object store set to newstore

2015-06-01 Thread Srikanth Madugundi
Hi Sage,

Unfortunately I purged the cluster yesterday and restarted the
backfill tool. I did not see the osd crash yet on the cluster. I am
monitoring the OSDs and will update you once I see the crash.

With the new backfill run I have reduced the rps by half, not sure if
this is the reason for not seeing the crash yet.

Regards
Srikanth


On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil s...@newdream.net wrote:
 I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash
 with that branch with 'debug newstore = 20' and send us the log?
 (You can just do 'ceph-post-file filename'.)

 Thanks!
 sage

 On Mon, 1 Jun 2015, Srikanth Madugundi wrote:

 Hi Sage,

 The assertion failed at line 1639, here is the log message


 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
 function 'virtual int NewStore::collection_list_partial(coll_t,
 ghobject_t, int, int, snapid_t, std::vectorghobject_t*,
 ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174

 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key  k  end_key)


 Just before the crash the here are the debug statements printed by the
 method (collection_list_partial)

 2015-05-30 22:49:23.607232 7f1681934700 15
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
 start -1/0//0/0 min/max 1024/1024 snap head
 2015-05-30 22:49:23.607251 7f1681934700 20
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
 --.7fb4.. to --.7fb4.0800. and
 --.804b.. to --.804b.0800. start
 -1/0//0/0


 Regards
 Srikanth

 On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote:
  On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
  Hi Sage and all,
 
  I build ceph code from wip-newstore on RHEL7 and running performance
  tests to compare with filestore. After few hours of running the tests
  the osd daemons started to crash. Here is the stack trace, the osd
  crashes immediately after the restart. So I could not get the osd up
  and running.
 
  ceph version b8e22893f44979613738dfcdd40dada2b513118
  (eb8e22893f44979613738dfcdd40dada2b513118)
  1: /usr/bin/ceph-osd() [0xb84652]
  2: (()+0xf130) [0x7f915f84f130]
  3: (gsignal()+0x39) [0x7f915e2695c9]
  4: (abort()+0x148) [0x7f915e26acd8]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
  6: (()+0x5e946) [0x7f915eb6b946]
  7: (()+0x5e973) [0x7f915eb6b973]
  8: (()+0x5eb9f) [0x7f915eb6bb9f]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
  const*)+0x27a) [0xc84c5a]
  10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
  snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
  ghobject_t*)+0x13c9) [0xa08639]
  11: (PGBackend::objects_list_partial(hobject_t const, int, int,
  snapid_t, std::vectorhobject_t, std::allocatorhobject_t *,
  hobject_t*)+0x352) [0x918a02]
  12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) 
  [0x8aa906]
  13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) 
  [0x8cd06b]
  14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
  ThreadPool::TPHandle)+0x68a) [0x85dbea]
  15: (OSD::dequeue_op(boost::intrusive_ptrPG,
  std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed)
  [0x6c3f5d]
  16: (OSD::ShardedOpWQ::_process(unsigned int,
  ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
  17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
  [0xc746bf]
  18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
  19: (()+0x7df3) [0x7f915f847df3]
  20: (clone()+0x6d) [0x7f915e32a01d]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
  needed to interpret this.
 
  Please let me know the cause of this crash, when this crash happens I
  noticed that two osds on separate machines are down. I can bring one
  osd up but restarting the other osd causes both OSDs to crash. My
  understanding is the crash seems to happen when two OSDs try to
  communicate and replicate a particular PG.
 
  Can you include the log lines that preceed the dump above?  In particular,
  there should be a line that tells you what assertion failed in what
  function and at what line number.  I haven't seen this crash so I'm not
  sure offhand what it is.
 
  Thanks!
  sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html