RE: debugging librbd async

James Harper Fri, 16 Aug 2013 17:10:22 -0700

> > I'm also testing valgrind at the moment, just basic memtest, but suddenly
> > everything is quite stable even though it's under reasonable load right now.
> > Stupid heisenbugs.
> 
> Valgrind makes things go very slow (~10x?), which can have a huge effect
> on timing. Sometimes that reveals new races, other times it hides others.
> :/
>


That's what I figured. It's run overnight with no hiccups so far under valgrind 
(helgrind - but no crashes under memtest either). The helgrind process on one 
tapdisk instance is running pretty hard, but performance is tolerable. I'll 
leave it go and poke at it occasionally to see if I can reproduce an actual 
crash.

So far helgrind has identified a bunch of possible races like:

==12478== Possible data race during write of size 4 at 0x913F250 by thread #1
==12478== Locks held: 2, at addresses 0x9119B80 0x913EF88
==12478==    at 0x5821D57: Pipe::Pipe(SimpleMessenger*, int, Connection*) (in 
/usr/lib/librados.so.2.0.0)
==12478==    by 0x5838FC7: SimpleMessenger::connect_rank(entity_addr_t const&, 
int, Connection*, Message*) (in /usr/lib/librados.so.2.0.0)
==12478==    by 0x5839414: SimpleMessenger::get_connection(entity_inst_t 
const&) (in /usr/lib/librados.so.2.0.0)
==12478==    by 0x56D062F: Objecter::get_session(int) (in 
/usr/lib/librados.so.2.0.0)
==12478==    by 0x56D0FAA: Objecter::recalc_op_target(Objecter::Op*) (in 
/usr/lib/librados.so.2.0.0)
==12478==    by 0x56D6140: Objecter::_op_submit(Objecter::Op*) (in 
/usr/lib/librados.so.2.0.0)
==12478==    by 0x56C4640: librados::IoCtxImpl::operate_read(object_t const&, 
ObjectOperation*, ceph::buffer::list*) (in /usr/lib/librados.so.2.0.0)
==12478==    by 0x56C53D1: librados::IoCtxImpl::stat(object_t const&, unsigned 
long*, long*) (in /usr/lib/librados.so.2.0.0)
==12478==    by 0x569B83E: librados::IoCtx::stat(std::string const&, unsigned 
long*, long*) (in /usr/lib/librados.so.2.0.0)
==12478==    by 0x63E45BF: librbd::detect_format(librados::IoCtx&, std::string 
const&, bool*, unsigned long*) (in /usr/lib/librbd.so.1.0.0)
==12478==    by 0x63DCF6E: librbd::ImageCtx::init() (in 
/usr/lib/librbd.so.1.0.0)
==12478==    by 0x63F45C7: librbd::open_image(librbd::ImageCtx*) (in 
/usr/lib/librbd.so.1.0.0)
==12478==
==12478== This conflicts with a previous read of size 4 by thread #9
==12478== Locks held: none
==12478==    at 0x581E6A1: Pipe::tcp_read_wait() (in /usr/lib/librados.so.2.0.0)
==12478==    by 0x581E96F: Pipe::tcp_read(char*, int) (in 
/usr/lib/librados.so.2.0.0)
==12478==    by 0x581F965: Pipe::read_message(Message**) (in 
/usr/lib/librados.so.2.0.0)
==12478==    by 0x583138B: Pipe::reader() (in /usr/lib/librados.so.2.0.0)
==12478==    by 0x5834D5C: Pipe::Reader::entry() (in /usr/lib/librados.so.2.0.0)
==12478==    by 0x4C2B5AD: mythread_wrapper (hg_intercepts.c:219)
==12478==    by 0x6650B4F: start_thread (pthread_create.c:304)
==12478==    by 0x6940A7C: clone (clone.S:112)

But I don't think they are correct. There are also some races that appear to 
relate to memcpy.

Also I believe tapdisk serialises all callbacks and I don't know that helgrind 
is aware of that completely as it identifies some cases that just can't happen.

The last thing logged is:

==12478== More than 10000000 total errors detected.  I'm not reporting any more.
==12478== Final error counts will be inaccurate.  Go fix your program!
==12478== Rerun with --error-limit=no to disable this cutoff.  Note
==12478== that errors may occur in your program without prior warning from
==12478== Valgrind, because errors are no longer being displayed.

So I think maybe it's stopped being useful in that instance. I can post the 
full output if you want. It's only about 40kb

James
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: debugging librbd async

Reply via email to