On Fri, 16 Aug 2013, James Harper wrote:
> > > > Of course, the old standby is to just crank up the logging detail and 
> > > > try
> > > > to narrow down where the crash happens.  Have you tried that yet?
> > >
> > > I haven't touched the rbd code. Is increased logging a compile-time
> > > option or a config option?
> > 
> > That is probably the first you should try then.  In the [client] section
> > of ceph.conf on the node where tapdisk is running add something like
> > 
> >  [client]
> >   debug rbd = 20
> >   debug rados = 20
> >   debug ms = 1
> >   log file = /var/log/ceph/client.$name.$pid.log
> > 
> > and make sure the log directory is writeable.
> > 
> 
> Excellent. How noisy are those levels likely to be?
> 
> Is it the consumer of librbd that reads those values? I mean all I need 
> to do is restart tapdisk process and the logging should happen right?

That sound do it, yeah.

> > > > There is a probable issue with aio_flush and caching enabled that Mike
> > > > Dawson is trying to reproduce.  Are you running with caching on or off?
> > >
> > > I have not enabled caching, and I believe it's disabled by default.
> > 
> > There is a fix for an aio hang that just hit the cuttlefish branch today
> > that could conceivably be the issue.  It causes a hang on qemu but maybe
> > tapdisk is more sensitive?  I'd make sure you're running with that in any
> > case to rule it out.
> > 
> 
> I switched to dumpling in the last few days to see if the problem existed 
> there. Is the fix you mention in dumpling? I'm not yet running mission 
> critical production code on ceph, just a secondary windows domain controller, 
> secondary spam filter, and a few other machines that don't affect production 
> if they crash.

The fix is in the dumpling branch, but not in v0.67.  It will be in 
v0.67.1.
 
> I'm also testing valgrind at the moment, just basic memtest, but suddenly 
> everything is quite stable even though it's under reasonable load right now. 
> Stupid heisenbugs.

Valgrind makes things go very slow (~10x?), which can have a huge effect 
on timing. Sometimes that reveals new races, other times it hides others.  
:/

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to