> > > Of course, the old standby is to just crank up the logging detail and try
> > > to narrow down where the crash happens.  Have you tried that yet?
> >
> > I haven't touched the rbd code. Is increased logging a compile-time
> > option or a config option?
> 
> That is probably the first you should try then.  In the [client] section
> of ceph.conf on the node where tapdisk is running add something like
> 
>  [client]
>   debug rbd = 20
>   debug rados = 20
>   debug ms = 1
>   log file = /var/log/ceph/client.$name.$pid.log
> 
> and make sure the log directory is writeable.
> 

Excellent. How noisy are those levels likely to be?

Is it the consumer of librbd that reads those values? I mean all I need to do 
is restart tapdisk process and the logging should happen right?

> > > There is a probable issue with aio_flush and caching enabled that Mike
> > > Dawson is trying to reproduce.  Are you running with caching on or off?
> >
> > I have not enabled caching, and I believe it's disabled by default.
> 
> There is a fix for an aio hang that just hit the cuttlefish branch today
> that could conceivably be the issue.  It causes a hang on qemu but maybe
> tapdisk is more sensitive?  I'd make sure you're running with that in any
> case to rule it out.
> 

I switched to dumpling in the last few days to see if the problem existed 
there. Is the fix you mention in dumpling? I'm not yet running mission critical 
production code on ceph, just a secondary windows domain controller, secondary 
spam filter, and a few other machines that don't affect production if they 
crash.

I'm also testing valgrind at the moment, just basic memtest, but suddenly 
everything is quite stable even though it's under reasonable load right now. 
Stupid heisenbugs.

Thanks

James



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to