On Wed, Jul 19, 2017 at 4:04 PM, Enrico Olivelli <[email protected]> wrote:
> Hi, > in some internal benchmarks we are experiencing openLedgerNoRecovery calls > which remain hung. > I see that basically that function calls ZookKeeper#getData. > > Does anyone have an idea of how it can happen ? > What version are you testing? Is it related your recent change on bumping zookeeper version? If that's the case, we should consider rolling back the zookeeper version. > > Is there any implicit timeout on ZK.getData() ? I did not find any way and > personally I never got into this problem. > As far as I know, there is no timeout on zookeeper requests. It would be a good question to zookeeper community. > > Maybe there is space for an improvement to add a timeout on openLedgerXXX > operations, but anyway it is strange that the callback is never called. > > Unfortunately the problem happens only in integration tests, mabye I can > work to reproduce it on a BK only test case. > > The case is simple: start ZK + 1 Bookie + 1 BookKeeper, create > concurrencly many ledgers, write and concurrently open them with > openLedgerNoRecovery from other threads. > The fact is that no error is on ZK logs and BK logs > Can you turn on debugging log for the bookkeeper client and also zookeeper? There might be logs for checking. Another solution is to do a TCP dump for tracing the zookeeper calls to see if the getData request and response is received at both sides. > > Any suggestion ? > > Thanks > > -- Enrico > > >
