ZK lost connectivity issue on large cluster

François Méthot Wed, 14 Sep 2016 11:44:30 -0700

Hi,

  We are trying to find a solution/workaround to issue:


2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR
o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR: ForemanException:
One more more nodes lost connectivity during query.  Identified nodes
were [atsqa4-133.qa.lab:31010].
org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
ForemanException: One more more nodes lost connectivity during query.
Identified nodes were [atsqa4-133.qa.lab:31010].
        at 
org.apache.drill.exec.work.foreman.Foreman$ForemanResult.close(Foreman.java:746)
[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
        at 
org.apache.drill.exec.work.foreman.Foreman$StateSwitch.processEvent(Foreman.java:858)
[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
        at 
org.apache.drill.exec.work.foreman.Foreman$StateSwitch.processEvent(Foreman.java:790)
[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
        at 
org.apache.drill.exec.work.foreman.Foreman$StateSwitch.moveToState(Foreman.java:792)
[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
        at 
org.apache.drill.exec.work.foreman.Foreman.moveToState(Foreman.java:909)
[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
        at 
org.apache.drill.exec.work.foreman.Foreman.access$2700(Foreman.java:110)
[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
        at 
org.apache.drill.exec.work.foreman.Foreman$StateListener.moveToState(Foreman.java:1183)
[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]


DRILL-4325  <https://issues.apache.org/jira/browse/DRILL-4325>ForemanException:
One or more nodes lost connectivity during query



Any one experienced this issue ?

It happens when running query involving many parquet files on a cluster of
200 nodes. Same query on a smaller cluster of 12 nodes runs fine.

It is not caused by garbage collection, (checked on both ZK node and the
involved drill bit).

Negotiated max session timeout is 40 seconds.

The sequence seems:
- Drill Query begins, using an existing ZK session.
- Drill Zk session timeouts
      - perhaps it was writing something that took too long
- Drill attempts to renew session
       - drill believes that the write operation failed, so it attempts to
re-create the zk node, which trigger another exception.

 We are open to any suggestion. We will report any finding.

Thanks
Francois

ZK lost connectivity issue on large cluster

Reply via email to