rymarm opened a new pull request, #2796:
URL: https://github.com/apache/drill/pull/2796

   # [DRILL-8426](https://issues.apache.org/jira/browse/DRILL-8426): Fix 
endless retrying zk set data for a large query
   
   ## Description
   Zookeeper closes a connection with a client if he tries to set data with a 
size bigger than `jute.maxbuffer`. By default, it is equal to 1MB.
   
   Drill persists it's running queries in zookeeper. If you issue a large query 
(bigger than the value of `jute.maxbuffer` on the zookeeper server) Drill will 
try to persist it and [get a ConnectinLoss 
exception](https://github.com/mapr/private-zookeeper/blob/1071ebf7f2936414443fab95055775643c6988db/zookeeper-server/src/main/java/org/apache/zookeeper/server/NettyServerCnxn.java#L533),
 curator (client library that Drill uses to communicate with Zookeeper) will 
try to retry the set command based on 
[RetryPolicy](https://github.com/apache/curator/blob/34055bbaeda55f06b8cd47b99c08d69c4edde72e/curator-client/src/main/java/org/apache/curator/RetryPolicy.java#L53).
 Drill uses 
[RetryNTimes](https://github.com/apache/drill/blob/2204d5f51ed33befe234019e0faa321f02cfc61e/exec/java-exec/src/main/java/org/apache/drill/exec/coord/zk/ZKClusterCoordinator.java#L112)
 policy which in Drill is set so to keep retrying for [7200 
times](https://github.com/apache/drill/blob/2204d5f51ed33befe234019e0faa321f0
 2cfc61e/exec/java-exec/src/main/resources/drill-module.conf#L91). And while 
Drill retrying to persist large query to zookeeper, he with each try will 
losing connection with zookeeper (server will cutting off connection, because 
data has to big size) and it will keeping for around 1 hour. After this, the 
client that issued the big query will not receive any error or any result, 
cause the final exception is not properly processed.
   
   What I change:
   1. Drill will compare the size of data with the client `jute.maxbuffer` 
value and if it is bigger, then throw IllegalArgumentException that will be 
wrapped into `UserException.executionError`. It is still doesn't safe Drill 
from trying to persist to big data into zookeeper, because a user can manually 
change the value of `jute.maxbuffer` on the client or the server side and then 
may have inconsistent values (a client `jute.maxbuffer` value is not equal to a 
server `jute.maxbuffer`). [But as said in zookeeper 
documentation](https://zookeeper.apache.org/doc/r3.6.2/zookeeperAdmin.html), if 
the user has changed `jute.maxbuffer` value, then the user should change it on 
all the zookeeper servers and clients. So in the general case - this check will 
be enough.
   2. Make Foreman properly process exception that may be raised from 
`queryStateProcessor.moveToState`.
   3. Reduce `drill.exec.zk.retry.count` from 7200 to 15
   4. Add info logs, if the zookeeper client will raise exception during set 
operation, so the user was aware what the data size was and what value of 
`jute.maxbuffer` Drill has.
   
   
   ## Documentation
   Add some information to [troubleshooting 
page](https://drill.apache.org/docs/troubleshooting/), what to do if you 
catched such an exception and Drill was not responding for a long time?
   ```
   Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for 
/drill/running/1bb91a06-3afe-8152-f3ce-048dd3bef992
           at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
           at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
           at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1672)
           at 
org.apache.curator.framework.imps.CreateBuilderImpl$18.call(CreateBuilderImpl.java:1216)
           at 
org.apache.curator.framework.imps.CreateBuilderImpl$18.call(CreateBuilderImpl.java:1193)
           at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
           at 
org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1190)
           at 
org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:605)
           at 
org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:595)
           at 
org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:48)
           at 
org.apache.drill.exec.coord.zk.ZookeeperClient.put(ZookeeperClient.java:294)
           ... 10 common frames omitted
   ```
   
   
   ## Testing
   Manual test, I tried to execute a huge query like this:
   ```
   select full_name from cp.`employee.json` where full_name in ('Sheri Nowmer', 
'Sheri Nowmer', ........)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to