[
https://issues.apache.org/jira/browse/DRILL-8426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717382#comment-17717382
]
ASF GitHub Bot commented on DRILL-8426:
---------------------------------------
rymarm opened a new pull request, #2796:
URL: https://github.com/apache/drill/pull/2796
# [DRILL-8426](https://issues.apache.org/jira/browse/DRILL-8426): Fix
endless retrying zk set data for a large query
## Description
Zookeeper closes a connection with a client if he tries to set data with a
size bigger than `jute.maxbuffer`. By default, it is equal to 1MB.
Drill persists it's running queries in zookeeper. If you issue a large query
(bigger than the value of `jute.maxbuffer` on the zookeeper server) Drill will
try to persist it and [get a ConnectinLoss
exception](https://github.com/mapr/private-zookeeper/blob/1071ebf7f2936414443fab95055775643c6988db/zookeeper-server/src/main/java/org/apache/zookeeper/server/NettyServerCnxn.java#L533),
curator (client library that Drill uses to communicate with Zookeeper) will
try to retry the set command based on
[RetryPolicy](https://github.com/apache/curator/blob/34055bbaeda55f06b8cd47b99c08d69c4edde72e/curator-client/src/main/java/org/apache/curator/RetryPolicy.java#L53).
Drill uses
[RetryNTimes](https://github.com/apache/drill/blob/2204d5f51ed33befe234019e0faa321f02cfc61e/exec/java-exec/src/main/java/org/apache/drill/exec/coord/zk/ZKClusterCoordinator.java#L112)
policy which in Drill is set so to keep retrying for [7200
times](https://github.com/apache/drill/blob/2204d5f51ed33befe234019e0faa321f02cfc61e/exec/java-exec/src/main/resources/drill-module.conf#L91).
And while Drill retrying to persist large query to zookeeper, he with each try
will losing connection with zookeeper (server will cutting off connection,
because data has to big size) and it will keeping for around 1 hour. After
this, the client that issued the big query will not receive any error or any
result, cause the final exception is not properly processed.
What I change:
1. Drill will compare the size of data with the client `jute.maxbuffer`
value and if it is bigger, then throw IllegalArgumentException that will be
wrapped into `UserException.executionError`. It is still doesn't safe Drill
from trying to persist to big data into zookeeper, because a user can manually
change the value of `jute.maxbuffer` on the client or the server side and then
may have inconsistent values (a client `jute.maxbuffer` value is not equal to a
server `jute.maxbuffer`). [But as said in zookeeper
documentation](https://zookeeper.apache.org/doc/r3.6.2/zookeeperAdmin.html), if
the user has changed `jute.maxbuffer` value, then the user should change it on
all the zookeeper servers and clients. So in the general case - this check will
be enough.
2. Make Foreman properly process exception that may be raised from
`queryStateProcessor.moveToState`.
3. Reduce `drill.exec.zk.retry.count` from 7200 to 15
4. Add info logs, if the zookeeper client will raise exception during set
operation, so the user was aware what the data size was and what value of
`jute.maxbuffer` Drill has.
## Documentation
Add some information to [troubleshooting
page](https://drill.apache.org/docs/troubleshooting/), what to do if you
catched such an exception and Drill was not responding for a long time?
```
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for
/drill/running/1bb91a06-3afe-8152-f3ce-048dd3bef992
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1672)
at
org.apache.curator.framework.imps.CreateBuilderImpl$18.call(CreateBuilderImpl.java:1216)
at
org.apache.curator.framework.imps.CreateBuilderImpl$18.call(CreateBuilderImpl.java:1193)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
at
org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1190)
at
org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:605)
at
org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:595)
at
org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:48)
at
org.apache.drill.exec.coord.zk.ZookeeperClient.put(ZookeeperClient.java:294)
... 10 common frames omitted
```
## Testing
Manual test, I tried to execute a huge query like this:
```
select full_name from cp.`employee.json` where full_name in ('Sheri Nowmer',
'Sheri Nowmer', ........)
```
> Endless retrying zk set data for a large query
> ----------------------------------------------
>
> Key: DRILL-8426
> URL: https://issues.apache.org/jira/browse/DRILL-8426
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Flow
> Reporter: Maksym Rymar
> Assignee: Maksym Rymar
> Priority: Major
>
> If to issue a large query (bigger than 1MB) Drill can fall down into infinite
> loop of retries to set data to zookeeper.
> In zookeeper logs you will see repeating errors like this:
> {code:java}
> java.io.IOException: Len error 112569{code}
> In drillbit logs you may see errors like this:
> {code:java}
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
> KeeperErrorCode = NoNode for
> /drill/magic-drillbits/ea569524-abaa-41e2-9f69-7857f3a04b6c
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:2419)
> at
> org.apache.curator.framework.imps.SetDataBuilderImpl$4.call(SetDataBuilderImpl.java:291)
> at
> org.apache.curator.framework.imps.SetDataBuilderImpl$4.call(SetDataBuilderImpl.java:287)
> at
> org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:67)
> at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:81)
> at
> org.apache.curator.framework.imps.SetDataBuilderImpl.pathInForeground(SetDataBuilderImpl.java:284)
> at
> org.apache.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:270)
> at
> org.apache.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:33)
> at
> org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.updateService(ServiceDiscoveryImpl.java:208){code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)