Rahul Challapalli created DRILL-5289: ----------------------------------------
Summary: Drill should handle OOM due to insufficient heap type of errors more gracefully Key: DRILL-5289 URL: https://issues.apache.org/jira/browse/DRILL-5289 Project: Apache Drill Issue Type: Bug Components: Execution - Flow, Execution - RPC Affects Versions: 1.10.0 Reporter: Rahul Challapalli [Git Commit ID will be updated soon] The below query which uses the managed sort causes an OOM error due to insufficient heap, which is a bug in itself. {code} ALTER SESSION SET `exec.sort.disable_managed` = false; +-------+-------------------------------------+ | ok | summary | +-------+-------------------------------------+ | true | exec.sort.disable_managed updated. | +-------+-------------------------------------+ 1 row selected (1.096 seconds) 0: jdbc:drill:zk=10.10.100.183:5181> alter session set `planner.memory.max_query_memory_per_node` = 14106127360; +-------+----------------------------------------------------+ | ok | summary | +-------+----------------------------------------------------+ | true | planner.memory.max_query_memory_per_node updated. | +-------+----------------------------------------------------+ 1 row selected (0.253 seconds) 0: jdbc:drill:zk=10.10.100.183:5181> alter session set `planner.width.max_per_node` = 1; +-------+--------------------------------------+ | ok | summary | +-------+--------------------------------------+ | true | planner.width.max_per_node updated. | +-------+--------------------------------------+ 1 row selected (0.184 seconds) 0: jdbc:drill:zk=10.10.100.183:5181> select * from (select * from dfs.`/drill/testdata/resource-manager/250wide.tbl` order by columns[0])d where d.columns[0] = 'ljdfhwuehnoiueyf'; {code} Once the OOM happens chaos follows {code} 1. Dangling fragments are left behind 2. Query fails but zookeeper thinks its still running 3. Client connection timeouts 4. Profile page shows the same query as both running and failed. {code} We should be handling this situation more gracefully as this could be perceived as a drillbit stability issue. I attached the jstack. The logs and data set used are too big to upload here. Reach out to me if you need more information. -- This message was sent by Atlassian JIRA (v6.3.15#6346)