Rahul Challapalli created DRILL-5289:
----------------------------------------

             Summary: Drill should handle OOM due to insufficient heap type of 
errors more gracefully
                 Key: DRILL-5289
                 URL: https://issues.apache.org/jira/browse/DRILL-5289
             Project: Apache Drill
          Issue Type: Bug
          Components: Execution - Flow, Execution - RPC
    Affects Versions: 1.10.0
            Reporter: Rahul Challapalli


[Git Commit ID will be updated soon]

The below query which uses the managed sort causes an OOM error due to 
insufficient heap, which is a bug in itself. 
{code}
ALTER SESSION SET `exec.sort.disable_managed` = false;
+-------+-------------------------------------+
|  ok   |               summary               |
+-------+-------------------------------------+
| true  | exec.sort.disable_managed updated.  |
+-------+-------------------------------------+
1 row selected (1.096 seconds)
0: jdbc:drill:zk=10.10.100.183:5181> alter session set 
`planner.memory.max_query_memory_per_node` = 14106127360;
+-------+----------------------------------------------------+
|  ok   |                      summary                       |
+-------+----------------------------------------------------+
| true  | planner.memory.max_query_memory_per_node updated.  |
+-------+----------------------------------------------------+
1 row selected (0.253 seconds)
0: jdbc:drill:zk=10.10.100.183:5181> alter session set 
`planner.width.max_per_node` = 1;
+-------+--------------------------------------+
|  ok   |               summary                |
+-------+--------------------------------------+
| true  | planner.width.max_per_node updated.  |
+-------+--------------------------------------+
1 row selected (0.184 seconds)
0: jdbc:drill:zk=10.10.100.183:5181> select * from (select * from 
dfs.`/drill/testdata/resource-manager/250wide.tbl` order by columns[0])d where 
d.columns[0] = 'ljdfhwuehnoiueyf';
{code}
Once the OOM happens chaos follows
{code}
1. Dangling fragments are left behind
2. Query fails but zookeeper thinks its still running
3. Client connection timeouts
4. Profile page shows the same query as both running and failed.
{code}

We should be handling this situation more gracefully as this could be perceived 
as a drillbit stability issue. I attached the jstack. The logs and data set 
used are too big to upload here. Reach out to me if you need more information.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to