[jira] [Commented] (DRILL-4595) FragmentExecutor.fail() should interrupt the fragment thread to avoid possible query hangs

Khurram Faraaz (JIRA) Thu, 14 Sep 2017 14:44:25 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167006#comment-16167006
 ]


Khurram Faraaz commented on DRILL-4595:
---------------------------------------

Verified on Drill 1.12.0 commit id  aaff1b35b7339fb4e6ab480dd517994ff9f0a5c5

lowered the memory
{noformat}
export DRILL_HEAP=${DRILL_HEAP:-"1G"}
export DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"1G"}
{noformat}

Ran a long running CTAS
{noformat}
0: jdbc:drill:schema=dfs.tmp> CREATE TABLE tbl_4595 PARTITION BY (key2) AS 
SELECT * FROM `twoKeyJsn.json` t;
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 26212355                   |
+-----------+----------------------------+
1 row selected (511.547 seconds)
{noformat}

> FragmentExecutor.fail() should interrupt the fragment thread to avoid 
> possible query hangs
> ------------------------------------------------------------------------------------------
>
>                 Key: DRILL-4595
>                 URL: https://issues.apache.org/jira/browse/DRILL-4595
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.4.0
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>             Fix For: Future
>
>
> When a fragment fails it's assumed it will be able to close itself and send 
> it's FAILED state to the foreman which will cancel any running fragments. 
> FragmentExecutor.cancel() will interrupt the thread making sure those 
> fragment don't stay blocked.
> However, if a fragment is already blocked when it's fail method is called the 
> foreman may never be notified about this and the query will hang forever. One 
> such scenario is the following:
> - generally it's a CTAS running on a large cluster (lot's of writers running 
> in parallel)
> - logs show that the user channel was closed and UserServer caused the root 
> fragment to move to a FAILED state
> - jstack shows that the root fragment is blocked in it's receiver waiting for 
> data
> - jstack also shows that ALL other fragments are no longer running, and the 
> logs show that all of them succeeded
> - the foreman waits *forever* for the root fragment to finish



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DRILL-4595) FragmentExecutor.fail() should interrupt the fragment thread to avoid possible query hangs

Reply via email to