[ https://issues.apache.org/jira/browse/DRILL-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167006#comment-16167006 ]
Khurram Faraaz commented on DRILL-4595: --------------------------------------- Verified on Drill 1.12.0 commit id aaff1b35b7339fb4e6ab480dd517994ff9f0a5c5 lowered the memory {noformat} export DRILL_HEAP=${DRILL_HEAP:-"1G"} export DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"1G"} {noformat} Ran a long running CTAS {noformat} 0: jdbc:drill:schema=dfs.tmp> CREATE TABLE tbl_4595 PARTITION BY (key2) AS SELECT * FROM `twoKeyJsn.json` t; +-----------+----------------------------+ | Fragment | Number of records written | +-----------+----------------------------+ | 0_0 | 26212355 | +-----------+----------------------------+ 1 row selected (511.547 seconds) {noformat} > FragmentExecutor.fail() should interrupt the fragment thread to avoid > possible query hangs > ------------------------------------------------------------------------------------------ > > Key: DRILL-4595 > URL: https://issues.apache.org/jira/browse/DRILL-4595 > Project: Apache Drill > Issue Type: Bug > Affects Versions: 1.4.0 > Reporter: Deneche A. Hakim > Assignee: Deneche A. Hakim > Fix For: Future > > > When a fragment fails it's assumed it will be able to close itself and send > it's FAILED state to the foreman which will cancel any running fragments. > FragmentExecutor.cancel() will interrupt the thread making sure those > fragment don't stay blocked. > However, if a fragment is already blocked when it's fail method is called the > foreman may never be notified about this and the query will hang forever. One > such scenario is the following: > - generally it's a CTAS running on a large cluster (lot's of writers running > in parallel) > - logs show that the user channel was closed and UserServer caused the root > fragment to move to a FAILED state > - jstack shows that the root fragment is blocked in it's receiver waiting for > data > - jstack also shows that ALL other fragments are no longer running, and the > logs show that all of them succeeded > - the foreman waits *forever* for the root fragment to finish -- This message was sent by Atlassian JIRA (v6.4.14#64029)