[jira] [Commented] (TEZ-3980) ShuffleRunner: the wake loop needs to check for shutdown
[ https://issues.apache.org/jira/browse/TEZ-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594400#comment-16594400 ] Gunther Hagleitner commented on TEZ-3980: - +1 > ShuffleRunner: the wake loop needs to check for shutdown > > > Key: TEZ-3980 > URL: https://issues.apache.org/jira/browse/TEZ-3980 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal V >Assignee: Gopal V >Priority: Major > Attachments: TEZ-3980.1.patch > > > In the ShuffleRunner threads, there's a loop which does not terminate if the > task threads get killed. > {code} > while ((runningFetchers.size() >= numFetchers || > pendingHosts.isEmpty()) > && numCompletedInputs.get() < numInputs) { > inputContext.notifyProgress(); > boolean ret = wakeLoop.await(1000, TimeUnit.MILLISECONDS); > } > {code} > The wakeLoop signal does not exit this out of the loop and is missing a break > for shut-down. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3980) ShuffleRunner: the wake loop needs to check for shutdown
[ https://issues.apache.org/jira/browse/TEZ-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589304#comment-16589304 ] Sergey Shelukhin commented on TEZ-3980: --- +1 non-binding > ShuffleRunner: the wake loop needs to check for shutdown > > > Key: TEZ-3980 > URL: https://issues.apache.org/jira/browse/TEZ-3980 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal V >Assignee: Gopal V >Priority: Major > Attachments: TEZ-3980.1.patch > > > In the ShuffleRunner threads, there's a loop which does not terminate if the > task threads get killed. > {code} > while ((runningFetchers.size() >= numFetchers || > pendingHosts.isEmpty()) > && numCompletedInputs.get() < numInputs) { > inputContext.notifyProgress(); > boolean ret = wakeLoop.await(1000, TimeUnit.MILLISECONDS); > } > {code} > The wakeLoop signal does not exit this out of the loop and is missing a break > for shut-down. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3980) ShuffleRunner: the wake loop needs to check for shutdown
[ https://issues.apache.org/jira/browse/TEZ-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582757#comment-16582757 ] Gopal V commented on TEZ-3980: -- Testing issues with LLAP task pre-emption. When reducers doing the unsorted shuffle join (or the bloom filter semi-join) are pre-empted, they leave behind a shuffle runner thread. After 32k threads leak, this fails with a "cannot create thread" in some other random IPC thread. > ShuffleRunner: the wake loop needs to check for shutdown > > > Key: TEZ-3980 > URL: https://issues.apache.org/jira/browse/TEZ-3980 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal V >Assignee: Gopal V >Priority: Major > Attachments: TEZ-3980.1.patch > > > In the ShuffleRunner threads, there's a loop which does not terminate if the > task threads get killed. > {code} > while ((runningFetchers.size() >= numFetchers || > pendingHosts.isEmpty()) > && numCompletedInputs.get() < numInputs) { > inputContext.notifyProgress(); > boolean ret = wakeLoop.await(1000, TimeUnit.MILLISECONDS); > } > {code} > The wakeLoop signal does not exit this out of the loop and is missing a break > for shut-down. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3980) ShuffleRunner: the wake loop needs to check for shutdown
[ https://issues.apache.org/jira/browse/TEZ-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582752#comment-16582752 ] Gopal V commented on TEZ-3980: -- The shufflescheduler has a check for shutdown.get() + a break inside the loop (also uses thread wait). This is a shufflemanager only bug right now. > ShuffleRunner: the wake loop needs to check for shutdown > > > Key: TEZ-3980 > URL: https://issues.apache.org/jira/browse/TEZ-3980 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal V >Assignee: Gopal V >Priority: Major > Attachments: TEZ-3980.1.patch > > > In the ShuffleRunner threads, there's a loop which does not terminate if the > task threads get killed. > {code} > while ((runningFetchers.size() >= numFetchers || > pendingHosts.isEmpty()) > && numCompletedInputs.get() < numInputs) { > inputContext.notifyProgress(); > boolean ret = wakeLoop.await(1000, TimeUnit.MILLISECONDS); > } > {code} > The wakeLoop signal does not exit this out of the loop and is missing a break > for shut-down. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3980) ShuffleRunner: the wake loop needs to check for shutdown
[ https://issues.apache.org/jira/browse/TEZ-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582709#comment-16582709 ] Kuhu Shukla commented on TEZ-3980: -- [~gopalv], Just curious how you encountered this issue? Did it cause a hang? Any details would be valuable as we are investigating some other bugs in and around that code base at the moment. > ShuffleRunner: the wake loop needs to check for shutdown > > > Key: TEZ-3980 > URL: https://issues.apache.org/jira/browse/TEZ-3980 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal V >Assignee: Gopal V >Priority: Major > Attachments: TEZ-3980.1.patch > > > In the ShuffleRunner threads, there's a loop which does not terminate if the > task threads get killed. > {code} > while ((runningFetchers.size() >= numFetchers || > pendingHosts.isEmpty()) > && numCompletedInputs.get() < numInputs) { > inputContext.notifyProgress(); > boolean ret = wakeLoop.await(1000, TimeUnit.MILLISECONDS); > } > {code} > The wakeLoop signal does not exit this out of the loop and is missing a break > for shut-down. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3980) ShuffleRunner: the wake loop needs to check for shutdown
[ https://issues.apache.org/jira/browse/TEZ-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582598#comment-16582598 ] Kuhu Shukla commented on TEZ-3980: -- Good catch [~gopalv].Do we need an equivalent change in ShuffleScheduler as well? (The ordered case) > ShuffleRunner: the wake loop needs to check for shutdown > > > Key: TEZ-3980 > URL: https://issues.apache.org/jira/browse/TEZ-3980 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal V >Assignee: Gopal V >Priority: Major > Attachments: TEZ-3980.1.patch > > > In the ShuffleRunner threads, there's a loop which does not terminate if the > task threads get killed. > {code} > while ((runningFetchers.size() >= numFetchers || > pendingHosts.isEmpty()) > && numCompletedInputs.get() < numInputs) { > inputContext.notifyProgress(); > boolean ret = wakeLoop.await(1000, TimeUnit.MILLISECONDS); > } > {code} > The wakeLoop signal does not exit this out of the loop and is missing a break > for shut-down. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3980) ShuffleRunner: the wake loop needs to check for shutdown
[ https://issues.apache.org/jira/browse/TEZ-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582037#comment-16582037 ] TezQA commented on TEZ-3980: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12935798/TEZ-3980.1.patch against master revision 90c8195. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2892//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2892//console This message is automatically generated. > ShuffleRunner: the wake loop needs to check for shutdown > > > Key: TEZ-3980 > URL: https://issues.apache.org/jira/browse/TEZ-3980 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal V >Assignee: Gopal V >Priority: Major > Attachments: TEZ-3980.1.patch > > > In the ShuffleRunner threads, there's a loop which does not terminate if the > task threads get killed. > {code} > while ((runningFetchers.size() >= numFetchers || > pendingHosts.isEmpty()) > && numCompletedInputs.get() < numInputs) { > inputContext.notifyProgress(); > boolean ret = wakeLoop.await(1000, TimeUnit.MILLISECONDS); > } > {code} > The wakeLoop signal does not exit this out of the loop and is missing a break > for shut-down. -- This message was sent by Atlassian JIRA (v7.6.3#76005)