[jira] [Commented] (HIVE-12650) Spark-submit is killed when Hive times out. Killing spark-submit doesn't cancel AM request. When AM is finally launched, it tries to connect back to Hive and gets refus
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15221149#comment-15221149 ] Xuefu Zhang commented on HIVE-12650: +1. Patch looks good to me. Thanks for working on this, Rui! > Spark-submit is killed when Hive times out. Killing spark-submit doesn't > cancel AM request. When AM is finally launched, it tries to connect back to > Hive and gets refused. > --- > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Rui Li > Attachments: HIVE-12650.1.patch, HIVE-12650.2.patch > > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Spark-submit is killed when Hive times out. Killing spark-submit doesn't cancel AM request. When AM is finally launched, it tries to connect back to Hive and gets refus
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15221049#comment-15221049 ] Rui Li commented on HIVE-12650: --- I tried several failed tests locally and they were not reproduced. [~xuefuz] would you mind take a look at the patch when you have time? Thanks. > Spark-submit is killed when Hive times out. Killing spark-submit doesn't > cancel AM request. When AM is finally launched, it tries to connect back to > Hive and gets refused. > --- > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Rui Li > Attachments: HIVE-12650.1.patch, HIVE-12650.2.patch > > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Spark-submit is killed when Hive times out. Killing spark-submit doesn't cancel AM request. When AM is finally launched, it tries to connect back to Hive and gets refus
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216086#comment-15216086 ] Hive QA commented on HIVE-12650: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12795584/HIVE-12650.2.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 9888 tests executed *Failed tests:* {noformat} TestSparkCliDriver-groupby3_map.q-sample2.q-auto_join14.q-and-12-more - did not produce a TEST-*.xml file TestSparkCliDriver-groupby_map_ppr_multi_distinct.q-table_access_keys_stats.q-groupby4_noskew.q-and-12-more - did not produce a TEST-*.xml file TestSparkCliDriver-join_rc.q-insert1.q-vectorized_rcfile_columnar.q-and-12-more - did not produce a TEST-*.xml file TestSparkCliDriver-ppd_join4.q-join9.q-ppd_join3.q-and-12-more - did not produce a TEST-*.xml file {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/7402/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/7402/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-7402/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 4 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12795584 - PreCommit-HIVE-TRUNK-Build > Spark-submit is killed when Hive times out. Killing spark-submit doesn't > cancel AM request. When AM is finally launched, it tries to connect back to > Hive and gets refused. > --- > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Rui Li > Attachments: HIVE-12650.1.patch, HIVE-12650.2.patch > > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Spark-submit is killed when Hive times out. Killing spark-submit doesn't cancel AM request. When AM is finally launched, it tries to connect back to Hive and gets refus
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211391#comment-15211391 ] Xuefu Zhang commented on HIVE-12650: Thanks, Rui. I think it's fine to list all possible causes in an error message when we don't actually know the exact one. We can also suggest user where to look further (such as yarn logs). I understand that prewarming containers complicates the things a bit, but I'm not sure of your proposal. Could you provide a patch showing the changes you have in mind? > Spark-submit is killed when Hive times out. Killing spark-submit doesn't > cancel AM request. When AM is finally launched, it tries to connect back to > Hive and gets refused. > --- > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Spark-submit is killed when Hive times out. Killing spark-submit doesn't cancel AM request. When AM is finally launched, it tries to connect back to Hive and gets refus
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211335#comment-15211335 ] Rui Li commented on HIVE-12650: --- I think the difficult part is that we really don't know the possible reasons. Anyway all we get is a timeout, it could be due to network issue, exceptions, or the RSC is just busy. Another possible refinement is that we can make the behavior more consistent. Like I said, there're now 2 paths that can lead to timeout/failure and user will see different error messages. How about remove the timeout at {{RemoteHiveSparkClient#createRemoteClient#getExecutorCount}}? I mean after certain amount of time, we can give up the pre-warm and eventually fail the job at job monitor. > Spark-submit is killed when Hive times out. Killing spark-submit doesn't > cancel AM request. When AM is finally launched, it tries to connect back to > Hive and gets refused. > --- > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Spark-submit is killed when Hive times out. Killing spark-submit doesn't cancel AM request. When AM is finally launched, it tries to connect back to Hive and gets refus
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15208398#comment-15208398 ] Xuefu Zhang commented on HIVE-12650: I was thinking of just improving the current message, maybe by naming different possibilities when a timeout occurs. The new timeout you mentioned doesn't seem very helpful as yarn-cluster is what we recommend. > Spark-submit is killed when Hive times out. Killing spark-submit doesn't > cancel AM request. When AM is finally launched, it tries to connect back to > Hive and gets refused. > --- > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Spark-submit is killed when Hive times out. Killing spark-submit doesn't cancel AM request. When AM is finally launched, it tries to connect back to Hive and gets refus
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205603#comment-15205603 ] Rui Li commented on HIVE-12650: --- Regarding better error message, do you think we can throw a timeout exception if SparkContext is not up after certain amount of time? Otherwise user only gets a timeout on the future and doesn't know the cause. On the other hand, this means adding another property and I think it only works for yarn-client. > Spark-submit is killed when Hive times out. Killing spark-submit doesn't > cancel AM request. When AM is finally launched, it tries to connect back to > Hive and gets refused. > --- > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Spark-submit is killed when Hive times out. Killing spark-submit doesn't cancel AM request. When AM is finally launched, it tries to connect back to Hive and gets refus
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201029#comment-15201029 ] Rui Li commented on HIVE-12650: --- Here're my findings so far (for yarn-client mode). # If the cluster has no resources available, the {{RemoteDriver}} is blocked at creating the SparkContext. Rpc has been set up at this point. So {{SparkClientImpl}} believes that driver is up. # There're 2 points where hive can get timeout. If we enable container pre-warm, we'll timeout at {{RemoteHiveSparkClient#createRemoteClient#getExecutorCount}}. If pre-warm is off, we'll timeout at {{RemoteSparkJobMonitor#startMonitor}}. Both are because SparkContext is not created and RemoteDriver can't respond to requests. For the latter, the job status remains {{SENT}} until timeout. Ideally it should be {{QUEUE}} instead. My understanding is that the Rpc handler is blocked at RemoteDriver for {{ADD JAR/FILE}} calls, which we submitted before the real job. # Currently YARN doesn't timeout a starving application, which means the app will eventually get served. Refer to YARN-3813 and YARN-2266. Based on these findings, I think we have to decide whether hive should timeout in this situation. Waiting is reasonable for busy clusters. But on the other hand, it seems difficult to tell whether we're blocked for lack of resources. I'm not sure if spark has such facilities for it. For yarn-cluster mode, this may be even more difficult because RemoteDriver is not running in that case and we'll have less information. What do you guys think? > Spark-submit is killed when Hive times out. Killing spark-submit doesn't > cancel AM request. When AM is finally launched, it tries to connect back to > Hive and gets refused. > --- > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Spark-submit is killed when Hive times out. Killing spark-submit doesn't cancel AM request. When AM is finally launched, it tries to connect back to Hive and gets refus
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193174#comment-15193174 ] Rui Li commented on HIVE-12650: --- The timeout is necessary in case the RSC crashes due to some errors. But the issue here shows that it could also because the RSC is just waiting for resources from a busy cluster. I think we need a way to distinguish these two scenarios and don't timeout on the latter. > Spark-submit is killed when Hive times out. Killing spark-submit doesn't > cancel AM request. When AM is finally launched, it tries to connect back to > Hive and gets refused. > --- > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)