[jira] [Commented] (HIVE-12650) Spark-submit is killed when Hive times out. Killing spark-submit doesn't cancel AM request. When AM is finally launched, it tries to connect back to Hive and gets refus

2016-03-31 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15221149#comment-15221149
 ] 

Xuefu Zhang commented on HIVE-12650:


+1. Patch looks good to me. Thanks for working on this, Rui!

> Spark-submit is killed when Hive times out. Killing spark-submit doesn't 
> cancel AM request. When AM is finally launched, it tries to connect back to 
> Hive and gets refused.
> ---
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Rui Li
> Attachments: HIVE-12650.1.patch, HIVE-12650.2.patch
>
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Spark-submit is killed when Hive times out. Killing spark-submit doesn't cancel AM request. When AM is finally launched, it tries to connect back to Hive and gets refus

2016-03-31 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15221049#comment-15221049
 ] 

Rui Li commented on HIVE-12650:
---

I tried several failed tests locally and they were not reproduced.
[~xuefuz] would you mind take a look at the patch when you have time? Thanks.

> Spark-submit is killed when Hive times out. Killing spark-submit doesn't 
> cancel AM request. When AM is finally launched, it tries to connect back to 
> Hive and gets refused.
> ---
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Rui Li
> Attachments: HIVE-12650.1.patch, HIVE-12650.2.patch
>
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Spark-submit is killed when Hive times out. Killing spark-submit doesn't cancel AM request. When AM is finally launched, it tries to connect back to Hive and gets refus

2016-03-29 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216086#comment-15216086
 ] 

Hive QA commented on HIVE-12650:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12795584/HIVE-12650.2.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 9888 tests executed
*Failed tests:*
{noformat}
TestSparkCliDriver-groupby3_map.q-sample2.q-auto_join14.q-and-12-more - did not 
produce a TEST-*.xml file
TestSparkCliDriver-groupby_map_ppr_multi_distinct.q-table_access_keys_stats.q-groupby4_noskew.q-and-12-more
 - did not produce a TEST-*.xml file
TestSparkCliDriver-join_rc.q-insert1.q-vectorized_rcfile_columnar.q-and-12-more 
- did not produce a TEST-*.xml file
TestSparkCliDriver-ppd_join4.q-join9.q-ppd_join3.q-and-12-more - did not 
produce a TEST-*.xml file
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/7402/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/7402/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-7402/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 4 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12795584 - PreCommit-HIVE-TRUNK-Build

> Spark-submit is killed when Hive times out. Killing spark-submit doesn't 
> cancel AM request. When AM is finally launched, it tries to connect back to 
> Hive and gets refused.
> ---
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Rui Li
> Attachments: HIVE-12650.1.patch, HIVE-12650.2.patch
>
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Spark-submit is killed when Hive times out. Killing spark-submit doesn't cancel AM request. When AM is finally launched, it tries to connect back to Hive and gets refus

2016-03-24 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211391#comment-15211391
 ] 

Xuefu Zhang commented on HIVE-12650:


Thanks, Rui. I think it's fine to list all possible causes in an error message 
when we don't actually know the exact one. We can also suggest user where to 
look further (such as yarn logs).

I understand that prewarming containers complicates the things a bit, but I'm 
not sure of your proposal. Could you provide a patch showing the changes you 
have in mind?

> Spark-submit is killed when Hive times out. Killing spark-submit doesn't 
> cancel AM request. When AM is finally launched, it tries to connect back to 
> Hive and gets refused.
> ---
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Spark-submit is killed when Hive times out. Killing spark-submit doesn't cancel AM request. When AM is finally launched, it tries to connect back to Hive and gets refus

2016-03-24 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211335#comment-15211335
 ] 

Rui Li commented on HIVE-12650:
---

I think the difficult part is that we really don't know the possible reasons. 
Anyway all we get is a timeout, it could be due to network issue, exceptions, 
or the RSC is just busy.

Another possible refinement is that we can make the behavior more consistent. 
Like I said, there're now 2 paths that can lead to timeout/failure and user 
will see different error messages. How about remove the timeout at 
{{RemoteHiveSparkClient#createRemoteClient#getExecutorCount}}? I mean after 
certain amount of time, we can give up the pre-warm and eventually fail the job 
at job monitor.

> Spark-submit is killed when Hive times out. Killing spark-submit doesn't 
> cancel AM request. When AM is finally launched, it tries to connect back to 
> Hive and gets refused.
> ---
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Spark-submit is killed when Hive times out. Killing spark-submit doesn't cancel AM request. When AM is finally launched, it tries to connect back to Hive and gets refus

2016-03-23 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15208398#comment-15208398
 ] 

Xuefu Zhang commented on HIVE-12650:


I was thinking of just improving the current message, maybe by naming different 
possibilities when a timeout occurs. The new timeout you mentioned doesn't seem 
very helpful as yarn-cluster is what we recommend.

> Spark-submit is killed when Hive times out. Killing spark-submit doesn't 
> cancel AM request. When AM is finally launched, it tries to connect back to 
> Hive and gets refused.
> ---
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Spark-submit is killed when Hive times out. Killing spark-submit doesn't cancel AM request. When AM is finally launched, it tries to connect back to Hive and gets refus

2016-03-21 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205603#comment-15205603
 ] 

Rui Li commented on HIVE-12650:
---

Regarding better error message, do you think we can throw a timeout exception 
if SparkContext is not up after certain amount of time? Otherwise user only 
gets a timeout on the future and doesn't know the cause. On the other hand, 
this means adding another property and I think it only works for yarn-client.

> Spark-submit is killed when Hive times out. Killing spark-submit doesn't 
> cancel AM request. When AM is finally launched, it tries to connect back to 
> Hive and gets refused.
> ---
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Spark-submit is killed when Hive times out. Killing spark-submit doesn't cancel AM request. When AM is finally launched, it tries to connect back to Hive and gets refus

2016-03-19 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201029#comment-15201029
 ] 

Rui Li commented on HIVE-12650:
---

Here're my findings so far (for yarn-client mode).

# If the cluster has no resources available, the {{RemoteDriver}} is blocked at 
creating the SparkContext. Rpc has been set up at this point. So 
{{SparkClientImpl}} believes that driver is up.
# There're 2 points where hive can get timeout. If we enable container 
pre-warm, we'll timeout at 
{{RemoteHiveSparkClient#createRemoteClient#getExecutorCount}}. If pre-warm is 
off, we'll timeout at {{RemoteSparkJobMonitor#startMonitor}}. Both are because 
SparkContext is not created and RemoteDriver can't respond to requests. For the 
latter, the job status remains {{SENT}} until timeout. Ideally it should be 
{{QUEUE}} instead. My understanding is that the Rpc handler is blocked at 
RemoteDriver for {{ADD JAR/FILE}} calls, which we submitted before the real job.
# Currently YARN doesn't timeout a starving application, which means the app 
will eventually get served. Refer to YARN-3813 and YARN-2266.

Based on these findings, I think we have to decide whether hive should timeout 
in this situation. Waiting is reasonable for busy clusters. But on the other 
hand, it seems difficult to tell whether we're blocked for lack of resources. 
I'm not sure if spark has such facilities for it. For yarn-cluster mode, this 
may be even more difficult because RemoteDriver is not running in that case and 
we'll have less information.
What do you guys think?

> Spark-submit is killed when Hive times out. Killing spark-submit doesn't 
> cancel AM request. When AM is finally launched, it tries to connect back to 
> Hive and gets refused.
> ---
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Spark-submit is killed when Hive times out. Killing spark-submit doesn't cancel AM request. When AM is finally launched, it tries to connect back to Hive and gets refus

2016-03-14 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193174#comment-15193174
 ] 

Rui Li commented on HIVE-12650:
---

The timeout is necessary in case the RSC crashes due to some errors. But the 
issue here shows that it could also because the RSC is just waiting for 
resources from a busy cluster. I think we need a way to distinguish these two 
scenarios and don't timeout on the latter.

> Spark-submit is killed when Hive times out. Killing spark-submit doesn't 
> cancel AM request. When AM is finally launched, it tries to connect back to 
> Hive and gets refused.
> ---
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)