[ https://issues.apache.org/jira/browse/HIVE-9323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14270719#comment-14270719 ]
Chengxiang Li commented on HIVE-9323: ------------------------------------- [~Szehon], I take a look at the hive log, the failed reason is quite strange and a little different from HIVE-9094. HIVE-9094 failed due to get executor count timeout because of spark cluster launch time is longer than spark client future timeout interval(5s, and 30s after HIVE-9094), while this timeout failure is due to RemoteDriver is not response in time(spark client wait 10s for RemoteDriver to register). >From the hive.log, RemoteDriver processor is launched at 2015-01-08 >18:43:03,938 {noformat} 2015-01-08 18:43:03,938 DEBUG [main]: client.SparkClientImpl (SparkClientImpl.java:startDriver(298)) - Running client driver with argv: /home/hiveptest/54.177.142.77-hiveptest-1/apache-svn-spark-source/itests/qtest-spark/../../itests/qtest-spark/target/spark/bin/spark-submit --properties-file /home/hiveptest/54.177.142.77-hiveptest-1/apache-svn-spark-source/itests/qtest-spark/target/tmp/spark-submit.1097041260552550316.properties --class org.apache.hive.spark.client.RemoteDriver /home/hiveptest/54.177.142.77-hiveptest-1/maven/org/apache/hive/hive-exec/0.15.0-SNAPSHOT/hive-exec-0.15.0-SNAPSHOT.jar --remote-host ip-10-228-130-250.us-west-1.compute.internal --remote-port 40406 {noformat} In spark.log, RemoteDriver register back to SparkClient at 2015-01-08 18:43:13,891 which should just more than timeout interval which is 10s. {noformat} 2015-01-08 18:43:13,891 DEBUG [Driver-RPC-Handler-0]: rpc.RpcDispatcher (RpcDispatcher.java:registerRpc(185)) - [DriverProtocol] Registered outstanding rpc 0 (org.apache.hive.spark.client.rpc.Rpc$Hello). {noformat} The strange thing is that RemoteDriver processor is unusual slow, as it's launched at 2015-01-08 18:43:03,938 but we get it's first debug info at 2015-01-08 18:43:13,161, RemoteDriver hardly do anything before this debug info. {noformat} 2015-01-08 18:43:13,161 INFO [main]: client.RemoteDriver (RemoteDriver.java:<init>(118)) - Connecting to: ip-10-228-130-250.us-west-1.compute.internal:40406 {noformat} I not sure why this happens, but this should be a quite rarely case, we can check whether it happens again, besides expand timeout interval, i don't have a good solution for this issue now. > Merge from trunk to spark 1/8/2015 > ---------------------------------- > > Key: HIVE-9323 > URL: https://issues.apache.org/jira/browse/HIVE-9323 > Project: Hive > Issue Type: Sub-task > Components: Spark > Affects Versions: spark-branch > Reporter: Szehon Ho > Assignee: Szehon Ho > Fix For: spark-branch > > Attachments: HIVE-9323-spark.patch, HIVE-9323.2-spark.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)