[jira] [Commented] (HIVE-16071) Spark remote driver misuses the timeout in RPC handshake

Chaoyu Tang (JIRA) Thu, 02 Mar 2017 19:45:07 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893684#comment-15893684
 ]


Chaoyu Tang commented on HIVE-16071:
------------------------------------

For the error I mentioned in this JIRA description, it was caused by the 
[timeout|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L176]
 set in HS2 registerClient. It happened when SaslServerHandler and 
SaslClientHandler were undergoing handshaking. The registerClient in 
SparkClientImpl threw out an error which interrupted the process calling 
spark-submit, therefore ending the channel between HS2 and RemoteClient. The 
channel termination was detected by SaslClientHandler.channelInactive at 
RemoteDriver, which in term invoked SaslClientHandler.dispose(), therefore we 
saw the SASLException at RemoteDriver side with msg "SaslException: Client 
closed before SASL negotiation finished." I have managed to reproduce this 
error by adjusting the hive.spark.client.server.connect.timeout value to make 
the HS2 timeout happen during SASL negotiation but RemoteDriver has not reached 
its own timeout. 
Looking more into the code, I think that the 
[cancelTask|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L100]
 never has a chance to effect. It has the same timeout value as that used for 
[registerClient|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L170],
 and the latter one always kicks in before it.
Timeout at RemoteDriver side could happen at two places. If it happens when 
driver 
[connects|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/Rpc.java#L110]
 back to HS2, HS2 could not detect this timeout error and has to wait until its 
own timeout set in  
[registerClient|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L170]
 effects. If the 
[timeout|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/Rpc.java#L122]
 happens during SASL handshaking, the RemoteDriver main will exit abnormally. 
The SaslServerHandler.channelInactive at HS2 side could detect this channel 
termination and invokes the SaslServerHandler.dispose, which in term cancels 
this cancelTask. Depending on the stage where HS2 is at (see following code 
snippet) 
{code}
    protected void onError(Throwable error) {
      cancelTask.cancel(true);
      if (client != null) {
        client.timeoutFuture.cancel(true);
        if (!client.promise.isDone()) {
          client.promise.setFailure(error);
        }
      }
    }
{code}
HS2 has to wait until its hive.spark.client.server.connect.timeout if the 
clientInfo is null, or the process could terminate immediately.
So the 
[cancelTask|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L100]
 is not used in the code unless its timeout is set with a different value 
shorter than that set for registerClient 
(hive.spark.client.server.connect.timeout).


> Spark remote driver misuses the timeout in RPC handshake
> --------------------------------------------------------
>
>                 Key: HIVE-16071
>                 URL: https://issues.apache.org/jira/browse/HIVE-16071
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>            Reporter: Chaoyu Tang
>            Assignee: Chaoyu Tang
>         Attachments: HIVE-16071.patch
>
>
> Based on its property description in HiveConf and the comments in HIVE-12650 
> (https://issues.apache.org/jira/browse/HIVE-12650?focusedCommentId=15128979&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15128979),
>  hive.spark.client.connect.timeout is the timeout when the spark remote 
> driver makes a socket connection (channel) to RPC server. But currently it is 
> also used by the remote driver for RPC client/server handshaking, which is 
> not right. Instead, hive.spark.client.server.connect.timeout should be used 
> and it has already been used by the RPCServer in the handshaking.
> The error like following is usually caused by this issue, since the default 
> hive.spark.client.connect.timeout value (1000ms) used by remote driver for 
> handshaking is a little too short.
> {code}
> 17/02/20 08:46:08 ERROR yarn.ApplicationMaster: User class threw exception: 
> java.util.concurrent.ExecutionException: javax.security.sasl.SaslException: 
> Client closed before SASL negotiation finished.
> java.util.concurrent.ExecutionException: javax.security.sasl.SaslException: 
> Client closed before SASL negotiation finished.
>         at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
>         at 
> org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:156)
>         at 
> org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542)
> Caused by: javax.security.sasl.SaslException: Client closed before SASL 
> negotiation finished.
>         at 
> org.apache.hive.spark.client.rpc.Rpc$SaslClientHandler.dispose(Rpc.java:453)
>         at 
> org.apache.hive.spark.client.rpc.SaslHandler.channelInactive(SaslHandler.java:90)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16071) Spark remote driver misuses the timeout in RPC handshake

Reply via email to