[jira] [Comment Edited] (HIVE-16071) Spark remote driver misuses the timeout in RPC handshake

2017-03-01 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891686#comment-15891686
 ] 

Rui Li edited comment on HIVE-16071 at 3/2/17 6:50 AM:
---

Hi [~ctang.ma], for the issue mentioned in this JIRA, I think you can try 
increasing the default value of client timeout. I also suggest you check 
whether HS2 has any problems.

Regarding the timeout in RpcServer constructor, I tried making it use the 
client side timeout, and successfully run some simple queries. Not sure if we 
can remove it. Looking at the code, that timeout task is responsible for 
closing the RPC, thus closing the underlying channel. We might as well keep it, 
but with the client timeout.


was (Author: lirui):
Hi [~ctang.ma], for the issue mentioned in this JIRA, I think you can try 
increasing the default value of client timeout. I also suggest you check 
whether HS2 has any problems.

Regarding the timeout in RpcServer constructor, I tried making it use the 
client side timeout, and successfully run some simple queries. Not sure if we 
can remove it. Looking at the code, that timeout task is responsible for 
closing the RPC, thus closing the underlying channel. We might as well keep it.

> Spark remote driver misuses the timeout in RPC handshake
> 
>
> Key: HIVE-16071
> URL: https://issues.apache.org/jira/browse/HIVE-16071
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Chaoyu Tang
>Assignee: Chaoyu Tang
> Attachments: HIVE-16071.patch
>
>
> Based on its property description in HiveConf and the comments in HIVE-12650 
> (https://issues.apache.org/jira/browse/HIVE-12650?focusedCommentId=15128979&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15128979),
>  hive.spark.client.connect.timeout is the timeout when the spark remote 
> driver makes a socket connection (channel) to RPC server. But currently it is 
> also used by the remote driver for RPC client/server handshaking, which is 
> not right. Instead, hive.spark.client.server.connect.timeout should be used 
> and it has already been used by the RPCServer in the handshaking.
> The error like following is usually caused by this issue, since the default 
> hive.spark.client.connect.timeout value (1000ms) used by remote driver for 
> handshaking is a little too short.
> {code}
> 17/02/20 08:46:08 ERROR yarn.ApplicationMaster: User class threw exception: 
> java.util.concurrent.ExecutionException: javax.security.sasl.SaslException: 
> Client closed before SASL negotiation finished.
> java.util.concurrent.ExecutionException: javax.security.sasl.SaslException: 
> Client closed before SASL negotiation finished.
> at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
> at 
> org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:156)
> at 
> org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542)
> Caused by: javax.security.sasl.SaslException: Client closed before SASL 
> negotiation finished.
> at 
> org.apache.hive.spark.client.rpc.Rpc$SaslClientHandler.dispose(Rpc.java:453)
> at 
> org.apache.hive.spark.client.rpc.SaslHandler.channelInactive(SaslHandler.java:90)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (HIVE-16071) Spark remote driver misuses the timeout in RPC handshake

2017-03-02 Thread Chaoyu Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893684#comment-15893684
 ] 

Chaoyu Tang edited comment on HIVE-16071 at 3/3/17 4:04 AM:


For the error I mentioned in this JIRA description, it was caused by the 
[timeout|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L176]
 set in HS2 registerClient. It happened when SaslServerHandler and 
SaslClientHandler were undergoing handshaking. The timeout caused the 
registerClient in SparkClientImpl to throw out an error which interrupted the 
process calling spark-submit, therefore ending the channel between HS2 and 
RemoteClient. The channel termination was detected by 
SaslClientHandler.channelInactive at RemoteDriver side, which in term invoked 
SaslClientHandler.dispose(). Therefore we saw the SASLException with msg 
"SaslException: Client closed before SASL negotiation finished."I have managed 
to reproduce this error by adjusting the 
hive.spark.client.server.connect.timeout value to make the HS2 timeout happen 
during SASL negotiation but RemoteDriver has not reached its own timeout.

Looking more into the code, I think that the 
[cancelTask|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L100]
 is not used at all in the code. It never has a chance to effect because it has 
the same timeout value as that used for 
[registerClient|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L170],
 and the latter one always kicks in before it.
Timeout at RemoteDriver side could happen at two places. If it happens when 
driver 
[connects|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/Rpc.java#L110]
 back to HS2, HS2 could not detect this timeout error at driver site and has to 
wait until its own timeout set in  
[registerClient|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L170]
 effects. If the 
[timeout|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/Rpc.java#L122]
 happens during SASL handshaking, the RemoteDriver main will exit abnormally. 
The SaslServerHandler.channelInactive at HS2 side could detect this channel 
termination and invokes the SaslServerHandler.dispose, which in term cancels 
this cancelTask (not be used again). Depending on the stage where HS2 is at 
(see following code snippet) 
{code}
protected void onError(Throwable error) {
  cancelTask.cancel(true);
  if (client != null) {
client.timeoutFuture.cancel(true);
if (!client.promise.isDone()) {
  client.promise.setFailure(error);
}
  }
}
{code}
HS2 should either wait until its hive.spark.client.server.connect.timeout when 
clientInfo is null, or terminates the process immediately.
So the 
[cancelTask|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L100]
 is currently useless in the code unless its timeout could be set with a 
different (shorter) value than that set for registerClient 
(hive.spark.client.server.connect.timeout). Or we can consider removing it 
though its existence in current code does not do any harm either.



was (Author: ctang.ma):
For the error I mentioned in this JIRA description, it was caused by the 
[timeout|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L176]
 set in HS2 registerClient. It happened when SaslServerHandler and 
SaslClientHandler were undergoing handshaking. The registerClient in 
SparkClientImpl threw out an error which interrupted the process calling 
spark-submit, therefore ending the channel between HS2 and RemoteClient. The 
channel termination was detected by SaslClientHandler.channelInactive at 
RemoteDriver, which in term invoked SaslClientHandler.dispose(), therefore we 
saw the SASLException at RemoteDriver side with msg "SaslException: Client 
closed before SASL negotiation finished." I have managed to reproduce this 
error by adjusting the hive.spark.client.server.connect.timeout value to make 
the HS2 timeout happen during SASL negotiation but RemoteDriver has not reached 
its own timeout. 
Looking more into the code, I think that the 
[cancelTask|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L100]
 never has a chance to effect. It has the same timeout value as that used for 
[registerClient|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L170],
 and the latter one always kick

[jira] [Comment Edited] (HIVE-16071) Spark remote driver misuses the timeout in RPC handshake

2017-03-09 Thread Chaoyu Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15903422#comment-15903422
 ] 

Chaoyu Tang edited comment on HIVE-16071 at 3/9/17 5:24 PM:


So we reached the consensus that hive.spark.client.server.connect.timeout 
should not be used for cancelTask at RPCServer side. The value proposed could 
be hive.spark.client.connect.timeout.
[~xuefuz] The reason that I previously suggested we could consider another 
timeout for cancelTask (a little longer than 
hive.spark.client.connect.timeout.) is to give RemoteDriver a little more time 
to timeout the handshaking than RPCServer. If the timeout at both sides are set 
to exactly same value, we might see the situations quite often where the 
terminations of SASL handshaking are initiated by cancelTask at RpcServer side 
because the timeout at RemoteDriver side might be slightly later for whatever 
reasons. During this short window, the handshake could still have a chance to 
succeed if it is not terminated by cancelTask.
To my understanding, to shorten cancelTask timeout is mainly for RpcServer to 
detect the handshake timeout (fired by RemoteDriver) sooner, we still want 
RemoteDriver to mainly control the SASL handshake timeout, and most handshake 
timeout should be fired from remoteDriver, right?


was (Author: ctang.ma):
So we reached the consensus that hive.spark.client.server.connect.timeout 
should not be used for cancelTask at RPCServer side. The value proposed could 
be hive.spark.client.connect.timeout.
[~xuefuz] The reason that I previously suggested we could consider another 
timeout for cancelTask (a little longer than 
hive.spark.client.connect.timeout.) is to give RemoteDriver a little more time 
to timeout the handshaking than RPCServer. If the timeout at both sides are set 
to exactly same value, we might see the situations quite often where the 
terminations of SASL handshaking are initiated by cancelTask at RpcServer side 
because the timeout at RemoteDriver side might be slightly later for whatever 
reasons. During this short window, the handshake could still have a chance to 
succeed if it is not terminated by cancelTask.
To my understanding, to shorten cancelTask timeout is mainly for RpcServer to 
detect the handshake timeout (fired by RemoteDriver) sooner, we still want 
RemoteDriver to mainly control the SASL handshake timeout, and most handshake 
timeout should be fired from remoteDriver, right?
In addition, I think we should

> Spark remote driver misuses the timeout in RPC handshake
> 
>
> Key: HIVE-16071
> URL: https://issues.apache.org/jira/browse/HIVE-16071
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Chaoyu Tang
>Assignee: Chaoyu Tang
> Attachments: HIVE-16071.patch
>
>
> Based on its property description in HiveConf and the comments in HIVE-12650 
> (https://issues.apache.org/jira/browse/HIVE-12650?focusedCommentId=15128979&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15128979),
>  hive.spark.client.connect.timeout is the timeout when the spark remote 
> driver makes a socket connection (channel) to RPC server. But currently it is 
> also used by the remote driver for RPC client/server handshaking, which is 
> not right. Instead, hive.spark.client.server.connect.timeout should be used 
> and it has already been used by the RPCServer in the handshaking.
> The error like following is usually caused by this issue, since the default 
> hive.spark.client.connect.timeout value (1000ms) used by remote driver for 
> handshaking is a little too short.
> {code}
> 17/02/20 08:46:08 ERROR yarn.ApplicationMaster: User class threw exception: 
> java.util.concurrent.ExecutionException: javax.security.sasl.SaslException: 
> Client closed before SASL negotiation finished.
> java.util.concurrent.ExecutionException: javax.security.sasl.SaslException: 
> Client closed before SASL negotiation finished.
> at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
> at 
> org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:156)
> at 
> org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542)
> Caused by: javax.security.sasl.SaslException: Client closed before SASL 
> negotiation finished.
> at 
> org.apache.hive.s

[jira] [Comment Edited] (HIVE-16071) Spark remote driver misuses the timeout in RPC handshake

2017-03-09 Thread Chaoyu Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15903422#comment-15903422
 ] 

Chaoyu Tang edited comment on HIVE-16071 at 3/9/17 5:24 PM:


So we reached the consensus that hive.spark.client.server.connect.timeout 
should not be used for cancelTask at RPCServer side. The value proposed could 
be hive.spark.client.connect.timeout.
[~xuefuz] The reason that I previously suggested we could consider another 
timeout for cancelTask (a little longer than 
hive.spark.client.connect.timeout.) is to give RemoteDriver a little more time 
to timeout the handshaking than RPCServer. If the timeout at both sides are set 
to exactly same value, we might see the situations quite often where the 
terminations of SASL handshaking are initiated by cancelTask at RpcServer side 
because the timeout at RemoteDriver side might be slightly later for whatever 
reasons. During this short window, the handshake could still have a chance to 
succeed if it is not terminated by cancelTask.
To my understanding, to shorten cancelTask timeout is mainly for RpcServer to 
detect the handshake timeout (fired by RemoteDriver) sooner, we still want 
RemoteDriver to mainly control the SASL handshake timeout, and most handshake 
timeout should be fired from remoteDriver, right?
In addition, I think we should


was (Author: ctang.ma):
So we reached the consensus that hive.spark.client.server.connect.timeout 
should not be used for cancelTask at RPCServer side. The value proposed could 
be hive.spark.client.connect.timeout.
[~xuefuz] The reason that I previously suggested we could consider another 
timeout for cancelTask (a little longer than 
hive.spark.client.connect.timeout.) is to give RemoteDriver a little more time 
to timeout the handshaking than RPCServer. If the timeout at both sides are set 
to exactly same value, we might see the situations quite often where the 
terminations of SASL handshaking are initiated by cancelTask at RpcServer side 
for the timeout at RemoteDriver side might be slightly later for whatever 
reasons. During this short window, the handshake could still succeed if it is 
not terminated by cancelTask.
To my understanding, we still want RemoteDriver to mainly control the SASL 
handshake timeout, to shorten the cancelTask timeout is mainly for RpcServer to 
detect the timeout (fired by RemoteDriver) sooner, right? 

> Spark remote driver misuses the timeout in RPC handshake
> 
>
> Key: HIVE-16071
> URL: https://issues.apache.org/jira/browse/HIVE-16071
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Chaoyu Tang
>Assignee: Chaoyu Tang
> Attachments: HIVE-16071.patch
>
>
> Based on its property description in HiveConf and the comments in HIVE-12650 
> (https://issues.apache.org/jira/browse/HIVE-12650?focusedCommentId=15128979&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15128979),
>  hive.spark.client.connect.timeout is the timeout when the spark remote 
> driver makes a socket connection (channel) to RPC server. But currently it is 
> also used by the remote driver for RPC client/server handshaking, which is 
> not right. Instead, hive.spark.client.server.connect.timeout should be used 
> and it has already been used by the RPCServer in the handshaking.
> The error like following is usually caused by this issue, since the default 
> hive.spark.client.connect.timeout value (1000ms) used by remote driver for 
> handshaking is a little too short.
> {code}
> 17/02/20 08:46:08 ERROR yarn.ApplicationMaster: User class threw exception: 
> java.util.concurrent.ExecutionException: javax.security.sasl.SaslException: 
> Client closed before SASL negotiation finished.
> java.util.concurrent.ExecutionException: javax.security.sasl.SaslException: 
> Client closed before SASL negotiation finished.
> at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
> at 
> org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:156)
> at 
> org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542)
> Caused by: javax.security.sasl.SaslException: Client closed before SASL 
> negotiation finished.
> at 
> org.apache.hive.spark.client.rpc.Rpc$SaslClientHandler.dispose(Rpc.java:453)
> at 
> org.apache.hi