Marcelo Vanzin created SPARK-27219: -------------------------------------- Summary: Misleading exceptions in transport code's SASL fallback path Key: SPARK-27219 URL: https://issues.apache.org/jira/browse/SPARK-27219 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0 Reporter: Marcelo Vanzin
There are a couple of code paths in the SASL fallback handling that result in misleading exceptions printed to logs. One of them is if a timeout occurs during authentication; for example: {noformat} 19/03/15 11:21:37 WARN crypto.AuthClientBootstrap: New auth protocol failed, trying SASL. java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task. at org.spark_project.guava.base.Throwables.propagate(Throwables.java:160) at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:258) at org.apache.spark.network.crypto.AuthClientBootstrap.doSparkAuth(AuthClientBootstrap.java:105) at org.apache.spark.network.crypto.AuthClientBootstrap.doBootstrap(AuthClientBootstrap.java:79) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:262) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:192) at org.apache.spark.network.shuffle.ExternalShuffleClient.lambda$fetchBlocks$0(ExternalShuffleClient.java:100) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141) ... Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task. at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276) at org.spark_project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96) at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:254) ... 38 more 19/03/15 11:21:38 WARN server.TransportChannelHandler: Exception in connection from vc1033.halxg.cloudera.com/10.17.216.43:7337 java.lang.IllegalArgumentException: Frame length should be positive: -3702202170875367528 at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119) {noformat} The IllegalArgumentException shouldn't happen, it only happens because the code is ignoring the time out and retrying, at which point the remote side is in a different state and thus doesn't expect the message. The same line that prints that exception can result in a noisy log message when the remote side (e.g. an old shuffle service) does not understand the new auth protocol. Since it's a warning it seems like something is wrong, when it's just doing what's expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org