Marcelo Vanzin created SPARK-27219:
--------------------------------------

             Summary: Misleading exceptions in transport code's SASL fallback 
path
                 Key: SPARK-27219
                 URL: https://issues.apache.org/jira/browse/SPARK-27219
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 2.4.0
            Reporter: Marcelo Vanzin


There are a couple of code paths in the SASL fallback handling that result in 
misleading exceptions printed to logs. One of them is if a timeout occurs 
during authentication; for example:

{noformat}
19/03/15 11:21:37 WARN crypto.AuthClientBootstrap: New auth protocol failed, 
trying SASL.
java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
waiting for task.
        at 
org.spark_project.guava.base.Throwables.propagate(Throwables.java:160)
        at 
org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:258)
        at 
org.apache.spark.network.crypto.AuthClientBootstrap.doSparkAuth(AuthClientBootstrap.java:105)
        at 
org.apache.spark.network.crypto.AuthClientBootstrap.doBootstrap(AuthClientBootstrap.java:79)
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:262)
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:192)
        at 
org.apache.spark.network.shuffle.ExternalShuffleClient.lambda$fetchBlocks$0(ExternalShuffleClient.java:100)
        at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
...
Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
        at 
org.spark_project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276)
        at 
org.spark_project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)
        at 
org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:254)
        ... 38 more
19/03/15 11:21:38 WARN server.TransportChannelHandler: Exception in connection 
from vc1033.halxg.cloudera.com/10.17.216.43:7337
java.lang.IllegalArgumentException: Frame length should be positive: 
-3702202170875367528
        at 
org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
{noformat}

The IllegalArgumentException shouldn't happen, it only happens because the code 
is ignoring the time out and retrying, at which point the remote side is in a 
different state and thus doesn't expect the message.

The same line that prints that exception can result in a noisy log message when 
the remote side (e.g. an old shuffle service) does not understand the new auth 
protocol. Since it's a warning it seems like something is wrong, when it's just 
doing what's expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to