Hi everybody, I am currently testing Spark 2.4.4 with the following new settings:
spark.authenticate true spark.io.encryption.enabled true spark.io.encryption.keySizeBits 256 spark.io.encryption.keygen.algorithm HmacSHA256 spark.network.crypto.enabled true spark.network.crypto.keyFactoryAlgorithm PBKDF2WithHmacSHA256 spark.network.crypto.keyLength 256 spark.network.crypto.saslFallback false I use dynamic allocation and the Spark shuffler is set correctly in Yarn. I added the following two options to yarn-site.xml's config: <property> <name>spark.authenticate</name> <value>true</value> </property> <property> <name>spark.network.crypto.enabled</name> <value>true</value> </property> This works very well in all the scala-based code (spark2-shell, spark-submit, etc..) but it doesn't for Pyspark, since I do see the following warnings repeating over and over: 20/01/14 10:23:50 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered! 20/01/14 10:23:50 WARN ExecutorAllocationManager: Unable to reach the cluster manager to request 1 total executors! The culprit seems to be the option "spark.io.encryption.enabled=true", since without it everything works fine. At first I thought that it was a Yarn resource allocation problem, but then I checked and the cluster has plenty of space. After digging a bit more into Yarn's container logs and I discovered that it seems a problem related to the Application master not being able to contact the Driver in time (assuming client mode of course): 20/01/14 09:45:21 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1576771377404_19608_000001 20/01/14 09:45:21 INFO YarnRMClient: Registering the ApplicationMaster 20/01/14 09:45:52 ERROR TransportClientFactory: Exception while bootstrapping client after 30120 ms java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task. at org.spark_project.guava.base.Throwables.propagate(Throwables.java:160) at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:263) at org.apache.spark.network.crypto.AuthClientBootstrap.doSparkAuth(AuthClientBootstrap.java:105) at org.apache.spark.network.crypto.AuthClientBootstrap.doBootstrap(AuthClientBootstrap.java:79) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:257) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task. at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276) at org.spark_project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96) at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:259) ... 11 more The strange part is that sometimes the timeout doesn't occur, and sometimes it does. I checked the code related to the above stacktrace and ended up to: https://github.com/apache/spark/blob/branch-2.4/common/network-common/src/main/java/org/apache/spark/network/crypto/AuthClientBootstrap.java#L106 https://github.com/apache/spark/blob/branch-2.4/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java#L129-L133 The "spark.network.auth.rpcTimeout" option seems to help, even if it is not advertised in the docs as far as I can see (the 30s mentioned in the exception are definitely trigger by this setting though). What I am wondering is where/what I should check to debug this further, since it seems a Python only problem that doesn't affect Scala. I didn't find any outstanding bugs, so given the fact that 2.4.4 is very recent I thought to report it in here to ask for an advice :) Thanks in advance! Luca --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org