Greetings,

Is it possible to limit the number of times the IPC client retries upon a
spark-submit invocation?  For context, see this StackOverflow post
<https://stackoverflow.com/questions/59863850/how-to-control-the-number-of-hadoop-ipc-retry-attempts-for-a-spark-job-submissio>.
In essence, I am trying to call spark-submit on a Kerberized cluster,
without having valid Kerberos tickets available.  This is deliberate, and
I'm not truly facing a Kerberos issue.  Rather, this is the
easiest reproducible case of "long IPC retry" I have been able to trigger.

In this particular case, the following errors are printed (presumably by
the launcher):

20/01/22 15:49:32 INFO retry.RetryInvocationHandler:
java.io.IOException: Failed on local exception: java.io.IOException:
org.apache.hadoop.security.AccessControlException: Client cannot
authenticate via:[TOKEN, KERBEROS]; Host Details : local host is:
"node-1.cluster/172.18.0.2"; destination host is:
"node-1.cluster":8032; , while invoking
ApplicationClientProtocolPBClientImpl.getClusterMetrics over null
after 1 failover attempts. Trying to failover after sleeping for
35160ms.

This continues for 30 times before the launcher finally gives up.

As indicated in the answer on that StackOverflow post, the relevant Hadoop
properties should be ipc.client.connect.max.retries and/or
ipc.client.connect.max.retries.on.sasl.  However, in testing on Spark 2.4.0
(on CDH 6.1), I am not able to get either of these to take effect (it still
retries 30 times regardless).  I am trying the SparkPi example, and
specifying them with --conf spark.hadoop.ipc.client.connect.max.retries
and/or --conf spark.hadoop.ipc.client.connect.max.retries.on.sasl.

Any ideas on what I could be doing wrong, or why I can't get these
properties to take effect?

Reply via email to