Hello,

It looks like local conf archives always get copied
<https://github.com/apache/spark/blob/fd009d652f7922254ccc7cc631b8df3a6b821532/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L813>
to the target (HDFS) every time a job is submitted

   1.  Other files/archives don't get sent if they are local
   
<https://github.com/apache/spark/blob/fd009d652f7922254ccc7cc631b8df3a6b821532/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L567>
-
   would it make sense to allow skipping upload of the local conf files as
   well?
   2. The archive seems to get copied on every 'distribute' call, which can
   happen multiple times per spark-submit job  (at least that's what I got
   from reading the code) - is that that intention?


The motivation for my questions is

   1. In some cases, spark-submit may not have direct access to HDFS, and
   hence cannot upload the files
   2. What would be the use-case for distributing the custom config to the
   YARN cluster. The cluster already has all the relevant YARN, HADOOP and
   Spark config. If anything, letting the end-user override the configs seems
   dangerous (if the override resource limits, etc. )

Cheers,
Eugene

Reply via email to