I think with addJar() there is no 'caching', in the sense files will be copied everytime per job. Whereas in hadoop distributed cache, files will be copied only once, and a symlink will be created to the cache file for subsequent runs: https://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/filecache/DistributedCache.html
Also,hadoop distributed cache can copy an archive file to the node and unzip it automatically to current working dir. The advantage here is that the copying will be very fast.. Still looking for similar mechanisms in SPARK -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-distributed-caching-with-Spark-and-YARN-tp1074p3566.html Sent from the Apache Spark User List mailing list archive at Nabble.com.