On 12/28/15, 5:16 PM, "Daniel Valdivia" <h...@danielvaldivia.com> wrote:
>Hi, > >I'm trying to submit a job to a small spark cluster running in stand >alone mode, however it seems like the jar file I'm submitting to the >cluster is "not found" by the workers nodes. > >I might have understood wrong, but I though the Driver node would send >this jar file to the worker nodes, or should I manually send this file to >each worker node before I submit the job? Yes, you have misunderstood, but so did I. So the problem is that --deploy-mode cluster runs the Driver on the cluster as well, and you don't know which node it's going to run on, so every node needs access to the JAR. spark-submit does not pass the JAR along to the Driver, but the Driver will pass it to the executors. I ended up putting the JAR in HDFS and passing an hdfs:// path to spark-submit. This is a subtle difference from Spark on YARN which does pass the JAR along to the Driver automatically, and IMO should probably be fixed in spark-submit. It's really confusing for newcomers. Another problem I ran into that you also might is that --packages doesn't work with --deploy-mode cluster. It downloads the packages to a temporary location on the node running spark-submit, then passes those paths to the node that is running the Driver, but since that isn't the same machine, it can't find anything and fails. The driver process *should* be the one doing the downloading, but it isn't. I ended up having to create a fat JAR with all of the dependencies to get around that one. Greg --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org