I am trying to figure out python library management. So my question is: Where do third party Python libraries(ex. numpy, scipy, etc.) need to exist if I running a spark job via 'spark-submit' against my cluster in 'yarn client' mode. Do the libraries need to only exist on the client(ie. the server executing the driver code) or do the libraries need to exist on the datanode/worker nodes where the tasks are executed? The documentation seems to indicate that under 'yarn client' the libraries are only need on the client machine not the entire cluster. If the libraries are needed across all cluster machines, any suggestions on a deployment strategy or dependency management model that works well?
Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-where-do-third-parties-libraries-need-to-be-installed-under-Yarn-client-mode-tp22639.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org