Hey there, I have a CDH cluster where the default Python installed on those Redhat Linux are Python2.6.
I am thinking about developing a Spark application using pyspark and I want to be able to use Pandas and Scikit learn package. Anaconda Python interpreter has the most funtionalities out of box, however, when I try to use Anaconda Python2.7. The Spark job won't run properly and failed due to the reason that the Python interpreter is not consistent across the cluster. Here are my questions: (1) I took a quick look at the source code of pyspark, looks like in the end, they are using spark-submit. Doesn't that mean all the work in the end will be translated into scala code and distribute the workload to the whole cluster? In that case, I should not worry about the Python interpreter beyond the master node right? (2) If the Spark job need consistent Python library to be installed on every node. Should I install Anaconda Python on all of them? If so, what is the modern way of managing the Python ecosystem on the cluster? I am a big fan of Python so please guide me. Best regards, Bin