Using Pandas/Scikit Learning in Pyspark

Bin Wang Fri, 08 May 2015 21:57:02 -0700

Hey there,

I have a CDH cluster where the default Python installed on those Redhat
Linux are Python2.6.


I am thinking about developing a Spark application using pyspark and I want
to be able to use Pandas and Scikit learn package. Anaconda Python
interpreter has the most funtionalities out of box, however, when I try to
use Anaconda Python2.7. The Spark job won't run properly and failed due to
the reason that the Python interpreter is not consistent across the
cluster.
Here are my questions:

(1) I took a quick look at the source code of pyspark, looks like in the
end, they are using spark-submit. Doesn't that mean all the work in the end
will be translated into scala code and distribute the workload to the whole
cluster? In that case, I should not worry about the Python interpreter
beyond the master node right?

(2) If the Spark job need consistent Python library to be installed on
every node. Should I install Anaconda Python on all of them? If so, what is
the modern way of managing the Python ecosystem on the cluster?

I am a big fan of Python so please guide me.

Best regards,

Bin

Using Pandas/Scikit Learning in Pyspark

Reply via email to