Also take a look at this:

On Fri, Jul 11, 2014 at 10:29 AM, Andrew Or <> wrote:

> Hi Egor,
> Here are a few answers to your questions:
> 1) Python needs to be installed on all machines, but not pyspark. The way
> the executors get the pyspark code depends on which cluster manager you
> use. In standalone mode, your executors need to have the actual python
> files in their working directory. In yarn mode, python files are included
> in the assembly jar, which is then shipped to your executor containers
> through a distributed cache.
> 2) Pyspark is just a thin wrapper around Spark. When you write a closure in
> python, it is shipped to the executors within the task itself the same way
> scala closures are shipped. If you use a special library, then all of the
> nodes will need to have that library pre-installed.
> 3) Are you trying to run your c++ code inside the "map" function? If so,
> you need to make sure the compiled code is present in the working directory
> on all the executors before-hand for python to "exec" it. I haven't done
> this before, but maybe there are a few gotchas in doing this.
> Maybe others can add more information?
> Andrew
> 2014-07-11 5:50 GMT-07:00 Egor Pahomov <>:
> > Hi, I want to use pySpark, but can't understand how it works.
> Documentation
> > doesn't provide enough information.
> >
> > 1) How python shipped to cluster? Should machines in cluster already have
> > python?
> > 2) What happens when I write some python code in "map" function - is it
> > shipped to cluster and just executed on it? How it understand all
> > dependencies, which my code need and ship it there? If I use Math in my
> > code in "map" does it mean, that I would ship Math class or some python
> > Math on cluster would be used?
> > 3) I have c++ compiled code. Can I ship this executable with "addPyFile"
> > and just use "exec" function from python? Would it work?
> >
> > --
> >
> >
> >
> > *Sincerely yoursEgor PakhomovScala Developer, Yandex*
> >

Reply via email to