Hi Egor,

Here are a few answers to your questions:

1) Python needs to be installed on all machines, but not pyspark. The way
the executors get the pyspark code depends on which cluster manager you
use. In standalone mode, your executors need to have the actual python
files in their working directory. In yarn mode, python files are included
in the assembly jar, which is then shipped to your executor containers
through a distributed cache.

2) Pyspark is just a thin wrapper around Spark. When you write a closure in
python, it is shipped to the executors within the task itself the same way
scala closures are shipped. If you use a special library, then all of the
nodes will need to have that library pre-installed.

3) Are you trying to run your c++ code inside the "map" function? If so,
you need to make sure the compiled code is present in the working directory
on all the executors before-hand for python to "exec" it. I haven't done
this before, but maybe there are a few gotchas in doing this.

Maybe others can add more information?

Andrew


2014-07-11 5:50 GMT-07:00 Egor Pahomov <pahomov.e...@gmail.com>:

> Hi, I want to use pySpark, but can't understand how it works. Documentation
> doesn't provide enough information.
>
> 1) How python shipped to cluster? Should machines in cluster already have
> python?
> 2) What happens when I write some python code in "map" function - is it
> shipped to cluster and just executed on it? How it understand all
> dependencies, which my code need and ship it there? If I use Math in my
> code in "map" does it mean, that I would ship Math class or some python
> Math on cluster would be used?
> 3) I have c++ compiled code. Can I ship this executable with "addPyFile"
> and just use "exec" function from python? Would it work?
>
> --
>
>
>
> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>

Reply via email to