Hi Egor, Here are a few answers to your questions:
1) Python needs to be installed on all machines, but not pyspark. The way the executors get the pyspark code depends on which cluster manager you use. In standalone mode, your executors need to have the actual python files in their working directory. In yarn mode, python files are included in the assembly jar, which is then shipped to your executor containers through a distributed cache. 2) Pyspark is just a thin wrapper around Spark. When you write a closure in python, it is shipped to the executors within the task itself the same way scala closures are shipped. If you use a special library, then all of the nodes will need to have that library pre-installed. 3) Are you trying to run your c++ code inside the "map" function? If so, you need to make sure the compiled code is present in the working directory on all the executors before-hand for python to "exec" it. I haven't done this before, but maybe there are a few gotchas in doing this. Maybe others can add more information? Andrew 2014-07-11 5:50 GMT-07:00 Egor Pahomov <pahomov.e...@gmail.com>: > Hi, I want to use pySpark, but can't understand how it works. Documentation > doesn't provide enough information. > > 1) How python shipped to cluster? Should machines in cluster already have > python? > 2) What happens when I write some python code in "map" function - is it > shipped to cluster and just executed on it? How it understand all > dependencies, which my code need and ship it there? If I use Math in my > code in "map" does it mean, that I would ship Math class or some python > Math on cluster would be used? > 3) I have c++ compiled code. Can I ship this executable with "addPyFile" > and just use "exec" function from python? Would it work? > > -- > > > > *Sincerely yoursEgor PakhomovScala Developer, Yandex* >