Hi Oleg,

There isn't much you need to do to setup a Yarn cluster to run PySpark. You
need to make sure all machines have python installed, and... that's about
it. Your assembly jar will be shipped to all containers along with all the
pyspark and py4j files needed. One caveat, however, is that the jar needs
to be built in maven and not on a Red Hat-based OS,

http://spark.apache.org/docs/latest/building-with-maven.html#building-for-pyspark-on-yarn

In addition, it should be built with Java 6 because of a known issue with
building jars with Java 7 and including python files in them (
https://issues.apache.org/jira/browse/SPARK-1718). Lastly, if you have
trouble getting it to work, you can follow the steps I have listed in a
different thread to figure out what's wrong:

http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3ccamjob8mr1+ias-sldz_rfrke_na2uubnmhrac4nukqyqnun...@mail.gmail.com%3e

Let me know if you can get it working,
-Andrew





2014-09-03 5:03 GMT-07:00 Oleg Ruchovets <oruchov...@gmail.com>:

> Hi all.
>    I am trying to run pyspark on yarn already couple of days:
>
> http://hortonworks.com/kb/spark-1-0-1-technical-preview-hdp-2-1-3/
>
> I posted exception on previous posts. It looks that I didn't do correct
> configuration.
>   I googled quite a lot and I can't find the steps should be done to
> configure PySpark running on Yarn.
>
> Can you please share the steps (critical points) should be configured to
> use PaSpark on Yarn ( hortonworks distribution) :
>   Environment variables.
>   Classpath
>   copy jars to all machine
>   other configuration.
>
> Thanks
> Oleg.
>
>

Reply via email to