Dear Sasha, What I did was that I installed the parcels on all the nodes of the cluster. Typically the location was /opt/cloudera/parcels/CDH5.4.2-1.cdh5.4.2.p0.2 Hope this helps you.
With regards, Ashish On Tue, Sep 8, 2015 at 10:18 PM, Sasha Kacanski <skacan...@gmail.com> wrote: > Hi Ashish, > Thanks for the update. > I tried all of it, but what I don't get it is that I run cluster with one > node so presumably I should have PYspark binaries there as I am developing > on same host. > Could you tell me where you placed parcels or whatever cloudera is using. > My understanding of yarn and spark is that these binaries get compressed > and packaged with Java to be pushed to work node. > Regards, > On Sep 7, 2015 9:00 PM, "Ashish Dutt" <ashish.du...@gmail.com> wrote: > >> Hello Sasha, >> >> I have no answer for debian. My cluster is on Linux and I'm using CDH 5.4 >> Your question- "Error from python worker: >> /cube/PY/Python27/bin/python: No module named pyspark" >> >> On a single node (ie one server/machine/computer) I installed pyspark >> binaries and it worked. Connected it to pycharm and it worked too. >> >> Next I tried executing pyspark command on another node (say the worker) >> in the cluster and i got this error message, Error from python worker: >> PATH: No module named pyspark". >> >> My first guess was that the worker is not picking up the path of pyspark >> binaries installed on the server ( I tried many a things like hard-coding >> the pyspark path in the config.sh file on the worker- NO LUCK; tried >> dynamic path from the code in pycharm- NO LUCK... ; searched the web and >> asked the question in almost every online forum--NO LUCK..; banged my head >> several times with pyspark/hadoop books--NO LUCK... Finally, one fine day a >> 'watermelon' dropped while brooding on this problem and I installed pyspark >> binaries on all the worker machines ) Now when I tried executing just the >> command pyspark on the worker's it worked. Tried some simple program >> snippets on each worker, it works too. >> >> I am not sure if this will help or not for your use-case. >> >> >> >> Sincerely, >> Ashish >> >> On Mon, Sep 7, 2015 at 11:04 PM, Sasha Kacanski <skacan...@gmail.com> >> wrote: >> >>> Thanks Ashish, >>> nice blog but does not cover my issue. Actually I have pycharm running >>> and loading pyspark and rest of libraries perfectly fine. >>> My issue is that I am not sure what is triggering >>> >>> Error from python worker: >>> /cube/PY/Python27/bin/python: No module named pyspark >>> pyspark >>> PYTHONPATH was: >>> >>> /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/18/spark-assembly-1. >>> 4.1-hadoop2.6.0.jar >>> >>> Question is why is yarn not getting python package to run on the single >>> node via YARN? >>> Some people are saying run with JAVA 6 due to zip library changes >>> between 6/7/8, some identified bug w RH, i am on debian, then some >>> documentation errors but nothing is really clear. >>> >>> i have binaries for spark hadoop and i did just fine with spark sql >>> module, hive, python, pandas ad yarn. >>> Locally as i said app is working fine (pandas to spark df to parquet) >>> But as soon as I move to yarn client mode yarn is not getting packages >>> required to run app. >>> >>> If someone confirms that I need to build everything from source with >>> specific version of software I will do that, but at this point I am not >>> sure what to do to remedy this situation... >>> >>> --sasha >>> >>> >>> On Sun, Sep 6, 2015 at 8:27 PM, Ashish Dutt <ashish.du...@gmail.com> >>> wrote: >>> >>>> Hi Aleksandar, >>>> Quite some time ago, I faced the same problem and I found a solution >>>> which I have posted here on my blog >>>> <https://edumine.wordpress.com/category/apache-spark/>. >>>> See if that can help you and if it does not then you can check out >>>> these questions & solution on stackoverflow >>>> <http://stackoverflow.com/search?q=no+module+named+pyspark> website >>>> >>>> >>>> Sincerely, >>>> Ashish Dutt >>>> >>>> >>>> On Mon, Sep 7, 2015 at 7:17 AM, Sasha Kacanski <skacan...@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> I am successfully running python app via pyCharm in local mode >>>>> setMaster("local[*]") >>>>> >>>>> When I turn on SparkConf().setMaster("yarn-client") >>>>> >>>>> and run via >>>>> >>>>> park-submit PysparkPandas.py >>>>> >>>>> >>>>> I run into issue: >>>>> Error from python worker: >>>>> /cube/PY/Python27/bin/python: No module named pyspark >>>>> PYTHONPATH was: >>>>> >>>>> /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/18/spark-assembly-1.4.1-hadoop2.6.0.jar >>>>> >>>>> I am running java >>>>> hadoop@pluto:~/pySpark$ /opt/java/jdk/bin/java -version >>>>> java version "1.8.0_31" >>>>> Java(TM) SE Runtime Environment (build 1.8.0_31-b13) >>>>> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode) >>>>> >>>>> Should I try same thing with java 6/7 >>>>> >>>>> Is this packaging issue or I have something wrong with configurations >>>>> ... >>>>> >>>>> Regards, >>>>> >>>>> -- >>>>> Aleksandar Kacanski >>>>> >>>> >>>> >>> >>> >>> -- >>> Aleksandar Kacanski >>> >> >>