Re: hadoop2.6.0 + spark1.4.1 + python2.7.10

2015-09-09 Thread Ashish Dutt
Dear Sasha,

What I did was that I installed the parcels on all the nodes of the
cluster. Typically the location was
/opt/cloudera/parcels/CDH5.4.2-1.cdh5.4.2.p0.2
Hope this helps you.

With regards,
Ashish



On Tue, Sep 8, 2015 at 10:18 PM, Sasha Kacanski  wrote:

> Hi Ashish,
> Thanks for the update.
> I tried all of it, but what I don't get it is that I run cluster with one
> node so presumably I should have PYspark binaries there as I am developing
> on same host.
> Could you tell me where you placed parcels or whatever cloudera is using.
> My understanding of yarn and spark is that these binaries get compressed
> and packaged with Java to be pushed to work node.
> Regards,
> On Sep 7, 2015 9:00 PM, "Ashish Dutt"  wrote:
>
>> Hello Sasha,
>>
>> I have no answer for debian. My cluster is on Linux and I'm using CDH 5.4
>> Your question-  "Error from python worker:
>>   /cube/PY/Python27/bin/python: No module named pyspark"
>>
>> On a single node (ie one server/machine/computer) I installed pyspark
>> binaries and it worked. Connected it to pycharm and it worked too.
>>
>> Next I tried executing pyspark command on another node (say the worker)
>> in the cluster and i got this error message, Error from python worker:
>> PATH: No module named pyspark".
>>
>> My first guess was that the worker is not picking up the path of pyspark
>> binaries installed on the server ( I tried many a things like hard-coding
>> the pyspark path in the config.sh file on the worker- NO LUCK; tried
>> dynamic path from the code in pycharm- NO LUCK... ; searched the web and
>> asked the question in almost every online forum--NO LUCK..; banged my head
>> several times with pyspark/hadoop books--NO LUCK... Finally, one fine day a
>> 'watermelon' dropped while brooding on this problem and I installed pyspark
>> binaries on all the worker machines ) Now when I tried executing just the
>> command pyspark on the worker's it worked. Tried some simple program
>> snippets on each worker, it works too.
>>
>> I am not sure if this will help or not for your use-case.
>>
>>
>>
>> Sincerely,
>> Ashish
>>
>> On Mon, Sep 7, 2015 at 11:04 PM, Sasha Kacanski 
>> wrote:
>>
>>> Thanks Ashish,
>>> nice blog but does not cover my issue. Actually I have pycharm running
>>> and loading pyspark and rest of libraries perfectly fine.
>>> My issue is that I am not sure what is triggering
>>>
>>> Error from python worker:
>>>   /cube/PY/Python27/bin/python: No module named pyspark
>>> pyspark
>>> PYTHONPATH was:
>>>
>>> /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/18/spark-assembly-1.
>>> 4.1-hadoop2.6.0.jar
>>>
>>> Question is why is yarn not getting python package to run on the single
>>> node via YARN?
>>> Some people are saying run with JAVA 6 due to zip library changes
>>> between 6/7/8, some identified bug w RH, i am on debian,  then some
>>> documentation errors but nothing is really clear.
>>>
>>> i have binaries for spark hadoop and i did just fine with spark sql
>>> module, hive, python, pandas ad yarn.
>>> Locally as i said app is working fine (pandas to spark df to parquet)
>>> But as soon as I move to yarn client mode yarn is not getting packages
>>> required to run app.
>>>
>>> If someone confirms that I need to build everything from source with
>>> specific version of software I will do that, but at this point I am not
>>> sure what to do to remedy this situation...
>>>
>>> --sasha
>>>
>>>
>>> On Sun, Sep 6, 2015 at 8:27 PM, Ashish Dutt 
>>> wrote:
>>>
 Hi Aleksandar,
 Quite some time ago, I faced the same problem and I found a solution
 which I have posted here on my blog
 .
 See if that can help you and if it does not then you can check out
 these questions & solution on stackoverflow
  website


 Sincerely,
 Ashish Dutt


 On Mon, Sep 7, 2015 at 7:17 AM, Sasha Kacanski 
 wrote:

> Hi,
> I am successfully running python app via pyCharm in local mode
> setMaster("local[*]")
>
> When I turn on SparkConf().setMaster("yarn-client")
>
> and run via
>
> park-submit PysparkPandas.py
>
>
> I run into issue:
> Error from python worker:
>   /cube/PY/Python27/bin/python: No module named pyspark
> PYTHONPATH was:
>
> /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/18/spark-assembly-1.4.1-hadoop2.6.0.jar
>
> I am running java
> hadoop@pluto:~/pySpark$ /opt/java/jdk/bin/java -version
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
>
> Should I try same thing with java 6/7
>
> Is this packaging issue or I have something wrong with configurations

Re: hadoop2.6.0 + spark1.4.1 + python2.7.10

2015-09-08 Thread Sasha Kacanski
Hi Ashish,
Thanks for the update.
I tried all of it, but what I don't get it is that I run cluster with one
node so presumably I should have PYspark binaries there as I am developing
on same host.
Could you tell me where you placed parcels or whatever cloudera is using.
My understanding of yarn and spark is that these binaries get compressed
and packaged with Java to be pushed to work node.
Regards,
On Sep 7, 2015 9:00 PM, "Ashish Dutt"  wrote:

> Hello Sasha,
>
> I have no answer for debian. My cluster is on Linux and I'm using CDH 5.4
> Your question-  "Error from python worker:
>   /cube/PY/Python27/bin/python: No module named pyspark"
>
> On a single node (ie one server/machine/computer) I installed pyspark
> binaries and it worked. Connected it to pycharm and it worked too.
>
> Next I tried executing pyspark command on another node (say the worker) in
> the cluster and i got this error message, Error from python worker: PATH:
> No module named pyspark".
>
> My first guess was that the worker is not picking up the path of pyspark
> binaries installed on the server ( I tried many a things like hard-coding
> the pyspark path in the config.sh file on the worker- NO LUCK; tried
> dynamic path from the code in pycharm- NO LUCK... ; searched the web and
> asked the question in almost every online forum--NO LUCK..; banged my head
> several times with pyspark/hadoop books--NO LUCK... Finally, one fine day a
> 'watermelon' dropped while brooding on this problem and I installed pyspark
> binaries on all the worker machines ) Now when I tried executing just the
> command pyspark on the worker's it worked. Tried some simple program
> snippets on each worker, it works too.
>
> I am not sure if this will help or not for your use-case.
>
>
>
> Sincerely,
> Ashish
>
> On Mon, Sep 7, 2015 at 11:04 PM, Sasha Kacanski 
> wrote:
>
>> Thanks Ashish,
>> nice blog but does not cover my issue. Actually I have pycharm running
>> and loading pyspark and rest of libraries perfectly fine.
>> My issue is that I am not sure what is triggering
>>
>> Error from python worker:
>>   /cube/PY/Python27/bin/python: No module named pyspark
>> pyspark
>> PYTHONPATH was:
>>
>> /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/18/spark-assembly-1.
>> 4.1-hadoop2.6.0.jar
>>
>> Question is why is yarn not getting python package to run on the single
>> node via YARN?
>> Some people are saying run with JAVA 6 due to zip library changes between
>> 6/7/8, some identified bug w RH, i am on debian,  then some documentation
>> errors but nothing is really clear.
>>
>> i have binaries for spark hadoop and i did just fine with spark sql
>> module, hive, python, pandas ad yarn.
>> Locally as i said app is working fine (pandas to spark df to parquet)
>> But as soon as I move to yarn client mode yarn is not getting packages
>> required to run app.
>>
>> If someone confirms that I need to build everything from source with
>> specific version of software I will do that, but at this point I am not
>> sure what to do to remedy this situation...
>>
>> --sasha
>>
>>
>> On Sun, Sep 6, 2015 at 8:27 PM, Ashish Dutt 
>> wrote:
>>
>>> Hi Aleksandar,
>>> Quite some time ago, I faced the same problem and I found a solution
>>> which I have posted here on my blog
>>> .
>>> See if that can help you and if it does not then you can check out these
>>> questions & solution on stackoverflow
>>>  website
>>>
>>>
>>> Sincerely,
>>> Ashish Dutt
>>>
>>>
>>> On Mon, Sep 7, 2015 at 7:17 AM, Sasha Kacanski 
>>> wrote:
>>>
 Hi,
 I am successfully running python app via pyCharm in local mode
 setMaster("local[*]")

 When I turn on SparkConf().setMaster("yarn-client")

 and run via

 park-submit PysparkPandas.py


 I run into issue:
 Error from python worker:
   /cube/PY/Python27/bin/python: No module named pyspark
 PYTHONPATH was:

 /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/18/spark-assembly-1.4.1-hadoop2.6.0.jar

 I am running java
 hadoop@pluto:~/pySpark$ /opt/java/jdk/bin/java -version
 java version "1.8.0_31"
 Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)

 Should I try same thing with java 6/7

 Is this packaging issue or I have something wrong with configurations
 ...

 Regards,

 --
 Aleksandar Kacanski

>>>
>>>
>>
>>
>> --
>> Aleksandar Kacanski
>>
>
>


Re: hadoop2.6.0 + spark1.4.1 + python2.7.10

2015-09-07 Thread Ashish Dutt
Hello Sasha,

I have no answer for debian. My cluster is on Linux and I'm using CDH 5.4
Your question-  "Error from python worker:
  /cube/PY/Python27/bin/python: No module named pyspark"

On a single node (ie one server/machine/computer) I installed pyspark
binaries and it worked. Connected it to pycharm and it worked too.

Next I tried executing pyspark command on another node (say the worker) in
the cluster and i got this error message, Error from python worker: PATH:
No module named pyspark".

My first guess was that the worker is not picking up the path of pyspark
binaries installed on the server ( I tried many a things like hard-coding
the pyspark path in the config.sh file on the worker- NO LUCK; tried
dynamic path from the code in pycharm- NO LUCK... ; searched the web and
asked the question in almost every online forum--NO LUCK..; banged my head
several times with pyspark/hadoop books--NO LUCK... Finally, one fine day a
'watermelon' dropped while brooding on this problem and I installed pyspark
binaries on all the worker machines ) Now when I tried executing just the
command pyspark on the worker's it worked. Tried some simple program
snippets on each worker, it works too.

I am not sure if this will help or not for your use-case.



Sincerely,
Ashish

On Mon, Sep 7, 2015 at 11:04 PM, Sasha Kacanski  wrote:

> Thanks Ashish,
> nice blog but does not cover my issue. Actually I have pycharm running and
> loading pyspark and rest of libraries perfectly fine.
> My issue is that I am not sure what is triggering
>
> Error from python worker:
>   /cube/PY/Python27/bin/python: No module named pyspark
> pyspark
> PYTHONPATH was:
>
> /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/18/spark-assembly-1.
> 4.1-hadoop2.6.0.jar
>
> Question is why is yarn not getting python package to run on the single
> node via YARN?
> Some people are saying run with JAVA 6 due to zip library changes between
> 6/7/8, some identified bug w RH, i am on debian,  then some documentation
> errors but nothing is really clear.
>
> i have binaries for spark hadoop and i did just fine with spark sql
> module, hive, python, pandas ad yarn.
> Locally as i said app is working fine (pandas to spark df to parquet)
> But as soon as I move to yarn client mode yarn is not getting packages
> required to run app.
>
> If someone confirms that I need to build everything from source with
> specific version of software I will do that, but at this point I am not
> sure what to do to remedy this situation...
>
> --sasha
>
>
> On Sun, Sep 6, 2015 at 8:27 PM, Ashish Dutt 
> wrote:
>
>> Hi Aleksandar,
>> Quite some time ago, I faced the same problem and I found a solution
>> which I have posted here on my blog
>> .
>> See if that can help you and if it does not then you can check out these
>> questions & solution on stackoverflow
>>  website
>>
>>
>> Sincerely,
>> Ashish Dutt
>>
>>
>> On Mon, Sep 7, 2015 at 7:17 AM, Sasha Kacanski 
>> wrote:
>>
>>> Hi,
>>> I am successfully running python app via pyCharm in local mode
>>> setMaster("local[*]")
>>>
>>> When I turn on SparkConf().setMaster("yarn-client")
>>>
>>> and run via
>>>
>>> park-submit PysparkPandas.py
>>>
>>>
>>> I run into issue:
>>> Error from python worker:
>>>   /cube/PY/Python27/bin/python: No module named pyspark
>>> PYTHONPATH was:
>>>
>>> /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/18/spark-assembly-1.4.1-hadoop2.6.0.jar
>>>
>>> I am running java
>>> hadoop@pluto:~/pySpark$ /opt/java/jdk/bin/java -version
>>> java version "1.8.0_31"
>>> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
>>> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
>>>
>>> Should I try same thing with java 6/7
>>>
>>> Is this packaging issue or I have something wrong with configurations ...
>>>
>>> Regards,
>>>
>>> --
>>> Aleksandar Kacanski
>>>
>>
>>
>
>
> --
> Aleksandar Kacanski
>


Re: hadoop2.6.0 + spark1.4.1 + python2.7.10

2015-09-06 Thread Ashish Dutt
Hi Aleksandar,
Quite some time ago, I faced the same problem and I found a solution which
I have posted here on my blog
.
See if that can help you and if it does not then you can check out these
questions & solution on stackoverflow
 website


Sincerely,
Ashish Dutt


On Mon, Sep 7, 2015 at 7:17 AM, Sasha Kacanski  wrote:

> Hi,
> I am successfully running python app via pyCharm in local mode
> setMaster("local[*]")
>
> When I turn on SparkConf().setMaster("yarn-client")
>
> and run via
>
> park-submit PysparkPandas.py
>
>
> I run into issue:
> Error from python worker:
>   /cube/PY/Python27/bin/python: No module named pyspark
> PYTHONPATH was:
>
> /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/18/spark-assembly-1.4.1-hadoop2.6.0.jar
>
> I am running java
> hadoop@pluto:~/pySpark$ /opt/java/jdk/bin/java -version
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
>
> Should I try same thing with java 6/7
>
> Is this packaging issue or I have something wrong with configurations ...
>
> Regards,
>
> --
> Aleksandar Kacanski
>


hadoop2.6.0 + spark1.4.1 + python2.7.10

2015-09-06 Thread Sasha Kacanski
Hi,
I am successfully running python app via pyCharm in local mode
setMaster("local[*]")

When I turn on SparkConf().setMaster("yarn-client")

and run via

park-submit PysparkPandas.py


I run into issue:
Error from python worker:
  /cube/PY/Python27/bin/python: No module named pyspark
PYTHONPATH was:

/tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/18/spark-assembly-1.4.1-hadoop2.6.0.jar

I am running java
hadoop@pluto:~/pySpark$ /opt/java/jdk/bin/java -version
java version "1.8.0_31"
Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)

Should I try same thing with java 6/7

Is this packaging issue or I have something wrong with configurations ...

Regards,

-- 
Aleksandar Kacanski