Re: PySpark without PySpark

Sujit Pal Fri, 10 Jul 2015 10:19:45 -0700

Hi Ashish,

Cool. glad it worked out. I have only used Spark clusters on EC2, which I
spin up using the spark-ec2 scripts (part of the Spark downloads). So don't
have any experience setting up inhouse clusters like you want to do. But I
found some documentation here that may be helpful.
https://docs.sigmoidanalytics.com/index.php/Installing_Spark_and_Setting_Up_Your_Cluster#Deploying_set_of_machines_over_SSH


There are other options as well in this document that will require you to
know some other tools like Chef (previous sections).

Good luck,
Sujit


On Thu, Jul 9, 2015 at 10:25 PM, Ashish Dutt <ashish.du...@gmail.com> wrote:

> Hi Sujit,
> Thank you for your time to help me out. And special thank you for your
> elaborate steps.
> I corrected SPARK_HOME to be c:\spark-1.3.0
> 2) I installed py4j from anaconda command prompt and the command you gave
> executed successfully.
> 3) I replaced python27 as python in the 00-setup script.
> I now give the Path variables as defined and the PATH.
>
> SPARK_HOME    C:\Spark-1.3.0
> JAVA_HOME       C:\Program Files\Java\jdk1.7.0_79
> PYTHONPATH     C:\Users\Ashish Dutt\Anaconda
> MAVEN_HOME    C:\Maven\bin
> SBT_HOME         C:\SBT
> PATH                   %JAVA_HOME%\BIN; %PYTHON_PATH%; %HADOOP_HOME%\BIN;
> %SPARK_HOME%; %M2_HOME%\BIN %MAVEN_HOME%\BIN;%SBT_HOME%;
>
> 4) This time, I grabbed my baseball bat (you do know why..) invoked
> ipython notebook again and with the other free hand I slowly typed the
> command
> print SPARK_HOME -- it worked Then another command from pyspark import
> SparkContext and it worked too!!!
> The baseball bat dropped to the ground and I quickly jabbed the other
> commands given in the post. Attached is the screenshot and it all worked...
> EUREKA...
>
> Sujit, a quintal of thanks for your persistence in helping me resolve this
> problem. You have been very helpful and I wish you luck and success in all
> your endeavors.
> Next milestone is to get this to work in a cluster environment.
>
> I am confused that do I need to install spark-1.3.0 on all the 4 linux
> machines that make my cluster?
> The goal is to use my laptop as a client (from where I will submit spark
> commands to the master server) The master can then distribute the job to
> the three nodes and provide the client with the end result.
> Am i correct in this visualization ?
>
> Once again, thank you for your efforts.
>
>
> Sincerely,
> Ashish Dutt
> PhD Candidate
> Department of Information Systems
> University of Malaya, Lembah Pantai,
> 50603 Kuala Lumpur, Malaysia
>
> On Fri, Jul 10, 2015 at 11:48 AM, Sujit Pal <sujitatgt...@gmail.com>
> wrote:
>
>> Hi Ashish,
>>
>> Julian's approach is probably better, but few observations:
>>
>> 1) Your SPARK_HOME should be C:\spark-1.3.0 (not C:\spark-1.3.0\bin).
>>
>> 2) If you have anaconda python installed (I saw that you had set this up
>> in a separate thread, py4j should be part of the package - at least I think
>> so. To test this, try in your python repl:
>> >>> from py4j.java_gateway import JavaGateway
>> if it succeeds you already have it.
>>
>> 3) In case Py4J is not installed, the best way to install a new package
>> is using easy_install or pip. Make sure your path is set up so when you
>> call python you are calling the anaconda version (in case you have multiple
>> python versions), then if so, do "easy_install py4j" - this will install
>> py4j correctly without any messing around on your part. Install
>> instructions for py4j available on their site:
>> http://py4j.sourceforge.net/install.html
>>
>> 4) You should replace the "python2" in your 00-setup-script with
>> "python", so you point to the $SPARK_HOME/python directory
>> (C:\spark-1.3.0\python).
>>
>> -sujit
>>
>>
>> On Thu, Jul 9, 2015 at 8:26 PM, Ashish Dutt <ashish.du...@gmail.com>
>> wrote:
>>
>>> Hello Sujit,
>>> Many thanks for your response.
>>> To answer your questions;
>>> Q1) Do you have SPARK_HOME set up in your environment?- Yes, I do. It
>>> is SPARK_HOME="C:/spark-1.3.0/bin"
>>> Q2) Is there a python2 or python subdirectory under the root of your
>>> Spark installation? - Yes, i do have that too. It is called python. To fix
>>> this problem this is what I did,
>>> I downloaded py4j-0.8.2.1-src from here
>>> <https://pypi.python.org/pypi/py4j> which was not there initially when
>>> I downloaded the spark package from the official repository. I then put it
>>> in the lib directory as C:\spark-1.3.0\python\lib. Note I did not extract
>>> the zip file. I put it in as it is.
>>> The pyspark folder of the spark-1.3.0 root folder. What I next did was
>>> copy this file and put it in the  pythonpath. So my python path now reads
>>> as PYTHONPATH="C:/Python27/"
>>>
>>> I then rebooted the computer and a silent prayer :-) Then I opened the
>>> command prompt and invoked the command pyspark from the bin directory of
>>> spark and EUREKA, it worked :-)  Attached is the screenshot for the same.
>>> Now, the problem is with IPython notebook. I cannot get it to work with
>>> pySpark.
>>> I have a cluster with 4 nodes using CDH5.4
>>>
>>> I was able to resolve the problem. Now the next challenge was to
>>> configure it with IPython. Followed the steps as documented in the blog.
>>> And I get the errors, attached is the screenshot
>>>
>>> @Julian, I tried your method too. Attached is the screenshot of the
>>> error message 7.png
>>>
>>> Hope you can help me out to fix this problem.
>>> Thank you for your time.
>>>
>>> Sincerely,
>>> Ashish Dutt
>>> PhD Candidate
>>> Department of Information Systems
>>> University of Malaya, Lembah Pantai,
>>> 50603 Kuala Lumpur, Malaysia
>>>
>>> On Fri, Jul 10, 2015 at 12:02 AM, Sujit Pal <sujitatgt...@gmail.com>
>>> wrote:
>>>
>>>> Hi Ashish,
>>>>
>>>> Your 00-pyspark-setup file looks very different from mine (and from the
>>>> one described in the blog post). Questions:
>>>>
>>>> 1) Do you have SPARK_HOME set up in your environment? Because if not,
>>>> it sets it to None in your code. You should provide the path to your Spark
>>>> installation. In my case I have spark-1.3.1 installed under $HOME/Software
>>>> and the code block under "# Configure the environment" (or yellow highlight
>>>> in the code below) reflects that.
>>>> 2) Is there a python2 or python subdirectory under the root of your
>>>> Spark installation? In my case its "python" not "python2". This contains
>>>> the Python bindings for spark, so the block under "# Add the PySpark/py4j
>>>> to the Python path" (or green highlight in the code below) adds it to the
>>>> Python sys.path so things like pyspark.SparkContext are accessible in your
>>>> Python environment.
>>>>
>>>> import os
>>>> import sys
>>>>
>>>> # Configure the environment
>>>> if 'SPARK_HOME' not in os.environ:
>>>>     os.environ['SPARK_HOME'] = "/Users/palsujit/Software/spark-1.3.1"
>>>>
>>>> # Create a variable for our root path
>>>> SPARK_HOME = os.environ['SPARK_HOME']
>>>>
>>>> # Add the PySpark/py4j to the Python Path
>>>> sys.path.insert(0, os.path.join(SPARK_HOME, "python", "build"))
>>>> sys.path.insert(0, os.path.join(SPARK_HOME, "python"))
>>>>
>>>> Hope this fixes things for you.
>>>>
>>>> -sujit
>>>>
>>>>
>>>> On Wed, Jul 8, 2015 at 9:52 PM, Ashish Dutt <ashish.du...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Sujit,
>>>>> Thanks for your response.
>>>>>
>>>>> So i opened a new notebook using the command ipython notebook
>>>>> --profile spark and tried the sequence of commands. i am getting errors.
>>>>> Attached is the screenshot of the same.
>>>>> Also I am attaching the  00-pyspark-setup.py for your reference. Looks
>>>>> like, I have written something wrong here. Cannot seem to figure out, what
>>>>> is it?
>>>>>
>>>>> Thank you for your help
>>>>>
>>>>>
>>>>> Sincerely,
>>>>> Ashish Dutt
>>>>>
>>>>> On Thu, Jul 9, 2015 at 11:53 AM, Sujit Pal <sujitatgt...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Ashish,
>>>>>>
>>>>>> >> Nice post.
>>>>>> Agreed, kudos to the author of the post, Benjamin Benfort of District
>>>>>> Labs.
>>>>>>
>>>>>> >> Following your post, I get this problem;
>>>>>> Again, not my post.
>>>>>>
>>>>>> I did try setting up IPython with the Spark profile for the edX Intro
>>>>>> to Spark course (because I didn't want to use the Vagrant container) and 
>>>>>> it
>>>>>> worked flawlessly with the instructions provided (on OSX). I haven't used
>>>>>> the IPython/PySpark environment beyond very basic tasks since then 
>>>>>> though,
>>>>>> because my employer has a Databricks license which we were already using
>>>>>> for other stuff and we ended up doing the labs on Databricks.
>>>>>>
>>>>>> Looking at your screenshot though, I don't see why you think its
>>>>>> picking up the default profile. One simple way of checking to see if 
>>>>>> things
>>>>>> are working is to open a new notebook and try this sequence of commands:
>>>>>>
>>>>>> from pyspark import SparkContext
>>>>>> sc = SparkContext("local", "pyspark")
>>>>>> sc
>>>>>>
>>>>>> You should see something like this after a little while:
>>>>>> <pyspark.context.SparkContext at 0x1093c9b10>
>>>>>>
>>>>>> While the context is being instantiated, you should also see lots of
>>>>>> log lines scroll by on the terminal where you started the "ipython 
>>>>>> notebook
>>>>>> --profile spark" command - these log lines are from Spark.
>>>>>>
>>>>>> Hope this helps,
>>>>>> Sujit
>>>>>>
>>>>>>
>>>>>> On Wed, Jul 8, 2015 at 6:04 PM, Ashish Dutt <ashish.du...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Sujit,
>>>>>>> Nice post.. Exactly what I had been looking for.
>>>>>>> I am relatively a beginner with Spark and real time data processing.
>>>>>>> We have a server with CDH5.4 with 4 nodes. The spark version in our
>>>>>>> server is 1.3.0
>>>>>>> On my laptop I have spark 1.3.0 too and its using Windows 7
>>>>>>> environment. As per point 5 of your post I am able to invoke pyspark
>>>>>>> locally as in a standalone mode.
>>>>>>>
>>>>>>> Following your post, I get this problem;
>>>>>>>
>>>>>>> 1. In section "Using Ipython notebook with spark" I cannot
>>>>>>> understand why it is picking up the default profile and not the pyspark
>>>>>>> profile. I am sure it is because of the path variables. Attached is the
>>>>>>> screenshot. Can you suggest how to solve this.
>>>>>>>
>>>>>>> Current the path variables for my laptop are like
>>>>>>> SPARK_HOME="C:\SPARK-1.3.0\BIN", JAVA_HOME="C:\PROGRAM
>>>>>>> FILES\JAVA\JDK1.7.0_79", HADOOP_HOME="D:\WINUTILS", 
>>>>>>> M2_HOME="D:\MAVEN\BIN",
>>>>>>> MAVEN_HOME="D:\MAVEN\BIN", PYTHON_HOME="C:\PYTHON27\", 
>>>>>>> SBT_HOME="C:\SBT\"
>>>>>>>
>>>>>>>
>>>>>>> Sincerely,
>>>>>>> Ashish Dutt
>>>>>>> PhD Candidate
>>>>>>> Department of Information Systems
>>>>>>> University of Malaya, Lembah Pantai,
>>>>>>> 50603 Kuala Lumpur, Malaysia
>>>>>>>
>>>>>>> On Thu, Jul 9, 2015 at 4:56 AM, Sujit Pal <sujitatgt...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> You are welcome Davies. Just to clarify, I didn't write the post
>>>>>>>> (not sure if my earlier post gave that impression, apologize if so),
>>>>>>>> although I agree its great :-).
>>>>>>>>
>>>>>>>> -sujit
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jul 8, 2015 at 10:36 AM, Davies Liu <dav...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Great post, thanks for sharing with us!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 8, 2015 at 9:59 AM, Sujit Pal <sujitatgt...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> > Hi Julian,
>>>>>>>>> >
>>>>>>>>> > I recently built a Python+Spark application to do search
>>>>>>>>> relevance
>>>>>>>>> > analytics. I use spark-submit to submit PySpark jobs to a Spark
>>>>>>>>> cluster on
>>>>>>>>> > EC2 (so I don't use the PySpark shell, hopefully thats what you
>>>>>>>>> are looking
>>>>>>>>> > for). Can't share the code, but the basic approach is covered in
>>>>>>>>> this blog
>>>>>>>>> > post - scroll down to the section "Writing a Spark Application".
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python
>>>>>>>>> >
>>>>>>>>> > Hope this helps,
>>>>>>>>> >
>>>>>>>>> > -sujit
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > On Wed, Jul 8, 2015 at 7:46 AM, Julian <
>>>>>>>>> julian+sp...@magnetic.com> wrote:
>>>>>>>>> >>
>>>>>>>>> >> Hey.
>>>>>>>>> >>
>>>>>>>>> >> Is there a resource that has written up what the necessary
>>>>>>>>> steps are for
>>>>>>>>> >> running PySpark without using the PySpark shell?
>>>>>>>>> >>
>>>>>>>>> >> I can reverse engineer (by following the tracebacks and reading
>>>>>>>>> the shell
>>>>>>>>> >> source) what the relevant Java imports needed are, but I would
>>>>>>>>> assume
>>>>>>>>> >> someone has attempted this before and just published something
>>>>>>>>> I can
>>>>>>>>> >> either
>>>>>>>>> >> follow or install? If not, I have something that pretty much
>>>>>>>>> works and can
>>>>>>>>> >> publish it, but I'm not a heavy Spark user, so there may be
>>>>>>>>> some things
>>>>>>>>> >> I've
>>>>>>>>> >> left out that I haven't hit because of how little of pyspark
>>>>>>>>> I'm playing
>>>>>>>>> >> with.
>>>>>>>>> >>
>>>>>>>>> >> Thanks,
>>>>>>>>> >> Julian
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >> --
>>>>>>>>> >> View this message in context:
>>>>>>>>> >>
>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-without-PySpark-tp23719.html
>>>>>>>>> >> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>> Nabble.com.
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>> >> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>> >>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: PySpark without PySpark

Reply via email to