running jupyter notebook server Re: spark and plot data

Andy Davidson Fri, 22 Jul 2016 10:08:58 -0700

Hi Pseudo

I do not know much about zeppelin . What languages are you using?


I have been doing my data exploration and graphing using python mostly
because early on spark had good support for python. Its easy to collect()
data as a local PANDAS object. I think at this point R should work well. You
should be able to easily collect() your data as a R dataframe. I have not
tried to Rstudio.

I typically run the Jupiter notebook server in my data center. I find the
notebooks really nice. I typically use matplotlib to generates my graph.
There are a lot of graphing packages.

Attached is the script I use to start the notebook server. This script and
process works but is a little hacky You call it as follows


#
# on a machine in your cluster
#
$ cd dirWithNotebooks

# all the logs will be in startIPythonNotebook.sh.out
# nohup allows you to log in start your notebook server and log out.
$ nohup startIPythonNotebook.sh > startIPythonNotebook.sh.out &

#
# on you local machine
#

# because of firewalls I need to open an ssh tunnel
$ ssh -o ServerAliveInterval=120 -N -f -L localhost:8889:localhost:7000
myCluster

# connect to the notebook server using the browser of you choice
http://localhost:8889




#
# If you need to stop your notebooks server you may need to kill the server
# there is probably a cleaner way to do this
# $ ps -el | head -1; ps -efl | grep python
#

 <http://jupyter.org/> http://jupyter.org/


P.S. Jupiter is in the process of being released. The new Juypter lab alpha
was just announced it looks really sweet.



From:  pseudo oduesp <pseudo20...@gmail.com>
Date:  Friday, July 22, 2016 at 2:08 AM
To:  Andrew Davidson <a...@santacruzintegration.com>
Subject:  Re: spark and plot data

> HI andy  ,
> thanks for reply ,
> i tell it just hard to each time switch  between local concept and destributed
> concept , for example zepplin give easy way to interact with data ok , but
> it's hard to configure on huge cluster with lot of node in my case i have
> cluster with 69 nodes and i process huge volume of data with pyspark and it
> cool but when  i want to plot some chart  i get hard job to make it .
> 
> i sampling my result or aggregate  , take for example if i user randomforest
> algorithme in machine learning  i want to retrive  most importante features
> with my version alerady installed in our cluster (1.5.0) i can't get this.
> 
> do you have any solution.
> 
> Thanks 
> 
> 2016-07-21 18:44 GMT+02:00 Andy Davidson <a...@santacruzintegration.com>:
>> Hi Pseudo
>> 
>> Plotting, graphing, data visualization, report generation are common needs in
>> scientific and enterprise computing.
>> 
>> Can you tell me more about your use case? What is it about the current
>> process / workflow do you think could be improved by pushing plotting (I
>> assume you mean plotting and graphing) into spark.
>> 
>> 
>> In my personal work all the graphing is done in the driver on summary stats
>> calculated using spark. So for me using standard python libs has not been a
>> problem.
>> 
>> Andy
>> 
>> From:  pseudo oduesp <pseudo20...@gmail.com>
>> Date:  Thursday, July 21, 2016 at 8:30 AM
>> To:  "user @spark" <user@spark.apache.org>
>> Subject:  spark and plot data
>> 
>>> Hi , 
>>> i know spark  it s engine  to compute large data set but for me i work with
>>> pyspark and it s very wonderful machine
>>> 
>>> my question  we  don't have tools for ploting data each time we have to
>>> switch and go back to python for using plot.
>>> but when you have large result scatter plot or roc curve  you cant use
>>> collect to take data .
>>> 
>>> somone have propostion for plot .
>>> 
>>> thanks 
>

startIPythonNotebook.sh
Description: Binary data

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

running jupyter notebook server Re: spark and plot data

Reply via email to