pyspark read json file with high dimensional sparse data

2016-03-30 Thread Yavuz Nuzumlalı
Hi all,

I'm trying to read a data inside a json file using `SQLContext.read.json()`
method.

However, reading operation does not finish. My data is of 29x3100
dimensions, but it's actually really sparse, so if there is a way to
directly read json into a sparse dataframe, it would work perfect for me.

What are the alternatives for reading such data into spark?

P.S. : When I try to load first 5 rows, read operation is completed in
~2 minutes.


Re: Plot DataFrame with matplotlib

2016-03-30 Thread Yavuz Nuzumlalı
Hi Teng,

Thanks for the answer. I've switched to pandas during proof of concept
process in order to be able to plot graphs easily.

Actually, pandas DataFrame object itself has `plot` methods, so these
objects can plot themselves on most cases easily (it uses matplotlib
inside).

I wonder if spark DataFrame API would consider moving in that direction,
because plotting is really important during analysis process, and
converting data frame using `toPandas()` method would fail for data that do
not fit in memory.

Although I'm not much familiar with internals, I would like to help for
anything if team considers adding such a feature.

On Wed, Mar 23, 2016 at 2:16 PM Teng Qiu  wrote:

> e... then this sounds like a feature requirement for matplotlib, you
> need to make matplotlib's APIs support RDD or spark DataFrame object,
> i checked the API of mplot3d
> (
> http://matplotlib.org/mpl_toolkits/mplot3d/tutorial.html#mpl_toolkits.mplot3d.Axes3D.scatter
> ),
> it only supports "array-like" input data.
>
> so yes, to use matplotlib, you need to take the elements out of RDD,
> and send them to plot API as list object.
>
> 2016-03-23 12:20 GMT+01:00 Yavuz Nuzumlalı :
> > Thanks for help, but the example that you referenced gets the values from
> > RDD as list and plots that list.
> >
> > What I am specifically asking was that is there a convenient way to plot
> a
> > DataFrame object directly?(like pandas DataFrame objects)
> >
> >
> > On Wed, Mar 23, 2016 at 11:47 AM Teng Qiu  wrote:
> >>
> >> not sure about 3d plot, but there is a nice example:
> >>
> >>
> https://github.com/zalando/spark-appliance/blob/master/examples/notebooks/PySpark_sklearn_matplotlib.ipynb
> >>
> >> for plotting rdd or dataframe using matplotlib.
> >>
> >> Am Mittwoch, 23. März 2016 schrieb Yavuz Nuzumlalı :
> >> > Hi all,
> >> > I'm trying to plot the result of a simple PCA operation, but couldn't
> >> > find a clear documentation about plotting data frames.
> >> > Here is the output of my data frame:
> >> > ++
> >> > |pca_features|
> >> > ++
> >> > |[-255.4681508918886,2.9340031372956155,-0.5357914079267039] |
> >> > |[-477.03566189308367,-6.170290817861212,-5.280827588464785] |
> >> > |[-163.13388125540507,-4.571443623272966,-1.2349427928939671]|
> >> > |[-53.721252166903255,0.6162589419996329,-0.39569546286098245]   |
> >> > [-27.97717473880869,0.30883567826481106,-0.11159555340377557]   |
> >> > |[-118.27508063853554,1.3484584740407748,-0.8088790388907207]|
> >> > Values of `pca_features` column is DenseVector s created using
> >> > VectorAssembler.
> >> > How can I draw a simple 3d scatter plot from this data frame?
> >> > Thanks
>


Re: Plot DataFrame with matplotlib

2016-03-23 Thread Yavuz Nuzumlalı
Thanks for help, but the example that you referenced gets the values from
RDD as list and plots that list.

What I am specifically asking was that is there a convenient way to plot a
DataFrame object directly?(like pandas DataFrame objects)


On Wed, Mar 23, 2016 at 11:47 AM Teng Qiu  wrote:

> not sure about 3d plot, but there is a nice example:
>
> https://github.com/zalando/spark-appliance/blob/master/examples/notebooks/PySpark_sklearn_matplotlib.ipynb
>
> for plotting rdd or dataframe using matplotlib.
>
> Am Mittwoch, 23. März 2016 schrieb Yavuz Nuzumlalı :
> > Hi all,
> > I'm trying to plot the result of a simple PCA operation, but couldn't
> find a clear documentation about plotting data frames.
> > Here is the output of my data frame:
> > ++
> > |pca_features|
> > ++
> > |[-255.4681508918886,2.9340031372956155,-0.5357914079267039] |
> > |[-477.03566189308367,-6.170290817861212,-5.280827588464785] |
> > |[-163.13388125540507,-4.571443623272966,-1.2349427928939671]|
> > |[-53.721252166903255,0.6162589419996329,-0.39569546286098245]   |
> > [-27.97717473880869,0.30883567826481106,-0.11159555340377557]   |
> > |[-118.27508063853554,1.3484584740407748,-0.8088790388907207]|
> > Values of `pca_features` column is DenseVector s created using
> VectorAssembler.
> > How can I draw a simple 3d scatter plot from this data frame?
> > Thanks


Plot DataFrame with matplotlib

2016-03-23 Thread Yavuz Nuzumlalı
Hi all,

I'm trying to plot the result of a simple PCA operation, but couldn't find
a clear documentation about plotting data frames.

Here is the output of my data frame:

++
|pca_features|
++
|[-255.4681508918886,2.9340031372956155,-0.5357914079267039] |
|[-477.03566189308367,-6.170290817861212,-5.280827588464785] |
|[-163.13388125540507,-4.571443623272966,-1.2349427928939671]|
|[-53.721252166903255,0.6162589419996329,-0.39569546286098245]   |
[-27.97717473880869,0.30883567826481106,-0.11159555340377557]   |
|[-118.27508063853554,1.3484584740407748,-0.8088790388907207]|

Values of `pca_features` column is DenseVector s created using
VectorAssembler.

How can I draw a simple 3d scatter plot from this data frame?

Thanks