pyspark read json file with high dimensional sparse data
Hi all, I'm trying to read a data inside a json file using `SQLContext.read.json()` method. However, reading operation does not finish. My data is of 29x3100 dimensions, but it's actually really sparse, so if there is a way to directly read json into a sparse dataframe, it would work perfect for me. What are the alternatives for reading such data into spark? P.S. : When I try to load first 5 rows, read operation is completed in ~2 minutes.
Re: Plot DataFrame with matplotlib
Hi Teng, Thanks for the answer. I've switched to pandas during proof of concept process in order to be able to plot graphs easily. Actually, pandas DataFrame object itself has `plot` methods, so these objects can plot themselves on most cases easily (it uses matplotlib inside). I wonder if spark DataFrame API would consider moving in that direction, because plotting is really important during analysis process, and converting data frame using `toPandas()` method would fail for data that do not fit in memory. Although I'm not much familiar with internals, I would like to help for anything if team considers adding such a feature. On Wed, Mar 23, 2016 at 2:16 PM Teng Qiu wrote: > e... then this sounds like a feature requirement for matplotlib, you > need to make matplotlib's APIs support RDD or spark DataFrame object, > i checked the API of mplot3d > ( > http://matplotlib.org/mpl_toolkits/mplot3d/tutorial.html#mpl_toolkits.mplot3d.Axes3D.scatter > ), > it only supports "array-like" input data. > > so yes, to use matplotlib, you need to take the elements out of RDD, > and send them to plot API as list object. > > 2016-03-23 12:20 GMT+01:00 Yavuz Nuzumlalı : > > Thanks for help, but the example that you referenced gets the values from > > RDD as list and plots that list. > > > > What I am specifically asking was that is there a convenient way to plot > a > > DataFrame object directly?(like pandas DataFrame objects) > > > > > > On Wed, Mar 23, 2016 at 11:47 AM Teng Qiu wrote: > >> > >> not sure about 3d plot, but there is a nice example: > >> > >> > https://github.com/zalando/spark-appliance/blob/master/examples/notebooks/PySpark_sklearn_matplotlib.ipynb > >> > >> for plotting rdd or dataframe using matplotlib. > >> > >> Am Mittwoch, 23. März 2016 schrieb Yavuz Nuzumlalı : > >> > Hi all, > >> > I'm trying to plot the result of a simple PCA operation, but couldn't > >> > find a clear documentation about plotting data frames. > >> > Here is the output of my data frame: > >> > ++ > >> > |pca_features| > >> > ++ > >> > |[-255.4681508918886,2.9340031372956155,-0.5357914079267039] | > >> > |[-477.03566189308367,-6.170290817861212,-5.280827588464785] | > >> > |[-163.13388125540507,-4.571443623272966,-1.2349427928939671]| > >> > |[-53.721252166903255,0.6162589419996329,-0.39569546286098245] | > >> > [-27.97717473880869,0.30883567826481106,-0.11159555340377557] | > >> > |[-118.27508063853554,1.3484584740407748,-0.8088790388907207]| > >> > Values of `pca_features` column is DenseVector s created using > >> > VectorAssembler. > >> > How can I draw a simple 3d scatter plot from this data frame? > >> > Thanks >
Re: Plot DataFrame with matplotlib
Thanks for help, but the example that you referenced gets the values from RDD as list and plots that list. What I am specifically asking was that is there a convenient way to plot a DataFrame object directly?(like pandas DataFrame objects) On Wed, Mar 23, 2016 at 11:47 AM Teng Qiu wrote: > not sure about 3d plot, but there is a nice example: > > https://github.com/zalando/spark-appliance/blob/master/examples/notebooks/PySpark_sklearn_matplotlib.ipynb > > for plotting rdd or dataframe using matplotlib. > > Am Mittwoch, 23. März 2016 schrieb Yavuz Nuzumlalı : > > Hi all, > > I'm trying to plot the result of a simple PCA operation, but couldn't > find a clear documentation about plotting data frames. > > Here is the output of my data frame: > > ++ > > |pca_features| > > ++ > > |[-255.4681508918886,2.9340031372956155,-0.5357914079267039] | > > |[-477.03566189308367,-6.170290817861212,-5.280827588464785] | > > |[-163.13388125540507,-4.571443623272966,-1.2349427928939671]| > > |[-53.721252166903255,0.6162589419996329,-0.39569546286098245] | > > [-27.97717473880869,0.30883567826481106,-0.11159555340377557] | > > |[-118.27508063853554,1.3484584740407748,-0.8088790388907207]| > > Values of `pca_features` column is DenseVector s created using > VectorAssembler. > > How can I draw a simple 3d scatter plot from this data frame? > > Thanks
Plot DataFrame with matplotlib
Hi all, I'm trying to plot the result of a simple PCA operation, but couldn't find a clear documentation about plotting data frames. Here is the output of my data frame: ++ |pca_features| ++ |[-255.4681508918886,2.9340031372956155,-0.5357914079267039] | |[-477.03566189308367,-6.170290817861212,-5.280827588464785] | |[-163.13388125540507,-4.571443623272966,-1.2349427928939671]| |[-53.721252166903255,0.6162589419996329,-0.39569546286098245] | [-27.97717473880869,0.30883567826481106,-0.11159555340377557] | |[-118.27508063853554,1.3484584740407748,-0.8088790388907207]| Values of `pca_features` column is DenseVector s created using VectorAssembler. How can I draw a simple 3d scatter plot from this data frame? Thanks