Out of curiosity, is there a way to pull all the data back to the driver to 
save without collect()? That is, stream the data in chunks back to the driver 
so that maximum memory used comparable to a single node’s data, but all the 
data is saved on one node.

—
Pedro Rodriguez
PhD Student in Large-Scale Machine Learning | CU Boulder
Systems Oriented Data Scientist
UC Berkeley AMPLab Alumni

pedrorodriguez.io | 909-353-4423
github.com/EntilZha | LinkedIn

On July 14, 2016 at 6:02:12 PM, Jacek Laskowski (ja...@japila.pl) wrote:

Hi,  

Please re-consider your wish since it is going to move all the  
distributed dataset to the single machine of the driver and may lead  
to OOME. It's more pro to save your result to HDFS or S3 or any other  
distributed filesystem (that is accessible by the driver and  
executors).  

If you insist...  

Use collect() after select() and work with Array[T].  

Pozdrawiam,  
Jacek Laskowski  
----  
https://medium.com/@jaceklaskowski/  
Mastering Apache Spark http://bit.ly/mastering-apache-spark  
Follow me at https://twitter.com/jaceklaskowski  


On Fri, Jul 15, 2016 at 12:15 AM, vr.n. nachiappan  
<nachiappan_...@yahoo.com.invalid> wrote:  
> Hello,  
>  
> I am using data frames to join two cassandra tables.  
>  
> Currently when i invoke save on data frames as shown below it is saving the  
> join results on executor nodes.  
>  
> joineddataframe.select(<col1>, <col2>  
> ...).format("com.databricks.spark.csv").option("header",  
> "true").save(<path>)  
>  
> I would like to persist the results of the join on Spark Master/Driver node.  
> Is it possible to save the results on Spark Master/Driver and how to do it.  
>  
> I appreciate your help.  
>  
> Nachi  
>  

---------------------------------------------------------------------  
To unsubscribe e-mail: user-unsubscr...@spark.apache.org  

Reply via email to