Re: Saving data frames on Spark Master/Driver

2016-07-14 Thread Sun Rui
You can simply save the join result distributedly, for example, as a HDFS file, 
and then copy the HDFS file to a local file.

There is an alternative memory-efficient way to collect distributed data back 
to driver other than collect(), that is toLocalIterator. The iterator will 
consume as much memory as the largest partition in your dataset.

You can use DataFrame.rdd.toLocalIterator() with Spark versions prior to 2.0. 
You can use Dataset.toLocalIterator() with Spark 2.0. 

For details, refer to https://issues.apache.org/jira/browse/SPARK-14334 


> On Jul 15, 2016, at 09:05, Pedro Rodriguez  wrote:
> 
> Out of curiosity, is there a way to pull all the data back to the driver to 
> save without collect()? That is, stream the data in chunks back to the driver 
> so that maximum memory used comparable to a single node’s data, but all the 
> data is saved on one node.
> 
> —
> Pedro Rodriguez
> PhD Student in Large-Scale Machine Learning | CU Boulder
> Systems Oriented Data Scientist
> UC Berkeley AMPLab Alumni
> 
> pedrorodriguez.io  | 909-353-4423
> github.com/EntilZha  | LinkedIn 
> 
> On July 14, 2016 at 6:02:12 PM, Jacek Laskowski (ja...@japila.pl 
> ) wrote:
> 
>> Hi, 
>> 
>> Please re-consider your wish since it is going to move all the 
>> distributed dataset to the single machine of the driver and may lead 
>> to OOME. It's more pro to save your result to HDFS or S3 or any other 
>> distributed filesystem (that is accessible by the driver and 
>> executors). 
>> 
>> If you insist... 
>> 
>> Use collect() after select() and work with Array[T]. 
>> 
>> Pozdrawiam, 
>> Jacek Laskowski 
>>  
>> https://medium.com/@jaceklaskowski/ 
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark 
>> Follow me at https://twitter.com/jaceklaskowski 
>> 
>> 
>> On Fri, Jul 15, 2016 at 12:15 AM, vr.n. nachiappan 
>>  wrote: 
>> > Hello, 
>> > 
>> > I am using data frames to join two cassandra tables. 
>> > 
>> > Currently when i invoke save on data frames as shown below it is saving 
>> > the 
>> > join results on executor nodes. 
>> > 
>> > joineddataframe.select(,  
>> > ...).format("com.databricks.spark.csv").option("header", 
>> > "true").save() 
>> > 
>> > I would like to persist the results of the join on Spark Master/Driver 
>> > node. 
>> > Is it possible to save the results on Spark Master/Driver and how to do 
>> > it. 
>> > 
>> > I appreciate your help. 
>> > 
>> > Nachi 
>> > 
>> 
>> - 
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Saving data frames on Spark Master/Driver

2016-07-14 Thread Taotao.Li
hi, consider transfer dataframe to rdd and then use* rdd.toLocalIterator *to
collect data on the driver node.

On Fri, Jul 15, 2016 at 9:05 AM, Pedro Rodriguez 
wrote:

> Out of curiosity, is there a way to pull all the data back to the driver
> to save without collect()? That is, stream the data in chunks back to the
> driver so that maximum memory used comparable to a single node’s data, but
> all the data is saved on one node.
>
> —
> Pedro Rodriguez
> PhD Student in Large-Scale Machine Learning | CU Boulder
> Systems Oriented Data Scientist
> UC Berkeley AMPLab Alumni
>
> pedrorodriguez.io | 909-353-4423
> github.com/EntilZha | LinkedIn
> 
>
> On July 14, 2016 at 6:02:12 PM, Jacek Laskowski (ja...@japila.pl) wrote:
>
> Hi,
>
> Please re-consider your wish since it is going to move all the
> distributed dataset to the single machine of the driver and may lead
> to OOME. It's more pro to save your result to HDFS or S3 or any other
> distributed filesystem (that is accessible by the driver and
> executors).
>
> If you insist...
>
> Use collect() after select() and work with Array[T].
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Fri, Jul 15, 2016 at 12:15 AM, vr.n. nachiappan
>  wrote:
> > Hello,
> >
> > I am using data frames to join two cassandra tables.
> >
> > Currently when i invoke save on data frames as shown below it is saving
> the
> > join results on executor nodes.
> >
> > joineddataframe.select(, 
> > ...).format("com.databricks.spark.csv").option("header",
> > "true").save()
> >
> > I would like to persist the results of the join on Spark Master/Driver
> node.
> > Is it possible to save the results on Spark Master/Driver and how to do
> it.
> >
> > I appreciate your help.
> >
> > Nachi
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
*___*
Quant | Engineer | Boy
*___*
*blog*:http://litaotao.github.io

*github*: www.github.com/litaotao


Re: Saving data frames on Spark Master/Driver

2016-07-14 Thread Pedro Rodriguez
Out of curiosity, is there a way to pull all the data back to the driver to 
save without collect()? That is, stream the data in chunks back to the driver 
so that maximum memory used comparable to a single node’s data, but all the 
data is saved on one node.

—
Pedro Rodriguez
PhD Student in Large-Scale Machine Learning | CU Boulder
Systems Oriented Data Scientist
UC Berkeley AMPLab Alumni

pedrorodriguez.io | 909-353-4423
github.com/EntilZha | LinkedIn

On July 14, 2016 at 6:02:12 PM, Jacek Laskowski (ja...@japila.pl) wrote:

Hi,  

Please re-consider your wish since it is going to move all the  
distributed dataset to the single machine of the driver and may lead  
to OOME. It's more pro to save your result to HDFS or S3 or any other  
distributed filesystem (that is accessible by the driver and  
executors).  

If you insist...  

Use collect() after select() and work with Array[T].  

Pozdrawiam,  
Jacek Laskowski  
  
https://medium.com/@jaceklaskowski/  
Mastering Apache Spark http://bit.ly/mastering-apache-spark  
Follow me at https://twitter.com/jaceklaskowski  


On Fri, Jul 15, 2016 at 12:15 AM, vr.n. nachiappan  
 wrote:  
> Hello,  
>  
> I am using data frames to join two cassandra tables.  
>  
> Currently when i invoke save on data frames as shown below it is saving the  
> join results on executor nodes.  
>  
> joineddataframe.select(,   
> ...).format("com.databricks.spark.csv").option("header",  
> "true").save()  
>  
> I would like to persist the results of the join on Spark Master/Driver node.  
> Is it possible to save the results on Spark Master/Driver and how to do it.  
>  
> I appreciate your help.  
>  
> Nachi  
>  

-  
To unsubscribe e-mail: user-unsubscr...@spark.apache.org  



Re: Saving data frames on Spark Master/Driver

2016-07-14 Thread Jacek Laskowski
Hi,

Please re-consider your wish since it is going to move all the
distributed dataset to the single machine of the driver and may lead
to OOME. It's more pro to save your result to HDFS or S3 or any other
distributed filesystem (that is accessible by the driver and
executors).

If you insist...

Use collect() after select() and work with Array[T].

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Fri, Jul 15, 2016 at 12:15 AM, vr.n. nachiappan
 wrote:
> Hello,
>
> I am using data frames to join two cassandra tables.
>
> Currently when i invoke save on data frames as shown below it is saving the
> join results on executor nodes.
>
> joineddataframe.select(, 
> ...).format("com.databricks.spark.csv").option("header",
> "true").save()
>
> I would like to persist the results of the join on Spark Master/Driver node.
> Is it possible to save the results on Spark Master/Driver and how to do it.
>
> I appreciate your help.
>
> Nachi
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Saving data frames on Spark Master/Driver

2016-07-14 Thread vr.n. nachiappan
Hello,
I am using data frames to join two cassandra tables.
Currently when i invoke save on data frames as shown below it is saving the 
join results on executor nodes. 
joineddataframe.select(,  
...).format("com.databricks.spark.csv").option("header", "true").save()
I would like to persist the results of the join on Spark Master/Driver node. Is 
it possible to save the results on Spark Master/Driver and how to do it.
I appreciate your help.
Nachi