Re: Saving data frames on Spark Master/Driver

2016-07-14 Thread Sun Rui
You can simply save the join result distributedly, for example, as a HDFS file, and then copy the HDFS file to a local file. There is an alternative memory-efficient way to collect distributed data back to driver other than collect(), that is toLocalIterator. The iterator will consume as much

Re: Saving data frames on Spark Master/Driver

2016-07-14 Thread Taotao.Li
hi, consider transfer dataframe to rdd and then use* rdd.toLocalIterator *to collect data on the driver node. On Fri, Jul 15, 2016 at 9:05 AM, Pedro Rodriguez wrote: > Out of curiosity, is there a way to pull all the data back to the driver > to save without collect()?

Re: Saving data frames on Spark Master/Driver

2016-07-14 Thread Pedro Rodriguez
Out of curiosity, is there a way to pull all the data back to the driver to save without collect()? That is, stream the data in chunks back to the driver so that maximum memory used comparable to a single node’s data, but all the data is saved on one node. — Pedro Rodriguez PhD Student in

Re: Saving data frames on Spark Master/Driver

2016-07-14 Thread Jacek Laskowski
Hi, Please re-consider your wish since it is going to move all the distributed dataset to the single machine of the driver and may lead to OOME. It's more pro to save your result to HDFS or S3 or any other distributed filesystem (that is accessible by the driver and executors). If you insist...

Saving data frames on Spark Master/Driver

2016-07-14 Thread vr.n. nachiappan
Hello, I am using data frames to join two cassandra tables. Currently when i invoke save on data frames as shown below it is saving the join results on executor nodes.  joineddataframe.select(, ...).format("com.databricks.spark.csv").option("header", "true").save() I would like to persist the