Re: Issue with parquet write after join (Spark 1.4.0)

2015-07-01 Thread Pooja Jain
Join is happening successfully as I am able to do count() after the join. Error is coming only while trying to write in parquet format on hdfs. Thanks, Pooja. On Wed, Jul 1, 2015 at 1:06 PM, Akhil Das ak...@sigmoidanalytics.com wrote: It says: Caused by: java.net.ConnectException:

Re: Issue with parquet write after join (Spark 1.4.0)

2015-07-01 Thread Raghavendra Pandey
By any chance, are you using time field in your df. Time fields are known to be notorious in rdd conversion. On Jul 1, 2015 6:13 PM, Pooja Jain pooja.ja...@gmail.com wrote: Join is happening successfully as I am able to do count() after the join. Error is coming only while trying to write in

Re: Issue with parquet write after join (Spark 1.4.0)

2015-07-01 Thread Michael Armbrust
I would still look at your executor logs. A count() is rewritten by the optimizer to be much more efficient because you don't actually need any of the columns. Also, writing parquet allocates quite a few large buffers. On Wed, Jul 1, 2015 at 5:42 AM, Pooja Jain pooja.ja...@gmail.com wrote:

Re: Issue with parquet write after join (Spark 1.4.0)

2015-07-01 Thread Akhil Das
It says: Caused by: java.net.ConnectException: Connection refused: slave2/...:54845 Could you look in the executor logs (stderr on slave2) and see what made it shut down? Since you are doing a join there's a high possibility of OOM etc. Thanks Best Regards On Wed, Jul 1, 2015 at 10:20 AM,

Issue with parquet write after join (Spark 1.4.0)

2015-06-30 Thread Pooja Jain
Hi, We are using Spark 1.4.0 on hadoop using yarn-cluster mode via spark-submit. We are facing parquet write issue after doing dataframe joins We have a full data set and then an incremental data. We are reading them as dataframes, joining them, and then writing the data to the hdfs system in