Re: Spark 2.0.2 Dataset union() slowness vs RDD union?

2017-03-20 Thread Everett Anderson
Closing the loop on this -- It appears we were just hitting some other problem related to S3A/S3, likely that the temporary directory used by the S3A Hadoop file system implementation for buffering data during upload either was full or had the wrong permissions. On Thu, Mar 16, 2017 at 6:03 PM

Re: Spark 2.0.2 Dataset union() slowness vs RDD union?

2017-03-16 Thread Everett Anderson
Hi! On Thu, Mar 16, 2017 at 5:20 PM, Burak Yavuz wrote: > Hi Everett, > > IIRC we added unionAll in Spark 2.0 which is the same implementation as > rdd union. The union in DataFrames with Spark 2.0 does dedeuplication, and > that's why you should be seeing the slowdown. > I thought it was the o

Re: Spark 2.0.2 Dataset union() slowness vs RDD union?

2017-03-16 Thread Burak Yavuz
Hi Everett, IIRC we added unionAll in Spark 2.0 which is the same implementation as rdd union. The union in DataFrames with Spark 2.0 does dedeuplication, and that's why you should be seeing the slowdown. Best, Burak On Thu, Mar 16, 2017 at 4:14 PM, Everett Anderson wrote: > Looks like the Dat

Re: Spark 2.0.2 Dataset union() slowness vs RDD union?

2017-03-16 Thread Everett Anderson
Looks like the Dataset version of union may also fail with the following on larger data sets, which again seems like it might be drawing everything into the driver for some reason -- 7/03/16 22:28:21 WARN TaskSetManager: Lost task 1.0 in stage 91.0 (TID 5760, ip-10-8-52-198.us-west-2.compute.inter

Spark 2.0.2 Dataset union() slowness vs RDD union?

2017-03-16 Thread Everett Anderson
Hi, We're using Dataset union() in Spark 2.0.2 to concatenate a bunch of tables together and save as Parquet to S3, but it seems to take a long time. We're using the S3A FileSystem implementation under the covers, too, if that helps. Watching the Spark UI, the executors all eventually stop (we're