Closing the loop on this --
It appears we were just hitting some other problem related to S3A/S3,
likely that the temporary directory used by the S3A Hadoop file system
implementation for buffering data during upload either was full or had the
wrong permissions.
On Thu, Mar 16, 2017 at 6:03 PM
Hi!
On Thu, Mar 16, 2017 at 5:20 PM, Burak Yavuz wrote:
> Hi Everett,
>
> IIRC we added unionAll in Spark 2.0 which is the same implementation as
> rdd union. The union in DataFrames with Spark 2.0 does dedeuplication, and
> that's why you should be seeing the slowdown.
>
I thought it was the o
Hi Everett,
IIRC we added unionAll in Spark 2.0 which is the same implementation as rdd
union. The union in DataFrames with Spark 2.0 does dedeuplication, and
that's why you should be seeing the slowdown.
Best,
Burak
On Thu, Mar 16, 2017 at 4:14 PM, Everett Anderson
wrote:
> Looks like the Dat
Looks like the Dataset version of union may also fail with the following on
larger data sets, which again seems like it might be drawing everything
into the driver for some reason --
7/03/16 22:28:21 WARN TaskSetManager: Lost task 1.0 in stage 91.0 (TID
5760, ip-10-8-52-198.us-west-2.compute.inter
Hi,
We're using Dataset union() in Spark 2.0.2 to concatenate a bunch of tables
together and save as Parquet to S3, but it seems to take a long time. We're
using the S3A FileSystem implementation under the covers, too, if that
helps.
Watching the Spark UI, the executors all eventually stop (we're