Spark 2.0.2 Dataset union() slowness vs RDD union?

Everett Anderson Thu, 16 Mar 2017 14:55:40 -0700

Hi,

We're using Dataset union() in Spark 2.0.2 to concatenate a bunch of tables
together and save as Parquet to S3, but it seems to take a long time. We're
using the S3A FileSystem implementation under the covers, too, if that
helps.


Watching the Spark UI, the executors all eventually stop (we're using
dynamic allocation) but under the SQL tab we can see a "save at
NativeMethodAccessorImpl.java:-2" in Running Queries. The driver is still
running of course, but it may take tens of minutes to finish. It makes me
wonder if our data all being collected through the driver.

If we instead convert the Datasets to RDDs and call SparkContext.union() it
works quickly.

Anyone know if this is a known issue?

Spark 2.0.2 Dataset union() slowness vs RDD union?

Reply via email to