subject:"unintended consequence of using coalesce operation"

unintended consequence of using coalesce operation

2015-09-29 Thread Lan Jiang

Hi, there I ran into an issue when using Spark (v 1.3) to load avro file through Spark SQL. The code sample is below val df = sqlContext.load(“path-to-avro-file","com.databricks.spark.avro”) val myrdd = df.select(“Key", “Name", “binaryfield").rdd val results = myrdd.map(...) val finalResults =

Re: unintended consequence of using coalesce operation

2015-09-29 Thread Michael Armbrust

coalesce is generally to avoid launching too many tasks, on a bunch of small files. As a result, the goal is to reduce parallelism (when the overhead of that parallelism is more costly than the gain). You are correct that in you case repartition sounds like a better choice. On Tue, Sep 29, 2015