Hi, there
I ran into an issue when using Spark (v 1.3) to load avro file through Spark
SQL. The code sample is below
val df = sqlContext.load(“path-to-avro-file","com.databricks.spark.avro”)
val myrdd = df.select(“Key", “Name", “binaryfield").rdd
val results = myrdd.map(...)
val finalResults =
coalesce is generally to avoid launching too many tasks, on a bunch of
small files. As a result, the goal is to reduce parallelism (when the
overhead of that parallelism is more costly than the gain). You are
correct that in you case repartition sounds like a better choice.
On Tue, Sep 29, 2015