hi spark-user, I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a dataframe from a parquet data source with a single parquet file, it yields a stage with lots of small tasks. It seems the number of tasks depends on how many executors I have instead of how many parquet files/partitions I have. Actually, it launches 5 tasks on each executor.
This behavior is quite strange, and may have potential issue if there is a slow executor. What is this "parquet" stage for? and why it launches 5 tasks on each executor? Thanks, J.