Re: sqlCtx.read.parquet yields lots of small tasks
Thanks, Ashish. I've created a JIRA: https://issues.apache.org/jira/browse/SPARK-15247 Best, J. On Sun, May 8, 2016 at 7:07 PM, Ashish Dubeywrote: > I see the behavior - so it always goes with min total tasks possible on > your settings ( num-executors * num-cores ) - however if you use a huge > amount of data then you will see more tasks - that means it has some kind > of lower bound on num-tasks.. It may require some digging. other formats > did not seem to have this issue. > > On Sun, May 8, 2016 at 12:10 AM, Johnny W. wrote: > >> The file size is very small (< 1M). The stage launches every time i call: >> -- >> sqlContext.read.parquet(path_to_file) >> >> These are the parquet specific configurations I set: >> -- >> spark.sql.parquet.filterPushdown: true >> spark.sql.parquet.mergeSchema: true >> >> Thanks, >> J. >> >> On Sat, May 7, 2016 at 4:20 PM, Ashish Dubey >> wrote: >> >>> How big is your file and can you also share the code snippet >>> >>> >>> On Saturday, May 7, 2016, Johnny W. wrote: >>> hi spark-user, I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a dataframe from a parquet data source with a single parquet file, it yields a stage with lots of small tasks. It seems the number of tasks depends on how many executors I have instead of how many parquet files/partitions I have. Actually, it launches 5 tasks on each executor. This behavior is quite strange, and may have potential issue if there is a slow executor. What is this "parquet" stage for? and why it launches 5 tasks on each executor? Thanks, J. >>> >> >
Re: sqlCtx.read.parquet yields lots of small tasks
I see the behavior - so it always goes with min total tasks possible on your settings ( num-executors * num-cores ) - however if you use a huge amount of data then you will see more tasks - that means it has some kind of lower bound on num-tasks.. It may require some digging. other formats did not seem to have this issue. On Sun, May 8, 2016 at 12:10 AM, Johnny W.wrote: > The file size is very small (< 1M). The stage launches every time i call: > -- > sqlContext.read.parquet(path_to_file) > > These are the parquet specific configurations I set: > -- > spark.sql.parquet.filterPushdown: true > spark.sql.parquet.mergeSchema: true > > Thanks, > J. > > On Sat, May 7, 2016 at 4:20 PM, Ashish Dubey wrote: > >> How big is your file and can you also share the code snippet >> >> >> On Saturday, May 7, 2016, Johnny W. wrote: >> >>> hi spark-user, >>> >>> I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a >>> dataframe from a parquet data source with a single parquet file, it yields >>> a stage with lots of small tasks. It seems the number of tasks depends on >>> how many executors I have instead of how many parquet files/partitions I >>> have. Actually, it launches 5 tasks on each executor. >>> >>> This behavior is quite strange, and may have potential issue if there is >>> a slow executor. What is this "parquet" stage for? and why it launches 5 >>> tasks on each executor? >>> >>> Thanks, >>> J. >>> >> >
Re: sqlCtx.read.parquet yields lots of small tasks
The file size is very small (< 1M). The stage launches every time i call: -- sqlContext.read.parquet(path_to_file) These are the parquet specific configurations I set: -- spark.sql.parquet.filterPushdown: true spark.sql.parquet.mergeSchema: true Thanks, J. On Sat, May 7, 2016 at 4:20 PM, Ashish Dubeywrote: > How big is your file and can you also share the code snippet > > > On Saturday, May 7, 2016, Johnny W. wrote: > >> hi spark-user, >> >> I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a >> dataframe from a parquet data source with a single parquet file, it yields >> a stage with lots of small tasks. It seems the number of tasks depends on >> how many executors I have instead of how many parquet files/partitions I >> have. Actually, it launches 5 tasks on each executor. >> >> This behavior is quite strange, and may have potential issue if there is >> a slow executor. What is this "parquet" stage for? and why it launches 5 >> tasks on each executor? >> >> Thanks, >> J. >> >
Re: sqlCtx.read.parquet yields lots of small tasks
How big is your file and can you also share the code snippet On Saturday, May 7, 2016, Johnny W.wrote: > hi spark-user, > > I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a > dataframe from a parquet data source with a single parquet file, it yields > a stage with lots of small tasks. It seems the number of tasks depends on > how many executors I have instead of how many parquet files/partitions I > have. Actually, it launches 5 tasks on each executor. > > This behavior is quite strange, and may have potential issue if there is a > slow executor. What is this "parquet" stage for? and why it launches 5 > tasks on each executor? > > Thanks, > J. >
sqlCtx.read.parquet yields lots of small tasks
hi spark-user, I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a dataframe from a parquet data source with a single parquet file, it yields a stage with lots of small tasks. It seems the number of tasks depends on how many executors I have instead of how many parquet files/partitions I have. Actually, it launches 5 tasks on each executor. This behavior is quite strange, and may have potential issue if there is a slow executor. What is this "parquet" stage for? and why it launches 5 tasks on each executor? Thanks, J.