Re: sqlCtx.read.parquet yields lots of small tasks

Ashish Dubey Sun, 08 May 2016 19:08:40 -0700

I see the behavior - so it always goes with min total tasks possible on
your settings ( num-executors * num-cores ) - however if you use a huge
amount of data then you will see more tasks - that means it has some kind
of lower bound on num-tasks.. It may require some digging. other formats
did not seem to have this issue.


On Sun, May 8, 2016 at 12:10 AM, Johnny W. <jzw.ser...@gmail.com> wrote:

> The file size is very small (< 1M). The stage launches every time i call:
> --
> sqlContext.read.parquet(path_to_file)
>
> These are the parquet specific configurations I set:
> --
> spark.sql.parquet.filterPushdown: true
> spark.sql.parquet.mergeSchema: true
>
> Thanks,
> J.
>
> On Sat, May 7, 2016 at 4:20 PM, Ashish Dubey <ashish....@gmail.com> wrote:
>
>> How big is your file and can you also share the code snippet
>>
>>
>> On Saturday, May 7, 2016, Johnny W. <jzw.ser...@gmail.com> wrote:
>>
>>> hi spark-user,
>>>
>>> I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a
>>> dataframe from a parquet data source with a single parquet file, it yields
>>> a stage with lots of small tasks. It seems the number of tasks depends on
>>> how many executors I have instead of how many parquet files/partitions I
>>> have. Actually, it launches 5 tasks on each executor.
>>>
>>> This behavior is quite strange, and may have potential issue if there is
>>> a slow executor. What is this "parquet" stage for? and why it launches 5
>>> tasks on each executor?
>>>
>>> Thanks,
>>> J.
>>>
>>
>

Re: sqlCtx.read.parquet yields lots of small tasks

Reply via email to