Re: sqlCtx.read.parquet yields lots of small tasks

Johnny W. Tue, 10 May 2016 00:08:47 -0700

Thanks, Ashish. I've created a JIRA:
https://issues.apache.org/jira/browse/SPARK-15247


Best,
J.

On Sun, May 8, 2016 at 7:07 PM, Ashish Dubey <ashish....@gmail.com> wrote:

> I see the behavior - so it always goes with min total tasks possible on
> your settings ( num-executors * num-cores ) - however if you use a huge
> amount of data then you will see more tasks - that means it has some kind
> of lower bound on num-tasks.. It may require some digging. other formats
> did not seem to have this issue.
>
> On Sun, May 8, 2016 at 12:10 AM, Johnny W. <jzw.ser...@gmail.com> wrote:
>
>> The file size is very small (< 1M). The stage launches every time i call:
>> --
>> sqlContext.read.parquet(path_to_file)
>>
>> These are the parquet specific configurations I set:
>> --
>> spark.sql.parquet.filterPushdown: true
>> spark.sql.parquet.mergeSchema: true
>>
>> Thanks,
>> J.
>>
>> On Sat, May 7, 2016 at 4:20 PM, Ashish Dubey <ashish....@gmail.com>
>> wrote:
>>
>>> How big is your file and can you also share the code snippet
>>>
>>>
>>> On Saturday, May 7, 2016, Johnny W. <jzw.ser...@gmail.com> wrote:
>>>
>>>> hi spark-user,
>>>>
>>>> I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a
>>>> dataframe from a parquet data source with a single parquet file, it yields
>>>> a stage with lots of small tasks. It seems the number of tasks depends on
>>>> how many executors I have instead of how many parquet files/partitions I
>>>> have. Actually, it launches 5 tasks on each executor.
>>>>
>>>> This behavior is quite strange, and may have potential issue if there
>>>> is a slow executor. What is this "parquet" stage for? and why it launches 5
>>>> tasks on each executor?
>>>>
>>>> Thanks,
>>>> J.
>>>>
>>>
>>
>

Re: sqlCtx.read.parquet yields lots of small tasks

Reply via email to