Re: sqlCtx.read.parquet yields lots of small tasks

2016-05-10 Thread Johnny W.
Thanks, Ashish. I've created a JIRA:
https://issues.apache.org/jira/browse/SPARK-15247

Best,
J.

On Sun, May 8, 2016 at 7:07 PM, Ashish Dubey  wrote:

> I see the behavior - so it always goes with min total tasks possible on
> your settings ( num-executors * num-cores ) - however if you use a huge
> amount of data then you will see more tasks - that means it has some kind
> of lower bound on num-tasks.. It may require some digging. other formats
> did not seem to have this issue.
>
> On Sun, May 8, 2016 at 12:10 AM, Johnny W.  wrote:
>
>> The file size is very small (< 1M). The stage launches every time i call:
>> --
>> sqlContext.read.parquet(path_to_file)
>>
>> These are the parquet specific configurations I set:
>> --
>> spark.sql.parquet.filterPushdown: true
>> spark.sql.parquet.mergeSchema: true
>>
>> Thanks,
>> J.
>>
>> On Sat, May 7, 2016 at 4:20 PM, Ashish Dubey 
>> wrote:
>>
>>> How big is your file and can you also share the code snippet
>>>
>>>
>>> On Saturday, May 7, 2016, Johnny W.  wrote:
>>>
 hi spark-user,

 I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a
 dataframe from a parquet data source with a single parquet file, it yields
 a stage with lots of small tasks. It seems the number of tasks depends on
 how many executors I have instead of how many parquet files/partitions I
 have. Actually, it launches 5 tasks on each executor.

 This behavior is quite strange, and may have potential issue if there
 is a slow executor. What is this "parquet" stage for? and why it launches 5
 tasks on each executor?

 Thanks,
 J.

>>>
>>
>


Re: sqlCtx.read.parquet yields lots of small tasks

2016-05-08 Thread Ashish Dubey
I see the behavior - so it always goes with min total tasks possible on
your settings ( num-executors * num-cores ) - however if you use a huge
amount of data then you will see more tasks - that means it has some kind
of lower bound on num-tasks.. It may require some digging. other formats
did not seem to have this issue.

On Sun, May 8, 2016 at 12:10 AM, Johnny W.  wrote:

> The file size is very small (< 1M). The stage launches every time i call:
> --
> sqlContext.read.parquet(path_to_file)
>
> These are the parquet specific configurations I set:
> --
> spark.sql.parquet.filterPushdown: true
> spark.sql.parquet.mergeSchema: true
>
> Thanks,
> J.
>
> On Sat, May 7, 2016 at 4:20 PM, Ashish Dubey  wrote:
>
>> How big is your file and can you also share the code snippet
>>
>>
>> On Saturday, May 7, 2016, Johnny W.  wrote:
>>
>>> hi spark-user,
>>>
>>> I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a
>>> dataframe from a parquet data source with a single parquet file, it yields
>>> a stage with lots of small tasks. It seems the number of tasks depends on
>>> how many executors I have instead of how many parquet files/partitions I
>>> have. Actually, it launches 5 tasks on each executor.
>>>
>>> This behavior is quite strange, and may have potential issue if there is
>>> a slow executor. What is this "parquet" stage for? and why it launches 5
>>> tasks on each executor?
>>>
>>> Thanks,
>>> J.
>>>
>>
>


Re: sqlCtx.read.parquet yields lots of small tasks

2016-05-08 Thread Johnny W.
The file size is very small (< 1M). The stage launches every time i call:
--
sqlContext.read.parquet(path_to_file)

These are the parquet specific configurations I set:
--
spark.sql.parquet.filterPushdown: true
spark.sql.parquet.mergeSchema: true

Thanks,
J.

On Sat, May 7, 2016 at 4:20 PM, Ashish Dubey  wrote:

> How big is your file and can you also share the code snippet
>
>
> On Saturday, May 7, 2016, Johnny W.  wrote:
>
>> hi spark-user,
>>
>> I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a
>> dataframe from a parquet data source with a single parquet file, it yields
>> a stage with lots of small tasks. It seems the number of tasks depends on
>> how many executors I have instead of how many parquet files/partitions I
>> have. Actually, it launches 5 tasks on each executor.
>>
>> This behavior is quite strange, and may have potential issue if there is
>> a slow executor. What is this "parquet" stage for? and why it launches 5
>> tasks on each executor?
>>
>> Thanks,
>> J.
>>
>


Re: sqlCtx.read.parquet yields lots of small tasks

2016-05-07 Thread Ashish Dubey
How big is your file and can you also share the code snippet

On Saturday, May 7, 2016, Johnny W.  wrote:

> hi spark-user,
>
> I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a
> dataframe from a parquet data source with a single parquet file, it yields
> a stage with lots of small tasks. It seems the number of tasks depends on
> how many executors I have instead of how many parquet files/partitions I
> have. Actually, it launches 5 tasks on each executor.
>
> This behavior is quite strange, and may have potential issue if there is a
> slow executor. What is this "parquet" stage for? and why it launches 5
> tasks on each executor?
>
> Thanks,
> J.
>


sqlCtx.read.parquet yields lots of small tasks

2016-05-07 Thread Johnny W.
hi spark-user,

I am using Spark 1.6.0. When I call sqlCtx.read.parquet to create a
dataframe from a parquet data source with a single parquet file, it yields
a stage with lots of small tasks. It seems the number of tasks depends on
how many executors I have instead of how many parquet files/partitions I
have. Actually, it launches 5 tasks on each executor.

This behavior is quite strange, and may have potential issue if there is a
slow executor. What is this "parquet" stage for? and why it launches 5
tasks on each executor?

Thanks,
J.