[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task

2017-06-20 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16056118#comment-16056118
 ] 

Thomas Graves commented on SPARK-21082:
---

Yes executor should not oom if you are trying to cache to much data there.  OOM 
generally comes from the task side processing data.

Like mentioned there are a lot of reasons for cached data to be skewed.  How 
are you proposing spark to figure this out?  Most of the time spark is not 
going to know how much data a task is going to generate and cache.
Generally its just a users tasks that reads some data and then caching it. That 
could just be naturally skewed if the user doesn't have customer partitioner or 
handle that skew. 

I guess what you are saying is to try to distribute the tasks equally among the 
executors which COULD result in a more equal distribution of cached data.  I 
would normally expect this to happen if you already have the executors and you 
don't have locality wait on.  

What is the reason you are getting the skewed cached results?  

Are you really asking for spark to handle skewed data better? 

> Consider Executor's memory usage when scheduling task 
> --
>
> Key: SPARK-21082
> URL: https://issues.apache.org/jira/browse/SPARK-21082
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.0
>Reporter: DjvuLee
>
>  Spark Scheduler do not consider the memory usage during dispatch tasks, this 
> can lead to Executor OOM if the RDD is cached sometimes, because Spark can 
> not estimate the memory usage well enough(especially when the RDD type is not 
> flatten), scheduler may dispatch so many tasks on one Executor.
> We can offer a configuration for user to decide whether scheduler will 
> consider the memory usage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task

2017-06-14 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16050016#comment-16050016
 ] 

Saisai Shao commented on SPARK-21082:
-

That's fine if the storage memory is not enough to cache all the data, Spark 
still could handle this scenario without OOM. Base on the free memory to 
schedule the task is too scenario specific from my understanding.

[~tgraves] [~irashid] [~mridulm80] may have more thoughts on it. 

> Consider Executor's memory usage when scheduling task 
> --
>
> Key: SPARK-21082
> URL: https://issues.apache.org/jira/browse/SPARK-21082
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.0
>Reporter: DjvuLee
>
>  Spark Scheduler do not consider the memory usage during dispatch tasks, this 
> can lead to Executor OOM if the RDD is cached sometimes, because Spark can 
> not estimate the memory usage well enough(especially when the RDD type is not 
> flatten), scheduler may dispatch so many tasks on one Executor.
> We can offer a configuration for user to decide whether scheduler will 
> consider the memory usage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task

2017-06-14 Thread DjvuLee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049965#comment-16049965
 ] 

DjvuLee commented on SPARK-21082:
-

Data locality, input size for task, scheduling order affect a lot, even all the 
nodes have the same computation capacity.


Suppose there are two Executors with same computation capacity and four tasks 
with input size: 10G, 3G, 10G, 20G. 
So there is a chance that one Executor will cache 30GB, one will  cache 13GB 
under current scheduling policy。 
If the Executor have only 25GB memory for storage, then not all the data can be 
cached in memory.

I will give a more detail description for the propose if it seems OK now.

> Consider Executor's memory usage when scheduling task 
> --
>
> Key: SPARK-21082
> URL: https://issues.apache.org/jira/browse/SPARK-21082
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.1
>Reporter: DjvuLee
>
>  Spark Scheduler do not consider the memory usage during dispatch tasks, this 
> can lead to Executor OOM if the RDD is cached sometimes, because Spark can 
> not estimate the memory usage well enough(especially when the RDD type is not 
> flatten), scheduler may dispatch so many tasks on one Executor.
> We can offer a configuration for user to decide whether scheduler will 
> consider the memory usage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task

2017-06-14 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049940#comment-16049940
 ] 

Saisai Shao commented on SPARK-21082:
-

Fast node actually equals to idle node, since fast node execute tasks more 
efficiently, so that it has more idle time to accept more tasks. Scheduler may 
not know which node is fast node, but it will always schedule tasks on to idle 
node (regardless of locality waiting), so as a result fast node will execute 
more tasks. 

What I mean fast nodes not only means much stronger CPU, it may be fast IO. 
Normally tasks should be relatively equal distributed, if you saw one Node has 
much more tasks compared to other nodes, you'd better find out the difference 
of that node from different aspects. Changing scheduler is not the first choice 
after all.

> Consider Executor's memory usage when scheduling task 
> --
>
> Key: SPARK-21082
> URL: https://issues.apache.org/jira/browse/SPARK-21082
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.1
>Reporter: DjvuLee
>
>  Spark Scheduler do not consider the memory usage during dispatch tasks, this 
> can lead to Executor OOM if the RDD is cached sometimes, because Spark can 
> not estimate the memory usage well enough(especially when the RDD type is not 
> flatten), scheduler may dispatch so many tasks on one Executor.
> We can offer a configuration for user to decide whether scheduler will 
> consider the memory usage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task

2017-06-14 Thread DjvuLee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049933#comment-16049933
 ] 

DjvuLee commented on SPARK-21082:
-

Not a really fast node and slow node problem.

Even all the nodes have equal computation power, but there are lots of factor 
affect the data cached by Executors. Such as the data locality for the task's 
input, the network, and scheduling order etc.

`it is reasonable to schedule more tasks on to fast node.`
but the fact is schedule more tasks to ideal Executors. Scheduler has no 
meaning of fast or slow for each Executor, it considers more about locality and 
idle.

I agree that it is better not to change the code, but I can not find any 
configuration to solve the problem.
Is there any good solution to keep the used memory balanced across the 
Executors? 

> Consider Executor's memory usage when scheduling task 
> --
>
> Key: SPARK-21082
> URL: https://issues.apache.org/jira/browse/SPARK-21082
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.1
>Reporter: DjvuLee
>
>  Spark Scheduler do not consider the memory usage during dispatch tasks, this 
> can lead to Executor OOM if the RDD is cached sometimes, because Spark can 
> not estimate the memory usage well enough(especially when the RDD type is not 
> flatten), scheduler may dispatch so many tasks on one Executor.
> We can offer a configuration for user to decide whether scheduler will 
> consider the memory usage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task

2017-06-14 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049906#comment-16049906
 ] 

Saisai Shao commented on SPARK-21082:
-

Is it due to fast node and slow node problem? Ideally if all the nodes have 
equal computation power, then the cached memory usage should be even. Here 
according to your description, it is more like a fast node and slow node 
problem, fast node will process and cache more data, it is reasonable to 
schedule more tasks on to fast node.

Based on free memory and OOM problem to schedule the tasks is quite scenario 
dependent AFAIK, actually we may have other solutions to tune the cluster 
instead of changing the code, also this scenario is not generic enough to 
change the scheduler. I would suggest to do a careful and generic design if you 
want to improve the scheduler. 

> Consider Executor's memory usage when scheduling task 
> --
>
> Key: SPARK-21082
> URL: https://issues.apache.org/jira/browse/SPARK-21082
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.1
>Reporter: DjvuLee
>
>  Spark Scheduler do not consider the memory usage during dispatch tasks, this 
> can lead to Executor OOM if the RDD is cached sometimes, because Spark can 
> not estimate the memory usage well enough(especially when the RDD type is not 
> flatten), scheduler may dispatch so many tasks on one Executor.
> We can offer a configuration for user to decide whether scheduler will 
> consider the memory usage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task

2017-06-14 Thread DjvuLee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048835#comment-16048835
 ] 

DjvuLee commented on SPARK-21082:
-

Yes, one of the reason why Spark do not balance tasks well enough is affected 
by data locality.

Consider data locality is good in most case, but when we want to cache the RDD 
and analysis many times on this RDD,
memory balance is more important than keep data locality when load the data。
If we can not guarantee all the consideration well enough, offer a 
configuration to users is valuable when dealing with memory.

I will give a pull request soon if this suggestion is not defective at first 
sight.

> Consider Executor's memory usage when scheduling task 
> --
>
> Key: SPARK-21082
> URL: https://issues.apache.org/jira/browse/SPARK-21082
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.1
>Reporter: DjvuLee
>
>  Spark Scheduler do not consider the memory usage during dispatch tasks, this 
> can lead to Executor OOM if the RDD is cached sometimes, because Spark can 
> not estimate the memory usage enough well(especially when the RDD type is not 
> flatten), scheduler may dispatch so many tasks on one Executor.
> We can offer a configuration for user to decide whether scheduler will 
> consider the memory usage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task

2017-06-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048821#comment-16048821
 ] 

Sean Owen commented on SPARK-21082:
---

This doesn't address my question about locality. I think this is a non-starter 
until you have a more comprehensive suggestion for how this does or doesn't 
interact with considerations like data locality, and available CPU. Spark 
already tries to balance tasks, and it's never going to perfectly balance them.

> Consider Executor's memory usage when scheduling task 
> --
>
> Key: SPARK-21082
> URL: https://issues.apache.org/jira/browse/SPARK-21082
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.1
>Reporter: DjvuLee
>
>  Spark Scheduler do not consider the memory usage during dispatch tasks, this 
> can lead to Executor OOM if the RDD is cached sometimes, because Spark can 
> not estimate the memory usage enough well(especially when the RDD type is not 
> flatten), scheduler may dispatch so many tasks on one Executor.
> We can offer a configuration for user to decide whether scheduler will 
> consider the memory usage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task

2017-06-13 Thread DjvuLee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048654#comment-16048654
 ] 

DjvuLee commented on SPARK-21082:
-

My idea is try to consider the BlockManger information when scheduling if user 
specify the configuration.

> Consider Executor's memory usage when scheduling task 
> --
>
> Key: SPARK-21082
> URL: https://issues.apache.org/jira/browse/SPARK-21082
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.1
>Reporter: DjvuLee
>
>  Spark Scheduler do not consider the memory usage during dispatch tasks, this 
> can lead to Executor OOM if the RDD is cached sometimes, because Spark can 
> not estimate the memory usage enough well(especially when the RDD type is not 
> flatten), scheduler may dispatch so many tasks on one Executor.
> We can offer a configuration for user to decide whether scheduler will 
> consider the memory usage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task

2017-06-13 Thread DjvuLee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048653#comment-16048653
 ] 

DjvuLee commented on SPARK-21082:
-

[~srowen] This situation occurred when the partition number is larger than the 
CPU core. 

Consider there are 1000 partition and 100 CPU core, we want cache RDD among all 
the Executors.
If  one  Executor executes tasks fast at first time, then the scheduler will 
dispatch more tasks to it, 
so after all the tasks is scheduled, some Executors will used all the storage 
memory, but some Executors just use few memory,
Executors which used more memory may not cache all the RDD partition scheduled 
on it, because there is no more memory for some tasks.
Under this situation, we can not cache all the partition even we have enough 
memory.

What's more, if some Executors occurred OOM during following compute, the 
scheduler may dispatch tasks to Executor which have no more storage memory, and 
sometimes can lead to more and more OOM if Spark can not estimate the memory.

But if the scheduler try to schedule tasks to Executors which own more free 
memory can ease this situation.


Maybe we can use the `coalesce` to decrease the partition number, but this is 
not good enough for speculating.

> Consider Executor's memory usage when scheduling task 
> --
>
> Key: SPARK-21082
> URL: https://issues.apache.org/jira/browse/SPARK-21082
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.1
>Reporter: DjvuLee
>
>  Spark Scheduler do not consider the memory usage during dispatch tasks, this 
> can lead to Executor OOM if the RDD is cached sometimes, because Spark can 
> not estimate the memory usage enough well(especially when the RDD type is not 
> flatten), scheduler may dispatch so many tasks on one Executor.
> We can offer a configuration for user to decide whether scheduler will 
> consider the memory usage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task

2017-06-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048270#comment-16048270
 ] 

Sean Owen commented on SPARK-21082:
---

I don't see how this would interact with, for example, data locality 
considerations. You can't actually estimate how much memory a task would take 
anyway.

> Consider Executor's memory usage when scheduling task 
> --
>
> Key: SPARK-21082
> URL: https://issues.apache.org/jira/browse/SPARK-21082
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.1
>Reporter: DjvuLee
>
>  Spark Scheduler do not consider the memory usage during dispatch tasks, this 
> can lead to Executor OOM if the RDD is cached sometimes, because Spark can 
> not estimate the memory usage enough well(especially when the RDD type is not 
> flatten), scheduler may dispatch so many tasks on one Executor.
> We can offer a configuration for user to decide whether scheduler will 
> consider the memory usage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task

2017-06-13 Thread DjvuLee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048146#comment-16048146
 ] 

DjvuLee commented on SPARK-21082:
-

If this feature is a good suggestion(we encounter this problem in fact), I will 
give a pull request.

> Consider Executor's memory usage when scheduling task 
> --
>
> Key: SPARK-21082
> URL: https://issues.apache.org/jira/browse/SPARK-21082
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.1
>Reporter: DjvuLee
>
>  Spark Scheduler do not consider the memory usage during dispatch tasks, this 
> can lead to Executor OOM if the RDD is cached sometimes, because Spark can 
> not estimate the memory usage enough well(especially when the RDD type is not 
> flatten), scheduler may dispatch so many task on one Executor.
> We can offer a configuration for user to decide whether scheduler will 
> consider the memory usage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org