[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task
[ https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16056118#comment-16056118 ] Thomas Graves commented on SPARK-21082: --- Yes executor should not oom if you are trying to cache to much data there. OOM generally comes from the task side processing data. Like mentioned there are a lot of reasons for cached data to be skewed. How are you proposing spark to figure this out? Most of the time spark is not going to know how much data a task is going to generate and cache. Generally its just a users tasks that reads some data and then caching it. That could just be naturally skewed if the user doesn't have customer partitioner or handle that skew. I guess what you are saying is to try to distribute the tasks equally among the executors which COULD result in a more equal distribution of cached data. I would normally expect this to happen if you already have the executors and you don't have locality wait on. What is the reason you are getting the skewed cached results? Are you really asking for spark to handle skewed data better? > Consider Executor's memory usage when scheduling task > -- > > Key: SPARK-21082 > URL: https://issues.apache.org/jira/browse/SPARK-21082 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.3.0 >Reporter: DjvuLee > > Spark Scheduler do not consider the memory usage during dispatch tasks, this > can lead to Executor OOM if the RDD is cached sometimes, because Spark can > not estimate the memory usage well enough(especially when the RDD type is not > flatten), scheduler may dispatch so many tasks on one Executor. > We can offer a configuration for user to decide whether scheduler will > consider the memory usage. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task
[ https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16050016#comment-16050016 ] Saisai Shao commented on SPARK-21082: - That's fine if the storage memory is not enough to cache all the data, Spark still could handle this scenario without OOM. Base on the free memory to schedule the task is too scenario specific from my understanding. [~tgraves] [~irashid] [~mridulm80] may have more thoughts on it. > Consider Executor's memory usage when scheduling task > -- > > Key: SPARK-21082 > URL: https://issues.apache.org/jira/browse/SPARK-21082 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.3.0 >Reporter: DjvuLee > > Spark Scheduler do not consider the memory usage during dispatch tasks, this > can lead to Executor OOM if the RDD is cached sometimes, because Spark can > not estimate the memory usage well enough(especially when the RDD type is not > flatten), scheduler may dispatch so many tasks on one Executor. > We can offer a configuration for user to decide whether scheduler will > consider the memory usage. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task
[ https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049965#comment-16049965 ] DjvuLee commented on SPARK-21082: - Data locality, input size for task, scheduling order affect a lot, even all the nodes have the same computation capacity. Suppose there are two Executors with same computation capacity and four tasks with input size: 10G, 3G, 10G, 20G. So there is a chance that one Executor will cache 30GB, one will cache 13GB under current scheduling policy。 If the Executor have only 25GB memory for storage, then not all the data can be cached in memory. I will give a more detail description for the propose if it seems OK now. > Consider Executor's memory usage when scheduling task > -- > > Key: SPARK-21082 > URL: https://issues.apache.org/jira/browse/SPARK-21082 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.2.1 >Reporter: DjvuLee > > Spark Scheduler do not consider the memory usage during dispatch tasks, this > can lead to Executor OOM if the RDD is cached sometimes, because Spark can > not estimate the memory usage well enough(especially when the RDD type is not > flatten), scheduler may dispatch so many tasks on one Executor. > We can offer a configuration for user to decide whether scheduler will > consider the memory usage. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task
[ https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049940#comment-16049940 ] Saisai Shao commented on SPARK-21082: - Fast node actually equals to idle node, since fast node execute tasks more efficiently, so that it has more idle time to accept more tasks. Scheduler may not know which node is fast node, but it will always schedule tasks on to idle node (regardless of locality waiting), so as a result fast node will execute more tasks. What I mean fast nodes not only means much stronger CPU, it may be fast IO. Normally tasks should be relatively equal distributed, if you saw one Node has much more tasks compared to other nodes, you'd better find out the difference of that node from different aspects. Changing scheduler is not the first choice after all. > Consider Executor's memory usage when scheduling task > -- > > Key: SPARK-21082 > URL: https://issues.apache.org/jira/browse/SPARK-21082 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.2.1 >Reporter: DjvuLee > > Spark Scheduler do not consider the memory usage during dispatch tasks, this > can lead to Executor OOM if the RDD is cached sometimes, because Spark can > not estimate the memory usage well enough(especially when the RDD type is not > flatten), scheduler may dispatch so many tasks on one Executor. > We can offer a configuration for user to decide whether scheduler will > consider the memory usage. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task
[ https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049933#comment-16049933 ] DjvuLee commented on SPARK-21082: - Not a really fast node and slow node problem. Even all the nodes have equal computation power, but there are lots of factor affect the data cached by Executors. Such as the data locality for the task's input, the network, and scheduling order etc. `it is reasonable to schedule more tasks on to fast node.` but the fact is schedule more tasks to ideal Executors. Scheduler has no meaning of fast or slow for each Executor, it considers more about locality and idle. I agree that it is better not to change the code, but I can not find any configuration to solve the problem. Is there any good solution to keep the used memory balanced across the Executors? > Consider Executor's memory usage when scheduling task > -- > > Key: SPARK-21082 > URL: https://issues.apache.org/jira/browse/SPARK-21082 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.2.1 >Reporter: DjvuLee > > Spark Scheduler do not consider the memory usage during dispatch tasks, this > can lead to Executor OOM if the RDD is cached sometimes, because Spark can > not estimate the memory usage well enough(especially when the RDD type is not > flatten), scheduler may dispatch so many tasks on one Executor. > We can offer a configuration for user to decide whether scheduler will > consider the memory usage. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task
[ https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049906#comment-16049906 ] Saisai Shao commented on SPARK-21082: - Is it due to fast node and slow node problem? Ideally if all the nodes have equal computation power, then the cached memory usage should be even. Here according to your description, it is more like a fast node and slow node problem, fast node will process and cache more data, it is reasonable to schedule more tasks on to fast node. Based on free memory and OOM problem to schedule the tasks is quite scenario dependent AFAIK, actually we may have other solutions to tune the cluster instead of changing the code, also this scenario is not generic enough to change the scheduler. I would suggest to do a careful and generic design if you want to improve the scheduler. > Consider Executor's memory usage when scheduling task > -- > > Key: SPARK-21082 > URL: https://issues.apache.org/jira/browse/SPARK-21082 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.2.1 >Reporter: DjvuLee > > Spark Scheduler do not consider the memory usage during dispatch tasks, this > can lead to Executor OOM if the RDD is cached sometimes, because Spark can > not estimate the memory usage well enough(especially when the RDD type is not > flatten), scheduler may dispatch so many tasks on one Executor. > We can offer a configuration for user to decide whether scheduler will > consider the memory usage. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task
[ https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048835#comment-16048835 ] DjvuLee commented on SPARK-21082: - Yes, one of the reason why Spark do not balance tasks well enough is affected by data locality. Consider data locality is good in most case, but when we want to cache the RDD and analysis many times on this RDD, memory balance is more important than keep data locality when load the data。 If we can not guarantee all the consideration well enough, offer a configuration to users is valuable when dealing with memory. I will give a pull request soon if this suggestion is not defective at first sight. > Consider Executor's memory usage when scheduling task > -- > > Key: SPARK-21082 > URL: https://issues.apache.org/jira/browse/SPARK-21082 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.2.1 >Reporter: DjvuLee > > Spark Scheduler do not consider the memory usage during dispatch tasks, this > can lead to Executor OOM if the RDD is cached sometimes, because Spark can > not estimate the memory usage enough well(especially when the RDD type is not > flatten), scheduler may dispatch so many tasks on one Executor. > We can offer a configuration for user to decide whether scheduler will > consider the memory usage. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task
[ https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048821#comment-16048821 ] Sean Owen commented on SPARK-21082: --- This doesn't address my question about locality. I think this is a non-starter until you have a more comprehensive suggestion for how this does or doesn't interact with considerations like data locality, and available CPU. Spark already tries to balance tasks, and it's never going to perfectly balance them. > Consider Executor's memory usage when scheduling task > -- > > Key: SPARK-21082 > URL: https://issues.apache.org/jira/browse/SPARK-21082 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.2.1 >Reporter: DjvuLee > > Spark Scheduler do not consider the memory usage during dispatch tasks, this > can lead to Executor OOM if the RDD is cached sometimes, because Spark can > not estimate the memory usage enough well(especially when the RDD type is not > flatten), scheduler may dispatch so many tasks on one Executor. > We can offer a configuration for user to decide whether scheduler will > consider the memory usage. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task
[ https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048654#comment-16048654 ] DjvuLee commented on SPARK-21082: - My idea is try to consider the BlockManger information when scheduling if user specify the configuration. > Consider Executor's memory usage when scheduling task > -- > > Key: SPARK-21082 > URL: https://issues.apache.org/jira/browse/SPARK-21082 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.2.1 >Reporter: DjvuLee > > Spark Scheduler do not consider the memory usage during dispatch tasks, this > can lead to Executor OOM if the RDD is cached sometimes, because Spark can > not estimate the memory usage enough well(especially when the RDD type is not > flatten), scheduler may dispatch so many tasks on one Executor. > We can offer a configuration for user to decide whether scheduler will > consider the memory usage. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task
[ https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048653#comment-16048653 ] DjvuLee commented on SPARK-21082: - [~srowen] This situation occurred when the partition number is larger than the CPU core. Consider there are 1000 partition and 100 CPU core, we want cache RDD among all the Executors. If one Executor executes tasks fast at first time, then the scheduler will dispatch more tasks to it, so after all the tasks is scheduled, some Executors will used all the storage memory, but some Executors just use few memory, Executors which used more memory may not cache all the RDD partition scheduled on it, because there is no more memory for some tasks. Under this situation, we can not cache all the partition even we have enough memory. What's more, if some Executors occurred OOM during following compute, the scheduler may dispatch tasks to Executor which have no more storage memory, and sometimes can lead to more and more OOM if Spark can not estimate the memory. But if the scheduler try to schedule tasks to Executors which own more free memory can ease this situation. Maybe we can use the `coalesce` to decrease the partition number, but this is not good enough for speculating. > Consider Executor's memory usage when scheduling task > -- > > Key: SPARK-21082 > URL: https://issues.apache.org/jira/browse/SPARK-21082 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.2.1 >Reporter: DjvuLee > > Spark Scheduler do not consider the memory usage during dispatch tasks, this > can lead to Executor OOM if the RDD is cached sometimes, because Spark can > not estimate the memory usage enough well(especially when the RDD type is not > flatten), scheduler may dispatch so many tasks on one Executor. > We can offer a configuration for user to decide whether scheduler will > consider the memory usage. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task
[ https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048270#comment-16048270 ] Sean Owen commented on SPARK-21082: --- I don't see how this would interact with, for example, data locality considerations. You can't actually estimate how much memory a task would take anyway. > Consider Executor's memory usage when scheduling task > -- > > Key: SPARK-21082 > URL: https://issues.apache.org/jira/browse/SPARK-21082 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.2.1 >Reporter: DjvuLee > > Spark Scheduler do not consider the memory usage during dispatch tasks, this > can lead to Executor OOM if the RDD is cached sometimes, because Spark can > not estimate the memory usage enough well(especially when the RDD type is not > flatten), scheduler may dispatch so many tasks on one Executor. > We can offer a configuration for user to decide whether scheduler will > consider the memory usage. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task
[ https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048146#comment-16048146 ] DjvuLee commented on SPARK-21082: - If this feature is a good suggestion(we encounter this problem in fact), I will give a pull request. > Consider Executor's memory usage when scheduling task > -- > > Key: SPARK-21082 > URL: https://issues.apache.org/jira/browse/SPARK-21082 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.2.1 >Reporter: DjvuLee > > Spark Scheduler do not consider the memory usage during dispatch tasks, this > can lead to Executor OOM if the RDD is cached sometimes, because Spark can > not estimate the memory usage enough well(especially when the RDD type is not > flatten), scheduler may dispatch so many task on one Executor. > We can offer a configuration for user to decide whether scheduler will > consider the memory usage. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org