[GitHub] [spark] gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost
gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost URL: https://github.com/apache/spark/pull/24462#issuecomment-520799822 Thank you @squito This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost
gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost URL: https://github.com/apache/spark/pull/24462#issuecomment-519802295 @squito Yeah it saves us much, from a TPC-DS 1T benchmark, 30% queries get 1.1x+ performance boost, 13% get 1.2x + performance boost. There's still remote read, but only once(if index files are not swapped out because of insufficient cache space), and this feature can take advantage of internal network bandwith inside computing cluster, releasing the compute-storage network, which may be the bottleneck of the workload. By 'reasonable balance', did you mean not considering complex conditions? I think probably it's beneficial to make it clear through discussion. Making this work in a long term is also fine by me. I tried to make a point that the current solution to not resubmitting map tasks by modifying `MapStatus` is not enough, due to it only cares about what Executors tell the Driver about the map outputs'(tasks') location. However, we should also grant Driver the right(for example, by add a config like this PR did) to not resubmit the map tasks even if it knows (not empty MapStatus location) which one to resubmit. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost
gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost URL: https://github.com/apache/spark/pull/24462#issuecomment-519354047 @squito Index and data files are both stored on DFS, the difference is that: data files are directly read from DFS, however, for index files, a reducer fetches them from the executors('s cache) who wrote them, if there aren't required index files in cache, they will be loaded from DFS. This approach simulates the external shuffle service's cache, but instead of in another Java process, it's in Executor. This approach needs a reasonable place(and it's the coordinated map executor) to cache index files. Returning a `None` location for mapper task will make - The no resubmit tasks' need satisfied - But the cache feature not satisfied : ( This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost
gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost URL: https://github.com/apache/spark/pull/24462#issuecomment-514078853 @squito I met with a condition that cannot be satisfied without this PR: - On map side, all shuffle files are written to remote Hadoop filesystem, there isn't any shuffle files managed by `BlockManager`s. So I simply should make `ShuffleWriters` return `MapStatus.location == null`? No, because it cannot fulfil the need during shuffle write. - On reduce side, I want to read the shuffle index files from the cache on the executors who wrote them, so I need the `BlockManagerId` in the `MapStatus` to tell each reducer which executor to find. This is what I mentioned: > what Driver decides to do when invalidating an executor(what this PR works on) and how the Executors tell Driver the MapStatus(with or without a location) are two different things. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost
gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost URL: https://github.com/apache/spark/pull/24462#issuecomment-510766013 @yifeih Thank you, I understand now. But can your way (making `MapStatus` able to contain an empty location in order to not resubmit map stage tasks) deals with this condition: the `MapStatus` returned contains a valid location, at the same time, we don't want the Driver to unregister this shuffle output when executors lost(Maybe due to the map output is also backed up in DFS)? In other words, what Driver decides to do when invalidating an executor(what this PR works on) and how the Executors tell Driver the `MapStatus`(with or without a location) are two different things. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost
gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost URL: https://github.com/apache/spark/pull/24462#issuecomment-509537346 @yifeih It's very interesting, but I didn't 100% get it. The new `ShuffleIO` API will not influence the existed `ShuffleManager` API(which means the developers who have implemented a custom ShuffleManager don't need to modify their code) right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost
gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost URL: https://github.com/apache/spark/pull/24462#issuecomment-509251934 @squito I agree with you, but still want to make sure I understand it right: The function https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit?disco=DN6g3wY resembles this PR's work. However, it just works when users leverage the new `ShuffleIO` pluggable API. But since the lower-level `ShuffleIO` & upper-level `ShuffleManager` API will both exist in the future Spark. We still need this PR's work for a custom `ShuffleManager` implementation to avoid resubmitting map tasks(for example when shuffle files are persisted in DFS). So actually we have plans to push this PR forward? @bsidhom Could we create a field in `ShuffleManager` like most people agreed on? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost
gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost URL: https://github.com/apache/spark/pull/24462#issuecomment-487843910 @bsidhom I agree that it would be ideal if there's a field in `ShuffleManager` indicating 'whether it can serve blocks without an executor'. I have also implemented a Hadoop-compatible file system shuffle manager. : P What do you mean by your implementation 'could live outside of Spark itself'? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost
gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost URL: https://github.com/apache/spark/pull/24462#issuecomment-487484432 @liupc Thanks for the explanation. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost
gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost URL: https://github.com/apache/spark/pull/24462#issuecomment-487336835 @liupc Under remote shuffle, if certain executors are lost, we can still fetch the shuffle data from remote storage(i.e. HDFS), we don't need executors to `server` the shuffle data. Also I don't understand why you mentioned this is too special... 'shuffle data server external to Spark' may be the new common... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost
gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost URL: https://github.com/apache/spark/pull/24462#issuecomment-486900975 @liupc Since shuffle manager is pluggable in Spark, this 'resubmit switch' in scheduler should also be configurable. @bsidhom This looks good! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org