[GitHub] [spark] gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost

2019-08-13 Thread GitBox
gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks 
when executors are lost
URL: https://github.com/apache/spark/pull/24462#issuecomment-520799822
 
 
   Thank you @squito 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost

2019-08-09 Thread GitBox
gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks 
when executors are lost
URL: https://github.com/apache/spark/pull/24462#issuecomment-519802295
 
 
   @squito Yeah it saves us much, from a TPC-DS 1T benchmark, 30% queries get 
1.1x+ performance boost, 13% get 1.2x + performance boost. There's still remote 
read, but only once(if index files are not swapped out because of insufficient 
cache space), and this feature can take advantage of internal network bandwith 
inside computing cluster, releasing the compute-storage network, which may be 
the bottleneck of the workload.
   
   By 'reasonable balance', did you mean not considering complex conditions? I 
think probably it's beneficial  to make it clear through discussion. Making 
this work in a long term is also fine by me. I tried to make a point that the 
current solution to not resubmitting map tasks by modifying `MapStatus` is not 
enough, due to it only cares about what Executors tell the Driver about the map 
outputs'(tasks') location. However, we should also grant Driver the right(for 
example, by add a config like this PR did) to not resubmit the map tasks even 
if it knows (not empty MapStatus location) which one to resubmit.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost

2019-08-07 Thread GitBox
gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks 
when executors are lost
URL: https://github.com/apache/spark/pull/24462#issuecomment-519354047
 
 
   @squito Index and data files are both stored on DFS, the difference is that: 
data files are directly read from DFS, however, for index files, a reducer 
fetches them from the executors('s cache) who wrote them, if there aren't 
required index files in cache, they will be loaded from DFS. This approach 
simulates the external shuffle service's cache, but instead of in another Java 
process, it's in Executor.
   
   This approach needs a reasonable place(and it's the coordinated map 
executor) to cache index files.  Returning a `None` location for mapper task 
will make
   - The no resubmit tasks' need satisfied
   - But the cache feature not satisfied : (


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost

2019-07-23 Thread GitBox
gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks 
when executors are lost
URL: https://github.com/apache/spark/pull/24462#issuecomment-514078853
 
 
   @squito I met with a condition that cannot be satisfied without this PR:
   - On map side, all shuffle files are written to remote Hadoop filesystem, 
there isn't any shuffle files managed by `BlockManager`s. So I simply should 
make `ShuffleWriters` return `MapStatus.location == null`? No, because it 
cannot fulfil the need during shuffle write.
   - On reduce side, I want to read the shuffle index files from the cache on 
the executors who wrote them, so I need the `BlockManagerId` in the `MapStatus` 
to tell each reducer which executor to find.
   
   This is what I mentioned: 
   
   > what Driver decides to do when invalidating an executor(what this PR works 
on) and how the Executors tell Driver the MapStatus(with or without a location) 
are two different things.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost

2019-07-12 Thread GitBox
gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks 
when executors are lost
URL: https://github.com/apache/spark/pull/24462#issuecomment-510766013
 
 
   @yifeih Thank you, I understand now. But can your way (making `MapStatus` 
able to contain an empty location in order to not resubmit map stage tasks) 
deals with this condition: the `MapStatus` returned contains a valid location, 
at the same time, we don't want the Driver to unregister this shuffle output 
when executors lost(Maybe due to the map output is also backed up in DFS)? 
   
   In other words, what Driver decides to do when invalidating an executor(what 
this PR works on) and how the Executors tell Driver the `MapStatus`(with or 
without a location) are two different things.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost

2019-07-09 Thread GitBox
gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks 
when executors are lost
URL: https://github.com/apache/spark/pull/24462#issuecomment-509537346
 
 
   @yifeih It's very interesting, but I didn't 100% get it. The new `ShuffleIO` 
API will not influence the existed `ShuffleManager` API(which means the 
developers who have implemented a custom ShuffleManager don't need to modify 
their code) right?  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost

2019-07-08 Thread GitBox
gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks 
when executors are lost
URL: https://github.com/apache/spark/pull/24462#issuecomment-509251934
 
 
   @squito I agree with you, but still want to make sure I understand it right: 
The function 
https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit?disco=DN6g3wY
 resembles this PR's work. However, it just works when users leverage the new 
`ShuffleIO` pluggable API. But since the lower-level `ShuffleIO` & upper-level 
`ShuffleManager` API will both exist in the future Spark. We still need this 
PR's work for a custom `ShuffleManager` implementation to avoid resubmitting 
map tasks(for example when shuffle files are persisted in DFS).
   
   So actually we have plans to push this PR forward?
   
   @bsidhom Could we create a field in `ShuffleManager` like most people agreed 
on? 
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost

2019-04-30 Thread GitBox
gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks 
when executors are lost
URL: https://github.com/apache/spark/pull/24462#issuecomment-487843910
 
 
   @bsidhom I agree that it would be ideal if there's a field in 
`ShuffleManager` indicating 'whether it can serve blocks without an executor'.
   I have also implemented a Hadoop-compatible file system shuffle manager. : P 
What do you mean by your implementation 'could live outside of Spark itself'?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost

2019-04-29 Thread GitBox
gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks 
when executors are lost
URL: https://github.com/apache/spark/pull/24462#issuecomment-487484432
 
 
   @liupc Thanks for the explanation.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost

2019-04-27 Thread GitBox
gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks 
when executors are lost
URL: https://github.com/apache/spark/pull/24462#issuecomment-487336835
 
 
   @liupc Under remote shuffle, if certain executors are lost, we can still 
fetch the shuffle data from remote storage(i.e. HDFS), we don't need executors 
to `server` the shuffle data. Also I don't understand why you mentioned this is 
too special... 'shuffle data server external to Spark' may be the new common...
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks when executors are lost

2019-04-25 Thread GitBox
gczsjdy commented on issue #24462: [SPARK-26268][CORE] Do not resubmit tasks 
when executors are lost
URL: https://github.com/apache/spark/pull/24462#issuecomment-486900975
 
 
   @liupc Since shuffle manager is pluggable in Spark, this 'resubmit switch' 
in scheduler should also be configurable.
   @bsidhom This looks good!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org