subject:"\[GitHub\] \[spark\] mridulm edited a comment on pull request #30876\: \[SPARK\-33870\]\[CORE\] Enable spark.storage.replication.proactive by default"

[GitHub] [spark] mridulm edited a comment on pull request #30876: [SPARK-33870][CORE] Enable spark.storage.replication.proactive by default

2020-12-28 Thread GitBox



mridulm edited a comment on pull request #30876:
URL: https://github.com/apache/spark/pull/30876#issuecomment-751979314


   Before answering specific queries below, I want to set the context.
   a) Enabling proactive replication could result in reduced recomputation cost 
when executors fail.
   b) Enabling it will result in increased transfers when executor(s) are lost.
   (Ignoring other minor impacts) 
   
   I was trying to understand what the impact would be, what the tradeoffs 
involved are, when we enable by default:
   
   1) Are the replication costs (b) lower now ? How do we estimate that cost ?
   (There was non-trivial impact when I had last done some expt's earlier)
   
   2) Are we (community) running into cases where we benefit from (a) but are 
not (very) negatively impacted by (b) ?
   Is there any commonality when this happens ?
   (application types/characterstics ? resource manager ? almost all usage ?)
   
   3) What is the impact to the application (and cluster) when we have 
nontrivial executor loss - executor release in DRA is one example of this, 
preemption is another.
   
   4) Anything else to watch out for ?
   
   As I mentioned earlier, I am fine with collecting data by enabling this flag 
by default.
   I am hoping this and other discussions will help us understand what 
questions to better evaluate before we release 3.2.
   
   
   > 1. For this question, I answered at the beginning that this is a kind of 
self-healing feature 
[here](https://github.com/apache/spark/pull/30876#discussion_r547031257)
   > 
   > > Making it default will impact all applications which have replication > 
1: given this PR is proposing to make it the default, I would like to know if 
there was any motivating reason to make this change ?
   
   Spark is self-healing via lineage :-)
   Having said that, as mentioned above, I want to understand what the tradeoff 
for enabling this flag are.
   
   > 
   > 1. For the following question, I asked your evidence first because I'm not 
aware of. :)
   > 
   > > If the cost of proactive replication is close to zero now (my 
experiments were from a while back), ofcourse the discussion is moot - did we 
have any results for this ?
   
   I am not proposing to change the default behavior, you are ... hence my 
query :-)
   As I mentioned above, when I had looked at this in the past - it was very 
helpful for some applications, but not others : it depended on the application 
and their requirements - `replication > 1` itself was not very commonly used 
then.
   
   > 
   > 1. For the following question, it seems that you assume that the current 
Spark's behavior is the best. I don't think this question justifies that the 
loss of data inside Spark side is good.
   > 
   > > What is the ongoing cost when application holds RDD references, but they 
are not in active use for rest of the application (not all references can be 
cleared by gc) - resulting in replication of blocks for an RDD which is 
legitimately not going to be used again ?
   
   Couple of points here:
   a) There is no data loss - spark recomputes when a lost block is required 
(but at some recomputation cost).
   b) My query was specifically about the cost for replication - given what I 
described is a common pattern in user applications : I was not saying this is 
desired code pattern, but it is a commonly observed behavior.
   
   
   > 
   > 1. For the following, yes, but `exacerbates` doesn't look like a proper 
term here because we had better make Spark smarter to handle those cases as I 
replied at 
[here](https://github.com/apache/spark/pull/30876#discussion_r547421217) 
already.
   > 
   > > Note that the above is orthogonal to DRA evicting an executor via 
storage timeout configuration. That just exacerbates the problem : since a 
larger number of executors could be lost.
   
   If we can do better on this, I am definitely very keen on it !
   Until that happens, we need to continue supporting existing scenarios where 
DRA impacts use of this flag.
   
   
   > 
   > 1. For the following, I didn't make this PR for that specific use case. I 
made this PR to improve this feature in various environment in Apache Spark 
3.2.0 timeframe 
[here](https://github.com/apache/spark/pull/30876#issuecomment-749953223).
   > 
   > > Specifically for this usecase, we dont need to make it a spark default 
right ? ...
   
   This was in response to the 
[scenario](https://github.com/apache/spark/pull/30876#issuecomment-750471287) 
described.
   Let us decouple discussion of that scenario from our discussion here - and 
focus on what we need to evaluate for enabling this by default.
   
   
   > 
   > 1. For the following, I replied that YARN environment also can suffer from 
disk loss or executor loss 
[here](https://github.com/apache/spark/pull/30876#issuecomment-751060200) 
because you insisted that YARN doesn't need this feature from the beginning. 
I'm still not sure that YARN environment is so i

[GitHub] [spark] mridulm edited a comment on pull request #30876: [SPARK-33870][CORE] Enable spark.storage.replication.proactive by default

2020-12-28 Thread GitBox



mridulm edited a comment on pull request #30876:
URL: https://github.com/apache/spark/pull/30876#issuecomment-751979314


   Before answering specific queries below, I want to set the context.
   a) Enabling proactive replication could result in reduced recomputation cost 
when executors fail.
   b) Enabling it will result in increased transfers when executor(s) are lost.
   (Ignoring other minor impacts) 
   
   I was trying to understand what the impact would be, what the tradeoffs 
involved are, when we enable by default:
   
   1) Are the replication costs (b) lower now ? How do we estimate that cost ?
   (There was non-trivial impact when I had last done some expt's earlier)
   
   2) Are we (community) running into cases where we benefit from (a) but are 
not (very) negatively impacted by (b) ?
   Is there any commonality when this happens ?
   (application types/characterstics ? resource manager ? almost all usage ?)
   
   3) What is the impact to the application (and cluster) when we have 
nontrivial executor loss - executor release in DRA is one example of this, 
preemption is another.
   
   4) Anything else to watch out for ?
   
   As I mentioned earlier, I am fine with collecting data by enabling this flag 
by default.
   I am hoping this and other discussions will help us understand what 
questions to better evaluate before we release 3.2.
   
   
   > 1. For this question, I answered at the beginning that this is a kind of 
self-healing feature 
[here](https://github.com/apache/spark/pull/30876#discussion_r547031257)
   > 
   > > Making it default will impact all applications which have replication > 
1: given this PR is proposing to make it the default, I would like to know if 
there was any motivating reason to make this change ?
   
   Spark is self-healing via lineage :-)
   Having said that, as mentioned above, I want to understand what the tradeoff 
for enabling this flag are.
   
   > 
   > 1. For the following question, I asked your evidence first because I'm not 
aware of. :)
   > 
   > > If the cost of proactive replication is close to zero now (my 
experiments were from a while back), ofcourse the discussion is moot - did we 
have any results for this ?
   
   I am not proposing to change the default behavior, you are ... hence my 
query :-)
   As I mentioned above, when I had looked at this in the past - it was very 
helpful for some applications, but not others : it depended on the application 
and their requirements - `replication > 1` itself was not very commonly used 
then.
   
   > 
   > 1. For the following question, it seems that you assume that the current 
Spark's behavior is the best. I don't think this question justifies that the 
loss of data inside Spark side is good.
   > 
   > > What is the ongoing cost when application holds RDD references, but they 
are not in active use for rest of the application (not all references can be 
cleared by gc) - resulting in replication of blocks for an RDD which is 
legitimately not going to be used again ?
   
   Couple of points here:
   a) There is no data loss - spark recomputes when a lost block is required 
(but at some recomputation cost).
   b) My query was specifically about the cost for replication - given what I 
described is a common pattern in user applications : I was not saying this is 
desired code pattern, but it is a commonly observed behavior.
   
   
   > 
   > 1. For the following, yes, but `exacerbates` doesn't look like a proper 
term here because we had better make Spark smarter to handle those cases as I 
replied at 
[here](https://github.com/apache/spark/pull/30876#discussion_r547421217) 
already.
   > 
   > > Note that the above is orthogonal to DRA evicting an executor via 
storage timeout configuration. That just exacerbates the problem : since a 
larger number of executors could be lost.
   
   If we can do better on this, I am definitely very keen on it !
   Until that happens, we need to continue supporting existing scenarios where 
DRA impacts use of this flag.
   
   
   > 
   > 1. For the following, I didn't make this PR for that specific use case. I 
made this PR to improve this feature in various environment in Apache Spark 
3.2.0 timeframe 
[here](https://github.com/apache/spark/pull/30876#issuecomment-749953223).
   > 
   > > Specifically for this usecase, we dont need to make it a spark default 
right ? ...
   
   This was in response to the 
[scenario](https://github.com/apache/spark/pull/30876#issuecomment-750471287) 
described.
   Let us decouple discussion of that scenario from our discussion here - and 
focus on what we need to evaluate for enabling this by default.
   
   
   > 
   > 1. For the following, I replied that YARN environment also can suffer from 
disk loss or executor loss 
[here](https://github.com/apache/spark/pull/30876#issuecomment-751060200) 
because you insisted that YARN doesn't need this feature from the beginning. 
I'm still not sure that YARN environment is so i

[GitHub] [spark] mridulm edited a comment on pull request #30876: [SPARK-33870][CORE] Enable spark.storage.replication.proactive by default

2020-12-28 Thread GitBox



mridulm edited a comment on pull request #30876:
URL: https://github.com/apache/spark/pull/30876#issuecomment-751934621


   Specifically for this 
[usecase](https://github.com/apache/spark/pull/30876#issuecomment-750471287), 
we dont need to make it a spark default right ?
   If I understood right, the following conditions are being met:
   a) Application is using RDD with replication > 1 (pre-req)
   b) Application is fine with the proactive replication cost, and is coded 
such that out-of-scope RDD's are ensured to be GC'ed.
   c) k8s is (aggressively) configured such that executor decomm is unable to 
do a sufficiently good enough job.
   d) recomputation cost is high enough to be offset by proactive replication 
(either due to latency SLA or high cost of computation).
   
   If true, then sure, for those applications/cluster env, making proactive 
replication an application default might makes sense.
   But this feels sufficiently narrow enough not to require a global default, 
right ? It feels more like a deployment/application default and not a platform 
level default ?
   
   In the scenario above though, how do we handle everything else ?
   Shuffle ? Replicated RDD where replication == 1 ?
   Perhaps better tuning for (c) might help more holistically ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm edited a comment on pull request #30876: [SPARK-33870][CORE] Enable spark.storage.replication.proactive by default

2020-12-24 Thread GitBox



mridulm edited a comment on pull request #30876:
URL: https://github.com/apache/spark/pull/30876#issuecomment-751186157


   @dongjoon-hyun proactive replication only applies to persisted RDD blocks, 
not shuffle blocks - not sure if I am missing something here.
   
   Even for persisted RDD blocks, it specifically applies when RDD is persisted 
with storage levels where `replication > 1` [1].
   I view loss of all replicas of a RDD blockId similarly - whether replication 
is 1 or higher.
   Having said that, specifically for usecases where spark cluster might be 
source of truth (or cost of recomputation is prohibitive), applications can 
ofcourse enable proactive replication via this flag.
   I am not sure I am seeing a concrete reason to turn this on for all 
applications.
   
   Please let me know if I am missing something in my understanding.
   
   [1] ESS serving disk backed blocks might have some corner cases to this flow 
which I have not thought through.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm edited a comment on pull request #30876: [SPARK-33870][CORE] Enable spark.storage.replication.proactive by default

[GitHub] [spark] mridulm edited a comment on pull request #30876: [SPARK-33870][CORE] Enable spark.storage.replication.proactive by default

[GitHub] [spark] mridulm edited a comment on pull request #30876: [SPARK-33870][CORE] Enable spark.storage.replication.proactive by default

[GitHub] [spark] mridulm edited a comment on pull request #30876: [SPARK-33870][CORE] Enable spark.storage.replication.proactive by default

4 matches

Site Navigation

Mail list logo

Footer information