[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

2018-09-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22392


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

2018-09-18 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22392#discussion_r218616946
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
@@ -54,6 +54,10 @@ class Checkpoint(ssc: StreamingContext, val 
checkpointTime: Time)
   "spark.driver.bindAddress",
   "spark.driver.port",
   "spark.master",
+  "spark.kubernetes.driver.pod.name",
+  "spark.kubernetes.driver.limit.cores",
+  "spark.kubernetes.executor.limit.cores",
--- End diff --

Yep, I can merge this later today if I don't see more comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

2018-09-18 Thread ssaavedra
Github user ssaavedra commented on a diff in the pull request:

https://github.com/apache/spark/pull/22392#discussion_r218616371
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
@@ -54,6 +54,10 @@ class Checkpoint(ssc: StreamingContext, val 
checkpointTime: Time)
   "spark.driver.bindAddress",
   "spark.driver.port",
   "spark.master",
+  "spark.kubernetes.driver.pod.name",
+  "spark.kubernetes.driver.limit.cores",
+  "spark.kubernetes.executor.limit.cores",
--- End diff --

Should we mark this as resolved and continue on the spark-dev about what to 
think about adding the rest of the kubernetes.* variables?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

2018-09-17 Thread ssaavedra
Github user ssaavedra commented on a diff in the pull request:

https://github.com/apache/spark/pull/22392#discussion_r218231743
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
@@ -54,6 +54,10 @@ class Checkpoint(ssc: StreamingContext, val 
checkpointTime: Time)
   "spark.driver.bindAddress",
   "spark.driver.port",
   "spark.master",
+  "spark.kubernetes.driver.pod.name",
+  "spark.kubernetes.driver.limit.cores",
+  "spark.kubernetes.executor.limit.cores",
--- End diff --

I'll try to get this updated by tomorrow, sorry for the delay today


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

2018-09-17 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22392#discussion_r218178934
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
@@ -54,6 +54,10 @@ class Checkpoint(ssc: StreamingContext, val 
checkpointTime: Time)
   "spark.driver.bindAddress",
   "spark.driver.port",
   "spark.master",
+  "spark.kubernetes.driver.pod.name",
+  "spark.kubernetes.driver.limit.cores",
+  "spark.kubernetes.executor.limit.cores",
--- End diff --

I'm fine deferring consideration for the other properties to a later time. 
But given that the the core limit ones are included, it makes no sense to 
exclude the core request ones. If the purpose is to make it work first for 2.4, 
I would suggest just keeping the ones that are not deterministic, e.g., driver 
pod name and executor pod name prefix. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

2018-09-17 Thread skonto
Github user skonto commented on a diff in the pull request:

https://github.com/apache/spark/pull/22392#discussion_r218056207
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
@@ -54,6 +54,10 @@ class Checkpoint(ssc: StreamingContext, val 
checkpointTime: Time)
   "spark.driver.bindAddress",
   "spark.driver.port",
   "spark.master",
+  "spark.kubernetes.driver.pod.name",
+  "spark.kubernetes.driver.limit.cores",
+  "spark.kubernetes.executor.limit.cores",
--- End diff --

I agree with getting it to work first. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

2018-09-17 Thread ssaavedra
Github user ssaavedra commented on a diff in the pull request:

https://github.com/apache/spark/pull/22392#discussion_r218048682
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
@@ -54,6 +54,10 @@ class Checkpoint(ssc: StreamingContext, val 
checkpointTime: Time)
   "spark.driver.bindAddress",
   "spark.driver.port",
   "spark.master",
+  "spark.kubernetes.driver.pod.name",
+  "spark.kubernetes.driver.limit.cores",
+  "spark.kubernetes.executor.limit.cores",
--- End diff --

I'm not sure about the use-case there. I think I agree in general with 
adding all of the request/limit knobs, because the cluster may have changed 
(e.g., a "core" is now a latest-gen Xeon instead of a prev-gen Celeron), or the 
job might have starved due to resource limits. However, I'm not sure that the 
general job settings should need tweaking when reloading a job from a 
checkpoint.

My rationale is that this should only happen when a job would have been, 
otherwise, just been running, with no opportunity for you to tweak such 
settings. I'd argue that if you need that fine-tuning, you should want to 
perform a new deployment of your job, instead of re-launching the previous one. 
You need to fix variables that refer to those non-deterministic choices that 
the deployment process does (such as choosing Pod names, IP addresses and so 
on), but I'd say the rest of the config flags should be unaffected.

In particular, about the python and R-related configurations, none of those 
should need to be changed in this case, since that would mean you are actually 
changing the operations you are going to perform, or the Python version.

I can add the missing resource request/limit flags or remove them 
completely, I'm not sure what's the better approach. In any case, I think a 
further discussion in spark-dev should settle this mist, but if you want 
checkpointing to be ready for 2.4.0 (we are already late), I'd go with what's 
needed for it to work and follow-up later with more advanced use cases, but I'm 
open to alternatives. Also, for those flags that are not really 
cluster-configuration-related, being able to change those flags falls out of 
scope here, and instead that should lead to a discussion about whether 
restoring a job from a checkpoint should allow such job to carry different 
run-time semantics.

Looking forward to a solution :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

2018-09-17 Thread skonto
Github user skonto commented on a diff in the pull request:

https://github.com/apache/spark/pull/22392#discussion_r218002553
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
@@ -54,6 +54,10 @@ class Checkpoint(ssc: StreamingContext, val 
checkpointTime: Time)
   "spark.driver.bindAddress",
   "spark.driver.port",
   "spark.master",
+  "spark.kubernetes.driver.pod.name",
+  "spark.kubernetes.driver.limit.cores",
+  "spark.kubernetes.executor.limit.cores",
--- End diff --

Are there any other properties that need to be restored beyond that?
How about `spark.kubernetes.allocation.batch.size` ,
`spark.kubernetes.executor.lostCheck.maxAttempts`
`spark.kubernetes.report.interval`
`spark.kubernetes.executor.eventProcessingInterval`?
Also there are several other properties for Python and R, listed in the 
[config](https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

2018-09-14 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22392#discussion_r217838764
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
@@ -54,6 +54,10 @@ class Checkpoint(ssc: StreamingContext, val 
checkpointTime: Time)
   "spark.driver.bindAddress",
   "spark.driver.port",
   "spark.master",
+  "spark.kubernetes.driver.pod.name",
+  "spark.kubernetes.driver.limit.cores",
+  "spark.kubernetes.executor.limit.cores",
--- End diff --

If that's the reasoning, you probably also want to restore all the resource 
request/limit settings.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

2018-09-14 Thread ssaavedra
Github user ssaavedra commented on a diff in the pull request:

https://github.com/apache/spark/pull/22392#discussion_r217821786
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
@@ -54,6 +54,10 @@ class Checkpoint(ssc: StreamingContext, val 
checkpointTime: Time)
   "spark.driver.bindAddress",
   "spark.driver.port",
   "spark.master",
+  "spark.kubernetes.driver.pod.name",
+  "spark.kubernetes.driver.limit.cores",
+  "spark.kubernetes.executor.limit.cores",
--- End diff --

Because, in my opinion, the previous checkpointed execution should not 
impact a newer choice of limit cores, nor should impose its old defaults on the 
newly-spawned executor pods.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

2018-09-14 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22392#discussion_r217792152
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
@@ -54,6 +54,10 @@ class Checkpoint(ssc: StreamingContext, val 
checkpointTime: Time)
   "spark.driver.bindAddress",
   "spark.driver.port",
   "spark.master",
+  "spark.kubernetes.driver.pod.name",
+  "spark.kubernetes.driver.limit.cores",
+  "spark.kubernetes.executor.limit.cores",
--- End diff --

Why including the `*.limit.cores` ones?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22392: [SPARK-23200] Reset Kubernetes-specific config on...

2018-09-11 Thread ssaavedra
GitHub user ssaavedra opened a pull request:

https://github.com/apache/spark/pull/22392

[SPARK-23200] Reset Kubernetes-specific config on Checkpoint restore

Several configuration parameters related to Kubernetes need to be
reset, as they are changed with each invokation of spark-submit and
thus prevents recovery of Spark Streaming tasks.

## What changes were proposed in this pull request?

When using the Kubernetes cluster-manager and spawning a Streaming 
workload, it is important to reset many spark.kubernetes.* properties that are 
generated by spark-submit but which would get rewritten when restoring a 
Checkpoint. This is so, because the spark-submit codepath creates Kubernetes 
resources, such as a ConfigMap, a Secret and other variables, which have an 
autogenerated name and the previous one will not resolve anymore.

In short, this change enables checkpoint restoration for streaming 
workloads, and thus enables Spark Streaming workloads in Kubernetes, which were 
not possible to restore from a checkpoint before if the workload went down.

## How was this patch tested?

This patch needs would benefit from testing in different k8s clusters.

This is similar to the YARN related code for resetting a Spark Streaming 
workload, but for the Kubernetes scheduler. This PR removes the initcontainers 
properties that existed before because they are now removed in master.

For a previous discussion, see the non-rebased work at: 
apache-spark-on-k8s#516

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ssaavedra/spark fix-checkpointing-master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22392.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22392


commit f457d023e8e488b89b97f5b3b9936831d7fc9bb6
Author: Santiago Saavedra 
Date:   2018-09-11T07:58:40Z

Reset Kubernetes-specific config on Checkpoint restore

Several configuration parameters related to Kubernetes need to be
reset, as they are changed with each invokation of spark-submit and
thus prevents recovery of Spark Streaming tasks.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org