[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-27 Thread baluchicken
Github user baluchicken commented on the issue: https://github.com/apache/spark/pull/21067 Thanks for the responses, I learned a lot from this:) I am going to close this PR for now, and maybe collaborate on the Kubernetes ticket raised by this PR. Thanks. ---

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-19 Thread liyinan926
Github user liyinan926 commented on the issue: https://github.com/apache/spark/pull/21067 +1 on what @foxish said. If using a Job is the right way to go ultimately, it's good to open discussion with sig-apps on adding an option to the Job API & controller to use deterministic pod

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-19 Thread foxish
Github user foxish commented on the issue: https://github.com/apache/spark/pull/21067 > ReadWriteOnce storage can only be attached to one node. This is well known. Using the RWO volume for fencing here would work - but this is not representative of all users. This breaks down

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-19 Thread skonto
Github user skonto commented on the issue: https://github.com/apache/spark/pull/21067 @baluchicken yeah I thought of that but I was hoping for more automation. --- - To unsubscribe, e-mail:

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-19 Thread baluchicken
Github user baluchicken commented on the issue: https://github.com/apache/spark/pull/21067 @skonto if the node never become available again the new driver will stay in Pending state until like @foxish said "the user explicitly force-kills the old driver". ---

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-19 Thread skonto
Github user skonto commented on the issue: https://github.com/apache/spark/pull/21067 > Once the partitioned node become available again the unknown old driver pod got terminated, the volume got unattached and get reattached to the new driver pod which state now changed from pending

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-17 Thread baluchicken
Github user baluchicken commented on the issue: https://github.com/apache/spark/pull/21067 I ran some more tests about this. I think we can say that this change can add resiliency to spark batch jobs where just like in case of YARN Spark will retry the job from the beginning if an

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21067 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93201/ Test PASSed. ---

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21067 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-17 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21067 **[Test build #93201 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93201/testReport)** for PR 21067 at commit

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-17 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21067 **[Test build #93201 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93201/testReport)** for PR 21067 at commit

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-12 Thread foxish
Github user foxish commented on the issue: https://github.com/apache/spark/pull/21067 > After a short/configurable delay the driver pod state changed to Unknown and the Job controller initiated a new spark driver. This is dangerous behavior. The old spark driver can still

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-12 Thread promiseofcake
Github user promiseofcake commented on the issue: https://github.com/apache/spark/pull/21067 @baluchicken, did that test involve using checkpointing in a shared location? --- - To unsubscribe, e-mail:

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-12 Thread baluchicken
Github user baluchicken commented on the issue: https://github.com/apache/spark/pull/21067 @foxish I just checked on a Google Kubernetes Cluster with Kubernetes version 1.10.4-gke.2. I created a two node cluster and I emulated "network partition" with iptables rules (node running the

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-06 Thread liyinan926
Github user liyinan926 commented on the issue: https://github.com/apache/spark/pull/21067 +1 on what @foxish said. I would also like to see a detailed discussion on the semantic differences this brings onto the table first before committing to this approach. ---

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-06 Thread foxish
Github user foxish commented on the issue: https://github.com/apache/spark/pull/21067 I don't think this current approach will suffice. Correctness is important here, especially for folks using spark streaming. I understand that we're proposing the use of backoff limits but there is

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21067 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21067 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92685/ Test PASSed. ---

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-06 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21067 **[Test build #92685 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92685/testReport)** for PR 21067 at commit

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-06 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21067 **[Test build #92685 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92685/testReport)** for PR 21067 at commit

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-06 Thread baluchicken
Github user baluchicken commented on the issue: https://github.com/apache/spark/pull/21067 @skonto thanks, I am going to check it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-06 Thread skonto
Github user skonto commented on the issue: https://github.com/apache/spark/pull/21067 @baluchicken probably this is covered here: https://github.com/apache/spark/pull/21260. I kind of missed that, as I thought it was only for hostpaths but it also covers PVs. ---

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21067 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92650/ Test FAILed. ---

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21067 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-05 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21067 **[Test build #92650 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92650/testReport)** for PR 21067 at commit

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-05 Thread baluchicken
Github user baluchicken commented on the issue: https://github.com/apache/spark/pull/21067 @mccheah rebased to master and updated the PR, now the KubernetesDriverBuilder will create the driver job instead of the configuration steps. ---

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-05 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21067 **[Test build #92650 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92650/testReport)** for PR 21067 at commit

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-04 Thread baluchicken
Github user baluchicken commented on the issue: https://github.com/apache/spark/pull/21067 @skonto sorry I have couple of other things to do but I am trying to update this as my time allows it. Yes we are planning to create a PR about the PVs related stuff as soon as this one

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-03 Thread skonto
Github user skonto commented on the issue: https://github.com/apache/spark/pull/21067 @baluchicken @foxish any update on this? HA story is pretty critical for production in many cases. --- - To unsubscribe,

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-06-11 Thread baluchicken
Github user baluchicken commented on the issue: https://github.com/apache/spark/pull/21067 @felixcheung rebased to master and fixed failing unit tests --- - To unsubscribe, e-mail:

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-06-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21067 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-06-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21067 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91660/ Test PASSed. ---

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-06-11 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21067 **[Test build #91660 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91660/testReport)** for PR 21067 at commit

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-06-11 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21067 **[Test build #91660 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91660/testReport)** for PR 21067 at commit

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-06-10 Thread felixcheung
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/21067 any update? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-06-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21067 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-21 Thread liyinan926
Github user liyinan926 commented on the issue: https://github.com/apache/spark/pull/21067 @foxish on concerns of the lack of exactly-one semantics. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21067 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90898/ Test FAILed. ---

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21067 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21067 **[Test build #90898 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90898/testReport)** for PR 21067 at commit

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21067 **[Test build #90898 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90898/testReport)** for PR 21067 at commit

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21067 **[Test build #90895 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90895/testReport)** for PR 21067 at commit

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21067 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90895/ Test FAILed. ---

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21067 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21067 **[Test build #90895 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90895/testReport)** for PR 21067 at commit

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-21 Thread baluchicken
Github user baluchicken commented on the issue: https://github.com/apache/spark/pull/21067 @felixcheung fixed the Scala style validations, sorry. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21067 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90868/ Test FAILed. ---

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21067 **[Test build #90868 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90868/testReport)** for PR 21067 at commit

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21067 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21067 **[Test build #90868 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90868/testReport)** for PR 21067 at commit

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-20 Thread felixcheung
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/21067 Jenkins, ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-13 Thread baluchicken
Github user baluchicken commented on the issue: https://github.com/apache/spark/pull/21067 Rebased again to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-07 Thread baluchicken
Github user baluchicken commented on the issue: https://github.com/apache/spark/pull/21067 @mccheah Rebased to master, and added support for configurable backofflimit. --- - To unsubscribe, e-mail:

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-04-14 Thread stoader
Github user stoader commented on the issue: https://github.com/apache/spark/pull/21067 @mccheah > But whether or not the driver should be relaunchable should be determined by the application submitter, and not necessarily done all the time. Can we make this behavior

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-04-13 Thread mccheah
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/21067 > We don't have a solid story for checkpointing streaming computation right now, and even if we did, you'll certainly lose all progress from batch jobs. Should probably clarify re:

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-04-13 Thread mccheah
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/21067 Looks like there's a lot of conflicts from the refactor that was just merged. In general though I don't think this buys us too much. The problem is that when the driver fails, you'll lose

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-04-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21067 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-04-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21067 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional