[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-07-20 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-660965811 Thanks everyone who took the time to review I know the discussion thread hear got a bit longer than usual. I’m really excited to get back to reviewing the PRs that build on top

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-07-19 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-660796492 Merged to dev branch This is an automated message from the Apache Git Service. To respond to the message, please

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-07-18 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-660582133 The python packaging tests are failing on Jenkins post upgrade and this passes all of the GH actions so unless there is any more discussion I intend to merge this tomorrow. --

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-07-17 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-660381704 It looks like the R test is failing even with the upgrade. I'm going to disable it and file a blocker to re-enable it unless folks object to that approach. ---

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-07-17 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-660364460 sounds good, I'll work on resolving the test issues. @jiangxb1987 if you want to make the follow up issues under the decommissioning umbrella issue it'll make tracking it easier

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-07-17 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-660340513 Hey folks it's the end of the workweek. I just want to check in and see if people believe this PR is still under active discussion, or if once it resolves in Jenkins it's ok to

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-07-17 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-660308966 Looks like my CR follow up caused an issue in the shuffle part, I think it's where I was unifying the types. Ran the storage tests locally but forgot to also check all of the sh

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-07-16 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-659592920 So @attilapiros is pretty familiar with this area of the code and has been reviewing and collaborating on it for an extended period of time. ---

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-07-15 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-658955592 It seems like Shane is working on fixing the R issue in Jenkins today, so I'll wait until that is resolved to give folks an opportunity to veto if they believe that is the corre

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-07-15 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-658954395 The SPIP has been voted on, this has been reviewed extensively, the original design is from 2017, I'm not waiting unless someone wishes to -1 for a valid technical reason. ---

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-07-15 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-658944693 All checks pass, I'm going to merge this to our current development branch. This is an automated message from th

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-07-14 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-658255227 The test failure reported from the K8s PRB is R version, which is unrelated. I’m going to do a last read through of this PR and if it looks good I intend to merge it after that

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-07-14 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-658161149 Both the GitHub actions tests failing are since R is not installed properly in the test environment. Given that there are no R changes I'm not planning on waiting on that.

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-07-13 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-657985918 Jenkins retest this please This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-07-13 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-657927249 Looking at this I believe all of the changes requested have been addressed. I'm going to get this PR up to date with the current development now that the SPIP has passed and if

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-26 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-650477676 Just an FYI to folks I'm not as active on this PR as I would normally be as I'm waiting to see where the SPIP discussions go. I'll circle back to this next week. -

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-18 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-646399281 I'm going to hold off on merging this for a little longer actually, it seems like there are some other folks who seem interested in the space (cc @HyukjinKwon who's recently sho

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-18 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-646209154 I will merge early next week unless anyone has any outstanding issues. This is an automated message from the Apa

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-16 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-644950938 Build dependency failure is likely unrelated given the lack of any pom changes. This is an automated message fr

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-15 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-644468103 I've updated the description @gatorsmile, let me know if there is any particular points you would like clarified. --

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-15 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-644376619 Hey @gatorsmile I noticed you marked this as "request changes", can you clarify what changes you are requesting? ---

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-15 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-644367309 Sure I'll share the design doc here as well, https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE --

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-13 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-643660509 jenkins retest this please This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-12 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-643575581 Yeah so the plan is to trigger an exit as soon as migrations are completed. I think a good follow up to https://issues.apache.org/jira/browse/SPARK-31197 would be adding a time

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-12 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-643559589 Although I'm a little fuzzy on what you mean by "eager" (if you mean as soon as the migrations are completed then yes) -

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-12 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-643553994 Yeah I think supporting multiple ways of reducing the number of fetch failures makes sense here. I think migration is certainly a "best-case" scenario and we can't count on in m

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-09 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-640736229 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-05 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-639800859 > grep "THREAD LEAK" So I'm not sure why it's not showing up in Jenkins: > holden@hkdesktop:~/repos/spark-website$ wget https://amplab.cs.berkeley.edu/jenkins/job/S

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-05 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-639663862 > > Oh also I think I understand some of our disagreement over the threads. I thought you were asking me to stop the Spark executor because I’ve started doing some separate work

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-05 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-639628299 Oh also I think I understand some of our disagreement over the threads. I thought you were asking me to stop the Spark executor because I’ve started doing some separate work on

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-05 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-639625978 Thanks for the PR :) This is an automated message from the Apache Git Service. To respond to the message, please

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-04 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-639222935 The K8s test failure appears unrelated (`- Run in client mode. *** FAILED ***`) we don't do anything with the tokens. I'll investigate more tomorrow. --

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-04 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-639201319 > > > So @attilapiros looking at the Jenkins console logs we aren't leaking any threads during testing (nor would I expect us to). But I'll add something to more aggressively st

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-04 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-639013472 > > Hey @attilapiros can you explain to my why you think we need to test the different kinds of block fetches? When we migrate we're always migrating to disk so I'm not seeing h

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-04 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-639007168 > > So @attilapiros looking at the Jenkins console logs we aren't leaking any threads during testing (nor would I expect us to). But I'll add something to more aggressively stop

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-04 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-638998676 So @attilapiros looking at the Jenkins console logs we aren't leaking any threads during testing (nor would I expect us to). But I'll add something to more aggressively stop the

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-03 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-638529780 Hey @attilapiros can you explain to my why you think we need to test the different kinds of block fetches? When we migrate we're always migrating to disk so I'm not seeing how i

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-03 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-638500521 I like the idea of adding a more specific unit test for the streaming upload so we can save the (slower) more integration style test for the other components, thanks for writing

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-03 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-638496741 So I don't want to stop the executor directly once the block migration is done. Instead, I have a follow-up JIRA which I've started working on that shutdowns the executor once t

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-02 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-637903387 I think the k8s failure is unrelated (e.g. ` java.lang.RuntimeException: Unable to load a Suite class that was discovered in the runpath: org.apache.spark.deploy.master.ui.Mast

[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-02 Thread GitBox
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-637764304 cc reviewers from the WIP PR: @attilapiros , @dongjoon-hyun , @prakharjain09 , @viirya This is an automated