Re: Should spark-ec2 get its own repo?
I sent a note to the Mesos developers and created https://github.com/apache/spark/pull/7899 to change the repository pointer. There are 3-4 open PRs right now in the mesos/spark-ec2 repository and I'll work on migrating them to amplab/spark-ec2 later today. My thoughts on moving the python script is that we should have a wrapper shell script that just fetches the latest version of spark_ec2.py for the corresponding Spark branch. We already have separate branches in our spark-ec2 repository for different Spark versions so it can just be a call to `wget https://github.com/amplab/spark-ec2/tree/spark-version/driver/spark_ec2.py`. Thanks Shivaram On Sun, Aug 2, 2015 at 11:34 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Sat, Aug 1, 2015 at 1:09 PM Matt Goodman meawo...@gmail.com wrote: I am considering porting some of this to a more general spark-cloud launcher, including google/aliyun/rackspace. It shouldn't be hard at all given the current approach for setup/install. FWIW, there are already some tools for launching Spark clusters on GCE and Azure: http://spark-packages.org/?q=tags%3A%22Deployment%22 Nick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Should spark-ec2 get its own repo?
On Sat, Aug 1, 2015 at 1:09 PM Matt Goodman meawo...@gmail.com wrote: I am considering porting some of this to a more general spark-cloud launcher, including google/aliyun/rackspace. It shouldn't be hard at all given the current approach for setup/install. FWIW, there are already some tools for launching Spark clusters on GCE and Azure: http://spark-packages.org/?q=tags%3A%22Deployment%22 Nick
Re: Should spark-ec2 get its own repo?
I think that is a good idea, and slated to happen. At the very least a README or some such. Is this a use case for git submodules? I am considering porting some of this to a more general spark-cloud launcher, including google/aliyun/rackspace. It shouldn't be hard at all given the current approach for setup/install. --Matthew Goodman = Check Out My Website: http://craneium.net Find me on LinkedIn: http://tinyurl.com/d6wlch On Fri, Jul 31, 2015 at 6:50 PM, Patrick Wendell pwend...@gmail.com wrote: Hey All, I've mostly kept quiet since I am not very active in maintaining this code anymore. However, it is a bit odd that the project is split-brained with a lot of the code being on github and some in the Spark repo. If the consensus is to migrate everything to github, that seems okay with me. I would vouch for having user continuity, for instance still have a shim ec2/spark-ec2 script that could perhaps just download and unpack the real script from github. - Patrick On Fri, Jul 31, 2015 at 2:13 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Yes - It is still in progress, but I have just not gotten time to get to this. I think getting the repo moved from mesos to amplab in the codebase by 1.5 should be possible. Thanks Shivaram On Fri, Jul 31, 2015 at 3:08 AM, Sean Owen so...@cloudera.com wrote: PS is this still in progress? it feels like something that would be good to do before 1.5.0, if it's going to happen soon. On Wed, Jul 22, 2015 at 6:59 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Yeah I'll send a note to the mesos dev list just to make sure they are informed. Shivaram On Tue, Jul 21, 2015 at 11:47 AM, Sean Owen so...@cloudera.com wrote: I agree it's worth informing Mesos devs and checking that there are no big objections. I presume Shivaram is plugged in enough to Mesos that there won't be any surprises there, and that the project would also agree with moving this Spark-specific bit out. they may also want to leave a pointer to the new location in the mesos repo of course. I don't think it is something that requires a formal vote. It's not a question of ownership -- neither Apache nor the project PMC owns the code. I don't think it's different from retiring or removing any other code. On Tue, Jul 21, 2015 at 7:03 PM, Mridul Muralidharan mri...@gmail.com wrote: If I am not wrong, since the code was hosted within mesos project repo, I assume (atleast part of it) is owned by mesos project and so its PMC ? - Mridul On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There is technically no PMC for the spark-ec2 project (I guess we are kind of establishing one right now). I haven't heard anything from the Spark PMC on the dev list that might suggest a need for a vote so far. I will send another round of email notification to the dev list when we have a JIRA / PR that actually moves the scripts (right now the only thing that changed is the location of some scripts in mesos/ to amplab/). Thanks Shivaram - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Should spark-ec2 get its own repo?
I don't think that using git submodules is a good idea here: - The extra `git submodule init git submodule update` step can lead to confusing problems in certain workflows. - We'd wind up with many commits that serve only to bump the submodule SHA; these commits will be hard to review since they won't contain line diffs (the author will have to manually provide a link to the diff of code changes). On Sat, Aug 1, 2015 at 10:08 AM, Matt Goodman meawo...@gmail.com wrote: I think that is a good idea, and slated to happen. At the very least a README or some such. Is this a use case for git submodules? I am considering porting some of this to a more general spark-cloud launcher, including google/aliyun/rackspace. It shouldn't be hard at all given the current approach for setup/install. --Matthew Goodman = Check Out My Website: http://craneium.net Find me on LinkedIn: http://tinyurl.com/d6wlch On Fri, Jul 31, 2015 at 6:50 PM, Patrick Wendell pwend...@gmail.com wrote: Hey All, I've mostly kept quiet since I am not very active in maintaining this code anymore. However, it is a bit odd that the project is split-brained with a lot of the code being on github and some in the Spark repo. If the consensus is to migrate everything to github, that seems okay with me. I would vouch for having user continuity, for instance still have a shim ec2/spark-ec2 script that could perhaps just download and unpack the real script from github. - Patrick On Fri, Jul 31, 2015 at 2:13 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Yes - It is still in progress, but I have just not gotten time to get to this. I think getting the repo moved from mesos to amplab in the codebase by 1.5 should be possible. Thanks Shivaram On Fri, Jul 31, 2015 at 3:08 AM, Sean Owen so...@cloudera.com wrote: PS is this still in progress? it feels like something that would be good to do before 1.5.0, if it's going to happen soon. On Wed, Jul 22, 2015 at 6:59 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Yeah I'll send a note to the mesos dev list just to make sure they are informed. Shivaram On Tue, Jul 21, 2015 at 11:47 AM, Sean Owen so...@cloudera.com wrote: I agree it's worth informing Mesos devs and checking that there are no big objections. I presume Shivaram is plugged in enough to Mesos that there won't be any surprises there, and that the project would also agree with moving this Spark-specific bit out. they may also want to leave a pointer to the new location in the mesos repo of course. I don't think it is something that requires a formal vote. It's not a question of ownership -- neither Apache nor the project PMC owns the code. I don't think it's different from retiring or removing any other code. On Tue, Jul 21, 2015 at 7:03 PM, Mridul Muralidharan mri...@gmail.com wrote: If I am not wrong, since the code was hosted within mesos project repo, I assume (atleast part of it) is owned by mesos project and so its PMC ? - Mridul On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There is technically no PMC for the spark-ec2 project (I guess we are kind of establishing one right now). I haven't heard anything from the Spark PMC on the dev list that might suggest a need for a vote so far. I will send another round of email notification to the dev list when we have a JIRA / PR that actually moves the scripts (right now the only thing that changed is the location of some scripts in mesos/ to amplab/). Thanks Shivaram - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Should spark-ec2 get its own repo?
Hey All, I've mostly kept quiet since I am not very active in maintaining this code anymore. However, it is a bit odd that the project is split-brained with a lot of the code being on github and some in the Spark repo. If the consensus is to migrate everything to github, that seems okay with me. I would vouch for having user continuity, for instance still have a shim ec2/spark-ec2 script that could perhaps just download and unpack the real script from github. - Patrick On Fri, Jul 31, 2015 at 2:13 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Yes - It is still in progress, but I have just not gotten time to get to this. I think getting the repo moved from mesos to amplab in the codebase by 1.5 should be possible. Thanks Shivaram On Fri, Jul 31, 2015 at 3:08 AM, Sean Owen so...@cloudera.com wrote: PS is this still in progress? it feels like something that would be good to do before 1.5.0, if it's going to happen soon. On Wed, Jul 22, 2015 at 6:59 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Yeah I'll send a note to the mesos dev list just to make sure they are informed. Shivaram On Tue, Jul 21, 2015 at 11:47 AM, Sean Owen so...@cloudera.com wrote: I agree it's worth informing Mesos devs and checking that there are no big objections. I presume Shivaram is plugged in enough to Mesos that there won't be any surprises there, and that the project would also agree with moving this Spark-specific bit out. they may also want to leave a pointer to the new location in the mesos repo of course. I don't think it is something that requires a formal vote. It's not a question of ownership -- neither Apache nor the project PMC owns the code. I don't think it's different from retiring or removing any other code. On Tue, Jul 21, 2015 at 7:03 PM, Mridul Muralidharan mri...@gmail.com wrote: If I am not wrong, since the code was hosted within mesos project repo, I assume (atleast part of it) is owned by mesos project and so its PMC ? - Mridul On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There is technically no PMC for the spark-ec2 project (I guess we are kind of establishing one right now). I haven't heard anything from the Spark PMC on the dev list that might suggest a need for a vote so far. I will send another round of email notification to the dev list when we have a JIRA / PR that actually moves the scripts (right now the only thing that changed is the location of some scripts in mesos/ to amplab/). Thanks Shivaram - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Should spark-ec2 get its own repo?
Yes - It is still in progress, but I have just not gotten time to get to this. I think getting the repo moved from mesos to amplab in the codebase by 1.5 should be possible. Thanks Shivaram On Fri, Jul 31, 2015 at 3:08 AM, Sean Owen so...@cloudera.com wrote: PS is this still in progress? it feels like something that would be good to do before 1.5.0, if it's going to happen soon. On Wed, Jul 22, 2015 at 6:59 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Yeah I'll send a note to the mesos dev list just to make sure they are informed. Shivaram On Tue, Jul 21, 2015 at 11:47 AM, Sean Owen so...@cloudera.com wrote: I agree it's worth informing Mesos devs and checking that there are no big objections. I presume Shivaram is plugged in enough to Mesos that there won't be any surprises there, and that the project would also agree with moving this Spark-specific bit out. they may also want to leave a pointer to the new location in the mesos repo of course. I don't think it is something that requires a formal vote. It's not a question of ownership -- neither Apache nor the project PMC owns the code. I don't think it's different from retiring or removing any other code. On Tue, Jul 21, 2015 at 7:03 PM, Mridul Muralidharan mri...@gmail.com wrote: If I am not wrong, since the code was hosted within mesos project repo, I assume (atleast part of it) is owned by mesos project and so its PMC ? - Mridul On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There is technically no PMC for the spark-ec2 project (I guess we are kind of establishing one right now). I haven't heard anything from the Spark PMC on the dev list that might suggest a need for a vote so far. I will send another round of email notification to the dev list when we have a JIRA / PR that actually moves the scripts (right now the only thing that changed is the location of some scripts in mesos/ to amplab/). Thanks Shivaram
Re: Should spark-ec2 get its own repo?
PS is this still in progress? it feels like something that would be good to do before 1.5.0, if it's going to happen soon. On Wed, Jul 22, 2015 at 6:59 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Yeah I'll send a note to the mesos dev list just to make sure they are informed. Shivaram On Tue, Jul 21, 2015 at 11:47 AM, Sean Owen so...@cloudera.com wrote: I agree it's worth informing Mesos devs and checking that there are no big objections. I presume Shivaram is plugged in enough to Mesos that there won't be any surprises there, and that the project would also agree with moving this Spark-specific bit out. they may also want to leave a pointer to the new location in the mesos repo of course. I don't think it is something that requires a formal vote. It's not a question of ownership -- neither Apache nor the project PMC owns the code. I don't think it's different from retiring or removing any other code. On Tue, Jul 21, 2015 at 7:03 PM, Mridul Muralidharan mri...@gmail.com wrote: If I am not wrong, since the code was hosted within mesos project repo, I assume (atleast part of it) is owned by mesos project and so its PMC ? - Mridul On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There is technically no PMC for the spark-ec2 project (I guess we are kind of establishing one right now). I haven't heard anything from the Spark PMC on the dev list that might suggest a need for a vote so far. I will send another round of email notification to the dev list when we have a JIRA / PR that actually moves the scripts (right now the only thing that changed is the location of some scripts in mesos/ to amplab/). Thanks Shivaram - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Should spark-ec2 get its own repo?
Yeah I'll send a note to the mesos dev list just to make sure they are informed. Shivaram On Tue, Jul 21, 2015 at 11:47 AM, Sean Owen so...@cloudera.com wrote: I agree it's worth informing Mesos devs and checking that there are no big objections. I presume Shivaram is plugged in enough to Mesos that there won't be any surprises there, and that the project would also agree with moving this Spark-specific bit out. they may also want to leave a pointer to the new location in the mesos repo of course. I don't think it is something that requires a formal vote. It's not a question of ownership -- neither Apache nor the project PMC owns the code. I don't think it's different from retiring or removing any other code. On Tue, Jul 21, 2015 at 7:03 PM, Mridul Muralidharan mri...@gmail.com wrote: If I am not wrong, since the code was hosted within mesos project repo, I assume (atleast part of it) is owned by mesos project and so its PMC ? - Mridul On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There is technically no PMC for the spark-ec2 project (I guess we are kind of establishing one right now). I haven't heard anything from the Spark PMC on the dev list that might suggest a need for a vote so far. I will send another round of email notification to the dev list when we have a JIRA / PR that actually moves the scripts (right now the only thing that changed is the location of some scripts in mesos/ to amplab/). Thanks Shivaram
Re: Should spark-ec2 get its own repo?
There is technically no PMC for the spark-ec2 project (I guess we are kind of establishing one right now). I haven't heard anything from the Spark PMC on the dev list that might suggest a need for a vote so far. I will send another round of email notification to the dev list when we have a JIRA / PR that actually moves the scripts (right now the only thing that changed is the location of some scripts in mesos/ to amplab/). Thanks Shivaram On Mon, Jul 20, 2015 at 12:55 PM, Mridul Muralidharan mri...@gmail.com wrote: Might be a good idea to get the PMC's of both projects to sign off to prevent future issues with apache. Regards, Mridul On Mon, Jul 20, 2015 at 12:01 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I've created https://github.com/amplab/spark-ec2 and added an initial set of committers. Note that this is not a fork of the existing github.com/mesos/spark-ec2 and users will need to fork from here. This is mostly to avoid the base-fork in pull requests being set incorrectly etc. I'll be migrating some PRs / closing them in the old repo and will also update the README in that repo. Thanks Shivaram On Fri, Jul 17, 2015 at 3:00 PM, Sean Owen so...@cloudera.com wrote: On Fri, Jul 17, 2015 at 6:58 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I am not sure why the ASF JIRA can be only used to track one set of artifacts that are packaged and released together. I agree that marking a fix version as 1.5 for a change in another repo doesn't make a lot of sense, but we could just not use fix versions for the EC2 issues ? *shrug* it just seems harder and less natural to use ASF JIRA. What's the benefit? I agree it's not a big deal either way but it's a small part of the problem we're solving in the first place. I suspect that one way or the other, there would be issues filed both places, so this probably isn't worth debating. My concerns are less about it being pushed out etc. For better or worse we have had EC2 scripts be a part of the Spark distribution from a very early stage (from version 0.5.0 if my git history reading is correct). So users will assume that any error with EC2 scripts belong to the Spark project. In addition almost all the contributions to the EC2 scripts come from Spark developers and so keeping the issues in the same mailing list / JIRA seems natural. This I guess again relates to the question of managing issues for code that isn't part of the Spark release artifact. Yeah good question -- Github doesn't give you a mailing list. I think dev@ would still be where it's discussed which is ... again 'part of the problem' but as you say, probably beneficial. It's a pretty low traffic topic anyway. I'll create the amplab/spark-ec2 repo over the next couple of days unless there are more comments on this thread. This will at least alleviate some of the naming confusion over using a repository in mesos and I'll give Sean, Nick, Matthew commit access to it. I am still not convinced about moving the issues over though. I won't move the issues. Maybe time tells whether one approach is better, or that it just doesn't matter. However it'd be a great opportunity to review and clear stale EC2 issues.
Re: Should spark-ec2 get its own repo?
If I am not wrong, since the code was hosted within mesos project repo, I assume (atleast part of it) is owned by mesos project and so its PMC ? - Mridul On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There is technically no PMC for the spark-ec2 project (I guess we are kind of establishing one right now). I haven't heard anything from the Spark PMC on the dev list that might suggest a need for a vote so far. I will send another round of email notification to the dev list when we have a JIRA / PR that actually moves the scripts (right now the only thing that changed is the location of some scripts in mesos/ to amplab/). Thanks Shivaram On Mon, Jul 20, 2015 at 12:55 PM, Mridul Muralidharan mri...@gmail.com wrote: Might be a good idea to get the PMC's of both projects to sign off to prevent future issues with apache. Regards, Mridul On Mon, Jul 20, 2015 at 12:01 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I've created https://github.com/amplab/spark-ec2 and added an initial set of committers. Note that this is not a fork of the existing github.com/mesos/spark-ec2 and users will need to fork from here. This is mostly to avoid the base-fork in pull requests being set incorrectly etc. I'll be migrating some PRs / closing them in the old repo and will also update the README in that repo. Thanks Shivaram On Fri, Jul 17, 2015 at 3:00 PM, Sean Owen so...@cloudera.com wrote: On Fri, Jul 17, 2015 at 6:58 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I am not sure why the ASF JIRA can be only used to track one set of artifacts that are packaged and released together. I agree that marking a fix version as 1.5 for a change in another repo doesn't make a lot of sense, but we could just not use fix versions for the EC2 issues ? *shrug* it just seems harder and less natural to use ASF JIRA. What's the benefit? I agree it's not a big deal either way but it's a small part of the problem we're solving in the first place. I suspect that one way or the other, there would be issues filed both places, so this probably isn't worth debating. My concerns are less about it being pushed out etc. For better or worse we have had EC2 scripts be a part of the Spark distribution from a very early stage (from version 0.5.0 if my git history reading is correct). So users will assume that any error with EC2 scripts belong to the Spark project. In addition almost all the contributions to the EC2 scripts come from Spark developers and so keeping the issues in the same mailing list / JIRA seems natural. This I guess again relates to the question of managing issues for code that isn't part of the Spark release artifact. Yeah good question -- Github doesn't give you a mailing list. I think dev@ would still be where it's discussed which is ... again 'part of the problem' but as you say, probably beneficial. It's a pretty low traffic topic anyway. I'll create the amplab/spark-ec2 repo over the next couple of days unless there are more comments on this thread. This will at least alleviate some of the naming confusion over using a repository in mesos and I'll give Sean, Nick, Matthew commit access to it. I am still not convinced about moving the issues over though. I won't move the issues. Maybe time tells whether one approach is better, or that it just doesn't matter. However it'd be a great opportunity to review and clear stale EC2 issues. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Should spark-ec2 get its own repo?
Thats part of the confusion we are trying to fix here -- the repository used to live in the mesos github account but was never a part of the Apache Mesos project. It was a remnant part of Spark from when Spark used to live at github.com/mesos/spark. Shivaram On Tue, Jul 21, 2015 at 11:03 AM, Mridul Muralidharan mri...@gmail.com wrote: If I am not wrong, since the code was hosted within mesos project repo, I assume (atleast part of it) is owned by mesos project and so its PMC ? - Mridul On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There is technically no PMC for the spark-ec2 project (I guess we are kind of establishing one right now). I haven't heard anything from the Spark PMC on the dev list that might suggest a need for a vote so far. I will send another round of email notification to the dev list when we have a JIRA / PR that actually moves the scripts (right now the only thing that changed is the location of some scripts in mesos/ to amplab/). Thanks Shivaram On Mon, Jul 20, 2015 at 12:55 PM, Mridul Muralidharan mri...@gmail.com wrote: Might be a good idea to get the PMC's of both projects to sign off to prevent future issues with apache. Regards, Mridul On Mon, Jul 20, 2015 at 12:01 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I've created https://github.com/amplab/spark-ec2 and added an initial set of committers. Note that this is not a fork of the existing github.com/mesos/spark-ec2 and users will need to fork from here. This is mostly to avoid the base-fork in pull requests being set incorrectly etc. I'll be migrating some PRs / closing them in the old repo and will also update the README in that repo. Thanks Shivaram On Fri, Jul 17, 2015 at 3:00 PM, Sean Owen so...@cloudera.com wrote: On Fri, Jul 17, 2015 at 6:58 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I am not sure why the ASF JIRA can be only used to track one set of artifacts that are packaged and released together. I agree that marking a fix version as 1.5 for a change in another repo doesn't make a lot of sense, but we could just not use fix versions for the EC2 issues ? *shrug* it just seems harder and less natural to use ASF JIRA. What's the benefit? I agree it's not a big deal either way but it's a small part of the problem we're solving in the first place. I suspect that one way or the other, there would be issues filed both places, so this probably isn't worth debating. My concerns are less about it being pushed out etc. For better or worse we have had EC2 scripts be a part of the Spark distribution from a very early stage (from version 0.5.0 if my git history reading is correct). So users will assume that any error with EC2 scripts belong to the Spark project. In addition almost all the contributions to the EC2 scripts come from Spark developers and so keeping the issues in the same mailing list / JIRA seems natural. This I guess again relates to the question of managing issues for code that isn't part of the Spark release artifact. Yeah good question -- Github doesn't give you a mailing list. I think dev@ would still be where it's discussed which is ... again 'part of the problem' but as you say, probably beneficial. It's a pretty low traffic topic anyway. I'll create the amplab/spark-ec2 repo over the next couple of days unless there are more comments on this thread. This will at least alleviate some of the naming confusion over using a repository in mesos and I'll give Sean, Nick, Matthew commit access to it. I am still not convinced about moving the issues over though. I won't move the issues. Maybe time tells whether one approach is better, or that it just doesn't matter. However it'd be a great opportunity to review and clear stale EC2 issues.
Re: Should spark-ec2 get its own repo?
That sounds good. Thanks for clarifying ! Regards, Mridul On Tue, Jul 21, 2015 at 11:09 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Thats part of the confusion we are trying to fix here -- the repository used to live in the mesos github account but was never a part of the Apache Mesos project. It was a remnant part of Spark from when Spark used to live at github.com/mesos/spark. Shivaram On Tue, Jul 21, 2015 at 11:03 AM, Mridul Muralidharan mri...@gmail.com wrote: If I am not wrong, since the code was hosted within mesos project repo, I assume (atleast part of it) is owned by mesos project and so its PMC ? - Mridul On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There is technically no PMC for the spark-ec2 project (I guess we are kind of establishing one right now). I haven't heard anything from the Spark PMC on the dev list that might suggest a need for a vote so far. I will send another round of email notification to the dev list when we have a JIRA / PR that actually moves the scripts (right now the only thing that changed is the location of some scripts in mesos/ to amplab/). Thanks Shivaram On Mon, Jul 20, 2015 at 12:55 PM, Mridul Muralidharan mri...@gmail.com wrote: Might be a good idea to get the PMC's of both projects to sign off to prevent future issues with apache. Regards, Mridul On Mon, Jul 20, 2015 at 12:01 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I've created https://github.com/amplab/spark-ec2 and added an initial set of committers. Note that this is not a fork of the existing github.com/mesos/spark-ec2 and users will need to fork from here. This is mostly to avoid the base-fork in pull requests being set incorrectly etc. I'll be migrating some PRs / closing them in the old repo and will also update the README in that repo. Thanks Shivaram On Fri, Jul 17, 2015 at 3:00 PM, Sean Owen so...@cloudera.com wrote: On Fri, Jul 17, 2015 at 6:58 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I am not sure why the ASF JIRA can be only used to track one set of artifacts that are packaged and released together. I agree that marking a fix version as 1.5 for a change in another repo doesn't make a lot of sense, but we could just not use fix versions for the EC2 issues ? *shrug* it just seems harder and less natural to use ASF JIRA. What's the benefit? I agree it's not a big deal either way but it's a small part of the problem we're solving in the first place. I suspect that one way or the other, there would be issues filed both places, so this probably isn't worth debating. My concerns are less about it being pushed out etc. For better or worse we have had EC2 scripts be a part of the Spark distribution from a very early stage (from version 0.5.0 if my git history reading is correct). So users will assume that any error with EC2 scripts belong to the Spark project. In addition almost all the contributions to the EC2 scripts come from Spark developers and so keeping the issues in the same mailing list / JIRA seems natural. This I guess again relates to the question of managing issues for code that isn't part of the Spark release artifact. Yeah good question -- Github doesn't give you a mailing list. I think dev@ would still be where it's discussed which is ... again 'part of the problem' but as you say, probably beneficial. It's a pretty low traffic topic anyway. I'll create the amplab/spark-ec2 repo over the next couple of days unless there are more comments on this thread. This will at least alleviate some of the naming confusion over using a repository in mesos and I'll give Sean, Nick, Matthew commit access to it. I am still not convinced about moving the issues over though. I won't move the issues. Maybe time tells whether one approach is better, or that it just doesn't matter. However it'd be a great opportunity to review and clear stale EC2 issues. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Should spark-ec2 get its own repo?
I've created https://github.com/amplab/spark-ec2 and added an initial set of committers. Note that this is not a fork of the existing github.com/mesos/spark-ec2 and users will need to fork from here. This is mostly to avoid the base-fork in pull requests being set incorrectly etc. I'll be migrating some PRs / closing them in the old repo and will also update the README in that repo. Thanks Shivaram On Fri, Jul 17, 2015 at 3:00 PM, Sean Owen so...@cloudera.com wrote: On Fri, Jul 17, 2015 at 6:58 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I am not sure why the ASF JIRA can be only used to track one set of artifacts that are packaged and released together. I agree that marking a fix version as 1.5 for a change in another repo doesn't make a lot of sense, but we could just not use fix versions for the EC2 issues ? *shrug* it just seems harder and less natural to use ASF JIRA. What's the benefit? I agree it's not a big deal either way but it's a small part of the problem we're solving in the first place. I suspect that one way or the other, there would be issues filed both places, so this probably isn't worth debating. My concerns are less about it being pushed out etc. For better or worse we have had EC2 scripts be a part of the Spark distribution from a very early stage (from version 0.5.0 if my git history reading is correct). So users will assume that any error with EC2 scripts belong to the Spark project. In addition almost all the contributions to the EC2 scripts come from Spark developers and so keeping the issues in the same mailing list / JIRA seems natural. This I guess again relates to the question of managing issues for code that isn't part of the Spark release artifact. Yeah good question -- Github doesn't give you a mailing list. I think dev@ would still be where it's discussed which is ... again 'part of the problem' but as you say, probably beneficial. It's a pretty low traffic topic anyway. I'll create the amplab/spark-ec2 repo over the next couple of days unless there are more comments on this thread. This will at least alleviate some of the naming confusion over using a repository in mesos and I'll give Sean, Nick, Matthew commit access to it. I am still not convinced about moving the issues over though. I won't move the issues. Maybe time tells whether one approach is better, or that it just doesn't matter. However it'd be a great opportunity to review and clear stale EC2 issues.
Re: Should spark-ec2 get its own repo?
Might be a good idea to get the PMC's of both projects to sign off to prevent future issues with apache. Regards, Mridul On Mon, Jul 20, 2015 at 12:01 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I've created https://github.com/amplab/spark-ec2 and added an initial set of committers. Note that this is not a fork of the existing github.com/mesos/spark-ec2 and users will need to fork from here. This is mostly to avoid the base-fork in pull requests being set incorrectly etc. I'll be migrating some PRs / closing them in the old repo and will also update the README in that repo. Thanks Shivaram On Fri, Jul 17, 2015 at 3:00 PM, Sean Owen so...@cloudera.com wrote: On Fri, Jul 17, 2015 at 6:58 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I am not sure why the ASF JIRA can be only used to track one set of artifacts that are packaged and released together. I agree that marking a fix version as 1.5 for a change in another repo doesn't make a lot of sense, but we could just not use fix versions for the EC2 issues ? *shrug* it just seems harder and less natural to use ASF JIRA. What's the benefit? I agree it's not a big deal either way but it's a small part of the problem we're solving in the first place. I suspect that one way or the other, there would be issues filed both places, so this probably isn't worth debating. My concerns are less about it being pushed out etc. For better or worse we have had EC2 scripts be a part of the Spark distribution from a very early stage (from version 0.5.0 if my git history reading is correct). So users will assume that any error with EC2 scripts belong to the Spark project. In addition almost all the contributions to the EC2 scripts come from Spark developers and so keeping the issues in the same mailing list / JIRA seems natural. This I guess again relates to the question of managing issues for code that isn't part of the Spark release artifact. Yeah good question -- Github doesn't give you a mailing list. I think dev@ would still be where it's discussed which is ... again 'part of the problem' but as you say, probably beneficial. It's a pretty low traffic topic anyway. I'll create the amplab/spark-ec2 repo over the next couple of days unless there are more comments on this thread. This will at least alleviate some of the naming confusion over using a repository in mesos and I'll give Sean, Nick, Matthew commit access to it. I am still not convinced about moving the issues over though. I won't move the issues. Maybe time tells whether one approach is better, or that it just doesn't matter. However it'd be a great opportunity to review and clear stale EC2 issues. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Should spark-ec2 get its own repo?
Some replies inline On Wed, Jul 15, 2015 at 1:08 AM, Sean Owen so...@cloudera.com wrote: The code can continue to be a good reference implementation, no matter where it lives. In fact, it can be a better more complete one, and easier to update. I agree that ec2/ needs to retain some kind of pointer to the new location. Yes, maybe a script as well that does the checkout as you say. We have to be careful that the effect here isn't to make people think this code is still part of the blessed bits of a Spark release, since it isn't. But I suppose the point is that it isn't quite now either (isn't tested, isn't fully contained in apache/spark) and that's what we're fixing. I still don't like the idea of using the ASF JIRA for Spark to track issues in a separate project, as these kinds of splits are what we're trying to get rid of. I think it's a plus to be able to only bother with the Github PR/issue system, and not parallel JIRAs as well. I also worry that this blurs the line between code that is formally tested and blessed in a Spark release, and that which is not. You fix an issue in this separate repo and marked it fixed in Spark 1.5 -- what does that imply? I am not sure why the ASF JIRA can be only used to track one set of artifacts that are packaged and released together. I agree that marking a fix version as 1.5 for a change in another repo doesn't make a lot of sense, but we could just not use fix versions for the EC2 issues ? I think the issue is people don't like the sense this is getting pushed outside the wall, or 'removed' from Spark. On the one hand I argue it hasn't really properly been part of Spark -- that's why we need this change to happen. But, I also think this is easy to resolve other ways: spark-packages.org, the pointer in the repo, prominent notes in the wiki, etc. My concerns are less about it being pushed out etc. For better or worse we have had EC2 scripts be a part of the Spark distribution from a very early stage (from version 0.5.0 if my git history reading is correct). So users will assume that any error with EC2 scripts belong to the Spark project. In addition almost all the contributions to the EC2 scripts come from Spark developers and so keeping the issues in the same mailing list / JIRA seems natural. This I guess again relates to the question of managing issues for code that isn't part of the Spark release artifact. I suggest Shivaram owns this, and that amplab/spark-ec2 is used to host? I'm not qualified to help make the new copy or repo admin but would be happy to help with the rest, like triaging, if you can give me rights to open issues. I'll create the amplab/spark-ec2 repo over the next couple of days unless there are more comments on this thread. This will at least alleviate some of the naming confusion over using a repository in mesos and I'll give Sean, Nick, Matthew commit access to it. I am still not convinced about moving the issues over though. Thanks Shivaram
Re: Should spark-ec2 get its own repo?
On Fri, Jul 17, 2015 at 6:58 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I am not sure why the ASF JIRA can be only used to track one set of artifacts that are packaged and released together. I agree that marking a fix version as 1.5 for a change in another repo doesn't make a lot of sense, but we could just not use fix versions for the EC2 issues ? *shrug* it just seems harder and less natural to use ASF JIRA. What's the benefit? I agree it's not a big deal either way but it's a small part of the problem we're solving in the first place. I suspect that one way or the other, there would be issues filed both places, so this probably isn't worth debating. My concerns are less about it being pushed out etc. For better or worse we have had EC2 scripts be a part of the Spark distribution from a very early stage (from version 0.5.0 if my git history reading is correct). So users will assume that any error with EC2 scripts belong to the Spark project. In addition almost all the contributions to the EC2 scripts come from Spark developers and so keeping the issues in the same mailing list / JIRA seems natural. This I guess again relates to the question of managing issues for code that isn't part of the Spark release artifact. Yeah good question -- Github doesn't give you a mailing list. I think dev@ would still be where it's discussed which is ... again 'part of the problem' but as you say, probably beneficial. It's a pretty low traffic topic anyway. I'll create the amplab/spark-ec2 repo over the next couple of days unless there are more comments on this thread. This will at least alleviate some of the naming confusion over using a repository in mesos and I'll give Sean, Nick, Matthew commit access to it. I am still not convinced about moving the issues over though. I won't move the issues. Maybe time tells whether one approach is better, or that it just doesn't matter. However it'd be a great opportunity to review and clear stale EC2 issues. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Should spark-ec2 get its own repo?
The code can continue to be a good reference implementation, no matter where it lives. In fact, it can be a better more complete one, and easier to update. I agree that ec2/ needs to retain some kind of pointer to the new location. Yes, maybe a script as well that does the checkout as you say. We have to be careful that the effect here isn't to make people think this code is still part of the blessed bits of a Spark release, since it isn't. But I suppose the point is that it isn't quite now either (isn't tested, isn't fully contained in apache/spark) and that's what we're fixing. I still don't like the idea of using the ASF JIRA for Spark to track issues in a separate project, as these kinds of splits are what we're trying to get rid of. I think it's a plus to be able to only bother with the Github PR/issue system, and not parallel JIRAs as well. I also worry that this blurs the line between code that is formally tested and blessed in a Spark release, and that which is not. You fix an issue in this separate repo and marked it fixed in Spark 1.5 -- what does that imply? I think the issue is people don't like the sense this is getting pushed outside the wall, or 'removed' from Spark. On the one hand I argue it hasn't really properly been part of Spark -- that's why we need this change to happen. But, I also think this is easy to resolve other ways: spark-packages.org, the pointer in the repo, prominent notes in the wiki, etc. I suggest Shivaram owns this, and that amplab/spark-ec2 is used to host? I'm not qualified to help make the new copy or repo admin but would be happy to help with the rest, like triaging, if you can give me rights to open issues. On Wed, Jul 15, 2015 at 5:35 AM, Matt Goodman meawo...@gmail.com wrote: I concur with the things Sean said about keeping the same JIRA. Frankly, its a pretty small part of spark, and as mentioned by Nicholas, a reference implementation of getting Spark running in ec2. I can see wanting to grow it to a little more general tool that implements launchers for other compute platforms. Porting this over to Google/M$/rackspace offerings would be not too far out of reach. --Matthew Goodman = Check Out My Website: http://craneium.net Find me on LinkedIn: http://tinyurl.com/d6wlch On Mon, Jul 13, 2015 at 2:46 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: At a high level I see the spark-ec2 scripts as an effort to provide a reference implementation for launching EC2 clusters with Apache Spark On a side note, this is precisely how I used spark-ec2 for a personal project that does something similar: reference implementation. Nick 2015년 7월 13일 (월) 오후 1:27, Shivaram Venkataraman shiva...@eecs.berkeley.edu님이 작성: I think moving the repo-location and re-organizing the python code to handle dependencies, testing etc. sounds good to me. However, I think there are a couple of things which I am not sure about 1. I strongly believe that we should preserve existing command-line in ec2/spark-ec2 (i.e. the shell script not the python file). This could be a thin wrapper script that just checks out the or downloads something (similar to say build/mvn). Mainly, I see no reason to break the workflow that users are used to right now. 2. I am also not sure about that moving the issue tracker is necessarily a good idea. I don't think we get a large number of issues due to the EC2 stuff and if we do have a workflow for launching EC2 clusters, the Spark JIRA would still be the natural place to report issues related to this. At a high level I see the spark-ec2 scripts as an effort to provide a reference implementation for launching EC2 clusters with Apache Spark -- Given this view I am not sure it makes sense to completely decouple this from the Apache project. Thanks Shivaram On Sun, Jul 12, 2015 at 1:34 AM, Sean Owen so...@cloudera.com wrote: I agree with these points. The ec2 support is substantially a separate project, and would likely be better managed as one. People can much more rapidly iterate on it and release it. I suggest: 1. Pick a new repo location. amplab/spark-ec2 ? spark-ec2/spark-ec2 ? 2. Add interested parties as owners/contributors 3. Reassemble a working clone of the current code from spark/ec2 and mesos/spark-ec2 and check it in 4. Announce the new location on user@, dev@ 5. Triage open JIRAs to the new repo's issue tracker and close them elsewhere 6. Remove the old copies of the code and leave a pointer to the new location in their place I'd also like to hear a few more nods before pulling the trigger though. On Sat, Jul 11, 2015 at 7:07 PM, Matt Goodman meawo...@gmail.com wrote: I wanted to revive the conversation about the spark-ec2 tools, as it seems to have been lost in the 1.4.1 release voting spree. I think that splitting it into its own repository is a really good move, and I would also be happy to help with this transition, as well as help
Re: Should spark-ec2 get its own repo?
I concur with the things Sean said about keeping the same JIRA. Frankly, its a pretty small part of spark, and as mentioned by Nicholas, a reference implementation of getting Spark running in ec2. I can see wanting to grow it to a little more general tool that implements launchers for other compute platforms. Porting this over to Google/M$/rackspace offerings would be not too far out of reach. --Matthew Goodman = Check Out My Website: http://craneium.net Find me on LinkedIn: http://tinyurl.com/d6wlch On Mon, Jul 13, 2015 at 2:46 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: At a high level I see the spark-ec2 scripts as an effort to provide a reference implementation for launching EC2 clusters with Apache Spark On a side note, this is precisely how I used spark-ec2 for a personal project that does something similar: reference implementation. Nick 2015년 7월 13일 (월) 오후 1:27, Shivaram Venkataraman shiva...@eecs.berkeley.edu님이 작성: I think moving the repo-location and re-organizing the python code to handle dependencies, testing etc. sounds good to me. However, I think there are a couple of things which I am not sure about 1. I strongly believe that we should preserve existing command-line in ec2/spark-ec2 (i.e. the shell script not the python file). This could be a thin wrapper script that just checks out the or downloads something (similar to say build/mvn). Mainly, I see no reason to break the workflow that users are used to right now. 2. I am also not sure about that moving the issue tracker is necessarily a good idea. I don't think we get a large number of issues due to the EC2 stuff and if we do have a workflow for launching EC2 clusters, the Spark JIRA would still be the natural place to report issues related to this. At a high level I see the spark-ec2 scripts as an effort to provide a reference implementation for launching EC2 clusters with Apache Spark -- Given this view I am not sure it makes sense to completely decouple this from the Apache project. Thanks Shivaram On Sun, Jul 12, 2015 at 1:34 AM, Sean Owen so...@cloudera.com wrote: I agree with these points. The ec2 support is substantially a separate project, and would likely be better managed as one. People can much more rapidly iterate on it and release it. I suggest: 1. Pick a new repo location. amplab/spark-ec2 ? spark-ec2/spark-ec2 ? 2. Add interested parties as owners/contributors 3. Reassemble a working clone of the current code from spark/ec2 and mesos/spark-ec2 and check it in 4. Announce the new location on user@, dev@ 5. Triage open JIRAs to the new repo's issue tracker and close them elsewhere 6. Remove the old copies of the code and leave a pointer to the new location in their place I'd also like to hear a few more nods before pulling the trigger though. On Sat, Jul 11, 2015 at 7:07 PM, Matt Goodman meawo...@gmail.com wrote: I wanted to revive the conversation about the spark-ec2 tools, as it seems to have been lost in the 1.4.1 release voting spree. I think that splitting it into its own repository is a really good move, and I would also be happy to help with this transition, as well as help maintain the resulting repository. Here is my justification for why we ought to do this split. User Facing: The spark-ec2 launcher dosen't use anything in the parent spark repository spark-ec2 version is disjoint from the parent repo. I consider it confusing that the spark-ec2 script dosen't launch the version of spark it is checked-out with. Someone interested in setting up spark-ec2 with anything but the default configuration will have to clone at least 2 repositories at present, and probably fork and push changes to 1. spark-ec2 has mismatched dependencies wrt. to spark itself. This includes a confusing shim in the spark-ec2 script to install boto, which frankly should just be a dependency of the script Developer Facing: Support across 2 repos will be worse than across 1. Its unclear where to file issues/PRs, and requires extra communications for even fairly trivial stuff. Spark-ec2 also depends on a number binary blobs being in the right place, currently the responsibility for these is decentralized, and likely prone to various flavors of dumb. The current flow of booting a spark-ec2 cluster is _complicated_ I spent the better part of a couple days figuring out how to integrate our custom tools into this stack. This is very hard to fix when commits/PR's need to span groups/repositories/buckets-o-binary, I am sure there are several other problems that are languishing under similar roadblocks It makes testing possible. The spark-ec2 script is a great case for CI given the number of permutations of launch criteria there are. I suspect AWS would be happy to foot the bill on spark-ec2 testing (probably ~20 bucks a month based on some envelope sketches),
Re: Should spark-ec2 get its own repo?
I think moving the repo-location and re-organizing the python code to handle dependencies, testing etc. sounds good to me. However, I think there are a couple of things which I am not sure about 1. I strongly believe that we should preserve existing command-line in ec2/spark-ec2 (i.e. the shell script not the python file). This could be a thin wrapper script that just checks out the or downloads something (similar to say build/mvn). Mainly, I see no reason to break the workflow that users are used to right now. 2. I am also not sure about that moving the issue tracker is necessarily a good idea. I don't think we get a large number of issues due to the EC2 stuff and if we do have a workflow for launching EC2 clusters, the Spark JIRA would still be the natural place to report issues related to this. At a high level I see the spark-ec2 scripts as an effort to provide a reference implementation for launching EC2 clusters with Apache Spark -- Given this view I am not sure it makes sense to completely decouple this from the Apache project. Thanks Shivaram On Sun, Jul 12, 2015 at 1:34 AM, Sean Owen so...@cloudera.com wrote: I agree with these points. The ec2 support is substantially a separate project, and would likely be better managed as one. People can much more rapidly iterate on it and release it. I suggest: 1. Pick a new repo location. amplab/spark-ec2 ? spark-ec2/spark-ec2 ? 2. Add interested parties as owners/contributors 3. Reassemble a working clone of the current code from spark/ec2 and mesos/spark-ec2 and check it in 4. Announce the new location on user@, dev@ 5. Triage open JIRAs to the new repo's issue tracker and close them elsewhere 6. Remove the old copies of the code and leave a pointer to the new location in their place I'd also like to hear a few more nods before pulling the trigger though. On Sat, Jul 11, 2015 at 7:07 PM, Matt Goodman meawo...@gmail.com wrote: I wanted to revive the conversation about the spark-ec2 tools, as it seems to have been lost in the 1.4.1 release voting spree. I think that splitting it into its own repository is a really good move, and I would also be happy to help with this transition, as well as help maintain the resulting repository. Here is my justification for why we ought to do this split. User Facing: The spark-ec2 launcher dosen't use anything in the parent spark repository spark-ec2 version is disjoint from the parent repo. I consider it confusing that the spark-ec2 script dosen't launch the version of spark it is checked-out with. Someone interested in setting up spark-ec2 with anything but the default configuration will have to clone at least 2 repositories at present, and probably fork and push changes to 1. spark-ec2 has mismatched dependencies wrt. to spark itself. This includes a confusing shim in the spark-ec2 script to install boto, which frankly should just be a dependency of the script Developer Facing: Support across 2 repos will be worse than across 1. Its unclear where to file issues/PRs, and requires extra communications for even fairly trivial stuff. Spark-ec2 also depends on a number binary blobs being in the right place, currently the responsibility for these is decentralized, and likely prone to various flavors of dumb. The current flow of booting a spark-ec2 cluster is _complicated_ I spent the better part of a couple days figuring out how to integrate our custom tools into this stack. This is very hard to fix when commits/PR's need to span groups/repositories/buckets-o-binary, I am sure there are several other problems that are languishing under similar roadblocks It makes testing possible. The spark-ec2 script is a great case for CI given the number of permutations of launch criteria there are. I suspect AWS would be happy to foot the bill on spark-ec2 testing (probably ~20 bucks a month based on some envelope sketches), as it is a piece of software that directly impacts other people giving them money. I have some contacts there, and I am pretty sure this would be an easy conversation, particularly if the repo directly concerned with ec2. Think also being able to assemble the binary blobs into s3 bucket dedicated to spark-ec2 Any other thoughts/voices appreciated here. spark-ec2 is a super-power tool and deserves a fair bit of attention! --Matthew Goodman = Check Out My Website: http://craneium.net Find me on LinkedIn: http://tinyurl.com/d6wlch - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Should spark-ec2 get its own repo?
At a high level I see the spark-ec2 scripts as an effort to provide a reference implementation for launching EC2 clusters with Apache Spark On a side note, this is precisely how I used spark-ec2 for a personal project that does something similar: reference implementation. Nick 2015년 7월 13일 (월) 오후 1:27, Shivaram Venkataraman shiva...@eecs.berkeley.edu님이 작성: I think moving the repo-location and re-organizing the python code to handle dependencies, testing etc. sounds good to me. However, I think there are a couple of things which I am not sure about 1. I strongly believe that we should preserve existing command-line in ec2/spark-ec2 (i.e. the shell script not the python file). This could be a thin wrapper script that just checks out the or downloads something (similar to say build/mvn). Mainly, I see no reason to break the workflow that users are used to right now. 2. I am also not sure about that moving the issue tracker is necessarily a good idea. I don't think we get a large number of issues due to the EC2 stuff and if we do have a workflow for launching EC2 clusters, the Spark JIRA would still be the natural place to report issues related to this. At a high level I see the spark-ec2 scripts as an effort to provide a reference implementation for launching EC2 clusters with Apache Spark -- Given this view I am not sure it makes sense to completely decouple this from the Apache project. Thanks Shivaram On Sun, Jul 12, 2015 at 1:34 AM, Sean Owen so...@cloudera.com wrote: I agree with these points. The ec2 support is substantially a separate project, and would likely be better managed as one. People can much more rapidly iterate on it and release it. I suggest: 1. Pick a new repo location. amplab/spark-ec2 ? spark-ec2/spark-ec2 ? 2. Add interested parties as owners/contributors 3. Reassemble a working clone of the current code from spark/ec2 and mesos/spark-ec2 and check it in 4. Announce the new location on user@, dev@ 5. Triage open JIRAs to the new repo's issue tracker and close them elsewhere 6. Remove the old copies of the code and leave a pointer to the new location in their place I'd also like to hear a few more nods before pulling the trigger though. On Sat, Jul 11, 2015 at 7:07 PM, Matt Goodman meawo...@gmail.com wrote: I wanted to revive the conversation about the spark-ec2 tools, as it seems to have been lost in the 1.4.1 release voting spree. I think that splitting it into its own repository is a really good move, and I would also be happy to help with this transition, as well as help maintain the resulting repository. Here is my justification for why we ought to do this split. User Facing: The spark-ec2 launcher dosen't use anything in the parent spark repository spark-ec2 version is disjoint from the parent repo. I consider it confusing that the spark-ec2 script dosen't launch the version of spark it is checked-out with. Someone interested in setting up spark-ec2 with anything but the default configuration will have to clone at least 2 repositories at present, and probably fork and push changes to 1. spark-ec2 has mismatched dependencies wrt. to spark itself. This includes a confusing shim in the spark-ec2 script to install boto, which frankly should just be a dependency of the script Developer Facing: Support across 2 repos will be worse than across 1. Its unclear where to file issues/PRs, and requires extra communications for even fairly trivial stuff. Spark-ec2 also depends on a number binary blobs being in the right place, currently the responsibility for these is decentralized, and likely prone to various flavors of dumb. The current flow of booting a spark-ec2 cluster is _complicated_ I spent the better part of a couple days figuring out how to integrate our custom tools into this stack. This is very hard to fix when commits/PR's need to span groups/repositories/buckets-o-binary, I am sure there are several other problems that are languishing under similar roadblocks It makes testing possible. The spark-ec2 script is a great case for CI given the number of permutations of launch criteria there are. I suspect AWS would be happy to foot the bill on spark-ec2 testing (probably ~20 bucks a month based on some envelope sketches), as it is a piece of software that directly impacts other people giving them money. I have some contacts there, and I am pretty sure this would be an easy conversation, particularly if the repo directly concerned with ec2. Think also being able to assemble the binary blobs into s3 bucket dedicated to spark-ec2 Any other thoughts/voices appreciated here. spark-ec2 is a super-power tool and deserves a fair bit of attention! --Matthew Goodman = Check Out My Website: http://craneium.net Find me on LinkedIn: http://tinyurl.com/d6wlch
Re: Should spark-ec2 get its own repo?
I agree with these points. The ec2 support is substantially a separate project, and would likely be better managed as one. People can much more rapidly iterate on it and release it. I suggest: 1. Pick a new repo location. amplab/spark-ec2 ? spark-ec2/spark-ec2 ? 2. Add interested parties as owners/contributors 3. Reassemble a working clone of the current code from spark/ec2 and mesos/spark-ec2 and check it in 4. Announce the new location on user@, dev@ 5. Triage open JIRAs to the new repo's issue tracker and close them elsewhere 6. Remove the old copies of the code and leave a pointer to the new location in their place I'd also like to hear a few more nods before pulling the trigger though. On Sat, Jul 11, 2015 at 7:07 PM, Matt Goodman meawo...@gmail.com wrote: I wanted to revive the conversation about the spark-ec2 tools, as it seems to have been lost in the 1.4.1 release voting spree. I think that splitting it into its own repository is a really good move, and I would also be happy to help with this transition, as well as help maintain the resulting repository. Here is my justification for why we ought to do this split. User Facing: The spark-ec2 launcher dosen't use anything in the parent spark repository spark-ec2 version is disjoint from the parent repo. I consider it confusing that the spark-ec2 script dosen't launch the version of spark it is checked-out with. Someone interested in setting up spark-ec2 with anything but the default configuration will have to clone at least 2 repositories at present, and probably fork and push changes to 1. spark-ec2 has mismatched dependencies wrt. to spark itself. This includes a confusing shim in the spark-ec2 script to install boto, which frankly should just be a dependency of the script Developer Facing: Support across 2 repos will be worse than across 1. Its unclear where to file issues/PRs, and requires extra communications for even fairly trivial stuff. Spark-ec2 also depends on a number binary blobs being in the right place, currently the responsibility for these is decentralized, and likely prone to various flavors of dumb. The current flow of booting a spark-ec2 cluster is _complicated_ I spent the better part of a couple days figuring out how to integrate our custom tools into this stack. This is very hard to fix when commits/PR's need to span groups/repositories/buckets-o-binary, I am sure there are several other problems that are languishing under similar roadblocks It makes testing possible. The spark-ec2 script is a great case for CI given the number of permutations of launch criteria there are. I suspect AWS would be happy to foot the bill on spark-ec2 testing (probably ~20 bucks a month based on some envelope sketches), as it is a piece of software that directly impacts other people giving them money. I have some contacts there, and I am pretty sure this would be an easy conversation, particularly if the repo directly concerned with ec2. Think also being able to assemble the binary blobs into s3 bucket dedicated to spark-ec2 Any other thoughts/voices appreciated here. spark-ec2 is a super-power tool and deserves a fair bit of attention! --Matthew Goodman = Check Out My Website: http://craneium.net Find me on LinkedIn: http://tinyurl.com/d6wlch - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Should spark-ec2 get its own repo?
I wanted to revive the conversation about the spark-ec2 tools, as it seems to have been lost in the 1.4.1 release voting spree. I think that splitting it into its own repository is a really good move, and I would also be happy to help with this transition, as well as help maintain the resulting repository. Here is my justification for why we ought to do this split. User Facing: - The spark-ec2 launcher dosen't use anything in the parent spark repository - spark-ec2 version is disjoint from the parent repo. I consider it confusing that the spark-ec2 script dosen't launch the version of spark it is checked-out with. - Someone interested in setting up spark-ec2 with anything but the default configuration will have to clone at least 2 repositories at present, and probably fork and push changes to 1. - spark-ec2 has mismatched dependencies wrt. to spark itself. This includes a confusing shim in the spark-ec2 script to install boto, which frankly should just be a dependency of the script Developer Facing: - Support across 2 repos will be worse than across 1. Its unclear where to file issues/PRs, and requires extra communications for even fairly trivial stuff. - Spark-ec2 also depends on a number binary blobs being in the right place, currently the responsibility for these is decentralized, and likely prone to various flavors of dumb. - The current flow of booting a spark-ec2 cluster is _complicated_ I spent the better part of a couple days figuring out how to integrate our custom tools into this stack. This is very hard to fix when commits/PR's need to span groups/repositories/buckets-o-binary, I am sure there are several other problems that are languishing under similar roadblocks - It makes testing possible. The spark-ec2 script is a great case for CI given the number of permutations of launch criteria there are. I suspect AWS would be happy to foot the bill on spark-ec2 testing (probably ~20 bucks a month based on some envelope sketches), as it is a piece of software that directly impacts other people giving them money. I have some contacts there, and I am pretty sure this would be an easy conversation, particularly if the repo directly concerned with ec2. Think also being able to assemble the binary blobs into s3 bucket dedicated to spark-ec2 Any other thoughts/voices appreciated here. spark-ec2 is a super-power tool and deserves a fair bit of attention! --Matthew Goodman = Check Out My Website: http://craneium.net Find me on LinkedIn: http://tinyurl.com/d6wlch
Should spark-ec2 get its own repo?
spark-ec2 is kind of a mini project within a project. It’s composed of a set of EC2 AMIs https://github.com/mesos/spark-ec2/tree/branch-1.4/ami-list under someone’s account (maybe Patrick’s?) plus the following 2 code bases: - Main command line tool: https://github.com/apache/spark/tree/master/ec2 - Scripts used to install stuff on launched instances: https://github.com/mesos/spark-ec2 You’ll notice that part of the code lives under the Mesos GitHub organization. This is an artifact of history, when Spark itself kinda grew out of Mesos before becoming its own project. There are a few issues with this state of affairs, none of which are major but which nonetheless merit some discussion: - The spark-ec2 code is split across 2 repositories when it is not technically necessary. - Some of that code is owned by an organization that should technically not be owning Spark stuff. - Spark and spark-ec2 live in the same repo but spark-ec2 issues are often completely disjoint from issues with Spark itself. This has led in some cases to new Spark RCs being cut because of minor issues with spark-ec2 (like version strings not being updated). I wanted to put up for discussion a few suggestions and see what people agreed with. 1. The current state of affairs is fine and it is not worth moving stuff around. 2. spark-ec2 should get its own repo, and should be moved out of the main Spark repo. That means both of the code bases linked above would live in one place (maybe a spark-ec2/spark-ec2 repo). 3. spark-ec2 should stay in the Spark repo, but the stuff under the Mesos organization should be moved elsewhere (again, perhaps under a spark-ec2/spark-ec2 repo). What do you think? Nick
Re: Should spark-ec2 get its own repo?
I'll render an opinion although I'm only barely qualified by having just had a small discussion on this -- It does seem like mesos/spark-ec2 is in the wrong place, although really, that is at best an issue for Mesos. But it does highlight that the Spark EC2 support doesn't entirely live with and get distributed with apache/spark. It does feel like that should move and should not be separate from the other half of EC2 support. Why not put it in apache/spark? I think the problem is that the AMI process clones the repo, and the apache/spark repo is huge. One answer is just to fix that by arranging a different way of releasing the EC2 files as a downloadable archive. However, if it is true that the Spark EC2 support doesn't need to live with and get released with the rest of Spark, it might make more sense to merge both halves into a new separate repo and run it separately from apache/spark, like any other third-party repo. I think that's less radical than it sounds, and has some benefits. There is not quite the same argument of needing to build and maintain this together like with language bindings and subprojects. But is that something that people who use and maintain it agree with or are advocating for? On Fri, Jul 3, 2015 at 6:23 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: spark-ec2 is kind of a mini project within a project. It’s composed of a set of EC2 AMIs under someone’s account (maybe Patrick’s?) plus the following 2 code bases: Main command line tool: https://github.com/apache/spark/tree/master/ec2 Scripts used to install stuff on launched instances: https://github.com/mesos/spark-ec2 You’ll notice that part of the code lives under the Mesos GitHub organization. This is an artifact of history, when Spark itself kinda grew out of Mesos before becoming its own project. There are a few issues with this state of affairs, none of which are major but which nonetheless merit some discussion: The spark-ec2 code is split across 2 repositories when it is not technically necessary. Some of that code is owned by an organization that should technically not be owning Spark stuff. Spark and spark-ec2 live in the same repo but spark-ec2 issues are often completely disjoint from issues with Spark itself. This has led in some cases to new Spark RCs being cut because of minor issues with spark-ec2 (like version strings not being updated). I wanted to put up for discussion a few suggestions and see what people agreed with. The current state of affairs is fine and it is not worth moving stuff around. spark-ec2 should get its own repo, and should be moved out of the main Spark repo. That means both of the code bases linked above would live in one place (maybe a spark-ec2/spark-ec2 repo). spark-ec2 should stay in the Spark repo, but the stuff under the Mesos organization should be moved elsewhere (again, perhaps under a spark-ec2/spark-ec2 repo). What do you think? Nick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Should spark-ec2 get its own repo?
As the person maintaining the mesos/spark-ec2 repo, here are my 2 cents - I don't think it makes sense to put the scripts in the Spark repo itself. Cloning the scripts on the EC2 instances is an intentional design which allows us to make minor config changes in EC2 launches without needing a new Spark release. - I think having some script to launch EC2 clusters that is a part of mainline Spark is a nice feature to have. However this could be a very thin wrapper instead of the big Python file we have right now. - Moving the scripts from the Mesos organization to spark-ec2 or amplab is fine by me. In fact one nice way to do this transition would be to move the existing spark-ec2 repo to a new organization and then move the logic from the launcher script out of the Spark to the new repo. Thanks Shivaram On Fri, Jul 3, 2015 at 10:36 AM, Sean Owen so...@cloudera.com wrote: I'll render an opinion although I'm only barely qualified by having just had a small discussion on this -- It does seem like mesos/spark-ec2 is in the wrong place, although really, that is at best an issue for Mesos. But it does highlight that the Spark EC2 support doesn't entirely live with and get distributed with apache/spark. It does feel like that should move and should not be separate from the other half of EC2 support. Why not put it in apache/spark? I think the problem is that the AMI process clones the repo, and the apache/spark repo is huge. One answer is just to fix that by arranging a different way of releasing the EC2 files as a downloadable archive. However, if it is true that the Spark EC2 support doesn't need to live with and get released with the rest of Spark, it might make more sense to merge both halves into a new separate repo and run it separately from apache/spark, like any other third-party repo. I think that's less radical than it sounds, and has some benefits. There is not quite the same argument of needing to build and maintain this together like with language bindings and subprojects. But is that something that people who use and maintain it agree with or are advocating for? On Fri, Jul 3, 2015 at 6:23 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: spark-ec2 is kind of a mini project within a project. It’s composed of a set of EC2 AMIs under someone’s account (maybe Patrick’s?) plus the following 2 code bases: Main command line tool: https://github.com/apache/spark/tree/master/ec2 Scripts used to install stuff on launched instances: https://github.com/mesos/spark-ec2 You’ll notice that part of the code lives under the Mesos GitHub organization. This is an artifact of history, when Spark itself kinda grew out of Mesos before becoming its own project. There are a few issues with this state of affairs, none of which are major but which nonetheless merit some discussion: The spark-ec2 code is split across 2 repositories when it is not technically necessary. Some of that code is owned by an organization that should technically not be owning Spark stuff. Spark and spark-ec2 live in the same repo but spark-ec2 issues are often completely disjoint from issues with Spark itself. This has led in some cases to new Spark RCs being cut because of minor issues with spark-ec2 (like version strings not being updated). I wanted to put up for discussion a few suggestions and see what people agreed with. The current state of affairs is fine and it is not worth moving stuff around. spark-ec2 should get its own repo, and should be moved out of the main Spark repo. That means both of the code bases linked above would live in one place (maybe a spark-ec2/spark-ec2 repo). spark-ec2 should stay in the Spark repo, but the stuff under the Mesos organization should be moved elsewhere (again, perhaps under a spark-ec2/spark-ec2 repo). What do you think? Nick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org