Re: Scalastyle improvements / large code reformatting
On Mon, Oct 13, 2014 at 11:57 AM, Patrick Wendell pwend...@gmail.com wrote: That would even work for imports as well, you'd just have a thing where if anyone modified some imports they would have to fix all the imports in that file. It's at least worth a try. OK, that sounds like a fair compromise. I've updated the description on SPARK-3849 https://issues.apache.org/jira/browse/SPARK-3849 accordingly. Nick
Re: new jenkins update + tentative release date
Thanks for doing this work Shane. So is Jenkins in the new datacenter now? Do you know if the problems with checking out patches from GitHub should be resolved now? Here's an example from the past hour https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21702/console . Nick On Mon, Oct 13, 2014 at 1:35 PM, shane knapp skn...@berkeley.edu wrote: AND WE ARE LIIIVE! https://amplab.cs.berkeley.edu/jenkins/ have at it, folks! On Mon, Oct 13, 2014 at 10:15 AM, shane knapp skn...@berkeley.edu wrote: quick update: we should be back up and running in the next ~60mins. On Mon, Oct 13, 2014 at 7:54 AM, shane knapp skn...@berkeley.edu wrote: Jenkins is in quiet mode and the move will be starting after i have my coffee. :) On Sun, Oct 12, 2014 at 11:26 PM, Josh Rosen rosenvi...@gmail.com wrote: Reminder: this Jenkins migration is happening tomorrow morning (Monday). On Fri, Oct 10, 2014 at 1:01 PM, shane knapp skn...@berkeley.edu wrote: reminder: this IS happening, first thing monday morning PDT. :) On Wed, Oct 8, 2014 at 3:01 PM, shane knapp skn...@berkeley.edu wrote: greetings! i've got some updates regarding our new jenkins infrastructure, as well as the initial date and plan for rolling things out: *** current testing/build break whack-a-mole: a lot of out of date artifacts are cached in the current jenkins, which has caused a few builds during my testing to break due to dependency resolution failure[1][2]. bumping these versions can cause your builds to fail, due to public api changes and the like. consider yourself warned that some projects might require some debugging... :) tomorrow, i will be at databricks working w/@joshrosen to make sure that the spark builds have any bugs hammered out. *** deployment plan: unless something completely horrible happens, THE NEW JENKINS WILL GO LIVE ON MONDAY (october 13th). all jenkins infrastructure will be DOWN for the entirety of the day (starting at ~8am). this means no builds, period. i'm hoping that the downtime will be much shorter than this, but we'll have to see how everything goes. all test/build history WILL BE PRESERVED. i will be rsyncing the jenkins jobs/ directory over, complete w/history as part of the deployment. once i'm feeling good about the state of things, i'll point the original url to the new instances and send out an all clear. if you are a student at UC berkeley, you can log in to jenkins using your LDAP login, and (by default) view but not change plans. if you do not have a UC berkeley LDAP login, you can still view plans anonymously. IF YOU ARE A PLAN ADMIN, THEN PLEASE REACH OUT, ASAP, PRIVATELY AND I WILL SET UP ADMIN ACCESS TO YOUR BUILDS. *** post deployment plan: fix all of the things that break! i will be keeping a VERY close eye on the builds, checking for breaks, and helping out where i can. if the situation is dire, i can always roll back to the old jenkins infra... but i hope we never get to that point! :) i'm hoping that things will go smoothly, but please be patient as i'm certain we'll hit a few bumps in the road. please let me know if you guys have any comments/questions/concerns... :) shane 1 - https://github.com/bigdatagenomics/bdg-services/pull/18 2 - https://github.com/bigdatagenomics/avocado/pull/111
Re: new jenkins update + tentative release date
Ah, that sucks. Thank you for looking into this. On Mon, Oct 13, 2014 at 5:43 PM, shane knapp skn...@berkeley.edu wrote: On Mon, Oct 13, 2014 at 2:28 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for doing this work Shane. So is Jenkins in the new datacenter now? Do you know if the problems with checking out patches from GitHub should be resolved now? Here's an example from the past hour https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21702/console . yeah, i just noticed that we're still having the checkout issues. i was really hoping that the better network would just make this go away... guess i'll be doing a deeper dive now. i would just up the timeout, but that's not coming out for a little while yet: https://issues.jenkins-ci.org/browse/JENKINS-20387 (we are currently running the latest -- 2.2.7, and the timeout field is coming in 2.3, whenever that is) i'll try and strace/replicate it locally as well.
Re: new jenkins update + tentative release date
*fingers crossed* On Mon, Oct 13, 2014 at 5:54 PM, shane knapp skn...@berkeley.edu wrote: ok, i found something that may help: https://issues.jenkins-ci.org/browse/JENKINS-20445?focusedCommentId=195638page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-195638 i set this to 20 minutes... let's see if that helps. On Mon, Oct 13, 2014 at 2:48 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Ah, that sucks. Thank you for looking into this. On Mon, Oct 13, 2014 at 5:43 PM, shane knapp skn...@berkeley.edu wrote: On Mon, Oct 13, 2014 at 2:28 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for doing this work Shane. So is Jenkins in the new datacenter now? Do you know if the problems with checking out patches from GitHub should be resolved now? Here's an example from the past hour https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21702/console . yeah, i just noticed that we're still having the checkout issues. i was really hoping that the better network would just make this go away... guess i'll be doing a deeper dive now. i would just up the timeout, but that's not coming out for a little while yet: https://issues.jenkins-ci.org/browse/JENKINS-20387 (we are currently running the latest -- 2.2.7, and the timeout field is coming in 2.3, whenever that is) i'll try and strace/replicate it locally as well.
Re: Trouble running tests
Running dev/run-tests as-is should work and will test everything. That's what the contributing guide recommends, if I remember correctly. At some point we should make it easier to test individual components locally using the dev script, but calling sbt on the various tests suites as Michael pointed out will always work. Nick On Friday, October 10, 2014, Yana Kadiyska yana.kadiy...@gmail.com wrote: Thanks Nicholas and Michael-- yes, I wanted to make sure all tests pass before I submitted a pull request. AMPLAB_JENKINS=true ./dev/run-tests fails for me in mlib and yarn suites(synced to 14f222f7f76cc93633aae27a94c0e556e289ec56). I was however able to run Michael's suggested tests and my changes affect the SQL project only, so I'll go ahead with the pull request... I'd like to know if people run the full suite locally though -- I can imagine cases where a change is not clearly isolated to a single module. thanks again On Thu, Oct 9, 2014 at 5:26 PM, Michael Armbrust mich...@databricks.com javascript:_e(%7B%7D,'cvml','mich...@databricks.com'); wrote: Also, in general for SQL only changes it is sufficient to run sbt/sbt catatlyst/test sql/test hive/test. The hive/test part takes the longest, so I usually leave that out until just before submitting unless my changes are hive specific. On Thu, Oct 9, 2014 at 11:40 AM, Nicholas Chammas nicholas.cham...@gmail.com javascript:_e(%7B%7D,'cvml','nicholas.cham...@gmail.com'); wrote: _RUN_SQL_TESTS needs to be true as well. Those two _... variables set get correctly when tests are run on Jenkins. They’re not meant to be manipulated directly by testers. Did you want to run SQL tests only locally? You can try faking being Jenkins by setting AMPLAB_JENKINS=true before calling run-tests. That should be simpler than futzing with the _... variables. Nick On Thu, Oct 9, 2014 at 10:10 AM, Yana yana.kadiy...@gmail.com javascript:_e(%7B%7D,'cvml','yana.kadiy...@gmail.com'); wrote: Hi, apologies if I missed a FAQ somewhere. I am trying to submit a bug fix for the very first time. Reading instructions, I forked the git repo (at c9ae79fba25cd49ca70ca398bc75434202d26a97) and am trying to run tests. I run this: ./dev/run-tests _SQL_TESTS_ONLY=true and after a while get the following error: [info] ScalaTest [info] Run completed in 3 minutes, 37 seconds. [info] Total number of tests run: 224 [info] Suites: completed 19, aborted 0 [info] Tests: succeeded 224, failed 0, canceled 0, ignored 5, pending 0 [info] All tests passed. [info] Passed: Total 224, Failed 0, Errors 0, Passed 224, Ignored 5 [success] Total time: 301 s, completed Oct 9, 2014 9:31:23 AM [error] Expected ID character [error] Not a valid command: hive-thriftserver [error] Expected project ID [error] Expected configuration [error] Expected ':' (if selecting a configuration) [error] Expected key [error] Not a valid key: hive-thriftserver [error] hive-thriftserver/test [error] ^ (I am running this without my changes) I have 2 questions: 1. How to fix this 2. Is there a best practice on what to fork so you start off with a good state? I'm wondering if I should sync the latest changes or go back to a label? thanks in advance -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Trouble-running-tests-tp8717.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:_e(%7B%7D,'cvml','dev-unsubscr...@spark.apache.org'); For additional commands, e-mail: dev-h...@spark.apache.org javascript:_e(%7B%7D,'cvml','dev-h...@spark.apache.org');
Re: Trouble running tests
_RUN_SQL_TESTS needs to be true as well. Those two _... variables set get correctly when tests are run on Jenkins. They’re not meant to be manipulated directly by testers. Did you want to run SQL tests only locally? You can try faking being Jenkins by setting AMPLAB_JENKINS=true before calling run-tests. That should be simpler than futzing with the _... variables. Nick On Thu, Oct 9, 2014 at 10:10 AM, Yana yana.kadiy...@gmail.com wrote: Hi, apologies if I missed a FAQ somewhere. I am trying to submit a bug fix for the very first time. Reading instructions, I forked the git repo (at c9ae79fba25cd49ca70ca398bc75434202d26a97) and am trying to run tests. I run this: ./dev/run-tests _SQL_TESTS_ONLY=true and after a while get the following error: [info] ScalaTest [info] Run completed in 3 minutes, 37 seconds. [info] Total number of tests run: 224 [info] Suites: completed 19, aborted 0 [info] Tests: succeeded 224, failed 0, canceled 0, ignored 5, pending 0 [info] All tests passed. [info] Passed: Total 224, Failed 0, Errors 0, Passed 224, Ignored 5 [success] Total time: 301 s, completed Oct 9, 2014 9:31:23 AM [error] Expected ID character [error] Not a valid command: hive-thriftserver [error] Expected project ID [error] Expected configuration [error] Expected ':' (if selecting a configuration) [error] Expected key [error] Not a valid key: hive-thriftserver [error] hive-thriftserver/test [error] ^ (I am running this without my changes) I have 2 questions: 1. How to fix this 2. Is there a best practice on what to fork so you start off with a good state? I'm wondering if I should sync the latest changes or go back to a label? thanks in advance -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Trouble-running-tests-tp8717.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
spark-prs and mesos/spark-ec2
Does it make sense to point the Spark PR review board to read from mesos/spark-ec2 as well? PRs submitted against that repo may reference Spark JIRAs and need review just like any other Spark PR. Nick
Re: Unneeded branches/tags
So: - tags: can delete - branches: stuck with ‘em Correct? Nick On Wed, Oct 8, 2014 at 1:52 AM, Patrick Wendell pwend...@gmail.com wrote: Actually - weirdly - we can delete old tags and it works with the mirroring. Nick if you put together a list of un-needed tags I can delete them. On Tue, Oct 7, 2014 at 6:27 PM, Reynold Xin r...@databricks.com wrote: Those branches are no longer active. However, I don't think we can delete branches from github due to the way ASF mirroring works. I might be wrong there. On Tue, Oct 7, 2014 at 6:25 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Just curious: Are there branches and/or tags on the repo that we don't need anymore? What are the scala-2.9 and streaming branches for, for example? And do we still need branches for older versions of Spark that we are not backporting stuff to, like branch-0.5? Nick
Re: Extending Scala style checks
I've created SPARK-3849: Automate remaining Scala style rules https://issues.apache.org/jira/browse/SPARK-3849. Please create sub-tasks on this issue for rules that we have not automated and let's work through them as possible. I went ahead and created the first sub-task, SPARK-3850: Scala style: Disallow trailing spaces https://issues.apache.org/jira/browse/SPARK-3850. Nick On Tue, Oct 7, 2014 at 4:45 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: For starters, do we have a list of all the Scala style rules that are currently not enforced automatically but are likely well-suited for automation? Let's put such a list together in a JIRA issue and work through implementing them. Nick On Thu, Oct 2, 2014 at 12:06 AM, Cheng Lian lian.cs@gmail.com wrote: Since we can easily catch the list of all changed files in a PR, I think we can start with adding the no trailing space check for newly changed files only? On 10/2/14 9:24 AM, Nicholas Chammas wrote: Yeah, I remember that hell when I added PEP 8 to the build checks and fixed all the outstanding Python style issues. I had to keep rebasing and resolving merge conflicts until the PR was merged. It's a rough process, but thankfully it's also a one-time process. I might be able to help with that in the next week or two if no-one else wants to pick it up. Nick On Wed, Oct 1, 2014 at 9:20 PM, Michael Armbrust mich...@databricks.com wrote: The hard part here is updating the existing code base... which is going to create merge conflicts with like all of the open PRs... On Wed, Oct 1, 2014 at 6:13 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Ah, since there appears to be a built-in rule for end-of-line whitespace, Michael and Cheng, y'all should be able to add this in pretty easily. Nick On Wed, Oct 1, 2014 at 6:37 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Nick, We can always take built-in rules. Back when we added this Prashant Sharma actually did some great work that lets us write our own style rules in cases where rules don't exist. You can see some existing rules here: https://github.com/apache/spark/tree/master/project/ spark-style/src/main/scala/org/apache/spark/scalastyle Prashant has over time contributed a lot of our custom rules upstream to stalastyle, so now there are only a couple there. - Patrick On Wed, Oct 1, 2014 at 2:36 PM, Ted Yu yuzhih...@gmail.com wrote: Please take a look at WhitespaceEndOfLineChecker under: http://www.scalastyle.org/rules-0.1.0.html Cheers On Wed, Oct 1, 2014 at 2:01 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: As discussed here https://github.com/apache/spark/pull/2619, it would be good to extend our Scala style checks to programmatically enforce as many of our style rules as possible. Does anyone know if it's relatively straightforward to enforce additional rules like the no trailing spaces rule mentioned in the linked PR? Nick
Re: Extending Scala style checks
For starters, do we have a list of all the Scala style rules that are currently not enforced automatically but are likely well-suited for automation? Let's put such a list together in a JIRA issue and work through implementing them. Nick On Thu, Oct 2, 2014 at 12:06 AM, Cheng Lian lian.cs@gmail.com wrote: Since we can easily catch the list of all changed files in a PR, I think we can start with adding the no trailing space check for newly changed files only? On 10/2/14 9:24 AM, Nicholas Chammas wrote: Yeah, I remember that hell when I added PEP 8 to the build checks and fixed all the outstanding Python style issues. I had to keep rebasing and resolving merge conflicts until the PR was merged. It's a rough process, but thankfully it's also a one-time process. I might be able to help with that in the next week or two if no-one else wants to pick it up. Nick On Wed, Oct 1, 2014 at 9:20 PM, Michael Armbrust mich...@databricks.com wrote: The hard part here is updating the existing code base... which is going to create merge conflicts with like all of the open PRs... On Wed, Oct 1, 2014 at 6:13 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Ah, since there appears to be a built-in rule for end-of-line whitespace, Michael and Cheng, y'all should be able to add this in pretty easily. Nick On Wed, Oct 1, 2014 at 6:37 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Nick, We can always take built-in rules. Back when we added this Prashant Sharma actually did some great work that lets us write our own style rules in cases where rules don't exist. You can see some existing rules here: https://github.com/apache/spark/tree/master/project/ spark-style/src/main/scala/org/apache/spark/scalastyle Prashant has over time contributed a lot of our custom rules upstream to stalastyle, so now there are only a couple there. - Patrick On Wed, Oct 1, 2014 at 2:36 PM, Ted Yu yuzhih...@gmail.com wrote: Please take a look at WhitespaceEndOfLineChecker under: http://www.scalastyle.org/rules-0.1.0.html Cheers On Wed, Oct 1, 2014 at 2:01 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: As discussed here https://github.com/apache/spark/pull/2619, it would be good to extend our Scala style checks to programmatically enforce as many of our style rules as possible. Does anyone know if it's relatively straightforward to enforce additional rules like the no trailing spaces rule mentioned in the linked PR? Nick
Unneeded branches/tags
Just curious: Are there branches and/or tags on the repo that we don’t need anymore? What are the scala-2.9 and streaming branches for, for example? And do we still need branches for older versions of Spark that we are not backporting stuff to, like branch-0.5? Nick
Re: EC2 clusters ready in launch time + 30 seconds
FYI: I've created SPARK-3821: Develop an automated way of creating Spark images (AMI, Docker, and others) https://issues.apache.org/jira/browse/SPARK-3821 On Mon, Oct 6, 2014 at 4:48 PM, Daniil Osipov daniil.osi...@shazam.com wrote: I've also been looking at this. Basically, the Spark EC2 script is excellent for small development clusters of several nodes, but isn't suitable for production. It handles instance setup in a single threaded manner, while it can easily be parallelized. It also doesn't handle failure well, ex when an instance fails to start or is taking too long to respond. Our desire was to have an equivalent to Amazon EMR[1] API that would trigger Spark jobs, including specified cluster setup. I've done some work towards that end, and it would benefit from an updated AMI greatly. Dan [1] http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-commands.html On Sat, Oct 4, 2014 at 7:28 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for posting that script, Patrick. It looks like a good place to start. Regarding Docker vs. Packer, as I understand it you can use Packer to create Docker containers at the same time as AMIs and other image types. Nick On Sat, Oct 4, 2014 at 2:49 AM, Patrick Wendell pwend...@gmail.com wrote: Hey All, Just a couple notes. I recently posted a shell script for creating the AMI's from a clean Amazon Linux AMI. https://github.com/mesos/spark-ec2/blob/v3/create_image.sh I think I will update the AMI's soon to get the most recent security updates. For spark-ec2's purpose this is probably sufficient (we'll only need to re-create them every few months). However, it would be cool if someone wanted to tackle providing a more general mechanism for defining Spark-friendly images that can be used more generally. I had thought that docker might be a good way to go for something like this - but maybe this packer thing is good too. For one thing, if we had a standard image we could use it to create containers for running Spark's unit test, which would be really cool. This would help a lot with random issues around port and filesystem contention we have for unit tests. I'm not sure if the long term place for this would be inside the spark codebase or a community library or what. But it would definitely be very valuable to have if someone wanted to take it on. - Patrick On Fri, Oct 3, 2014 at 5:20 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: FYI: There is an existing issue -- SPARK-3314 https://issues.apache.org/jira/browse/SPARK-3314 -- about scripting the creation of Spark AMIs. With Packer, it looks like we may be able to script the creation of multiple image types (VMWare, GCE, AMI, Docker, etc...) at once from a single Packer template. That's very cool. I'll be looking into this. Nick On Thu, Oct 2, 2014 at 8:23 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for the update, Nate. I'm looking forward to seeing how these projects turn out. David, Packer looks very, very interesting. I'm gonna look into it more next week. Nick On Thu, Oct 2, 2014 at 8:00 PM, Nate D'Amico n...@reactor8.com wrote: Bit of progress on our end, bit of lagging as well. Our guy leading effort got little bogged down on client project to update hive/sql testbed to latest spark/sparkSQL, also launching public service so we have been bit scattered recently. Will have some more updates probably after next week. We are planning on taking our client work around hive/spark, plus taking over the bigtop automation work to modernize and get that fit for human consumption outside or org. All our work and puppet modules will be open sourced, documented, hopefully start to rally some other folks around effort that find it useful Side note, another effort we are looking into is gradle tests/support. We have been leveraging serverspec for some basic infrastructure tests, but with bigtop switching over to gradle builds/testing setup in 0.8 we want to include support for that in our own efforts, probably some stuff that can be learned and leveraged in spark world for repeatable/tested infrastructure If anyone has any specific automation questions to your environment you can drop me a line directly.., will try to help out best I can. Else will post update to dev list once we get on top of our own product release and the bigtop work Nate -Original Message- From: David Rowe [mailto:davidr...@gmail.com] Sent: Thursday, October 02, 2014 4:44 PM To: Nicholas Chammas Cc: dev; Shivaram Venkataraman Subject: Re: EC2 clusters ready in launch time + 30 seconds I think this is exactly what packer is for. See e.g. http://www.packer.io/intro/getting-started/build
Re: EC2 clusters ready in launch time + 30 seconds
Thanks for posting that script, Patrick. It looks like a good place to start. Regarding Docker vs. Packer, as I understand it you can use Packer to create Docker containers at the same time as AMIs and other image types. Nick On Sat, Oct 4, 2014 at 2:49 AM, Patrick Wendell pwend...@gmail.com wrote: Hey All, Just a couple notes. I recently posted a shell script for creating the AMI's from a clean Amazon Linux AMI. https://github.com/mesos/spark-ec2/blob/v3/create_image.sh I think I will update the AMI's soon to get the most recent security updates. For spark-ec2's purpose this is probably sufficient (we'll only need to re-create them every few months). However, it would be cool if someone wanted to tackle providing a more general mechanism for defining Spark-friendly images that can be used more generally. I had thought that docker might be a good way to go for something like this - but maybe this packer thing is good too. For one thing, if we had a standard image we could use it to create containers for running Spark's unit test, which would be really cool. This would help a lot with random issues around port and filesystem contention we have for unit tests. I'm not sure if the long term place for this would be inside the spark codebase or a community library or what. But it would definitely be very valuable to have if someone wanted to take it on. - Patrick On Fri, Oct 3, 2014 at 5:20 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: FYI: There is an existing issue -- SPARK-3314 https://issues.apache.org/jira/browse/SPARK-3314 -- about scripting the creation of Spark AMIs. With Packer, it looks like we may be able to script the creation of multiple image types (VMWare, GCE, AMI, Docker, etc...) at once from a single Packer template. That's very cool. I'll be looking into this. Nick On Thu, Oct 2, 2014 at 8:23 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for the update, Nate. I'm looking forward to seeing how these projects turn out. David, Packer looks very, very interesting. I'm gonna look into it more next week. Nick On Thu, Oct 2, 2014 at 8:00 PM, Nate D'Amico n...@reactor8.com wrote: Bit of progress on our end, bit of lagging as well. Our guy leading effort got little bogged down on client project to update hive/sql testbed to latest spark/sparkSQL, also launching public service so we have been bit scattered recently. Will have some more updates probably after next week. We are planning on taking our client work around hive/spark, plus taking over the bigtop automation work to modernize and get that fit for human consumption outside or org. All our work and puppet modules will be open sourced, documented, hopefully start to rally some other folks around effort that find it useful Side note, another effort we are looking into is gradle tests/support. We have been leveraging serverspec for some basic infrastructure tests, but with bigtop switching over to gradle builds/testing setup in 0.8 we want to include support for that in our own efforts, probably some stuff that can be learned and leveraged in spark world for repeatable/tested infrastructure If anyone has any specific automation questions to your environment you can drop me a line directly.., will try to help out best I can. Else will post update to dev list once we get on top of our own product release and the bigtop work Nate -Original Message- From: David Rowe [mailto:davidr...@gmail.com] Sent: Thursday, October 02, 2014 4:44 PM To: Nicholas Chammas Cc: dev; Shivaram Venkataraman Subject: Re: EC2 clusters ready in launch time + 30 seconds I think this is exactly what packer is for. See e.g. http://www.packer.io/intro/getting-started/build-image.html On a related note, the current AMI for hvm systems (e.g. m3.*, r3.*) has a bad package for httpd, whcih causes ganglia not to start. For some reason I can't get access to the raw AMI to fix it. On Fri, Oct 3, 2014 at 9:30 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Is there perhaps a way to define an AMI programmatically? Like, a collection of base AMI id + list of required stuff to be installed + list of required configuration changes. I'm guessing that's what people use things like Puppet, Ansible, or maybe also AWS CloudFormation for, right? If we could do something like that, then with every new release of Spark we could quickly and easily create new AMIs that have everything we need. spark-ec2 would only have to bring up the instances and do a minimal amount of configuration, and the only thing we'd need to track in the Spark repo is the code that defines what goes on the AMI, as well as a list of the AMI ids specific to each release. I'm just thinking out loud here. Does this make sense? Nate, Any
Re: EC2 clusters ready in launch time + 30 seconds
FYI: There is an existing issue -- SPARK-3314 https://issues.apache.org/jira/browse/SPARK-3314 -- about scripting the creation of Spark AMIs. With Packer, it looks like we may be able to script the creation of multiple image types (VMWare, GCE, AMI, Docker, etc...) at once from a single Packer template. That's very cool. I'll be looking into this. Nick On Thu, Oct 2, 2014 at 8:23 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for the update, Nate. I'm looking forward to seeing how these projects turn out. David, Packer looks very, very interesting. I'm gonna look into it more next week. Nick On Thu, Oct 2, 2014 at 8:00 PM, Nate D'Amico n...@reactor8.com wrote: Bit of progress on our end, bit of lagging as well. Our guy leading effort got little bogged down on client project to update hive/sql testbed to latest spark/sparkSQL, also launching public service so we have been bit scattered recently. Will have some more updates probably after next week. We are planning on taking our client work around hive/spark, plus taking over the bigtop automation work to modernize and get that fit for human consumption outside or org. All our work and puppet modules will be open sourced, documented, hopefully start to rally some other folks around effort that find it useful Side note, another effort we are looking into is gradle tests/support. We have been leveraging serverspec for some basic infrastructure tests, but with bigtop switching over to gradle builds/testing setup in 0.8 we want to include support for that in our own efforts, probably some stuff that can be learned and leveraged in spark world for repeatable/tested infrastructure If anyone has any specific automation questions to your environment you can drop me a line directly.., will try to help out best I can. Else will post update to dev list once we get on top of our own product release and the bigtop work Nate -Original Message- From: David Rowe [mailto:davidr...@gmail.com] Sent: Thursday, October 02, 2014 4:44 PM To: Nicholas Chammas Cc: dev; Shivaram Venkataraman Subject: Re: EC2 clusters ready in launch time + 30 seconds I think this is exactly what packer is for. See e.g. http://www.packer.io/intro/getting-started/build-image.html On a related note, the current AMI for hvm systems (e.g. m3.*, r3.*) has a bad package for httpd, whcih causes ganglia not to start. For some reason I can't get access to the raw AMI to fix it. On Fri, Oct 3, 2014 at 9:30 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Is there perhaps a way to define an AMI programmatically? Like, a collection of base AMI id + list of required stuff to be installed + list of required configuration changes. I’m guessing that’s what people use things like Puppet, Ansible, or maybe also AWS CloudFormation for, right? If we could do something like that, then with every new release of Spark we could quickly and easily create new AMIs that have everything we need. spark-ec2 would only have to bring up the instances and do a minimal amount of configuration, and the only thing we’d need to track in the Spark repo is the code that defines what goes on the AMI, as well as a list of the AMI ids specific to each release. I’m just thinking out loud here. Does this make sense? Nate, Any progress on your end with this work? Nick On Sun, Jul 13, 2014 at 8:53 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: It should be possible to improve cluster launch time if we are careful about what commands we run during setup. One way to do this would be to walk down the list of things we do for cluster initialization and see if there is anything we can do make things faster. Unfortunately this might be pretty time consuming, but I don't know of a better strategy. The place to start would be the setup.sh file at https://github.com/mesos/spark-ec2/blob/v3/setup.sh Here are some things that take a lot of time and could be improved: 1. Creating swap partitions on all machines. We could check if there is a way to get EC2 to always mount a swap partition 2. Copying / syncing things across slaves. The copy-dir script is called too many times right now and each time it pauses for a few milliseconds between slaves [1]. This could be improved by removing unnecessary copies 3. We could make less frequently used modules like Tachyon, persistent hdfs not a part of the default setup. [1] https://github.com/mesos/spark-ec2/blob/v3/copy-dir.sh#L42 Thanks Shivaram On Sat, Jul 12, 2014 at 7:02 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Thu, Jul 10, 2014 at 8:10 PM, Nate D'Amico n...@reactor8.com wrote: Starting to work through some automation/config stuff for spark stack on EC2 with a project, will be focusing the work through
Re: EC2 clusters ready in launch time + 30 seconds
Thanks for the update, Nate. I'm looking forward to seeing how these projects turn out. David, Packer looks very, very interesting. I'm gonna look into it more next week. Nick On Thu, Oct 2, 2014 at 8:00 PM, Nate D'Amico n...@reactor8.com wrote: Bit of progress on our end, bit of lagging as well. Our guy leading effort got little bogged down on client project to update hive/sql testbed to latest spark/sparkSQL, also launching public service so we have been bit scattered recently. Will have some more updates probably after next week. We are planning on taking our client work around hive/spark, plus taking over the bigtop automation work to modernize and get that fit for human consumption outside or org. All our work and puppet modules will be open sourced, documented, hopefully start to rally some other folks around effort that find it useful Side note, another effort we are looking into is gradle tests/support. We have been leveraging serverspec for some basic infrastructure tests, but with bigtop switching over to gradle builds/testing setup in 0.8 we want to include support for that in our own efforts, probably some stuff that can be learned and leveraged in spark world for repeatable/tested infrastructure If anyone has any specific automation questions to your environment you can drop me a line directly.., will try to help out best I can. Else will post update to dev list once we get on top of our own product release and the bigtop work Nate -Original Message- From: David Rowe [mailto:davidr...@gmail.com] Sent: Thursday, October 02, 2014 4:44 PM To: Nicholas Chammas Cc: dev; Shivaram Venkataraman Subject: Re: EC2 clusters ready in launch time + 30 seconds I think this is exactly what packer is for. See e.g. http://www.packer.io/intro/getting-started/build-image.html On a related note, the current AMI for hvm systems (e.g. m3.*, r3.*) has a bad package for httpd, whcih causes ganglia not to start. For some reason I can't get access to the raw AMI to fix it. On Fri, Oct 3, 2014 at 9:30 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Is there perhaps a way to define an AMI programmatically? Like, a collection of base AMI id + list of required stuff to be installed + list of required configuration changes. I’m guessing that’s what people use things like Puppet, Ansible, or maybe also AWS CloudFormation for, right? If we could do something like that, then with every new release of Spark we could quickly and easily create new AMIs that have everything we need. spark-ec2 would only have to bring up the instances and do a minimal amount of configuration, and the only thing we’d need to track in the Spark repo is the code that defines what goes on the AMI, as well as a list of the AMI ids specific to each release. I’m just thinking out loud here. Does this make sense? Nate, Any progress on your end with this work? Nick On Sun, Jul 13, 2014 at 8:53 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: It should be possible to improve cluster launch time if we are careful about what commands we run during setup. One way to do this would be to walk down the list of things we do for cluster initialization and see if there is anything we can do make things faster. Unfortunately this might be pretty time consuming, but I don't know of a better strategy. The place to start would be the setup.sh file at https://github.com/mesos/spark-ec2/blob/v3/setup.sh Here are some things that take a lot of time and could be improved: 1. Creating swap partitions on all machines. We could check if there is a way to get EC2 to always mount a swap partition 2. Copying / syncing things across slaves. The copy-dir script is called too many times right now and each time it pauses for a few milliseconds between slaves [1]. This could be improved by removing unnecessary copies 3. We could make less frequently used modules like Tachyon, persistent hdfs not a part of the default setup. [1] https://github.com/mesos/spark-ec2/blob/v3/copy-dir.sh#L42 Thanks Shivaram On Sat, Jul 12, 2014 at 7:02 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Thu, Jul 10, 2014 at 8:10 PM, Nate D'Amico n...@reactor8.com wrote: Starting to work through some automation/config stuff for spark stack on EC2 with a project, will be focusing the work through the apache bigtop effort to start, can then share with spark community directly as things progress if people are interested Let us know how that goes. I'm definitely interested in hearing more. Nick
Re: amplab jenkins is down
On Thu, Sep 4, 2014 at 4:19 PM, shane knapp skn...@berkeley.edu wrote: on a side note, this incident will be accelerating our plan to move the entire jenkins infrastructure in to a managed datacenter environment. this will be our major push over the next couple of weeks. more details about this, also, as soon as i get them. Are there any updates on this move of the Jenkins infrastructure to a managed datacenter? I remember it being mentioned that another benefit of this move would be reduced flakiness when Jenkins tries to checkout patches for testing. For some reason, I'm getting a lot of those https://github.com/apache/spark/pull/2606#issuecomment-57514540 today. Nick
Re: do MIMA checking before all test cases start?
How early can MiMa checks be run? Before Spark is even built https://github.com/apache/spark/blob/8cc70e7e15fd800f31b94e9102069506360289db/dev/run-tests#L118? After the build but before the unit tests? On Thu, Sep 25, 2014 at 6:06 PM, Patrick Wendell pwend...@gmail.com wrote: Yeah we can also move it first. Wouldn't hurt. On Thu, Sep 25, 2014 at 6:39 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: It might still make sense to make this change if MIMA checks are always relatively quick, for the same reason we do style checks first. On Thu, Sep 25, 2014 at 12:25 AM, Nan Zhu zhunanmcg...@gmail.com wrote: yeah, I tried that, but there is always an issue when I ran dev/mima, it always gives me some binary compatibility error on Java API part so I have to wait for Jenkins' result when fixing MIMA issues -- Nan Zhu On Thursday, September 25, 2014 at 12:04 AM, Patrick Wendell wrote: Have you considered running the mima checks locally? We prefer people not use Jenkins for very frequent checks since it takes resources away from other people trying to run tests. On Wed, Sep 24, 2014 at 6:44 PM, Nan Zhu zhunanmcg...@gmail.com (mailto:zhunanmcg...@gmail.com) wrote: Hi, all It seems that, currently, Jenkins makes MIMA checking after all test cases have finished, IIRC, during the first months we introduced MIMA, we do the MIMA checking before running test cases What's the motivation to adjust this behaviour? In my opinion, if you have some binary compatibility issues, you just need to do some minor changes, but in the current environment, you can only get if your change works after all test cases finished (1 hour later...) Best, -- Nan Zhu
Re: Extending Scala style checks
Ah, since there appears to be a built-in rule for end-of-line whitespace, Michael and Cheng, y'all should be able to add this in pretty easily. Nick On Wed, Oct 1, 2014 at 6:37 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Nick, We can always take built-in rules. Back when we added this Prashant Sharma actually did some great work that lets us write our own style rules in cases where rules don't exist. You can see some existing rules here: https://github.com/apache/spark/tree/master/project/spark-style/src/main/scala/org/apache/spark/scalastyle Prashant has over time contributed a lot of our custom rules upstream to stalastyle, so now there are only a couple there. - Patrick On Wed, Oct 1, 2014 at 2:36 PM, Ted Yu yuzhih...@gmail.com wrote: Please take a look at WhitespaceEndOfLineChecker under: http://www.scalastyle.org/rules-0.1.0.html Cheers On Wed, Oct 1, 2014 at 2:01 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: As discussed here https://github.com/apache/spark/pull/2619, it would be good to extend our Scala style checks to programmatically enforce as many of our style rules as possible. Does anyone know if it's relatively straightforward to enforce additional rules like the no trailing spaces rule mentioned in the linked PR? Nick
Re: Extending Scala style checks
Yeah, I remember that hell when I added PEP 8 to the build checks and fixed all the outstanding Python style issues. I had to keep rebasing and resolving merge conflicts until the PR was merged. It's a rough process, but thankfully it's also a one-time process. I might be able to help with that in the next week or two if no-one else wants to pick it up. Nick On Wed, Oct 1, 2014 at 9:20 PM, Michael Armbrust mich...@databricks.com wrote: The hard part here is updating the existing code base... which is going to create merge conflicts with like all of the open PRs... On Wed, Oct 1, 2014 at 6:13 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Ah, since there appears to be a built-in rule for end-of-line whitespace, Michael and Cheng, y'all should be able to add this in pretty easily. Nick On Wed, Oct 1, 2014 at 6:37 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Nick, We can always take built-in rules. Back when we added this Prashant Sharma actually did some great work that lets us write our own style rules in cases where rules don't exist. You can see some existing rules here: https://github.com/apache/spark/tree/master/project/spark-style/src/main/scala/org/apache/spark/scalastyle Prashant has over time contributed a lot of our custom rules upstream to stalastyle, so now there are only a couple there. - Patrick On Wed, Oct 1, 2014 at 2:36 PM, Ted Yu yuzhih...@gmail.com wrote: Please take a look at WhitespaceEndOfLineChecker under: http://www.scalastyle.org/rules-0.1.0.html Cheers On Wed, Oct 1, 2014 at 2:01 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: As discussed here https://github.com/apache/spark/pull/2619, it would be good to extend our Scala style checks to programmatically enforce as many of our style rules as possible. Does anyone know if it's relatively straightforward to enforce additional rules like the no trailing spaces rule mentioned in the linked PR? Nick
Re: Extending Scala style checks
Does anyone know if Scala has something equivalent to autopep8 https://pypi.python.org/pypi/autopep8? It would help patch up the existing code base a lot quicker as we add in new style rules. On Wed, Oct 1, 2014 at 9:24 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Yeah, I remember that hell when I added PEP 8 to the build checks and fixed all the outstanding Python style issues. I had to keep rebasing and resolving merge conflicts until the PR was merged. It's a rough process, but thankfully it's also a one-time process. I might be able to help with that in the next week or two if no-one else wants to pick it up. Nick On Wed, Oct 1, 2014 at 9:20 PM, Michael Armbrust mich...@databricks.com wrote: The hard part here is updating the existing code base... which is going to create merge conflicts with like all of the open PRs... On Wed, Oct 1, 2014 at 6:13 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Ah, since there appears to be a built-in rule for end-of-line whitespace, Michael and Cheng, y'all should be able to add this in pretty easily. Nick On Wed, Oct 1, 2014 at 6:37 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Nick, We can always take built-in rules. Back when we added this Prashant Sharma actually did some great work that lets us write our own style rules in cases where rules don't exist. You can see some existing rules here: https://github.com/apache/spark/tree/master/project/spark-style/src/main/scala/org/apache/spark/scalastyle Prashant has over time contributed a lot of our custom rules upstream to stalastyle, so now there are only a couple there. - Patrick On Wed, Oct 1, 2014 at 2:36 PM, Ted Yu yuzhih...@gmail.com wrote: Please take a look at WhitespaceEndOfLineChecker under: http://www.scalastyle.org/rules-0.1.0.html Cheers On Wed, Oct 1, 2014 at 2:01 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: As discussed here https://github.com/apache/spark/pull/2619, it would be good to extend our Scala style checks to programmatically enforce as many of our style rules as possible. Does anyone know if it's relatively straightforward to enforce additional rules like the no trailing spaces rule mentioned in the linked PR? Nick
thank you for reviewing our patches
I recently came across this mailing list post by Linus Torvalds https://lkml.org/lkml/2004/12/20/255 about the value of reviewing even “trivial” patches. The following passages stood out to me: I think that much more important than the patch is the fact that people get used to the notion that they can change the kernel … So please don’t stop. Yes, those trivial patches *are* a bother. Damn, they are *horrible*. But at the same time, the devil is in the detail, and they are needed in the long run. Both the patches themselves, and the people that grew up on them. Spark is the first (and currently only) open source project I contribute regularly to. My first several PRs against the project, as simple as they were, were definitely patches that I “grew up on”. I appreciate the time and effort all the reviewers I’ve interacted with have taken to work with me on my PRs, even when they are “trivial”. And I’m sure that as I continue to contribute to this project there will be many more patches that I will “grow up on”. Thank you Patrick, Reynold, Josh, Davies, Michael, and everyone else who’s taken time to review one of my patches. I appreciate it! Nick
Re: Spark SQL use of alias in where clause
That is correct. Aliases in the SELECT clause can only be referenced in the ORDER BY and HAVING clauses. Otherwise, you'll have to just repeat the statement, like concat() in this case. A more elegant alternative, which is probably not available in Spark SQL yet, is to use Common Table Expressions http://technet.microsoft.com/en-us/library/ms190766(v=sql.105).aspx. On Wed, Sep 24, 2014 at 11:32 PM, Yanbo Liang yanboha...@gmail.com wrote: Maybe it's the way SQL works. The select part is executed after the where filter is applied, so you cannot use alias declared in select part in where clause. Hive and Oracle behavior the same as Spark SQL. 2014-09-25 8:58 GMT+08:00 Du Li l...@yahoo-inc.com.invalid: Hi, The following query does not work in Shark nor in the new Spark SQLContext or HiveContext. SELECT key, value, concat(key, value) as combined from src where combined like ’11%’; The following tweak of syntax works fine although a bit ugly. SELECT key, value, concat(key, value) as combined from src where concat(key,value) like ’11%’ order by combined; Are you going to support alias in where clause soon? Thanks, Du
Re: Tests and Test Infrastructure
I fully support this. A smoothly running test infrastructure helps everybody’s work just flow better. The Jenkins Pull Request Builder is mostly functioning again. However, we are working on a simpler technical pipeline for testing patches, as this plug-in has been a constant source of downtime and issues for us, and is very hard to debug. Yep. One such issue that happens too often is that Jenkins simply fails to fetch from git. Hopefully a new pipeline will be able to fetch more reliably. flaky tests Dunno if these were some of the ones recently fixed, but the flakiest tests seem to be the Kafka and Flume tests in Spark Streaming, based purely on my subjective experience. It would be great if we could stabilize them! Time of tests PSA: Here are some related JIRA issues for those interested in working on our testing setup: - SPARK-3431: Parallelize execution of tests https://issues.apache.org/jira/browse/SPARK-3431 - SPARK-3432: Fix logging of unit test execution time https://issues.apache.org/jira/browse/SPARK-3432 Nick On Sun, Sep 14, 2014 at 2:20 AM, Josh Rosen rosenvi...@gmail.com wrote: Also, huge thanks to Cheng Lian, who tracked down and fixed the final issue that was causing the Maven master build’s Spark SQL tests to fail! On September 13, 2014 at 11:08:00 PM, Patrick Wendell (pwend...@gmail.com) wrote: Hey All, Wanted to send a quick update about test infrastructure. With the number of contributors we have and the rate of development, maintaining a well-oiled test infra is really important. Every time a flaky test fails a legitimate pull request, it wastes developer time and effort. 1. Master build: Spark's master builds are back to green again in Maven and SBT after a long time of instability. Big thanks to Josh Rosen, Andrew Or, Nick Chammas, Shane Knapp, Sean Owen, and many others who were involved in pinpointing and fixing fairly convoluted test failure issues. 2. Jenkins PRB: The Jenkins Pull Request Builder is mostly functioning again. However, we are working on a simpler technical pipeline for testing patches, as this plug-in has been a constant source of downtime and issues for us, and is very hard to debug. 3. Reverting flaky patches: Going forward - we may revert patches that seem to be the root cause of flaky or failing tests. This is necessary as these days, the test infra being down will block something like 10-30 in-flight patches on a given day. This puts the onus back on the test writer to try and figure out what's going on - we'll of course help debug the issue! 4. Time of tests: With hundreds (thousands?) of tests, we will have a very high bar for tests which take several seconds or longer. Things like Thread.sleep() bloat test time when proper synchronization mechanisms should be used. Expect reviewers to push back on any long-running tests, in many cases they can be re-written to be both shorter and better. Thanks again to everyone putting in effort on this, we've made a ton of progress in the last few weeks. A solid test infra will help us scale and move quickly as Spark development continues to accelerate. - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
don't trigger tests when only .md files are changed
Would it make sense to have Jenkins *not* trigger tests when the only files that have changed are .md files (example https://github.com/apache/spark/pull/2367)? Those don’t even need RAT checks, right? I can make this change if it makes sense. Nick
Re: don't trigger tests when only .md files are changed
We could still have Jenkins post a message to the effect of “this patch only modifies .md files; no tests will be run”. On Fri, Sep 12, 2014 at 3:48 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Would it make sense to have Jenkins *not* trigger tests when the only files that have changed are .md files (example https://github.com/apache/spark/pull/2367)? Those don’t even need RAT checks, right? I can make this change if it makes sense. Nick
Re: Announcing Spark 1.1.0!
Nice work everybody! I'm looking forward to trying out this release! On Thu, Sep 11, 2014 at 8:12 PM, Patrick Wendell pwend...@gmail.com wrote: I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is the second release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 171 developers! This release brings operational and performance improvements in Spark core including a new implementation of the Spark shuffle designed for very large scale workloads. Spark 1.1 adds significant extensions to the newest Spark modules, MLlib and Spark SQL. Spark SQL introduces a JDBC server, byte code generation for fast expression evaluation, a public types API, JSON support, and other features and optimizations. MLlib introduces a new statistics library along with several new algorithms and optimizations. Spark 1.1 also builds out Spark's Python support and adds new components to the Spark Streaming module. Visit the release notes [1] to read about the new features, or download [2] the release today. [1] http://spark.eu.apache.org/releases/spark-release-1-1-0.html [2] http://spark.eu.apache.org/downloads.html NOTE: SOME ASF DOWNLOAD MIRRORS WILL NOT CONTAIN THE RELEASE FOR SEVERAL HOURS. Please e-mail me directly for any type-o's in the release notes or name listing. Thanks, and congratulations! - Patrick - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: yet another jenkins restart early thursday morning -- 730am PDT (and a brief update on our new jenkins infra)
I'm looking forward to this. :) Looks like Jenkins is having trouble triggering builds for new commits or after user requests (e.g. https://github.com/apache/spark/pull/2339#issuecomment-55165937). Hopefully that will be resolved tomorrow. Nick On Tue, Sep 9, 2014 at 5:00 PM, shane knapp skn...@berkeley.edu wrote: since the power incident last thursday, the github pull request builder plugin is still not really working 100%. i found an open issue w/jenkins[1] that could definitely be affecting us, i will be pausing builds early thursday morning and then restarting jenkins. i'll send out a reminder tomorrow, and if this causes any problems for you, please let me know and we can work out a better time. but, now for some good news! yesterday morning, we racked and stacked the systems for the new jenkins instance in the berkeley datacenter. tomorrow i should be able to log in to them and start getting them set up and configured. this is a major step in getting us in to a much more 'production' style environment! anyways: thanks for your patience, and i think we've all learned that hard powering down your build system is a definite recipe for disaster. :) shane [1] -- https://issues.jenkins-ci.org/browse/JENKINS-22509
Re: jenkins failed all tests?
Yeah, it feels like Jenkins has become a lot more flaky recently. Or maybe it’s just our tests. Here are some more examples: - https://github.com/apache/spark/pull/2310#issuecomment-54741169 - https://github.com/apache/spark/pull/2313#issuecomment-54752766 Nick On Sun, Sep 7, 2014 at 4:54 PM, Nan Zhu zhunanmcg...@gmail.com wrote: It seems that I’m not the only one https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/ Best, -- Nan Zhu On Sunday, September 7, 2014 at 4:52 PM, Nan Zhu wrote: Hi, Sean, Thanks for the reply Here are the updated files: https://github.com/apache/spark/pull/2312/files just two md files... Best, -- Nan Zhu On Sunday, September 7, 2014 at 4:30 PM, Sean Owen wrote: It would help to point to your change. Are you sure it was only docs and are you sure you're rebased, submitting against the right branch? Jenkins is saying you are changing public APIs; it's not reporting test failures. But it could well be a test/Jenkins problem. On Sun, Sep 7, 2014 at 8:39 PM, Nan Zhu zhunanmcg...@gmail.com (mailto:zhunanmcg...@gmail.com) wrote: Hi, all I just modified some document, but still failed to pass tests? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19950/consoleFull Anyone can look at the problem? Best, -- Nan Zhu
trimming unnecessary test output
Continuing the discussion started here https://github.com/apache/spark/pull/2279, I’m wondering if people already know that certain test output is unnecessary and should be trimmed. For example https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19917/consoleFull, I see a bunch of lines like this: 14/09/06 07:54:13 INFO GenerateProjection: Code generated expression List(IS NOT NULL 1) in 128.33733 ms Can/should this type of output be suppressed? Is there any other test output that is obviously more noise than signal? Nick
Scala's Jenkins setup looks neat
After reading Erik's email, I found this Scala PR https://github.com/scala/scala/pull/3963 and immediately noticed a few cool things: - Jenkins is hooked directly into GitHub somehow, so you get the All is well message in the merge status window, presumably based on the last test status - Jenkins is also tagging the PR based on its test status or need for review - Jenkins is also tagging the PR for a specific milestone Do any of these things make sense to add to our setup? Or perhaps something inspired by these features? Nick
Re: Scala's Jenkins setup looks neat
Aww, that's a bummer... On Sat, Sep 6, 2014 at 1:10 PM, Reynold Xin r...@databricks.com wrote: that would require github hooks permission and unfortunately asf infra wouldn't allow that. Maybe they will change their mind one day, but so far we asked about this and the answer has been no for security reasons. On Saturday, September 6, 2014, Nicholas Chammas nicholas.cham...@gmail.com wrote: After reading Erik's email, I found this Scala PR https://github.com/scala/scala/pull/3963 and immediately noticed a few cool things: - Jenkins is hooked directly into GitHub somehow, so you get the All is well message in the merge status window, presumably based on the last test status - Jenkins is also tagging the PR based on its test status or need for review - Jenkins is also tagging the PR for a specific milestone Do any of these things make sense to add to our setup? Or perhaps something inspired by these features? Nick
Re: amplab jenkins is down
Hmm, looks like at least some builds https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/19804/consoleFull are working now, though this last one was from ~5 hours ago. On Fri, Sep 5, 2014 at 1:02 AM, shane knapp skn...@berkeley.edu wrote: yep. that's exactly the behavior i saw earlier, and will be figuring out first thing tomorrow morning. i bet it's an environment issues on the slaves. On Thu, Sep 4, 2014 at 7:10 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Looks like during the last build https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/19797/console Jenkins was unable to execute a git fetch? On Thu, Sep 4, 2014 at 7:58 PM, shane knapp skn...@berkeley.edu wrote: i'm going to restart jenkins and see if that fixes things. On Thu, Sep 4, 2014 at 4:56 PM, shane knapp skn...@berkeley.edu wrote: looking On Thu, Sep 4, 2014 at 4:21 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: It appears that our main man is having trouble https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/ hearing new requests https://github.com/apache/spark/pull/2277#issuecomment-54549106. Do we need some smelling salts? On Thu, Sep 4, 2014 at 5:49 PM, shane knapp skn...@berkeley.edu wrote: i'd ping the Jenkinsmench... the master was completely offline, so any new jobs wouldn't have reached it. any jobs that were queued when power was lost probably started up, but jobs that were running would fail. On Thu, Sep 4, 2014 at 2:45 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Woohoo! Thanks Shane. Do you know if queued PR builds will automatically be picked up? Or do we have to ping the Jenkinmensch manually from each PR? Nick On Thu, Sep 4, 2014 at 5:37 PM, shane knapp skn...@berkeley.edu wrote: AND WE'RE UP! sorry that this took so long... i'll send out a more detailed explanation of what happened soon. now, off to back up jenkins. shane On Thu, Sep 4, 2014 at 1:27 PM, shane knapp skn...@berkeley.edu wrote: it's a faulty power switch on the firewall, which has been swapped out. we're about to reboot and be good to go. On Thu, Sep 4, 2014 at 1:19 PM, shane knapp skn...@berkeley.edu wrote: looks like some hardware failed, and we're swapping in a replacement. i don't have more specific information yet -- including *what* failed, as our sysadmin is super busy ATM. the root cause was an incorrect circuit being switched off during building maintenance. on a side note, this incident will be accelerating our plan to move the entire jenkins infrastructure in to a managed datacenter environment. this will be our major push over the next couple of weeks. more details about this, also, as soon as i get them. i'm very sorry about the downtime, we'll get everything up and running ASAP. On Thu, Sep 4, 2014 at 12:27 PM, shane knapp skn...@berkeley.edu wrote: looks like a power outage in soda hall. more updates as they happen. On Thu, Sep 4, 2014 at 12:25 PM, shane knapp skn...@berkeley.edu wrote: i am trying to get things up and running, but it looks like either the firewall gateway or jenkins server itself is down. i'll update as soon as i know more. -- You received this message because you are subscribed to the Google Groups amp-infra group. To unsubscribe from this group and stop receiving emails from it, send an email to amp-infra+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: amplab jenkins is down
How's it going? It looks like during the last build https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/lastBuild/console from about 30 min ago Jenkins was still having trouble fetching from GitHub. It also looks like not all requests for testing are triggering builds. On Fri, Sep 5, 2014 at 1:23 PM, shane knapp skn...@berkeley.edu wrote: it's looking like everything except the pull request builders are working. i'm going to be working on getting this resolved today. On Fri, Sep 5, 2014 at 8:18 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Hmm, looks like at least some builds https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/19804/consoleFull are working now, though this last one was from ~5 hours ago. On Fri, Sep 5, 2014 at 1:02 AM, shane knapp skn...@berkeley.edu wrote: yep. that's exactly the behavior i saw earlier, and will be figuring out first thing tomorrow morning. i bet it's an environment issues on the slaves. On Thu, Sep 4, 2014 at 7:10 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Looks like during the last build https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/19797/console Jenkins was unable to execute a git fetch? On Thu, Sep 4, 2014 at 7:58 PM, shane knapp skn...@berkeley.edu wrote: i'm going to restart jenkins and see if that fixes things. On Thu, Sep 4, 2014 at 4:56 PM, shane knapp skn...@berkeley.edu wrote: looking On Thu, Sep 4, 2014 at 4:21 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: It appears that our main man is having trouble https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/ hearing new requests https://github.com/apache/spark/pull/2277#issuecomment-54549106. Do we need some smelling salts? On Thu, Sep 4, 2014 at 5:49 PM, shane knapp skn...@berkeley.edu wrote: i'd ping the Jenkinsmench... the master was completely offline, so any new jobs wouldn't have reached it. any jobs that were queued when power was lost probably started up, but jobs that were running would fail. On Thu, Sep 4, 2014 at 2:45 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Woohoo! Thanks Shane. Do you know if queued PR builds will automatically be picked up? Or do we have to ping the Jenkinmensch manually from each PR? Nick On Thu, Sep 4, 2014 at 5:37 PM, shane knapp skn...@berkeley.edu wrote: AND WE'RE UP! sorry that this took so long... i'll send out a more detailed explanation of what happened soon. now, off to back up jenkins. shane On Thu, Sep 4, 2014 at 1:27 PM, shane knapp skn...@berkeley.edu wrote: it's a faulty power switch on the firewall, which has been swapped out. we're about to reboot and be good to go. On Thu, Sep 4, 2014 at 1:19 PM, shane knapp skn...@berkeley.edu wrote: looks like some hardware failed, and we're swapping in a replacement. i don't have more specific information yet -- including *what* failed, as our sysadmin is super busy ATM. the root cause was an incorrect circuit being switched off during building maintenance. on a side note, this incident will be accelerating our plan to move the entire jenkins infrastructure in to a managed datacenter environment. this will be our major push over the next couple of weeks. more details about this, also, as soon as i get them. i'm very sorry about the downtime, we'll get everything up and running ASAP. On Thu, Sep 4, 2014 at 12:27 PM, shane knapp skn...@berkeley.edu wrote: looks like a power outage in soda hall. more updates as they happen. On Thu, Sep 4, 2014 at 12:25 PM, shane knapp skn...@berkeley.edu wrote: i am trying to get things up and running, but it looks like either the firewall gateway or jenkins server itself is down. i'll update as soon as i know more. -- You received this message because you are subscribed to the Google Groups amp-infra group. To unsubscribe from this group and stop receiving emails from it, send an email to amp-infra+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [VOTE] Release Apache Spark 1.1.0 (RC4)
On Thu, Sep 4, 2014 at 1:50 PM, Gurvinder Singh gurvinder.si...@uninett.no wrote: There is a regression when using pyspark to read data from HDFS. Could you open a JIRA http://issues.apache.org/jira/ with a brief repro? We'll look into it. (You could also provide a repro in a separate thread.) Nick
Re: amplab jenkins is down
Woohoo! Thanks Shane. Do you know if queued PR builds will automatically be picked up? Or do we have to ping the Jenkinmensch manually from each PR? Nick On Thu, Sep 4, 2014 at 5:37 PM, shane knapp skn...@berkeley.edu wrote: AND WE'RE UP! sorry that this took so long... i'll send out a more detailed explanation of what happened soon. now, off to back up jenkins. shane On Thu, Sep 4, 2014 at 1:27 PM, shane knapp skn...@berkeley.edu wrote: it's a faulty power switch on the firewall, which has been swapped out. we're about to reboot and be good to go. On Thu, Sep 4, 2014 at 1:19 PM, shane knapp skn...@berkeley.edu wrote: looks like some hardware failed, and we're swapping in a replacement. i don't have more specific information yet -- including *what* failed, as our sysadmin is super busy ATM. the root cause was an incorrect circuit being switched off during building maintenance. on a side note, this incident will be accelerating our plan to move the entire jenkins infrastructure in to a managed datacenter environment. this will be our major push over the next couple of weeks. more details about this, also, as soon as i get them. i'm very sorry about the downtime, we'll get everything up and running ASAP. On Thu, Sep 4, 2014 at 12:27 PM, shane knapp skn...@berkeley.edu wrote: looks like a power outage in soda hall. more updates as they happen. On Thu, Sep 4, 2014 at 12:25 PM, shane knapp skn...@berkeley.edu wrote: i am trying to get things up and running, but it looks like either the firewall gateway or jenkins server itself is down. i'll update as soon as i know more.
Re: amplab jenkins is down
It appears that our main man is having trouble https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/ hearing new requests https://github.com/apache/spark/pull/2277#issuecomment-54549106. Do we need some smelling salts? On Thu, Sep 4, 2014 at 5:49 PM, shane knapp skn...@berkeley.edu wrote: i'd ping the Jenkinsmench... the master was completely offline, so any new jobs wouldn't have reached it. any jobs that were queued when power was lost probably started up, but jobs that were running would fail. On Thu, Sep 4, 2014 at 2:45 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Woohoo! Thanks Shane. Do you know if queued PR builds will automatically be picked up? Or do we have to ping the Jenkinmensch manually from each PR? Nick On Thu, Sep 4, 2014 at 5:37 PM, shane knapp skn...@berkeley.edu wrote: AND WE'RE UP! sorry that this took so long... i'll send out a more detailed explanation of what happened soon. now, off to back up jenkins. shane On Thu, Sep 4, 2014 at 1:27 PM, shane knapp skn...@berkeley.edu wrote: it's a faulty power switch on the firewall, which has been swapped out. we're about to reboot and be good to go. On Thu, Sep 4, 2014 at 1:19 PM, shane knapp skn...@berkeley.edu wrote: looks like some hardware failed, and we're swapping in a replacement. i don't have more specific information yet -- including *what* failed, as our sysadmin is super busy ATM. the root cause was an incorrect circuit being switched off during building maintenance. on a side note, this incident will be accelerating our plan to move the entire jenkins infrastructure in to a managed datacenter environment. this will be our major push over the next couple of weeks. more details about this, also, as soon as i get them. i'm very sorry about the downtime, we'll get everything up and running ASAP. On Thu, Sep 4, 2014 at 12:27 PM, shane knapp skn...@berkeley.edu wrote: looks like a power outage in soda hall. more updates as they happen. On Thu, Sep 4, 2014 at 12:25 PM, shane knapp skn...@berkeley.edu wrote: i am trying to get things up and running, but it looks like either the firewall gateway or jenkins server itself is down. i'll update as soon as i know more. -- You received this message because you are subscribed to the Google Groups amp-infra group. To unsubscribe from this group and stop receiving emails from it, send an email to amp-infra+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: amplab jenkins is down
Looks like during the last build https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/19797/console Jenkins was unable to execute a git fetch? On Thu, Sep 4, 2014 at 7:58 PM, shane knapp skn...@berkeley.edu wrote: i'm going to restart jenkins and see if that fixes things. On Thu, Sep 4, 2014 at 4:56 PM, shane knapp skn...@berkeley.edu wrote: looking On Thu, Sep 4, 2014 at 4:21 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: It appears that our main man is having trouble https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/ hearing new requests https://github.com/apache/spark/pull/2277#issuecomment-54549106. Do we need some smelling salts? On Thu, Sep 4, 2014 at 5:49 PM, shane knapp skn...@berkeley.edu wrote: i'd ping the Jenkinsmench... the master was completely offline, so any new jobs wouldn't have reached it. any jobs that were queued when power was lost probably started up, but jobs that were running would fail. On Thu, Sep 4, 2014 at 2:45 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Woohoo! Thanks Shane. Do you know if queued PR builds will automatically be picked up? Or do we have to ping the Jenkinmensch manually from each PR? Nick On Thu, Sep 4, 2014 at 5:37 PM, shane knapp skn...@berkeley.edu wrote: AND WE'RE UP! sorry that this took so long... i'll send out a more detailed explanation of what happened soon. now, off to back up jenkins. shane On Thu, Sep 4, 2014 at 1:27 PM, shane knapp skn...@berkeley.edu wrote: it's a faulty power switch on the firewall, which has been swapped out. we're about to reboot and be good to go. On Thu, Sep 4, 2014 at 1:19 PM, shane knapp skn...@berkeley.edu wrote: looks like some hardware failed, and we're swapping in a replacement. i don't have more specific information yet -- including *what* failed, as our sysadmin is super busy ATM. the root cause was an incorrect circuit being switched off during building maintenance. on a side note, this incident will be accelerating our plan to move the entire jenkins infrastructure in to a managed datacenter environment. this will be our major push over the next couple of weeks. more details about this, also, as soon as i get them. i'm very sorry about the downtime, we'll get everything up and running ASAP. On Thu, Sep 4, 2014 at 12:27 PM, shane knapp skn...@berkeley.edu wrote: looks like a power outage in soda hall. more updates as they happen. On Thu, Sep 4, 2014 at 12:25 PM, shane knapp skn...@berkeley.edu wrote: i am trying to get things up and running, but it looks like either the firewall gateway or jenkins server itself is down. i'll update as soon as i know more. -- You received this message because you are subscribed to the Google Groups amp-infra group. To unsubscribe from this group and stop receiving emails from it, send an email to amp-infra+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [VOTE] Release Apache Spark 1.1.0 (RC4)
On Wed, Sep 3, 2014 at 3:24 AM, Patrick Wendell pwend...@gmail.com wrote: == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. 3. PySpark uses a new heuristic for determining the parallelism of shuffle operations. -- Old behavior can be restored by setting spark.default.parallelism to the number of cores in the cluster. Will these changes be called out in the release notes or somewhere in the docs? That last one (which I believe is what we discovered as the result of SPARK- https://issues.apache.org/jira/browse/SPARK-) could have a large impact on PySpark users. Nick
spark-ec2 depends on stuff in the Mesos repo
Spawned by this discussion https://github.com/apache/spark/pull/1120#issuecomment-54305831. See these 2 lines in spark_ec2.py: - spark_ec2 L42 https://github.com/apache/spark/blob/6a72a36940311fcb3429bd34c8818bc7d513115c/ec2/spark_ec2.py#L42 - spark_ec2 L566 https://github.com/apache/spark/blob/6a72a36940311fcb3429bd34c8818bc7d513115c/ec2/spark_ec2.py#L566 Why does the spark-ec2 script depend on stuff in the Mesos repo? Should they be moved to the Spark repo? Nick
Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab
Hi Shane! Thank you for doing the Jenkins upgrade last week. It's nice to know that infrastructure is gonna get some dedicated TLC going forward. Welcome aboard! Nick On Tue, Sep 2, 2014 at 1:35 PM, shane knapp skn...@berkeley.edu wrote: so, i had a meeting w/the databricks guys on friday and they recommended i send an email out to the list to say 'hi' and give you guys a quick intro. :) hi! i'm shane knapp, the new AMPLab devops engineer, and will be spending time getting the jenkins build infrastructure up to production quality. much of this will be 'under the covers' work, like better system level auth, backups, etc, but some will definitely be user facing: timely jenkins updates, debugging broken build infrastructure and some plugin support. i've been working in the bay area now since 1997 at many different companies, and my last 10 years has been split between google and palantir. i'm a huge proponent of OSS, and am really happy to be able to help with the work you guys are doing! if anyone has any requests/questions/comments, feel free to drop me a line! shane
Re: [VOTE] Release Apache Spark 1.1.0 (RC3)
In light of the discussion on SPARK-, I'll revoke my -1 vote. The issue does not appear to be serious. On Sun, Aug 31, 2014 at 5:14 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: -1: I believe I've found a regression from 1.0.2. The report is captured in SPARK- https://issues.apache.org/jira/browse/SPARK-. On Sat, Aug 30, 2014 at 6:07 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc3 (commit b2d0493b): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1030/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open until Tuesday, September 02, at 23:07 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == Regressions fixed since RC1 == - Build issue for SQL support: https://issues.apache.org/jira/browse/SPARK-3234 - EC2 script version bump to 1.1.0. == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.0.X will not block this release. == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Run the Big Data Benchmark for new releases
What do people think of running the Big Data Benchmark https://amplab.cs.berkeley.edu/benchmark/ (repo https://github.com/amplab/benchmark) as part of preparing every new release of Spark? We'd run it just for Spark and effectively use it as another type of test to track any performance progress or regressions from release to release. Would doing such a thing be valuable? Do we already have a way of benchmarking Spark performance that we use regularly? Nick
Re: Run the Big Data Benchmark for new releases
Oh, that's sweet. So, a related question then. Did those tests pick up the performance issue reported in SPARK- https://issues.apache.org/jira/browse/SPARK-? Does it make sense to add a new test to cover that case? On Tue, Sep 2, 2014 at 12:29 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Nicholas, At Databricks we already run https://github.com/databricks/spark-perf for each release, which is a more comprehensive performance test suite. Matei On September 1, 2014 at 8:22:05 PM, Nicholas Chammas ( nicholas.cham...@gmail.com) wrote: What do people think of running the Big Data Benchmark https://amplab.cs.berkeley.edu/benchmark/ (repo https://github.com/amplab/benchmark) as part of preparing every new release of Spark? We'd run it just for Spark and effectively use it as another type of test to track any performance progress or regressions from release to release. Would doing such a thing be valuable? Do we already have a way of benchmarking Spark performance that we use regularly? Nick
Re: Run the Big Data Benchmark for new releases
Alright, sounds good! I've created databricks/spark-perf/issues/9 https://github.com/databricks/spark-perf/issues/9 as a reminder for us to add a new test once we've root caused SPARK-. On Tue, Sep 2, 2014 at 1:07 AM, Patrick Wendell pwend...@gmail.com wrote: Yeah, this wasn't detected in our performance tests. We even have a test in PySpark that I would have though might catch this (it just schedules a bunch of really small tasks, similar to the regression case). https://github.com/databricks/spark-perf/blob/master/pyspark-tests/tests.py#L51 Anyways, Josh is trying to repro the regression to see if we can figure out what is going on. If we find something for sure we should add a test. On Mon, Sep 1, 2014 at 10:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Nope, actually, they didn't find that (they found some other things that were fixed, as well as some improvements). Feel free to send a PR, but it would be good to profile the issue first to understand what slowed down. (For example is the map phase taking longer or is it the reduce phase, is there some difference in lengths of specific tasks, etc). Matei On September 1, 2014 at 10:03:20 PM, Nicholas Chammas ( nicholas.cham...@gmail.com) wrote: Oh, that's sweet. So, a related question then. Did those tests pick up the performance issue reported in SPARK-? Does it make sense to add a new test to cover that case? On Tue, Sep 2, 2014 at 12:29 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Nicholas, At Databricks we already run https://github.com/databricks/spark-perf for each release, which is a more comprehensive performance test suite. Matei On September 1, 2014 at 8:22:05 PM, Nicholas Chammas ( nicholas.cham...@gmail.com) wrote: What do people think of running the Big Data Benchmark https://amplab.cs.berkeley.edu/benchmark/ (repo https://github.com/amplab/benchmark) as part of preparing every new release of Spark? We'd run it just for Spark and effectively use it as another type of test to track any performance progress or regressions from release to release. Would doing such a thing be valuable? Do we already have a way of benchmarking Spark performance that we use regularly? Nick
Re: [VOTE] Release Apache Spark 1.1.0 (RC3)
-1: I believe I've found a regression from 1.0.2. The report is captured in SPARK- https://issues.apache.org/jira/browse/SPARK-. On Sat, Aug 30, 2014 at 6:07 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc3 (commit b2d0493b): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1030/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open until Tuesday, September 02, at 23:07 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == Regressions fixed since RC1 == - Build issue for SQL support: https://issues.apache.org/jira/browse/SPARK-3234 - EC2 script version bump to 1.1.0. == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.0.X will not block this release. == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.0 (RC3)
On Sun, Aug 31, 2014 at 6:38 PM, chutium teng@gmail.com wrote: has anyone tried to build it on hadoop.version=2.0.0-mr1-cdh4.3.0 or hadoop.version=1.0.3-mapr-3.0.3 ? Is the behavior you're seeing a regression from 1.0.2, or does 1.0.2 have this same problem? Nick
Re: Handling stale PRs
On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com wrote: it's actually precedurally difficult for us to close pull requests Just an FYI: Seems like the GitHub-sanctioned work-around to having issues-only permissions is to have a second, issues-only repository https://help.github.com/articles/issues-only-access-permissions. Not a very attractive work-around... Nick
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
There were several formatting and typographical errors in the SQL docs that I've fixed in this PR https://github.com/apache/spark/pull/2201. Dunno if we want to roll that into the release. On Fri, Aug 29, 2014 at 12:17 PM, Patrick Wendell pwend...@gmail.com wrote: Okay I'll plan to add cdh4 binary as well for the final release! --- sent from my phone On Aug 29, 2014 8:26 AM, Ye Xianjin advance...@gmail.com wrote: We just used CDH 4.7 for our production cluster. And I believe we won't use CDH 5 in the next year. Sent from my iPhone On 2014年8月29日, at 14:39, Matei Zaharia matei.zaha...@gmail.com wrote: Personally I'd actually consider putting CDH4 back if there are still users on it. It's always better to be inclusive, and the convenience of a one-click download is high. Do we have a sense on what % of CDH users still use CDH4? Matei On August 28, 2014 at 11:31:13 PM, Sean Owen (so...@cloudera.com) wrote: (Copying my reply since I don't know if it goes to the mailing list) Great, thanks for explaining the reasoning. You're saying these aren't going into the final release? I think that moots any issue surrounding distributing them then. This is all I know of from the ASF: https://community.apache.org/projectIndependence.html I don't read it as expressly forbidding this kind of thing although you can see how it bumps up against the spirit. There's not a bright line -- what about Tomcat providing binaries compiled for Windows for example? does that favor an OS vendor? From this technical ASF perspective only the releases matter -- do what you want with snapshots and RCs. The only issue there is maybe releasing something different than was in the RC; is that at all confusing? Just needs a note. I think this theoretical issue doesn't exist if these binaries aren't released, so I see no reason to not proceed. The rest is a different question about whether you want to spend time maintaining this profile and candidate. The vendor already manages their build I think and -- and I don't know -- may even prefer not to have a different special build floating around. There's also the theoretical argument that this turns off other vendors from adopting Spark if it's perceived to be too connected to other vendors. I'd like to maximize Spark's distribution and there's some argument you do this by not making vendor profiles. But as I say a different question to just think about over time... (oh and PS for my part I think it's a good thing that CDH4 binaries were removed. I wasn't arguing for resurrecting them) On Fri, Aug 29, 2014 at 7:26 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Sean, The reason there are no longer CDH-specific builds is that all newer versions of CDH and HDP work with builds for the upstream Hadoop projects. I dropped CDH4 in favor of a newer Hadoop version (2.4) and the Hadoop-without-Hive (also 2.4) build. For MapR - we can't officially post those artifacts on ASF web space when we make the final release, we can only link to them as being hosted by MapR specifically since they use non-compatible licenses. However, I felt that providing these during a testing period was alright, with the goal of increasing test coverage. I couldn't find any policy against posting these on personal web space during RC voting. However, we can remove them if there is one. Dropping CDH4 was more because it is now pretty old, but we can add it back if people want. The binary packaging is a slightly separate question from release votes, so I can always add more binary packages whenever. And on this, my main concern is covering the most popular Hadoop versions to lower the bar for users to build and test Spark. - Patrick On Thu, Aug 28, 2014 at 11:04 PM, Sean Owen so...@cloudera.com wrote: +1 I tested the source and Hadoop 2.4 release. Checksums and signatures are OK. Compiles fine with Java 8 on OS X. Tests... don't fail any more than usual. FWIW I've also been using the 1.1.0-SNAPSHOT for some time in another project and have encountered no problems. I notice that the 1.1.0 release removes the CDH4-specific build, but adds two MapR-specific builds. Compare with https://dist.apache.org/repos/dist/release/spark/spark-1.0.2/ I commented on the commit: https://github.com/apache/spark/commit/ceb19830b88486faa87ff41e18d03ede713a73cc I'm in favor of removing all vendor-specific builds. This change *looks* a bit funny as there was no JIRA (?) and appears to swap one vendor for another. Of course there's nothing untoward going on, but what was the reasoning? It's best avoided, and MapR already distributes Spark just fine, no? This is a gray area with ASF projects. I mention it as well because it came up with Apache Flink recently
Re: [VOTE] Release Apache Spark 1.1.0 (RC2)
[Let me know if I should be posting these comments in a different thread.] Should the default Spark version in spark-ec2 https://github.com/apache/spark/blob/e1535ad3c6f7400f2b7915ea91da9c60510557ba/ec2/spark_ec2.py#L86 be updated for this release? Nick On Fri, Aug 29, 2014 at 12:55 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Nicholas, Thanks for this, we can merge in doc changes outside of the actual release timeline, so we'll make sure to loop those changes in before we publish the final 1.1 docs. - Patrick On Fri, Aug 29, 2014 at 9:24 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: There were several formatting and typographical errors in the SQL docs that I've fixed in this PR. Dunno if we want to roll that into the release. On Fri, Aug 29, 2014 at 12:17 PM, Patrick Wendell pwend...@gmail.com wrote: Okay I'll plan to add cdh4 binary as well for the final release! --- sent from my phone On Aug 29, 2014 8:26 AM, Ye Xianjin advance...@gmail.com wrote: We just used CDH 4.7 for our production cluster. And I believe we won't use CDH 5 in the next year. Sent from my iPhone On 2014年8月29日, at 14:39, Matei Zaharia matei.zaha...@gmail.com wrote: Personally I'd actually consider putting CDH4 back if there are still users on it. It's always better to be inclusive, and the convenience of a one-click download is high. Do we have a sense on what % of CDH users still use CDH4? Matei On August 28, 2014 at 11:31:13 PM, Sean Owen (so...@cloudera.com) wrote: (Copying my reply since I don't know if it goes to the mailing list) Great, thanks for explaining the reasoning. You're saying these aren't going into the final release? I think that moots any issue surrounding distributing them then. This is all I know of from the ASF: https://community.apache.org/projectIndependence.html I don't read it as expressly forbidding this kind of thing although you can see how it bumps up against the spirit. There's not a bright line -- what about Tomcat providing binaries compiled for Windows for example? does that favor an OS vendor? From this technical ASF perspective only the releases matter -- do what you want with snapshots and RCs. The only issue there is maybe releasing something different than was in the RC; is that at all confusing? Just needs a note. I think this theoretical issue doesn't exist if these binaries aren't released, so I see no reason to not proceed. The rest is a different question about whether you want to spend time maintaining this profile and candidate. The vendor already manages their build I think and -- and I don't know -- may even prefer not to have a different special build floating around. There's also the theoretical argument that this turns off other vendors from adopting Spark if it's perceived to be too connected to other vendors. I'd like to maximize Spark's distribution and there's some argument you do this by not making vendor profiles. But as I say a different question to just think about over time... (oh and PS for my part I think it's a good thing that CDH4 binaries were removed. I wasn't arguing for resurrecting them) On Fri, Aug 29, 2014 at 7:26 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Sean, The reason there are no longer CDH-specific builds is that all newer versions of CDH and HDP work with builds for the upstream Hadoop projects. I dropped CDH4 in favor of a newer Hadoop version (2.4) and the Hadoop-without-Hive (also 2.4) build. For MapR - we can't officially post those artifacts on ASF web space when we make the final release, we can only link to them as being hosted by MapR specifically since they use non-compatible licenses. However, I felt that providing these during a testing period was alright, with the goal of increasing test coverage. I couldn't find any policy against posting these on personal web space during RC voting. However, we can remove them if there is one. Dropping CDH4 was more because it is now pretty old, but we can add it back if people want. The binary packaging is a slightly separate question from release votes, so I can always add more binary packages whenever. And on this, my main concern is covering the most popular Hadoop versions to lower the bar for users to build and test Spark. - Patrick On Thu, Aug 28, 2014 at 11:04 PM, Sean Owen so...@cloudera.com wrote: +1 I tested the source and Hadoop 2.4 release. Checksums and signatures are OK. Compiles fine with Java 8 on OS X. Tests... don't fail any more than usual. FWIW I've also been using the 1.1.0-SNAPSHOT for some time in another project and have encountered no problems. I
Re: Handling stale PRs
On Tue, Aug 26, 2014 at 2:21 PM, Josh Rosen rosenvi...@gmail.com wrote: Last weekend, I started hacking on a Google App Engine app for helping with pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png). BTW Josh, how can we stay up-to-date on your work on this tool? A JIRA issue, perhaps? Nick
Re: Handling stale PRs
Alright! That was quick. :) On Wed, Aug 27, 2014 at 6:48 PM, Josh Rosen rosenvi...@gmail.com wrote: I have a very simple dashboard running at http://spark-prs.appspot.com/. Currently, this mirrors the functionality of Patrick’s github-shim, but it should be very easy to extend with other features. The source is at https://github.com/databricks/spark-pr-dashboard (pull requests and issues welcome!) On August 27, 2014 at 2:11:41 PM, Nicholas Chammas ( nicholas.cham...@gmail.com) wrote: On Tue, Aug 26, 2014 at 2:21 PM, Josh Rosen rosenvi...@gmail.com wrote: Last weekend, I started hacking on a Google App Engine app for helping with pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png). BTW Josh, how can we stay up-to-date on your work on this tool? A JIRA issue, perhaps? Nick
Re: jenkins maintenance/downtime, aug 28th, 730am-9am PDT
Looks like we're currently at 1.568 so we should be getting a nice slew of UI tweaks and bug fixes. Neat! On Wed, Aug 27, 2014 at 7:13 PM, shane knapp skn...@berkeley.edu wrote: tomorrow morning i will be upgrading jenkins to the latest/greatest (1.577). at 730am, i will put jenkins in to a quiet period, so no new builds will be accepted. once any running builds are finished, i will be taking jenkins down for the upgrade. depending on what and how many jobs are running, i'm expecting this to take, at most, an hour. i'll send out an update tomorrow morning right before i begin, and will send out updates and an all-clear once we're up and running again. 1.577 release notes: http://jenkins-ci.org/changelog please let me know if there are any questions/concerns. thanks in advance! shane
Re: Handling stale PRs
On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com wrote: I'd prefer if we took the approach of politely explaining why in the current form the patch isn't acceptable and closing it (potentially w/ tips on how to improve it or narrow the scope). Amen to this. Aiming for such a culture would set Spark apart from other projects in a great way. I've proposed several different solutions to ASF infra to streamline the process, but thus far they haven't been open to any of my ideas: I've added myself as a watcher on those 2 INFRA issues. Sucks that the only solution on offer right now requires basically polluting the commit history. Short of moving Spark's repo to a non-ASF-managed GitHub account, do you think another bot could help us manage the number of stale PRs? I'm thinking a solution as follows might be very helpful: - Extend Spark QA / Jenkins to run on a weekly schedule and check for stale PRs. Let's say a stale PR is an open one that hasn't been updated in N months. - Spark QA maintains a list of known committers on its side. - During its weekly check of stale PRs, Spark QA takes the following action: - If the last person to comment on a PR was a committer, post to the PR asking for an update from the contributor. - If the last person to comment on a PR was a contributor, add the PR to a list. Email this list of *hanging PRs* out to the dev list on a weekly basis and ask committers to update them. - If the last person to comment on a PR was Spark QA asking the contributor to update it, then add the PR to a list. Email this list of *abandoned PRs* to the dev list for the record (or for closing, if that becomes possible in the future). This doesn't solve the problem of not being able to close PRs, but it does help make sure no PR is left hanging for long. What do you think? I'd be interested in implementing this solution if we like it. Nick
Re: Handling stale PRs
OK, that sounds pretty cool. Josh, Do you see this app as encompassing or supplanting the functionality I described as well? Nick On Tue, Aug 26, 2014 at 2:21 PM, Josh Rosen rosenvi...@gmail.com wrote: Last weekend, I started hacking on a Google App Engine app for helping with pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png). Some of my basic goals (not all implemented yet): - Users sign in using GitHub and can browse a list of pull requests, including links to associated JIRAs, Jenkins statuses, a quick preview of the last comment, etc. - Pull requests are auto-classified based on which components they modify (by looking at the diff). - From the app’s own internal database of PRs, we can build dashboards to find “abandoned” PRs, graph average time to first review, etc. - Since we authenticate users with GitHub, we can enable administrative functions via this dashboard (e.g. “assign this PR to me”, “vote to close in the weekly auto-close commit”, etc. Right now, I’ve implemented GItHub OAuth support and code to update the issues database using the GitHub API. Because we have access to the full API, it’s pretty easy to do fancy things like parsing the reason for Jenkins failure, etc. You could even imagine some fancy mashup tools to pull up JIRAs and pull requests side-by in iframes. After I hack on this a bit more, I plan to release a public preview version; if we find this tool useful, I’ll clean it up and open-source the app so folks can contribute to it. - Josh On August 26, 2014 at 8:16:46 AM, Nicholas Chammas ( nicholas.cham...@gmail.com) wrote: On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com wrote: I'd prefer if we took the approach of politely explaining why in the current form the patch isn't acceptable and closing it (potentially w/ tips on how to improve it or narrow the scope). Amen to this. Aiming for such a culture would set Spark apart from other projects in a great way. I've proposed several different solutions to ASF infra to streamline the process, but thus far they haven't been open to any of my ideas: I've added myself as a watcher on those 2 INFRA issues. Sucks that the only solution on offer right now requires basically polluting the commit history. Short of moving Spark's repo to a non-ASF-managed GitHub account, do you think another bot could help us manage the number of stale PRs? I'm thinking a solution as follows might be very helpful: - Extend Spark QA / Jenkins to run on a weekly schedule and check for stale PRs. Let's say a stale PR is an open one that hasn't been updated in N months. - Spark QA maintains a list of known committers on its side. - During its weekly check of stale PRs, Spark QA takes the following action: - If the last person to comment on a PR was a committer, post to the PR asking for an update from the contributor. - If the last person to comment on a PR was a contributor, add the PR to a list. Email this list of *hanging PRs* out to the dev list on a weekly basis and ask committers to update them. - If the last person to comment on a PR was Spark QA asking the contributor to update it, then add the PR to a list. Email this list of *abandoned PRs* to the dev list for the record (or for closing, if that becomes possible in the future). This doesn't solve the problem of not being able to close PRs, but it does help make sure no PR is left hanging for long. What do you think? I'd be interested in implementing this solution if we like it. Nick
spark-ec2 1.0.2 creates EC2 cluster at wrong version
I downloaded the source code release for 1.0.2 from here http://spark.apache.org/downloads.html and launched an EC2 cluster using spark-ec2. After the cluster finishes launching, I fire up the shell and check the version: scala sc.version res1: String = 1.0.1 The startup banner also shows the same thing. Hmm... So I dig around and find that the spark_ec2.py script has the default Spark version set to 1.0.1. Derp. parser.add_option(-v, --spark-version, default=1.0.1, help=Version of Spark to use: 'X.Y.Z' or a specific git hash) Is there any way to fix the release? It’s a minor issue, but could be very confusing. And how can we prevent this from happening again? Nick
Re: Handling stale PRs
By the way, as a reference point, I just stumbled across the Discourse GitHub project and their list of pull requests https://github.com/discourse/discourse/pulls looks pretty neat. ~2,200 closed PRs, 6 open. Least recently updated PR dates to 8 days ago. Project started ~1.5 years ago. Dunno how many committers Discourse has, but it looks like they've managed their PRs well. I hope we can do as well in this regard as they have. Nick On Tue, Aug 26, 2014 at 2:40 PM, Josh Rosen rosenvi...@gmail.com wrote: Sure; App Engine supports cron and sending emails. We can configure the app with Spark QA’s credentials in order to allow it to post comments on issues, etc. - Josh On August 26, 2014 at 11:38:08 AM, Nicholas Chammas ( nicholas.cham...@gmail.com) wrote: OK, that sounds pretty cool. Josh, Do you see this app as encompassing or supplanting the functionality I described as well? Nick On Tue, Aug 26, 2014 at 2:21 PM, Josh Rosen rosenvi...@gmail.com wrote: Last weekend, I started hacking on a Google App Engine app for helping with pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png). Some of my basic goals (not all implemented yet): - Users sign in using GitHub and can browse a list of pull requests, including links to associated JIRAs, Jenkins statuses, a quick preview of the last comment, etc. - Pull requests are auto-classified based on which components they modify (by looking at the diff). - From the app’s own internal database of PRs, we can build dashboards to find “abandoned” PRs, graph average time to first review, etc. - Since we authenticate users with GitHub, we can enable administrative functions via this dashboard (e.g. “assign this PR to me”, “vote to close in the weekly auto-close commit”, etc. Right now, I’ve implemented GItHub OAuth support and code to update the issues database using the GitHub API. Because we have access to the full API, it’s pretty easy to do fancy things like parsing the reason for Jenkins failure, etc. You could even imagine some fancy mashup tools to pull up JIRAs and pull requests side-by in iframes. After I hack on this a bit more, I plan to release a public preview version; if we find this tool useful, I’ll clean it up and open-source the app so folks can contribute to it. - Josh On August 26, 2014 at 8:16:46 AM, Nicholas Chammas ( nicholas.cham...@gmail.com) wrote: On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com wrote: I'd prefer if we took the approach of politely explaining why in the current form the patch isn't acceptable and closing it (potentially w/ tips on how to improve it or narrow the scope). Amen to this. Aiming for such a culture would set Spark apart from other projects in a great way. I've proposed several different solutions to ASF infra to streamline the process, but thus far they haven't been open to any of my ideas: I've added myself as a watcher on those 2 INFRA issues. Sucks that the only solution on offer right now requires basically polluting the commit history. Short of moving Spark's repo to a non-ASF-managed GitHub account, do you think another bot could help us manage the number of stale PRs? I'm thinking a solution as follows might be very helpful: - Extend Spark QA / Jenkins to run on a weekly schedule and check for stale PRs. Let's say a stale PR is an open one that hasn't been updated in N months. - Spark QA maintains a list of known committers on its side. - During its weekly check of stale PRs, Spark QA takes the following action: - If the last person to comment on a PR was a committer, post to the PR asking for an update from the contributor. - If the last person to comment on a PR was a contributor, add the PR to a list. Email this list of *hanging PRs* out to the dev list on a weekly basis and ask committers to update them. - If the last person to comment on a PR was Spark QA asking the contributor to update it, then add the PR to a list. Email this list of *abandoned PRs* to the dev list for the record (or for closing, if that becomes possible in the future). This doesn't solve the problem of not being able to close PRs, but it does help make sure no PR is left hanging for long. What do you think? I'd be interested in implementing this solution if we like it. Nick
Re: Pull requests will be automatically linked to JIRA when submitted
FYI: Looks like the Mesos folk also have a bot to do automatic linking, but it appears to have been provided to them somehow by ASF. See this comment as an example: https://issues.apache.org/jira/browse/MESOS-1688?focusedCommentId=14109078page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14109078 Might be a small win to push this work to a bot ASF manages if we can get access to it (and if we have no concerns about depending on an another external service). Nick On Mon, Aug 11, 2014 at 4:10 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for looking into this. I think little tools like this are super helpful. Would it hurt to open a request with INFRA to install/configure the JIRA-GitHub plugin while we continue to use the Python script we have? I wouldn't mind opening that JIRA issue with them. Nick On Mon, Aug 11, 2014 at 12:52 PM, Patrick Wendell pwend...@gmail.com wrote: I spent some time on this and I'm not sure either of these is an option, unfortunately. We typically can't use custom JIRA plug-in's because this JIRA is controlled by the ASF and we don't have rights to modify most things about how it works (it's a large shared JIRA instance used by more than 50 projects). It's worth looking into whether they can do something. In general we've tended to avoid going through ASF infra them whenever possible, since they are generally overloaded and things move very slowly, even if there are outages. Here is the script we use to do the sync: https://github.com/apache/spark/blob/master/dev/github_jira_sync.py It might be possible to modify this to support post-hoc changes, but we'd need to think about how to do so while minimizing function calls to the ASF JIRA API, which I found are very slow. - Patrick On Mon, Aug 11, 2014 at 7:51 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: It looks like this script doesn't catch PRs that are opened and *then* have the JIRA issue ID added to the name. Would it be easy to somehow have the script trigger on PR name changes as well as PR creates? Alternately, is there a reason we can't or don't want to use the plugin mentioned below? (I'm assuming it covers cases like this, but I'm not sure.) Nick On Wed, Jul 23, 2014 at 12:52 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: By the way, it looks like there’s a JIRA plugin that integrates it with GitHub: - https://marketplace.atlassian.com/plugins/com.atlassian.jira.plugins.jira-bitbucket-connector-plugin - https://confluence.atlassian.com/display/BITBUCKET/Linking+Bitbucket+and+GitHub+accounts+to+JIRA It does the automatic linking and shows some additional information https://marketplace-cdn.atlassian.com/files/images/com.atlassian.jira.plugins.jira-bitbucket-connector-plugin/86ff1a21-44fb-4227-aa4f-44c77aec2c97.png that might be nice to have for heavy JIRA users. Nick On Sun, Jul 20, 2014 at 12:50 PM, Patrick Wendell pwend...@gmail.com wrote: Yeah it needs to have SPARK-XXX in the title (this is the format we request already). It just works with small synchronization script I wrote that we run every five minutes on Jeknins that uses the Github and Jenkins API: https://github.com/apache/spark/commit/49e472744951d875627d78b0d6e93cd139232929 - Patrick On Sun, Jul 20, 2014 at 8:06 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: That's pretty neat. How does it work? Do we just need to put the issue ID (e.g. SPARK-1234) anywhere in the pull request? Nick On Sat, Jul 19, 2014 at 11:10 PM, Patrick Wendell pwend...@gmail.com wrote: Just a small note, today I committed a tool that will automatically mirror pull requests to JIRA issues, so contributors will no longer have to manually post a pull request on the JIRA when they make one. It will create a link on the JIRA and also make a comment to trigger an e-mail to people watching. This should make some things easier, such as avoiding accidental duplicate effort on the same JIRA. - Patrick
Handling stale PRs
Check this out: https://github.com/apache/spark/pulls?q=is%3Aopen+is%3Apr+sort%3Aupdated-asc We're hitting close to 300 open PRs. Those are the least recently updated ones. I think having a low number of stale (i.e. not recently updated) PRs is a good thing to shoot for. It doesn't leave contributors hanging (which feels bad for contributors), and reduces project clutter (which feels bad for maintainers/committers). What is our approach to tackling this problem? I think communicating and enforcing a clear policy on how stale PRs are handled might be a good way to reduce the number of stale PRs we have without making contributors feel rejected. I don't know what such a policy would look like, but it should be enforceable and lightweight--i.e. it shouldn't feel like a hammer used to reject people's work, but rather a necessary tool to keep the project's contributions relevant and manageable. Nick
Re: Spark Contribution
That sounds like a good idea. Continuing along those lines, what do people think of moving the contributing page entirely from the wiki to GitHub? It feels like the right place for it since GitHub is where we take contributions, and it also lets people make improvements to it. Nick 2014년 8월 23일 토요일, Sean Owenso...@cloudera.com님이 작성한 메시지: Can I ask a related question, since I have a PR open to touch up README.md as we speak (SPARK-3069)? If this text is in a file called CONTRIBUTING.md, then it will cause a link to appear on the pull request screen, inviting people to review the contribution guidelines: https://github.com/blog/1184-contributing-guidelines This is mildly important as the project wants to make it clear that you agree that your contribution is licensed under the AL2, since there is no formal ICLA. How about I propose moving the text to CONTRIBUTING.md with a pointer in README.md? or keep it both places? On Sat, Aug 23, 2014 at 1:08 AM, Reynold Xin r...@databricks.com javascript:; wrote: Great idea. Added the link https://github.com/apache/spark/blob/master/README.md On Thu, Aug 21, 2014 at 4:06 PM, Nicholas Chammas nicholas.cham...@gmail.com javascript:; wrote: We should add this link to the readme on GitHub btw. 2014년 8월 21일 목요일, Henry Saputrahenry.sapu...@gmail.com javascript:;님이 작성한 메시지: The Apache Spark wiki on how to contribute should be great place to start: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark - Henry On Thu, Aug 21, 2014 at 3:25 AM, Maisnam Ns maisnam...@gmail.com javascript:; javascript:; wrote: Hi, Can someone help me with some links on how to contribute for Spark Regards mns - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:; javascript:; For additional commands, e-mail: dev-h...@spark.apache.org javascript:; javascript:;
Re: Spark Contribution
We should add this link to the readme on GitHub btw. 2014년 8월 21일 목요일, Henry Saputrahenry.sapu...@gmail.com님이 작성한 메시지: The Apache Spark wiki on how to contribute should be great place to start: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark - Henry On Thu, Aug 21, 2014 at 3:25 AM, Maisnam Ns maisnam...@gmail.com javascript:; wrote: Hi, Can someone help me with some links on how to contribute for Spark Regards mns - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:; For additional commands, e-mail: dev-h...@spark.apache.org javascript:;
Re: -1s on pull requests?
On Sun, Aug 3, 2014 at 4:35 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Include the commit hash in the tests have started/completed messages, so that it's clear what code exactly is/has been tested for each test cycle. This is now captured in this JIRA issue https://issues.apache.org/jira/browse/SPARK-2912 and completed in this PR https://github.com/apache/spark/pull/1816 which has been merged in to master. Example of old style: tests starting https://github.com/apache/spark/pull/1819#issuecomment-51416510 / tests finished https://github.com/apache/spark/pull/1819#issuecomment-51417477 (with new classes) Example of new style: tests starting https://github.com/apache/spark/pull/1816#issuecomment-51855254 / tests finished https://github.com/apache/spark/pull/1816#issuecomment-51855255 (with new classes) Nick
Re: Tests failing
Shivaram, Can you point us to an example of that happening? The Jenkins console output, that is. Nick On Fri, Aug 15, 2014 at 2:28 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Also I think Jenkins doesn't post build timeouts to github. Is there anyway we can fix that ? On Aug 15, 2014 9:04 AM, Patrick Wendell pwend...@gmail.com wrote: Hi All, I noticed that all PR tests run overnight had failed due to timeouts. The patch that updates the netty shuffle I believe somehow inflated to the build time significantly. That patch had been tested, but one change was made before it was merged that was not tested. I've reverted the patch for now to see if it brings the build times back down. - Patrick
Re: Tests failing
So 2 hours is a hard cap on how long a build can run. Okie doke. Perhaps then I'll wrap the run-tests step as you suggest and limit it to 100 minutes or something, and cleanly report if it times out. Sound good? On Fri, Aug 15, 2014 at 4:43 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Nicholas, Yeah so Jenkins has it's own timeout mechanism and it will just kill the entire build after 120 minutes. But since run-tests is sitting in the middle of the tests, it can't actually post a failure message. I think run-tests-jenkins should just wrap the call to run-tests in a call in its own timeout. It might be possible to just use this: http://linux.die.net/man/1/timeout - Patrick On Fri, Aug 15, 2014 at 1:31 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: OK, I've captured this in SPARK-3076 https://issues.apache.org/jira/browse/SPARK-3076. Patrick, Is the problem that this run-tests https://github.com/apache/spark/blob/0afe5cb65a195d2f14e8dfcefdbec5dac023651f/dev/run-tests-jenkins#L151 step times out, and that is currently not handled gracefully? To be more specific, it hangs for 120 minutes, times out, but the parent script for some reason is also terminated. Does that sound right? Nick On Fri, Aug 15, 2014 at 3:33 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Jenkins runs for this PR https://github.com/apache/spark/pull/1960 timed out without notification. The relevant Jenkins logs are at https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18588/consoleFull https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18592/consoleFull https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18597/consoleFull On Fri, Aug 15, 2014 at 11:44 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Shivaram, Can you point us to an example of that happening? The Jenkins console output, that is. Nick On Fri, Aug 15, 2014 at 2:28 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Also I think Jenkins doesn't post build timeouts to github. Is there anyway we can fix that ? On Aug 15, 2014 9:04 AM, Patrick Wendell pwend...@gmail.com wrote: Hi All, I noticed that all PR tests run overnight had failed due to timeouts. The patch that updates the netty shuffle I believe somehow inflated to the build time significantly. That patch had been tested, but one change was made before it was merged that was not tested. I've reverted the patch for now to see if it brings the build times back down. - Patrick
Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms
On a related note, I recently heard about Distributed R https://github.com/vertica/DistributedR, which is coming out of HP/Vertica and seems to be their proposition for machine learning at scale. It would be interesting to see some kind of comparison between that and MLlib (and perhaps also SparkR https://github.com/amplab-extras/SparkR-pkg?), especially since Distributed R has a concept of distributed arrays and works on data in-memory. Docs are here. https://github.com/vertica/DistributedR/tree/master/doc/platform Nick On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin r...@databricks.com wrote: They only compared their own implementations of couple algorithms on different platforms rather than comparing the different platforms themselves (in the case of Spark -- PySpark). I can write two variants of an algorithm on Spark and make them perform drastically differently. I have no doubt if you implement a ML algorithm in Python itself without any native libraries, the performance will be sub-optimal. What PySpark really provides is: - Using Spark transformations in Python - ML algorithms implemented in Scala (leveraging native numerical libraries for high performance), and callable in Python The paper claims Python is now one of the most popular languages for ML-oriented programming, and that's why they went ahead with Python. However, as I understand, very few people actually implement algorithms in Python directly because of the sub-optimal performance. Most people implement algorithms in other languages (e.g. C / Java), and expose APIs in Python for ease-of-use. This is what we are trying to do with PySpark as well. On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Has anyone had a chance to look at this paper (with title in subject)? http://www.cs.rice.edu/~lp6/comparison.pdf Interesting that they chose to use Python alone. Do we know how much faster Scala is vs. Python in general, if at all? As with any and all benchmarks, I'm sure there are caveats, but it'd be nice to have a response to the question above for starters. Thanks, Ignacio
Re: Pull requests will be automatically linked to JIRA when submitted
Thanks for looking into this. I think little tools like this are super helpful. Would it hurt to open a request with INFRA to install/configure the JIRA-GitHub plugin while we continue to use the Python script we have? I wouldn't mind opening that JIRA issue with them. Nick On Mon, Aug 11, 2014 at 12:52 PM, Patrick Wendell pwend...@gmail.com wrote: I spent some time on this and I'm not sure either of these is an option, unfortunately. We typically can't use custom JIRA plug-in's because this JIRA is controlled by the ASF and we don't have rights to modify most things about how it works (it's a large shared JIRA instance used by more than 50 projects). It's worth looking into whether they can do something. In general we've tended to avoid going through ASF infra them whenever possible, since they are generally overloaded and things move very slowly, even if there are outages. Here is the script we use to do the sync: https://github.com/apache/spark/blob/master/dev/github_jira_sync.py It might be possible to modify this to support post-hoc changes, but we'd need to think about how to do so while minimizing function calls to the ASF JIRA API, which I found are very slow. - Patrick On Mon, Aug 11, 2014 at 7:51 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: It looks like this script doesn't catch PRs that are opened and *then* have the JIRA issue ID added to the name. Would it be easy to somehow have the script trigger on PR name changes as well as PR creates? Alternately, is there a reason we can't or don't want to use the plugin mentioned below? (I'm assuming it covers cases like this, but I'm not sure.) Nick On Wed, Jul 23, 2014 at 12:52 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: By the way, it looks like there’s a JIRA plugin that integrates it with GitHub: - https://marketplace.atlassian.com/plugins/com.atlassian.jira.plugins.jira-bitbucket-connector-plugin - https://confluence.atlassian.com/display/BITBUCKET/Linking+Bitbucket+and+GitHub+accounts+to+JIRA It does the automatic linking and shows some additional information https://marketplace-cdn.atlassian.com/files/images/com.atlassian.jira.plugins.jira-bitbucket-connector-plugin/86ff1a21-44fb-4227-aa4f-44c77aec2c97.png that might be nice to have for heavy JIRA users. Nick On Sun, Jul 20, 2014 at 12:50 PM, Patrick Wendell pwend...@gmail.com wrote: Yeah it needs to have SPARK-XXX in the title (this is the format we request already). It just works with small synchronization script I wrote that we run every five minutes on Jeknins that uses the Github and Jenkins API: https://github.com/apache/spark/commit/49e472744951d875627d78b0d6e93cd139232929 - Patrick On Sun, Jul 20, 2014 at 8:06 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: That's pretty neat. How does it work? Do we just need to put the issue ID (e.g. SPARK-1234) anywhere in the pull request? Nick On Sat, Jul 19, 2014 at 11:10 PM, Patrick Wendell pwend...@gmail.com wrote: Just a small note, today I committed a tool that will automatically mirror pull requests to JIRA issues, so contributors will no longer have to manually post a pull request on the JIRA when they make one. It will create a link on the JIRA and also make a comment to trigger an e-mail to people watching. This should make some things easier, such as avoiding accidental duplicate effort on the same JIRA. - Patrick
Unit tests in 5 minutes
Howdy, Do we think it's both feasible and worthwhile to invest in getting our unit tests to finish in under 5 minutes (or something similarly brief) when run by Jenkins? Unit tests currently seem to take anywhere from 30 min to 2 hours. As people add more tests, I imagine this time will only grow. I think it would be better for both contributors and reviewers if they didn't have to wait so long for test results; PR reviews would be shorter, if nothing else. I don't know how how this is normally done, but maybe it wouldn't be too much work to get a test cycle to feel lighter. Most unit tests are independent and can be run concurrently, right? Would it make sense to build a given patch on many servers at once and send disjoint sets of unit tests to each? I'd be interested in working on something like that if possible (and sensible). Nick
Re: -1s on pull requests?
1. Include the commit hash in the tests have started/completed FYI: Looks like Xiangrui's already got a JIRA issue for this. SPARK-2622: Add Jenkins build numbers to SparkQA messages https://issues.apache.org/jira/browse/SPARK-2622 2. Pin a message to the start or end of the PR Should new JIRA issues for this item fall under the following umbrella issue? SPARK-2230: Improvements to Jenkins QA Harness https://issues.apache.org/jira/browse/SPARK-2230 Nick
Re: -1s on pull requests?
On Mon, Jul 21, 2014 at 4:44 PM, Kay Ousterhout k...@eecs.berkeley.edu wrote: This also happens when something accidentally gets merged after the tests have started but before tests have passed. Some improvements to SparkQA https://github.com/SparkQA could help with this. May I suggest: 1. Include the commit hash in the tests have started/completed messages, so that it's clear what code exactly is/has been tested for each test cycle. 2. Pin a message to the start or end of the PR that is updated with the status of the PR. Testing not complete; New commits since last test; Tests failed; etc. It should be easy for committers to get the status of the PR at a glance, without scrolling through the comment history. Nick
Re: -1s on pull requests?
On Sun, Aug 3, 2014 at 11:29 PM, Patrick Wendell pwend...@gmail.com wrote: Nick - Any interest in doing these? this is all doable from within the spark repo itself because our QA harness scripts are in there: https://github.com/apache/spark/blob/master/dev/run-tests-jenkins If not, could you make a JIRA for them and put it under Project Infra. I’ll make the JIRA and think about how to do this stuff. I’ll have to understand what that run-tests-jenkins script does and see how easy it is to extend. Nick
Re: ASF JIRA is down for maintenance
Seems to be back up now. On Sat, Aug 2, 2014 at 2:06 AM, Patrick Wendell pwend...@gmail.com wrote: Please don't let this prevent you from merging patches, just keep a list and we can update the JIRA later. - Patrick
Re: [VOTE] Release Apache Spark 1.0.2 (RC1)
- spun up an EC2 cluster successfully using spark-ec2 - tested S3 file access from that cluster successfully +1 On Tue, Jul 29, 2014 at 1:46 AM, Henry Saputra henry.sapu...@gmail.com wrote: NOTICE and LICENSE files look good Hashes and sigs look good No executable in the source distribution Compile source and run standalone +1 - Henry On Fri, Jul 25, 2014 at 4:08 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.2. This release fixes a number of bugs in Spark 1.0.1. Some of the notable ones are - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix for SPARK-1199. The fix was reverted for 1.0.2. - SPARK-2576: NoClassDefFoundError when executing Spark QL query on HDFS CSV file. The full list is at http://s.apache.org/9NJ The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f The release files, including signatures, digests, etc can be found at: http://people.apache.org/~tdas/spark-1.0.2-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/tdas.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1024/ The documentation corresponding to this release can be found at: http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/ Please vote on releasing this package as Apache Spark 1.0.2! The vote is open until Tuesday, July 29, at 23:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.2 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/
Re: JIRA content request
+1 on using JIRA workflows to manage the backlog, and +9000 on having decent descriptions for all JIRA issues. On Tue, Jul 29, 2014 at 7:48 PM, Sean Owen so...@cloudera.com wrote: How about using a JIRA status like Documentation Required to mean burden's on you to elaborate with a motivation and/or PR. This could both prompt people to do so, and also let one see when a JIRA has been waiting on the reporter for months, rather than simply never been looked at, and should thus time out and be closed. Both of these would probably help the JIRA backlog. On Wed, Jul 30, 2014 at 12:34 AM, Mark Hamstra m...@clearstorydata.com wrote: Of late, I've been coming across quite a few pull requests and associated JIRA issues that contain nothing indicating their purpose beyond a pretty minimal description of what the pull request does. On the pull request itself, a reference to the corresponding JIRA in the title combined with a description that gives us a sketch of what the PR does is fine, but if there is no description in at least the JIRA of *why* you think some change to Spark would be good, then it often makes getting started on code reviews a little harder for those of us doing the reviews. So, I'm requesting that if you are submitting a JIRA or pull request for something that isn't obviously a bug or bug fix, you please include some sort of motivation in at least the JIRA body so that the reviewers can more easily get through the head-scratching phase of trying to figure out why Spark might be improved by merging a pull request.
Re: Fraud management system implementation
This sounds more like a user list https://spark.apache.org/community.html question. This is the dev list, where people discuss things related to contributing code and such to Spark. On Mon, Jul 28, 2014 at 10:15 AM, jitendra shelar jitendra.shelar...@gmail.com wrote: Hi, I am new to spark. I am learning spark and scala. I had some queries. 1) Can somebody please tell me if it is possible to implement credit card fraud management system using spark? 2) If yes, can somebody please guide me how to proceed. 3) Shall I prefer Scala or Java for this implementation? 4) Please suggest me some pointers related to Hidden Markonav Model (HMM) and anomaly detection in data mining (using spark). Thanks, Jitendra
Re: Pull requests will be automatically linked to JIRA when submitted
By the way, it looks like there’s a JIRA plugin that integrates it with GitHub: - https://marketplace.atlassian.com/plugins/com.atlassian.jira.plugins.jira-bitbucket-connector-plugin - https://confluence.atlassian.com/display/BITBUCKET/Linking+Bitbucket+and+GitHub+accounts+to+JIRA It does the automatic linking and shows some additional information https://marketplace-cdn.atlassian.com/files/images/com.atlassian.jira.plugins.jira-bitbucket-connector-plugin/86ff1a21-44fb-4227-aa4f-44c77aec2c97.png that might be nice to have for heavy JIRA users. Nick On Sun, Jul 20, 2014 at 12:50 PM, Patrick Wendell pwend...@gmail.com wrote: Yeah it needs to have SPARK-XXX in the title (this is the format we request already). It just works with small synchronization script I wrote that we run every five minutes on Jeknins that uses the Github and Jenkins API: https://github.com/apache/spark/commit/49e472744951d875627d78b0d6e93cd139232929 - Patrick On Sun, Jul 20, 2014 at 8:06 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: That's pretty neat. How does it work? Do we just need to put the issue ID (e.g. SPARK-1234) anywhere in the pull request? Nick On Sat, Jul 19, 2014 at 11:10 PM, Patrick Wendell pwend...@gmail.com wrote: Just a small note, today I committed a tool that will automatically mirror pull requests to JIRA issues, so contributors will no longer have to manually post a pull request on the JIRA when they make one. It will create a link on the JIRA and also make a comment to trigger an e-mail to people watching. This should make some things easier, such as avoiding accidental duplicate effort on the same JIRA. - Patrick
Contributing to Spark needs PySpark build/test instructions
Contributing to Spark https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark needs a line or two about building and testing PySpark. A call out of run-tests, for example, would be helpful for new contributors to PySpark. Nick
Re: Contributing to Spark needs PySpark build/test instructions
For the record, the triggering discussion is here https://github.com/apache/spark/pull/1505#issuecomment-49671550. I assumed that sbt/sbt test covers all the tests required before submitting a patch, and it appears that it doesn’t. On Mon, Jul 21, 2014 at 6:42 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Contributing to Spark https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark needs a line or two about building and testing PySpark. A call out of run-tests, for example, would be helpful for new contributors to PySpark. Nick
Re: Contributing to Spark needs PySpark build/test instructions
Looks good. Does sbt/sbt test cover the same tests as /dev/run-tests? I’m looking at step 5 under “Contributing Code”. Someone contributing to PySpark will want to be directed to run something in addition to (or instead of) sbt/sbt test, I believe. Nick On Mon, Jul 21, 2014 at 11:43 PM, Reynold Xin r...@databricks.com wrote: I added an automated testing section: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-AutomatedTesting Can you take a look to see if it is what you had in mind? On Mon, Jul 21, 2014 at 3:54 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: For the record, the triggering discussion is here https://github.com/apache/spark/pull/1505#issuecomment-49671550. I assumed that sbt/sbt test covers all the tests required before submitting a patch, and it appears that it doesn’t. On Mon, Jul 21, 2014 at 6:42 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Contributing to Spark https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark needs a line or two about building and testing PySpark. A call out of run-tests, for example, would be helpful for new contributors to PySpark. Nick
Re: Pull requests will be automatically linked to JIRA when submitted
That's pretty neat. How does it work? Do we just need to put the issue ID (e.g. SPARK-1234) anywhere in the pull request? Nick On Sat, Jul 19, 2014 at 11:10 PM, Patrick Wendell pwend...@gmail.com wrote: Just a small note, today I committed a tool that will automatically mirror pull requests to JIRA issues, so contributors will no longer have to manually post a pull request on the JIRA when they make one. It will create a link on the JIRA and also make a comment to trigger an e-mail to people watching. This should make some things easier, such as avoiding accidental duplicate effort on the same JIRA. - Patrick
Re: small (yet major) change going in: broadcasting RDD to reduce task size
On Thu, Jul 17, 2014 at 1:23 AM, Stephen Haberman stephen.haber...@gmail.com wrote: I'd be ecstatic if more major changes were this well/succinctly explained Ditto on that. The summary of user impact was very nice. It would be good to repeat that on the user list or release notes when this change goes out. Nick
ec2 clusters launched at 9fe693b5b6 are broken (?)
Just launched an EC2 cluster from git hash 9fe693b5b6ed6af34ee1e800ab89c8a11991ea38. Calling take() on an RDD accessing data in S3 yields the following error output. I understand that NoClassDefFoundError errors may mean something in the deployment was messed up. Is that correct? When I launch a cluster using spark-ec2, I expect all critical deployment details to be taken care of by the script. So is something in the deployment executed by spark-ec2 borked? Nick java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException at org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:224) at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:214) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:176) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:201) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:201) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:201) at org.apache.spark.ShuffleDependency.init(Dependency.scala:71) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:79) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:188) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.dependencies(RDD.scala:188) at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1144) at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:903) at org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:174) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:191) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:190) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator.init(CoalescedRDD.scala:185) at org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:236) at org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:337) at org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:83) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:201) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:201) at org.apache.spark.rdd.RDD.take(RDD.scala:1036) at $iwC$$iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC.init(console:33) at $iwC.init(console:35) at init(console:37) at .init(console:41) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at
Re: ec2 clusters launched at 9fe693b5b6 are broken (?)
Okie doke--added myself as a watcher on that issue. On a related note, what are the thoughts on automatically spinning up/down EC2 clusters and running tests against them? It would probably be way too cumbersome to do that for every build, but perhaps on some schedule it could help validate that we are still deploying EC2 clusters correctly. Would something like that be valuable? Nick On Tue, Jul 15, 2014 at 1:19 AM, Patrick Wendell pwend...@gmail.com wrote: Yeah - this is likely caused by SPARK-2471. On Mon, Jul 14, 2014 at 10:11 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: My guess is that this is related to https://issues.apache.org/jira/browse/SPARK-2471 where the S3 library gets excluded from the SBT assembly jar. I am not sure if the assembly jar used in EC2 is generated using SBT though. Shivaram On Mon, Jul 14, 2014 at 10:02 PM, Aaron Davidson ilike...@gmail.com wrote: This one is typically due to a mismatch between the Hadoop versions -- i.e., Spark is compiled against 1.0.4 but is running with 2.3.0 in the classpath, or something like that. Not certain why you're seeing this with spark-ec2, but I'm assuming this is related to the issues you posted in a separate thread. On Mon, Jul 14, 2014 at 6:43 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Just launched an EC2 cluster from git hash 9fe693b5b6ed6af34ee1e800ab89c8a11991ea38. Calling take() on an RDD accessing data in S3 yields the following error output. I understand that NoClassDefFoundError errors may mean something in the deployment was messed up. Is that correct? When I launch a cluster using spark-ec2, I expect all critical deployment details to be taken care of by the script. So is something in the deployment executed by spark-ec2 borked? Nick java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException at org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:224) at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:214) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:176) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:201) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:201) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:201) at org.apache.spark.ShuffleDependency.init(Dependency.scala:71) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:79) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:188) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.dependencies(RDD.scala:188) at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1144) at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:903) at org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:174) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:191) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:190) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350
Re: EC2 clusters ready in launch time + 30 seconds
On Thu, Jul 10, 2014 at 8:10 PM, Nate D'Amico n...@reactor8.com wrote: Starting to work through some automation/config stuff for spark stack on EC2 with a project, will be focusing the work through the apache bigtop effort to start, can then share with spark community directly as things progress if people are interested Let us know how that goes. I'm definitely interested in hearing more. Nick
EC2 clusters ready in launch time + 30 seconds
Hi devs! Right now it takes a non-trivial amount of time to launch EC2 clusters. Part of this time is spent starting the EC2 instances, which is out of our control. Another part of this time is spent installing stuff on and configuring the instances. This, we can control. I’d like to explore approaches to upgrading spark-ec2 so that launching a cluster of any size generally takes only 30 seconds on top of the time to launch the base EC2 instances. Since Amazon can launch instances concurrently, I believe this means we should be able to launch a fully operational Spark cluster of any size in constant time. Is that correct? Do we already have an idea of what it would take to get to that point? Nick