Re: Scalastyle improvements / large code reformatting

2014-10-13 Thread Nicholas Chammas
On Mon, Oct 13, 2014 at 11:57 AM, Patrick Wendell pwend...@gmail.com
wrote:

 That would even work for imports as well,
 you'd just have a thing where if anyone modified some imports they
 would have to fix all the imports in that file. It's at least worth a
 try.


OK, that sounds like a fair compromise. I've updated the description on
SPARK-3849 https://issues.apache.org/jira/browse/SPARK-3849 accordingly.

Nick


Re: new jenkins update + tentative release date

2014-10-13 Thread Nicholas Chammas
Thanks for doing this work Shane.

So is Jenkins in the new datacenter now? Do you know if the problems with
checking out patches from GitHub should be resolved now? Here's an example
from the past hour
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21702/console
.

Nick


On Mon, Oct 13, 2014 at 1:35 PM, shane knapp skn...@berkeley.edu wrote:

 AND WE ARE LIIIVE!

 https://amplab.cs.berkeley.edu/jenkins/

 have at it, folks!

 On Mon, Oct 13, 2014 at 10:15 AM, shane knapp skn...@berkeley.edu wrote:

  quick update:  we should be back up and running in the next ~60mins.
 
  On Mon, Oct 13, 2014 at 7:54 AM, shane knapp skn...@berkeley.edu
 wrote:
 
  Jenkins is in quiet mode and the move will be starting after i have my
  coffee.  :)
 
  On Sun, Oct 12, 2014 at 11:26 PM, Josh Rosen rosenvi...@gmail.com
  wrote:
 
  Reminder: this Jenkins migration is happening tomorrow morning
 (Monday).
 
  On Fri, Oct 10, 2014 at 1:01 PM, shane knapp skn...@berkeley.edu
  wrote:
 
  reminder:  this IS happening, first thing monday morning PDT.  :)
 
  On Wed, Oct 8, 2014 at 3:01 PM, shane knapp skn...@berkeley.edu
  wrote:
 
   greetings!
  
   i've got some updates regarding our new jenkins infrastructure, as
  well as
   the initial date and plan for rolling things out:
  
   *** current testing/build break whack-a-mole:
   a lot of out of date artifacts are cached in the current jenkins,
  which
   has caused a few builds during my testing to break due to dependency
   resolution failure[1][2].
  
   bumping these versions can cause your builds to fail, due to public
  api
   changes and the like.  consider yourself warned that some projects
  might
   require some debugging...  :)
  
   tomorrow, i will be at databricks working w/@joshrosen to make sure
  that
   the spark builds have any bugs hammered out.
  
   ***  deployment plan:
   unless something completely horrible happens, THE NEW JENKINS WILL
 GO
  LIVE
   ON MONDAY (october 13th).
  
   all jenkins infrastructure will be DOWN for the entirety of the day
   (starting at ~8am).  this means no builds, period.  i'm hoping that
  the
   downtime will be much shorter than this, but we'll have to see how
   everything goes.
  
   all test/build history WILL BE PRESERVED.  i will be rsyncing the
  jenkins
   jobs/ directory over, complete w/history as part of the deployment.
  
   once i'm feeling good about the state of things, i'll point the
  original
   url to the new instances and send out an all clear.
  
   if you are a student at UC berkeley, you can log in to jenkins using
  your
   LDAP login, and (by default) view but not change plans.  if you do
  not have
   a UC berkeley LDAP login, you can still view plans anonymously.
  
   IF YOU ARE A PLAN ADMIN, THEN PLEASE REACH OUT, ASAP, PRIVATELY AND
 I
  WILL
   SET UP ADMIN ACCESS TO YOUR BUILDS.
  
   ***  post deployment plan:
   fix all of the things that break!
  
   i will be keeping a VERY close eye on the builds, checking for
  breaks, and
   helping out where i can.  if the situation is dire, i can always
 roll
  back
   to the old jenkins infra...  but i hope we never get to that point!
  :)
  
   i'm hoping that things will go smoothly, but please be patient as
 i'm
   certain we'll hit a few bumps in the road.
  
   please let me know if you guys have any
  comments/questions/concerns...  :)
  
   shane
  
   1 - https://github.com/bigdatagenomics/bdg-services/pull/18
   2 - https://github.com/bigdatagenomics/avocado/pull/111
  
 
 
 
 
 



Re: new jenkins update + tentative release date

2014-10-13 Thread Nicholas Chammas
Ah, that sucks. Thank you for looking into this.

On Mon, Oct 13, 2014 at 5:43 PM, shane knapp skn...@berkeley.edu wrote:

 On Mon, Oct 13, 2014 at 2:28 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Thanks for doing this work Shane.

 So is Jenkins in the new datacenter now? Do you know if the problems with
 checking out patches from GitHub should be resolved now? Here's an
 example from the past hour
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21702/console
 .


 yeah, i just noticed that we're still having the checkout issues.  i was
 really hoping that the better network would just make this go away...
  guess i'll be doing a deeper dive now.

 i would just up the timeout, but that's not coming out for a little while
 yet:
 https://issues.jenkins-ci.org/browse/JENKINS-20387

 (we are currently running the latest -- 2.2.7, and the timeout field is
 coming in 2.3, whenever that is)

 i'll try and strace/replicate it locally as well.




Re: new jenkins update + tentative release date

2014-10-13 Thread Nicholas Chammas
*fingers crossed*

On Mon, Oct 13, 2014 at 5:54 PM, shane knapp skn...@berkeley.edu wrote:

 ok, i found something that may help:

 https://issues.jenkins-ci.org/browse/JENKINS-20445?focusedCommentId=195638page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-195638

 i set this to 20 minutes...  let's see if that helps.

 On Mon, Oct 13, 2014 at 2:48 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Ah, that sucks. Thank you for looking into this.

 On Mon, Oct 13, 2014 at 5:43 PM, shane knapp skn...@berkeley.edu wrote:

 On Mon, Oct 13, 2014 at 2:28 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Thanks for doing this work Shane.

 So is Jenkins in the new datacenter now? Do you know if the problems
 with checking out patches from GitHub should be resolved now? Here's an
 example from the past hour
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21702/console
 .


 yeah, i just noticed that we're still having the checkout issues.  i
 was really hoping that the better network would just make this go away...
  guess i'll be doing a deeper dive now.

 i would just up the timeout, but that's not coming out for a little
 while yet:
 https://issues.jenkins-ci.org/browse/JENKINS-20387

 (we are currently running the latest -- 2.2.7, and the timeout field is
 coming in 2.3, whenever that is)

 i'll try and strace/replicate it locally as well.







Re: Trouble running tests

2014-10-10 Thread Nicholas Chammas
Running dev/run-tests as-is should work and will test everything. That's
what the contributing guide recommends, if I remember correctly.

At some point we should make it easier to test individual components
locally using the dev script, but calling sbt on the various tests suites
as Michael pointed out will always work.

Nick

On Friday, October 10, 2014, Yana Kadiyska yana.kadiy...@gmail.com wrote:

 Thanks Nicholas and Michael-- yes, I wanted to make sure all tests pass
 before I submitted a pull request.

 AMPLAB_JENKINS=true ./dev/run-tests fails for me in mlib and yarn
 suites(synced to 14f222f7f76cc93633aae27a94c0e556e289ec56).

 I was however able to run Michael's suggested tests and my changes affect
 the SQL project only, so I'll go ahead with the pull request...

 I'd like to know if people run the full suite locally though -- I can
 imagine cases where a change is not clearly isolated to a single module.

 thanks again

 On Thu, Oct 9, 2014 at 5:26 PM, Michael Armbrust mich...@databricks.com
 javascript:_e(%7B%7D,'cvml','mich...@databricks.com'); wrote:

 Also, in general for SQL only changes it is sufficient to run sbt/sbt
 catatlyst/test sql/test hive/test.  The hive/test part takes the
 longest, so I usually leave that out until just before submitting unless my
 changes are hive specific.

 On Thu, Oct 9, 2014 at 11:40 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com
 javascript:_e(%7B%7D,'cvml','nicholas.cham...@gmail.com'); wrote:

 _RUN_SQL_TESTS needs to be true as well. Those two _... variables set get
 correctly when tests are run on Jenkins. They’re not meant to be
 manipulated directly by testers.

 Did you want to run SQL tests only locally? You can try faking being
 Jenkins by setting AMPLAB_JENKINS=true before calling run-tests. That
 should be simpler than futzing with the _... variables.

 Nick
 ​

 On Thu, Oct 9, 2014 at 10:10 AM, Yana yana.kadiy...@gmail.com
 javascript:_e(%7B%7D,'cvml','yana.kadiy...@gmail.com'); wrote:

  Hi, apologies if I missed a FAQ somewhere.
 
  I am trying to submit a bug fix for the very first time. Reading
  instructions, I forked the git repo (at
  c9ae79fba25cd49ca70ca398bc75434202d26a97) and am trying to run tests.
 
  I run this: ./dev/run-tests  _SQL_TESTS_ONLY=true
 
  and after a while get the following error:
 
  [info] ScalaTest
  [info] Run completed in 3 minutes, 37 seconds.
  [info] Total number of tests run: 224
  [info] Suites: completed 19, aborted 0
  [info] Tests: succeeded 224, failed 0, canceled 0, ignored 5, pending 0
  [info] All tests passed.
  [info] Passed: Total 224, Failed 0, Errors 0, Passed 224, Ignored 5
  [success] Total time: 301 s, completed Oct 9, 2014 9:31:23 AM
  [error] Expected ID character
  [error] Not a valid command: hive-thriftserver
  [error] Expected project ID
  [error] Expected configuration
  [error] Expected ':' (if selecting a configuration)
  [error] Expected key
  [error] Not a valid key: hive-thriftserver
  [error] hive-thriftserver/test
  [error]  ^
 
 
  (I am running this without my changes)
 
  I have 2 questions:
  1. How to fix this
  2. Is there a best practice on what to fork so you start off with a
 good
  state? I'm wondering if I should sync the latest changes or go back
 to a
  label?
 
  thanks in advance
 
 
 
 
  --
  View this message in context:
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Trouble-running-tests-tp8717.html
  Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 javascript:_e(%7B%7D,'cvml','dev-unsubscr...@spark.apache.org');
  For additional commands, e-mail: dev-h...@spark.apache.org
 javascript:_e(%7B%7D,'cvml','dev-h...@spark.apache.org');
 
 






Re: Trouble running tests

2014-10-09 Thread Nicholas Chammas
_RUN_SQL_TESTS needs to be true as well. Those two _... variables set get
correctly when tests are run on Jenkins. They’re not meant to be
manipulated directly by testers.

Did you want to run SQL tests only locally? You can try faking being
Jenkins by setting AMPLAB_JENKINS=true before calling run-tests. That
should be simpler than futzing with the _... variables.

Nick
​

On Thu, Oct 9, 2014 at 10:10 AM, Yana yana.kadiy...@gmail.com wrote:

 Hi, apologies if I missed a FAQ somewhere.

 I am trying to submit a bug fix for the very first time. Reading
 instructions, I forked the git repo (at
 c9ae79fba25cd49ca70ca398bc75434202d26a97) and am trying to run tests.

 I run this: ./dev/run-tests  _SQL_TESTS_ONLY=true

 and after a while get the following error:

 [info] ScalaTest
 [info] Run completed in 3 minutes, 37 seconds.
 [info] Total number of tests run: 224
 [info] Suites: completed 19, aborted 0
 [info] Tests: succeeded 224, failed 0, canceled 0, ignored 5, pending 0
 [info] All tests passed.
 [info] Passed: Total 224, Failed 0, Errors 0, Passed 224, Ignored 5
 [success] Total time: 301 s, completed Oct 9, 2014 9:31:23 AM
 [error] Expected ID character
 [error] Not a valid command: hive-thriftserver
 [error] Expected project ID
 [error] Expected configuration
 [error] Expected ':' (if selecting a configuration)
 [error] Expected key
 [error] Not a valid key: hive-thriftserver
 [error] hive-thriftserver/test
 [error]  ^


 (I am running this without my changes)

 I have 2 questions:
 1. How to fix this
 2. Is there a best practice on what to fork so you start off with a good
 state? I'm wondering if I should sync the latest changes or go back to a
 label?

 thanks in advance




 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Trouble-running-tests-tp8717.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




spark-prs and mesos/spark-ec2

2014-10-09 Thread Nicholas Chammas
Does it make sense to point the Spark PR review board to read from
mesos/spark-ec2 as well? PRs submitted against that repo may reference
Spark JIRAs and need review just like any other Spark PR.

Nick


Re: Unneeded branches/tags

2014-10-08 Thread Nicholas Chammas
So:

   - tags: can delete
   - branches: stuck with ‘em

Correct?

Nick
​

On Wed, Oct 8, 2014 at 1:52 AM, Patrick Wendell pwend...@gmail.com wrote:

 Actually - weirdly - we can delete old tags and it works with the
 mirroring. Nick if you put together a list of un-needed tags I can
 delete them.

 On Tue, Oct 7, 2014 at 6:27 PM, Reynold Xin r...@databricks.com wrote:
  Those branches are no longer active. However, I don't think we can delete
  branches from github due to the way ASF mirroring works. I might be wrong
  there.
 
 
 
  On Tue, Oct 7, 2014 at 6:25 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com
  wrote:
 
  Just curious: Are there branches and/or tags on the repo that we don't
 need
  anymore?
 
  What are the scala-2.9 and streaming branches for, for example? And do
 we
  still need branches for older versions of Spark that we are not
 backporting
  stuff to, like branch-0.5?
 
  Nick
 
 



Re: Extending Scala style checks

2014-10-08 Thread Nicholas Chammas
I've created SPARK-3849: Automate remaining Scala style rules
https://issues.apache.org/jira/browse/SPARK-3849.

Please create sub-tasks on this issue for rules that we have not automated
and let's work through them as possible.

I went ahead and created the first sub-task, SPARK-3850: Scala style:
Disallow trailing spaces https://issues.apache.org/jira/browse/SPARK-3850.

Nick

On Tue, Oct 7, 2014 at 4:45 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 For starters, do we have a list of all the Scala style rules that are
 currently not enforced automatically but are likely well-suited for
 automation?

 Let's put such a list together in a JIRA issue and work through
 implementing them.

 Nick

 On Thu, Oct 2, 2014 at 12:06 AM, Cheng Lian lian.cs@gmail.com wrote:

 Since we can easily catch the list of all changed files in a PR, I think
 we can start with adding the no trailing space check for newly changed
 files only?


 On 10/2/14 9:24 AM, Nicholas Chammas wrote:

 Yeah, I remember that hell when I added PEP 8 to the build checks and
 fixed
 all the outstanding Python style issues. I had to keep rebasing and
 resolving merge conflicts until the PR was merged.

 It's a rough process, but thankfully it's also a one-time process. I
 might
 be able to help with that in the next week or two if no-one else wants to
 pick it up.

 Nick

 On Wed, Oct 1, 2014 at 9:20 PM, Michael Armbrust mich...@databricks.com
 
 wrote:

  The hard part here is updating the existing code base... which is going
 to
 create merge conflicts with like all of the open PRs...

 On Wed, Oct 1, 2014 at 6:13 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

  Ah, since there appears to be a built-in rule for end-of-line
 whitespace,
 Michael and Cheng, y'all should be able to add this in pretty easily.

 Nick

 On Wed, Oct 1, 2014 at 6:37 PM, Patrick Wendell pwend...@gmail.com
 wrote:

  Hey Nick,

 We can always take built-in rules. Back when we added this Prashant
 Sharma actually did some great work that lets us write our own style
 rules in cases where rules don't exist.

 You can see some existing rules here:


  https://github.com/apache/spark/tree/master/project/
 spark-style/src/main/scala/org/apache/spark/scalastyle

 Prashant has over time contributed a lot of our custom rules upstream
 to stalastyle, so now there are only a couple there.

 - Patrick

 On Wed, Oct 1, 2014 at 2:36 PM, Ted Yu yuzhih...@gmail.com wrote:

 Please take a look at WhitespaceEndOfLineChecker under:
 http://www.scalastyle.org/rules-0.1.0.html

 Cheers

 On Wed, Oct 1, 2014 at 2:01 PM, Nicholas Chammas 

 nicholas.cham...@gmail.com

 wrote:
 As discussed here https://github.com/apache/spark/pull/2619, it

 would be

 good to extend our Scala style checks to programmatically enforce as

 many

 of our style rules as possible.

 Does anyone know if it's relatively straightforward to enforce

 additional

 rules like the no trailing spaces rule mentioned in the linked PR?

 Nick







Re: Extending Scala style checks

2014-10-07 Thread Nicholas Chammas
For starters, do we have a list of all the Scala style rules that are
currently not enforced automatically but are likely well-suited for
automation?

Let's put such a list together in a JIRA issue and work through
implementing them.

Nick

On Thu, Oct 2, 2014 at 12:06 AM, Cheng Lian lian.cs@gmail.com wrote:

 Since we can easily catch the list of all changed files in a PR, I think
 we can start with adding the no trailing space check for newly changed
 files only?


 On 10/2/14 9:24 AM, Nicholas Chammas wrote:

 Yeah, I remember that hell when I added PEP 8 to the build checks and
 fixed
 all the outstanding Python style issues. I had to keep rebasing and
 resolving merge conflicts until the PR was merged.

 It's a rough process, but thankfully it's also a one-time process. I might
 be able to help with that in the next week or two if no-one else wants to
 pick it up.

 Nick

 On Wed, Oct 1, 2014 at 9:20 PM, Michael Armbrust mich...@databricks.com
 wrote:

  The hard part here is updating the existing code base... which is going
 to
 create merge conflicts with like all of the open PRs...

 On Wed, Oct 1, 2014 at 6:13 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

  Ah, since there appears to be a built-in rule for end-of-line
 whitespace,
 Michael and Cheng, y'all should be able to add this in pretty easily.

 Nick

 On Wed, Oct 1, 2014 at 6:37 PM, Patrick Wendell pwend...@gmail.com
 wrote:

  Hey Nick,

 We can always take built-in rules. Back when we added this Prashant
 Sharma actually did some great work that lets us write our own style
 rules in cases where rules don't exist.

 You can see some existing rules here:


  https://github.com/apache/spark/tree/master/project/
 spark-style/src/main/scala/org/apache/spark/scalastyle

 Prashant has over time contributed a lot of our custom rules upstream
 to stalastyle, so now there are only a couple there.

 - Patrick

 On Wed, Oct 1, 2014 at 2:36 PM, Ted Yu yuzhih...@gmail.com wrote:

 Please take a look at WhitespaceEndOfLineChecker under:
 http://www.scalastyle.org/rules-0.1.0.html

 Cheers

 On Wed, Oct 1, 2014 at 2:01 PM, Nicholas Chammas 

 nicholas.cham...@gmail.com

 wrote:
 As discussed here https://github.com/apache/spark/pull/2619, it

 would be

 good to extend our Scala style checks to programmatically enforce as

 many

 of our style rules as possible.

 Does anyone know if it's relatively straightforward to enforce

 additional

 rules like the no trailing spaces rule mentioned in the linked PR?

 Nick






Unneeded branches/tags

2014-10-07 Thread Nicholas Chammas
Just curious: Are there branches and/or tags on the repo that we don’t need
anymore?

What are the scala-2.9 and streaming branches for, for example? And do we
still need branches for older versions of Spark that we are not backporting
stuff to, like branch-0.5?

Nick
​


Re: EC2 clusters ready in launch time + 30 seconds

2014-10-06 Thread Nicholas Chammas
FYI: I've created SPARK-3821: Develop an automated way of creating Spark
images (AMI, Docker, and others)
https://issues.apache.org/jira/browse/SPARK-3821

On Mon, Oct 6, 2014 at 4:48 PM, Daniil Osipov daniil.osi...@shazam.com
wrote:

 I've also been looking at this. Basically, the Spark EC2 script is
 excellent for small development clusters of several nodes, but isn't
 suitable for production. It handles instance setup in a single threaded
 manner, while it can easily be parallelized. It also doesn't handle failure
 well, ex when an instance fails to start or is taking too long to respond.

 Our desire was to have an equivalent to Amazon EMR[1] API that would
 trigger Spark jobs, including specified cluster setup. I've done some work
 towards that end, and it would benefit from an updated AMI greatly.

 Dan

 [1]
 http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-commands.html

 On Sat, Oct 4, 2014 at 7:28 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Thanks for posting that script, Patrick. It looks like a good place to
 start.

 Regarding Docker vs. Packer, as I understand it you can use Packer to
 create Docker containers at the same time as AMIs and other image types.

 Nick


 On Sat, Oct 4, 2014 at 2:49 AM, Patrick Wendell pwend...@gmail.com
 wrote:

  Hey All,
 
  Just a couple notes. I recently posted a shell script for creating the
  AMI's from a clean Amazon Linux AMI.
 
  https://github.com/mesos/spark-ec2/blob/v3/create_image.sh
 
  I think I will update the AMI's soon to get the most recent security
  updates. For spark-ec2's purpose this is probably sufficient (we'll
  only need to re-create them every few months).
 
  However, it would be cool if someone wanted to tackle providing a more
  general mechanism for defining Spark-friendly images that can be
  used more generally. I had thought that docker might be a good way to
  go for something like this - but maybe this packer thing is good too.
 
  For one thing, if we had a standard image we could use it to create
  containers for running Spark's unit test, which would be really cool.
  This would help a lot with random issues around port and filesystem
  contention we have for unit tests.
 
  I'm not sure if the long term place for this would be inside the spark
  codebase or a community library or what. But it would definitely be
  very valuable to have if someone wanted to take it on.
 
  - Patrick
 
  On Fri, Oct 3, 2014 at 5:20 PM, Nicholas Chammas
  nicholas.cham...@gmail.com wrote:
   FYI: There is an existing issue -- SPARK-3314
   https://issues.apache.org/jira/browse/SPARK-3314 -- about scripting
  the
   creation of Spark AMIs.
  
   With Packer, it looks like we may be able to script the creation of
   multiple image types (VMWare, GCE, AMI, Docker, etc...) at once from a
   single Packer template. That's very cool.
  
   I'll be looking into this.
  
   Nick
  
  
   On Thu, Oct 2, 2014 at 8:23 PM, Nicholas Chammas 
  nicholas.cham...@gmail.com
   wrote:
  
   Thanks for the update, Nate. I'm looking forward to seeing how these
   projects turn out.
  
   David, Packer looks very, very interesting. I'm gonna look into it
 more
   next week.
  
   Nick
  
  
   On Thu, Oct 2, 2014 at 8:00 PM, Nate D'Amico n...@reactor8.com
 wrote:
  
   Bit of progress on our end, bit of lagging as well.  Our guy leading
   effort got little bogged down on client project to update hive/sql
  testbed
   to latest spark/sparkSQL, also launching public service so we have
  been bit
   scattered recently.
  
   Will have some more updates probably after next week.  We are
 planning
  on
   taking our client work around hive/spark, plus taking over the
 bigtop
   automation work to modernize and get that fit for human consumption
  outside
   or org.  All our work and puppet modules will be open sourced,
  documented,
   hopefully start to rally some other folks around effort that find it
  useful
  
   Side note, another effort we are looking into is gradle
 tests/support.
   We have been leveraging serverspec for some basic infrastructure
  tests, but
   with bigtop switching over to gradle builds/testing setup in 0.8 we
  want to
   include support for that in our own efforts, probably some stuff
 that
  can
   be learned and leveraged in spark world for repeatable/tested
  infrastructure
  
   If anyone has any specific automation questions to your environment
 you
   can drop me a line directly.., will try to help out best I can.
 Else
  will
   post update to dev list once we get on top of our own product
 release
  and
   the bigtop work
  
   Nate
  
  
   -Original Message-
   From: David Rowe [mailto:davidr...@gmail.com]
   Sent: Thursday, October 02, 2014 4:44 PM
   To: Nicholas Chammas
   Cc: dev; Shivaram Venkataraman
   Subject: Re: EC2 clusters ready in launch time + 30 seconds
  
   I think this is exactly what packer is for. See e.g.
   http://www.packer.io/intro/getting-started/build

Re: EC2 clusters ready in launch time + 30 seconds

2014-10-04 Thread Nicholas Chammas
Thanks for posting that script, Patrick. It looks like a good place to
start.

Regarding Docker vs. Packer, as I understand it you can use Packer to
create Docker containers at the same time as AMIs and other image types.

Nick


On Sat, Oct 4, 2014 at 2:49 AM, Patrick Wendell pwend...@gmail.com wrote:

 Hey All,

 Just a couple notes. I recently posted a shell script for creating the
 AMI's from a clean Amazon Linux AMI.

 https://github.com/mesos/spark-ec2/blob/v3/create_image.sh

 I think I will update the AMI's soon to get the most recent security
 updates. For spark-ec2's purpose this is probably sufficient (we'll
 only need to re-create them every few months).

 However, it would be cool if someone wanted to tackle providing a more
 general mechanism for defining Spark-friendly images that can be
 used more generally. I had thought that docker might be a good way to
 go for something like this - but maybe this packer thing is good too.

 For one thing, if we had a standard image we could use it to create
 containers for running Spark's unit test, which would be really cool.
 This would help a lot with random issues around port and filesystem
 contention we have for unit tests.

 I'm not sure if the long term place for this would be inside the spark
 codebase or a community library or what. But it would definitely be
 very valuable to have if someone wanted to take it on.

 - Patrick

 On Fri, Oct 3, 2014 at 5:20 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
  FYI: There is an existing issue -- SPARK-3314
  https://issues.apache.org/jira/browse/SPARK-3314 -- about scripting
 the
  creation of Spark AMIs.
 
  With Packer, it looks like we may be able to script the creation of
  multiple image types (VMWare, GCE, AMI, Docker, etc...) at once from a
  single Packer template. That's very cool.
 
  I'll be looking into this.
 
  Nick
 
 
  On Thu, Oct 2, 2014 at 8:23 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com
  wrote:
 
  Thanks for the update, Nate. I'm looking forward to seeing how these
  projects turn out.
 
  David, Packer looks very, very interesting. I'm gonna look into it more
  next week.
 
  Nick
 
 
  On Thu, Oct 2, 2014 at 8:00 PM, Nate D'Amico n...@reactor8.com wrote:
 
  Bit of progress on our end, bit of lagging as well.  Our guy leading
  effort got little bogged down on client project to update hive/sql
 testbed
  to latest spark/sparkSQL, also launching public service so we have
 been bit
  scattered recently.
 
  Will have some more updates probably after next week.  We are planning
 on
  taking our client work around hive/spark, plus taking over the bigtop
  automation work to modernize and get that fit for human consumption
 outside
  or org.  All our work and puppet modules will be open sourced,
 documented,
  hopefully start to rally some other folks around effort that find it
 useful
 
  Side note, another effort we are looking into is gradle tests/support.
  We have been leveraging serverspec for some basic infrastructure
 tests, but
  with bigtop switching over to gradle builds/testing setup in 0.8 we
 want to
  include support for that in our own efforts, probably some stuff that
 can
  be learned and leveraged in spark world for repeatable/tested
 infrastructure
 
  If anyone has any specific automation questions to your environment you
  can drop me a line directly.., will try to help out best I can.  Else
 will
  post update to dev list once we get on top of our own product release
 and
  the bigtop work
 
  Nate
 
 
  -Original Message-
  From: David Rowe [mailto:davidr...@gmail.com]
  Sent: Thursday, October 02, 2014 4:44 PM
  To: Nicholas Chammas
  Cc: dev; Shivaram Venkataraman
  Subject: Re: EC2 clusters ready in launch time + 30 seconds
 
  I think this is exactly what packer is for. See e.g.
  http://www.packer.io/intro/getting-started/build-image.html
 
  On a related note, the current AMI for hvm systems (e.g. m3.*, r3.*)
 has
  a bad package for httpd, whcih causes ganglia not to start. For some
 reason
  I can't get access to the raw AMI to fix it.
 
  On Fri, Oct 3, 2014 at 9:30 AM, Nicholas Chammas 
  nicholas.cham...@gmail.com
   wrote:
 
   Is there perhaps a way to define an AMI programmatically? Like, a
   collection of base AMI id + list of required stuff to be installed +
   list of required configuration changes. I'm guessing that's what
   people use things like Puppet, Ansible, or maybe also AWS
  CloudFormation for, right?
  
   If we could do something like that, then with every new release of
   Spark we could quickly and easily create new AMIs that have
 everything
  we need.
   spark-ec2 would only have to bring up the instances and do a minimal
   amount of configuration, and the only thing we'd need to track in the
   Spark repo is the code that defines what goes on the AMI, as well as
 a
   list of the AMI ids specific to each release.
  
   I'm just thinking out loud here. Does this make sense?
  
   Nate,
  
   Any

Re: EC2 clusters ready in launch time + 30 seconds

2014-10-03 Thread Nicholas Chammas
FYI: There is an existing issue -- SPARK-3314
https://issues.apache.org/jira/browse/SPARK-3314 -- about scripting the
creation of Spark AMIs.

With Packer, it looks like we may be able to script the creation of
multiple image types (VMWare, GCE, AMI, Docker, etc...) at once from a
single Packer template. That's very cool.

I'll be looking into this.

Nick


On Thu, Oct 2, 2014 at 8:23 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Thanks for the update, Nate. I'm looking forward to seeing how these
 projects turn out.

 David, Packer looks very, very interesting. I'm gonna look into it more
 next week.

 Nick


 On Thu, Oct 2, 2014 at 8:00 PM, Nate D'Amico n...@reactor8.com wrote:

 Bit of progress on our end, bit of lagging as well.  Our guy leading
 effort got little bogged down on client project to update hive/sql testbed
 to latest spark/sparkSQL, also launching public service so we have been bit
 scattered recently.

 Will have some more updates probably after next week.  We are planning on
 taking our client work around hive/spark, plus taking over the bigtop
 automation work to modernize and get that fit for human consumption outside
 or org.  All our work and puppet modules will be open sourced, documented,
 hopefully start to rally some other folks around effort that find it useful

 Side note, another effort we are looking into is gradle tests/support.
 We have been leveraging serverspec for some basic infrastructure tests, but
 with bigtop switching over to gradle builds/testing setup in 0.8 we want to
 include support for that in our own efforts, probably some stuff that can
 be learned and leveraged in spark world for repeatable/tested infrastructure

 If anyone has any specific automation questions to your environment you
 can drop me a line directly.., will try to help out best I can.  Else will
 post update to dev list once we get on top of our own product release and
 the bigtop work

 Nate


 -Original Message-
 From: David Rowe [mailto:davidr...@gmail.com]
 Sent: Thursday, October 02, 2014 4:44 PM
 To: Nicholas Chammas
 Cc: dev; Shivaram Venkataraman
 Subject: Re: EC2 clusters ready in launch time + 30 seconds

 I think this is exactly what packer is for. See e.g.
 http://www.packer.io/intro/getting-started/build-image.html

 On a related note, the current AMI for hvm systems (e.g. m3.*, r3.*) has
 a bad package for httpd, whcih causes ganglia not to start. For some reason
 I can't get access to the raw AMI to fix it.

 On Fri, Oct 3, 2014 at 9:30 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com
  wrote:

  Is there perhaps a way to define an AMI programmatically? Like, a
  collection of base AMI id + list of required stuff to be installed +
  list of required configuration changes. I’m guessing that’s what
  people use things like Puppet, Ansible, or maybe also AWS
 CloudFormation for, right?
 
  If we could do something like that, then with every new release of
  Spark we could quickly and easily create new AMIs that have everything
 we need.
  spark-ec2 would only have to bring up the instances and do a minimal
  amount of configuration, and the only thing we’d need to track in the
  Spark repo is the code that defines what goes on the AMI, as well as a
  list of the AMI ids specific to each release.
 
  I’m just thinking out loud here. Does this make sense?
 
  Nate,
 
  Any progress on your end with this work?
 
  Nick
  ​
 
  On Sun, Jul 13, 2014 at 8:53 PM, Shivaram Venkataraman 
  shiva...@eecs.berkeley.edu wrote:
 
   It should be possible to improve cluster launch time if we are
   careful about what commands we run during setup. One way to do this
   would be to walk down the list of things we do for cluster
   initialization and see if there is anything we can do make things
   faster. Unfortunately this might
  be
   pretty time consuming, but I don't know of a better strategy. The
   place
  to
   start would be the setup.sh file at
   https://github.com/mesos/spark-ec2/blob/v3/setup.sh
  
   Here are some things that take a lot of time and could be improved:
   1. Creating swap partitions on all machines. We could check if there
   is a way to get EC2 to always mount a swap partition 2. Copying /
   syncing things across slaves. The copy-dir script is called too many
   times right now and each time it pauses for a few milliseconds
   between slaves [1]. This could be improved by removing unnecessary
   copies 3. We could make less frequently used modules like Tachyon,
   persistent
  hdfs
   not a part of the default setup.
  
   [1] https://github.com/mesos/spark-ec2/blob/v3/copy-dir.sh#L42
  
   Thanks
   Shivaram
  
  
  
  
   On Sat, Jul 12, 2014 at 7:02 PM, Nicholas Chammas 
   nicholas.cham...@gmail.com wrote:
  
On Thu, Jul 10, 2014 at 8:10 PM, Nate D'Amico n...@reactor8.com
  wrote:
   
 Starting to work through some automation/config stuff for spark
 stack
   on
 EC2 with a project, will be focusing the work through

Re: EC2 clusters ready in launch time + 30 seconds

2014-10-02 Thread Nicholas Chammas
Thanks for the update, Nate. I'm looking forward to seeing how these
projects turn out.

David, Packer looks very, very interesting. I'm gonna look into it more
next week.

Nick


On Thu, Oct 2, 2014 at 8:00 PM, Nate D'Amico n...@reactor8.com wrote:

 Bit of progress on our end, bit of lagging as well.  Our guy leading
 effort got little bogged down on client project to update hive/sql testbed
 to latest spark/sparkSQL, also launching public service so we have been bit
 scattered recently.

 Will have some more updates probably after next week.  We are planning on
 taking our client work around hive/spark, plus taking over the bigtop
 automation work to modernize and get that fit for human consumption outside
 or org.  All our work and puppet modules will be open sourced, documented,
 hopefully start to rally some other folks around effort that find it useful

 Side note, another effort we are looking into is gradle tests/support.  We
 have been leveraging serverspec for some basic infrastructure tests, but
 with bigtop switching over to gradle builds/testing setup in 0.8 we want to
 include support for that in our own efforts, probably some stuff that can
 be learned and leveraged in spark world for repeatable/tested infrastructure

 If anyone has any specific automation questions to your environment you
 can drop me a line directly.., will try to help out best I can.  Else will
 post update to dev list once we get on top of our own product release and
 the bigtop work

 Nate


 -Original Message-
 From: David Rowe [mailto:davidr...@gmail.com]
 Sent: Thursday, October 02, 2014 4:44 PM
 To: Nicholas Chammas
 Cc: dev; Shivaram Venkataraman
 Subject: Re: EC2 clusters ready in launch time + 30 seconds

 I think this is exactly what packer is for. See e.g.
 http://www.packer.io/intro/getting-started/build-image.html

 On a related note, the current AMI for hvm systems (e.g. m3.*, r3.*) has a
 bad package for httpd, whcih causes ganglia not to start. For some reason I
 can't get access to the raw AMI to fix it.

 On Fri, Oct 3, 2014 at 9:30 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com
  wrote:

  Is there perhaps a way to define an AMI programmatically? Like, a
  collection of base AMI id + list of required stuff to be installed +
  list of required configuration changes. I’m guessing that’s what
  people use things like Puppet, Ansible, or maybe also AWS CloudFormation
 for, right?
 
  If we could do something like that, then with every new release of
  Spark we could quickly and easily create new AMIs that have everything
 we need.
  spark-ec2 would only have to bring up the instances and do a minimal
  amount of configuration, and the only thing we’d need to track in the
  Spark repo is the code that defines what goes on the AMI, as well as a
  list of the AMI ids specific to each release.
 
  I’m just thinking out loud here. Does this make sense?
 
  Nate,
 
  Any progress on your end with this work?
 
  Nick
  ​
 
  On Sun, Jul 13, 2014 at 8:53 PM, Shivaram Venkataraman 
  shiva...@eecs.berkeley.edu wrote:
 
   It should be possible to improve cluster launch time if we are
   careful about what commands we run during setup. One way to do this
   would be to walk down the list of things we do for cluster
   initialization and see if there is anything we can do make things
   faster. Unfortunately this might
  be
   pretty time consuming, but I don't know of a better strategy. The
   place
  to
   start would be the setup.sh file at
   https://github.com/mesos/spark-ec2/blob/v3/setup.sh
  
   Here are some things that take a lot of time and could be improved:
   1. Creating swap partitions on all machines. We could check if there
   is a way to get EC2 to always mount a swap partition 2. Copying /
   syncing things across slaves. The copy-dir script is called too many
   times right now and each time it pauses for a few milliseconds
   between slaves [1]. This could be improved by removing unnecessary
   copies 3. We could make less frequently used modules like Tachyon,
   persistent
  hdfs
   not a part of the default setup.
  
   [1] https://github.com/mesos/spark-ec2/blob/v3/copy-dir.sh#L42
  
   Thanks
   Shivaram
  
  
  
  
   On Sat, Jul 12, 2014 at 7:02 PM, Nicholas Chammas 
   nicholas.cham...@gmail.com wrote:
  
On Thu, Jul 10, 2014 at 8:10 PM, Nate D'Amico n...@reactor8.com
  wrote:
   
 Starting to work through some automation/config stuff for spark
 stack
   on
 EC2 with a project, will be focusing the work through the apache
  bigtop
 effort to start, can then share with spark community directly as
  things
 progress if people are interested
   
   
Let us know how that goes. I'm definitely interested in hearing more.
   
Nick
   
  
 




Re: amplab jenkins is down

2014-10-01 Thread Nicholas Chammas
On Thu, Sep 4, 2014 at 4:19 PM, shane knapp skn...@berkeley.edu wrote:

 on a side note, this incident will be accelerating our plan to move the
 entire jenkins infrastructure in to a managed datacenter environment.
 this
 will be our major push over the next couple of weeks.  more details about
 this, also, as soon as i get them.


Are there any updates on this move of the Jenkins infrastructure to a
managed datacenter?

I remember it being mentioned that another benefit of this move would be
reduced flakiness when Jenkins tries to checkout patches for testing. For
some reason, I'm getting a lot of those
https://github.com/apache/spark/pull/2606#issuecomment-57514540 today.

Nick


Re: do MIMA checking before all test cases start?

2014-10-01 Thread Nicholas Chammas
How early can MiMa checks be run? Before Spark is even built
https://github.com/apache/spark/blob/8cc70e7e15fd800f31b94e9102069506360289db/dev/run-tests#L118?
After the build but before the unit tests?

On Thu, Sep 25, 2014 at 6:06 PM, Patrick Wendell pwend...@gmail.com wrote:

 Yeah we can also move it first. Wouldn't hurt.

 On Thu, Sep 25, 2014 at 6:39 AM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
  It might still make sense to make this change if MIMA checks are always
  relatively quick, for the same reason we do style checks first.
 
  On Thu, Sep 25, 2014 at 12:25 AM, Nan Zhu zhunanmcg...@gmail.com
 wrote:
 
  yeah, I tried that, but there is always an issue when I ran dev/mima,
 
  it always gives me some binary compatibility error on Java API part
 
  so I have to wait for Jenkins' result when fixing MIMA issues
 
  --
  Nan Zhu
 
 
  On Thursday, September 25, 2014 at 12:04 AM, Patrick Wendell wrote:
 
   Have you considered running the mima checks locally? We prefer people
   not use Jenkins for very frequent checks since it takes resources away
   from other people trying to run tests.
  
   On Wed, Sep 24, 2014 at 6:44 PM, Nan Zhu zhunanmcg...@gmail.com
   (mailto:zhunanmcg...@gmail.com) wrote:
Hi, all
   
It seems that, currently, Jenkins makes MIMA checking after all test
cases have finished, IIRC, during the first months we introduced
 MIMA, we do
the MIMA checking before running test cases
   
What's the motivation to adjust this behaviour?
   
In my opinion, if you have some binary compatibility issues, you
 just
need to do some minor changes, but in the current environment, you
 can only
get if your change works after all test cases finished (1 hour
 later...)
   
Best,
   
--
Nan Zhu
   
  
  
  
 
 
 



Re: Extending Scala style checks

2014-10-01 Thread Nicholas Chammas
Ah, since there appears to be a built-in rule for end-of-line whitespace,
Michael and Cheng, y'all should be able to add this in pretty easily.

Nick

On Wed, Oct 1, 2014 at 6:37 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hey Nick,

 We can always take built-in rules. Back when we added this Prashant
 Sharma actually did some great work that lets us write our own style
 rules in cases where rules don't exist.

 You can see some existing rules here:

 https://github.com/apache/spark/tree/master/project/spark-style/src/main/scala/org/apache/spark/scalastyle

 Prashant has over time contributed a lot of our custom rules upstream
 to stalastyle, so now there are only a couple there.

 - Patrick

 On Wed, Oct 1, 2014 at 2:36 PM, Ted Yu yuzhih...@gmail.com wrote:
  Please take a look at WhitespaceEndOfLineChecker under:
  http://www.scalastyle.org/rules-0.1.0.html
 
  Cheers
 
  On Wed, Oct 1, 2014 at 2:01 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com
  wrote:
 
  As discussed here https://github.com/apache/spark/pull/2619, it
 would be
  good to extend our Scala style checks to programmatically enforce as
 many
  of our style rules as possible.
 
  Does anyone know if it's relatively straightforward to enforce
 additional
  rules like the no trailing spaces rule mentioned in the linked PR?
 
  Nick
 



Re: Extending Scala style checks

2014-10-01 Thread Nicholas Chammas
Yeah, I remember that hell when I added PEP 8 to the build checks and fixed
all the outstanding Python style issues. I had to keep rebasing and
resolving merge conflicts until the PR was merged.

It's a rough process, but thankfully it's also a one-time process. I might
be able to help with that in the next week or two if no-one else wants to
pick it up.

Nick

On Wed, Oct 1, 2014 at 9:20 PM, Michael Armbrust mich...@databricks.com
wrote:

 The hard part here is updating the existing code base... which is going to
 create merge conflicts with like all of the open PRs...

 On Wed, Oct 1, 2014 at 6:13 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Ah, since there appears to be a built-in rule for end-of-line whitespace,
 Michael and Cheng, y'all should be able to add this in pretty easily.

 Nick

 On Wed, Oct 1, 2014 at 6:37 PM, Patrick Wendell pwend...@gmail.com
 wrote:

  Hey Nick,
 
  We can always take built-in rules. Back when we added this Prashant
  Sharma actually did some great work that lets us write our own style
  rules in cases where rules don't exist.
 
  You can see some existing rules here:
 
 
 https://github.com/apache/spark/tree/master/project/spark-style/src/main/scala/org/apache/spark/scalastyle
 
  Prashant has over time contributed a lot of our custom rules upstream
  to stalastyle, so now there are only a couple there.
 
  - Patrick
 
  On Wed, Oct 1, 2014 at 2:36 PM, Ted Yu yuzhih...@gmail.com wrote:
   Please take a look at WhitespaceEndOfLineChecker under:
   http://www.scalastyle.org/rules-0.1.0.html
  
   Cheers
  
   On Wed, Oct 1, 2014 at 2:01 PM, Nicholas Chammas 
  nicholas.cham...@gmail.com
   wrote:
  
   As discussed here https://github.com/apache/spark/pull/2619, it
  would be
   good to extend our Scala style checks to programmatically enforce as
  many
   of our style rules as possible.
  
   Does anyone know if it's relatively straightforward to enforce
  additional
   rules like the no trailing spaces rule mentioned in the linked PR?
  
   Nick
  
 





Re: Extending Scala style checks

2014-10-01 Thread Nicholas Chammas
Does anyone know if Scala has something equivalent to autopep8
https://pypi.python.org/pypi/autopep8? It would help patch up the
existing code base a lot quicker as we add in new style rules.
​

On Wed, Oct 1, 2014 at 9:24 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Yeah, I remember that hell when I added PEP 8 to the build checks and
 fixed all the outstanding Python style issues. I had to keep rebasing and
 resolving merge conflicts until the PR was merged.

 It's a rough process, but thankfully it's also a one-time process. I might
 be able to help with that in the next week or two if no-one else wants to
 pick it up.

 Nick

 On Wed, Oct 1, 2014 at 9:20 PM, Michael Armbrust mich...@databricks.com
 wrote:

 The hard part here is updating the existing code base... which is going
 to create merge conflicts with like all of the open PRs...

 On Wed, Oct 1, 2014 at 6:13 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Ah, since there appears to be a built-in rule for end-of-line whitespace,
 Michael and Cheng, y'all should be able to add this in pretty easily.

 Nick

 On Wed, Oct 1, 2014 at 6:37 PM, Patrick Wendell pwend...@gmail.com
 wrote:

  Hey Nick,
 
  We can always take built-in rules. Back when we added this Prashant
  Sharma actually did some great work that lets us write our own style
  rules in cases where rules don't exist.
 
  You can see some existing rules here:
 
 
 https://github.com/apache/spark/tree/master/project/spark-style/src/main/scala/org/apache/spark/scalastyle
 
  Prashant has over time contributed a lot of our custom rules upstream
  to stalastyle, so now there are only a couple there.
 
  - Patrick
 
  On Wed, Oct 1, 2014 at 2:36 PM, Ted Yu yuzhih...@gmail.com wrote:
   Please take a look at WhitespaceEndOfLineChecker under:
   http://www.scalastyle.org/rules-0.1.0.html
  
   Cheers
  
   On Wed, Oct 1, 2014 at 2:01 PM, Nicholas Chammas 
  nicholas.cham...@gmail.com
   wrote:
  
   As discussed here https://github.com/apache/spark/pull/2619, it
  would be
   good to extend our Scala style checks to programmatically enforce as
  many
   of our style rules as possible.
  
   Does anyone know if it's relatively straightforward to enforce
  additional
   rules like the no trailing spaces rule mentioned in the linked PR?
  
   Nick
  
 






thank you for reviewing our patches

2014-09-26 Thread Nicholas Chammas
I recently came across this mailing list post by Linus Torvalds
https://lkml.org/lkml/2004/12/20/255 about the value of reviewing even
“trivial” patches. The following passages stood out to me:

I think that much more important than the patch is the fact that people get
used to the notion that they can change the kernel

…

So please don’t stop. Yes, those trivial patches *are* a bother. Damn, they
are *horrible*. But at the same time, the devil is in the detail, and they
are needed in the long run. Both the patches themselves, and the people
that grew up on them.

Spark is the first (and currently only) open source project I contribute
regularly to. My first several PRs against the project, as simple as they
were, were definitely patches that I “grew up on”.

I appreciate the time and effort all the reviewers I’ve interacted with
have taken to work with me on my PRs, even when they are “trivial”. And I’m
sure that as I continue to contribute to this project there will be many
more patches that I will “grow up on”.

Thank you Patrick, Reynold, Josh, Davies, Michael, and everyone else who’s
taken time to review one of my patches. I appreciate it!

Nick
​


Re: Spark SQL use of alias in where clause

2014-09-25 Thread Nicholas Chammas
That is correct. Aliases in the SELECT clause can only be referenced in the
ORDER BY and HAVING clauses. Otherwise, you'll have to just repeat the
statement, like concat() in this case.

A more elegant alternative, which is probably not available in Spark SQL
yet, is to use Common Table Expressions
http://technet.microsoft.com/en-us/library/ms190766(v=sql.105).aspx.

On Wed, Sep 24, 2014 at 11:32 PM, Yanbo Liang yanboha...@gmail.com wrote:

 Maybe it's the way SQL works.
 The select part is executed after the where filter is applied, so you
 cannot use alias declared in select part in where clause.
 Hive and Oracle behavior the same as Spark SQL.

 2014-09-25 8:58 GMT+08:00 Du Li l...@yahoo-inc.com.invalid:

   Hi,

  The following query does not work in Shark nor in the new Spark
 SQLContext or HiveContext.
 SELECT key, value, concat(key, value) as combined from src where combined
 like ’11%’;

  The following tweak of syntax works fine although a bit ugly.
 SELECT key, value, concat(key, value) as combined from src where
 concat(key,value) like ’11%’ order by combined;

  Are you going to support alias in where clause soon?

  Thanks,
 Du





Re: Tests and Test Infrastructure

2014-09-14 Thread Nicholas Chammas
I fully support this. A smoothly running test infrastructure helps
everybody’s work just flow better.

The Jenkins Pull Request Builder is mostly functioning
again. However, we are working on a simpler technical pipeline for
testing patches, as this plug-in has been a constant source of
downtime and issues for us, and is very hard to debug.

Yep. One such issue that happens too often is that Jenkins simply fails to
fetch from git. Hopefully a new pipeline will be able to fetch more
reliably.

flaky tests

Dunno if these were some of the ones recently fixed, but the flakiest tests
seem to be the Kafka and Flume tests in Spark Streaming, based purely on my
subjective experience. It would be great if we could stabilize them!

Time of tests

PSA: Here are some related JIRA issues for those interested in working on
our testing setup:

   - SPARK-3431: Parallelize execution of tests
   https://issues.apache.org/jira/browse/SPARK-3431
   - SPARK-3432: Fix logging of unit test execution time
   https://issues.apache.org/jira/browse/SPARK-3432

Nick
​

On Sun, Sep 14, 2014 at 2:20 AM, Josh Rosen rosenvi...@gmail.com wrote:

 Also, huge thanks to Cheng Lian, who tracked down and fixed the final
 issue that was causing the Maven master build’s Spark SQL tests to fail!

 On September 13, 2014 at 11:08:00 PM, Patrick Wendell (pwend...@gmail.com)
 wrote:
 Hey All,

 Wanted to send a quick update about test infrastructure. With the
 number of contributors we have and the rate of development,
 maintaining a well-oiled test infra is really important.

 Every time a flaky test fails a legitimate pull request, it wastes
 developer time and effort.

 1. Master build: Spark's master builds are back to green again in
 Maven and SBT after a long time of instability. Big thanks to Josh
 Rosen, Andrew Or, Nick Chammas, Shane Knapp, Sean Owen, and many
 others who were involved in pinpointing and fixing fairly convoluted
 test failure issues.

 2. Jenkins PRB: The Jenkins Pull Request Builder is mostly functioning
 again. However, we are working on a simpler technical pipeline for
 testing patches, as this plug-in has been a constant source of
 downtime and issues for us, and is very hard to debug.

 3. Reverting flaky patches: Going forward - we may revert patches that
 seem to be the root cause of flaky or failing tests. This is necessary
 as these days, the test infra being down will block something like
 10-30 in-flight patches on a given day. This puts the onus back on the
 test writer to try and figure out what's going on - we'll of course
 help debug the issue!

 4. Time of tests: With hundreds (thousands?) of tests, we will have a
 very high bar for tests which take several seconds or longer. Things
 like Thread.sleep() bloat test time when proper synchronization
 mechanisms should be used. Expect reviewers to push back on any
 long-running tests, in many cases they can be re-written to be both
 shorter and better.

 Thanks again to everyone putting in effort on this, we've made a ton
 of progress in the last few weeks. A solid test infra will help us
 scale and move quickly as Spark development continues to accelerate.

 - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




don't trigger tests when only .md files are changed

2014-09-12 Thread Nicholas Chammas
Would it make sense to have Jenkins *not* trigger tests when the only files
that have changed are .md files (example
https://github.com/apache/spark/pull/2367)? Those don’t even need RAT
checks, right?

I can make this change if it makes sense.

Nick
​


Re: don't trigger tests when only .md files are changed

2014-09-12 Thread Nicholas Chammas
We could still have Jenkins post a message to the effect of “this patch
only modifies .md files; no tests will be run”.
​

On Fri, Sep 12, 2014 at 3:48 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Would it make sense to have Jenkins *not* trigger tests when the only
 files that have changed are .md files (example
 https://github.com/apache/spark/pull/2367)? Those don’t even need RAT
 checks, right?

 I can make this change if it makes sense.

 Nick
 ​



Re: Announcing Spark 1.1.0!

2014-09-11 Thread Nicholas Chammas
Nice work everybody! I'm looking forward to trying out this release!

On Thu, Sep 11, 2014 at 8:12 PM, Patrick Wendell pwend...@gmail.com wrote:

 I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is
 the second release on the API-compatible 1.X line. It is Spark's
 largest release ever, with contributions from 171 developers!

 This release brings operational and performance improvements in Spark
 core including a new implementation of the Spark shuffle designed for
 very large scale workloads. Spark 1.1 adds significant extensions to
 the newest Spark modules, MLlib and Spark SQL. Spark SQL introduces a
 JDBC server, byte code generation for fast expression evaluation, a
 public types API, JSON support, and other features and optimizations.
 MLlib introduces a new statistics library along with several new
 algorithms and optimizations. Spark 1.1 also builds out Spark's Python
 support and adds new components to the Spark Streaming module.

 Visit the release notes [1] to read about the new features, or
 download [2] the release today.

 [1] http://spark.eu.apache.org/releases/spark-release-1-1-0.html
 [2] http://spark.eu.apache.org/downloads.html

 NOTE: SOME ASF DOWNLOAD MIRRORS WILL NOT CONTAIN THE RELEASE FOR SEVERAL
 HOURS.

 Please e-mail me directly for any type-o's in the release notes or name
 listing.

 Thanks, and congratulations!
 - Patrick

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: yet another jenkins restart early thursday morning -- 730am PDT (and a brief update on our new jenkins infra)

2014-09-10 Thread Nicholas Chammas
I'm looking forward to this. :)

Looks like Jenkins is having trouble triggering builds for new commits or
after user requests (e.g.
https://github.com/apache/spark/pull/2339#issuecomment-55165937).
Hopefully that will be resolved tomorrow.

Nick

On Tue, Sep 9, 2014 at 5:00 PM, shane knapp skn...@berkeley.edu wrote:

 since the power incident last thursday, the github pull request builder
 plugin is still not really working 100%.  i found an open issue
 w/jenkins[1] that could definitely be affecting us, i will be pausing
 builds early thursday morning and then restarting jenkins.
 i'll send out a reminder tomorrow, and if this causes any problems for you,
 please let me know and we can work out a better time.

 but, now for some good news!  yesterday morning, we racked and stacked the
 systems for the new jenkins instance in the berkeley datacenter.  tomorrow
 i should be able to log in to them and start getting them set up and
 configured.  this is a major step in getting us in to a much more
 'production' style environment!

 anyways:  thanks for your patience, and i think we've all learned that hard
 powering down your build system is a definite recipe for disaster.  :)

 shane

 [1] -- https://issues.jenkins-ci.org/browse/JENKINS-22509



Re: jenkins failed all tests?

2014-09-07 Thread Nicholas Chammas
Yeah, it feels like Jenkins has become a lot more flaky recently. Or maybe
it’s just our tests.

Here are some more examples:

   - https://github.com/apache/spark/pull/2310#issuecomment-54741169
   - https://github.com/apache/spark/pull/2313#issuecomment-54752766

Nick
​

On Sun, Sep 7, 2014 at 4:54 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

 It seems that I’m not the only one

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/

 Best,

 --
 Nan Zhu


 On Sunday, September 7, 2014 at 4:52 PM, Nan Zhu wrote:

  Hi, Sean,
 
  Thanks for the reply
 
  Here are the updated files:
 
  https://github.com/apache/spark/pull/2312/files
 
  just two md files...
 
  Best,
 
  --
  Nan Zhu
 
 
  On Sunday, September 7, 2014 at 4:30 PM, Sean Owen wrote:
 
   It would help to point to your change. Are you sure it was only docs
   and are you sure you're rebased, submitting against the right branch?
   Jenkins is saying you are changing public APIs; it's not reporting
   test failures. But it could well be a test/Jenkins problem.
  
   On Sun, Sep 7, 2014 at 8:39 PM, Nan Zhu zhunanmcg...@gmail.com
 (mailto:zhunanmcg...@gmail.com) wrote:
Hi, all
   
I just modified some document,
   
but still failed to pass tests?
   
   
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19950/consoleFull
   
Anyone can look at the problem?
   
Best,
   
--
Nan Zhu
   
  
  
  
  
 
 




trimming unnecessary test output

2014-09-06 Thread Nicholas Chammas
Continuing the discussion started here
https://github.com/apache/spark/pull/2279, I’m wondering if people
already know that certain test output is unnecessary and should be trimmed.

For example
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19917/consoleFull,
I see a bunch of lines like this:

14/09/06 07:54:13 INFO GenerateProjection: Code generated expression
List(IS NOT NULL 1) in 128.33733 ms

Can/should this type of output be suppressed? Is there any other test
output that is obviously more noise than signal?

Nick
​


Scala's Jenkins setup looks neat

2014-09-06 Thread Nicholas Chammas
After reading Erik's email, I found this Scala PR
https://github.com/scala/scala/pull/3963 and immediately noticed a few
cool things:

   - Jenkins is hooked directly into GitHub somehow, so you get the All is
   well message in the merge status window, presumably based on the last test
   status
   - Jenkins is also tagging the PR based on its test status or need for
   review
   - Jenkins is also tagging the PR for a specific milestone

Do any of these things make sense to add to our setup? Or perhaps something
inspired by these features?

Nick


Re: Scala's Jenkins setup looks neat

2014-09-06 Thread Nicholas Chammas
Aww, that's a bummer...


On Sat, Sep 6, 2014 at 1:10 PM, Reynold Xin r...@databricks.com wrote:

 that would require github hooks permission and unfortunately asf infra
 wouldn't allow that.

 Maybe they will change their mind one day, but so far we asked about this
 and the answer has been no for security reasons.

 On Saturday, September 6, 2014, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 After reading Erik's email, I found this Scala PR
 https://github.com/scala/scala/pull/3963 and immediately noticed a few
 cool things:

- Jenkins is hooked directly into GitHub somehow, so you get the All
 is
well message in the merge status window, presumably based on the last
 test
status
- Jenkins is also tagging the PR based on its test status or need for
review
- Jenkins is also tagging the PR for a specific milestone

 Do any of these things make sense to add to our setup? Or perhaps
 something
 inspired by these features?

 Nick




Re: amplab jenkins is down

2014-09-05 Thread Nicholas Chammas
Hmm, looks like at least some builds
https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/19804/consoleFull
are working now, though this last one was from ~5 hours ago.


On Fri, Sep 5, 2014 at 1:02 AM, shane knapp skn...@berkeley.edu wrote:

 yep.  that's exactly the behavior i saw earlier, and will be figuring out
 first thing tomorrow morning.  i bet it's an environment issues on the
 slaves.


 On Thu, Sep 4, 2014 at 7:10 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Looks like during the last build
 https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/19797/console
 Jenkins was unable to execute a git fetch?


 On Thu, Sep 4, 2014 at 7:58 PM, shane knapp skn...@berkeley.edu wrote:

 i'm going to restart jenkins and see if that fixes things.


 On Thu, Sep 4, 2014 at 4:56 PM, shane knapp skn...@berkeley.edu wrote:

 looking


 On Thu, Sep 4, 2014 at 4:21 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 It appears that our main man is having trouble
 https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/
  hearing new requests
 https://github.com/apache/spark/pull/2277#issuecomment-54549106.

 Do we need some smelling salts?


 On Thu, Sep 4, 2014 at 5:49 PM, shane knapp skn...@berkeley.edu
 wrote:

 i'd ping the Jenkinsmench...  the master was completely offline, so
 any new
 jobs wouldn't have reached it.  any jobs that were queued when power
 was
 lost probably started up, but jobs that were running would fail.


 On Thu, Sep 4, 2014 at 2:45 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com
  wrote:

  Woohoo! Thanks Shane.
 
  Do you know if queued PR builds will automatically be picked up? Or
 do we
  have to ping the Jenkinmensch manually from each PR?
 
  Nick
 
 
  On Thu, Sep 4, 2014 at 5:37 PM, shane knapp skn...@berkeley.edu
 wrote:
 
  AND WE'RE UP!
 
  sorry that this took so long...  i'll send out a more detailed
 explanation
  of what happened soon.
 
  now, off to back up jenkins.
 
  shane
 
 
  On Thu, Sep 4, 2014 at 1:27 PM, shane knapp skn...@berkeley.edu
 wrote:
 
   it's a faulty power switch on the firewall, which has been
 swapped out.
we're about to reboot and be good to go.
  
  
   On Thu, Sep 4, 2014 at 1:19 PM, shane knapp skn...@berkeley.edu
 
  wrote:
  
   looks like some hardware failed, and we're swapping in a
 replacement.
  i
   don't have more specific information yet -- including *what*
 failed,
  as our
   sysadmin is super busy ATM.  the root cause was an incorrect
 circuit
  being
   switched off during building maintenance.
  
   on a side note, this incident will be accelerating our plan to
 move the
   entire jenkins infrastructure in to a managed datacenter
 environment.
  this
   will be our major push over the next couple of weeks.  more
 details
  about
   this, also, as soon as i get them.
  
   i'm very sorry about the downtime, we'll get everything up and
 running
   ASAP.
  
  
   On Thu, Sep 4, 2014 at 12:27 PM, shane knapp 
 skn...@berkeley.edu
  wrote:
  
   looks like a power outage in soda hall.  more updates as they
 happen.
  
  
   On Thu, Sep 4, 2014 at 12:25 PM, shane knapp 
 skn...@berkeley.edu
   wrote:
  
   i am trying to get things up and running, but it looks like
 either
  the
   firewall gateway or jenkins server itself is down.  i'll
 update as
  soon as
   i know more.
  
  
  
  
  
 
 
   --
  You received this message because you are subscribed to the Google
 Groups
  amp-infra group.
  To unsubscribe from this group and stop receiving emails from it,
 send an
  email to amp-infra+unsubscr...@googlegroups.com.
  For more options, visit https://groups.google.com/d/optout.
 









Re: amplab jenkins is down

2014-09-05 Thread Nicholas Chammas
How's it going?

It looks like during the last build
https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/lastBuild/console
from about 30 min ago Jenkins was still having trouble fetching from
GitHub. It also looks like not all requests for testing are triggering
builds.


On Fri, Sep 5, 2014 at 1:23 PM, shane knapp skn...@berkeley.edu wrote:

 it's looking like everything except the pull request builders are working.
  i'm going to be working on getting this resolved today.


 On Fri, Sep 5, 2014 at 8:18 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Hmm, looks like at least some builds
 https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/19804/consoleFull
 are working now, though this last one was from ~5 hours ago.


 On Fri, Sep 5, 2014 at 1:02 AM, shane knapp skn...@berkeley.edu wrote:

 yep.  that's exactly the behavior i saw earlier, and will be figuring
 out first thing tomorrow morning.  i bet it's an environment issues on the
 slaves.


 On Thu, Sep 4, 2014 at 7:10 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Looks like during the last build
 https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/19797/console
 Jenkins was unable to execute a git fetch?


 On Thu, Sep 4, 2014 at 7:58 PM, shane knapp skn...@berkeley.edu
 wrote:

 i'm going to restart jenkins and see if that fixes things.


 On Thu, Sep 4, 2014 at 4:56 PM, shane knapp skn...@berkeley.edu
 wrote:

 looking


 On Thu, Sep 4, 2014 at 4:21 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 It appears that our main man is having trouble
 https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/
  hearing new requests
 https://github.com/apache/spark/pull/2277#issuecomment-54549106.

 Do we need some smelling salts?


 On Thu, Sep 4, 2014 at 5:49 PM, shane knapp skn...@berkeley.edu
 wrote:

 i'd ping the Jenkinsmench...  the master was completely offline, so
 any new
 jobs wouldn't have reached it.  any jobs that were queued when
 power was
 lost probably started up, but jobs that were running would fail.


 On Thu, Sep 4, 2014 at 2:45 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com
  wrote:

  Woohoo! Thanks Shane.
 
  Do you know if queued PR builds will automatically be picked up?
 Or do we
  have to ping the Jenkinmensch manually from each PR?
 
  Nick
 
 
  On Thu, Sep 4, 2014 at 5:37 PM, shane knapp skn...@berkeley.edu
 wrote:
 
  AND WE'RE UP!
 
  sorry that this took so long...  i'll send out a more detailed
 explanation
  of what happened soon.
 
  now, off to back up jenkins.
 
  shane
 
 
  On Thu, Sep 4, 2014 at 1:27 PM, shane knapp skn...@berkeley.edu
 wrote:
 
   it's a faulty power switch on the firewall, which has been
 swapped out.
we're about to reboot and be good to go.
  
  
   On Thu, Sep 4, 2014 at 1:19 PM, shane knapp 
 skn...@berkeley.edu
  wrote:
  
   looks like some hardware failed, and we're swapping in a
 replacement.
  i
   don't have more specific information yet -- including *what*
 failed,
  as our
   sysadmin is super busy ATM.  the root cause was an incorrect
 circuit
  being
   switched off during building maintenance.
  
   on a side note, this incident will be accelerating our plan
 to move the
   entire jenkins infrastructure in to a managed datacenter
 environment.
  this
   will be our major push over the next couple of weeks.  more
 details
  about
   this, also, as soon as i get them.
  
   i'm very sorry about the downtime, we'll get everything up
 and running
   ASAP.
  
  
   On Thu, Sep 4, 2014 at 12:27 PM, shane knapp 
 skn...@berkeley.edu
  wrote:
  
   looks like a power outage in soda hall.  more updates as
 they happen.
  
  
   On Thu, Sep 4, 2014 at 12:25 PM, shane knapp 
 skn...@berkeley.edu
   wrote:
  
   i am trying to get things up and running, but it looks like
 either
  the
   firewall gateway or jenkins server itself is down.  i'll
 update as
  soon as
   i know more.
  
  
  
  
  
 
 
   --
  You received this message because you are subscribed to the
 Google Groups
  amp-infra group.
  To unsubscribe from this group and stop receiving emails from it,
 send an
  email to amp-infra+unsubscr...@googlegroups.com.
  For more options, visit https://groups.google.com/d/optout.
 











Re: [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-04 Thread Nicholas Chammas
On Thu, Sep 4, 2014 at 1:50 PM, Gurvinder Singh gurvinder.si...@uninett.no
wrote:

 There is a regression when using pyspark to read data
 from HDFS.


Could you open a JIRA http://issues.apache.org/jira/ with a brief repro?
We'll look into it.

(You could also provide a repro in a separate thread.)

Nick


Re: amplab jenkins is down

2014-09-04 Thread Nicholas Chammas
Woohoo! Thanks Shane.

Do you know if queued PR builds will automatically be picked up? Or do we
have to ping the Jenkinmensch manually from each PR?

Nick


On Thu, Sep 4, 2014 at 5:37 PM, shane knapp skn...@berkeley.edu wrote:

 AND WE'RE UP!

 sorry that this took so long...  i'll send out a more detailed explanation
 of what happened soon.

 now, off to back up jenkins.

 shane


 On Thu, Sep 4, 2014 at 1:27 PM, shane knapp skn...@berkeley.edu wrote:

  it's a faulty power switch on the firewall, which has been swapped out.
   we're about to reboot and be good to go.
 
 
  On Thu, Sep 4, 2014 at 1:19 PM, shane knapp skn...@berkeley.edu wrote:
 
  looks like some hardware failed, and we're swapping in a replacement.  i
  don't have more specific information yet -- including *what* failed, as
 our
  sysadmin is super busy ATM.  the root cause was an incorrect circuit
 being
  switched off during building maintenance.
 
  on a side note, this incident will be accelerating our plan to move the
  entire jenkins infrastructure in to a managed datacenter environment.
 this
  will be our major push over the next couple of weeks.  more details
 about
  this, also, as soon as i get them.
 
  i'm very sorry about the downtime, we'll get everything up and running
  ASAP.
 
 
  On Thu, Sep 4, 2014 at 12:27 PM, shane knapp skn...@berkeley.edu
 wrote:
 
  looks like a power outage in soda hall.  more updates as they happen.
 
 
  On Thu, Sep 4, 2014 at 12:25 PM, shane knapp skn...@berkeley.edu
  wrote:
 
  i am trying to get things up and running, but it looks like either the
  firewall gateway or jenkins server itself is down.  i'll update as
 soon as
  i know more.
 
 
 
 
 



Re: amplab jenkins is down

2014-09-04 Thread Nicholas Chammas
It appears that our main man is having trouble
https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/
 hearing new requests
https://github.com/apache/spark/pull/2277#issuecomment-54549106.

Do we need some smelling salts?


On Thu, Sep 4, 2014 at 5:49 PM, shane knapp skn...@berkeley.edu wrote:

 i'd ping the Jenkinsmench...  the master was completely offline, so any new
 jobs wouldn't have reached it.  any jobs that were queued when power was
 lost probably started up, but jobs that were running would fail.


 On Thu, Sep 4, 2014 at 2:45 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com
  wrote:

  Woohoo! Thanks Shane.
 
  Do you know if queued PR builds will automatically be picked up? Or do we
  have to ping the Jenkinmensch manually from each PR?
 
  Nick
 
 
  On Thu, Sep 4, 2014 at 5:37 PM, shane knapp skn...@berkeley.edu wrote:
 
  AND WE'RE UP!
 
  sorry that this took so long...  i'll send out a more detailed
 explanation
  of what happened soon.
 
  now, off to back up jenkins.
 
  shane
 
 
  On Thu, Sep 4, 2014 at 1:27 PM, shane knapp skn...@berkeley.edu
 wrote:
 
   it's a faulty power switch on the firewall, which has been swapped
 out.
we're about to reboot and be good to go.
  
  
   On Thu, Sep 4, 2014 at 1:19 PM, shane knapp skn...@berkeley.edu
  wrote:
  
   looks like some hardware failed, and we're swapping in a replacement.
  i
   don't have more specific information yet -- including *what* failed,
  as our
   sysadmin is super busy ATM.  the root cause was an incorrect circuit
  being
   switched off during building maintenance.
  
   on a side note, this incident will be accelerating our plan to move
 the
   entire jenkins infrastructure in to a managed datacenter environment.
  this
   will be our major push over the next couple of weeks.  more details
  about
   this, also, as soon as i get them.
  
   i'm very sorry about the downtime, we'll get everything up and
 running
   ASAP.
  
  
   On Thu, Sep 4, 2014 at 12:27 PM, shane knapp skn...@berkeley.edu
  wrote:
  
   looks like a power outage in soda hall.  more updates as they
 happen.
  
  
   On Thu, Sep 4, 2014 at 12:25 PM, shane knapp skn...@berkeley.edu
   wrote:
  
   i am trying to get things up and running, but it looks like either
  the
   firewall gateway or jenkins server itself is down.  i'll update as
  soon as
   i know more.
  
  
  
  
  
 
 
   --
  You received this message because you are subscribed to the Google Groups
  amp-infra group.
  To unsubscribe from this group and stop receiving emails from it, send an
  email to amp-infra+unsubscr...@googlegroups.com.
  For more options, visit https://groups.google.com/d/optout.
 



Re: amplab jenkins is down

2014-09-04 Thread Nicholas Chammas
Looks like during the last build
https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/19797/console
Jenkins was unable to execute a git fetch?


On Thu, Sep 4, 2014 at 7:58 PM, shane knapp skn...@berkeley.edu wrote:

 i'm going to restart jenkins and see if that fixes things.


 On Thu, Sep 4, 2014 at 4:56 PM, shane knapp skn...@berkeley.edu wrote:

 looking


 On Thu, Sep 4, 2014 at 4:21 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 It appears that our main man is having trouble
 https://amplab.cs.berkeley.edu/jenkins/view/Pull%20Request%20Builders/job/SparkPullRequestBuilder/
  hearing new requests
 https://github.com/apache/spark/pull/2277#issuecomment-54549106.

 Do we need some smelling salts?


 On Thu, Sep 4, 2014 at 5:49 PM, shane knapp skn...@berkeley.edu wrote:

 i'd ping the Jenkinsmench...  the master was completely offline, so any
 new
 jobs wouldn't have reached it.  any jobs that were queued when power was
 lost probably started up, but jobs that were running would fail.


 On Thu, Sep 4, 2014 at 2:45 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com
  wrote:

  Woohoo! Thanks Shane.
 
  Do you know if queued PR builds will automatically be picked up? Or
 do we
  have to ping the Jenkinmensch manually from each PR?
 
  Nick
 
 
  On Thu, Sep 4, 2014 at 5:37 PM, shane knapp skn...@berkeley.edu
 wrote:
 
  AND WE'RE UP!
 
  sorry that this took so long...  i'll send out a more detailed
 explanation
  of what happened soon.
 
  now, off to back up jenkins.
 
  shane
 
 
  On Thu, Sep 4, 2014 at 1:27 PM, shane knapp skn...@berkeley.edu
 wrote:
 
   it's a faulty power switch on the firewall, which has been swapped
 out.
we're about to reboot and be good to go.
  
  
   On Thu, Sep 4, 2014 at 1:19 PM, shane knapp skn...@berkeley.edu
  wrote:
  
   looks like some hardware failed, and we're swapping in a
 replacement.
  i
   don't have more specific information yet -- including *what*
 failed,
  as our
   sysadmin is super busy ATM.  the root cause was an incorrect
 circuit
  being
   switched off during building maintenance.
  
   on a side note, this incident will be accelerating our plan to
 move the
   entire jenkins infrastructure in to a managed datacenter
 environment.
  this
   will be our major push over the next couple of weeks.  more
 details
  about
   this, also, as soon as i get them.
  
   i'm very sorry about the downtime, we'll get everything up and
 running
   ASAP.
  
  
   On Thu, Sep 4, 2014 at 12:27 PM, shane knapp skn...@berkeley.edu
 
  wrote:
  
   looks like a power outage in soda hall.  more updates as they
 happen.
  
  
   On Thu, Sep 4, 2014 at 12:25 PM, shane knapp 
 skn...@berkeley.edu
   wrote:
  
   i am trying to get things up and running, but it looks like
 either
  the
   firewall gateway or jenkins server itself is down.  i'll update
 as
  soon as
   i know more.
  
  
  
  
  
 
 
   --
  You received this message because you are subscribed to the Google
 Groups
  amp-infra group.
  To unsubscribe from this group and stop receiving emails from it,
 send an
  email to amp-infra+unsubscr...@googlegroups.com.
  For more options, visit https://groups.google.com/d/optout.
 







Re: [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-03 Thread Nicholas Chammas
On Wed, Sep 3, 2014 at 3:24 AM, Patrick Wendell pwend...@gmail.com wrote:

 == What default changes should I be aware of? ==
 1. The default value of spark.io.compression.codec is now snappy
 -- Old behavior can be restored by switching to lzf

 2. PySpark now performs external spilling during aggregations.
 -- Old behavior can be restored by setting spark.shuffle.spill to
 false.

 3. PySpark uses a new heuristic for determining the parallelism of
 shuffle operations.
 -- Old behavior can be restored by setting
 spark.default.parallelism to the number of cores in the cluster.


Will these changes be called out in the release notes or somewhere in the
docs?

That last one (which I believe is what we discovered as the result of
SPARK- https://issues.apache.org/jira/browse/SPARK-) could have a
large impact on PySpark users.

Nick


spark-ec2 depends on stuff in the Mesos repo

2014-09-03 Thread Nicholas Chammas
Spawned by this discussion
https://github.com/apache/spark/pull/1120#issuecomment-54305831.

See these 2 lines in spark_ec2.py:

   - spark_ec2 L42
   
https://github.com/apache/spark/blob/6a72a36940311fcb3429bd34c8818bc7d513115c/ec2/spark_ec2.py#L42
   - spark_ec2 L566
   
https://github.com/apache/spark/blob/6a72a36940311fcb3429bd34c8818bc7d513115c/ec2/spark_ec2.py#L566

Why does the spark-ec2 script depend on stuff in the Mesos repo? Should
they be moved to the Spark repo?

Nick
​


Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Nicholas Chammas
Hi Shane!

Thank you for doing the Jenkins upgrade last week. It's nice to know that
infrastructure is gonna get some dedicated TLC going forward.

Welcome aboard!

Nick


On Tue, Sep 2, 2014 at 1:35 PM, shane knapp skn...@berkeley.edu wrote:

 so, i had a meeting w/the databricks guys on friday and they recommended i
 send an email out to the list to say 'hi' and give you guys a quick intro.
  :)

 hi!  i'm shane knapp, the new AMPLab devops engineer, and will be spending
 time getting the jenkins build infrastructure up to production quality.
  much of this will be 'under the covers' work, like better system level
 auth, backups, etc, but some will definitely be user facing:  timely
 jenkins updates, debugging broken build infrastructure and some plugin
 support.

 i've been working in the bay area now since 1997 at many different
 companies, and my last 10 years has been split between google and palantir.
  i'm a huge proponent of OSS, and am really happy to be able to help with
 the work you guys are doing!

 if anyone has any requests/questions/comments, feel free to drop me a line!

 shane



Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Nicholas Chammas
In light of the discussion on SPARK-, I'll revoke my -1 vote. The
issue does not appear to be serious.


On Sun, Aug 31, 2014 at 5:14 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 -1: I believe I've found a regression from 1.0.2. The report is captured
 in SPARK- https://issues.apache.org/jira/browse/SPARK-.


 On Sat, Aug 30, 2014 at 6:07 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 Please vote on releasing the following candidate as Apache Spark version
 1.1.0!

 The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):

 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.1.0-rc3/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1030/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/

 Please vote on releasing this package as Apache Spark 1.1.0!

 The vote is open until Tuesday, September 02, at 23:07 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.1.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == Regressions fixed since RC1 ==
 - Build issue for SQL support:
 https://issues.apache.org/jira/browse/SPARK-3234
 - EC2 script version bump to 1.1.0.

 == What justifies a -1 vote for this release? ==
 This vote is happening very late into the QA period compared with
 previous votes, so -1 votes should only occur for significant
 regressions from 1.0.2. Bugs already present in 1.0.X will not block
 this release.

 == What default changes should I be aware of? ==
 1. The default value of spark.io.compression.codec is now snappy
 -- Old behavior can be restored by switching to lzf

 2. PySpark now performs external spilling during aggregations.
 -- Old behavior can be restored by setting spark.shuffle.spill to
 false.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





Run the Big Data Benchmark for new releases

2014-09-01 Thread Nicholas Chammas
What do people think of running the Big Data Benchmark
https://amplab.cs.berkeley.edu/benchmark/ (repo
https://github.com/amplab/benchmark) as part of preparing every new
release of Spark?

We'd run it just for Spark and effectively use it as another type of test
to track any performance progress or regressions from release to release.

Would doing such a thing be valuable? Do we already have a way of
benchmarking Spark performance that we use regularly?

Nick


Re: Run the Big Data Benchmark for new releases

2014-09-01 Thread Nicholas Chammas
Oh, that's sweet. So, a related question then.

Did those tests pick up the performance issue reported in SPARK-
https://issues.apache.org/jira/browse/SPARK-? Does it make sense to
add a new test to cover that case?


On Tue, Sep 2, 2014 at 12:29 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hi Nicholas,

 At Databricks we already run https://github.com/databricks/spark-perf for
 each release, which is a more comprehensive performance test suite.

 Matei

 On September 1, 2014 at 8:22:05 PM, Nicholas Chammas (
 nicholas.cham...@gmail.com) wrote:

 What do people think of running the Big Data Benchmark
 https://amplab.cs.berkeley.edu/benchmark/ (repo
 https://github.com/amplab/benchmark) as part of preparing every new
 release of Spark?

 We'd run it just for Spark and effectively use it as another type of test
 to track any performance progress or regressions from release to release.

 Would doing such a thing be valuable? Do we already have a way of
 benchmarking Spark performance that we use regularly?

 Nick




Re: Run the Big Data Benchmark for new releases

2014-09-01 Thread Nicholas Chammas
Alright, sounds good! I've created databricks/spark-perf/issues/9
https://github.com/databricks/spark-perf/issues/9 as a reminder for us to
add a new test once we've root caused SPARK-.


On Tue, Sep 2, 2014 at 1:07 AM, Patrick Wendell pwend...@gmail.com wrote:

 Yeah, this wasn't detected in our performance tests. We even have a
 test in PySpark that I would have though might catch this (it just
 schedules a bunch of really small tasks, similar to the regression
 case).


 https://github.com/databricks/spark-perf/blob/master/pyspark-tests/tests.py#L51

 Anyways, Josh is trying to repro the regression to see if we can
 figure out what is going on. If we find something for sure we should
 add a test.

 On Mon, Sep 1, 2014 at 10:04 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
  Nope, actually, they didn't find that (they found some other things that
 were fixed, as well as some improvements). Feel free to send a PR, but it
 would be good to profile the issue first to understand what slowed down.
 (For example is the map phase taking longer or is it the reduce phase, is
 there some difference in lengths of specific tasks, etc).
 
  Matei
 
  On September 1, 2014 at 10:03:20 PM, Nicholas Chammas (
 nicholas.cham...@gmail.com) wrote:
 
  Oh, that's sweet. So, a related question then.
 
  Did those tests pick up the performance issue reported in SPARK-?
 Does it make sense to add a new test to cover that case?
 
 
  On Tue, Sep 2, 2014 at 12:29 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
  Hi Nicholas,
 
  At Databricks we already run https://github.com/databricks/spark-perf
 for each release, which is a more comprehensive performance test suite.
 
  Matei
 
  On September 1, 2014 at 8:22:05 PM, Nicholas Chammas (
 nicholas.cham...@gmail.com) wrote:
 
  What do people think of running the Big Data Benchmark
  https://amplab.cs.berkeley.edu/benchmark/ (repo
  https://github.com/amplab/benchmark) as part of preparing every new
  release of Spark?
 
  We'd run it just for Spark and effectively use it as another type of test
  to track any performance progress or regressions from release to release.
 
  Would doing such a thing be valuable? Do we already have a way of
  benchmarking Spark performance that we use regularly?
 
  Nick
 



Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-08-31 Thread Nicholas Chammas
-1: I believe I've found a regression from 1.0.2. The report is captured in
SPARK- https://issues.apache.org/jira/browse/SPARK-.


On Sat, Aug 30, 2014 at 6:07 PM, Patrick Wendell pwend...@gmail.com wrote:

 Please vote on releasing the following candidate as Apache Spark version
 1.1.0!

 The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):

 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.1.0-rc3/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1030/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/

 Please vote on releasing this package as Apache Spark 1.1.0!

 The vote is open until Tuesday, September 02, at 23:07 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.1.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == Regressions fixed since RC1 ==
 - Build issue for SQL support:
 https://issues.apache.org/jira/browse/SPARK-3234
 - EC2 script version bump to 1.1.0.

 == What justifies a -1 vote for this release? ==
 This vote is happening very late into the QA period compared with
 previous votes, so -1 votes should only occur for significant
 regressions from 1.0.2. Bugs already present in 1.0.X will not block
 this release.

 == What default changes should I be aware of? ==
 1. The default value of spark.io.compression.codec is now snappy
 -- Old behavior can be restored by switching to lzf

 2. PySpark now performs external spilling during aggregations.
 -- Old behavior can be restored by setting spark.shuffle.spill to
 false.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-08-31 Thread Nicholas Chammas
On Sun, Aug 31, 2014 at 6:38 PM, chutium teng@gmail.com wrote:

 has anyone tried to build it on hadoop.version=2.0.0-mr1-cdh4.3.0 or
 hadoop.version=1.0.3-mapr-3.0.3 ?


Is the behavior you're seeing a regression from 1.0.2, or does 1.0.2 have
this same problem?

Nick


Re: Handling stale PRs

2014-08-30 Thread Nicholas Chammas
On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com wrote:

 it's actually precedurally difficult for us to close pull requests


Just an FYI: Seems like the GitHub-sanctioned work-around to having
issues-only permissions is to have a second, issues-only repository
https://help.github.com/articles/issues-only-access-permissions. Not a
very attractive work-around...

Nick


Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-29 Thread Nicholas Chammas
There were several formatting and typographical errors in the SQL docs that
I've fixed in this PR https://github.com/apache/spark/pull/2201. Dunno if
we want to roll that into the release.


On Fri, Aug 29, 2014 at 12:17 PM, Patrick Wendell pwend...@gmail.com
wrote:

 Okay I'll plan to add cdh4 binary as well for the final release!

 ---
 sent from my phone
 On Aug 29, 2014 8:26 AM, Ye Xianjin advance...@gmail.com wrote:

  We just used CDH 4.7 for our production cluster. And I believe we won't
  use CDH 5 in the next year.
 
  Sent from my iPhone
 
   On 2014年8月29日, at 14:39, Matei Zaharia matei.zaha...@gmail.com
 wrote:
  
   Personally I'd actually consider putting CDH4 back if there are still
  users on it. It's always better to be inclusive, and the convenience of a
  one-click download is high. Do we have a sense on what % of CDH users
 still
  use CDH4?
  
   Matei
  
   On August 28, 2014 at 11:31:13 PM, Sean Owen (so...@cloudera.com)
 wrote:
  
   (Copying my reply since I don't know if it goes to the mailing list)
  
   Great, thanks for explaining the reasoning. You're saying these aren't
   going into the final release? I think that moots any issue surrounding
   distributing them then.
  
   This is all I know of from the ASF:
   https://community.apache.org/projectIndependence.html I don't read it
   as expressly forbidding this kind of thing although you can see how it
   bumps up against the spirit. There's not a bright line -- what about
   Tomcat providing binaries compiled for Windows for example? does that
   favor an OS vendor?
  
   From this technical ASF perspective only the releases matter -- do
   what you want with snapshots and RCs. The only issue there is maybe
   releasing something different than was in the RC; is that at all
   confusing? Just needs a note.
  
   I think this theoretical issue doesn't exist if these binaries aren't
   released, so I see no reason to not proceed.
  
   The rest is a different question about whether you want to spend time
   maintaining this profile and candidate. The vendor already manages
   their build I think and -- and I don't know -- may even prefer not to
   have a different special build floating around. There's also the
   theoretical argument that this turns off other vendors from adopting
   Spark if it's perceived to be too connected to other vendors. I'd like
   to maximize Spark's distribution and there's some argument you do this
   by not making vendor profiles. But as I say a different question to
   just think about over time...
  
   (oh and PS for my part I think it's a good thing that CDH4 binaries
   were removed. I wasn't arguing for resurrecting them)
  
   On Fri, Aug 29, 2014 at 7:26 AM, Patrick Wendell pwend...@gmail.com
  wrote:
   Hey Sean,
  
   The reason there are no longer CDH-specific builds is that all newer
   versions of CDH and HDP work with builds for the upstream Hadoop
   projects. I dropped CDH4 in favor of a newer Hadoop version (2.4) and
   the Hadoop-without-Hive (also 2.4) build.
  
   For MapR - we can't officially post those artifacts on ASF web space
   when we make the final release, we can only link to them as being
   hosted by MapR specifically since they use non-compatible licenses.
   However, I felt that providing these during a testing period was
   alright, with the goal of increasing test coverage. I couldn't find
   any policy against posting these on personal web space during RC
   voting. However, we can remove them if there is one.
  
   Dropping CDH4 was more because it is now pretty old, but we can add it
   back if people want. The binary packaging is a slightly separate
   question from release votes, so I can always add more binary packages
   whenever. And on this, my main concern is covering the most popular
   Hadoop versions to lower the bar for users to build and test Spark.
  
   - Patrick
  
   On Thu, Aug 28, 2014 at 11:04 PM, Sean Owen so...@cloudera.com
  wrote:
   +1 I tested the source and Hadoop 2.4 release. Checksums and
   signatures are OK. Compiles fine with Java 8 on OS X. Tests... don't
   fail any more than usual.
  
   FWIW I've also been using the 1.1.0-SNAPSHOT for some time in another
   project and have encountered no problems.
  
  
   I notice that the 1.1.0 release removes the CDH4-specific build, but
   adds two MapR-specific builds. Compare with
   https://dist.apache.org/repos/dist/release/spark/spark-1.0.2/ I
   commented on the commit:
  
 
 https://github.com/apache/spark/commit/ceb19830b88486faa87ff41e18d03ede713a73cc
  
   I'm in favor of removing all vendor-specific builds. This change
   *looks* a bit funny as there was no JIRA (?) and appears to swap one
   vendor for another. Of course there's nothing untoward going on, but
   what was the reasoning? It's best avoided, and MapR already
   distributes Spark just fine, no?
  
   This is a gray area with ASF projects. I mention it as well because
 it
   came up with Apache Flink recently
  

Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-29 Thread Nicholas Chammas
[Let me know if I should be posting these comments in a different thread.]

Should the default Spark version in spark-ec2
https://github.com/apache/spark/blob/e1535ad3c6f7400f2b7915ea91da9c60510557ba/ec2/spark_ec2.py#L86
be updated for this release?

Nick
​


On Fri, Aug 29, 2014 at 12:55 PM, Patrick Wendell pwend...@gmail.com
wrote:

 Hey Nicholas,

 Thanks for this, we can merge in doc changes outside of the actual
 release timeline, so we'll make sure to loop those changes in before
 we publish the final 1.1 docs.

 - Patrick

 On Fri, Aug 29, 2014 at 9:24 AM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
  There were several formatting and typographical errors in the SQL docs
 that
  I've fixed in this PR. Dunno if we want to roll that into the release.
 
 
  On Fri, Aug 29, 2014 at 12:17 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  Okay I'll plan to add cdh4 binary as well for the final release!
 
  ---
  sent from my phone
  On Aug 29, 2014 8:26 AM, Ye Xianjin advance...@gmail.com wrote:
 
   We just used CDH 4.7 for our production cluster. And I believe we
 won't
   use CDH 5 in the next year.
  
   Sent from my iPhone
  
On 2014年8月29日, at 14:39, Matei Zaharia matei.zaha...@gmail.com
wrote:
   
Personally I'd actually consider putting CDH4 back if there are
 still
   users on it. It's always better to be inclusive, and the convenience
 of
   a
   one-click download is high. Do we have a sense on what % of CDH users
   still
   use CDH4?
   
Matei
   
On August 28, 2014 at 11:31:13 PM, Sean Owen (so...@cloudera.com)
wrote:
   
(Copying my reply since I don't know if it goes to the mailing list)
   
Great, thanks for explaining the reasoning. You're saying these
 aren't
going into the final release? I think that moots any issue
 surrounding
distributing them then.
   
This is all I know of from the ASF:
https://community.apache.org/projectIndependence.html I don't read
 it
as expressly forbidding this kind of thing although you can see how
 it
bumps up against the spirit. There's not a bright line -- what about
Tomcat providing binaries compiled for Windows for example? does
 that
favor an OS vendor?
   
From this technical ASF perspective only the releases matter -- do
what you want with snapshots and RCs. The only issue there is maybe
releasing something different than was in the RC; is that at all
confusing? Just needs a note.
   
I think this theoretical issue doesn't exist if these binaries
 aren't
released, so I see no reason to not proceed.
   
The rest is a different question about whether you want to spend
 time
maintaining this profile and candidate. The vendor already manages
their build I think and -- and I don't know -- may even prefer not
 to
have a different special build floating around. There's also the
theoretical argument that this turns off other vendors from adopting
Spark if it's perceived to be too connected to other vendors. I'd
 like
to maximize Spark's distribution and there's some argument you do
 this
by not making vendor profiles. But as I say a different question to
just think about over time...
   
(oh and PS for my part I think it's a good thing that CDH4 binaries
were removed. I wasn't arguing for resurrecting them)
   
On Fri, Aug 29, 2014 at 7:26 AM, Patrick Wendell 
 pwend...@gmail.com
   wrote:
Hey Sean,
   
The reason there are no longer CDH-specific builds is that all
 newer
versions of CDH and HDP work with builds for the upstream Hadoop
projects. I dropped CDH4 in favor of a newer Hadoop version (2.4)
 and
the Hadoop-without-Hive (also 2.4) build.
   
For MapR - we can't officially post those artifacts on ASF web
 space
when we make the final release, we can only link to them as being
hosted by MapR specifically since they use non-compatible licenses.
However, I felt that providing these during a testing period was
alright, with the goal of increasing test coverage. I couldn't find
any policy against posting these on personal web space during RC
voting. However, we can remove them if there is one.
   
Dropping CDH4 was more because it is now pretty old, but we can add
it
back if people want. The binary packaging is a slightly separate
question from release votes, so I can always add more binary
 packages
whenever. And on this, my main concern is covering the most popular
Hadoop versions to lower the bar for users to build and test Spark.
   
- Patrick
   
On Thu, Aug 28, 2014 at 11:04 PM, Sean Owen so...@cloudera.com
   wrote:
+1 I tested the source and Hadoop 2.4 release. Checksums and
signatures are OK. Compiles fine with Java 8 on OS X. Tests...
 don't
fail any more than usual.
   
FWIW I've also been using the 1.1.0-SNAPSHOT for some time in
another
project and have encountered no problems.
   
   
I

Re: Handling stale PRs

2014-08-27 Thread Nicholas Chammas
On Tue, Aug 26, 2014 at 2:21 PM, Josh Rosen rosenvi...@gmail.com wrote:

 Last weekend, I started hacking on a Google App Engine app for helping
 with pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png).


BTW Josh, how can we stay up-to-date on your work on this tool? A JIRA
issue, perhaps?

Nick


Re: Handling stale PRs

2014-08-27 Thread Nicholas Chammas
Alright! That was quick. :)


On Wed, Aug 27, 2014 at 6:48 PM, Josh Rosen rosenvi...@gmail.com wrote:

 I have a very simple dashboard running at http://spark-prs.appspot.com/.
  Currently, this mirrors the functionality of Patrick’s github-shim, but it
 should be very easy to extend with other features.

 The source is at https://github.com/databricks/spark-pr-dashboard (pull
 requests and issues welcome!)

 On August 27, 2014 at 2:11:41 PM, Nicholas Chammas (
 nicholas.cham...@gmail.com) wrote:

  On Tue, Aug 26, 2014 at 2:21 PM, Josh Rosen rosenvi...@gmail.com wrote:

  Last weekend, I started hacking on a Google App Engine app for helping
 with pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png).


 BTW Josh, how can we stay up-to-date on your work on this tool? A JIRA
 issue, perhaps?

 Nick




Re: jenkins maintenance/downtime, aug 28th, 730am-9am PDT

2014-08-27 Thread Nicholas Chammas
Looks like we're currently at 1.568 so we should be getting a nice slew of
UI tweaks and bug fixes. Neat!


On Wed, Aug 27, 2014 at 7:13 PM, shane knapp skn...@berkeley.edu wrote:

 tomorrow morning i will be upgrading jenkins to the latest/greatest
 (1.577).

 at 730am, i will put jenkins in to a quiet period, so no new builds will be
 accepted.  once any running builds are finished, i will be taking jenkins
 down for the upgrade.

 depending on what and how many jobs are running, i'm expecting this to
 take, at most, an hour.

 i'll send out an update tomorrow morning right before i begin, and will
 send out updates and an all-clear once we're up and running again.

 1.577 release notes:
 http://jenkins-ci.org/changelog

 please let me know if there are any questions/concerns.  thanks in advance!

 shane



Re: Handling stale PRs

2014-08-26 Thread Nicholas Chammas
On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com wrote:

 I'd prefer if we took the approach of politely explaining why in the
 current form the patch isn't acceptable and closing it (potentially w/ tips
 on how to improve it or narrow the scope).


Amen to this. Aiming for such a culture would set Spark apart from other
projects in a great way.

I've proposed several different solutions to ASF infra to streamline the
 process, but thus far they haven't been open to any of my ideas:


I've added myself as a watcher on those 2 INFRA issues. Sucks that the only
solution on offer right now requires basically polluting the commit history.

Short of moving Spark's repo to a non-ASF-managed GitHub account, do you
think another bot could help us manage the number of stale PRs?

I'm thinking a solution as follows might be very helpful:

   - Extend Spark QA / Jenkins to run on a weekly schedule and check for
   stale PRs. Let's say a stale PR is an open one that hasn't been updated in
   N months.
   - Spark QA maintains a list of known committers on its side.
   - During its weekly check of stale PRs, Spark QA takes the following
   action:
  - If the last person to comment on a PR was a committer, post to the
  PR asking for an update from the contributor.
  - If the last person to comment on a PR was a contributor, add the PR
  to a list. Email this list of *hanging PRs* out to the dev list on a
  weekly basis and ask committers to update them.
  - If the last person to comment on a PR was Spark QA asking the
  contributor to update it, then add the PR to a list. Email this
list of *abandoned
  PRs* to the dev list for the record (or for closing, if that becomes
  possible in the future).

This doesn't solve the problem of not being able to close PRs, but it does
help make sure no PR is left hanging for long.

What do you think? I'd be interested in implementing this solution if we
like it.

Nick


Re: Handling stale PRs

2014-08-26 Thread Nicholas Chammas
OK, that sounds pretty cool.

Josh,

Do you see this app as encompassing or supplanting the functionality I
described as well?

Nick


On Tue, Aug 26, 2014 at 2:21 PM, Josh Rosen rosenvi...@gmail.com wrote:

 Last weekend, I started hacking on a Google App Engine app for helping
 with pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png).
  Some of my basic goals (not all implemented yet):

 - Users sign in using GitHub and can browse a list of pull requests,
 including links to associated JIRAs, Jenkins statuses, a quick preview of
 the last comment, etc.

 - Pull requests are auto-classified based on which components they modify
 (by looking at the diff).

 - From the app’s own internal database of PRs, we can build dashboards to
 find “abandoned” PRs, graph average time to first review, etc.

 - Since we authenticate users with GitHub, we can enable administrative
 functions via this dashboard (e.g. “assign this PR to me”, “vote to close
 in the weekly auto-close commit”, etc.

 Right now, I’ve implemented GItHub OAuth support and code to update the
 issues database using the GitHub API.  Because we have access to the full
 API, it’s pretty easy to do fancy things like parsing the reason for
 Jenkins failure, etc.  You could even imagine some fancy mashup tools to
 pull up JIRAs and pull requests side-by in iframes.

 After I hack on this a bit more, I plan to release a public preview
 version; if we find this tool useful, I’ll clean it up and open-source the
 app so folks can contribute to it.

 - Josh

 On August 26, 2014 at 8:16:46 AM, Nicholas Chammas (
 nicholas.cham...@gmail.com) wrote:

 On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com
 wrote:

  I'd prefer if we took the approach of politely explaining why in the
  current form the patch isn't acceptable and closing it (potentially w/
 tips
  on how to improve it or narrow the scope).


 Amen to this. Aiming for such a culture would set Spark apart from other
 projects in a great way.

 I've proposed several different solutions to ASF infra to streamline the
  process, but thus far they haven't been open to any of my ideas:


 I've added myself as a watcher on those 2 INFRA issues. Sucks that the
 only
 solution on offer right now requires basically polluting the commit
 history.

 Short of moving Spark's repo to a non-ASF-managed GitHub account, do you
 think another bot could help us manage the number of stale PRs?

 I'm thinking a solution as follows might be very helpful:

 - Extend Spark QA / Jenkins to run on a weekly schedule and check for
 stale PRs. Let's say a stale PR is an open one that hasn't been updated in
 N months.
 - Spark QA maintains a list of known committers on its side.
 - During its weekly check of stale PRs, Spark QA takes the following
 action:
 - If the last person to comment on a PR was a committer, post to the
 PR asking for an update from the contributor.
 - If the last person to comment on a PR was a contributor, add the PR
 to a list. Email this list of *hanging PRs* out to the dev list on a
 weekly basis and ask committers to update them.
 - If the last person to comment on a PR was Spark QA asking the
 contributor to update it, then add the PR to a list. Email this
 list of *abandoned
 PRs* to the dev list for the record (or for closing, if that becomes
 possible in the future).

 This doesn't solve the problem of not being able to close PRs, but it does
 help make sure no PR is left hanging for long.

 What do you think? I'd be interested in implementing this solution if we
 like it.

 Nick




spark-ec2 1.0.2 creates EC2 cluster at wrong version

2014-08-26 Thread Nicholas Chammas
I downloaded the source code release for 1.0.2 from here
http://spark.apache.org/downloads.html and launched an EC2 cluster using
spark-ec2.

After the cluster finishes launching, I fire up the shell and check the
version:

scala sc.version
res1: String = 1.0.1

The startup banner also shows the same thing. Hmm...

So I dig around and find that the spark_ec2.py script has the default Spark
version set to 1.0.1.

Derp.

  parser.add_option(-v, --spark-version, default=1.0.1,
  help=Version of Spark to use: 'X.Y.Z' or a specific git hash)

Is there any way to fix the release? It’s a minor issue, but could be very
confusing. And how can we prevent this from happening again?

Nick
​


Re: Handling stale PRs

2014-08-26 Thread Nicholas Chammas
By the way, as a reference point, I just stumbled across the Discourse
GitHub project and their list of pull requests
https://github.com/discourse/discourse/pulls looks pretty neat.

~2,200 closed PRs, 6 open. Least recently updated PR dates to 8 days ago.
Project started ~1.5 years ago.

Dunno how many committers Discourse has, but it looks like they've managed
their PRs well. I hope we can do as well in this regard as they have.

Nick


On Tue, Aug 26, 2014 at 2:40 PM, Josh Rosen rosenvi...@gmail.com wrote:

 Sure; App Engine supports cron and sending emails.  We can configure the
 app with Spark QA’s credentials in order to allow it to post comments on
 issues, etc.

 - Josh

 On August 26, 2014 at 11:38:08 AM, Nicholas Chammas (
 nicholas.cham...@gmail.com) wrote:

  OK, that sounds pretty cool.

 Josh,

 Do you see this app as encompassing or supplanting the functionality I
 described as well?

 Nick


 On Tue, Aug 26, 2014 at 2:21 PM, Josh Rosen rosenvi...@gmail.com wrote:

  Last weekend, I started hacking on a Google App Engine app for helping
 with pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png).
  Some of my basic goals (not all implemented yet):

  - Users sign in using GitHub and can browse a list of pull requests,
 including links to associated JIRAs, Jenkins statuses, a quick preview of
 the last comment, etc.

  - Pull requests are auto-classified based on which components they
 modify (by looking at the diff).

  - From the app’s own internal database of PRs, we can build dashboards
 to find “abandoned” PRs, graph average time to first review, etc.

  - Since we authenticate users with GitHub, we can enable administrative
 functions via this dashboard (e.g. “assign this PR to me”, “vote to close
 in the weekly auto-close commit”, etc.

  Right now, I’ve implemented GItHub OAuth support and code to update the
 issues database using the GitHub API.  Because we have access to the full
 API, it’s pretty easy to do fancy things like parsing the reason for
 Jenkins failure, etc.  You could even imagine some fancy mashup tools to
 pull up JIRAs and pull requests side-by in iframes.

 After I hack on this a bit more, I plan to release a public preview
 version; if we find this tool useful, I’ll clean it up and open-source the
 app so folks can contribute to it.

 - Josh

 On August 26, 2014 at 8:16:46 AM, Nicholas Chammas (
 nicholas.cham...@gmail.com) wrote:

  On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com
 wrote:

  I'd prefer if we took the approach of politely explaining why in the
  current form the patch isn't acceptable and closing it (potentially w/
 tips
  on how to improve it or narrow the scope).


 Amen to this. Aiming for such a culture would set Spark apart from other
 projects in a great way.

 I've proposed several different solutions to ASF infra to streamline the
  process, but thus far they haven't been open to any of my ideas:


 I've added myself as a watcher on those 2 INFRA issues. Sucks that the
 only
 solution on offer right now requires basically polluting the commit
 history.

 Short of moving Spark's repo to a non-ASF-managed GitHub account, do you
 think another bot could help us manage the number of stale PRs?

 I'm thinking a solution as follows might be very helpful:

 - Extend Spark QA / Jenkins to run on a weekly schedule and check for
 stale PRs. Let's say a stale PR is an open one that hasn't been updated in
 N months.
 - Spark QA maintains a list of known committers on its side.
 - During its weekly check of stale PRs, Spark QA takes the following
 action:
 - If the last person to comment on a PR was a committer, post to the
 PR asking for an update from the contributor.
 - If the last person to comment on a PR was a contributor, add the PR
 to a list. Email this list of *hanging PRs* out to the dev list on a
 weekly basis and ask committers to update them.
 - If the last person to comment on a PR was Spark QA asking the
 contributor to update it, then add the PR to a list. Email this
 list of *abandoned
 PRs* to the dev list for the record (or for closing, if that becomes
 possible in the future).

 This doesn't solve the problem of not being able to close PRs, but it does
 help make sure no PR is left hanging for long.

 What do you think? I'd be interested in implementing this solution if we
 like it.

 Nick





Re: Pull requests will be automatically linked to JIRA when submitted

2014-08-25 Thread Nicholas Chammas
FYI: Looks like the Mesos folk also have a bot to do automatic linking, but
it appears to have been provided to them somehow by ASF.

See this comment as an example:
https://issues.apache.org/jira/browse/MESOS-1688?focusedCommentId=14109078page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14109078

Might be a small win to push this work to a bot ASF manages if we can get
access to it (and if we have no concerns about depending on an another
external service).

Nick


On Mon, Aug 11, 2014 at 4:10 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Thanks for looking into this. I think little tools like this are super
 helpful.

 Would it hurt to open a request with INFRA to install/configure the
 JIRA-GitHub plugin while we continue to use the Python script we have? I
 wouldn't mind opening that JIRA issue with them.

 Nick


 On Mon, Aug 11, 2014 at 12:52 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 I spent some time on this and I'm not sure either of these is an option,
 unfortunately.

 We typically can't use custom JIRA plug-in's because this JIRA is
 controlled by the ASF and we don't have rights to modify most things about
 how it works (it's a large shared JIRA instance used by more than 50
 projects). It's worth looking into whether they can do something. In
 general we've tended to avoid going through ASF infra them whenever
 possible, since they are generally overloaded and things move very slowly,
 even if there are outages.

 Here is the script we use to do the sync:
 https://github.com/apache/spark/blob/master/dev/github_jira_sync.py

 It might be possible to modify this to support post-hoc changes, but we'd
 need to think about how to do so while minimizing function calls to the ASF
 JIRA API, which I found are very slow.

 - Patrick



 On Mon, Aug 11, 2014 at 7:51 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 It looks like this script doesn't catch PRs that are opened and *then*
 have

 the JIRA issue ID added to the name. Would it be easy to somehow have the
 script trigger on PR name changes as well as PR creates?

 Alternately, is there a reason we can't or don't want to use the plugin
 mentioned below? (I'm assuming it covers cases like this, but I'm not
 sure.)

 Nick



 On Wed, Jul 23, 2014 at 12:52 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

  By the way, it looks like there’s a JIRA plugin that integrates it with
  GitHub:
 
 -
 
 https://marketplace.atlassian.com/plugins/com.atlassian.jira.plugins.jira-bitbucket-connector-plugin

 -
 
 https://confluence.atlassian.com/display/BITBUCKET/Linking+Bitbucket+and+GitHub+accounts+to+JIRA
 
  It does the automatic linking and shows some additional information
  
 https://marketplace-cdn.atlassian.com/files/images/com.atlassian.jira.plugins.jira-bitbucket-connector-plugin/86ff1a21-44fb-4227-aa4f-44c77aec2c97.png
 

  that might be nice to have for heavy JIRA users.
 
  Nick
 
 
 
  On Sun, Jul 20, 2014 at 12:50 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  Yeah it needs to have SPARK-XXX in the title (this is the format we
  request already). It just works with small synchronization script I
  wrote that we run every five minutes on Jeknins that uses the Github
  and Jenkins API:
 
 
 
 https://github.com/apache/spark/commit/49e472744951d875627d78b0d6e93cd139232929
 
  - Patrick
 
  On Sun, Jul 20, 2014 at 8:06 AM, Nicholas Chammas
  nicholas.cham...@gmail.com wrote:
   That's pretty neat.
  
   How does it work? Do we just need to put the issue ID (e.g.
 SPARK-1234)
   anywhere in the pull request?
  
   Nick
  
  
   On Sat, Jul 19, 2014 at 11:10 PM, Patrick Wendell 
 pwend...@gmail.com
   wrote:
  
   Just a small note, today I committed a tool that will automatically
   mirror pull requests to JIRA issues, so contributors will no longer
   have to manually post a pull request on the JIRA when they make
 one.
  
   It will create a link on the JIRA and also make a comment to
 trigger
   an e-mail to people watching.
  
   This should make some things easier, such as avoiding accidental
   duplicate effort on the same JIRA.
  
   - Patrick
  
 
 
 






Handling stale PRs

2014-08-25 Thread Nicholas Chammas
Check this out:
https://github.com/apache/spark/pulls?q=is%3Aopen+is%3Apr+sort%3Aupdated-asc

We're hitting close to 300 open PRs. Those are the least recently updated
ones.

I think having a low number of stale (i.e. not recently updated) PRs is a
good thing to shoot for. It doesn't leave contributors hanging (which feels
bad for contributors), and reduces project clutter (which feels bad for
maintainers/committers).

What is our approach to tackling this problem?

I think communicating and enforcing a clear policy on how stale PRs are
handled might be a good way to reduce the number of stale PRs we have
without making contributors feel rejected.

I don't know what such a policy would look like, but it should be
enforceable and lightweight--i.e. it shouldn't feel like a hammer used to
reject people's work, but rather a necessary tool to keep the project's
contributions relevant and manageable.

Nick


Re: Spark Contribution

2014-08-23 Thread Nicholas Chammas
That sounds like a good idea.

Continuing along those lines, what do people think of moving the
contributing page entirely from the wiki to GitHub? It feels like the right
place for it since GitHub is where we take contributions, and it also lets
people make improvements to it.

Nick


2014년 8월 23일 토요일, Sean Owenso...@cloudera.com님이 작성한 메시지:

 Can I ask a related question, since I have a PR open to touch up
 README.md as we speak (SPARK-3069)?

 If this text is in a file called CONTRIBUTING.md, then it will cause a
 link to appear on the pull request screen, inviting people to review
 the contribution guidelines:

 https://github.com/blog/1184-contributing-guidelines

 This is mildly important as the project wants to make it clear that
 you agree that your contribution is licensed under the AL2, since
 there is no formal ICLA.

 How about I propose moving the text to CONTRIBUTING.md with a pointer
 in README.md? or keep it both places?

 On Sat, Aug 23, 2014 at 1:08 AM, Reynold Xin r...@databricks.com
 javascript:; wrote:
  Great idea. Added the link
  https://github.com/apache/spark/blob/master/README.md
 
 
 
  On Thu, Aug 21, 2014 at 4:06 PM, Nicholas Chammas 
  nicholas.cham...@gmail.com javascript:; wrote:
 
  We should add this link to the readme on GitHub btw.
 
  2014년 8월 21일 목요일, Henry Saputrahenry.sapu...@gmail.com javascript:;님이
 작성한 메시지:
 
   The Apache Spark wiki on how to contribute should be great place to
   start:
  
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
  
   - Henry
  
   On Thu, Aug 21, 2014 at 3:25 AM, Maisnam Ns maisnam...@gmail.com
 javascript:;
   javascript:; wrote:
Hi,
   
Can someone help me with some links on how to contribute for Spark
   
Regards
mns
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 javascript:; javascript:;
   For additional commands, e-mail: dev-h...@spark.apache.org
 javascript:;
  javascript:;
  
  
 



Re: Spark Contribution

2014-08-21 Thread Nicholas Chammas
We should add this link to the readme on GitHub btw.

2014년 8월 21일 목요일, Henry Saputrahenry.sapu...@gmail.com님이 작성한 메시지:

 The Apache Spark wiki on how to contribute should be great place to
 start:
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

 - Henry

 On Thu, Aug 21, 2014 at 3:25 AM, Maisnam Ns maisnam...@gmail.com
 javascript:; wrote:
  Hi,
 
  Can someone help me with some links on how to contribute for Spark
 
  Regards
  mns

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:;
 For additional commands, e-mail: dev-h...@spark.apache.org javascript:;




Re: -1s on pull requests?

2014-08-15 Thread Nicholas Chammas
On Sun, Aug 3, 2014 at 4:35 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Include the commit hash in the tests have started/completed messages, so
 that it's clear what code exactly is/has been tested for each test cycle.


This is now captured in this JIRA issue
https://issues.apache.org/jira/browse/SPARK-2912 and completed in this PR
https://github.com/apache/spark/pull/1816 which has been merged in to
master.

Example of old style: tests starting
https://github.com/apache/spark/pull/1819#issuecomment-51416510 / tests
finished https://github.com/apache/spark/pull/1819#issuecomment-51417477
(with
new classes)

Example of new style: tests starting
https://github.com/apache/spark/pull/1816#issuecomment-51855254 / tests
finished https://github.com/apache/spark/pull/1816#issuecomment-51855255
(with
new classes)

Nick


Re: Tests failing

2014-08-15 Thread Nicholas Chammas
Shivaram,

Can you point us to an example of that happening? The Jenkins console
output, that is.

Nick


On Fri, Aug 15, 2014 at 2:28 PM, Shivaram Venkataraman 
shiva...@eecs.berkeley.edu wrote:

 Also I think Jenkins doesn't post build timeouts to github. Is there anyway
 we can fix that ?
 On Aug 15, 2014 9:04 AM, Patrick Wendell pwend...@gmail.com wrote:

  Hi All,
 
  I noticed that all PR tests run overnight had failed due to timeouts. The
  patch that updates the netty shuffle I believe somehow inflated to the
  build time significantly. That patch had been tested, but one change was
  made before it was merged that was not tested.
 
  I've reverted the patch for now to see if it brings the build times back
  down.
 
  - Patrick
 



Re: Tests failing

2014-08-15 Thread Nicholas Chammas
So 2 hours is a hard cap on how long a build can run. Okie doke.

Perhaps then I'll wrap the run-tests step as you suggest and limit it to
100 minutes or something, and cleanly report if it times out.

Sound good?


On Fri, Aug 15, 2014 at 4:43 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hey Nicholas,

 Yeah so Jenkins has it's own timeout mechanism and it will just kill the
 entire build after 120 minutes. But since run-tests is sitting in the
 middle of the tests, it can't actually post a failure message.

 I think run-tests-jenkins should just wrap the call to run-tests in a call
 in its own timeout. It might be possible to just use this:

 http://linux.die.net/man/1/timeout

 - Patrick


 On Fri, Aug 15, 2014 at 1:31 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 OK, I've captured this in SPARK-3076
 https://issues.apache.org/jira/browse/SPARK-3076.

 Patrick,

 Is the problem that this run-tests
 https://github.com/apache/spark/blob/0afe5cb65a195d2f14e8dfcefdbec5dac023651f/dev/run-tests-jenkins#L151
  step
 times out, and that is currently not handled gracefully? To be more
 specific, it hangs for 120 minutes, times out, but the parent script for
 some reason is also terminated. Does that sound right?

 Nick


 On Fri, Aug 15, 2014 at 3:33 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 Jenkins runs for this PR https://github.com/apache/spark/pull/1960
 timed out without notification. The relevant Jenkins logs are at


 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18588/consoleFull

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18592/consoleFull

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18597/consoleFull


 On Fri, Aug 15, 2014 at 11:44 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Shivaram,

 Can you point us to an example of that happening? The Jenkins console
 output, that is.

 Nick


 On Fri, Aug 15, 2014 at 2:28 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 Also I think Jenkins doesn't post build timeouts to github. Is there
 anyway
 we can fix that ?
 On Aug 15, 2014 9:04 AM, Patrick Wendell pwend...@gmail.com wrote:

  Hi All,
 
  I noticed that all PR tests run overnight had failed due to
 timeouts. The
  patch that updates the netty shuffle I believe somehow inflated to
 the
  build time significantly. That patch had been tested, but one change
 was
  made before it was merged that was not tested.
 
  I've reverted the patch for now to see if it brings the build times
 back
  down.
 
  - Patrick
 








Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Nicholas Chammas
On a related note, I recently heard about Distributed R
https://github.com/vertica/DistributedR, which is coming out of
HP/Vertica and seems to be their proposition for machine learning at scale.

It would be interesting to see some kind of comparison between that and
MLlib (and perhaps also SparkR https://github.com/amplab-extras/SparkR-pkg?),
especially since Distributed R has a concept of distributed arrays and
works on data in-memory. Docs are here.
https://github.com/vertica/DistributedR/tree/master/doc/platform

Nick


On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin r...@databricks.com wrote:

 They only compared their own implementations of couple algorithms on
 different platforms rather than comparing the different platforms
 themselves (in the case of Spark -- PySpark). I can write two variants of
 an algorithm on Spark and make them perform drastically differently.

 I have no doubt if you implement a ML algorithm in Python itself without
 any native libraries, the performance will be sub-optimal.

 What PySpark really provides is:

 - Using Spark transformations in Python
 - ML algorithms implemented in Scala (leveraging native numerical libraries
 for high performance), and callable in Python

 The paper claims Python is now one of the most popular languages for
 ML-oriented programming, and that's why they went ahead with Python.
 However, as I understand, very few people actually implement algorithms in
 Python directly because of the sub-optimal performance. Most people
 implement algorithms in other languages (e.g. C / Java), and expose APIs in
 Python for ease-of-use. This is what we are trying to do with PySpark as
 well.


 On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas 
 ignacio.zendejas...@gmail.com wrote:

  Has anyone had a chance to look at this paper (with title in subject)?
  http://www.cs.rice.edu/~lp6/comparison.pdf
 
  Interesting that they chose to use Python alone. Do we know how much
 faster
  Scala is vs. Python in general, if at all?
 
  As with any and all benchmarks, I'm sure there are caveats, but it'd be
  nice to have a response to the question above for starters.
 
  Thanks,
  Ignacio
 



Re: Pull requests will be automatically linked to JIRA when submitted

2014-08-11 Thread Nicholas Chammas
Thanks for looking into this. I think little tools like this are super
helpful.

Would it hurt to open a request with INFRA to install/configure the
JIRA-GitHub plugin while we continue to use the Python script we have? I
wouldn't mind opening that JIRA issue with them.

Nick


On Mon, Aug 11, 2014 at 12:52 PM, Patrick Wendell pwend...@gmail.com
wrote:

 I spent some time on this and I'm not sure either of these is an option,
 unfortunately.

 We typically can't use custom JIRA plug-in's because this JIRA is
 controlled by the ASF and we don't have rights to modify most things about
 how it works (it's a large shared JIRA instance used by more than 50
 projects). It's worth looking into whether they can do something. In
 general we've tended to avoid going through ASF infra them whenever
 possible, since they are generally overloaded and things move very slowly,
 even if there are outages.

 Here is the script we use to do the sync:
 https://github.com/apache/spark/blob/master/dev/github_jira_sync.py

 It might be possible to modify this to support post-hoc changes, but we'd
 need to think about how to do so while minimizing function calls to the ASF
 JIRA API, which I found are very slow.

 - Patrick



 On Mon, Aug 11, 2014 at 7:51 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 It looks like this script doesn't catch PRs that are opened and *then*
 have

 the JIRA issue ID added to the name. Would it be easy to somehow have the
 script trigger on PR name changes as well as PR creates?

 Alternately, is there a reason we can't or don't want to use the plugin
 mentioned below? (I'm assuming it covers cases like this, but I'm not
 sure.)

 Nick



 On Wed, Jul 23, 2014 at 12:52 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

  By the way, it looks like there’s a JIRA plugin that integrates it with
  GitHub:
 
 -
 
 https://marketplace.atlassian.com/plugins/com.atlassian.jira.plugins.jira-bitbucket-connector-plugin

 -
 
 https://confluence.atlassian.com/display/BITBUCKET/Linking+Bitbucket+and+GitHub+accounts+to+JIRA
 
  It does the automatic linking and shows some additional information
  
 https://marketplace-cdn.atlassian.com/files/images/com.atlassian.jira.plugins.jira-bitbucket-connector-plugin/86ff1a21-44fb-4227-aa4f-44c77aec2c97.png
 

  that might be nice to have for heavy JIRA users.
 
  Nick
 
 
 
  On Sun, Jul 20, 2014 at 12:50 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  Yeah it needs to have SPARK-XXX in the title (this is the format we
  request already). It just works with small synchronization script I
  wrote that we run every five minutes on Jeknins that uses the Github
  and Jenkins API:
 
 
 
 https://github.com/apache/spark/commit/49e472744951d875627d78b0d6e93cd139232929
 
  - Patrick
 
  On Sun, Jul 20, 2014 at 8:06 AM, Nicholas Chammas
  nicholas.cham...@gmail.com wrote:
   That's pretty neat.
  
   How does it work? Do we just need to put the issue ID (e.g.
 SPARK-1234)
   anywhere in the pull request?
  
   Nick
  
  
   On Sat, Jul 19, 2014 at 11:10 PM, Patrick Wendell 
 pwend...@gmail.com
   wrote:
  
   Just a small note, today I committed a tool that will automatically
   mirror pull requests to JIRA issues, so contributors will no longer
   have to manually post a pull request on the JIRA when they make one.
  
   It will create a link on the JIRA and also make a comment to
 trigger
   an e-mail to people watching.
  
   This should make some things easier, such as avoiding accidental
   duplicate effort on the same JIRA.
  
   - Patrick
  
 
 
 





Unit tests in 5 minutes

2014-08-08 Thread Nicholas Chammas
Howdy,

Do we think it's both feasible and worthwhile to invest in getting our unit
tests to finish in under 5 minutes (or something similarly brief) when run
by Jenkins?

Unit tests currently seem to take anywhere from 30 min to 2 hours. As
people add more tests, I imagine this time will only grow. I think it would
be better for both contributors and reviewers if they didn't have to wait
so long for test results; PR reviews would be shorter, if nothing else.

I don't know how how this is normally done, but maybe it wouldn't be too
much work to get a test cycle to feel lighter.

Most unit tests are independent and can be run concurrently, right? Would
it make sense to build a given patch on many servers at once and send
disjoint sets of unit tests to each?

I'd be interested in working on something like that if possible (and
sensible).

Nick


Re: -1s on pull requests?

2014-08-05 Thread Nicholas Chammas

 1. Include the commit hash in the tests have started/completed


FYI: Looks like Xiangrui's already got a JIRA issue for this.

SPARK-2622: Add Jenkins build numbers to SparkQA messages
https://issues.apache.org/jira/browse/SPARK-2622

2. Pin a message to the start or end of the PR


Should new JIRA issues for this item fall under the following umbrella
issue?

SPARK-2230: Improvements to Jenkins QA Harness
https://issues.apache.org/jira/browse/SPARK-2230

Nick


Re: -1s on pull requests?

2014-08-03 Thread Nicholas Chammas
On Mon, Jul 21, 2014 at 4:44 PM, Kay Ousterhout k...@eecs.berkeley.edu
wrote:

 This also happens when something accidentally gets merged after the tests
 have started but before tests have passed.


Some improvements to SparkQA https://github.com/SparkQA could help with
this. May I suggest:

   1. Include the commit hash in the tests have started/completed
   messages, so that it's clear what code exactly is/has been tested for each
   test cycle.
   2. Pin a message to the start or end of the PR that is updated with
   the status of the PR. Testing not complete; New commits since last
   test; Tests failed; etc. It should be easy for committers to get the
   status of the PR at a glance, without scrolling through the comment history.

Nick


Re: -1s on pull requests?

2014-08-03 Thread Nicholas Chammas
On Sun, Aug 3, 2014 at 11:29 PM, Patrick Wendell pwend...@gmail.com wrote:

Nick - Any interest in doing these? this is all doable from within the
 spark repo itself because our QA harness scripts are in there:

 https://github.com/apache/spark/blob/master/dev/run-tests-jenkins

 If not, could you make a JIRA for them and put it under Project Infra.

I’ll make the JIRA and think about how to do this stuff. I’ll have to
understand what that run-tests-jenkins script does and see how easy it is
to extend.

Nick
​


Re: ASF JIRA is down for maintenance

2014-08-02 Thread Nicholas Chammas
Seems to be back up now.


On Sat, Aug 2, 2014 at 2:06 AM, Patrick Wendell pwend...@gmail.com wrote:

 Please don't let this prevent you from merging patches, just keep a list
 and we can update the JIRA later.

 - Patrick



Re: [VOTE] Release Apache Spark 1.0.2 (RC1)

2014-07-29 Thread Nicholas Chammas
   - spun up an EC2 cluster successfully using spark-ec2
   - tested S3 file access from that cluster successfully

+1
​


On Tue, Jul 29, 2014 at 1:46 AM, Henry Saputra henry.sapu...@gmail.com
wrote:

 NOTICE and LICENSE files look good
 Hashes and sigs look good
 No executable in the source distribution
 Compile source and run standalone

 +1

 - Henry

 On Fri, Jul 25, 2014 at 4:08 PM, Tathagata Das
 tathagata.das1...@gmail.com wrote:
  Please vote on releasing the following candidate as Apache Spark version
 1.0.2.
 
  This release fixes a number of bugs in Spark 1.0.1.
  Some of the notable ones are
  - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix for
  SPARK-1199. The fix was reverted for 1.0.2.
  - SPARK-2576: NoClassDefFoundError when executing Spark QL query on
  HDFS CSV file.
  The full list is at http://s.apache.org/9NJ
 
  The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e):
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f
 
  The release files, including signatures, digests, etc can be found at:
  http://people.apache.org/~tdas/spark-1.0.2-rc1/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/tdas.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1024/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/
 
  Please vote on releasing this package as Apache Spark 1.0.2!
 
  The vote is open until Tuesday, July 29, at 23:00 UTC and passes if
  a majority of at least 3 +1 PMC votes are cast.
  [ ] +1 Release this package as Apache Spark 1.0.2
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/



Re: JIRA content request

2014-07-29 Thread Nicholas Chammas
+1 on using JIRA workflows to manage the backlog, and +9000 on having
decent descriptions for all JIRA issues.


On Tue, Jul 29, 2014 at 7:48 PM, Sean Owen so...@cloudera.com wrote:

 How about using a JIRA status like Documentation Required to mean
 burden's on you to elaborate with a motivation and/or PR. This could
 both prompt people to do so, and also let one see when a JIRA has been
 waiting on the reporter for months, rather than simply never been
 looked at, and should thus time out and be closed. Both of these would
 probably help the JIRA backlog.

 On Wed, Jul 30, 2014 at 12:34 AM, Mark Hamstra m...@clearstorydata.com
 wrote:
  Of late, I've been coming across quite a few pull requests and associated
  JIRA issues that contain nothing indicating their purpose beyond a pretty
  minimal description of what the pull request does.  On the pull request
  itself, a reference to the corresponding JIRA in the title combined with
 a
  description that gives us a sketch of what the PR does is fine, but if
  there is no description in at least the JIRA of *why* you think some
 change
  to Spark would be good, then it often makes getting started on code
 reviews
  a little harder for those of us doing the reviews.  So, I'm requesting
 that
  if you are submitting a JIRA or pull request for something that isn't
  obviously a bug or bug fix, you please include some sort of motivation in
  at least the JIRA body so that the reviewers can more easily get through
  the head-scratching phase of trying to figure out why Spark might be
  improved by merging a pull request.



Re: Fraud management system implementation

2014-07-28 Thread Nicholas Chammas
This sounds more like a user list https://spark.apache.org/community.html
question. This is the dev list, where people discuss things related to
contributing code and such to Spark.


On Mon, Jul 28, 2014 at 10:15 AM, jitendra shelar 
jitendra.shelar...@gmail.com wrote:

 Hi,

 I am new to spark. I am learning spark and scala.

 I had some queries.

 1) Can somebody please tell me if it is possible to implement credit
 card fraud management system using spark?
 2) If yes, can somebody please guide me how to proceed.
 3) Shall I prefer Scala or Java for this implementation?

 4) Please suggest me some pointers related to Hidden Markonav Model
 (HMM) and anomaly detection in data mining (using spark).

 Thanks,
 Jitendra



Re: Pull requests will be automatically linked to JIRA when submitted

2014-07-23 Thread Nicholas Chammas
By the way, it looks like there’s a JIRA plugin that integrates it with
GitHub:

   -
   
https://marketplace.atlassian.com/plugins/com.atlassian.jira.plugins.jira-bitbucket-connector-plugin
   -
   
https://confluence.atlassian.com/display/BITBUCKET/Linking+Bitbucket+and+GitHub+accounts+to+JIRA

It does the automatic linking and shows some additional information
https://marketplace-cdn.atlassian.com/files/images/com.atlassian.jira.plugins.jira-bitbucket-connector-plugin/86ff1a21-44fb-4227-aa4f-44c77aec2c97.png
that might be nice to have for heavy JIRA users.

Nick
​


On Sun, Jul 20, 2014 at 12:50 PM, Patrick Wendell pwend...@gmail.com
wrote:

 Yeah it needs to have SPARK-XXX in the title (this is the format we
 request already). It just works with small synchronization script I
 wrote that we run every five minutes on Jeknins that uses the Github
 and Jenkins API:


 https://github.com/apache/spark/commit/49e472744951d875627d78b0d6e93cd139232929

 - Patrick

 On Sun, Jul 20, 2014 at 8:06 AM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
  That's pretty neat.
 
  How does it work? Do we just need to put the issue ID (e.g. SPARK-1234)
  anywhere in the pull request?
 
  Nick
 
 
  On Sat, Jul 19, 2014 at 11:10 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  Just a small note, today I committed a tool that will automatically
  mirror pull requests to JIRA issues, so contributors will no longer
  have to manually post a pull request on the JIRA when they make one.
 
  It will create a link on the JIRA and also make a comment to trigger
  an e-mail to people watching.
 
  This should make some things easier, such as avoiding accidental
  duplicate effort on the same JIRA.
 
  - Patrick
 



Contributing to Spark needs PySpark build/test instructions

2014-07-21 Thread Nicholas Chammas
Contributing to Spark
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
needs a line or two about building and testing PySpark. A call out of
run-tests, for example, would be helpful for new contributors to PySpark.

Nick
​


Re: Contributing to Spark needs PySpark build/test instructions

2014-07-21 Thread Nicholas Chammas
For the record, the triggering discussion is here
https://github.com/apache/spark/pull/1505#issuecomment-49671550. I
assumed that sbt/sbt test covers all the tests required before submitting a
patch, and it appears that it doesn’t.
​


On Mon, Jul 21, 2014 at 6:42 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Contributing to Spark
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
 needs a line or two about building and testing PySpark. A call out of
 run-tests, for example, would be helpful for new contributors to PySpark.

 Nick
 ​



Re: Contributing to Spark needs PySpark build/test instructions

2014-07-21 Thread Nicholas Chammas
Looks good. Does sbt/sbt test cover the same tests as /dev/run-tests?

I’m looking at step 5 under “Contributing Code”. Someone contributing to
PySpark will want to be directed to run something in addition to (or
instead of) sbt/sbt test, I believe.

Nick
​


On Mon, Jul 21, 2014 at 11:43 PM, Reynold Xin r...@databricks.com wrote:

 I added an automated testing section:

 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-AutomatedTesting

 Can you take a look to see if it is what you had in mind?



 On Mon, Jul 21, 2014 at 3:54 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

  For the record, the triggering discussion is here
  https://github.com/apache/spark/pull/1505#issuecomment-49671550. I
  assumed that sbt/sbt test covers all the tests required before
 submitting a
  patch, and it appears that it doesn’t.
  ​
 
 
  On Mon, Jul 21, 2014 at 6:42 PM, Nicholas Chammas 
  nicholas.cham...@gmail.com wrote:
 
   Contributing to Spark
   
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
  
   needs a line or two about building and testing PySpark. A call out of
   run-tests, for example, would be helpful for new contributors to
 PySpark.
  
   Nick
   ​
  
 



Re: Pull requests will be automatically linked to JIRA when submitted

2014-07-20 Thread Nicholas Chammas
That's pretty neat.

How does it work? Do we just need to put the issue ID (e.g. SPARK-1234)
anywhere in the pull request?

Nick


On Sat, Jul 19, 2014 at 11:10 PM, Patrick Wendell pwend...@gmail.com
wrote:

 Just a small note, today I committed a tool that will automatically
 mirror pull requests to JIRA issues, so contributors will no longer
 have to manually post a pull request on the JIRA when they make one.

 It will create a link on the JIRA and also make a comment to trigger
 an e-mail to people watching.

 This should make some things easier, such as avoiding accidental
 duplicate effort on the same JIRA.

 - Patrick



Re: small (yet major) change going in: broadcasting RDD to reduce task size

2014-07-17 Thread Nicholas Chammas
On Thu, Jul 17, 2014 at 1:23 AM, Stephen Haberman 
stephen.haber...@gmail.com wrote:

 I'd be ecstatic if more major changes were this well/succinctly
 explained


Ditto on that. The summary of user impact was very nice. It would be good
to repeat that on the user list or release notes when this change goes out.

Nick


ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Nicholas Chammas
Just launched an EC2 cluster from git hash
9fe693b5b6ed6af34ee1e800ab89c8a11991ea38. Calling take() on an RDD
accessing data in S3 yields the following error output.

I understand that NoClassDefFoundError errors may mean something in the
deployment was messed up. Is that correct? When I launch a cluster using
spark-ec2, I expect all critical deployment details to be taken care of by
the script.

So is something in the deployment executed by spark-ec2 borked?

Nick

java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:224)
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:214)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:176)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
at org.apache.spark.ShuffleDependency.init(Dependency.scala:71)
at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:79)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:188)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.dependencies(RDD.scala:188)
at 
org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1144)
at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:903)
at 
org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:174)
at 
org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:191)
at 
org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:190)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
at 
org.apache.spark.rdd.PartitionCoalescer$LocationIterator.init(CoalescedRDD.scala:185)
at 
org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:236)
at org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:337)
at org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:83)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
at org.apache.spark.rdd.RDD.take(RDD.scala:1036)
at $iwC$$iwC$$iwC$$iwC.init(console:26)
at $iwC$$iwC$$iwC.init(console:31)
at $iwC$$iwC.init(console:33)
at $iwC.init(console:35)
at init(console:37)
at .init(console:41)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 

Re: ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Nicholas Chammas
Okie doke--added myself as a watcher on that issue.

On a related note, what are the thoughts on automatically spinning up/down
EC2 clusters and running tests against them? It would probably be way too
cumbersome to do that for every build, but perhaps on some schedule it
could help validate that we are still deploying EC2 clusters correctly.

Would something like that be valuable?

Nick


On Tue, Jul 15, 2014 at 1:19 AM, Patrick Wendell pwend...@gmail.com wrote:

 Yeah - this is likely caused by SPARK-2471.

 On Mon, Jul 14, 2014 at 10:11 PM, Shivaram Venkataraman
 shiva...@eecs.berkeley.edu wrote:
  My guess is that this is related to
  https://issues.apache.org/jira/browse/SPARK-2471 where the S3 library
 gets
  excluded from the SBT assembly jar. I am not sure if the assembly jar
 used
  in EC2 is generated using SBT though.
 
  Shivaram
 
 
  On Mon, Jul 14, 2014 at 10:02 PM, Aaron Davidson ilike...@gmail.com
 wrote:
 
  This one is typically due to a mismatch between the Hadoop versions --
  i.e., Spark is compiled against 1.0.4 but is running with 2.3.0 in the
  classpath, or something like that. Not certain why you're seeing this
 with
  spark-ec2, but I'm assuming this is related to the issues you posted in
 a
  separate thread.
 
 
  On Mon, Jul 14, 2014 at 6:43 PM, Nicholas Chammas 
  nicholas.cham...@gmail.com wrote:
 
   Just launched an EC2 cluster from git hash
   9fe693b5b6ed6af34ee1e800ab89c8a11991ea38. Calling take() on an RDD
   accessing data in S3 yields the following error output.
  
   I understand that NoClassDefFoundError errors may mean something in
 the
   deployment was messed up. Is that correct? When I launch a cluster
 using
   spark-ec2, I expect all critical deployment details to be taken care
 of
  by
   the script.
  
   So is something in the deployment executed by spark-ec2 borked?
  
   Nick
  
   java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
   at
  
 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:224)
   at
  
 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:214)
   at
   org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
   at
  
 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
   at
  
 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
   at
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:176)
   at
  org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
   at
  org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
   at
 org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at
  org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
   at
  org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
   at
  
 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at
  org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
   at
  org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
   at org.apache.spark.ShuffleDependency.init(Dependency.scala:71)
   at
   org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:79)
   at
   org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
   at
   org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:188)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.dependencies(RDD.scala:188)
   at
  
 
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1144)
   at
   org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:903)
   at
  
 
 org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:174)
   at
  
 
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:191)
   at
  
 
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:190)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350

Re: EC2 clusters ready in launch time + 30 seconds

2014-07-12 Thread Nicholas Chammas
On Thu, Jul 10, 2014 at 8:10 PM, Nate D'Amico n...@reactor8.com wrote:

 Starting to work through some automation/config stuff for spark stack on
 EC2 with a project, will be focusing the work through the apache bigtop
 effort to start, can then share with spark community directly as things
 progress if people are interested


Let us know how that goes. I'm definitely interested in hearing more.

Nick


EC2 clusters ready in launch time + 30 seconds

2014-07-10 Thread Nicholas Chammas
Hi devs!

Right now it takes a non-trivial amount of time to launch EC2 clusters.
Part of this time is spent starting the EC2 instances, which is out of our
control. Another part of this time is spent installing stuff on and
configuring the instances. This, we can control.

I’d like to explore approaches to upgrading spark-ec2 so that launching a
cluster of any size generally takes only 30 seconds on top of the time to
launch the base EC2 instances. Since Amazon can launch instances
concurrently, I believe this means we should be able to launch a fully
operational Spark cluster of any size in constant time. Is that correct?

Do we already have an idea of what it would take to get to that point?

Nick
​


<    1   2   3   4   5