from:"Patrick Wendell"

Re: Signal/Noise Ratio

2014-02-22 Thread Patrick Wendell

Hey Chris,

Would the following be consistent with the Apache guidelines?

(a) We establish a culture of not having overall design discussions on
github. Design discussions should to occur on JIRA or on the dev list.
IMO this is pretty much already true, but there are a few exceptions.
(b) We add a mailing list called github@s.a.o which receives the
github traffic. This way everything is available in Apache infra.
(c) Because of our use of JIRA it might make sense to have an
issues@s.a.o list as well similar to what YARN and other projects use.

The github chatter is so noisy that I think, overall, it decreases
engagement with the official developer list. This is the opposite of
what we want.

- Patrick

On Sat, Feb 22, 2014 at 11:34 AM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Everyone,

 The biggest thing is simply making sure that the dev@projecta.o list is
 meaningful
 and that meaningful development isn't going on elsewhere that constitute
 decisions for the Apache project as reified in code contributions and
 overall
 stewardship of the effort.

 I noticed in a few emails from Github relating to comments on Github Pull
 Requests
 some conversation which I deemed to be relevant to the project, so I
 brought this
 up and it came up during graduation.

 Here's a general rule of thumb: it's fine if devs converse e.g., on
 Github, etc.,
 and even if it's project discussion *so long as* that relevant project
 discussion
 makes it way in some form to the actual, bona fide project's
 dev@projecta.o list,
 giving others in the community not necessarily on Github or watching
 Github or part
 of that non Apache conversation to comment, and be part of the community
 led decisions
 for the project there.

 Making its way to that bona fide Apache project dev list can happen in
 several ways.

 1. by simply direct 1:1 mapping from Github comments which I see Apache
 project
 related dev discussion on from time to time and believe fits the criteria
 I'm describing
 above to the project's dev@project.a.o list.

 2. by not 1:1 mapping all Github conversation to the dev@project.a.o
 list, but to
 some other list, e.g., github@projecta.o, for example (or any of the
 others being
 discussed) *so long as*, and this is key, that those discussions on Github
 get summarized
 on the dev@project.a.o list giving everyone an opportunity to
 participate in the development
 by being *here at Apache*.

 3. By not worrying about Github at all and simply doing all the
 development here at
 the ASF.

 4. Others..

 My feeling is that some combination of #1 and #2 can pass muster, and the
 Apache Spark
 community can decide. That said, noise reduction can also lead to loss of
 precision and
 accuracy and don't be surprised in reducing that noise if some key thing
 makes it onto
 a Github PR but didn't make it onto the dev list b/c we are all human and
 forgot to summarize
 it there. Even if that happens, we assume everyone has good intentions and
 we simply
 address those issues when/if they come up.

 Cheers,
 Chris




 -Original Message-
 From: Sandy Ryza sandy.r...@cloudera.com
 Reply-To: dev@spark.incubator.apache.org dev@spark.incubator.apache.org
 Date: Saturday, February 22, 2014 11:19 AM
 To: dev@spark.incubator.apache.org dev@spark.incubator.apache.org
 Subject: Re: Signal/Noise Ratio

Hadoop subprojects (MR, YARN, HDFS) each have a dev list that contains
discussion as well as a single email whenever a JIRA is filed, and an
issues list with all the JIRA activity.  I think this works out pretty
well.  Subscribing just to the dev list, I can keep up with changes that
are going to be made and follow the ones I care about.  And the issues
list
is there if I want the firehose.

Is Apache actually prescriptive that a list with dev in its name needs
to
contain all discussion?  If so, most projects I've followed are violating
this.


On Fri, Feb 21, 2014 at 7:54 PM, Kay Ousterhout
k...@eecs.berkeley.eduwrote:

 It looks like there's at least one other apache project, jclouds, that
 sends the github notifications to a separate notifications@ list (see


http://mail-archives.apache.org/mod_mbox/incubator-general/201402.mbox/%3
C1391721862.67613.YahooMailNeo%40web172602.mail.ir2.yahoo.com%3E
 ).
  Given that many people are annoyed by getting the messages on this
list,
 and that there is some precedent for sending them to a different list,
I'd
 be in favor of doing that.


 On Fri, Feb 21, 2014 at 6:18 PM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:

  Sweet great job Reynold.
 
  ++
  Chris Mattmann, Ph.D.
  Chief Architect
  Instrument Software and Science Data Systems Section (398)
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 171-283, Mailstop: 171-246
  Email: chris.a.mattm...@nasa.gov
  WWW:  http://sunset.usc.edu/~mattmann/
  ++

Re: Signal/Noise Ratio

2014-02-22 Thread Patrick Wendell

btw - I'd prefer reviews@s.a.o instead of github@ to remain more
neutral and flexible.

On Sat, Feb 22, 2014 at 12:35 PM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Chris,

 Would the following be consistent with the Apache guidelines?

 (a) We establish a culture of not having overall design discussions on
 github. Design discussions should to occur on JIRA or on the dev list.
 IMO this is pretty much already true, but there are a few exceptions.
 (b) We add a mailing list called github@s.a.o which receives the
 github traffic. This way everything is available in Apache infra.
 (c) Because of our use of JIRA it might make sense to have an
 issues@s.a.o list as well similar to what YARN and other projects use.

 The github chatter is so noisy that I think, overall, it decreases
 engagement with the official developer list. This is the opposite of
 what we want.

 - Patrick

 On Sat, Feb 22, 2014 at 11:34 AM, Mattmann, Chris A (3980)
 chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Everyone,

 The biggest thing is simply making sure that the dev@projecta.o list is
 meaningful
 and that meaningful development isn't going on elsewhere that constitute
 decisions for the Apache project as reified in code contributions and
 overall
 stewardship of the effort.

 I noticed in a few emails from Github relating to comments on Github Pull
 Requests
 some conversation which I deemed to be relevant to the project, so I
 brought this
 up and it came up during graduation.

 Here's a general rule of thumb: it's fine if devs converse e.g., on
 Github, etc.,
 and even if it's project discussion *so long as* that relevant project
 discussion
 makes it way in some form to the actual, bona fide project's
 dev@projecta.o list,
 giving others in the community not necessarily on Github or watching
 Github or part
 of that non Apache conversation to comment, and be part of the community
 led decisions
 for the project there.

 Making its way to that bona fide Apache project dev list can happen in
 several ways.

 1. by simply direct 1:1 mapping from Github comments which I see Apache
 project
 related dev discussion on from time to time and believe fits the criteria
 I'm describing
 above to the project's dev@project.a.o list.

 2. by not 1:1 mapping all Github conversation to the dev@project.a.o
 list, but to
 some other list, e.g., github@projecta.o, for example (or any of the
 others being
 discussed) *so long as*, and this is key, that those discussions on Github
 get summarized
 on the dev@project.a.o list giving everyone an opportunity to
 participate in the development
 by being *here at Apache*.

 3. By not worrying about Github at all and simply doing all the
 development here at
 the ASF.

 4. Others..

 My feeling is that some combination of #1 and #2 can pass muster, and the
 Apache Spark
 community can decide. That said, noise reduction can also lead to loss of
 precision and
 accuracy and don't be surprised in reducing that noise if some key thing
 makes it onto
 a Github PR but didn't make it onto the dev list b/c we are all human and
 forgot to summarize
 it there. Even if that happens, we assume everyone has good intentions and
 we simply
 address those issues when/if they come up.

 Cheers,
 Chris




 -Original Message-
 From: Sandy Ryza sandy.r...@cloudera.com
 Reply-To: dev@spark.incubator.apache.org dev@spark.incubator.apache.org
 Date: Saturday, February 22, 2014 11:19 AM
 To: dev@spark.incubator.apache.org dev@spark.incubator.apache.org
 Subject: Re: Signal/Noise Ratio

Hadoop subprojects (MR, YARN, HDFS) each have a dev list that contains
discussion as well as a single email whenever a JIRA is filed, and an
issues list with all the JIRA activity.  I think this works out pretty
well.  Subscribing just to the dev list, I can keep up with changes that
are going to be made and follow the ones I care about.  And the issues
list
is there if I want the firehose.

Is Apache actually prescriptive that a list with dev in its name needs
to
contain all discussion?  If so, most projects I've followed are violating
this.


On Fri, Feb 21, 2014 at 7:54 PM, Kay Ousterhout
k...@eecs.berkeley.eduwrote:

 It looks like there's at least one other apache project, jclouds, that
 sends the github notifications to a separate notifications@ list (see


http://mail-archives.apache.org/mod_mbox/incubator-general/201402.mbox/%3
C1391721862.67613.YahooMailNeo%40web172602.mail.ir2.yahoo.com%3E
 ).
  Given that many people are annoyed by getting the messages on this
list,
 and that there is some precedent for sending them to a different list,
I'd
 be in favor of doing that.


 On Fri, Feb 21, 2014 at 6:18 PM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:

  Sweet great job Reynold.
 
  ++
  Chris Mattmann, Ph.D.
  Chief Architect
  Instrument Software and Science Data Systems Section (398)
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA

Re: Signal/Noise Ratio

2014-02-22 Thread Patrick Wendell

Hey All,

I created a JIRA to ask infra to create a dedicated reviews@ mailing
list for this purpose.

https://issues.apache.org/jira/browse/INFRA-7368

Hopefully they can migrate the github stream to this list so that
people can distinguish it from developer discussions. In parallel, we
are also trying to see if we can use the github status notifier rather
than the constant comments from jenkins.

- Patrick

On Sat, Feb 22, 2014 at 1:04 PM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov wrote:
 Patrick, +1 to the below. Great summary and yes I think that would
 work great.

 Cheers,
 Chris

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-283, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Patrick Wendell pwend...@gmail.com
 Reply-To: dev@spark.incubator.apache.org dev@spark.incubator.apache.org
 Date: Saturday, February 22, 2014 12:35 PM
 To: dev@spark.incubator.apache.org dev@spark.incubator.apache.org
 Subject: Re: Signal/Noise Ratio

Hey Chris,

Would the following be consistent with the Apache guidelines?

(a) We establish a culture of not having overall design discussions on
github. Design discussions should to occur on JIRA or on the dev list.
IMO this is pretty much already true, but there are a few exceptions.
(b) We add a mailing list called github@s.a.o which receives the
github traffic. This way everything is available in Apache infra.
(c) Because of our use of JIRA it might make sense to have an
issues@s.a.o list as well similar to what YARN and other projects use.

The github chatter is so noisy that I think, overall, it decreases
engagement with the official developer list. This is the opposite of
what we want.

- Patrick

On Sat, Feb 22, 2014 at 11:34 AM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Everyone,

 The biggest thing is simply making sure that the dev@projecta.o list
is
 meaningful
 and that meaningful development isn't going on elsewhere that constitute
 decisions for the Apache project as reified in code contributions and
 overall
 stewardship of the effort.

 I noticed in a few emails from Github relating to comments on Github
Pull
 Requests
 some conversation which I deemed to be relevant to the project, so I
 brought this
 up and it came up during graduation.

 Here's a general rule of thumb: it's fine if devs converse e.g., on
 Github, etc.,
 and even if it's project discussion *so long as* that relevant project
 discussion
 makes it way in some form to the actual, bona fide project's
 dev@projecta.o list,
 giving others in the community not necessarily on Github or watching
 Github or part
 of that non Apache conversation to comment, and be part of the community
 led decisions
 for the project there.

 Making its way to that bona fide Apache project dev list can happen in
 several ways.

 1. by simply direct 1:1 mapping from Github comments which I see Apache
 project
 related dev discussion on from time to time and believe fits the
criteria
 I'm describing
 above to the project's dev@project.a.o list.

 2. by not 1:1 mapping all Github conversation to the dev@project.a.o
 list, but to
 some other list, e.g., github@projecta.o, for example (or any of the
 others being
 discussed) *so long as*, and this is key, that those discussions on
Github
 get summarized
 on the dev@project.a.o list giving everyone an opportunity to
 participate in the development
 by being *here at Apache*.

 3. By not worrying about Github at all and simply doing all the
 development here at
 the ASF.

 4. Others..

 My feeling is that some combination of #1 and #2 can pass muster, and
the
 Apache Spark
 community can decide. That said, noise reduction can also lead to loss
of
 precision and
 accuracy and don't be surprised in reducing that noise if some key thing
 makes it onto
 a Github PR but didn't make it onto the dev list b/c we are all human
and
 forgot to summarize
 it there. Even if that happens, we assume everyone has good intentions
and
 we simply
 address those issues when/if they come up.

 Cheers,
 Chris




 -Original Message-
 From: Sandy Ryza sandy.r...@cloudera.com
 Reply-To: dev@spark.incubator.apache.org
dev@spark.incubator.apache.org
 Date: Saturday, February 22, 2014 11:19 AM
 To: dev@spark.incubator.apache.org dev@spark.incubator.apache.org
 Subject: Re: Signal/Noise Ratio

Hadoop subprojects (MR, YARN, HDFS) each have a dev list that contains
discussion as well as a single email whenever a JIRA is filed, and an
issues

Re: Request to review PR #605

2014-02-22 Thread Patrick Wendell

Hey Punya,

It's sufficient to just ping the request on github rather than e-mail
the dev list. Sometimes it can takes a few days for people to get to
looking at patches...

- Patrick

On Sat, Feb 22, 2014 at 5:17 PM, Punya Biswal pbis...@palantir.com wrote:
 Hi all,

 Can someone review and/or merge PR #605 (convert or move Java code)? It's
 been sitting for four days.

 Thanks!
 Punya

Re: [DISCUSS] Necessity of Maven and SBT Build in Spark

2014-02-21 Thread Patrick Wendell

Hey Everyone,

We are going to publish artifacts to maven central in the exact same
format no matter which build system we use.

For normal consumers of Spark {maven vs sbt} won't make a difference.
It will make a difference for people who are extended the Spark build
to do their own packaging. This is what I'm trying to gauge - does
anyone do this in a way where they feel only maven or only sbt
supports their particular issue.

- Patrick

On Fri, Feb 21, 2014 at 12:40 AM, Pascal Voitot Dev
pascal.voitot@gmail.com wrote:
 Hi,

 My small contrib to the discussion.
 SBT is able to publish Maven artifacts generating the POM and all JAR 
 signed files.
 So even if not in the project, a Pom can be found somewhere.

 Pascal



 On Fri, Feb 21, 2014 at 9:28 AM, Paul Brown p...@mult.ifario.us wrote:

 As a customer of the code, I don't care *how* the code gets built, but it
 is important to me that the Maven artifacts (POM files, binaries, sources,
 javadocs) are clean, accurate, up to date, and published on Maven Central.

 Some examples where structure/publishing failures have been bad for users:

 - For a long time (and perhaps still), Solr and Lucene were built by an Ant
 build that produced incorrect POMs and required potential developers to
 manually configure their IDEs.

 - For a long time (and perhaps still), Pig was built by Ant, published
 incorrect POMs, and failed to publish useful auxiliary artifacts like
 PigUnit and the PiggyBank as Maven-addressable artifacts.  (That said,
 thanks to Spark, we no longer use Pig...)

 - For a long time (and perhaps still), Cassandra depended on
 non-generally-available libraries (high-scale, etc.) that made it
 inconvenient to embed Cassandra in a larger system.  Cassandra gets a
 little slack because the build/structure was almost too terrible to look at
 prior to incubation and it's gotten better...

 And those are just a few projects at Apache that come to mind; I could make
 a longish list of offenders.

 btw, among other things that the Spark project probably *should* do would
 be to publish artifacts with a classifier to distinguish the Hadoop version
 linked against.

 I'll be a happy user of sbt-built artifacts, or if the project goes/sticks
 with Maven I'm more than willing to help answer questions or provide PRs
 for stickier items around assemblies, multiple artifacts, etc.


 --
 p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/


 On Thu, Feb 20, 2014 at 11:56 PM, Sean Owen so...@cloudera.com wrote:

  Two builds is indeed a pain, since it's an ongoing chore to keep them
  in sync. For example, I am already seeing that the two do not quite
  declare the same dependencies (see recent patch).
 
  I think publishing artifacts to Maven central should be considered a
  hard requirement if it isn't already one from the ASF, and it may be?
  Certainly most people out there would be shocked if you told them
  Spark is not in the repo at all. And that requires at least
  maintaining a pom that declares the structure of the project.
 
  This does not necessarily mean using Maven to build, but is a reason
  that removing the pom is going to make this a lot harder for people to
  consume as a project.
 
  Maven has its pros and cons but there are plenty of people lurking
  around who know it quite well. Certainly it's easier for the Hadoop
  people to understand and work with. On the other hand, it supports
  Scala although only via a plugin, which is weaker support. sbt seems
  like a fairly new, basic, ad-hoc tool. Is there an advantage to it,
  other than being Scala (which is an advantage)?
 
  --
  Sean Owen | Director, Data Science | London
 
 
  On Fri, Feb 21, 2014 at 4:03 AM, Patrick Wendell pwend...@gmail.com
  wrote:
   Hey All,
  
   It's very high overhead having two build systems in Spark. Before
   getting into a long discussion about the merits of sbt vs maven, I
   wanted to pose a simple question to the dev list:
  
   Is there anyone who feels that dropping either sbt or maven would have
   a major consequence for them?
  
   And I say major consequence meaning something becomes completely
   impossible now and can't be worked around. This is different from an
   inconvenience, i.e., something which can be worked around but will
   require some investment.
  
   I'm posing the question in this way because, if there are features in
   either build system that are absolutely-un-available in the other,
   then we'll have to maintain both for the time being. I'm merely trying
   to see whether this is the case...
  
   - Patrick

Re: [DISCUSS] Necessity of Maven and SBT Build in Spark

2014-02-21 Thread Patrick Wendell

Kos - thanks for chiming in. Could you be more specific about what is
available in maven and not in sbt for these issues? I took a look at
the bigtop code relating to Spark. As far as I could tell [1] was the
main point of integration with the build system (maybe there are other
integration points)?

   - in order to integrate Spark well into existing Hadoop stack it was
 necessary to have a way to avoid transitive dependencies duplications and
 possible conflicts.

 E.g. Maven assembly allows us to avoid adding _all_ Hadoop libs and later
 merely declare Spark package dependency on standard Bigtop Hadoop
 packages. And yes - Bigtop packaging means the naming and layout would be
 standard across all commercial Hadoop distributions that are worth
 mentioning: ASF Bigtop convenience binary packages, and Cloudera or
 Hortonworks packages. Hence, the downstream user doesn't need to spend any
 effort to make sure that Spark clicks-in properly.

The sbt build also allows you to plug in a Hadoop version similar to
the maven build.


   - Maven provides a relatively easy way to deal with the jar-hell problem,
 although the original maven build was just Shader'ing everything into a
 huge lump of class files. Oftentimes ending up with classes slamming on
 top of each other from different transitive dependencies.

AFIAK we are only using the shade plug-in to deal with conflict
resolution in the assembly jar. These are dealt with in sbt via the
sbt assembly plug-in in an identical way. Is there a difference?

[1] 
https://git-wip-us.apache.org/repos/asf?p=bigtop.git;a=blob;f=bigtop-packages/src/common/spark/do-component-build;h=428540e0f6aa56cd7e78eb1c831aa7fe9496a08f;hb=master

Re: Planned 0.9.1 release

2014-02-21 Thread Patrick Wendell

We back port bug fixes into the 0.9 branch as they come in, so if
there is a particular fix you want to get you can always build from
the head of branch-0.9 and expect only stability improvements compared
with Spark 0.9.0.

The timing of the maintenance releases depends a bit on what bug fixes
come in and their importance. I'm thinking we should propose a release
pretty soon (order weeks) since there are some valuable bug fixes that
came in this week.

- Patrick

On Fri, Feb 21, 2014 at 2:22 PM, Gary Malouf malouf.g...@gmail.com wrote:
 My team has avoided upgrading to 0.9 to this point because of the Mesos bug
 that has since been fixed in master.  For ease of tracking, we are trying
 to only use tagged releases going forward as long as they will continue to
 be frequent or become more stable over time.

 Is there any timeline on cutting a tag for the 0.9.1 bug fix release?

[DISCUSS] Necessity of Maven and SBT Build in Spark

2014-02-20 Thread Patrick Wendell

Hey All,

It's very high overhead having two build systems in Spark. Before
getting into a long discussion about the merits of sbt vs maven, I
wanted to pose a simple question to the dev list:

Is there anyone who feels that dropping either sbt or maven would have
a major consequence for them?

And I say major consequence meaning something becomes completely
impossible now and can't be worked around. This is different from an
inconvenience, i.e., something which can be worked around but will
require some investment.

I'm posing the question in this way because, if there are features in
either build system that are absolutely-un-available in the other,
then we'll have to maintain both for the time being. I'm merely trying
to see whether this is the case...

- Patrick

Re: [DISCUSS] Necessity of Maven and SBT Build in Spark

2014-02-20 Thread Patrick Wendell

Hey Henry,

Yep, I wanted to reboot this since some time has passed and people may
have new or changed ways of using the build.

Maven makes the Apache publishing fairly seamless, but after the last
two releases I believe we could make it work with sbt as well. sbt
also supports publishing and other Apache projects such as Kafka
publish with sbt.

On Thu, Feb 20, 2014 at 8:50 PM, Henry Saputra henry.sapu...@gmail.com wrote:
 Thanks for bringing back the build systems discussions, Patrick.
 There was a long discussion way back before Spark joining ASF and as I
 remember there has not been clear winner between using sbt or maven.

 Maven makes it easier to publish the artifacts to Nexus repository,
 not sure if sbt can do  the same, and as I remember one of the
 limitations or drawbacks about maven is the use of profiles.
 Matei had suggested using some kind of Hadoop client detection as in
 Parquet project to manage the Hadoop versions to avoid profiles.


 - Henry

 On Thu, Feb 20, 2014 at 8:03 PM, Patrick Wendell pwend...@gmail.com wrote:
 Hey All,

 It's very high overhead having two build systems in Spark. Before
 getting into a long discussion about the merits of sbt vs maven, I
 wanted to pose a simple question to the dev list:

 Is there anyone who feels that dropping either sbt or maven would have
 a major consequence for them?

 And I say major consequence meaning something becomes completely
 impossible now and can't be worked around. This is different from an
 inconvenience, i.e., something which can be worked around but will
 require some investment.

 I'm posing the question in this way because, if there are features in
 either build system that are absolutely-un-available in the other,
 then we'll have to maintain both for the time being. I'm merely trying
 to see whether this is the case...

 - Patrick

Re: coding style discussion: explicit return type in public APIs

2014-02-19 Thread Patrick Wendell

+1 overall.

Christopher - I agree that once the number of rules becomes large it's
more efficient to pursue a use your judgement approach. However,
since this is only 3 cases I'd prefer to wait to see if it grows.

The concern with this approach is that for newer people, contributors,
etc it's hard for them to understand what good judgement is. Many are
new to scala, so explicit rules are generally better.

- Patrick

On Wed, Feb 19, 2014 at 12:19 AM, Reynold Xin r...@databricks.com wrote:
 Yes, the case you brought up is not a matter of readability or style. If it
 returns a different type, it should be declared (otherwise it is just
 wrong).


 On Wed, Feb 19, 2014 at 12:17 AM, Mridul Muralidharan mri...@gmail.comwrote:

 You are right.
 A degenerate case would be :

 def createFoo = new FooImpl()

 vs

 def createFoo: Foo = new FooImpl()

 Former will cause api instability. Reynold, maybe this is already
 avoided - and I understood it wrong ?

 Thanks,
 Mridul



 On Wed, Feb 19, 2014 at 12:44 PM, Christopher Nguyen c...@adatao.com
 wrote:
  Mridul, IIUUC, what you've mentioned did come to mind, but I deemed it
  orthogonal to the stylistic issue Reynold is talking about.
 
  I believe you're referring to the case where there is a specific desired
  return type by API design, but the implementation does not, in which
 case,
  of course, one must define the return type. That's an API requirement and
  not just a matter of readability.
 
  We could add this as an NB in the proposed guideline.
 
  --
  Christopher T. Nguyen
  Co-founder  CEO, Adatao http://adatao.com
  linkedin.com/in/ctnguyen
 
 
 
  On Tue, Feb 18, 2014 at 10:40 PM, Reynold Xin r...@databricks.com
 wrote:
 
  +1 Christopher's suggestion.
 
  Mridul,
 
  How would that happen? Case 3 requires the method to be invoking the
  constructor directly. It was implicit in my email, but the return type
  should be the same as the class itself.
 
 
 
 
  On Tue, Feb 18, 2014 at 10:37 PM, Mridul Muralidharan mri...@gmail.com
  wrote:
 
   Case 3 can be a potential issue.
   Current implementation might be returning a concrete class which we
   might want to change later - making it a type change.
   The intention might be to return an RDD (for example), but the
   inferred type might be a subclass of RDD - and future changes will
   cause signature change.
  
  
   Regards,
   Mridul
  
  
   On Wed, Feb 19, 2014 at 11:52 AM, Reynold Xin r...@databricks.com
  wrote:
Hi guys,
   
Want to bring to the table this issue to see what other members of
 the
community think and then we can codify it in the Spark coding style
   guide.
The topic is about declaring return types explicitly in public APIs.
   
In general I think we should favor explicit type declaration in
 public
APIs. However, I do think there are 3 cases we can avoid the public
 API
definition because in these 3 cases the types are self-evident 
   repetitive.
   
Case 1. toString
   
Case 2. A method returning a string or a val defining a string
   
def name = abcd // this is so obvious that it is a string
val name = edfg // this too
   
Case 3. The method or variable is invoking the constructor of a
 class
  and
return that immediately. For example:
   
val a = new SparkContext(...)
implicit def rddToAsyncRDDActions[T: ClassTag](rdd: RDD[T]) = new
AsyncRDDActions(rdd)
   
   
Thoughts?

Re: Adding my wiki user id (hsaputra) as contributors in Apache Spark confluence wiki space

2014-02-13 Thread Patrick Wendell

Hey Henry,

Ya unfortunately I have no idea how to do this!

On Thu, Feb 13, 2014 at 9:54 AM, Mayur Rustagi mayur.rust...@gmail.com wrote:
 I can help out here as well. I am trying to develop docs around setting up
 Spark, Streaming and Shark, currently doing it on my wiki (
 docs.sigmoidanalytics.com). Would love to contribute.
 Regards
 Mayur

 Mayur Rustagi
 Ph: +919632149971
 h https://twitter.com/mayur_rustagittp://www.sigmoidanalytics.com
 https://twitter.com/mayur_rustagi



 On Thu, Feb 13, 2014 at 8:28 AM, Henry Saputra henry.sapu...@gmail.comwrote:

 HI Andy,

 Could you or someone with space admin role in the Spark wiki [1]
 kindly help to add my userid hsaputra as collaborators to edit/ add
 new content in the Spark wiki space?

 I believe Andy's userid  was granted the space admin to the wiki.

 Thank you,

 - Henry

 [1]  https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage

Re: [GitHub] incubator-spark pull request: SPARK-1078: Replace lift-json with j...

2014-02-11 Thread Patrick Wendell

I think Aaron just meant 1.0.0 by the next minor release.

On Tue, Feb 11, 2014 at 7:56 PM, Mark Hamstra m...@clearstorydata.com wrote:

The situation sounds fine for the next minor release...

I don't understand what you mean by this. According to my current
understanding, the next release of Spark other than maintenance releases on
0.9.x is intended to be a major release, 1.0.0, and there are no plans for
an intervening minor release, which would be 0.10.0. Thus the next minor
release would be 1.1.0, and I fail to see why we would wait for that
instead of putting the dependency change (assuming that it is something
that we do, indeed, want) in 1.0.0.

On Tue, Feb 11, 2014 at 7:51 PM, aarondav g...@git.apache.org wrote:

Github user aarondav commented on the pull request:

https://github.com/apache/incubator-spark/pull/582#issuecomment-34836430

Thanks for looking into it! The situation sounds fine for the next
minor release, and I don't think this patch needs to be included in the
next maintenance release anyway (following your very own [suggestion](
http://mail-archives.apache.org/mod_mbox/spark-dev/201402.mbox/browser)
on the dev list).

While this patch looks good to me, I am not sure I fully understand
the need for it. I posted my question on the [dev list thread](
http://mail-archives.apache.org/mod_mbox/spark-dev/201402.mbox/%3C945190638.685798.1391974088596.JavaMail.zimbra%40redhat.com%3E).
Besides the dependency change, you also mention performance improvements.
[This benchmark](
http://engineering.ooyala.com/blog/comparing-scala-json-libraries) does
show Jackson outperforming lift on a particular workload, but do you have
another source showing how the relative performance changes with input size?

Re: Github merge script

2014-02-10 Thread Patrick Wendell

Hey Andrew,

The intent was to be consistent with the way the merge messages look
before. But I agree it obfuscates the commit messages from the user
and hides them further down.

I think your proposal is good, but it might be better to use the title
of their pull request message rather than the first line of the most
recent commit in their branch (not sure what you meant by commit
message).

Maybe you could submit a pull request for this? The script we use to
merge things is in dev/merge_spark_pr.py.

Another nice thing is if people are formatting their titles with
jira's then it will all look nice and pretty... which is kind of the
goal.

- Patrick

On Sun, Feb 9, 2014 at 11:55 PM, Andrew Ash and...@andrewash.com wrote:
 The current script for merging a GitHub PR squashes the commits and sticks a
 Merge pull request #123 from abc/def at the top of the commit message.
 However this obscures the original commit message when doing a short gitlog
 (first line only) so the recent history is much less meaningful than before.

 Compare recent history A:

 * 919bd7f Prashant Sharma 86 minutes ago  (origin/master, origin/HEAD)Merge
 pull request #567 from ScrapCodes/style2.
 * 2182aa3 Martin Jaggi 8 hours ago Merge pull request #566 from
 martinjaggi/copy-MLlib-d.
 * afc8f3c qqsun8819 10 hours ago Merge pull request #551 from
 qqsun8819/json-protocol.
 * 94ccf86 Patrick Wendell 10 hours ago Merge pull request #569 from
 pwendell/merge-fixes.
 * b69f8b2 Patrick Wendell 14 hours ago Merge pull request #557 from
 ScrapCodes/style. Closes #557.
 * b6dba10 CodingCat 24 hours ago Merge pull request #556 from
 CodingCat/JettyUtil. Closes #556.
 | * de22abc jyotiska 24 hours ago  (origin/branch-0.9)Merge pull request
 #562 from jyotiska/master. Closes #562.
 * | 2ef37c9 jyotiska 24 hours ago Merge pull request #562 from
 jyotiska/master. Closes #562.
 | * 2e3d1c3 Patrick Wendell 24 hours ago Merge pull request #560 from
 pwendell/logging. Closes #560.
 * | b6d40b7 Patrick Wendell 24 hours ago Merge pull request #560 from
 pwendell/logging. Closes #560.
 * | f892da8 Patrick Wendell 25 hours ago Merge pull request #565 from
 pwendell/dev-scripts. Closes #565.
 * | c2341c9 Mark Hamstra 32 hours ago Merge pull request #542 from
 markhamstra/versionBump. Closes #542.
 | * 22e0a3b Qiuzhuang Lian 35 hours ago Merge pull request #561 from
 Qiuzhuang/master. Closes #561.
 * | f0ce736 Qiuzhuang Lian 35 hours ago Merge pull request #561 from
 Qiuzhuang/master. Closes #561.
 * | 7805080 Jey Kottalam 35 hours ago Merge pull request #454 from
 jey/atomic-sbt-download. Closes #454.
 * | fabf174 Martin Jaggi 2 days ago Merge pull request #552 from
 martinjaggi/master. Closes #552.
 * | 3a9d82c Andrew Ash 3 days ago Merge pull request #506 from
 ash211/intersection. Closes #506.
 | * ce179f6 Andrew Or 3 days ago Merge pull request #533 from
 andrewor14/master. Closes #533.


 To B:

 If you go back some time in history, you get a much more branched history,
 like this:

 | * | | | | | | | | 0984647 Patrick Wendell 4 weeks ago Enable compression
 by default for spills
 |/ / / / / / / / /
 | * | | | | | | | 4e497db Tathagata Das 4 weeks ago Removed
 StreamingContext.registerInputStream and registerOutputStream - they were
 useless as InputDStream has been made to register itself. Also made DS
 * | | | | | | | |   fdaabdc Patrick Wendell 4 weeks ago Merge pull request
 #380 from mateiz/py-bayes
 |\ \ \ \ \ \ \ \ \
 | | | | | * | | | | c2852cf Frank Dai 4 weeks ago Indent two spaces
 * | | | | | | | | |   4a805af Patrick Wendell 4 weeks ago Merge pull request
 #367 from ankurdave/graphx
 |\ \ \ \ \ \ \ \ \ \
 | * | | | | | | | | | 80e73ed Joseph E. Gonzalez 4 weeks ago Adding minimal
 additional functionality to EdgeRDD
 * | | | | | | | | | |   945fe7a Patrick Wendell 4 weeks ago Merge pull
 request #408 from pwendell/external-serializers
 |\ \ \ \ \ \ \ \ \ \ \
 | | * | | | | | | | | | 4bafc4f Joseph E. Gonzalez 4 weeks ago adding
 documentation about EdgeRDD
 * | | | | | | | | | | |   68641bc Patrick Wendell 4 weeks ago Merge pull
 request #413 from rxin/scaladoc
 |\ \ \ \ \ \ \ \ \ \ \ \
 | | | | | | | | * | | | | 12386b3 Frank Dai 4 weeks ago Since getLong() and
 getInt() have side effect, get back parentheses, and remove an empty line
 | | | | | | | | * | | | | 0d94d74 Frank Dai 4 weeks ago Code clean up for
 mllib
 * | | | | | | | | | | | |   0ca0d4d Patrick Wendell 4 weeks ago Merge pull
 request #401 from andrewor14/master
 |\ \ \ \ \ \ \ \ \ \ \ \ \
 | | | | * | | | | | | | | | af645be Ankur Dave 4 weeks ago Fix all code
 examples in guide
 | | | | * | | | | | | | | | 2cd9358 Ankur Dave 4 weeks ago Finish
 6f6f8c928ce493357d4d32e46971c5e401682ea8
 * | | | | | | | | | | | | |   08b9fec Patrick Wendell 4 weeks ago Merge pull
 request #409 from tdas/unpersist

 Ignoring the merge commits here, the commit messages are much better here
 than in the current setup because they're what the original author wrote.
 Not a pretty generic

Re: [VOTE] Graduation of Apache Spark from the Incubator

2014-02-10 Thread Patrick Wendell

+1

To clarify to others, this is an IPCM vote so only the IPCM votes are binding :)

On Mon, Feb 10, 2014 at 10:02 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
 +1


 On Mon, Feb 10, 2014 at 9:57 PM, Mark Hamstra m...@clearstorydata.comwrote:

 +1


 On Mon, Feb 10, 2014 at 8:27 PM, Chris Mattmann mattm...@apache.org
 wrote:

  Hi Everyone,
 
  This is a new VOTE to decide if Apache Spark should graduate
  from the Incubator. Please VOTE on the resolution pasted below
  the ballot. I'll leave this VOTE open for at least 72 hours.
 
  Thanks!
 
  [ ] +1 Graduate Apache Spark from the Incubator.
  [ ] +0 Don't care.
  [ ] -1 Don't graduate Apache Spark from the Incubator because..
 
  Here is my +1 binding for graduation.
 
  Cheers,
  Chris
 
   snip
 
  WHEREAS, the Board of Directors deems it to be in the best
  interests of the Foundation and consistent with the
  Foundation's purpose to establish a Project Management
  Committee charged with the creation and maintenance of
  open-source software, for distribution at no charge to the
  public, related to fast and flexible large-scale data analysis
  on clusters.
 
  NOW, THEREFORE, BE IT RESOLVED, that a Project Management
  Committee (PMC), to be known as the Apache Spark Project, be
  and hereby is established pursuant to Bylaws of the Foundation;
  and be it further
 
  RESOLVED, that the Apache Spark Project be and hereby is
  responsible for the creation and maintenance of software
  related to fast and flexible large-scale data analysis
  on clusters; and be it further RESOLVED, that the office
  of Vice President, Apache Spark be and hereby is created,
  the person holding such office to serve at the direction of
  the Board of Directors as the chair of the Apache Spark
  Project, and to have primary responsibility for management
  of the projects within the scope of responsibility
  of the Apache Spark Project; and be it further
  RESOLVED, that the persons listed immediately below be and
  hereby are appointed to serve as the initial members of the
  Apache Spark Project:
 
  * Mosharaf Chowdhury mosha...@apache.org
  * Jason Dai jason...@apache.org
  * Tathagata Das t...@apache.org
  * Ankur Dave ankurd...@apache.org
  * Aaron Davidson a...@apache.org
  * Thomas Dudziak to...@apache.org
  * Robert Evans bo...@apache.org
  * Thomas Graves tgra...@apache.org
  * Andy Konwinski and...@apache.org
  * Stephen Haberman steph...@apache.org
  * Mark Hamstra markhams...@apache.org
  * Shane Huang shane_hu...@apache.org
  * Ryan LeCompte ryanlecom...@apache.org
  * Haoyuan Li haoy...@apache.org
  * Sean McNamara mcnam...@apache.org
  * Mridul Muralidharam mridul...@apache.org
  * Kay Ousterhout kayousterh...@apache.org
  * Nick Pentreath mln...@apache.org
  * Imran Rashid iras...@apache.org
  * Charles Reiss wog...@apache.org
  * Josh Rosen joshro...@apache.org
  * Prashant Sharma prash...@apache.org
  * Ram Sriharsha har...@apache.org
  * Shivaram Venkataraman shiva...@apache.org
  * Patrick Wendell pwend...@apache.org
  * Andrew Xia xiajunl...@apache.org
  * Reynold Xin r...@apache.org
  * Matei Zaharia ma...@apache.org
 
  NOW, THEREFORE, BE IT FURTHER RESOLVED, that Matei Zaharia be
  appointed to the office of Vice President, Apache Spark, to
  serve in accordance with and subject to the direction of the
  Board of Directors and the Bylaws of the Foundation until
  death, resignation, retirement, removal or disqualification, or
  until a successor is appointed; and be it further
 
  RESOLVED, that the Apache Spark Project be and hereby is
  tasked with the migration and rationalization of the Apache
  Incubator Spark podling; and be it further
 
  RESOLVED, that all responsibilities pertaining to the Apache
  Incubator Spark podling encumbered upon the Apache Incubator
  Project are hereafter discharged.

Re: [TODO] Document the release process for Apache Spark

2014-02-09 Thread Patrick Wendell

Done, thanks. Feel free to edit it directly as well :)

On Sat, Feb 8, 2014 at 11:28 PM, Henry Saputra henry.sapu...@gmail.com wrote:
 Cool! Thanks Patrick.

 Looks good to me. Just one small recommendation about Get Access to
 Apache Nexus for Publishing Artifacts, as I remember you need to file
 INFRA ticket for your Apache id [1] to get it?

 If it is then probably good idea to add it to the wiki.

 - Henry


 [1] https://issues.apache.org/jira


 On Sat, Feb 8, 2014 at 9:42 PM, Patrick Wendell pwend...@gmail.com wrote:
 I ported the release docs to the wiki today. Thanks for reminding me
 about this Henry:

 https://cwiki.apache.org/confluence/display/SPARK/Preparing+Spark+Releases

 - Patrick

 On Fri, Feb 7, 2014 at 11:51 AM, Henry Saputra henry.sapu...@gmail.com 
 wrote:
 Cool, Thanks Patrick! Really appreciate it =)

 - Henry

 On Fri, Feb 7, 2014 at 11:46 AM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Henry,

 Let me document this on the wiki. I've already keep pretty thorough
 docs on this I just need to migrate them to the wiki. I've created a
 JIRA here:

 https://spark-project.atlassian.net/browse/SPARK-1066

 - Patrick

 On Fri, Feb 7, 2014 at 11:35 AM, Henry Saputra henry.sapu...@gmail.com 
 wrote:
 Hi Patrick,

 As part of the unofficial checklist for graduation, we need to have a
 documented steps to make a release.

 As the first and so far the only RE for Apache Spark, I would like to
 ask for your help to document the steps to release. This will help
 other member to do the release and take turns to make sure all future
 PMCs and committers know how to do Apache Spark release.

 Most of the steps are probably similar to other projects but it is
 always useful for each podling to have its own documentation to
 release artifacts.

 Really appreciate your help.


 Thanks,

 - Henry

Re: How to write test cases for the functionalities which involves actor communication

2014-02-09 Thread Patrick Wendell

It's possible to mock out actors... we have a few examples in the code
base. One his here:

https://github.com/apache/incubator-spark/blob/master/core/src/test/scala/org/apache/spark/deploy/worker/WorkerWatcherSuite.scala

On Sun, Feb 9, 2014 at 6:21 AM, Nan Zhu zhunanmcg...@gmail.com wrote:
 Hi, all

 I have a question when trying to write some test cases for the PR

 The key functionality in my PR involves actor communication between master 
 and worker, like the worker does something and returns the result to the 
 master via a message, I want to test if the master can do the right thing 
 according to the number of workers existing in the cluster and the return 
 result from the worker,

 Is there any way to test this via some test cases?

 Thank you

 Best,

 --
 Nan Zhu

[SUMMARY] Proposal for Spark Release Strategy

2014-02-08 Thread Patrick Wendell

Hey All,

Thanks for everyone who participated in this thread. I've distilled
feedback based on the discussion and wanted to summarize the
conclusions:

- People seem universally +1 on semantic versioning in general.

- People seem universally +1 on having a public merge windows for releases.

- People seem universally +1 on a policy of having associated JIRA's
with features.

- Everyone believes link-level compatiblity should be the goal. Some
people think we should outright promise it now. Others thing we should
either not promise it or promise it later.
-- Compromise: let's do one minor release 1.0-1.1 to convince
ourselves this is possible (some issues with Scala traits will make
this tricky). Then we can codify it in writing. I've created
SPARK-1069 [1] to clearly establish that this is the goal for 1.X
family of releases.

- Some people think we should add particular features before having 1.0.
-- Version 1.X indicates API stability rather than a feature set;
this was clarified.
-- That said, people still have several months to work on features if
they really want to get them in for this release.

I'm going to integrate this feedback and post a tentative version of
the release guidelines to the wiki.

With all this said, I would like to move the master version to
1.0.0-SNAPSHOT as the main concerns with this have been addressed and
clarified. This merely represents a tentative consensus and the
release is still subject to a formal vote amongst PMC members.

[1] https://spark-project.atlassian.net/browse/SPARK-1069

- Patrick

Re: [SUMMARY] Proposal for Spark Release Strategy

2014-02-08 Thread Patrick Wendell

:P - I'm pretty sure this can be done but it will require some work -
we already use the github API in our merge script and we could hook
something like that up with the jenkins tests. Henry maybe you could
create a JIRA for this for Spark 1.0?

- Patrick

On Sat, Feb 8, 2014 at 3:20 PM, Mark Hamstra m...@clearstorydata.com wrote:
 I know that it can be done -- which is different from saying that I know how 
 to set it up.


 On Feb 8, 2014, at 2:57 PM, Henry Saputra henry.sapu...@gmail.com wrote:

 Patrick, do you know if there is a way to check if a Github PR's
 subject/ title contains JIRA number and will raise warning by the
 Jenkins?

 - Henry

 On Sat, Feb 8, 2014 at 12:56 PM, Patrick Wendell pwend...@gmail.com wrote:
 Hey All,

 Thanks for everyone who participated in this thread. I've distilled
 feedback based on the discussion and wanted to summarize the
 conclusions:

 - People seem universally +1 on semantic versioning in general.

 - People seem universally +1 on having a public merge windows for releases.

 - People seem universally +1 on a policy of having associated JIRA's
 with features.

 - Everyone believes link-level compatiblity should be the goal. Some
 people think we should outright promise it now. Others thing we should
 either not promise it or promise it later.
 -- Compromise: let's do one minor release 1.0-1.1 to convince
 ourselves this is possible (some issues with Scala traits will make
 this tricky). Then we can codify it in writing. I've created
 SPARK-1069 [1] to clearly establish that this is the goal for 1.X
 family of releases.

 - Some people think we should add particular features before having 1.0.
 -- Version 1.X indicates API stability rather than a feature set;
 this was clarified.
 -- That said, people still have several months to work on features if
 they really want to get them in for this release.

 I'm going to integrate this feedback and post a tentative version of
 the release guidelines to the wiki.

 With all this said, I would like to move the master version to
 1.0.0-SNAPSHOT as the main concerns with this have been addressed and
 clarified. This merely represents a tentative consensus and the
 release is still subject to a formal vote amongst PMC members.

 [1] https://spark-project.atlassian.net/browse/SPARK-1069

 - Patrick

Re: [TODO] Document the release process for Apache Spark

2014-02-08 Thread Patrick Wendell

I ported the release docs to the wiki today. Thanks for reminding me
about this Henry:

https://cwiki.apache.org/confluence/display/SPARK/Preparing+Spark+Releases

- Patrick

On Fri, Feb 7, 2014 at 11:51 AM, Henry Saputra henry.sapu...@gmail.com wrote:
 Cool, Thanks Patrick! Really appreciate it =)

 - Henry

 On Fri, Feb 7, 2014 at 11:46 AM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Henry,

 Let me document this on the wiki. I've already keep pretty thorough
 docs on this I just need to migrate them to the wiki. I've created a
 JIRA here:

 https://spark-project.atlassian.net/browse/SPARK-1066

 - Patrick

 On Fri, Feb 7, 2014 at 11:35 AM, Henry Saputra henry.sapu...@gmail.com 
 wrote:
 Hi Patrick,

 As part of the unofficial checklist for graduation, we need to have a
 documented steps to make a release.

 As the first and so far the only RE for Apache Spark, I would like to
 ask for your help to document the steps to release. This will help
 other member to do the release and take turns to make sure all future
 PMCs and committers know how to do Apache Spark release.

 Most of the steps are probably similar to other projects but it is
 always useful for each podling to have its own documentation to
 release artifacts.

 Really appreciate your help.


 Thanks,

 - Henry

Re: 0.9.0 forces log4j usage

2014-02-07 Thread Patrick Wendell

Hey Paul,

Thanks for digging this up. I worked on this feature and the intent
was to give users good default behavior if they didn't include any
logging configuration on the classpath.

The problem with assuming that CL tooling is going to fix the job is
that many people link against spark as a library and run their
application using their own scripts. In this case the first thing
people see when they run an application that links against Spark was a
big ugly logging warning.

I'm not super familiar with log4j-over-slf4j, but this behavior of
returning null for the appenders seems a little weird. What is the use
case for using this and not just directly use slf4j-log4j12 like Spark
itself does?

Did you have a more general fix for this in mind? Or was your plan to
just revert the existing behavior... We might be able to add a
configuration option to disable this logging default stuff. Or we
could just rip it out - but I'd like to avoid that if possible.

- Patrick

On Thu, Feb 6, 2014 at 11:41 PM, Paul Brown p...@mult.ifario.us wrote:
We have a few applications that embed Spark, and in 0.8.0 and 0.8.1, we
were able to use slf4j, but 0.9.0 broke that and unintentionally forces
direct use of log4j as the logging backend.

The issue is here in the org.apache.spark.Logging trait:

https://github.com/apache/incubator-spark/blame/master/core/src/main/scala/org/apache/spark/Logging.scala#L107

log4j-over-slf4j *always* returns an empty enumeration for appenders to the
ROOT logger:

https://github.com/qos-ch/slf4j/blob/master/log4j-over-slf4j/src/main/java/org/apache/log4j/Category.java?source=c#L81

And this causes an infinite loop and an eventual stack overflow.

I'm happy to submit a Jira and a patch, but it would be significant enough
reversal of recent changes that it's probably worth discussing before I
sink a half hour into it. My suggestion would be that initialization (or
not) should be left to the user with reasonable default behavior supplied
by the spark commandline tooling and not forced on applications that
incorporate Spark.

Thoughts/opinions?

-- Paul
--
p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/

Re: 0.9.0 forces log4j usage

2014-02-07 Thread Patrick Wendell

Koert - my suggestion was this. We let users use any slf4j backend
they want. If we detect that they are using the log4j backend and
*also* they didn't configure any log4j appenders, we set up some nice
defaults for them. If they are using another backend, Spark doesn't
try to modify the configuration at all.

On Fri, Feb 7, 2014 at 11:14 AM, Koert Kuipers ko...@tresata.com wrote:
 well static binding is probably the wrong terminology but you get the
 idea. multiple backends are not allowed and cause an even uglier warning...

 see also here:
 https://github.com/twitter/scalding/pull/636
 and here:
 https://groups.google.com/forum/#!topic/cascading-user/vYvnnN_15ls
 all me being annoying and complaining about slf4j-log4j12 dependencies
 (which did get removed).


 On Fri, Feb 7, 2014 at 2:09 PM, Koert Kuipers ko...@tresata.com wrote:

 the issue is that slf4j uses static binding. you can put only one slf4j
 backend on the classpath, and that's what it uses. more than one is not
 allowed.

 so you either keep the slf4j-log4j12 dependency for spark, and then you
 took away people's choice of slf4j backend which is considered bad form for
 a library, or you do not include it and then people will always get the big
 fat ugly warning and slf4j logging will not flow to log4j.

 including log4j itself is not necessary a problem i think?


 On Fri, Feb 7, 2014 at 1:11 PM, Patrick Wendell pwend...@gmail.comwrote:

 This also seems relevant - but not my area of expertise (whether this
 is a valid way to check this).


 http://stackoverflow.com/questions/10505418/how-to-find-which-library-slf4j-has-bound-itself-to

 On Fri, Feb 7, 2014 at 10:08 AM, Patrick Wendell pwend...@gmail.com
 wrote:
  Hey Guys,
 
  Thanks for explainning. Ya this is a problem - we didn't really know
  that people are using other slf4j backends, slf4j is in there for
  historical reasons but I think we may assume in a few places that
  log4j is being used and we should minimize those.
 
  We should patch this and get a fix into 0.9.1. So some solutions I see
 are:
 
  (a) Add SparkConf option to disable this. I'm fine with this one.
 
  (b) Ask slf4j which backend is active and only try to enforce this
  default if we know slf4j is using log4j. Do either of you know if this
  is possible? Not sure if slf4j exposes this.
 
  (c) Just remove this default stuff. We'd rather not do this. The goal
  of this thing is to provide good usability for people who have linked
  against Spark and haven't done anything to configure logging. For
  beginners we try to minimize the assumptions about what else they know
  about, and I've found log4j configuration is a huge mental barrier for
  people who are getting started.
 
  Paul if you submit a patch doing (a) we can merge it in. If you have
  any idea if (b) is possible I prefer that one, but it may not be
  possible or might be brittle.
 
  - Patrick
 
  On Fri, Feb 7, 2014 at 6:36 AM, Koert Kuipers ko...@tresata.com
 wrote:
  Totally agree with Paul: a library should not pick the slf4j backend.
 It
  defeats the purpose of slf4j. That big ugly warning is there to alert
  people that its their responsibility to pick the back end...
  On Feb 7, 2014 3:55 AM, Paul Brown p...@mult.ifario.us wrote:
 
  Hi, Patrick --
 
  From slf4j, you can either backend it into log4j (which is the way
 that
  Spark is shipped) or you can route log4j through slf4j and then on to
 a
  different backend (e.g., logback).  We're doing the latter and
 manipulating
  the dependencies in the build because that's the way the enclosing
  application is set up.
 
  The issue with the current situation is that there's no way for an
 end user
  to choose to *not* use the log4j backend.  (My short-term solution
 was to
  use the Maven shade plugin to swap in a version of the Logging trait
 with
  the body of that method commented out.)  In addition to the situation
 with
  log4j-over-slf4j and the empty enumeration of ROOT appenders, you
 might
  also run afoul of someone who intentionally configured log4j with an
 empty
  set of appenders at the time that Spark is initializing.
 
  I'd be happy with any implementation that lets me choose my logging
  backend: override default behavior via system property, plug-in
  architecture, etc.  I do think it's reasonable to expect someone
 digesting
  a substantial JDK-based system like Spark to understand how to
 initialize
  logging -- surely they're using logging of some kind elsewhere in
 their
  application -- but if you want the default behavior there as a
 courtesy, it
  might be worth putting an INFO (versus a the glaring log4j WARN)
 message on
  the output that says something like Initialized default logging via
 Log4J;
  pass -Dspark.logging.loadDefaultLogger=false to disable this
 behavior. so
  that it's both convenient and explicit.
 
  Cheers.
  -- Paul
 
 
 
 
 
 
  --
  p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
 
 
  On Fri, Feb 7, 2014 at 12:05 AM

Re: [TODO] Document the release process for Apache Spark

2014-02-07 Thread Patrick Wendell

Hey Henry,

Let me document this on the wiki. I've already keep pretty thorough
docs on this I just need to migrate them to the wiki. I've created a
JIRA here:

https://spark-project.atlassian.net/browse/SPARK-1066

- Patrick

On Fri, Feb 7, 2014 at 11:35 AM, Henry Saputra henry.sapu...@gmail.com wrote:
 Hi Patrick,

 As part of the unofficial checklist for graduation, we need to have a
 documented steps to make a release.

 As the first and so far the only RE for Apache Spark, I would like to
 ask for your help to document the steps to release. This will help
 other member to do the release and take turns to make sure all future
 PMCs and committers know how to do Apache Spark release.

 Most of the steps are probably similar to other projects but it is
 always useful for each podling to have its own documentation to
 release artifacts.

 Really appreciate your help.


 Thanks,

 - Henry

Re: 0.9.0 forces log4j usage

2014-02-07 Thread Patrick Wendell

Ah okay sounds good. This is what I meant earlier by You have
some other application that directly calls log4j i.e. you have
for historical reasons installed the log4j-over-slf4j.

Would you mind trying out this fix and seeing if it works? This is
designed to be a hotfix for 0.9, not a general solution where we rip
out log4j from our published dependencies:

https://github.com/apache/incubator-spark/pull/560/files

- Patrick

On Fri, Feb 7, 2014 at 5:57 PM, Paul Brown p...@mult.ifario.us wrote:
 Hi, Patrick --

 I forget which other component is responsible, but we're using the
 log4j-over-slf4j as part of an overall requirement to centralize logging,
 i.e., *someone* else is logging over log4j and we're pulling that in.
  (There's also some jul logging from Jersey, etc.)

 Goals:

 - Fully control/capture all possible logging.  (God forbid we have to grab
 System.out/err, but we'd do it if needed.)
 - Use the backend we like best at the moment.  (Happens to be logback.)

 Possible cases:

 - If Spark used Log4j at all, we would pull in that logging via
 log4j-over-slf4j.
 - If Spark used only slf4j and referenced no backend, we would use it as-is
 although we'd still have the log4j-over-slf4j because of other libraries.
 - If Spark used only slf4j and referenced the slf4j-log4j12 backend, we
 would exclude that one dependency (via our POM).

 Best.
 -- Paul


 --
 p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/


 On Fri, Feb 7, 2014 at 5:38 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hey Paul,

 So if your goal is ultimately to output to logback. Then why don't you
 just use slf4j and logback-classic.jar as described here [1]. Why
 involve log4j-over-slf4j at all?

 Let's say we refactored the spark build so it didn't advertise
 slf4j-log4j12 as a dependency. Would you still be using
 log4j-over-slf4j... or is this just a fix to deal with the fact that
 Spark is somewhat log4j dependent at this point.

 [1] http://www.slf4j.org/manual.html

 - Patrick

 On Fri, Feb 7, 2014 at 5:14 PM, Paul Brown p...@mult.ifario.us wrote:
  Hi, Patrick --
 
  That's close but not quite it.
 
  The issue that occurs is not the delegation loop mentioned in slf4j
  documentation.  The stack overflow is entirely within the code in the
 Spark
  trait:
 
  at org.apache.spark.Logging$class.initializeLogging(Logging.scala:112)
  at org.apache.spark.Logging$class.initializeIfNecessary(Logging.scala:97)
  at org.apache.spark.Logging$class.log(Logging.scala:36)
  at org.apache.spark.SparkEnv$.log(SparkEnv.scala:94)
 
 
  And then that repeats.
 
  As for our situation, we exclude the slf4j-log4j12 dependency when we
  import the Spark library (because we don't want to use log4j) and have
  log4j-over-slf4j already in place to ensure that all of the logging in
 the
  overall application runs through slf4j and then out through logback.  (We
  also, as another poster already mentioned, also force jcl and jul through
  slf4j.)
 
  The zen of slf4j for libraries is that the library uses the slf4j API and
  then the enclosing application can route logging as it sees fit.  Spark
  master CLI would log via slf4j and include the slf4j-log4j12 backend;
 same
  for Spark worker CLI.  Spark as a library (versus as a container) would
 not
  include any backend to the slf4j API and leave this up to the
 application.
   (FWIW, this would also avoid your log4j warning message.)
 
  But as I was saying before, I'd be happy with a situation where I can
 avoid
  log4j being enabled or configured, and I think you'll find an existing
  choice of logging framework to be a common scenario for those embedding
  Spark in other systems.
 
  Best.
  -- Paul
 
  --
  p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
 
 
  On Fri, Feb 7, 2014 at 3:01 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  Paul,
 
  Looking back at your problem. I think it's the one here:
  http://www.slf4j.org/codes.html#log4jDelegationLoop
 
  So let me just be clear what you are doing so I understand. You have
  some other application that directly calls log4j. So you have to
  include log4j-over-slf4j to route those logs through slf4j to logback.
 
  At the same time you embed Spark in this application. In the past it
  was fine, but now that Spark programmatic ally initializes log4j, it
  screws up your application because log4j-over-slf4j doesn't work with
  applications that do this explicilty as discussed here:
  http://www.slf4j.org/legacy.html
 
  Correct?
 
  - Patrick
 
  On Fri, Feb 7, 2014 at 2:02 PM, Koert Kuipers ko...@tresata.com
 wrote:
   got it. that sounds reasonable
  
  
   On Fri, Feb 7, 2014 at 2:31 PM, Patrick Wendell pwend...@gmail.com
  wrote:
  
   Koert - my suggestion was this. We let users use any slf4j backend
   they want. If we detect that they are using the log4j backend and
   *also* they didn't configure any log4j appenders, we set up some nice
   defaults for them. If they are using another backend

Re: Proposal for Spark Release Strategy

2014-02-06 Thread Patrick Wendell

 I like Heiko's proposal that requires every pull request to reference a
 JIRA.  This is how things are done in Hadoop and it makes it much easier
 to, for example, find out whether an issue you came across when googling
 for an error is in a release.

I think this is a good idea and something on which there is wide
consensus. I separately was going to suggest this in a later e-mail
(it's not directly tied to versioning). One of many reasons this is
necessary is because it's becoming hard to track which features ended
up in which releases.

 I agree with Mridul about binary compatibility.  It can be a dealbreaker
 for organizations that are considering an upgrade. The two ways I'm aware
 of that cause binary compatibility are scala version upgrades and messing
 around with inheritance.  Are these not avoidable at least for minor
 releases?

This is clearly a goal but I'm hesitant to codify it until we
understand all of the reasons why it might not work. I've heard in
general with Scala there are many non-obvious things that can break
binary compatibility and we need to understand what they are. I'd
propose we add the migration tool [1] here to our build and use it for
a few months and see what happens (hat tip to Michael Armbrust).

It's easy to formalize this as a requirement later, it's impossible to
go the other direction. For Scala major versions it's possible we can
cross-build between 2.10 and 2.11 to retain link-level compatibility.
It's just entirely uncharted territory and AFAIK no one who's
suggesting this is speaking from experience maintaining this guarantee
for a Scala project.

That would be the strongest convincing reason for me - if someone has
actually done this in the past in a Scala project and speaks from
experience. Most of use are speaking from the perspective of Java
projects where we understand well the trade-off's and costs of
maintaining this guarantee.

[1] https://github.com/typesafehub/migration-manager

- Patrick

Re: Proposal for Spark Release Strategy

2014-02-06 Thread Patrick Wendell

 and the
   vision and make adjustment accordingly.
  
   Release a 1.0.0 is a huge milestone and if we do need to break API
   somehow or modify internal behavior dramatically we could take
   advantage to release 1.0.0 as good step to go to.
  
  
   - Henry
  
  
  
   On Wed, Feb 5, 2014 at 9:52 PM, Andrew Ash and...@andrewash.com
 wrote:
   Agree on timeboxed releases as well.
  
   Is there a vision for where we want to be as a project before
 declaring
  the
   first 1.0 release?  While we're in the 0.x days per semver we can
 break
   backcompat at will (though we try to avoid it where possible), and
 that
   luxury goes away with 1.x  I just don't want to release a 1.0 simply
   because it seems to follow after 0.9 rather than making an intentional
   decision that we're at the point where we can stand by the current
 APIs
  and
   binary compatibility for the next year or so of the major release.
  
   Until that decision is made as a group I'd rather we do an immediate
   version bump to 0.10.0-SNAPSHOT and then if discussion warrants it
  later,
   replace that with 1.0.0-SNAPSHOT.  It's very easy to go from 0.10 to
 1.0
   but not the other way around.
  
   https://github.com/apache/incubator-spark/pull/542
  
   Cheers!
   Andrew
  
  
   On Wed, Feb 5, 2014 at 9:49 PM, Heiko Braun ike.br...@googlemail.com
  wrote:
  
   +1 on time boxed releases and compatibility guidelines
  
  
   Am 06.02.2014 um 01:20 schrieb Patrick Wendell pwend...@gmail.com
 :
  
   Hi Everyone,
  
   In an effort to coordinate development amongst the growing list of
   Spark contributors, I've taken some time to write up a proposal to
   formalize various pieces of the development process. The next
 release
   of Spark will likely be Spark 1.0.0, so this message is intended in
   part to coordinate the release plan for 1.0.0 and future releases.
   I'll post this on the wiki after discussing it on this thread as
   tentative project guidelines.
  
   == Spark Release Structure ==
   Starting with Spark 1.0.0, the Spark project will follow the
 semantic
   versioning guidelines (http://semver.org/) with a few deviations.
   These small differences account for Spark's nature as a multi-module
   project.
  
   Each Spark release will be versioned:
   [MAJOR].[MINOR].[MAINTENANCE]
  
   All releases with the same major version number will have API
   compatibility, defined as [1]. Major version numbers will remain
   stable over long periods of time. For instance, 1.X.Y may last 1
 year
   or more.
  
   Minor releases will typically contain new features and improvements.
   The target frequency for minor releases is every 3-4 months. One
   change we'd like to make is to announce fixed release dates and
 merge
   windows for each release, to facilitate coordination. Each minor
   release will have a merge window where new patches can be merged, a
 QA
   window when only fixes can be merged, then a final period where
 voting
   occurs on release candidates. These windows will be announced
   immediately after the previous minor release to give people plenty
 of
   time, and over time, we might make the whole release process more
   regular (similar to Ubuntu). At the bottom of this document is an
   example window for the 1.0.0 release.
  
   Maintenance releases will occur more frequently and depend on
 specific
   patches introduced (e.g. bug fixes) and their urgency. In general
   these releases are designed to patch bugs. However, higher level
   libraries may introduce small features, such as a new algorithm,
   provided they are entirely additive and isolated from existing code
   paths. Spark core may not introduce any features.
  
   When new components are added to Spark, they may initially be marked
   as alpha. Alpha components do not have to abide by the above
   guidelines, however, to the maximum extent possible, they should try
   to. Once they are marked stable they have to follow these
   guidelines. At present, GraphX is the only alpha component of Spark.
  
   [1] API compatibility:
  
   An API is any public class or interface exposed in Spark that is not
   marked as semi-private or experimental. Release A is API compatible
   with release B if code compiled against release A *compiles cleanly*
   against B. This does not guarantee that a compiled application that
 is
   linked against version A will link cleanly against version B without
   re-compiling. Link-level compatibility is something we'll try to
   guarantee that as well, and we might make it a requirement in the
   future, but challenges with things like Scala versions have made
 this
   difficult to guarantee in the past.
  
   == Merging Pull Requests ==
   To merge pull requests, committers are encouraged to use this tool
 [2]
   to collapse the request into one commit rather than manually
   performing git merges. It will also format the commit message nicely
   in a way that can be easily parsed later when writing credits.
   Currently

Proposal for Spark Release Strategy

2014-02-05 Thread Patrick Wendell

Hi Everyone,

In an effort to coordinate development amongst the growing list of
Spark contributors, I've taken some time to write up a proposal to
formalize various pieces of the development process. The next release
of Spark will likely be Spark 1.0.0, so this message is intended in
part to coordinate the release plan for 1.0.0 and future releases.
I'll post this on the wiki after discussing it on this thread as
tentative project guidelines.

== Spark Release Structure ==
Starting with Spark 1.0.0, the Spark project will follow the semantic
versioning guidelines (http://semver.org/) with a few deviations.
These small differences account for Spark's nature as a multi-module
project.

Each Spark release will be versioned:
[MAJOR].[MINOR].[MAINTENANCE]

All releases with the same major version number will have API
compatibility, defined as [1]. Major version numbers will remain
stable over long periods of time. For instance, 1.X.Y may last 1 year
or more.

Minor releases will typically contain new features and improvements.
The target frequency for minor releases is every 3-4 months. One
change we'd like to make is to announce fixed release dates and merge
windows for each release, to facilitate coordination. Each minor
release will have a merge window where new patches can be merged, a QA
window when only fixes can be merged, then a final period where voting
occurs on release candidates. These windows will be announced
immediately after the previous minor release to give people plenty of
time, and over time, we might make the whole release process more
regular (similar to Ubuntu). At the bottom of this document is an
example window for the 1.0.0 release.

Maintenance releases will occur more frequently and depend on specific
patches introduced (e.g. bug fixes) and their urgency. In general
these releases are designed to patch bugs. However, higher level
libraries may introduce small features, such as a new algorithm,
provided they are entirely additive and isolated from existing code
paths. Spark core may not introduce any features.

When new components are added to Spark, they may initially be marked
as alpha. Alpha components do not have to abide by the above
guidelines, however, to the maximum extent possible, they should try
to. Once they are marked stable they have to follow these
guidelines. At present, GraphX is the only alpha component of Spark.

[1] API compatibility:

An API is any public class or interface exposed in Spark that is not
marked as semi-private or experimental. Release A is API compatible
with release B if code compiled against release A *compiles cleanly*
against B. This does not guarantee that a compiled application that is
linked against version A will link cleanly against version B without
re-compiling. Link-level compatibility is something we'll try to
guarantee that as well, and we might make it a requirement in the
future, but challenges with things like Scala versions have made this
difficult to guarantee in the past.

== Merging Pull Requests ==
To merge pull requests, committers are encouraged to use this tool [2]
to collapse the request into one commit rather than manually
performing git merges. It will also format the commit message nicely
in a way that can be easily parsed later when writing credits.
Currently it is maintained in a public utility repository, but we'll
merge it into mainline Spark soon.

[2] https://github.com/pwendell/spark-utils/blob/master/apache_pr_merge.py

== Tentative Release Window for 1.0.0 ==
Feb 1st - April 1st: General development
April 1st: Code freeze for new features
April 15th: RC1

== Deviations ==
For now, the proposal is to consider these tentative guidelines. We
can vote to formalize these as project rules at a later time after
some experience working with them. Once formalized, any deviation to
these guidelines will be subject to a lazy majority vote.

- Patrick

Re: Proposal for Spark Release Strategy

2014-02-05 Thread Patrick Wendell

 How are Alpha components and higher level libraries which may add small
 features within a maintenance release going to be marked with that status?
  Somehow/somewhere within the code itself, as just as some kind of external
 reference?

I think we'd mark alpha features as such in the java/scaladoc. This is
what scala does with experimental features. Higher level libraries are
anything that isn't Spark core. Maybe we can formalize this more
somehow.

We might be able to annotate the new features as experimental if they
end up in a patch release. This could make it more clear.


 I would strongly encourage that developers submitting pull requests include
 within the description of that PR whether you intend the contribution to be
 mergeable at the maintenance level, minor level, or major level.  That will
 help those of us doing code reviews and merges decide where the code should
 go and how closely to scrutinize the PR for changes that are not compatible
 with the intended release level.

I'd say the default is the minor level. If contributors know it should
be added in a maintenance release, it's great if they say so. However
I'd say this is also responsibility with the committers, since
individual contributors may not know. It will probably be a while
before major level patches are being merged :P

Re: Proposal for Spark Release Strategy

2014-02-05 Thread Patrick Wendell

If people feel that merging the intermediate SNAPSHOT number is
significant, let's just defer merging that until this discussion
concludes.

That said - the decision to settle on 1.0 for the next release is not
just because it happens to come after 0.9. It's a conscientious
decision based on the development of the project to this point. A
major focus of the 0.9 release was tying off loose ends in terms of
backwards compatibility (e.g. spark configuration). There was some
discussion back then of maybe cutting a 1.0 release but the decision
was deferred until after 0.9.

@mridul - pleas see the original post for discussion about binary compatibility.

On Wed, Feb 5, 2014 at 10:20 PM, Andy Konwinski andykonwin...@gmail.com wrote:
 +1 for 0.10.0 now with the option to switch to 1.0.0 after further
 discussion.
 On Feb 5, 2014 9:53 PM, Andrew Ash and...@andrewash.com wrote:

 Agree on timeboxed releases as well.

 Is there a vision for where we want to be as a project before declaring the
 first 1.0 release?  While we're in the 0.x days per semver we can break
 backcompat at will (though we try to avoid it where possible), and that
 luxury goes away with 1.x  I just don't want to release a 1.0 simply
 because it seems to follow after 0.9 rather than making an intentional
 decision that we're at the point where we can stand by the current APIs and
 binary compatibility for the next year or so of the major release.

 Until that decision is made as a group I'd rather we do an immediate
 version bump to 0.10.0-SNAPSHOT and then if discussion warrants it later,
 replace that with 1.0.0-SNAPSHOT.  It's very easy to go from 0.10 to 1.0
 but not the other way around.

 https://github.com/apache/incubator-spark/pull/542

 Cheers!
 Andrew


 On Wed, Feb 5, 2014 at 9:49 PM, Heiko Braun ike.br...@googlemail.com
 wrote:

  +1 on time boxed releases and compatibility guidelines
 
 
   Am 06.02.2014 um 01:20 schrieb Patrick Wendell pwend...@gmail.com:
  
   Hi Everyone,
  
   In an effort to coordinate development amongst the growing list of
   Spark contributors, I've taken some time to write up a proposal to
   formalize various pieces of the development process. The next release
   of Spark will likely be Spark 1.0.0, so this message is intended in
   part to coordinate the release plan for 1.0.0 and future releases.
   I'll post this on the wiki after discussing it on this thread as
   tentative project guidelines.
  
   == Spark Release Structure ==
   Starting with Spark 1.0.0, the Spark project will follow the semantic
   versioning guidelines (http://semver.org/) with a few deviations.
   These small differences account for Spark's nature as a multi-module
   project.
  
   Each Spark release will be versioned:
   [MAJOR].[MINOR].[MAINTENANCE]
  
   All releases with the same major version number will have API
   compatibility, defined as [1]. Major version numbers will remain
   stable over long periods of time. For instance, 1.X.Y may last 1 year
   or more.
  
   Minor releases will typically contain new features and improvements.
   The target frequency for minor releases is every 3-4 months. One
   change we'd like to make is to announce fixed release dates and merge
   windows for each release, to facilitate coordination. Each minor
   release will have a merge window where new patches can be merged, a QA
   window when only fixes can be merged, then a final period where voting
   occurs on release candidates. These windows will be announced
   immediately after the previous minor release to give people plenty of
   time, and over time, we might make the whole release process more
   regular (similar to Ubuntu). At the bottom of this document is an
   example window for the 1.0.0 release.
  
   Maintenance releases will occur more frequently and depend on specific
   patches introduced (e.g. bug fixes) and their urgency. In general
   these releases are designed to patch bugs. However, higher level
   libraries may introduce small features, such as a new algorithm,
   provided they are entirely additive and isolated from existing code
   paths. Spark core may not introduce any features.
  
   When new components are added to Spark, they may initially be marked
   as alpha. Alpha components do not have to abide by the above
   guidelines, however, to the maximum extent possible, they should try
   to. Once they are marked stable they have to follow these
   guidelines. At present, GraphX is the only alpha component of Spark.
  
   [1] API compatibility:
  
   An API is any public class or interface exposed in Spark that is not
   marked as semi-private or experimental. Release A is API compatible
   with release B if code compiled against release A *compiles cleanly*
   against B. This does not guarantee that a compiled application that is
   linked against version A will link cleanly against version B without
   re-compiling. Link-level compatibility is something we'll try to
   guarantee that as well, and we

Re: [VOTE] Release Apache Spark 0.9.0-incubating (rc5)

2014-02-02 Thread Patrick Wendell

It takes a day or two to package the release pass votes and is cut to
maven. Coming soon!

On Sat, Feb 1, 2014 at 8:08 PM, Kapil Malik kma...@adobe.com wrote:
 Awesome ! Thanks everyone :)

 -Original Message-
 From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
 Sent: 02 February 2014 08:09
 To: dev@spark.incubator.apache.org; j...@cs.berkeley.edu Kottalam
 Subject: Re: [VOTE] Release Apache Spark 0.9.0-incubating (rc5)

 Yup, we’re still working on putting it on the website, but this is the final 
 release. You can download the RC5 artifacts from 
 http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-0-9-0-incubating-rc5-td318.html.

 Matei

 On Feb 1, 2014, at 12:51 PM, Jey Kottalam j...@cs.berkeley.edu wrote:

 Hi Kapil,

 It looks to me like the artifacts in Maven are the official 0.9.0
 release, though the website has not yet been updated. The IPMC
 approved RC5 as of yesterday:

 https://mail-archives.apache.org/mod_mbox/incubator-general/201401.mbo
 x/cabpqxstjm+po7_22bdybqxk90zsy3pnxppft87-9xdff98u...@mail.gmail.com

 -Jey

 On Sat, Feb 1, 2014 at 8:19 AM, Kapil Malik kma...@adobe.com wrote:
 Hi Stevo,
 Thanks for the link. Indeed, different versions are available on maven 
 repository which I can clone/sync for development purposes. But I'm more 
 confident about official release version when deploying to a cluster which 
 is used by multiple people.
 Hence curious about date for 0.9 official release.

 Thanks and regards,

 Kapil

 -Original Message-
 From: Stevo Slavić [mailto:ssla...@gmail.com]
 Sent: 01 February 2014 21:33
 To: dev@spark.incubator.apache.org
 Subject: Re: [VOTE] Release Apache Spark 0.9.0-incubating (rc5)

 Apache Spark 0.9.0 artifacts are on Maven central repo (see
 http://central.maven.org/maven2/org/apache/spark/spark-core_2.10/0.9.
 0-incubating/)

 Kind regards,
 Stevo Slavic

 On Sat, Feb 1, 2014 at 4:59 PM, Kapil Malik kma...@adobe.com wrote:

 Sent too early ... 1 week* (maybe I refreshed too fast)

 -Original Message-
 From: Kapil Malik
 Sent: 01 February 2014 21:27
 To: dev@spark.incubator.apache.org
 Subject: RE: [VOTE] Release Apache Spark 0.9.0-incubating (rc5)

 +1 for Q !
 Have been monitoring this thread from past 3 weeks in anticipation
 :) Any tentative dates for official 0.9 release ?

 Kapil Malik | kma...@adobe.com | 33430 / 8800836581

 -Original Message-
 From: C. Ross Jam [mailto:cross...@crossjam.net]
 Sent: 01 February 2014 21:18
 To: dev@spark.incubator.apache.org
 Subject: Re: [VOTE] Release Apache Spark 0.9.0-incubating (rc5)

 Curious lurker here. Did this vote close successfully? Should I wait
 for an official 0.9 release?

 Cheers!

 On Friday, January 24, 2014, Patrick Wendell pwend...@gmail.com wrote:

 Please vote on releasing the following candidate as Apache Spark
 (incubating) version 0.9.0.

Re: [VOTE] Release Apache Spark 0.9.0-incubating (rc5)

2014-01-28 Thread Patrick Wendell

I'll add my own +1.

On Tue, Jan 28, 2014 at 12:45 PM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Stephen,

 Yes this runs afoul of good practice in Maven where a given version
 shouldn't be re-used. As far as I understand though, it is required by
 the way the Apache release process works.

 The artifacts and repository content that get voted on need to exactly
 match the final release. So we can't hold a vote on a version of the
 code where everything says -rcx, then we go back and change the
 source code and do a second push to maven with code that doesn't have
 an -rcx suffix. This would effectively change the code that is being
 released.

 I was thinking as a work around that maybe we could publish a second
 set of staging artifacts that are versioned with -rcX for people to
 test against. I think as long as we make it clear that these are not
 the official artifacts being voted on it might be okay. I'm not
 totally sure if this is allowed though.

 - Patrick

 On Tue, Jan 28, 2014 at 9:01 AM, Stephen Haberman
 stephen.haber...@gmail.com wrote:
 Hi Patrick,

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1006/

 I was going to import this rc5 release into our internal Maven repo to
 try it out, but noticed that the version doesn't have rc5 in it.

 This means that, if there is an rc6, I'll have to re-import over the same
 artifacts, which is generally not a good thing given Maven assumes artifacts
 never change.

 Is this restriction required by the blessing process, or would it be
 possible to sneak rc5 into the pre-final version number?

 For now, I'll just build a local version, at the same commit, but with
 the as 0.9.0-incubating-rc5.

 Apologies if this was discussed before and I just missed it.

 - Stephen

Re: Moving to Typesafe Config?

2014-01-27 Thread Patrick Wendell

Hey Heiko,

Spark 0.9 introduced a common config class for Spark applications. It
also (initially) supported loading config files in the nested typesafe
format, but this was removed last minute due to a bug. In 1.0 we'll
probably add support for config files, though it may not support
typesafe's tree-style config files because that conflicts with the
naming style of several spark options (we have options where x.y and
x.y.z are both named keys, and the typesafe parser doesn't allow
that).

- Patrick

On Mon, Jan 27, 2014 at 8:59 AM, Heiko Braun ike.br...@googlemail.com wrote:
Thanks. I found the discussion myself ;)

/heiko

Am 27.01.2014 um 17:34 schrieb Mark Hamstra m...@clearstorydata.com:

And it would be more helpful if I gave you a usable link
http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html

Sent from my iPhone

On Jan 27, 2014, at 8:13 AM, Heiko Braun ike.br...@googlemail.com wrote:

Thanks Mark.

On 27 Jan 2014, at 17:05, Mark Hamstra m...@clearstorydata.com wrote:

Been done and undone, and will probably be redone for 1.0. See
https://mail.google.com/mail/ca/u/0/#search/config/143a6c39e3995882

On Mon, Jan 27, 2014 at 7:58 AM, Heiko Braun
ike.br...@googlemail.comwrote:

Is there any interest in moving to a more structured approach for
configuring spark components? I.e. moving to the typesafe config [1].
Since
spark already leverages akka, this seems to be a reasonable choice IMO.

[1] https://github.com/typesafehub/config

Regards, Heiko

Re: [VOTE] Release Apache Spark 0.9.0-incubating (rc5)

2014-01-26 Thread Patrick Wendell

Hey Taka,

If you build a second version you need to clean the existing assembly jar.

The reference implementation of the tests are the ones on the U.C.
Berkeley Jenkins. These are passing for Branch 0.9 for both Hadoop 1
and Hadoop 2 versions, so I'm inclined to think it's an issue with
your test env or setup.

https://amplab.cs.berkeley.edu/jenkins/view/Spark/

- Patrick

On Sun, Jan 26, 2014 at 10:52 PM, Reynold Xin r...@databricks.com wrote:
 It is possible that you have generated the assembly jar using one version
 of Hadoop, and then another assembly jar with another version. Those tests
 that failed are all using a local cluster that sets up multiple processes,
 which would require launching Spark worker processes using the assembly
 jar. If that's indeed the problem, removing the extra assembly jars should
 fix them.


 On Sun, Jan 26, 2014 at 10:49 PM, Taka Shinagawa 
 taka.epsi...@gmail.comwrote:

 If I build Spark for Hadoop 1.0.4 (either SPARK_HADOOP_VERSION=1.0.4
 sbt/sbt assembly  or sbt/sbt assembly) or use the binary distribution,
 'sbt/sbt test' runs successfully.

 However, if I build Spark targeting any other Hadoop versions (e.g.
 SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly, SPARK_HADOOP_VERSION=2.2.0
 sbt/sbt assembly), I'm getting the following errors with 'sbt/sbt test':

 1) type mismatch errors with JavaPairDStream.scala
 2) following test failures
 [error] Failed tests:
 [error] org.apache.spark.ShuffleNettySuite
 [error] org.apache.spark.ShuffleSuite
 [error] org.apache.spark.FileServerSuite
 [error] org.apache.spark.DistributedSuite

 I don't have Hadoop 1.0.4 installed on my test systems (but the test
 succeeds, and failed with the installed Hadoop versions). I'm seeing these
 sbt test errors with the previous 0.9.0 RCs and 0.8.1, too.

 I'm wondering if anyone else has seen this problem or I'm missing something
 to run the test correctly.

 Thanks,
 Taka




 On Sat, Jan 25, 2014 at 5:00 PM, Sean McNamara
 sean.mcnam...@webtrends.comwrote:

  +1
 
  On 1/25/14, 4:04 PM, Mark Hamstra m...@clearstorydata.com wrote:
 
  +1
  
  
  On Sat, Jan 25, 2014 at 2:37 PM, Andy Konwinski
  andykonwin...@gmail.comwrote:
  
   +1
  
  
   On Sat, Jan 25, 2014 at 2:27 PM, Reynold Xin r...@databricks.com
  wrote:
  
+1
   
 On Jan 25, 2014, at 12:07 PM, Hossein fal...@gmail.com wrote:

 +1

 Compiled and tested on Mavericks.

 --Hossein


 On Sat, Jan 25, 2014 at 11:38 AM, Patrick Wendell
  pwend...@gmail.com
wrote:

 I'll kick of the voting with a +1.

 On Thu, Jan 23, 2014 at 11:33 PM, Patrick Wendell
  pwend...@gmail.com
   
 wrote:
 Please vote on releasing the following candidate as Apache Spark
 (incubating) version 0.9.0.

 A draft of the release notes along with the changes file is
  attached
 to this e-mail.

 The tag to be voted on is v0.9.0-incubating (commit 95d28ff3):

   
  
  
 
 https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=commit;h=
  95d28ff3d0d20d9c583e184f9e2c5ae842d8a4d9

 The release files, including signatures, digests, etc can be
 found
   at:
 http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc5

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:

   
  
 https://repository.apache.org/content/repositories/orgapachespark-1006/

 The documentation corresponding to this release can be found at:

  http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc5-docs/

 Please vote on releasing this package as Apache Spark
   0.9.0-incubating!

 The vote is open until Monday, January 27, at 07:30 UTC and
 passes
   ifa
 majority of at least 3 +1 PPMC votes are cast.

 [ ] +1 Release this package as Apache Spark 0.9.0-incubating
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.incubator.apache.org/

   
--
You received this message because you are subscribed to the Google
  Groups
Unofficial Apache Spark Dev Mailing List Mirror group.
To unsubscribe from this group and stop receiving emails from it,
  send an
email to apache-spark-dev-mirror+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

[RESULT] [VOTE] Release Apache Spark 0.9.0-incubating (rc5)

2014-01-26 Thread Patrick Wendell

Voting is now closed. This vote passes with 5 binding +1 votes and no
0 or -1 votes. This vote will now go to the IPMC list for a second
72-hour vote. Spark developers are encouraged to comment on the IPMC
vote as well.

The totals are:

+1
Patrick Wendell*
Hossein Falaki
Reynold Xin*
Andy Konwinski*
Mark Hamstra*
Sean McNamara*

0: (none)
-1: (none)

On Sun, Jan 26, 2014 at 10:58 PM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Taka,

 If you build a second version you need to clean the existing assembly jar.

 The reference implementation of the tests are the ones on the U.C.
 Berkeley Jenkins. These are passing for Branch 0.9 for both Hadoop 1
 and Hadoop 2 versions, so I'm inclined to think it's an issue with
 your test env or setup.

 https://amplab.cs.berkeley.edu/jenkins/view/Spark/

 - Patrick

 On Sun, Jan 26, 2014 at 10:52 PM, Reynold Xin r...@databricks.com wrote:
 It is possible that you have generated the assembly jar using one version
 of Hadoop, and then another assembly jar with another version. Those tests
 that failed are all using a local cluster that sets up multiple processes,
 which would require launching Spark worker processes using the assembly
 jar. If that's indeed the problem, removing the extra assembly jars should
 fix them.


 On Sun, Jan 26, 2014 at 10:49 PM, Taka Shinagawa 
 taka.epsi...@gmail.comwrote:

 If I build Spark for Hadoop 1.0.4 (either SPARK_HADOOP_VERSION=1.0.4
 sbt/sbt assembly  or sbt/sbt assembly) or use the binary distribution,
 'sbt/sbt test' runs successfully.

 However, if I build Spark targeting any other Hadoop versions (e.g.
 SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly, SPARK_HADOOP_VERSION=2.2.0
 sbt/sbt assembly), I'm getting the following errors with 'sbt/sbt test':

 1) type mismatch errors with JavaPairDStream.scala
 2) following test failures
 [error] Failed tests:
 [error] org.apache.spark.ShuffleNettySuite
 [error] org.apache.spark.ShuffleSuite
 [error] org.apache.spark.FileServerSuite
 [error] org.apache.spark.DistributedSuite

 I don't have Hadoop 1.0.4 installed on my test systems (but the test
 succeeds, and failed with the installed Hadoop versions). I'm seeing these
 sbt test errors with the previous 0.9.0 RCs and 0.8.1, too.

 I'm wondering if anyone else has seen this problem or I'm missing something
 to run the test correctly.

 Thanks,
 Taka




 On Sat, Jan 25, 2014 at 5:00 PM, Sean McNamara
 sean.mcnam...@webtrends.comwrote:

  +1
 
  On 1/25/14, 4:04 PM, Mark Hamstra m...@clearstorydata.com wrote:
 
  +1
  
  
  On Sat, Jan 25, 2014 at 2:37 PM, Andy Konwinski
  andykonwin...@gmail.comwrote:
  
   +1
  
  
   On Sat, Jan 25, 2014 at 2:27 PM, Reynold Xin r...@databricks.com
  wrote:
  
+1
   
 On Jan 25, 2014, at 12:07 PM, Hossein fal...@gmail.com wrote:

 +1

 Compiled and tested on Mavericks.

 --Hossein


 On Sat, Jan 25, 2014 at 11:38 AM, Patrick Wendell
  pwend...@gmail.com
wrote:

 I'll kick of the voting with a +1.

 On Thu, Jan 23, 2014 at 11:33 PM, Patrick Wendell
  pwend...@gmail.com
   
 wrote:
 Please vote on releasing the following candidate as Apache Spark
 (incubating) version 0.9.0.

 A draft of the release notes along with the changes file is
  attached
 to this e-mail.

 The tag to be voted on is v0.9.0-incubating (commit 95d28ff3):

   
  
  
 
 https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=commit;h=
  95d28ff3d0d20d9c583e184f9e2c5ae842d8a4d9

 The release files, including signatures, digests, etc can be
 found
   at:
 http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc5

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:

   
  
 https://repository.apache.org/content/repositories/orgapachespark-1006/

 The documentation corresponding to this release can be found at:

  http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc5-docs/

 Please vote on releasing this package as Apache Spark
   0.9.0-incubating!

 The vote is open until Monday, January 27, at 07:30 UTC and
 passes
   ifa
 majority of at least 3 +1 PPMC votes are cast.

 [ ] +1 Release this package as Apache Spark 0.9.0-incubating
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.incubator.apache.org/

   
--
You received this message because you are subscribed to the Google
  Groups
Unofficial Apache Spark Dev Mailing List Mirror group.
To unsubscribe from this group and stop receiving emails from it,
  send an
email to apache-spark-dev-mirror+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: [VOTE] Release Apache Spark 0.9.0-incubating (rc4) [new thread]

2014-01-22 Thread Patrick Wendell

Hey Tom,

Matei had to remove this because it turns out that there was a fairly
serious bug in the Typesafe config library we use for parsing conf
files [1]. There wasn't an immediate solution to this so he just
removed the capability for this release and we can revisit it in the
next release.

http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html

- Patrick

On Wed, Jan 22, 2014 at 8:18 AM, Tom Graves tgraves...@yahoo.com wrote:
 It looks like the latest round of changes took out spark.conf.  Are there 
 plans to add this back in (jira)?

 Tom



 On Wednesday, January 22, 2014 3:46 AM, Henry Saputra 
 henry.sapu...@gmail.com wrote:

 Would love to hear from Mridul to verify the fixes for problems he saw are
 in.


 On Tuesday, January 21, 2014, Patrick Wendell pwend...@gmail.com wrote:

 Please vote on releasing the following candidate as Apache Spark
 (incubating) version 0.9.0.

 A draft of the release notes along with the changes file is attached
 to this e-mail.

 The tag to be voted on is v0.9.0-incubating (commit 0771df67):

 https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=commit;h=0771df675363c69622404cb514bd751bc90526af

 The release files, including signatures, digests, etc can be found at:
 http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc4

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1005/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc4-docs/

 Please vote on releasing this package as Apache Spark 0.9.0-incubating!

 The vote is open until Friday, January 24, at 11:15 UTC and passes if
 a majority of at least 3 +1 PPMC votes are cast.

 [ ] +1 Release this package as Apache Spark 0.9.0-incubating
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.incubator.apache.org/

Re: [VOTE] Release Apache Spark 0.9.0-incubating (rc4) [new thread]

2014-01-22 Thread Patrick Wendell

Btw - to be clear this was an incompatibility between Spark's config
names and constraints on names imposed by typesafe. So didn't mean to
imply there was something broken in their config library.

On Wed, Jan 22, 2014 at 9:14 AM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Tom,

 Matei had to remove this because it turns out that there was a fairly
 serious bug in the Typesafe config library we use for parsing conf
 files [1]. There wasn't an immediate solution to this so he just
 removed the capability for this release and we can revisit it in the
 next release.

 http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html

 - Patrick

 On Wed, Jan 22, 2014 at 8:18 AM, Tom Graves tgraves...@yahoo.com wrote:
 It looks like the latest round of changes took out spark.conf.  Are there 
 plans to add this back in (jira)?

 Tom



 On Wednesday, January 22, 2014 3:46 AM, Henry Saputra 
 henry.sapu...@gmail.com wrote:

 Would love to hear from Mridul to verify the fixes for problems he saw are
 in.


 On Tuesday, January 21, 2014, Patrick Wendell pwend...@gmail.com wrote:

 Please vote on releasing the following candidate as Apache Spark
 (incubating) version 0.9.0.

 A draft of the release notes along with the changes file is attached
 to this e-mail.

 The tag to be voted on is v0.9.0-incubating (commit 0771df67):

 https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=commit;h=0771df675363c69622404cb514bd751bc90526af

 The release files, including signatures, digests, etc can be found at:
 http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc4

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1005/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc4-docs/

 Please vote on releasing this package as Apache Spark 0.9.0-incubating!

 The vote is open until Friday, January 24, at 11:15 UTC and passes if
 a majority of at least 3 +1 PPMC votes are cast.

 [ ] +1 Release this package as Apache Spark 0.9.0-incubating
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.incubator.apache.org/

Re: Config properties broken in master

2014-01-19 Thread Patrick Wendell

Hey Mridul this was patched and we cut a new release candidate. There
were several different config options which had a.b and a.b.c... they
should all work in the new RC.

On Sun, Jan 19, 2014 at 4:56 AM, Mridul Muralidharan mri...@gmail.com wrote:
Chanced upon spill related config which exhibit same pattern ...

- Mridul

On Sun, Jan 19, 2014 at 1:10 AM, Reynold Xin r...@databricks.com wrote:
I also just went over the config options to see how pervasive this is. In
addition to speculation, there is one more conflict of this kind:

spark.locality.wait
spark.locality.wait.node
spark.locality.wait.process
spark.locality.wait.rack

spark.speculation
spark.speculation.interval
spark.speculation.multiplier
spark.speculation.quantile

On Sat, Jan 18, 2014 at 11:36 AM, Matei Zaharia
matei.zaha...@gmail.comwrote:

This is definitely an important issue to fix. Instead of renaming
properties, one solution would be to replace Typesafe Config with just
reading Java system properties, and disable config files for this release.
I kind of like that over renaming.

Matei

On Jan 18, 2014, at 11:30 AM, Mridul Muralidharan mri...@gmail.com
wrote:

Hi,

Speculation was an example, there are others in spark which are
affected by this ...
Some of them have been around for a while, so will break existing
code/scripts.

Regards,
Mridul

On Sun, Jan 19, 2014 at 12:51 AM, Nan Zhu zhunanmcg...@gmail.com
wrote:
change spark.speculation to spark.speculation.switch?

maybe we can restrict that all properties in Spark should be three
levels

On Sat, Jan 18, 2014 at 2:10 PM, Mridul Muralidharan mri...@gmail.com
wrote:

Hi,

Unless I am mistaken, the change to using typesafe ConfigFactory has
broken some of the system properties we use in spark.

For example: if we have both
-Dspark.speculation=true -Dspark.speculation.multiplier=0.95
set, then the spark.speculation property is dropped.

The rules of parseProperty actually document this clearly [1]

I am not sure what the right fix here would be (other than replacing
use of config that is).

Any thoughts ?
I would vote -1 for 0.9 to be released before this is fixed.

Regards,
Mridul

[1]

http://typesafehub.github.io/config/latest/api/com/typesafe/config/ConfigFactory.html#parseProperties%28java.util.Properties,%20com.typesafe.config.ConfigParseOptions%29

Re: [VOTE] Release Apache Spark 0.9.0-incubating (rc2)

2014-01-19 Thread Patrick Wendell

This vote is cancelled in favor of rc3 - which fixes the YARN issue
Sandy ran into.

@taka - thanks for reporting that bug. It's not enough to block this
release however. Once a fix exists we can merge it into the 0.9 branch
and it will be in 0.9.1

On Sun, Jan 19, 2014 at 12:37 PM, Taka Shinagawa taka.epsi...@gmail.com wrote:
 I've found a problem with the cartesian method on Pyspark and filed
 as SPARK-1034
 https://spark-project.atlassian.net/browse/SPARK-1034

 0.8.1 doesn't have this problem. On Scala, cartesian method works fine.

 It's also nice if SPARK-978 can be fixed, too.
 https://spark-project.atlassian.net/browse/SPARK-978

 Thanks,
 Taka


 On Sun, Jan 19, 2014 at 1:24 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Has anybody tested against YARN 2.2?  I tried it out against a
 pseudo-distributed cluster and ran into an issue I just filed as
 SPARK-1031https://spark-project.atlassian.net/browse/SPARK-1031
 .

 thanks,
 Sandy


 On Sun, Jan 19, 2014 at 12:55 AM, Reynold Xin r...@databricks.com wrote:

  +1
 
 
  On Sat, Jan 18, 2014 at 11:11 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
   I'll kick of the voting with a +1.
  
   On Sat, Jan 18, 2014 at 11:05 PM, Patrick Wendell pwend...@gmail.com
   wrote:
Please vote on releasing the following candidate as Apache Spark
(incubating) version 0.9.0.
   
A draft of the release notes along with the changes file is attached
to this e-mail.
   
The tag to be voted on is v0.9.0-incubating (commit 00c847a):
   
  
 
 https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=commit;h=00c847af1d4be2fe5fad887a57857eead1e517dc
   
The release files, including signatures, digests, etc can be found
 at:
http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc2/
   
Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc
   
The staging repository for this release can be found at:
   
  https://repository.apache.org/content/repositories/orgapachespark-1003/
   
The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc2-docs/
   
Please vote on releasing this package as Apache Spark
 0.9.0-incubating!
   
The vote is open until Wednesday, January 22, at 07:05 UTC
and passes if a majority of at least 3 +1 PPMC votes are cast.
   
[ ] +1 Release this package as Apache Spark 0.9.0-incubating
[ ] -1 Do not release this package because ...
   
To learn more about Apache Spark, please see
http://spark.incubator.apache.org/

Re: [VOTE] Release Apache Spark 0.9.0-incubating (rc3)

2014-01-19 Thread Patrick Wendell

Attempting to attach the release notes again (I think it may have been
blocked previously due to not having an extension).

On Sun, Jan 19, 2014 at 8:05 PM, Patrick Wendell pwend...@gmail.com wrote:
I'll add my +1 as well

On Sun, Jan 19, 2014 at 7:33 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
+1

Re-tested on Mac.

Matei

On Jan 19, 2014, at 7:09 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:

Starting off.
+1

On Sun, Jan 19, 2014 at 2:15 PM, Patrick Wendell pwend...@gmail.com wrote:

Please vote on releasing the following candidate as Apache Spark
(incubating) version 0.9.0.

A draft of the release notes along with the changes file is attached
to this e-mail.

The tag to be voted on is v0.9.0-incubating (commit a7760eff):

https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=commit;h=a7760eff4ea6a474cab68896a88550f63bae8b0d

The release files, including signatures, digests, etc can be found at:
http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc3/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1004/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc3-docs/

Please vote on releasing this package as Apache Spark 0.9.0-incubating!

The vote is open until Wednesday, January 22, at 22:15 UTC and passes
if a majority of at least 3 +1 PPMC votes are cast.

[ ] +1 Release this package as Apache Spark 0.9.0-incubating
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.incubator.apache.org/

Spark 0.9.0 is a major release that adds significant new features. It updates
Spark to Scala 2.10, simplifies high availability, and updates numerous
components of the project. This release includes a first version of GraphX, a
powerful new framework for graph processing that comes with a library of
standard algorithms. In addition, Spark Streaming is now out of alpha, and
includes significant optimizations and simplified high availability deployment.

### Scala 2.10 Support

Spark now runs on Scala 2.10, letting users benefit from the language and
library improvements in this version.

### Configuration System

The new [SparkConf] class is now the preferred way to configure advanced
settings on your SparkContext, though the previous Java system property still
works. SparkConf is especially useful in tests to make sure properties donât
stay set across tests.

### Spark Streaming Improvements

Spark Streaming is no longer alpha, and comes with simplified high availability
and several optimizations.

* When running on a Spark standalone cluster with the [standalone cluster high
availability mode], you can submit a Spark Streaming driver application to the
cluster and have it automatically recovered if either the driver or the cluster
master crashes.
* Windowed operators have been sped up by 30-50%.
* Spark Streamingâs input source plugins (e.g. for Twitter, Kafka and Flume)
are now separate projects, making it easier to pull in only the dependencies
you need.
* A new StreamingListener interface has been added for monitoring statistics
about the streaming computation.
* A few aspects of the API have been improved:
* `DStream` and `PairDStream` classes have been moved from
`org.apache.spark.streaming` to `org.apache.spark.streaming.dstream` to keep it
consistent with `org.apache.spark.rdd.RDD`.
* `DStream.foreach` - `DStream.foreachRDD` to make it explicit that it works
for every RDD, not every element
* `StreamingContext.awaitTermination()` allows you wait for context shutdown
and catch any exception that occurs in the streaming computation.
*`StreamingContext.stop()` now allows stopping of StreamingContext without
stopping the underlying SparkContext.

### GraphX Alpha

GraphX is a new API for graph processing that uses recent advances in
graph-parallel computation. It lets you build a graph within a Spark program
using the standard Spark operators, then process it with new graph operators
that are optimized for distributed computation. It includes basic
transformations, a Pregel API for iterative computation, and a standard library
of graph loaders and analytics algorithms. By offering these features within
the Spark engine, GraphX can significantly speed up processing tasks compared
to workflows that use different engines.

GraphX features in this release include:

* Building graphs from arbitrary Spark RDDs
* Basic operations to transform graphs or extract subgraphs
* An optimized Pregel API that takes advantage of graph partitioning and
indexing
* Standard algorithms including PageRank, connected components, strongly
connected components, SVD++, and triangle counting
* Interactive use from the Spark shell

GraphX

Re: [VOTE] Release Apache Spark 0.9.0-incubating (rc3)

2014-01-19 Thread Patrick Wendell

Eventually the notes get posted on the apache website. I attached them
to this e-mail so that people can get a sense of what is in the
release before they vote on it.

On Sun, Jan 19, 2014 at 9:57 PM, Henry Saputra henry.sapu...@gmail.com wrote:
 Hi Patrick, quick question, where are you planning to add the release notes?
 I dont think it is part of the source, is it?

 - Henry

 On Sun, Jan 19, 2014 at 8:41 PM, Patrick Wendell pwend...@gmail.com wrote:
 Attempting to attach the release notes again (I think it may have been
 blocked previously due to not having an extension).

 On Sun, Jan 19, 2014 at 8:05 PM, Patrick Wendell pwend...@gmail.com wrote:
 I'll add my +1 as well

 On Sun, Jan 19, 2014 at 7:33 PM, Matei Zaharia matei.zaha...@gmail.com 
 wrote:
 +1

 Re-tested on Mac.

 Matei

 On Jan 19, 2014, at 7:09 PM, Tathagata Das tathagata.das1...@gmail.com 
 wrote:

 Starting off.
 +1


 On Sun, Jan 19, 2014 at 2:15 PM, Patrick Wendell pwend...@gmail.com 
 wrote:

 Please vote on releasing the following candidate as Apache Spark
 (incubating) version 0.9.0.

 A draft of the release notes along with the changes file is attached
 to this e-mail.

 The tag to be voted on is v0.9.0-incubating (commit a7760eff):

 https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=commit;h=a7760eff4ea6a474cab68896a88550f63bae8b0d

 The release files, including signatures, digests, etc can be found at:
 http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc3/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1004/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc3-docs/

 Please vote on releasing this package as Apache Spark 0.9.0-incubating!

 The vote is open until Wednesday, January 22, at 22:15 UTC and passes
 if a majority of at least 3 +1 PPMC votes are cast.

 [ ] +1 Release this package as Apache Spark 0.9.0-incubating
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.incubator.apache.org/

Re: [VOTE] Release Apache Spark 0.9.0-incubating (rc1)

2014-01-18 Thread Patrick Wendell

Mridul, thanks a *lot* for pointing this out. This is indeed an issue
and something which warrants cutting a new RC.

- Patrick

On Sat, Jan 18, 2014 at 11:14 AM, Mridul Muralidharan mri...@gmail.com wrote:
 I would vote -1 for this release until we resolve config property
 issue [1] : if there is a known resolution for this (which I could not
 find unfortunately, apologies if it exists !), then will change my
 vote.

 Thanks,
 Mridul


 [1] 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html

 On Thu, Jan 16, 2014 at 7:18 AM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark
 (incubating) version 0.9.0.

 A draft of the release notes along with the changes file is attached
 to this e-mail.

 The tag to be voted on is v0.9.0-incubating (commit 7348893):
 https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=commit;h=7348893f0edd96dacce2f00970db1976266f7008

 The release files, including signatures, digests, etc can be found at:
 http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc1/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1001/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc1-docs/

 Please vote on releasing this package as Apache Spark 0.9.0-incubating!

 The vote is open until Sunday, January 19, at 02:00 UTC
 and passes if a majority of at least 3 +1 PPMC votes are cast.

 [ ] +1 Release this package as Apache Spark 0.9.0-incubating
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.incubator.apache.org/

Re: [VOTE] Release Apache Spark 0.9.0-incubating (rc1)

2014-01-18 Thread Patrick Wendell

This vote is cancelled in favor of rc2 which I'll post shortly.

On Sat, Jan 18, 2014 at 12:14 PM, Patrick Wendell pwend...@gmail.com wrote:
 Mridul, thanks a *lot* for pointing this out. This is indeed an issue
 and something which warrants cutting a new RC.

 - Patrick

 On Sat, Jan 18, 2014 at 11:14 AM, Mridul Muralidharan mri...@gmail.com 
 wrote:
 I would vote -1 for this release until we resolve config property
 issue [1] : if there is a known resolution for this (which I could not
 find unfortunately, apologies if it exists !), then will change my
 vote.

 Thanks,
 Mridul


 [1] 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html

 On Thu, Jan 16, 2014 at 7:18 AM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark
 (incubating) version 0.9.0.

 A draft of the release notes along with the changes file is attached
 to this e-mail.

 The tag to be voted on is v0.9.0-incubating (commit 7348893):
 https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=commit;h=7348893f0edd96dacce2f00970db1976266f7008

 The release files, including signatures, digests, etc can be found at:
 http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc1/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1001/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc1-docs/

 Please vote on releasing this package as Apache Spark 0.9.0-incubating!

 The vote is open until Sunday, January 19, at 02:00 UTC
 and passes if a majority of at least 3 +1 PPMC votes are cast.

 [ ] +1 Release this package as Apache Spark 0.9.0-incubating
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.incubator.apache.org/

Re: [VOTE] Release Apache Spark 0.9.0-incubating (rc2)

2014-01-18 Thread Patrick Wendell

I'll kick of the voting with a +1.

On Sat, Jan 18, 2014 at 11:05 PM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark
 (incubating) version 0.9.0.

 A draft of the release notes along with the changes file is attached
 to this e-mail.

 The tag to be voted on is v0.9.0-incubating (commit 00c847a):
 https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=commit;h=00c847af1d4be2fe5fad887a57857eead1e517dc

 The release files, including signatures, digests, etc can be found at:
 http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc2/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1003/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-0.9.0-incubating-rc2-docs/

 Please vote on releasing this package as Apache Spark 0.9.0-incubating!

 The vote is open until Wednesday, January 22, at 07:05 UTC
 and passes if a majority of at least 3 +1 PPMC votes are cast.

 [ ] +1 Release this package as Apache Spark 0.9.0-incubating
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.incubator.apache.org/

Re: [VOTE] Release Apache Spark 0.9.0-incubating (rc1)

2014-01-16 Thread Patrick Wendell

I also ran your example locally and it worked with 0.8.1 and
0.9.0-rc1. So it's possible somehow you are pulling in an older
version if Spark or an incompatible version of Hadoop.

- Patrick

On Thu, Jan 16, 2014 at 9:39 AM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Alex,

 Thanks for testing out this rc. Would you mind forking this into a different
 thread so we can discuss there?

 Also, does your application build and run correctly with spark 0.8.1? That
 would determine whether the problem is specifically with this rc...

 Patrick

 ---
 sent from my phone

 On Jan 15, 2014 11:44 PM, Alex Cozzi alexco...@gmail.com wrote:

 Oh, I forgot: I am using the “yarn” maven profile to target yarn 2.2

 Alex Cozzi
 alexco...@gmail.com
 On Jan 15, 2014, at 11:41 PM, Alex Cozzi alexco...@gmail.com wrote:

  Just testing out the rc1. I create a dependent project (using maven) and
  I copied the HdfsTest.scala test, but I added a single line to save the 
  file
  back to disk:
 
  package org.apache.spark.examples
 
  import org.apache.spark._
 
  object HdfsTest {
def main(args: Array[String]) {
  val sc = new SparkContext(args(0), HdfsTest,
System.getenv(SPARK_HOME),
  SparkContext.jarOfClass(this.getClass))
  val file = sc.textFile(args(1))
  val mapped = file.map(s = s.length).cache()
  for (iter - 1 to 10) {
val start = System.currentTimeMillis()
for (x - mapped) { x + 2 }
//  println(Processing:  + x)
val end = System.currentTimeMillis()
println(Iteration  + iter +  took  + (end-start) +  ms)
mapped.saveAsTextFile(out)
  }
  System.exit(0)
}
  }
 
  and this my pom file:
  project xmlns=http://maven.apache.org/POM/4.0.0;
  xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
xsi:schemaLocation=http://maven.apache.org/POM/4.0.0
  http://maven.apache.org/maven-v4_0_0.xsd;
modelVersion4.0.0/modelVersion
groupIdmy.examples/groupId
artifactIdspark-samples/artifactId
version0.0.1-SNAPSHOT/version
inceptionYear2014/inceptionYear
 
properties
maven.compiler.source1.6/maven.compiler.source
maven.compiler.target1.6/maven.compiler.target
encodingUTF-8/encoding
scala.tools.version2.10/scala.tools.version
scala.version2.10.0/scala.version
/properties
 
repositories
repository
idspark staging/id
 
  urlhttps://repository.apache.org/content/repositories/orgapachespark-1001/url
/repository
/repositories
 
dependencies
dependency
groupIdorg.scala-lang/groupId
artifactIdscala-library/artifactId
version${scala.version}/version
/dependency
 
dependency
groupIdorg.apache.spark/groupId
 
  artifactIdspark-core_${scala.tools.version}/artifactId
version0.9.0-incubating/version
/dependency
 
!-- Test --
dependency
groupIdjunit/groupId
artifactIdjunit/artifactId
version4.11/version
scopetest/scope
/dependency
dependency
groupIdorg.specs2/groupId
 
  artifactIdspecs2_${scala.tools.version}/artifactId
version1.13/version
scopetest/scope
/dependency
dependency
groupIdorg.scalatest/groupId
 
  artifactIdscalatest_${scala.tools.version}/artifactId
version2.0.M6-SNAP8/version
scopetest/scope
/dependency
/dependencies
 
build
sourceDirectorysrc/main/scala/sourceDirectory
testSourceDirectorysrc/test/scala/testSourceDirectory
plugins
plugin
!-- see
  http://davidb.github.com/scala-maven-plugin --
groupIdnet.alchim31.maven/groupId
 
  artifactIdscala-maven-plugin/artifactId
version3.1.6/version
configuration
 
  scalaCompatVersion2.10/scalaCompatVersion
jvmArgs
jvmArg-Xms128m/jvmArg
jvmArg-Xmx2048m/jvmArg
/jvmArgs
/configuration
executions
execution
goals
 
  goalcompile/goal
 
  goaltestCompile/goal
/goals

Re: [VOTE] Release Apache Spark 0.9.0-incubating (rc1)

2014-01-16 Thread Patrick Wendell

I'll kick this vote off with a +1.

On Thu, Jan 16, 2014 at 10:43 AM, Patrick Wendell pwend...@gmail.com wrote:
 I also ran your example locally and it worked with 0.8.1 and
 0.9.0-rc1. So it's possible somehow you are pulling in an older
 version if Spark or an incompatible version of Hadoop.

 - Patrick

 On Thu, Jan 16, 2014 at 9:39 AM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Alex,

 Thanks for testing out this rc. Would you mind forking this into a different
 thread so we can discuss there?

 Also, does your application build and run correctly with spark 0.8.1? That
 would determine whether the problem is specifically with this rc...

 Patrick

 ---
 sent from my phone

 On Jan 15, 2014 11:44 PM, Alex Cozzi alexco...@gmail.com wrote:

 Oh, I forgot: I am using the “yarn” maven profile to target yarn 2.2

 Alex Cozzi
 alexco...@gmail.com
 On Jan 15, 2014, at 11:41 PM, Alex Cozzi alexco...@gmail.com wrote:

  Just testing out the rc1. I create a dependent project (using maven) and
  I copied the HdfsTest.scala test, but I added a single line to save the 
  file
  back to disk:
 
  package org.apache.spark.examples
 
  import org.apache.spark._
 
  object HdfsTest {
def main(args: Array[String]) {
  val sc = new SparkContext(args(0), HdfsTest,
System.getenv(SPARK_HOME),
  SparkContext.jarOfClass(this.getClass))
  val file = sc.textFile(args(1))
  val mapped = file.map(s = s.length).cache()
  for (iter - 1 to 10) {
val start = System.currentTimeMillis()
for (x - mapped) { x + 2 }
//  println(Processing:  + x)
val end = System.currentTimeMillis()
println(Iteration  + iter +  took  + (end-start) +  ms)
mapped.saveAsTextFile(out)
  }
  System.exit(0)
}
  }
 
  and this my pom file:
  project xmlns=http://maven.apache.org/POM/4.0.0;
  xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
xsi:schemaLocation=http://maven.apache.org/POM/4.0.0
  http://maven.apache.org/maven-v4_0_0.xsd;
modelVersion4.0.0/modelVersion
groupIdmy.examples/groupId
artifactIdspark-samples/artifactId
version0.0.1-SNAPSHOT/version
inceptionYear2014/inceptionYear
 
properties
maven.compiler.source1.6/maven.compiler.source
maven.compiler.target1.6/maven.compiler.target
encodingUTF-8/encoding
scala.tools.version2.10/scala.tools.version
scala.version2.10.0/scala.version
/properties
 
repositories
repository
idspark staging/id
 
  urlhttps://repository.apache.org/content/repositories/orgapachespark-1001/url
/repository
/repositories
 
dependencies
dependency
groupIdorg.scala-lang/groupId
artifactIdscala-library/artifactId
version${scala.version}/version
/dependency
 
dependency
groupIdorg.apache.spark/groupId
 
  artifactIdspark-core_${scala.tools.version}/artifactId
version0.9.0-incubating/version
/dependency
 
!-- Test --
dependency
groupIdjunit/groupId
artifactIdjunit/artifactId
version4.11/version
scopetest/scope
/dependency
dependency
groupIdorg.specs2/groupId
 
  artifactIdspecs2_${scala.tools.version}/artifactId
version1.13/version
scopetest/scope
/dependency
dependency
groupIdorg.scalatest/groupId
 
  artifactIdscalatest_${scala.tools.version}/artifactId
version2.0.M6-SNAP8/version
scopetest/scope
/dependency
/dependencies
 
build
sourceDirectorysrc/main/scala/sourceDirectory
testSourceDirectorysrc/test/scala/testSourceDirectory
plugins
plugin
!-- see
  http://davidb.github.com/scala-maven-plugin --
groupIdnet.alchim31.maven/groupId
 
  artifactIdscala-maven-plugin/artifactId
version3.1.6/version
configuration
 
  scalaCompatVersion2.10/scalaCompatVersion
jvmArgs
jvmArg-Xms128m/jvmArg
jvmArg-Xmx2048m/jvmArg
/jvmArgs
/configuration
executions
execution
goals

Re: testing 0.9.0-incubating and maven

2014-01-16 Thread Patrick Wendell

Hey Alex,

Maven profiles only affect the Spark build itself. They do not
transitively affect your own build.

Checkout the docs for how to deploy applications on yarn:
http://spark.incubator.apache.org/docs/latest/running-on-yarn.html

When compiling your application, just should explicitly add the hadoop
version you depend on to your own build (e.g. a hadoop-client
dependency). Take a look at the example here where we show adding
hadoop-client:

http://spark.incubator.apache.org/docs/latest/quick-start.html

When deploying Spark applications on YARN, you actually want to mark
spark as a provided dependency in your application's maven and bundle
your application as an assembly jar, then submit it with a Spark YARN
bundle to a YARN cluster. The instructions are the same as they were
in 0.8.1.

For the spark jar you want to submit to YARN, you can download the
precompiled Spark one.

It might make sense to try this pipeline with 0.8.1 and get it working
there. It sounds here more like you are dealing with getting the build
set-up rather than a particular issue with the 0.9.0 RC.

- Patrick

On Thu, Jan 16, 2014 at 1:13 PM, Alex Cozzi alexco...@gmail.com wrote:
 Hi Patrick,
 thank you for testing. I think I found out what is wrong: I am trying to 
 build my own examples that also depend on another library which in turns 
 depends on hadoop 2.2.
 what was happening is that my library brings in hadoop 2.2, while spark 
 depends on hadoop 1.04 and then I think I get conflict versions of the 
 classes.

 A couple of things are not clear to me:

 1: do the published artifacts support YARN and hadoop 2.2 or will I need to 
 make my own build?
 2: if they do, how do I activate the profiles in my maven config? I tried mvn 
 -Pyarn compile but it does not work (maven says “[WARNING] The requested 
 profile yarn could not be activated because it does not exist.”)


 essentially I would like to specify the spark dependencies as:

 dependencies
 dependency
 groupIdorg.scala-lang/groupId
 artifactIdscala-library/artifactId
 version${scala.version}/version
 /dependency

 dependency
 groupIdorg.apache.spark/groupId
 
 artifactIdspark-core_${scala.tools.version}/artifactId
 version0.9.0-incubating/version
 /dependency

 and tell maven to use the “yarn” profile for this dependency, but I do not 
 seem to be able to make it work.
 Anybody has any suggestion?

 Alex

Re: spark code formatter?

2014-01-09 Thread Patrick Wendell

I'm also very wary of using a code formatter for the reasons already
mentioned by Reynold.

Does scaliform have a mode where it just provides style checks rather
than reformat the code? This is something we really need for, e.g.,
reviewing the many submissions to the project.

- Patrick

On Wed, Jan 8, 2014 at 11:51 PM, Reynold Xin r...@databricks.com wrote:
 Thanks for doing that, DB. Not sure about others, but I'm actually strongly
 against blanket automatic code formatters, given that they can be
 disruptive. Often humans would intentionally choose to style things in a
 certain way for more clear semantics and better readability. Code
 formatters don't capture these nuances. It is pretty dangerous to just auto
 format everything.

 Maybe it'd be ok if we restrict the code formatters to a very limited set
 of things, such as indenting function parameters, etc.


 On Wed, Jan 8, 2014 at 10:28 PM, DB Tsai dbt...@alpinenow.com wrote:

 A pull request for scalariform.
 https://github.com/apache/incubator-spark/pull/365

 Sincerely,

 DB Tsai
 Machine Learning Engineer
 Alpine Data Labs
 --
 Web: http://alpinenow.com/


 On Wed, Jan 8, 2014 at 10:09 PM, DB Tsai dbt...@alpinenow.com wrote:
  We use sbt-scalariform in our company, and it can automatically format
  the coding style when runs `sbt compile`.
 
  https://github.com/sbt/sbt-scalariform
 
  We ask our developers to run `sbt compile` before commit, and it's
  really nice to see everyone has the same spacing and indentation.
 
  Sincerely,
 
  DB Tsai
  Machine Learning Engineer
  Alpine Data Labs
  --
  Web: http://alpinenow.com/
 
 
  On Wed, Jan 8, 2014 at 9:50 PM, Reynold Xin r...@databricks.com wrote:
  We have a Scala style configuration file in Shark:
  https://github.com/amplab/shark/blob/master/scalastyle-config.xml
 
  However, the scalastyle project is still pretty primitive and doesn't
 cover
  most of the use cases. It is still great to include it to cover basic
  checks such as 100-char wide lines.
 
 
  On Wed, Jan 8, 2014 at 8:02 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
 
  Not that I know of. This would be very useful to add, especially if we
 can
  make SBT automatically check the code style (or we can somehow plug
 this
  into Jenkins).
 
  Matei
 
  On Jan 8, 2014, at 11:00 AM, Michael Allman m...@allman.ms wrote:
 
   Hi,
  
   I've read the spark code style guide for contributors here:
  
  
 https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
  
   For scala code, do you have a scalariform configuration that you use
 to
  format your code to these specs?
  
   Cheers,
  
   Michael

Re: Build Changes for SBT Users

2014-01-05 Thread Patrick Wendell

Ya I was referring to already released version. Of course we can
update for subsequent releases...

On Sun, Jan 5, 2014 at 4:24 PM, Reynold Xin r...@databricks.com wrote:
 Why is it not possible? You always update the script; just can't update
 scripts for released versions.




 On Sat, Jan 4, 2014 at 9:07 PM, Patrick Wendell pwend...@gmail.com wrote:

 I agree TD - I was just saying that Reynold's proposal that we could
 update the release post-hoc is unfortunately not possible.

 On Sat, Jan 4, 2014 at 7:13 PM, Tathagata Das
 tathagata.das1...@gmail.com wrote:
  Patrick, that is right. All we are trying to ensure is to make a
  best-effort attempt to make it smooth for a new user. The script will
 try
  its best to automatically install / download sbt for the user. The
 fallback
  will be that the user will have to install sbt on their own. If the URL
  happens to change and our script fails to automatically download, then we
  are *no worse* than not providing the script at all.
 
  TD
 
 
  On Sat, Jan 4, 2014 at 7:06 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  Reynold the issue is releases are immutable and we expect them to be
  downloaded for several years after the release date.
 
  On Sat, Jan 4, 2014 at 5:57 PM, Xuefeng Wu ben...@gmail.com wrote:
   Sound reasonable.  But I think few installed sbt even it is easy to
  install.  I think can provide this tricky script in online document,
 user
  could download this script to install sbt independence. Sound like a yet
  another brew install sbt?
   :)
  
   Yours, Xuefeng Wu 吴雪峰 敬上
  
   On 2014年1月5日, at 上午2:56, Patrick Wendell pwend...@gmail.com wrote:
  
   We thought about this but elected not to do this for a few reasons.
  
   1. Some people build from machines that do not have internet access
   for security reasons and retrieve dependency from internal nexus
   repositories. So having a build dependency that relies on internet
   downloads is not desirable.
  
   2. It's a hard to ensure stability of a particular URL in perpetuity.
   This is why maven central and other mirror networks exist. Keep in
   mind that we can't change the release code ever once we release it,
   and if something changed about the particular URL it could break the
   build.
  
   - Patrick
  
   On Sat, Jan 4, 2014 at 9:34 AM, Andrew Ash and...@andrewash.com
  wrote:
   +1 on bundling a script similar to that one
  
  
   On Sat, Jan 4, 2014 at 4:48 AM, Holden Karau hol...@pigscanfly.ca
 
  wrote:
  
   Could we ship a shell script which downloads the sbt jar if not
  present
   (like for example
 https://github.com/holdenk/slashem/blob/master/sbt)?
  
  
   On Sat, Jan 4, 2014 at 12:02 AM, Patrick Wendell 
 pwend...@gmail.com
   wrote:
  
   Hey All,
  
   Due to an ASF requirement, we recently merged a patch which
 removes
   the sbt jar from the build. This is necessary because we aren't
   allowed to distributed binary artifacts with our source packages.
  
   This means that instead of building Spark with sbt/sbt XXX,
 you'll
   need to have sbt yourself and just run sbt XXX from within the
  Spark
   directory. This is similar to the maven build, where we expect
 users
   already have maven installed.
  
   You can download sbt at http://www.scala-sbt.org/. It's okay to
 just
   download the most recent version of sbt, since sbt knows how to
 fetch
   other versions of itself and will always use the one we specify in
  our
   build file to compile spark.
  
   - Patrick
  
  
  
   --
   Cell : 425-233-8271

Build Changes for SBT Users

2014-01-04 Thread Patrick Wendell

Hey All,

Due to an ASF requirement, we recently merged a patch which removes
the sbt jar from the build. This is necessary because we aren't
allowed to distributed binary artifacts with our source packages.

This means that instead of building Spark with sbt/sbt XXX, you'll
need to have sbt yourself and just run sbt XXX from within the Spark
directory. This is similar to the maven build, where we expect users
already have maven installed.

You can download sbt at http://www.scala-sbt.org/. It's okay to just
download the most recent version of sbt, since sbt knows how to fetch
other versions of itself and will always use the one we specify in our
build file to compile spark.

- Patrick

Re: Build Changes for SBT Users

2014-01-04 Thread Patrick Wendell

We thought about this but elected not to do this for a few reasons.

1. Some people build from machines that do not have internet access
for security reasons and retrieve dependency from internal nexus
repositories. So having a build dependency that relies on internet
downloads is not desirable.

2. It's a hard to ensure stability of a particular URL in perpetuity.
This is why maven central and other mirror networks exist. Keep in
mind that we can't change the release code ever once we release it,
and if something changed about the particular URL it could break the
build.

- Patrick

On Sat, Jan 4, 2014 at 9:34 AM, Andrew Ash and...@andrewash.com wrote:
 +1 on bundling a script similar to that one


 On Sat, Jan 4, 2014 at 4:48 AM, Holden Karau hol...@pigscanfly.ca wrote:

 Could we ship a shell script which downloads the sbt jar if not present
 (like for example https://github.com/holdenk/slashem/blob/master/sbt )?


 On Sat, Jan 4, 2014 at 12:02 AM, Patrick Wendell pwend...@gmail.com
 wrote:

  Hey All,
 
  Due to an ASF requirement, we recently merged a patch which removes
  the sbt jar from the build. This is necessary because we aren't
  allowed to distributed binary artifacts with our source packages.
 
  This means that instead of building Spark with sbt/sbt XXX, you'll
  need to have sbt yourself and just run sbt XXX from within the Spark
  directory. This is similar to the maven build, where we expect users
  already have maven installed.
 
  You can download sbt at http://www.scala-sbt.org/. It's okay to just
  download the most recent version of sbt, since sbt knows how to fetch
  other versions of itself and will always use the one we specify in our
  build file to compile spark.
 
  - Patrick
 



 --
 Cell : 425-233-8271

Re: Build Changes for SBT Users

2014-01-04 Thread Patrick Wendell

Hey Holden,

That sounds reasonable to me. Where would we get a url we can control
though? Right now the project has web space is at incubator.apache...
but later this will change to a full apache domain. Is there somewhere
in maven central these jars are hosted... that would be the nicest
because things like repo1.maven.org basically never changes.

- Patrick

On Sat, Jan 4, 2014 at 1:20 PM, Holden Karau hol...@pigscanfly.ca wrote:
 That makes sense, I think we could structure a script in such a way that it
 would overcome these problems though and probably provide a fair a mount of
 benefit for people who just want to get started quickly.

 The easiest would be to have it use the system sbt if present and then fall
 back to downloading the sbt jar. As far as stability of the URL goes we
 could solve this by either having it point at a domain we control, or just
 with an clear error message indicating it failed to download sbt and the
 user needs to install sbt.

 If a restructured script in that manner would be useful I could whip up a
 pull request :)


 On Sat, Jan 4, 2014 at 10:56 AM, Patrick Wendell pwend...@gmail.com wrote:

 We thought about this but elected not to do this for a few reasons.

 1. Some people build from machines that do not have internet access
 for security reasons and retrieve dependency from internal nexus
 repositories. So having a build dependency that relies on internet
 downloads is not desirable.

 2. It's a hard to ensure stability of a particular URL in perpetuity.
 This is why maven central and other mirror networks exist. Keep in
 mind that we can't change the release code ever once we release it,
 and if something changed about the particular URL it could break the
 build.

 - Patrick

 On Sat, Jan 4, 2014 at 9:34 AM, Andrew Ash and...@andrewash.com wrote:
  +1 on bundling a script similar to that one
 
 
  On Sat, Jan 4, 2014 at 4:48 AM, Holden Karau hol...@pigscanfly.ca
 wrote:
 
  Could we ship a shell script which downloads the sbt jar if not present
  (like for example https://github.com/holdenk/slashem/blob/master/sbt )?
 
 
  On Sat, Jan 4, 2014 at 12:02 AM, Patrick Wendell pwend...@gmail.com
  wrote:
 
   Hey All,
  
   Due to an ASF requirement, we recently merged a patch which removes
   the sbt jar from the build. This is necessary because we aren't
   allowed to distributed binary artifacts with our source packages.
  
   This means that instead of building Spark with sbt/sbt XXX, you'll
   need to have sbt yourself and just run sbt XXX from within the Spark
   directory. This is similar to the maven build, where we expect users
   already have maven installed.
  
   You can download sbt at http://www.scala-sbt.org/. It's okay to just
   download the most recent version of sbt, since sbt knows how to fetch
   other versions of itself and will always use the one we specify in our
   build file to compile spark.
  
   - Patrick
  
 
 
 
  --
  Cell : 425-233-8271
 




 --
 Cell : 425-233-8271

Re: Build Changes for SBT Users

2014-01-04 Thread Patrick Wendell

Reynold the issue is releases are immutable and we expect them to be
downloaded for several years after the release date.

On Sat, Jan 4, 2014 at 5:57 PM, Xuefeng Wu ben...@gmail.com wrote:
 Sound reasonable.  But I think few installed sbt even it is easy to install.  
 I think can provide this tricky script in online document, user could 
 download this script to install sbt independence. Sound like a yet another 
 brew install sbt?
 :)

 Yours, Xuefeng Wu 吴雪峰 敬上

 On 2014年1月5日, at 上午2:56, Patrick Wendell pwend...@gmail.com wrote:

 We thought about this but elected not to do this for a few reasons.

 1. Some people build from machines that do not have internet access
 for security reasons and retrieve dependency from internal nexus
 repositories. So having a build dependency that relies on internet
 downloads is not desirable.

 2. It's a hard to ensure stability of a particular URL in perpetuity.
 This is why maven central and other mirror networks exist. Keep in
 mind that we can't change the release code ever once we release it,
 and if something changed about the particular URL it could break the
 build.

 - Patrick

 On Sat, Jan 4, 2014 at 9:34 AM, Andrew Ash and...@andrewash.com wrote:
 +1 on bundling a script similar to that one


 On Sat, Jan 4, 2014 at 4:48 AM, Holden Karau hol...@pigscanfly.ca wrote:

 Could we ship a shell script which downloads the sbt jar if not present
 (like for example https://github.com/holdenk/slashem/blob/master/sbt )?


 On Sat, Jan 4, 2014 at 12:02 AM, Patrick Wendell pwend...@gmail.com
 wrote:

 Hey All,

 Due to an ASF requirement, we recently merged a patch which removes
 the sbt jar from the build. This is necessary because we aren't
 allowed to distributed binary artifacts with our source packages.

 This means that instead of building Spark with sbt/sbt XXX, you'll
 need to have sbt yourself and just run sbt XXX from within the Spark
 directory. This is similar to the maven build, where we expect users
 already have maven installed.

 You can download sbt at http://www.scala-sbt.org/. It's okay to just
 download the most recent version of sbt, since sbt knows how to fetch
 other versions of itself and will always use the one we specify in our
 build file to compile spark.

 - Patrick



 --
 Cell : 425-233-8271

Re: Build Changes for SBT Users

2014-01-04 Thread Patrick Wendell

I agree TD - I was just saying that Reynold's proposal that we could
update the release post-hoc is unfortunately not possible.

On Sat, Jan 4, 2014 at 7:13 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
 Patrick, that is right. All we are trying to ensure is to make a
 best-effort attempt to make it smooth for a new user. The script will try
 its best to automatically install / download sbt for the user. The fallback
 will be that the user will have to install sbt on their own. If the URL
 happens to change and our script fails to automatically download, then we
 are *no worse* than not providing the script at all.

 TD


 On Sat, Jan 4, 2014 at 7:06 PM, Patrick Wendell pwend...@gmail.com wrote:

 Reynold the issue is releases are immutable and we expect them to be
 downloaded for several years after the release date.

 On Sat, Jan 4, 2014 at 5:57 PM, Xuefeng Wu ben...@gmail.com wrote:
  Sound reasonable.  But I think few installed sbt even it is easy to
 install.  I think can provide this tricky script in online document, user
 could download this script to install sbt independence. Sound like a yet
 another brew install sbt?
  :)
 
  Yours, Xuefeng Wu 吴雪峰 敬上
 
  On 2014年1月5日, at 上午2:56, Patrick Wendell pwend...@gmail.com wrote:
 
  We thought about this but elected not to do this for a few reasons.
 
  1. Some people build from machines that do not have internet access
  for security reasons and retrieve dependency from internal nexus
  repositories. So having a build dependency that relies on internet
  downloads is not desirable.
 
  2. It's a hard to ensure stability of a particular URL in perpetuity.
  This is why maven central and other mirror networks exist. Keep in
  mind that we can't change the release code ever once we release it,
  and if something changed about the particular URL it could break the
  build.
 
  - Patrick
 
  On Sat, Jan 4, 2014 at 9:34 AM, Andrew Ash and...@andrewash.com
 wrote:
  +1 on bundling a script similar to that one
 
 
  On Sat, Jan 4, 2014 at 4:48 AM, Holden Karau hol...@pigscanfly.ca
 wrote:
 
  Could we ship a shell script which downloads the sbt jar if not
 present
  (like for example https://github.com/holdenk/slashem/blob/master/sbt)?
 
 
  On Sat, Jan 4, 2014 at 12:02 AM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  Hey All,
 
  Due to an ASF requirement, we recently merged a patch which removes
  the sbt jar from the build. This is necessary because we aren't
  allowed to distributed binary artifacts with our source packages.
 
  This means that instead of building Spark with sbt/sbt XXX, you'll
  need to have sbt yourself and just run sbt XXX from within the
 Spark
  directory. This is similar to the maven build, where we expect users
  already have maven installed.
 
  You can download sbt at http://www.scala-sbt.org/. It's okay to just
  download the most recent version of sbt, since sbt knows how to fetch
  other versions of itself and will always use the one we specify in
 our
  build file to compile spark.
 
  - Patrick
 
 
 
  --
  Cell : 425-233-8271

Re: Changes that affect packaging and running Spark

2014-01-03 Thread Patrick Wendell

-- Small correction

 /sbin contains administrative scripts for launching the standalone
 cluster manager:
 /sbin/start-master.sh
 /sbin/start-all.sh
 ...etc

Re: Terminology: worker vs slave

2014-01-02 Thread Patrick Wendell

Ya we've been trying to standardize on the terminology here (see glossary):

http://spark.incubator.apache.org/docs/latest/cluster-overview.html

I think slave actually isn't mentioned here at all - but references
to slave in the codebase are synonymous with worker.

- Patrick

On Thu, Jan 2, 2014 at 10:42 PM, Reynold Xin r...@databricks.com wrote:
 It is historic.

 I think we are converging towards

 worker: the slave daemon in the standalone cluster manager

 executor: the jvm process that is launched by the worker that executes tasks



 On Thu, Jan 2, 2014 at 10:39 PM, Andrew Ash and...@andrewash.com wrote:

 The terms worker and slave seem to be used interchangeably.  Are they the
 same?

 Worker is used more frequently in the codebase:

 aash@aash-mbp ~/git/spark$ git grep -i worker | wc -l
  981
 aash@aash-mbp ~/git/spark$ git grep -i slave | wc -l
  348
 aash@aash-mbp ~/git/spark$

 Does it make sense to unify on one or the other?

Disallowing null mergeCombiners

2013-12-31 Thread Patrick Wendell

Hey All,

There is a small API change that we are considering for the external
sort patch. Previously we allowed mergeCombiner to be null when map
side aggregation was not enabled. This is because it wasn't necessary
in that case since mappers didn't ship pre-aggregated values to
reducers.

Because the external sort capability also relies on the mergeCombiner
function to merge partially-aggregated on-disk segments, we now need
it all the time, even if map side aggregation is enabled. This is a
fairly esoteric thing that I'm not sure anyone other than Shark ever
used, but I want to check in case anyone had feelings about this.

The relevant code is here:

https://github.com/apache/incubator-spark/pull/303/files#diff-f70e97c099b5eac05c75288cb215e080R72

- Patrick

Re: IMPORTANT: Spark mailing lists moving to Apache by September 1st

2013-12-24 Thread Patrick Wendell

Hey Andy - these Nabble groups look great! Thanks for setting them up.


On Tue, Dec 24, 2013 at 10:49 AM, Evan Chan e...@ooyala.com wrote:

 Thanks Andy, at first glance nabble seems great, it allows search plus
 posting new topics, so it appears to be bidirectional.Now just have to
 register an account on there.


 On Sun, Dec 22, 2013 at 2:47 PM, Andy Konwinski 
 andykonwin...@gmail.comwrote:

 Per Matei's suggestion, I've set up two nabble archive lists, one to
 archive the apache dev list and one to archive the apache user list.

 user list archive: http://apache-spark-user-list.1001560.n3.nabble.com
 dev list archive:
 http://apache-spark-developers-list.1001551.n3.nabble.com

 Between these and whatever solution we end up with for the google group
 mirrors, we should have decent enough alternatives to reading via the
 apache list archives going forward.


 On Thu, Dec 19, 2013 at 11:09 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

  Yes, I agree that we should close down the existing Google group on Jan
  1st. While it’s more convenient to use, it’s created confusion. I hope
 that
  we can get the ASF to support better search interfaces in the future
 too. I
  think we just have to drive this from within.
 
  The Google Group should be a nice way to make the content searchable
 from
  the web. We should also see what it takes to make it mirrored on Nabble
 (
  http://www.nabble.com). I’ve found a lot of information about other
  projects there, and other Apache projects do use it.
 
  Matei
 
  On Dec 19, 2013, at 10:49 PM, Andy Konwinski andykonwin...@gmail.com
  wrote:
 
  I've set up two new unofficial google groups to mirror the Apache Spark
  user and dev lists:
 
  https://groups.google.com/forum/#!forum/apache-spark-dev-mirror
  https://groups.google.com/forum/#!forum/apache-spark-user-mirror
 
  Basically these lists each subscribe to the corresponding Apache list.
 
  They do not allow folks to subscribe directly to them. Getting emails
 from
  the Google Group would offer no advantages that I can think of and we
  really want to encourage folks to sign up for the official mailing list
  instead.
 
  The lists do allow the public to send email to them, which I think might
  be necessary since the from: field for all emails that get distributed
  via the Apache mailing list is set to the author of the email.
 
  I think this might be a great compromise. At least we can try this out
 and
  see how it goes.
 
  Matei, can you confirm that Jan 1 is the date we want to turn off the
  existing spark-users google group?
 
  We could consider using the existing spark-developers and spark-users
  google groups instead of the two new ones I just created but I think
 that
  it is much more obvious to have the lists include the word mirror in
 their
  names.
 
  The dev list mirror seems to be working, because I see the last couple
  emails from this thread in it already. I'll confirm and ensure that the
  user list mirror is working too.
 
  Thoughts?
 
  Andy
 
  P.S. Thanks to Patrick for suggesting this to me originally.
 
  On Thu, Dec 19, 2013 at 8:46 PM, Aaron Davidson ilike...@gmail.com
 wrote:
 
  I'd be fine with one-way mirrors here (Apache threads being reflected
 in
  Google groups) -- I have no idea how one is supposed to navigate the
 Apache
  list to look for historic threads.
 
 
  On Thu, Dec 19, 2013 at 7:58 PM, Mike Potts maspo...@gmail.com
 wrote:
 
  Thanks very much for the prompt and comprehensive reply!  I appreciate
  the overarching desire to integrate with apache: I'm very happy to
 hear
  that there's a move to use the existing groups as mirrors: that will
  overcome all of my objections: particularly if it's bidirectional! :)
 
 
  On Thursday, December 19, 2013 7:19:06 PM UTC-8, Andy Konwinski wrote:
 
  Hey Mike,
 
  As you probably noticed when you CC'd spark-de...@googlegroups.com,
  that list has already be reconfigured so that it no longer allows
 posting
  (and bounces emails sent to it).
 
  We will be doing the same thing to the spark...@googlegroups.comlist
  too (we'll announce a date for that soon).
 
  That may sound very frustrating, and you are *not* alone feeling that
  way. We've had a long conversation with our mentors about this, and
 I've
  felt very similar to you, so I'd like to give you background.
 
  As I'm coming to see it, part of becoming an Apache project is moving
  the community *fully* over to Apache infrastructure, and more
 generally the
  Apache way of organizing the community.
 
  This applies in both the nuts-and-bolts sense of being on apache
 infra,
  but possibly more importantly, it is also a guiding principle and
 way of
  thinking.
 
  In various ways, moving to apache Infra can be a painful process, and
  IMO the loss of all the great mailing list functionality that comes
 with
  using Google Groups is perhaps the most painful step. But basically,
 the de
  facto mailing lists need to be the Apache ones, and not Google

Re: Akka problem when using scala command to launch Spark applications in the current 0.9.0-SNAPSHOT

2013-12-24 Thread Patrick Wendell

Even,

This problem also exists for people who write their own applications that
depend on/include Spark. E.g. they bundle up their app and then launch the
driver with scala -cp my-budle.jar... I've seen this cause an issue in
that setting.

- Patrick


On Tue, Dec 24, 2013 at 10:50 AM, Evan Chan e...@ooyala.com wrote:

 Hi Reynold,

 The default, documented methods of starting Spark all use the assembly
 jar, and thus java, right?

 -Evan



 On Fri, Dec 20, 2013 at 11:36 PM, Reynold Xin r...@databricks.com wrote:

 It took me hours to debug a problem yesterday on the latest master branch
 (0.9.0-SNAPSHOT), and I would like to share with the dev list in case
 anybody runs into this Akka problem.

 A little background for those of you who haven't followed closely the
 development of Spark and YARN 2.2: YARN 2.2 uses protobuf 2.5, and Akka
 uses an older version of protobuf that is not binary compatible. In order
 to have a single build that is compatible for both YARN 2.2 and pre-2.2
 YARN/Hadoop, we published a special version of Akka that builds with
 protobuf shaded (i.e. using a different package name for the protobuf
 stuff).

 However, it turned out Scala 2.10 includes a version of Akka jar in its
 default classpath (look at the lib folder in Scala 2.10 binary
 distribution). If you use the scala command to launch any Spark
 application
 on the current master branch, there is a pretty high chance that you
 wouldn't be able to create the SparkContext (stack trace at the end of the
 email). The problem is that the Akka packaged with Scala 2.10 takes
 precedence in the classloader over the special Akka version Spark
 includes.

 Before we have a good solution for this, the workaround is to use java to
 launch the application instead of scala. All you need to do is to include
 the right Scala jars (scala-library and scala-compiler) in the classpath.
 Note that the scala command is really just a simple script that calls java
 with the right classpath.


 Stack trace:

 java.lang.NoSuchMethodException:
 akka.remote.RemoteActorRefProvider.init(java.lang.String,
 akka.actor.ActorSystem$Settings, akka.event.EventStream,
 akka.actor.Scheduler, akka.actor.DynamicAccess)
 at java.lang.Class.getConstructor0(Class.java:2763)
 at java.lang.Class.getDeclaredConstructor(Class.java:2021)
 at

 akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:77)
 at scala.util.Try$.apply(Try.scala:161)
 at

 akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:74)
 at

 akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:85)
 at

 akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:85)
 at scala.util.Success.flatMap(Try.scala:200)
 at

 akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:85)
 at akka.actor.ActorSystemImpl.init(ActorSystem.scala:546)
 at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
 at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
 at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:79)
 at
 org.apache.spark.SparkEnv$.createFromSystemProperties(SparkEnv.scala:120)
 at org.apache.spark.SparkContext.init(SparkContext.scala:106)




 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com  |

 http://www.ooyala.com/ 
 http://www.facebook.com/ooyalahttp://www.linkedin.com/company/ooyalahttp://www.twitter.com/ooyala

Re: IMPORTANT: Spark mailing lists moving to Apache by September 1st

2013-12-20 Thread Patrick Wendell

Andy and Mike,

I'd also prefer to just convert the old groups into mirrors. That way
people who are still subscribed to them will continue to get e-mails
(and most people on the list are read-only users).

Ideally we'd have the behavior that users who try to e-mail the google
group get a bounce back saying this is now a read only mirror.

That said I have *no idea* of this is possible to set-up nicely within
google groups. I defer to Andy! Having the new mirror groups also
seems like a decent solution as well...

- Patrick

On Fri, Dec 20, 2013 at 8:35 AM, Mike Potts maspo...@gmail.com wrote:
 I actually prefer that, but I didn't want my preference to get in the way of
 creating mirror groups, one way or the other :)  (My argument would be that
 since the old groups would be closing anyway, re-purposing them as mirrors
 is fair use: and less work/confusing than creating new *-mirror groups
 instead.)


 On Friday, December 20, 2013 8:29:40 AM UTC-8, Andy Konwinski wrote:

 That would be really awesome. I'm not familiar with any Google Groups
 functionality that supports that but I'll look.

 That's an argument for maybe just changing the names of the existing
 groups to something with mirror in them instead of using newly created ones.

 --
 You received this message because you are subscribed to the Google Groups
 Spark Users group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to spark-users+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/groups/opt_out.

Spark 0.8.1 Released

2013-12-19 Thread Patrick Wendell

Hi everyone,

We've just posted Spark 0.8.1, a new maintenance release that contains
some bug fixes and improvements to the 0.8 branch. The full release
notes are available at [1]. Apart from various bug fixes, 0.8.1
includes support for YARN 2.2, a high availability mode for the
standalone scheduler, and optimizations to the shuffle. We recommend
that current users update to this release. You can grab the release at
[2].

[1] http://spark.incubator.apache.org/releases/spark-release-0-8-1.html
[2] http://spark.incubator.apache.org/downloads

Thanks to the following people who contributed to this release:

Michael Armbrust, Pierre Borckmans, Evan Chan, Ewen Cheslack, Mosharaf
Chowdhury, Frank Dai, Aaron Davidson, Tathagata Das, Ankur Dave,
Harvey Feng, Ali Ghodsi, Thomas Graves, Li Guoqiang, Stephen Haberman,
Haidar Hadi, Nathan Howell, Holden Karau, Du Li, Raymond Liu, Xi Liu,
David McCauley, Michael (wannabeast), Fabrizio Milo, Mridul
Muralidharan, Sundeep Narravula, Kay Ousterhout, Nick Pentreath, Imran
Rashid, Ahir Reddy, Josh Rosen, Henry Saputra, Jerry Shao, Mingfei
Shi, Andre Schumacher, Karthik Tunga, Patrick Wendell, Neal Wiggins,
Andrew Xia, Reynold Xin, Matei Zaharia, and Wu Zeming

- Patrick

[RESULT] [VOTE] Release Apache Spark 0.8.1-incubating (rc4)

2013-12-17 Thread Patrick Wendell

The vote is now closed. This vote passes with 4 IPMC +1's and no 0 or -1 votes.

+1 (4 Total)
Marvin Humphrey
Henry Saputra
Chris Mattmann
Roman Shaposhnik

0 (0 Total)

-1 (0 Total)

* = Binding Vote

Thanks to everyone who helped vet this release.

- Patrick

Re: [VOTE] Release Apache Spark 0.8.1-incubating (rc4)

2013-12-15 Thread Patrick Wendell

You can checkout the docs mentioned in the vote thread. There is also
a pre-build binary for hadoop2 that is compiled for YARN 2.2

- Patrick

On Sun, Dec 15, 2013 at 4:31 AM, Azuryy Yu azury...@gmail.com wrote:
yarn 2.2, not yarn 0.22, I am so sorry.

On Sun, Dec 15, 2013 at 8:31 PM, Azuryy Yu azury...@gmail.com wrote:

Hi,
Spark-0.8.1 supports yarn 0.22 right? where to find the release note?
Thanks.

On Sun, Dec 15, 2013 at 3:20 AM, Henry Saputra
henry.sapu...@gmail.comwrote:

Yeah seems like it. He was ok with our prev release.
Let's wait for his reply

On Saturday, December 14, 2013, Patrick Wendell wrote:

Henry - from that thread it looks like sebb's concern was something
different than this.

On Sat, Dec 14, 2013 at 11:08 AM, Henry Saputra
henry.sapu...@gmail.com
wrote:
Hi Patrick,

Yeap I agree, but technically ASF VOTE release on source only, there
even debate about it =), so putting it in the vote staging artifact
could confuse people because in our case we do package 3rd party
libraries in the binary jars.

I have sent email to sebb asking clarification about his concern in
general@ list.

- Henry

On Sat, Dec 14, 2013 at 10:56 AM, Patrick Wendell pwend...@gmail.com

wrote:
Hey Henry,

One thing a lot of people do during the vote is test the binaries and
make sure they work. This is really valuable. If you'd like I could
add a caveat to the vote thread explaining that we are only voting on
the source.

- Patrick

On Sat, Dec 14, 2013 at 10:40 AM, Henry Saputra
henry.sapu...@gmail.com wrote:
Actually we should be fine putting the binaries there as long as the
VOTE is for the source.

Let's verify with sebb in the general@ list about his concern.

- Henry

On Sat, Dec 14, 2013 at 10:31 AM, Henry Saputra
henry.sapu...@gmail.com wrote:
Hi Patrick, as sebb has mentioned let's move the binaries from the
voting directory in your people.apache.org directory.
ASF release voting is for source code and not binaries, and
technically we provide binaries for convenience.

And add link to the KEYS location in the dist[1] to let verify
signatures.

Sorry for the late response to the VOTE thread, guys.

- Henry

[1]
https://dist.apache.org/repos/dist/release/incubator/spark/KEYS

On Fri, Dec 13, 2013 at 6:37 PM, Patrick Wendell
pwend...@gmail.com
wrote:
The vote is now closed. This vote passes with 5 PPMC +1's and no 0
or -1
votes.

+1 (5 Total)
Matei Zaharia*
Nick Pentreath*
Patrick Wendell*
Prashant Sharma*
Tom Graves*

0 (0 Total)

-1 (0 Total)

* = Binding Vote

As per the incubator release guide [1] I'll be sending this to the
general incubator list for a final vote from IPMC members.

[1]

http://incubator.apache.org/guides/releasemanagement.html#best-practice-incubator-release-
vote

On Thu, Dec 12, 2013 at 8:59 AM, Evan Chan e...@ooyala.com wrote:

I'd be personally fine with a standard workflow of assemble-deps
+
packaging just the Spark files as separate packages, if it
speeds up
everyone's development time.

On Wed, Dec 11, 2013 at 1:10 PM, Mark Hamstra
m...@clearstorydata.com
wrote:

I don't know how to make sense of the numbers, but here's what
I've got
from a very small sample size.

Re: Scala 2.10 Merge

2013-12-14 Thread Patrick Wendell

Alright I just merged this in - so Spark is officially Scala 2.10
from here forward.

For reference I cut a new branch called scala-2.9 with the commit
immediately prior to the merge:
https://git-wip-us.apache.org/repos/asf/incubator-spark/repo?p=incubator-spark.git;a=shortlog;h=refs/heads/scala-2.9

- Patrick

On Thu, Dec 12, 2013 at 8:26 PM, Patrick Wendell pwend...@gmail.com wrote:
Hey Reymond,

Let's move this discussion out of this thread and into the associated JIRA.
I'll write up our current approach over there.

https://spark-project.atlassian.net/browse/SPARK-995

- Patrick

On Thu, Dec 12, 2013 at 5:56 PM, Liu, Raymond raymond@intel.com wrote:

Hi Patrick

So what's the plan for support Yarn 2.2 in 0.9? As far as I can
see, if you want to support both 2.2 and 2.0 , due to protobuf version
incompatible issue. You need two version of akka anyway.

Akka 2.3-M1 looks like have a little bit change in API, we
probably could isolate the code like what we did on yarn part API. I
remember that it is mentioned that to use reflection for different API is
preferred. So the purpose to use reflection is to use one release bin jar to
support both version of Hadoop/Yarn on runtime, instead of build different
bin jar on compile time?

Then all code related to hadoop will also be built in separate
modules for loading on demand? This sounds to me involve a lot of works. And
you still need to have shim layer and separate code for different version
API and depends on different version Akka etc. Sounds like and even strict
demands versus our current approaching on master, and with dynamic class
loader in addition, And the problem we are facing now are still there?

Best Regards,
Raymond Liu

-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: Thursday, December 12, 2013 5:13 PM
To: dev@spark.incubator.apache.org
Subject: Re: Scala 2.10 Merge

Also - the code is still there because of a recent merge that took in some
newer changes... we'll be removing it for the final merge.

On Thu, Dec 12, 2013 at 1:12 AM, Patrick Wendell pwend...@gmail.com
wrote:

Hey Raymond,

This won't work because AFAIK akka 2.3-M1 is not binary compatible
with akka 2.2.3 (right?). For all of the non-yarn 2.2 versions we need
to still use the older protobuf library, so we'd need to support both.

I'd also be concerned about having a reference to a non-released
version of akka. Akka is the source of our hardest-to-find bugs and
simultaneously trying to support 2.2.3 and 2.3-M1 is a bit daunting.
Of course, if you are building off of master you can maintain a fork
that uses this.

- Patrick

On Thu, Dec 12, 2013 at 12:42 AM, Liu, Raymond
raymond@intel.comwrote:

Hi Patrick

What does that means for drop YARN 2.2? seems codes are still
there. You mean if build upon 2.2 it will break, and won't and work
right?
Since the home made akka build on scala 2.10 are not there. While, if
for this case, can we just use akka 2.3-M1 which run on protobuf 2.5
for replacement?

Best Regards,
Raymond Liu

-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: Thursday, December 12, 2013 4:21 PM
To: dev@spark.incubator.apache.org
Subject: Scala 2.10 Merge

Hi Developers,

In the next few days we are planning to merge Scala 2.10 support into
Spark. For those that haven't been following this, Prashant Sharma
has been maintaining the scala-2.10 branch of Spark for several
months. This branch is current with master and has been reviewed for
merging:

https://github.com/apache/incubator-spark/tree/scala-2.10

Scala 2.10 support is one of the most requested features for Spark -
it will be great to get this into Spark 0.9! Please note that *Scala
2.10 is not binary compatible with Scala 2.9*. With that in mind, I
wanted to give a few heads-up/requests to developers:

If you are developing applications on top of Spark's master branch,
those will need to migrate to Scala 2.10. You may want to download
and test the current scala-2.10 branch in order to make sure you will
be okay as Spark developments move forward. Of course, you can always
stick with the current master commit and be fine (I'll cut a tag when
we do the merge in order to delineate where the version changes).
Please open new threads on the dev list to report and discuss any
issues.

This merge will temporarily drop support for YARN 2.2 on the master
branch.
This is because the workaround we used was only compiled for Scala 2.9.
We are going to come up with a more robust solution to YARN 2.2
support before releasing 0.9.

Going forward, we will continue to make maintenance releases on
branch-0.8 which will remain compatible with Scala 2.9.

For those interested, the primary code changes in this merge are
upgrading the akka version, changing the use

Re: [VOTE] Release Apache Spark 0.8.1-incubating (rc4)

2013-12-13 Thread Patrick Wendell

The vote is now closed. This vote passes with 5 PPMC +1's and no 0 or -1
votes.

+1 (5 Total)
Matei Zaharia*
Nick Pentreath*
Patrick Wendell*
Prashant Sharma*
Tom Graves*

0 (0 Total)

-1 (0 Total)

* = Binding Vote

As per the incubator release guide [1] I'll be sending this to the
general incubator list for a final vote from IPMC members.

[1]
http://incubator.apache.org/guides/releasemanagement.html#best-practice-incubator-release-
vote

On Thu, Dec 12, 2013 at 8:59 AM, Evan Chan e...@ooyala.com wrote:

I'd be personally fine with a standard workflow of assemble-deps +
packaging just the Spark files as separate packages, if it speeds up
everyone's development time.

On Wed, Dec 11, 2013 at 1:10 PM, Mark Hamstra m...@clearstorydata.com
wrote:

I don't know how to make sense of the numbers, but here's what I've got
from a very small sample size.

For both v0.8.0-incubating and v0.8.1-incubating, building separate
assemblies is faster than `./sbt/sbt assembly` and the times for building
separate assemblies for 0.8.0 and 0.8.1 are about the same.

For v0.8.0-incubating, `./sbt/sbt assembly` takes about 2.5x as long as
the
sum of the separate assemblies.
For v0.8.1-incubating, `./sbt/sbt assembly` takes almost 8x as long as
the
sum of the separate assemblies.

Weird.

On Wed, Dec 11, 2013 at 11:49 AM, Patrick Wendell pwend...@gmail.com
wrote:

I'll +1 myself also.

For anyone who has the slow build problem: does this issue happen when
building v0.8.0-incubating also? Trying to figure out whether it's
related to something we added in 0.8.1 or if it's a long standing
issue.

- Patrick

On Wed, Dec 11, 2013 at 10:39 AM, Matei Zaharia
matei.zaha...@gmail.com

wrote:
Woah, weird, but definitely good to know.

If you’re doing Spark development, there’s also a more convenient
option
added by Shivaram in the master branch. You can do sbt assemble-deps to
package *just* the dependencies of each project in a special assembly
JAR,
and then use sbt compile to update the code. This will use the classes
directly out of the target/scala-2.9.3/classes directories. You have to
redo assemble-deps only if your external dependencies change.

Matei

On Dec 11, 2013, at 1:04 AM, Prashant Sharma scrapco...@gmail.com
wrote:

I hope this PR https://github.com/apache/incubator-spark/pull/252can
help.
Again this is not a blocker for the release from my side either.

On Wed, Dec 11, 2013 at 2:14 PM, Mark Hamstra
m...@clearstorydata.com
wrote:

Interesting, and confirmed: On my machine where `./sbt/sbt
assembly`
takes
a long, long, long time to complete (a MBP, in my case),
building
three
separate assemblies (`./sbt/sbt assembly/assembly`, `./sbt/sbt
examples/assembly`, `./sbt/sbt tools/assembly`) takes much, much
less
time.

On Wed, Dec 11, 2013 at 12:02 AM, Prashant Sharma
scrapco...@gmail.com
wrote:

forgot to mention, after running sbt/sbt assembly/assembly running
sbt/sbt
examples/assembly takes just 37s. Not to mention my hardware is
not
really
great.

On Wed, Dec 11, 2013 at 1:28 PM, Prashant Sharma
scrapco...@gmail.com
wrote:

Hi Patrick and Matei,

Was trying out this and followed the quick start guide which says
do
sbt/sbt assembly, like few others I was also stuck for few
minutes
on
linux. On the other hand if I use sbt/sbt assembly/assembly it is
much
faster.

Should we change the documentation to reflect this. It will not
be
great
for first time users to get stuck there.

On Wed, Dec 11, 2013 at 9:54 AM, Matei Zaharia
matei.zaha...@gmail.com
wrote:

Built and tested it on Mac OS X.

Matei

On Dec 10, 2013, at 4:49 PM, Patrick Wendell
pwend...@gmail.com
wrote:

Please vote on releasing the following candidate as Apache
Spark
(incubating) version 0.8.1.

The tag to be voted on is v0.8.1-incubating (commit b87d31d):

https://git-wip-us.apache.org/repos/asf/incubator-spark/repo?p=incubator-spark.git;a=commit;h=b87d31dd8eb4b4e47c0138e9242d0dd6922c8c4e

The release files, including signatures, digests, etc can be
found
at:
http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc4/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-040/

The documentation corresponding to this release can be found
at:

http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc4-docs/

For information about the contents of this release see:

https://git-wip-us.apache.org/repos/asf

Re: Scala 2.10 Merge

2013-12-12 Thread Patrick Wendell

Also - the code is still there because of a recent merge that took in some
newer changes... we'll be removing it for the final merge.


On Thu, Dec 12, 2013 at 1:12 AM, Patrick Wendell pwend...@gmail.com wrote:

 Hey Raymond,

 This won't work because AFAIK akka 2.3-M1 is not binary compatible with
 akka 2.2.3 (right?). For all of the non-yarn 2.2 versions we need to still
 use the older protobuf library, so we'd need to support both.

 I'd also be concerned about having a reference to a non-released version
 of akka. Akka is the source of our hardest-to-find bugs and simultaneously
 trying to support 2.2.3 and 2.3-M1 is a bit daunting. Of course, if you are
 building off of master you can maintain a fork that uses this.

 - Patrick


 On Thu, Dec 12, 2013 at 12:42 AM, Liu, Raymond raymond@intel.comwrote:

 Hi Patrick

 What does that means for drop YARN 2.2? seems codes are still
 there. You mean if build upon 2.2 it will break, and won't and work right?
 Since the home made akka build on scala 2.10 are not there. While, if for
 this case, can we just use akka 2.3-M1 which run on protobuf 2.5 for
 replacement?

 Best Regards,
 Raymond Liu


 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]
 Sent: Thursday, December 12, 2013 4:21 PM
 To: dev@spark.incubator.apache.org
 Subject: Scala 2.10 Merge

 Hi Developers,

 In the next few days we are planning to merge Scala 2.10 support into
 Spark. For those that haven't been following this, Prashant Sharma has been
 maintaining the scala-2.10 branch of Spark for several months. This branch
 is current with master and has been reviewed for merging:

 https://github.com/apache/incubator-spark/tree/scala-2.10

 Scala 2.10 support is one of the most requested features for Spark - it
 will be great to get this into Spark 0.9! Please note that *Scala 2.10 is
 not binary compatible with Scala 2.9*. With that in mind, I wanted to give
 a few heads-up/requests to developers:

 If you are developing applications on top of Spark's master branch, those
 will need to migrate to Scala 2.10. You may want to download and test the
 current scala-2.10 branch in order to make sure you will be okay as Spark
 developments move forward. Of course, you can always stick with the current
 master commit and be fine (I'll cut a tag when we do the merge in order to
 delineate where the version changes). Please open new threads on the dev
 list to report and discuss any issues.

 This merge will temporarily drop support for YARN 2.2 on the master
 branch.
 This is because the workaround we used was only compiled for Scala 2.9.
 We are going to come up with a more robust solution to YARN 2.2 support
 before releasing 0.9.

 Going forward, we will continue to make maintenance releases on
 branch-0.8 which will remain compatible with Scala 2.9.

 For those interested, the primary code changes in this merge are
 upgrading the akka version, changing the use of Scala 2.9's ClassManifest
 construct to Scala 2.10's ClassTag, and updating the spark shell to work
 with Scala 2.10's repl.

 - Patrick

Re: Scala 2.10 Merge

2013-12-12 Thread Patrick Wendell

Hey Reymond,

Let's move this discussion out of this thread and into the associated JIRA.
I'll write up our current approach over there.

https://spark-project.atlassian.net/browse/SPARK-995

- Patrick


On Thu, Dec 12, 2013 at 5:56 PM, Liu, Raymond raymond@intel.com wrote:

 Hi Patrick

 So what's the plan for support Yarn 2.2 in 0.9? As far as I can
 see, if you want to support both 2.2 and 2.0 , due to protobuf version
 incompatible issue. You need two version of akka anyway.

 Akka 2.3-M1 looks like have a little bit change in API, we
 probably could isolate the code like what we did on yarn part API. I
 remember that it is mentioned that to use reflection for different API is
 preferred. So the purpose to use reflection is to use one release bin jar
 to support both version of Hadoop/Yarn on runtime, instead of build
 different bin jar on compile time?

  Then all code related to hadoop will also be built in separate
 modules for loading on demand? This sounds to me involve a lot of works.
 And you still need to have shim layer and separate code for different
 version API and depends on different version Akka etc. Sounds like and even
 strict demands versus our current approaching on master, and with dynamic
 class loader in addition, And the problem we are facing now are still there?

 Best Regards,
 Raymond Liu

 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]
 Sent: Thursday, December 12, 2013 5:13 PM
 To: dev@spark.incubator.apache.org
 Subject: Re: Scala 2.10 Merge

 Also - the code is still there because of a recent merge that took in some
 newer changes... we'll be removing it for the final merge.


 On Thu, Dec 12, 2013 at 1:12 AM, Patrick Wendell pwend...@gmail.com
 wrote:

  Hey Raymond,
 
  This won't work because AFAIK akka 2.3-M1 is not binary compatible
  with akka 2.2.3 (right?). For all of the non-yarn 2.2 versions we need
  to still use the older protobuf library, so we'd need to support both.
 
  I'd also be concerned about having a reference to a non-released
  version of akka. Akka is the source of our hardest-to-find bugs and
  simultaneously trying to support 2.2.3 and 2.3-M1 is a bit daunting.
  Of course, if you are building off of master you can maintain a fork
 that uses this.
 
  - Patrick
 
 
  On Thu, Dec 12, 2013 at 12:42 AM, Liu, Raymond raymond@intel.com
 wrote:
 
  Hi Patrick
 
  What does that means for drop YARN 2.2? seems codes are still
  there. You mean if build upon 2.2 it will break, and won't and work
 right?
  Since the home made akka build on scala 2.10 are not there. While, if
  for this case, can we just use akka 2.3-M1 which run on protobuf 2.5
  for replacement?
 
  Best Regards,
  Raymond Liu
 
 
  -Original Message-
  From: Patrick Wendell [mailto:pwend...@gmail.com]
  Sent: Thursday, December 12, 2013 4:21 PM
  To: dev@spark.incubator.apache.org
  Subject: Scala 2.10 Merge
 
  Hi Developers,
 
  In the next few days we are planning to merge Scala 2.10 support into
  Spark. For those that haven't been following this, Prashant Sharma
  has been maintaining the scala-2.10 branch of Spark for several
  months. This branch is current with master and has been reviewed for
 merging:
 
  https://github.com/apache/incubator-spark/tree/scala-2.10
 
  Scala 2.10 support is one of the most requested features for Spark -
  it will be great to get this into Spark 0.9! Please note that *Scala
  2.10 is not binary compatible with Scala 2.9*. With that in mind, I
  wanted to give a few heads-up/requests to developers:
 
  If you are developing applications on top of Spark's master branch,
  those will need to migrate to Scala 2.10. You may want to download
  and test the current scala-2.10 branch in order to make sure you will
  be okay as Spark developments move forward. Of course, you can always
  stick with the current master commit and be fine (I'll cut a tag when
  we do the merge in order to delineate where the version changes).
  Please open new threads on the dev list to report and discuss any
 issues.
 
  This merge will temporarily drop support for YARN 2.2 on the master
  branch.
  This is because the workaround we used was only compiled for Scala 2.9.
  We are going to come up with a more robust solution to YARN 2.2
  support before releasing 0.9.
 
  Going forward, we will continue to make maintenance releases on
  branch-0.8 which will remain compatible with Scala 2.9.
 
  For those interested, the primary code changes in this merge are
  upgrading the akka version, changing the use of Scala 2.9's
  ClassManifest construct to Scala 2.10's ClassTag, and updating the
  spark shell to work with Scala 2.10's repl.
 
  - Patrick

Re: [VOTE] Release Apache Spark 0.8.1-incubating (rc4)

2013-12-11 Thread Patrick Wendell

I'll +1 myself also.

For anyone who has the slow build problem: does this issue happen when
building v0.8.0-incubating also? Trying to figure out whether it's
related to something we added in 0.8.1 or if it's a long standing
issue.

- Patrick

On Wed, Dec 11, 2013 at 10:39 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
 Woah, weird, but definitely good to know.

 If you’re doing Spark development, there’s also a more convenient option 
 added by Shivaram in the master branch. You can do sbt assemble-deps to 
 package *just* the dependencies of each project in a special assembly JAR, 
 and then use sbt compile to update the code. This will use the classes 
 directly out of the target/scala-2.9.3/classes directories. You have to redo 
 assemble-deps only if your external dependencies change.

 Matei

 On Dec 11, 2013, at 1:04 AM, Prashant Sharma scrapco...@gmail.com wrote:

 I hope this PR https://github.com/apache/incubator-spark/pull/252 can help.
 Again this is not a blocker for the release from my side either.


 On Wed, Dec 11, 2013 at 2:14 PM, Mark Hamstra m...@clearstorydata.comwrote:

 Interesting, and confirmed: On my machine where `./sbt/sbt assembly` takes
 a long, long, long time to complete (a MBP, in my case), building three
 separate assemblies (`./sbt/sbt assembly/assembly`, `./sbt/sbt
 examples/assembly`, `./sbt/sbt tools/assembly`) takes much, much less time.



 On Wed, Dec 11, 2013 at 12:02 AM, Prashant Sharma scrapco...@gmail.com
 wrote:

 forgot to mention, after running sbt/sbt assembly/assembly running
 sbt/sbt
 examples/assembly takes just 37s. Not to mention my hardware is not
 really
 great.


 On Wed, Dec 11, 2013 at 1:28 PM, Prashant Sharma scrapco...@gmail.com
 wrote:

 Hi Patrick and Matei,

 Was trying out this and followed the quick start guide which says do
 sbt/sbt assembly, like few others I was also stuck for few minutes on
 linux. On the other hand if I use sbt/sbt assembly/assembly it is much
 faster.

 Should we change the documentation to reflect this. It will not be
 great
 for first time users to get stuck there.


 On Wed, Dec 11, 2013 at 9:54 AM, Matei Zaharia 
 matei.zaha...@gmail.com
 wrote:

 +1

 Built and tested it on Mac OS X.

 Matei


 On Dec 10, 2013, at 4:49 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 Please vote on releasing the following candidate as Apache Spark
 (incubating) version 0.8.1.

 The tag to be voted on is v0.8.1-incubating (commit b87d31d):



 https://git-wip-us.apache.org/repos/asf/incubator-spark/repo?p=incubator-spark.git;a=commit;h=b87d31dd8eb4b4e47c0138e9242d0dd6922c8c4e

 The release files, including signatures, digests, etc can be found
 at:
 http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc4/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:

 https://repository.apache.org/content/repositories/orgapachespark-040/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc4-docs/

 For information about the contents of this release see:



 https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=blob;f=CHANGES.txt;h=ce0aeab524505b63c7999e0371157ac2def6fe1c;hb=branch-0.8

 Please vote on releasing this package as Apache Spark
 0.8.1-incubating!

 The vote is open until Saturday, December 14th at 01:00 UTC and
 passes if a majority of at least 3 +1 PPMC votes are cast.

 [ ] +1 Release this package as Apache Spark 0.8.1-incubating
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.incubator.apache.org/




 --
 s




 --
 s





 --
 s

Re: [VOTE] Release Apache Spark 0.8.1-incubating (rc4)

2013-12-11 Thread Patrick Wendell

Hey Tom,

I re-verified the signatures and got someone else to do it. It seemed
fine. Here is what I did.

gpg --recv-key 9E4FE3AF
wget
http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc4/spark-0.8.1-incubating.tgz.asc
wget
http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc4/spark-0.8.1-incubating.tgz
gpg --verify spark-0.8.1-incubating.tgz.asc spark-0.8.1-incubating.tgz
gpg: Signature made Tue 10 Dec 2013 02:53:15 PM PST using RSA key ID 9E4FE3AF
gpg: Good signature from Patrick Wendell pwend...@gmail.com

On Wed, Dec 11, 2013 at 1:10 PM, Mark Hamstra m...@clearstorydata.com wrote:
I don't know how to make sense of the numbers, but here's what I've got
from a very small sample size.

For v0.8.0-incubating, `./sbt/sbt assembly` takes about 2.5x as long as the
sum of the separate assemblies.
For v0.8.1-incubating, `./sbt/sbt assembly` takes almost 8x as long as the
sum of the separate assemblies.

Weird.

On Wed, Dec 11, 2013 at 11:49 AM, Patrick Wendell pwend...@gmail.comwrote:

I'll +1 myself also.

- Patrick

On Wed, Dec 11, 2013 at 10:39 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Woah, weird, but definitely good to know.

If you’re doing Spark development, there’s also a more convenient option
added by Shivaram in the master branch. You can do sbt assemble-deps to
package *just* the dependencies of each project in a special assembly JAR,
and then use sbt compile to update the code. This will use the classes
directly out of the target/scala-2.9.3/classes directories. You have to
redo assemble-deps only if your external dependencies change.

Matei

On Dec 11, 2013, at 1:04 AM, Prashant Sharma scrapco...@gmail.com
wrote:

I hope this PR https://github.com/apache/incubator-spark/pull/252 can
help.
Again this is not a blocker for the release from my side either.

On Wed, Dec 11, 2013 at 2:14 PM, Mark Hamstra m...@clearstorydata.com
wrote:

Interesting, and confirmed: On my machine where `./sbt/sbt assembly`
takes
a long, long, long time to complete (a MBP, in my case), building
three
separate assemblies (`./sbt/sbt assembly/assembly`, `./sbt/sbt
examples/assembly`, `./sbt/sbt tools/assembly`) takes much, much less
time.

On Wed, Dec 11, 2013 at 12:02 AM, Prashant Sharma
scrapco...@gmail.com
wrote:

forgot to mention, after running sbt/sbt assembly/assembly running
sbt/sbt
examples/assembly takes just 37s. Not to mention my hardware is not
really
great.

On Wed, Dec 11, 2013 at 1:28 PM, Prashant Sharma
scrapco...@gmail.com
wrote:

Hi Patrick and Matei,

Was trying out this and followed the quick start guide which says do
sbt/sbt assembly, like few others I was also stuck for few minutes on
linux. On the other hand if I use sbt/sbt assembly/assembly it is
much
faster.

Should we change the documentation to reflect this. It will not be
great
for first time users to get stuck there.

On Wed, Dec 11, 2013 at 9:54 AM, Matei Zaharia
matei.zaha...@gmail.com
wrote:

Built and tested it on Mac OS X.

Matei

On Dec 10, 2013, at 4:49 PM, Patrick Wendell pwend...@gmail.com
wrote:

Please vote on releasing the following candidate as Apache Spark
(incubating) version 0.8.1.

The tag to be voted on is v0.8.1-incubating (commit b87d31d):

https://git-wip-us.apache.org/repos/asf/incubator-spark/repo?p=incubator-spark.git;a=commit;h=b87d31dd8eb4b4e47c0138e9242d0dd6922c8c4e

The release files, including signatures, digests, etc can be found
at:
http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc4/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-040/

The documentation corresponding to this release can be found at:

http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc4-docs/

For information about the contents of this release see:

https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=blob;f=CHANGES.txt;h=ce0aeab524505b63c7999e0371157ac2def6fe1c;hb=branch-0.8

Please vote on releasing this package as Apache Spark
0.8.1-incubating!

The vote is open until Saturday, December 14th at 01:00 UTC and
passes if a majority of at least 3 +1 PPMC votes are cast.

[ ] +1 Release this package as Apache Spark 0.8.1-incubating
[ ] -1 Do not release this package because ...

To learn more

Re: [VOTE] Release Apache Spark 0.8.1-incubating (rc4)

2013-12-11 Thread Patrick Wendell

I also talked to a few people who got corrupted binaries when
downloading from the people.apache HTTP. In that case the checksum
failed but if they re-downloaded it worked. So maybe just re-download
and try again?

On Wed, Dec 11, 2013 at 3:15 PM, Patrick Wendell pwend...@gmail.com wrote:
Hey Tom,

I re-verified the signatures and got someone else to do it. It seemed
fine. Here is what I did.

On Wed, Dec 11, 2013 at 1:10 PM, Mark Hamstra m...@clearstorydata.com wrote:
I don't know how to make sense of the numbers, but here's what I've got
from a very small sample size.

For v0.8.0-incubating, `./sbt/sbt assembly` takes about 2.5x as long as the
sum of the separate assemblies.
For v0.8.1-incubating, `./sbt/sbt assembly` takes almost 8x as long as the
sum of the separate assemblies.

Weird.

On Wed, Dec 11, 2013 at 11:49 AM, Patrick Wendell pwend...@gmail.comwrote:

I'll +1 myself also.

- Patrick

On Wed, Dec 11, 2013 at 10:39 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Woah, weird, but definitely good to know.

If you’re doing Spark development, there’s also a more convenient option
added by Shivaram in the master branch. You can do sbt assemble-deps to
package *just* the dependencies of each project in a special assembly JAR,
and then use sbt compile to update the code. This will use the classes
directly out of the target/scala-2.9.3/classes directories. You have to
redo assemble-deps only if your external dependencies change.

Matei

On Dec 11, 2013, at 1:04 AM, Prashant Sharma scrapco...@gmail.com
wrote:

I hope this PR https://github.com/apache/incubator-spark/pull/252 can
help.
Again this is not a blocker for the release from my side either.

On Wed, Dec 11, 2013 at 2:14 PM, Mark Hamstra m...@clearstorydata.com
wrote:

Interesting, and confirmed: On my machine where `./sbt/sbt assembly`
takes
a long, long, long time to complete (a MBP, in my case), building
three
separate assemblies (`./sbt/sbt assembly/assembly`, `./sbt/sbt
examples/assembly`, `./sbt/sbt tools/assembly`) takes much, much less
time.

On Wed, Dec 11, 2013 at 12:02 AM, Prashant Sharma
scrapco...@gmail.com
wrote:

forgot to mention, after running sbt/sbt assembly/assembly running
sbt/sbt
examples/assembly takes just 37s. Not to mention my hardware is not
really
great.

On Wed, Dec 11, 2013 at 1:28 PM, Prashant Sharma
scrapco...@gmail.com
wrote:

Hi Patrick and Matei,

Was trying out this and followed the quick start guide which says do
sbt/sbt assembly, like few others I was also stuck for few minutes on
linux. On the other hand if I use sbt/sbt assembly/assembly it is
much
faster.

Should we change the documentation to reflect this. It will not be
great
for first time users to get stuck there.

On Wed, Dec 11, 2013 at 9:54 AM, Matei Zaharia
matei.zaha...@gmail.com
wrote:

Built and tested it on Mac OS X.

Matei

On Dec 10, 2013, at 4:49 PM, Patrick Wendell pwend...@gmail.com
wrote:

Please vote on releasing the following candidate as Apache Spark
(incubating) version 0.8.1.

The tag to be voted on is v0.8.1-incubating (commit b87d31d):

https://git-wip-us.apache.org/repos/asf/incubator-spark/repo?p=incubator-spark.git;a=commit;h=b87d31dd8eb4b4e47c0138e9242d0dd6922c8c4e

The release files, including signatures, digests, etc can be found
at:
http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc4/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-040/

The documentation corresponding to this release can be found at:

http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc4-docs/

For information about the contents of this release see:

https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=blob;f=CHANGES.txt;h=ce0aeab524505b63c7999e0371157ac2def6fe1c;hb=branch-0.8

Please vote on releasing

[VOTE] Release Apache Spark 0.8.1-incubating (rc2)

2013-12-08 Thread Patrick Wendell

Please vote on releasing the following candidate as Apache Spark
(incubating) version 0.8.1.

The tag to be voted on is v0.8.1-incubating (commit bf23794a):
https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=tag;h=e6ba91b5a7527316202797fc3dce469ff86cf203

The release files, including signatures, digests, etc can be found at:
http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc2/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-024/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc2-docs/

For information about the contents of this release see:
attached draft of release notes
attached draft of release credits
https://github.com/apache/incubator-spark/blob/branch-0.8/CHANGES.txt

Please vote on releasing this package as Apache Spark 0.8.1-incubating!

The vote is open until Wednesday, December 11th at 21:00 UTC and
passes if a majority of at least 3 +1 PPMC votes are cast.

[ ] +1 Release this package as Apache Spark 0.8.1-incubating
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.incubator.apache.org/
Michael Armbrust -- build fix

Pierre Borckmans -- typo fix in documentation

Evan Chan -- added `local://` scheme for dependency jars

Ewen Cheslack-Postava -- `add` method for python accumulators, support for 
setting config properties in python

Mosharaf Chowdhury -- optimized broadcast implementation

Frank Dai -- documentation fix

Aaron Davidson -- lead on shuffle file consolidation, lead on h/a mode for 
standalone scheduler, cleaned up representation of block id’s, several small 
improvements and bug fixes

Tathagata Das -- new streaming operators: `transformWith`, `leftInnerJoin`, and 
`rightOuterJoin`, fix for kafka concurrency bug

Ankur Dave -- support for pausing spot clusters on EC2

Harvey Feng -- optimization to JobConf broadcasts, minor fixes, lead on YARN 
2.2 build

Ali Ghodsi -- scheduler support for SIMR, lead on YARN 2.2 build

Thomas Graves -- lead on Spark YARN integration including secure HDFS access 
over YARN

Li Guoqiang -- fix for maven build

Stephen Haberman -- bug fix

Haidar Hadi -- documentation fix

Nathan Howell -- bug fix relating to YARN

Holden Karau -- java version of `mapPartitionsWithIndex`

Du Li -- bug fix in make-distrubion.sh

Xi Lui -- bug fix and code clean-up

David McCauley -- bug fix in standalone mode JSON output

Michael (wannabeast) -- bug fix in memory store

Fabrizio Milo -- typos in documentation, minor clean-up in DAGScheduler, typo 
in scaladoc

Mridul Muralidharan -- fixes to meta-data cleaner and speculative scheduler

Sundeep Narravula -- build fix, bug fixes in scheduler and tests, minor code 
clean-up

Kay Ousterhout -- optimization to task result fetching, extensive code clean-up 
and refactoring (task schedulers, thread pools), result-fetching state in UI, 
showing task and attempt it in UI, several bug fixes in scheduler, UI, and unit 
tests

Nick Pentreath -- implicit feedback variant of ALS algorithm

Imran Rashid -- small improvement to executor launch

Ahir Reddy -- spark support for SIMR

Josh Rosen -- reduced memory overhead for BlockInfo objects, clean up of 
BlockManager code, fix to java API auditor, code clean-up in java API, and bug 
fixes in python API

Henry Saputra -- build fix

Jerry Shao -- refactoring of fair scheduler, support for running spark as a 
specific user, bug fix

Mingfei Shi -- documentation for JobLogger

Andre Schumacher -- sortByKey in pyspark and associated changes

Karthik Tunga -- bug fix in launch script

Patrick Wendell -- added `repartition` operator, logging improvements, 
instrumentation for shuffle write, documentation improvements, fix for 
streaming example, and release management

Neal Wiggins -- minor import clean-up, documentation typo

Andrew Xia -- bug fix in UI

Reynold Xin -- optimized hash set and hash tables for primitive types, task 
killing, support for setting job properties in repl, logging improvements, Kryo 
improvements, several bug fixes, and general clean-up

Matei Zaharia -- optimized hashmap for shuffle data, pyspark documentation, 
optimizations to kryo and chill serializers

Wu Zeming -- bug fix in executors UI
DRAFT OF RELEASE NOTES FOR SPARK 0.8.1

Apache Spark 0.8.1 is a maintenance release including several bug fixes and 
performance optimizations. It also includes a few new features. Contributions 
to 0.8.1 came from 40 developers.

== High availability mode for standalone scheduler ==
The standalone scheduler now has a High Availability (H/A) mode which can 
tolerate master failures. This is particularly useful for long-running 
applications such as streaming jobs and the shark server, where the scheduler 
master previous represented

Re: [DISCUSS] About the [VOTE] Release Apache Spark 0.8.1-incubating (rc1)

2013-12-08 Thread Patrick Wendell

Hey Mark,

One constructive action you and other people can take to help us
assess the quality and completeness of this release is to download the
release, run the tests, run the release in your dev environment, read
through the documentation, etc. This is one of the main points of
releasing an RC to the community... even if you disagree with some
patches that were merged in, this is still a way you can help validate
the release.

- Patrick

On Sun, Dec 8, 2013 at 1:30 PM, Mark Hamstra m...@clearstorydata.com wrote:
 I'm aware of the changes file, but it really doesn't address the issue that
 I am raising.  The changes file just tells me what has gone into the
 release candidate.  In general, it doesn't tell me why those changes went
 in or provide any rationale by which to judge whether that is the complete
 set of changes that should go in.

 I talked some with Matei about related versioning and release issues last
 week, and I've raised them in other contexts previously, but I'm taking the
 liberty to annoy people again because I really am not happy with our
 current versioning and release process, and I really am of the opinion that
 we've got to start doing much better before I can vote in favor of a 1.0
 release.  I fully realize that this is not a 1.0 release, and that because
 we are pre-1.0 we still have a lot of flexibility with releases that break
 backward or forward compatibility and with version numbers that have
 nothing like the semantic meaning that they will eventually need to have;
 but it is not going to be easy to change our process and culture so that we
 produce the kind of stability and reliability that Spark users need to be
 able to depend upon and version numbers that clearly communicate what those
 users expect them to mean.  I think that we should start making those
 changes now.  Just because we have flexibility pre-1.0, that doesn't mean
 that we shouldn't start training ourselves now to work within the
 constraints of post-1.0 Spark.  If I'm to be happy voting for an eventual
 1.0 release candidate, I'll need to have seen at least one full development
 cycle that already adheres to the post-1.0 constraints, demonstrating the
 maturity of our development process.

 That demonstration cycle is clearly not this one -- and I understand that
 there were some compelling reasons (particularly with regard too getting a
 full release of Spark based on Scala 2.9.3 before we make the jump to
 2.10.  This patch-level release breaks binary compatibility and contains
 a lot of code that isn't anywhere close to meeting the criterion for
 inclusion in a real, post-1.0 patch-level release: essentially changes
 that every, or nearly every, existing Spark user needs (not just wants),
 and that work with all existing and future binaries built with the prior
 patch-level version of Spark as a dependency.  Like I said, we are clearly
 nowhere close to that with the move from 0.8.0 to 0.8.1; but I also haven't
 been able to recognize any alternative criterion by which to judge the
 quality and completeness of this release candidate.

 Maybe there just isn't one, and I'm just going to have to swallow my
 concerns while watching 0.8.1 go out the door; but if we don't start doing
 better on this kind of thing in the future, you are going to start hearing
 more complaining from me. I just hope that it doesn't get to the point
 where I feel compelled to actively oppose an eventual 1.0 release
 candidate.


 On Sun, Dec 8, 2013 at 12:37 PM, Henry Saputra henry.sapu...@gmail.comwrote:

 Ah, sorry for the confusion Patrick, like you said I was just trying to let
 people aware about this file and the purpose of it.

 On Sunday, December 8, 2013, Patrick Wendell wrote:

  Hey Henry,
 
  Are you suggesting we need to change something about or changes file?
  Or are you just pointing people to the file?
 
  - Patrick
 
  On Sun, Dec 8, 2013 at 11:37 AM, Henry Saputra henry.sapu...@gmail.com
  wrote:
   HI Spark devs,
  
   I have modified the Subject to avoid polluting the VOTE thread since
   it related to more info how and which commits merge back to 0.8.*
   branch.
   Please respond to the previous question to this thread.
  
   Technically the CHANGES.txt [1] file should describe the changes in a
   particular release and it is the main requirement needed to cut an ASF
   release.
  
  
   - Henry
  
   [1]
  https://github.com/apache/incubator-spark/blob/branch-0.8/CHANGES.txt
  
   On Sun, Dec 8, 2013 at 12:03 AM, Josh Rosen rosenvi...@gmail.com
  wrote:
   We can use git log to figure out which changes haven't made it into
   branch-0.8.  Here's a quick attempt, which only lists pull requests
 that
   were only merged into one of the branches.  For completeness, this
  could be
   extended to find commits that weren't part of a merge and are only
  present
   in one branch.
  
   *Script:*
  
   MASTER_BRANCH=origin/master
   RELEASE_BRANCH=origin/branch-0.8
  
   git log --oneline --grep Merge pull

Re: [VOTE] Release Apache Spark 0.8.1-incubating (rc2)

2013-12-08 Thread Patrick Wendell

Hey Take,

Could you start a separate thread to debug your build issue? In that
thread, could you paste the exact build command and entire output? The log
you posted here suggests the first build detected hadoop 1.0.4 not 2.2.0
based on the assembly file name it is logging.

---
sent from my phone
On Dec 8, 2013 4:13 PM, Taka Shinagawa taka.epsi...@gmail.com wrote:

 With Hadoop 2.2.0 ( Java 1.7.0_45) installed, I'm having trouble
 completing the build process (sbt/sbt assembly) on Macbook. The sbt command
 hangs at the last step.

 ...
 ...
 [info] SHA-1: ce8275f5841002164c4305c912a2892ec7c1d395
 [info] Packaging

 /Users/taka/Documents/Spark/Releases/spark-0.8.1-incubating-rc2/tools/target/scala-2.9.3/spark-tools-assembly-0.8.1-incubating.jar
 ...
 [info] SHA-1: 0657a347240266230247693f265a5797d40c326a
 [info] Packaging

 /Users/taka/Documents/Spark/Releases/spark-0.8.1-incubating-rc2/assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop1.0.4.jar
 ...
 (hangs here)
 --


 On another Macbook with Hadoop 1.1.1 ( Java 1.7.0_45) installed, I was
 able to build it successfully.
 ..
 ..
 [info] SHA-1: 77109cd085bd4f0d2b601b3451b35b961d357534
 [info] Packaging

 /Users/tshinagawa/Documents/Spark/RCs/spark-0.8.1-incubating/examples/target/scala-2.9.3/spark-examples-assembly-0.8.1-incubating.jar
 ...
 [info] Done packaging.
 [success] Total time: 266 s, completed Dec 8, 2013 3:03:10 PM
 --



 On Sun, Dec 8, 2013 at 12:41 PM, Patrick Wendell pwend...@gmail.com
 wrote:

  Please vote on releasing the following candidate as Apache Spark
  (incubating) version 0.8.1.
 
  The tag to be voted on is v0.8.1-incubating (commit bf23794a):
 
 
 https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=tag;h=e6ba91b5a7527316202797fc3dce469ff86cf203
 
  The release files, including signatures, digests, etc can be found at:
  http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc2/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-024/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc2-docs/
 
  For information about the contents of this release see:
  attached draft of release notes
  attached draft of release credits
  https://github.com/apache/incubator-spark/blob/branch-0.8/CHANGES.txt
 
  Please vote on releasing this package as Apache Spark 0.8.1-incubating!
 
  The vote is open until Wednesday, December 11th at 21:00 UTC and
  passes if a majority of at least 3 +1 PPMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 0.8.1-incubating
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.incubator.apache.org/

Re: [VOTE] Release Apache Spark 0.8.1-incubating (rc2)

2013-12-08 Thread Patrick Wendell

For my own part I'll give a +1 to this RC.

On Sun, Dec 8, 2013 at 4:30 PM, Taka Shinagawa taka.epsi...@gmail.com wrote:
OK. I will post the entire output via separate email. I just upgraded
Hadoop to 2.2.0 recently. So there might be something I need to
remove/clean up.

On Sun, Dec 8, 2013 at 4:24 PM, Patrick Wendell pwend...@gmail.com wrote:

Hey Take,

Could you start a separate thread to debug your build issue? In that
thread, could you paste the exact build command and entire output? The log
you posted here suggests the first build detected hadoop 1.0.4 not 2.2.0
based on the assembly file name it is logging.

---
sent from my phone
On Dec 8, 2013 4:13 PM, Taka Shinagawa taka.epsi...@gmail.com wrote:

With Hadoop 2.2.0 ( Java 1.7.0_45) installed, I'm having trouble
completing the build process (sbt/sbt assembly) on Macbook. The sbt
command
hangs at the last step.

...
...
[info] SHA-1: ce8275f5841002164c4305c912a2892ec7c1d395
[info] Packaging

/Users/taka/Documents/Spark/Releases/spark-0.8.1-incubating-rc2/tools/target/scala-2.9.3/spark-tools-assembly-0.8.1-incubating.jar
...
[info] SHA-1: 0657a347240266230247693f265a5797d40c326a
[info] Packaging

/Users/taka/Documents/Spark/Releases/spark-0.8.1-incubating-rc2/assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop1.0.4.jar
...
(hangs here)
--

On another Macbook with Hadoop 1.1.1 ( Java 1.7.0_45) installed, I was
able to build it successfully.
..
..
[info] SHA-1: 77109cd085bd4f0d2b601b3451b35b961d357534
[info] Packaging

/Users/tshinagawa/Documents/Spark/RCs/spark-0.8.1-incubating/examples/target/scala-2.9.3/spark-examples-assembly-0.8.1-incubating.jar
...
[info] Done packaging.
[success] Total time: 266 s, completed Dec 8, 2013 3:03:10 PM
--

On Sun, Dec 8, 2013 at 12:41 PM, Patrick Wendell pwend...@gmail.com
wrote:

Please vote on releasing the following candidate as Apache Spark
(incubating) version 0.8.1.

The tag to be voted on is v0.8.1-incubating (commit bf23794a):

https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=tag;h=e6ba91b5a7527316202797fc3dce469ff86cf203

The release files, including signatures, digests, etc can be found at:
http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc2/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-024/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc2-docs/

For information about the contents of this release see:
attached draft of release notes
attached draft of release credits
https://github.com/apache/incubator-spark/blob/branch-0.8/CHANGES.txt

Please vote on releasing this package as Apache Spark 0.8.1-incubating!

The vote is open until Wednesday, December 11th at 21:00 UTC and
passes if a majority of at least 3 +1 PPMC votes are cast.

[ ] +1 Release this package as Apache Spark 0.8.1-incubating
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.incubator.apache.org/

Re: [VOTE] Release Apache Spark 0.8.1-incubating (rc2)

2013-12-08 Thread Patrick Wendell

Hey Mark - ya this would be good to get in.

Does merging that particular PR put this in sufficient shape for the
0.8.1 release or are there other open patches we need to look at?

- Patrick

On Sun, Dec 8, 2013 at 6:05 PM, Mark Hamstra m...@clearstorydata.com wrote:
SPARK-962 should be resolved before release. See also:
https://github.com/apache/incubator-spark/pull/195

With the references to the way I changed Debian packaging for ClearStory,
we should be at least 90% of the way toward doing it right for Apache.

On Sun, Dec 8, 2013 at 5:29 PM, Patrick Wendell pwend...@gmail.com wrote:

For my own part I'll give a +1 to this RC.

On Sun, Dec 8, 2013 at 4:30 PM, Taka Shinagawa taka.epsi...@gmail.com
wrote:
OK. I will post the entire output via separate email. I just upgraded
Hadoop to 2.2.0 recently. So there might be something I need to
remove/clean up.

On Sun, Dec 8, 2013 at 4:24 PM, Patrick Wendell pwend...@gmail.com
wrote:

Hey Take,

Could you start a separate thread to debug your build issue? In that
thread, could you paste the exact build command and entire output? The
log
you posted here suggests the first build detected hadoop 1.0.4 not 2.2.0
based on the assembly file name it is logging.

---
sent from my phone
On Dec 8, 2013 4:13 PM, Taka Shinagawa taka.epsi...@gmail.com
wrote:

With Hadoop 2.2.0 ( Java 1.7.0_45) installed, I'm having trouble
completing the build process (sbt/sbt assembly) on Macbook. The sbt
command
hangs at the last step.

...
...
[info] SHA-1: ce8275f5841002164c4305c912a2892ec7c1d395
[info] Packaging

/Users/taka/Documents/Spark/Releases/spark-0.8.1-incubating-rc2/assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop1.0.4.jar
...
(hangs here)
--

On another Macbook with Hadoop 1.1.1 ( Java 1.7.0_45) installed, I
was
able to build it successfully.
..
..
[info] SHA-1: 77109cd085bd4f0d2b601b3451b35b961d357534
[info] Packaging

On Sun, Dec 8, 2013 at 12:41 PM, Patrick Wendell pwend...@gmail.com
wrote:

Please vote on releasing the following candidate as Apache Spark
(incubating) version 0.8.1.

The tag to be voted on is v0.8.1-incubating (commit bf23794a):

https://git-wip-us.apache.org/repos/asf?p=incubator-spark.git;a=tag;h=e6ba91b5a7527316202797fc3dce469ff86cf203

The release files, including signatures, digests, etc can be found
at:
http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc2/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-024/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-0.8.1-incubating-rc2-docs/

For information about the contents of this release see:
attached draft of release notes
attached draft of release credits

https://github.com/apache/incubator-spark/blob/branch-0.8/CHANGES.txt

Please vote on releasing this package as Apache Spark
0.8.1-incubating!

The vote is open until Wednesday, December 11th at 21:00 UTC and
passes if a majority of at least 3 +1 PPMC votes are cast.

[ ] +1 Release this package as Apache Spark 0.8.1-incubating
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.incubator.apache.org/

Re: [VOTE] Release Apache Spark 0.8.1-incubating (rc2)

2013-12-08 Thread Patrick Wendell

Looked into this a bit more - I think removing repl-bin is something
we should wait until 0.9 to do, because we've published it to maven in
0.8.0 and people might expect it to be there in 0.8.1.

Merging the directly referenced pull request (195) seems like a good
idea though since it fixes a bug in the script.

Is that what you are suggesting?

- Patrick

On Sun, Dec 8, 2013 at 7:30 PM, Patrick Wendell pwend...@gmail.com wrote:
Hey Mark - ya this would be good to get in.