Semantics of LGTM

2015-01-17 Thread Patrick Wendell
Hey All,

Just wanted to ping about a minor issue - but one that ends up having
consequence given Spark's volume of reviews and commits. As much as
possible, I think that we should try and gear towards "Google Style"
LGTM on reviews. What I mean by this is that LGTM has the following
semantics:

"I know this code well, or I've looked at it close enough to feel
confident it should be merged. If there are issues/bugs with this code
later on, I feel confident I can help with them."

Here is an alternative semantic:

"Based on what I know about this part of the code, I don't see any
show-stopper problems with this patch".

The issue with the latter is that it ultimately erodes the
significance of LGTM, since subsequent reviewers need to reason about
what the person meant by saying LGTM. In contrast, having strong
semantics around LGTM can help streamline reviews a lot, especially as
reviewers get more experienced and gain trust from the comittership.

There are several easy ways to give a more limited endorsement of a patch:
- "I'm not familiar with this code, but style, etc look good" (general
endorsement)
- "The build changes in this code LGTM, but I haven't reviewed the
rest" (limited LGTM)

If people are okay with this, I might add a short note on the wiki.
I'm sending this e-mail first, though, to see whether anyone wants to
express agreement or disagreement with this approach.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Semantics of LGTM

2015-01-17 Thread Patrick Wendell
I think the ASF +1 is *slightly* different than Google's LGTM, because
it might convey wanting the patch/feature to be merged but not
necessarily saying you did a thorough review and stand behind it's
technical contents. For instance, I've seen people pile on +1's to try
and indicate support for a feature or patch in some projects, even
though they didn't do a thorough technical review. This +1 is
definitely a useful mechanism.

There is definitely much overlap though in the meaning, though, and
it's largely because Spark had it's own culture around reviews before
it was donated to the ASF, so there is a mix of two styles.

Nonetheless, I'd prefer to stick with the stronger LGTM semantics I
proposed originally (unlike the one Sandy proposed, e.g.). This is
what I've seen every project using the LGTM convention do (Google, and
some open source projects such as Impala) to indicate technical
sign-off.

- Patrick

On Sat, Jan 17, 2015 at 7:09 PM, Aaron Davidson  wrote:
> I think I've seen something like +2 = "strong LGTM" and +1 = "weak LGTM;
> someone else should review" before. It's nice to have a shortcut which isn't
> a sentence when talking about weaker forms of LGTM.
>
> On Sat, Jan 17, 2015 at 6:59 PM,  wrote:
>>
>> I think clarifying these semantics is definitely worthwhile. Maybe this
>> complicates the process with additional terminology, but the way I've used
>> these has been:
>>
>> +1 - I think this is safe to merge and, barring objections from others,
>> would merge it immediately.
>>
>> LGTM - I have no concerns about this patch, but I don't necessarily feel
>> qualified to make a final call about it.  The TM part acknowledges the
>> judgment as a little more subjective.
>>
>> I think having some concise way to express both of these is useful.
>>
>> -Sandy
>>
>> > On Jan 17, 2015, at 5:40 PM, Patrick Wendell  wrote:
>> >
>> > Hey All,
>> >
>> > Just wanted to ping about a minor issue - but one that ends up having
>> > consequence given Spark's volume of reviews and commits. As much as
>> > possible, I think that we should try and gear towards "Google Style"
>> > LGTM on reviews. What I mean by this is that LGTM has the following
>> > semantics:
>> >
>> > "I know this code well, or I've looked at it close enough to feel
>> > confident it should be merged. If there are issues/bugs with this code
>> > later on, I feel confident I can help with them."
>> >
>> > Here is an alternative semantic:
>> >
>> > "Based on what I know about this part of the code, I don't see any
>> > show-stopper problems with this patch".
>> >
>> > The issue with the latter is that it ultimately erodes the
>> > significance of LGTM, since subsequent reviewers need to reason about
>> > what the person meant by saying LGTM. In contrast, having strong
>> > semantics around LGTM can help streamline reviews a lot, especially as
>> > reviewers get more experienced and gain trust from the comittership.
>> >
>> > There are several easy ways to give a more limited endorsement of a
>> > patch:
>> > - "I'm not familiar with this code, but style, etc look good" (general
>> > endorsement)
>> > - "The build changes in this code LGTM, but I haven't reviewed the
>> > rest" (limited LGTM)
>> >
>> > If people are okay with this, I might add a short note on the wiki.
>> > I'm sending this e-mail first, though, to see whether anyone wants to
>> > express agreement or disagreement with this approach.
>> >
>> > - Patrick
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Semantics of LGTM

2015-01-19 Thread Patrick Wendell
Okay - so given all this I was going to put the following on the wiki
tentatively:

## Reviewing Code
Community code review is Spark's fundamental quality assurance
process. When reviewing a patch, your goal should be to help
streamline the committing process by giving committers confidence this
patch has been verified by an additional party. It's encouraged to
(politely) submit technical feedback to the author to identify areas
for improvement or potential bugs.

If you feel a patch is ready for inclusion in Spark, indicate this to
committers with a comment: "I think this patch looks good". Spark uses
the LGTM convention for indicating the highest level of technical
sign-off on a patch: simply comment with the word "LGTM". An LGTM is a
strong statement, it should be interpreted as the following: "I've
looked at this thoroughly and take as much ownership as if I wrote the
patch myself". If you comment LGTM you will be expected to help with
bugs or follow-up issues on the patch. Judicious use of LGTM's is a
great way to gain credibility as a reviewer with the broader
community.

It's also welcome for reviewers to argue against the inclusion of a
feature or patch. Simply indicate this in the comments.

- Patrick

On Mon, Jan 19, 2015 at 2:40 AM, Prashant Sharma  wrote:
> Patrick's original proposal LGTM :).  However until now, I have been in the
> impression of LGTM with special emphasis on TM part. That said, I will be
> okay/happy(or Responsible ) for the patch, if it goes in.
>
> Prashant Sharma
>
>
>
> On Sun, Jan 18, 2015 at 2:33 PM, Reynold Xin  wrote:
>>
>> Maybe just to avoid LGTM as a single token when it is not actually
>> according to Patrick's definition, but anybody can still leave comments
>> like:
>>
>> "The direction of the PR looks good to me." or "+1 on the direction"
>>
>> "The build part looks good to me"
>>
>> ...
>>
>>
>> On Sat, Jan 17, 2015 at 8:49 PM, Kay Ousterhout 
>> wrote:
>>
>> > +1 to Patrick's proposal of strong LGTM semantics.  On past projects,
>> > I've
>> > heard the semantics of "LGTM" expressed as "I've looked at this
>> > thoroughly
>> > and take as much ownership as if I wrote the patch myself".  My
>> > understanding is that this is the level of review we expect for all
>> > patches
>> > that ultimately go into Spark, so it's important to have a way to
>> > concisely
>> > describe when this has been done.
>> >
>> > Aaron / Sandy, when have you found the weaker LGTM to be useful?  In the
>> > cases I've seen, if someone else says "I looked at this very quickly and
>> > didn't see any glaring problems", it doesn't add any value for
>> > subsequent
>> > reviewers (someone still needs to take a thorough look).
>> >
>> > -Kay
>> >
>> > On Sat, Jan 17, 2015 at 8:04 PM,  wrote:
>> >
>> > > Yeah, the ASF +1 has become partly overloaded to mean both "I would
>> > > like
>> > > to see this feature" and "this patch should be committed", although,
>> > > at
>> > > least in Hadoop, using +1 on JIRA (as opposed to, say, in a release
>> > > vote)
>> > > should unambiguously mean the latter unless qualified in some other
>> > > way.
>> > >
>> > > I don't have any opinion on the specific characters, but I agree with
>> > > Aaron that it would be nice to have some sort of abbreviation for both
>> > the
>> > > strong and weak forms of approval.
>> > >
>> > > -Sandy
>> > >
>> > > > On Jan 17, 2015, at 7:25 PM, Patrick Wendell 
>> > wrote:
>> > > >
>> > > > I think the ASF +1 is *slightly* different than Google's LGTM,
>> > > > because
>> > > > it might convey wanting the patch/feature to be merged but not
>> > > > necessarily saying you did a thorough review and stand behind it's
>> > > > technical contents. For instance, I've seen people pile on +1's to
>> > > > try
>> > > > and indicate support for a feature or patch in some projects, even
>> > > > though they didn't do a thorough technical review. This +1 is
>> > > > definitely a useful mechanism.
>> > > >
>> > > > There is definitely much overlap though in the meaning, though, and
>> > > > it's largely because Spark

Re: Semantics of LGTM

2015-01-19 Thread Patrick Wendell
The wiki does not seem to be operational ATM, but I will do this when
it is back up.

On Mon, Jan 19, 2015 at 12:00 PM, Patrick Wendell  wrote:
> Okay - so given all this I was going to put the following on the wiki
> tentatively:
>
> ## Reviewing Code
> Community code review is Spark's fundamental quality assurance
> process. When reviewing a patch, your goal should be to help
> streamline the committing process by giving committers confidence this
> patch has been verified by an additional party. It's encouraged to
> (politely) submit technical feedback to the author to identify areas
> for improvement or potential bugs.
>
> If you feel a patch is ready for inclusion in Spark, indicate this to
> committers with a comment: "I think this patch looks good". Spark uses
> the LGTM convention for indicating the highest level of technical
> sign-off on a patch: simply comment with the word "LGTM". An LGTM is a
> strong statement, it should be interpreted as the following: "I've
> looked at this thoroughly and take as much ownership as if I wrote the
> patch myself". If you comment LGTM you will be expected to help with
> bugs or follow-up issues on the patch. Judicious use of LGTM's is a
> great way to gain credibility as a reviewer with the broader
> community.
>
> It's also welcome for reviewers to argue against the inclusion of a
> feature or patch. Simply indicate this in the comments.
>
> - Patrick
>
> On Mon, Jan 19, 2015 at 2:40 AM, Prashant Sharma  wrote:
>> Patrick's original proposal LGTM :).  However until now, I have been in the
>> impression of LGTM with special emphasis on TM part. That said, I will be
>> okay/happy(or Responsible ) for the patch, if it goes in.
>>
>> Prashant Sharma
>>
>>
>>
>> On Sun, Jan 18, 2015 at 2:33 PM, Reynold Xin  wrote:
>>>
>>> Maybe just to avoid LGTM as a single token when it is not actually
>>> according to Patrick's definition, but anybody can still leave comments
>>> like:
>>>
>>> "The direction of the PR looks good to me." or "+1 on the direction"
>>>
>>> "The build part looks good to me"
>>>
>>> ...
>>>
>>>
>>> On Sat, Jan 17, 2015 at 8:49 PM, Kay Ousterhout 
>>> wrote:
>>>
>>> > +1 to Patrick's proposal of strong LGTM semantics.  On past projects,
>>> > I've
>>> > heard the semantics of "LGTM" expressed as "I've looked at this
>>> > thoroughly
>>> > and take as much ownership as if I wrote the patch myself".  My
>>> > understanding is that this is the level of review we expect for all
>>> > patches
>>> > that ultimately go into Spark, so it's important to have a way to
>>> > concisely
>>> > describe when this has been done.
>>> >
>>> > Aaron / Sandy, when have you found the weaker LGTM to be useful?  In the
>>> > cases I've seen, if someone else says "I looked at this very quickly and
>>> > didn't see any glaring problems", it doesn't add any value for
>>> > subsequent
>>> > reviewers (someone still needs to take a thorough look).
>>> >
>>> > -Kay
>>> >
>>> > On Sat, Jan 17, 2015 at 8:04 PM,  wrote:
>>> >
>>> > > Yeah, the ASF +1 has become partly overloaded to mean both "I would
>>> > > like
>>> > > to see this feature" and "this patch should be committed", although,
>>> > > at
>>> > > least in Hadoop, using +1 on JIRA (as opposed to, say, in a release
>>> > > vote)
>>> > > should unambiguously mean the latter unless qualified in some other
>>> > > way.
>>> > >
>>> > > I don't have any opinion on the specific characters, but I agree with
>>> > > Aaron that it would be nice to have some sort of abbreviation for both
>>> > the
>>> > > strong and weak forms of approval.
>>> > >
>>> > > -Sandy
>>> > >
>>> > > > On Jan 17, 2015, at 7:25 PM, Patrick Wendell 
>>> > wrote:
>>> > > >
>>> > > > I think the ASF +1 is *slightly* different than Google's LGTM,
>>> > > > because
>>> > > > it might convey wanting the patch/feature to be merged but not
>>> > > > necessarily saying you did a thorough review and stand behind

Re: Standardized Spark dev environment

2015-01-20 Thread Patrick Wendell
To respond to the original suggestion by Nick. I always thought it
would be useful to have a Docker image on which we run the tests and
build releases, so that we could have a consistent environment that
other packagers or people trying to exhaustively run Spark tests could
replicate (or at least look at) to understand exactly how we recommend
building Spark. Sean - do you think that is too high of overhead?

In terms of providing images that we encourage as standard deployment
images of Spark and want to make portable across environments, that's
a much larger project and one with higher associated maintenance
overhead. So I'd be interested in seeing that evolve as its own
project (spark-deploy) or something associated with bigtop, etc.

- Patrick

On Tue, Jan 20, 2015 at 10:30 PM, Paolo Platter
 wrote:
> Hi all,
> I also tried the docker way and it works well.
> I suggest to look at sequenceiq/spark dockers, they are very active on that 
> field.
>
> Paolo
>
> Inviata dal mio Windows Phone
> 
> Da: jay vyas
> Inviato: 21/01/2015 04:45
> A: Nicholas Chammas
> Cc: Will Benton; Spark dev 
> list
> Oggetto: Re: Standardized Spark dev environment
>
> I can comment on both...  hi will and nate :)
>
> 1) Will's Dockerfile solution is  the most  simple direct solution to the
> dev environment question : its a  efficient way to build and develop spark
> environments for dev/test..  It would be cool to put that Dockerfile
> (and/or maybe a shell script which uses it) in the top level of spark as
> the build entry point.  For total platform portability, u could wrap in a
> vagrantfile to launch a lightweight vm, so that windows worked equally
> well.
>
> 2) However, since nate mentioned  vagrant and bigtop, i have to chime in :)
> the vagrant recipes in bigtop are a nice reference deployment of how to
> deploy spark in a heterogenous hadoop style environment, and tighter
> integration testing w/ bigtop for spark releases would be lovely !  The
> vagrant stuff use puppet to deploy an n node VM or docker based cluster, in
> which users can easily select components (including
> spark,yarn,hbase,hadoop,etc...) by simnply editing a YAML file :
> https://github.com/apache/bigtop/blob/master/bigtop-deploy/vm/vagrant-puppet/vagrantconfig.yaml
> As nate said, it would be alot of fun to get more cross collaboration
> between the spark and bigtop communities.   Input on how we can better
> integrate spark (wether its spork, hbase integration, smoke tests aroudn
> the mllib stuff, or whatever, is always welcome )
>
>
>
>
>
>
> On Tue, Jan 20, 2015 at 10:21 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> How many profiles (hadoop / hive /scala) would this development environment
>> support ?
>>
>> As many as we want. We probably want to cover a good chunk of the build
>> matrix  that Spark
>> officially supports.
>>
>> What does this provide, concretely?
>>
>> It provides a reliable way to create a "good" Spark development
>> environment. Roughly speaking, this probably should mean an environment
>> that matches Jenkins, since that's where we run "official" testing and
>> builds.
>>
>> For example, Spark has to run on Java 6 and Python 2.6. When devs build and
>> run Spark locally, we can make sure they're doing it on these versions of
>> the languages with a simple vagrant up.
>>
>> Nate, could you comment on how something like this would relate to the
>> Bigtop effort?
>>
>> http://chapeau.freevariable.com/2014/08/jvm-test-docker.html
>>
>> Will, that's pretty sweet. I tried something similar a few months ago as an
>> experiment to try building/testing Spark within a container. Here's the
>> shell script I used > >
>> against the base CentOS Docker image to setup an environment ready to build
>> and test Spark.
>>
>> We want to run Spark unit tests within containers on Jenkins, so it might
>> make sense to develop a single Docker image that can be used as both a "dev
>> environment" as well as execution container on Jenkins.
>>
>> Perhaps that's the approach to take instead of looking into Vagrant.
>>
>> Nick
>>
>> On Tue Jan 20 2015 at 8:22:41 PM Will Benton  wrote:
>>
>> Hey Nick,
>> >
>> > I did something similar with a Docker image last summer; I haven't
>> updated
>> > the images to cache the dependencies for the current Spark master, but it
>> > would be trivial to do so:
>> >
>> > http://chapeau.freevariable.com/2014/08/jvm-test-docker.html
>> >
>> >
>> > best,
>> > wb
>> >
>> >
>> > - Original Message -
>> > > From: "Nicholas Chammas" 
>> > > To: "Spark dev list" 
>> > > Sent: Tuesday, January 20, 2015 6:13:31 PM
>> > > Subject: Standardized Spark dev environment
>> > >
>> > > What do y'all think of creating a standardized Spark dev

Re: Standardized Spark dev environment

2015-01-21 Thread Patrick Wendell
> If the goal is a reproducible test environment then I think that is what
> Jenkins is. Granted you can only ask it for a test. But presumably you get
> the same result if you start from the same VM image as Jenkins and run the
> same steps.

But the issue is when users can't reproduce Jenkins failures. We don't
publish anywhere what the exact set of packages and versions is that
is installed on Jenkins. And it can change since it's a shared
infrastructure with other projects. So why not publish this manifest
as a docker file and then have it run on jenkins using that image? My
point is that this "VM image + steps" is not public anywhere.

> I bet it is not hard to set up and maintain. I bet it is easier than a VM.
> But unless Jenkins is using it aren't we just making another different
> standard build env in an effort to standardize? If it is not the same then
> it loses value as being exactly the same as the reference build env. Has a
> problem come up that this solves?

Right now the reference build env is an AMI I created and keep adding
stuff to when Spark gets new dependencies (e.g. the version of ruby we
need to create the docs, new python stats libraries, etc). So if we
had a docker image, then I would use that for making the RC's as well
and it could serve as a definitive reference for people who want to
understand exactly what set of things they need to build Spark.

>
> If the goal is just easing developer set up then what does a Docker image do
> - what does it set up for me? I don't know of stuff I need set up on OS X
> for me beyond the IDE.

There are actually a good number of packages you need to do a full
build of Spark including a compliant python version, Java version,
certain python packages, ruby and jekyll stuff for the docs, etc
(mentioned a bit earlier).

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Standardized Spark dev environment

2015-01-21 Thread Patrick Wendell
Yep,

I think it's only useful (and likely to be maintained) if we actually
use this on Jenkins. So that was my proposal. Basically give people a
docker file so they can understand exactly what versions of everything
we use for our reference build. And if they don't want to use docker
directly, this will at least serve as an up-to-date list of
packages/versions they should try to install locally in whatever
environment they have.

- Patrick

On Wed, Jan 21, 2015 at 5:42 AM, Will Benton  wrote:
> - Original Message -----
>> From: "Patrick Wendell" 
>> To: "Sean Owen" 
>> Cc: "dev" , "jay vyas" , 
>> "Paolo Platter"
>> , "Nicholas Chammas" 
>> , "Will Benton" 
>> Sent: Wednesday, January 21, 2015 2:09:35 AM
>> Subject: Re: Standardized Spark dev environment
>
>> But the issue is when users can't reproduce Jenkins failures.
>
> Yeah, to answer Sean's question, this was part of the problem I was trying to 
> solve.  The other part was teasing out differences between the Fedora Java 
> environment and a more conventional Java environment.  I agree with Sean (and 
> I think this is your suggestion as well, Patrick) that making the environment 
> Jenkins runs a standard image that is available for public consumption would 
> be useful in general.
>
>
>
> best,
> wb

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Upcoming Spark 1.2.1 RC

2015-01-21 Thread Patrick Wendell
Hey All,

I am planning to cut a 1.2.1 RC soon and wanted to notify people.

There are a handful of important fixes in the 1.2.1 branch
(http://s.apache.org/Mpn) particularly for Spark SQL. There was also
an issue publishing some of our artifacts with 1.2.0 and this release
would fix it for downstream projects.

You can track outstanding 1.2.1 blocker issues here at
http://s.apache.org/2v2 - I'm guessing all remaining blocker issues
will be fixed today.

I think we have a good handle on the remaining outstanding fixes, but
please let me know if you think there are severe outstanding fixes
that need to be backported into this branch or are not tracked above.

Thanks!
- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: renaming SchemaRDD -> DataFrame

2015-01-26 Thread Patrick Wendell
One thing potentially not clear from this e-mail, there will be a 1:1
correspondence where you can get an RDD to/from a DataFrame.

On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin  wrote:
> Hi,
>
> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
> get the community's opinion.
>
> The context is that SchemaRDD is becoming a common data format used for
> bringing data into Spark from external systems, and used for various
> components of Spark, e.g. MLlib's new pipeline API. We also expect more and
> more users to be programming directly against SchemaRDD API rather than the
> core RDD API. SchemaRDD, through its less commonly used DSL originally
> designed for writing test cases, always has the data-frame like API. In
> 1.3, we are redesigning the API to make the API usable for end users.
>
>
> There are two motivations for the renaming:
>
> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>
> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
> though it would contain some RDD functions like map, flatMap, etc), and
> calling it Schema*RDD* while it is not an RDD is highly confusing. Instead.
> DataFrame.rdd will return the underlying RDD for all RDD methods.
>
>
> My understanding is that very few users program directly against the
> SchemaRDD API at the moment, because they are not well documented. However,
> oo maintain backward compatibility, we can create a type alias DataFrame
> that is still named SchemaRDD. This will maintain source compatibility for
> Scala. That said, we will have to update all existing materials to use
> DataFrame rather than SchemaRDD.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-26 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.2.1!

The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.2.1-rc1/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1061/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/

Please vote on releasing this package as Apache Spark 1.2.1!

The vote is open until Friday, January 30, at 07:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.2.1
[ ] -1 Do not release this package because ...

For a list of fixes in this release, see http://s.apache.org/Mpn.

To learn more about Apache Spark, please see
http://spark.apache.org/

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Patrick Wendell
Hey Sean,

The release script generates hashes in two places (take a look a bit
further down in the script), one for the published artifacts and the
other for the binaries. In the case of the binaries we use SHA512
because, AFAIK, the ASF does not require you to use SHA1 and SHA512 is
better. In the case of the published Maven artifacts we use SHA1
because my understanding is this is what Maven requires. However, it
does appear that the format is now one that maven cannot parse.

Anyways, it seems fine to just change the format of the hash per your PR.

- Patrick

On Tue, Jan 27, 2015 at 5:00 AM, Sean Owen  wrote:
> I think there are several signing / hash issues that should be fixed
> before this release.
>
> Hashes:
>
> http://issues.apache.org/jira/browse/SPARK-5308
> https://github.com/apache/spark/pull/4161
>
> The hashes here are correct, but have two issues:
>
> As noted in the JIRA, the format of the hash file is "nonstandard" --
> at least, doesn't match what Maven outputs, and apparently which tools
> like Leiningen expect, which is just the hash with no file name or
> spaces. There are two ways to fix that: different command-line tools
> (see PR), or, just ask Maven to generate these hashes (a different,
> easy PR).
>
> However, is the script I modified above used to generate these hashes?
> It's generating SHA1 sums, but the output in this release candidate
> has (correct) SHA512 sums.
>
> This may be more than a nuisance, since last time for some reason
> Maven Central did not register the project hashes.
>
> http://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-core_2.10%7C1.2.0%7Cjar
> does not show them but they exist:
> http://www.us.apache.org/dist/spark/spark-1.2.0/
>
> It may add up to a problem worth rooting out before this release.
>
>
> Signing:
>
> As noted in https://issues.apache.org/jira/browse/SPARK-5299 there are
> two signing keys in
> https://people.apache.org/keys/committer/pwendell.asc (9E4FE3AF,
> 00799F7E) but only one is in http://www.apache.org/dist/spark/KEYS
>
> However, these artifacts seem to be signed by FC8ED089 which isn't in either.
>
> Details details, but I'd say non-binding -1 at the moment.
>
>
> On Tue, Jan 27, 2015 at 7:02 AM, Patrick Wendell  wrote:
>> Please vote on releasing the following candidate as Apache Spark version 
>> 1.2.1!
>>
>> The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3):
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.1-rc1/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1061/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.2.1!
>>
>> The vote is open until Friday, January 30, at 07:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.2.1
>> [ ] -1 Do not release this package because ...
>>
>> For a list of fixes in this release, see http://s.apache.org/Mpn.
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> - Patrick
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Patrick Wendell
Yes - the key issue is just due to me creating new keys this time
around. Anyways let's take another stab at this. In the mean time,
please don't hesitate to test the release itself.

- Patrick

On Tue, Jan 27, 2015 at 10:00 AM, Sean Owen  wrote:
> Got it. Ignore the SHA512 issue since these aren't somehow expected by
> a policy or Maven to be in a certain format. Just wondered if the
> difference was intended.
>
> The Maven way of generated the SHA1 hashes is to set this on the
> install plugin, AFAIK, although I'm not sure if the intent was to hash
> files that Maven didn't create:
>
> 
> true
> 
>
> As for the key issue, I think it's just a matter of uploading the new
> key in both places.
>
> We should all of course test the release anyway.
>
> On Tue, Jan 27, 2015 at 5:55 PM, Patrick Wendell  wrote:
>> Hey Sean,
>>
>> The release script generates hashes in two places (take a look a bit
>> further down in the script), one for the published artifacts and the
>> other for the binaries. In the case of the binaries we use SHA512
>> because, AFAIK, the ASF does not require you to use SHA1 and SHA512 is
>> better. In the case of the published Maven artifacts we use SHA1
>> because my understanding is this is what Maven requires. However, it
>> does appear that the format is now one that maven cannot parse.
>>
>> Anyways, it seems fine to just change the format of the hash per your PR.
>>
>> - Patrick
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Patrick Wendell
Hey Sean,

Right now we don't publish every 2.11 binary to avoid combinatorial
explosion of the number of build artifacts we publish (there are other
parameters such as whether hive is included, etc). We can revisit this
in future feature releases, but .1 releases like this are reserved for
bug fixes.

- Patrick

On Tue, Jan 27, 2015 at 10:31 AM, Sean McNamara
 wrote:
> We're using spark on scala 2.11 /w hadoop2.4.  Would it be practical / make 
> sense to build a bin version of spark against scala 2.11 for versions other 
> than just hadoop1 at this time?
>
> Cheers,
>
> Sean
>
>
>> On Jan 27, 2015, at 12:04 AM, Patrick Wendell  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 1.2.1!
>>
>> The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3):
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.1-rc1/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1061/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.2.1!
>>
>> The vote is open until Friday, January 30, at 07:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.2.1
>> [ ] -1 Do not release this package because ...
>>
>> For a list of fixes in this release, see http://s.apache.org/Mpn.
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> - Patrick
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Patrick Wendell
Okay - we've resolved all issues with the signatures and keys.
However, I'll leave the current vote open for a bit to solicit
additional feedback.

On Tue, Jan 27, 2015 at 10:43 AM, Sean McNamara
 wrote:
> Sounds good, that makes sense.
>
> Cheers,
>
> Sean
>
>> On Jan 27, 2015, at 11:35 AM, Patrick Wendell  wrote:
>>
>> Hey Sean,
>>
>> Right now we don't publish every 2.11 binary to avoid combinatorial
>> explosion of the number of build artifacts we publish (there are other
>> parameters such as whether hive is included, etc). We can revisit this
>> in future feature releases, but .1 releases like this are reserved for
>> bug fixes.
>>
>> - Patrick
>>
>> On Tue, Jan 27, 2015 at 10:31 AM, Sean McNamara
>>  wrote:
>>> We're using spark on scala 2.11 /w hadoop2.4.  Would it be practical / make 
>>> sense to build a bin version of spark against scala 2.11 for versions other 
>>> than just hadoop1 at this time?
>>>
>>> Cheers,
>>>
>>> Sean
>>>
>>>
>>>> On Jan 27, 2015, at 12:04 AM, Patrick Wendell  wrote:
>>>>
>>>> Please vote on releasing the following candidate as Apache Spark version 
>>>> 1.2.1!
>>>>
>>>> The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3):
>>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> http://people.apache.org/~pwendell/spark-1.2.1-rc1/
>>>>
>>>> Release artifacts are signed with the following key:
>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1061/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/
>>>>
>>>> Please vote on releasing this package as Apache Spark 1.2.1!
>>>>
>>>> The vote is open until Friday, January 30, at 07:00 UTC and passes
>>>> if a majority of at least 3 +1 PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 1.2.1
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> For a list of fixes in this release, see http://s.apache.org/Mpn.
>>>>
>>>> To learn more about Apache Spark, please see
>>>> http://spark.apache.org/
>>>>
>>>> - Patrick
>>>>
>>>> -
>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>>
>>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Friendly reminder/request to help with reviews!

2015-01-27 Thread Patrick Wendell
Hey All,

Just a reminder, as always around release time we have a very large
volume of patches show up near the deadline.

One thing that can help us maximize the number of patches we get in is
to have community involvement in performing code reviews. And in
particular, doing a thorough review and signing off on a patch with
LGTM can substantially increase the odds we can merge a patch
confidently.

If you are newer to Spark, finding a single area of the codebase to
focus on can still provide a lot of value to the project in the
reviewing process.

Cheers and good luck with everyone on work for this release.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[RESULT] [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-28 Thread Patrick Wendell
This vote is cancelled in favor of RC2.

On Tue, Jan 27, 2015 at 4:20 PM, Reynold Xin  wrote:
> +1
>
> Tested on Mac OS X
>
> On Tue, Jan 27, 2015 at 12:35 PM, Krishna Sankar 
> wrote:
>>
>> +1
>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 12:55 min
>>  mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
>> -Dhadoop.version=2.6.0 -Phive -DskipTests
>> 2. Tested pyspark, mlib - running as well as compare results with 1.1.x &
>> 1.2.0
>> 2.1. statistics OK
>> 2.2. Linear/Ridge/Laso Regression OK
>> 2.3. Decision Tree, Naive Bayes OK
>> 2.4. KMeans OK
>>Center And Scale OK
>>Fixed : org.apache.spark.SparkException in zip !
>> 2.5. rdd operations OK
>>State of the Union Texts - MapReduce, Filter,sortByKey (word count)
>> 2.6. recommendation OK
>>
>> Cheers
>> 
>>
>> On Mon, Jan 26, 2015 at 11:02 PM, Patrick Wendell 
>> wrote:
>>
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 1.2.1!
>> >
>> > The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3):
>> >
>> >
>> > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-1.2.1-rc1/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1061/
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/
>> >
>> > Please vote on releasing this package as Apache Spark 1.2.1!
>> >
>> > The vote is open until Friday, January 30, at 07:00 UTC and passes
>> > if a majority of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 1.2.1
>> > [ ] -1 Do not release this package because ...
>> >
>> > For a list of fixes in this release, see http://s.apache.org/Mpn.
>> >
>> > To learn more about Apache Spark, please see
>> > http://spark.apache.org/
>> >
>> > - Patrick
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>> >
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.2.1!

The tag to be voted on is v1.2.1-rc1 (commit b77f876):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b77f87673d1f9f03d4c83cf583158227c551359b

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.2.1-rc2/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1062/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.2.1-rc2-docs/

Changes from rc1:
This has no code changes from RC1. Only minor changes to the release script.

Please vote on releasing this package as Apache Spark 1.2.1!

The vote is open until  Saturday, January 31, at 10:04 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.2.1
[ ] -1 Do not release this package because ...

For a list of fixes in this release, see http://s.apache.org/Mpn.

To learn more about Apache Spark, please see
http://spark.apache.org/

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Patrick Wendell
Minor typo in the above e-mail - the tag is named v1.2.1-rc2 (not v1.2.1-rc1).

On Wed, Jan 28, 2015 at 2:06 AM, Patrick Wendell  wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 1.2.1!
>
> The tag to be voted on is v1.2.1-rc1 (commit b77f876):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b77f87673d1f9f03d4c83cf583158227c551359b
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.2.1-rc2/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1062/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.2.1-rc2-docs/
>
> Changes from rc1:
> This has no code changes from RC1. Only minor changes to the release script.
>
> Please vote on releasing this package as Apache Spark 1.2.1!
>
> The vote is open until  Saturday, January 31, at 10:04 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.2.1
> [ ] -1 Do not release this package because ...
>
> For a list of fixes in this release, see http://s.apache.org/Mpn.
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Patrick Wendell
Yes - it fixes that issue.

On Wed, Jan 28, 2015 at 2:17 AM, Aniket  wrote:
> Hi Patrick,
>
> I am wondering if this version will address issues around certain artifacts
> not getting published in 1.2 which are gating people to migrate to 1.2. One
> such issue is https://issues.apache.org/jira/browse/SPARK-5144
>
> Thanks,
> Aniket
>
> On Wed Jan 28 2015 at 15:39:43 Patrick Wendell [via Apache Spark Developers
> List]  wrote:
>
>> Minor typo in the above e-mail - the tag is named v1.2.1-rc2 (not
>> v1.2.1-rc1).
>>
>> On Wed, Jan 28, 2015 at 2:06 AM, Patrick Wendell <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=10318&i=0>> wrote:
>>
>> > Please vote on releasing the following candidate as Apache Spark version
>> 1.2.1!
>> >
>> > The tag to be voted on is v1.2.1-rc1 (commit b77f876):
>> >
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b77f87673d1f9f03d4c83cf583158227c551359b
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-1.2.1-rc2/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1062/
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-1.2.1-rc2-docs/
>> >
>> > Changes from rc1:
>> > This has no code changes from RC1. Only minor changes to the release
>> script.
>> >
>> > Please vote on releasing this package as Apache Spark 1.2.1!
>> >
>> > The vote is open until  Saturday, January 31, at 10:04 UTC and passes
>> > if a majority of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 1.2.1
>> > [ ] -1 Do not release this package because ...
>> >
>> > For a list of fixes in this release, see http://s.apache.org/Mpn.
>> >
>> > To learn more about Apache Spark, please see
>> > http://spark.apache.org/
>>
>> -
>> To unsubscribe, e-mail: [hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=10318&i=1>
>> For additional commands, e-mail: [hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=10318&i=2>
>>
>>
>>
>> --
>>  If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-1-RC2-tp10317p10318.html
>>  To start a new topic under Apache Spark Developers List, email
>> ml-node+s1001551n1...@n3.nabble.com
>> To unsubscribe from Apache Spark Developers List, click here
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YW5pa2V0LmJoYXRuYWdhckBnbWFpbC5jb218MXwxMzE3NTAzMzQz>
>> .
>> NAML
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-1-RC2-tp10317p10320.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: spark akka fork : is the source anywhere?

2015-01-28 Thread Patrick Wendell
It's maintained here:

https://github.com/pwendell/akka/tree/2.2.3-shaded-proto

Over time, this is something that would be great to get rid of, per rxin

On Wed, Jan 28, 2015 at 3:33 PM, Reynold Xin  wrote:
> Hopefully problems like this will go away entirely in the next couple of
> releases. https://issues.apache.org/jira/browse/SPARK-5293
>
>
>
> On Wed, Jan 28, 2015 at 3:12 PM, jay vyas 
> wrote:
>
>> Hi spark. Where is akka coming from in spark ?
>>
>> I see the distribution referenced is a spark artifact... but not in the
>> apache namespace.
>>
>>  org.spark-project.akka
>>  2.3.4-spark
>>
>> Clearly this is a deliberate thought out change (See SPARK-1812), but its
>> not clear where 2.3.4 spark is coming from and who is maintaining its
>> release?
>>
>> --
>> jay vyas
>>
>> PS
>>
>> I've had some conversations with will benton as well about this, and its
>> clear that some modifications to akka are needed, or else a protobug error
>> occurs, which amount to serialization incompatibilities, hence if one wants
>> to build spark from sources, the patched akka is required (or else, manual
>> patching needs to be done)...
>>
>> 15/01/28 22:58:10 ERROR ActorSystemImpl: Uncaught fatal error from thread
>> [sparkWorker-akka.remote.default-remote-dispatcher-6] shutting down
>> ActorSystem [sparkWorker] java.lang.VerifyError: class
>> akka.remote.WireFormats$AkkaControlMessage overrides final method
>> getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Questions about Spark standalone resource scheduler

2015-02-02 Thread Patrick Wendell
Hey Jerry,

I think standalone mode will still add more features over time, but
the goal isn't really for it to become equivalent to what Mesos/YARN
are today. Or at least, I doubt Spark Standalone will ever attempt to
manage _other_ frameworks outside of Spark and become a general
purpose resource manager.

In terms of having better support for multi tenancy, meaning multiple
*Spark* instances, this is something I think could be in scope in the
future. For instance, we added H/A to the standalone scheduler a while
back, because it let us support H/A streaming apps in a totally native
way. It's a trade off of adding new features and keeping the scheduler
very simple and easy to use. We've tended to bias towards simplicity
as the main goal, since this is something we want to be really easy
"out of the box".

One thing to point out, a lot of people use the standalone mode with
some coarser grained scheduler, such as running in a cloud service. In
this case they really just want a simple "inner" cluster manager. This
may even be the majority of all Spark installations. This is slightly
different than Hadoop environments, where they might just want nice
integration into the existing Hadoop stack via something like YARN.

- Patrick

On Mon, Feb 2, 2015 at 12:24 AM, Shao, Saisai  wrote:
> Hi all,
>
>
>
> I have some questions about the future development of Spark's standalone
> resource scheduler. We've heard some users have the requirements to have
> multi-tenant support in standalone mode, like multi-user management,
> resource management and isolation, whitelist of users. Seems current Spark
> standalone do not support such kind of functionalities, while resource
> schedulers like Yarn offers such kind of advanced managements, I'm not sure
> what's the future target of standalone resource scheduler, will it only
> target on simple implementation, and for advanced usage shift to YARN? Or
> will it plan to add some simple multi-tenant related functionalities?
>
>
>
> Thanks a lot for your comments.
>
>
>
> BR
>
> Jerry

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark Master Maven with YARN build is broken

2015-02-02 Thread Patrick Wendell
It's my fault, I'm sending a hot fix now.

On Mon, Feb 2, 2015 at 1:44 PM, Nicholas Chammas
 wrote:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/
>
> Is this is a known issue? It seems to have been broken since last night.
>
> Here's a snippet from the build output of one of the builds
> 
> :
>
> [error] bad symbolic reference. A signature in WebUI.class refers to
> term eclipse
> [error] in package org which is not available.
> [error] It may be completely missing from the current classpath, or
> the version on
> [error] the classpath might be incompatible with the version used when
> compiling WebUI.class.
> [error] bad symbolic reference. A signature in WebUI.class refers to term 
> jetty
> [error] in value org.eclipse which is not available.
> [error] It may be completely missing from the current classpath, or
> the version on
> [error] the classpath might be incompatible with the version used when
> compiling WebUI.class.
> [error]
> [error]  while compiling:
> /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/centos/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
> [error] during phase: erasure
> [error]  library version: version 2.10.4
> [error] compiler version: version 2.10.4
>
> Nick
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Temporary jenkins issue

2015-02-02 Thread Patrick Wendell
Hey All,

I made a change to the Jenkins configuration that caused most builds
to fail (attempting to enable a new plugin), I've reverted the change
effective about 10 minutes ago.

If you've seen recent build failures like below, this was caused by
that change. Sorry about that.


ERROR: Publisher
com.google.jenkins.flakyTestHandler.plugin.JUnitFlakyResultArchiver
aborted due to exception
java.lang.NoSuchMethodError:
hudson.model.AbstractBuild.getTestResultAction()Lhudson/tasks/test/AbstractTestResultAction;
at 
com.google.jenkins.flakyTestHandler.plugin.FlakyTestResultAction.(FlakyTestResultAction.java:78)
at 
com.google.jenkins.flakyTestHandler.plugin.JUnitFlakyResultArchiver.perform(JUnitFlakyResultArchiver.java:89)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:770)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:734)
at hudson.model.Build$BuildExecution.post2(Build.java:183)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:683)
at hudson.model.Run.execute(Run.java:1784)
at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
at hudson.model.ResourceController.execute(ResourceController.java:89)
at hudson.model.Executor.run(Executor.java:240)


- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-02-02 Thread Patrick Wendell
The windows issue reported only affects actually running Spark on
Windows (not job submission). However, I agree it's worth cutting a
new RC. I'm going to cancel this vote and propose RC3 with a single
additional patch. Let's try to vote that through so we can ship Spark
1.2.1.

- Patrick

On Sat, Jan 31, 2015 at 7:36 PM, Matei Zaharia  wrote:
> This looks like a pretty serious problem, thanks! Glad people are testing on 
> Windows.
>
> Matei
>
>> On Jan 31, 2015, at 11:57 AM, MartinWeindel  wrote:
>>
>> FYI: Spark 1.2.1rc2 does not work on Windows!
>>
>> On creating a Spark context you get following log output on my Windows
>> machine:
>> INFO  org.apache.spark.SparkEnv:59 - Registering BlockManagerMaster
>> ERROR org.apache.spark.util.Utils:75 - Failed to create local root dir in
>> C:\Users\mweindel\AppData\Local\Temp\. Ignoring this directory.
>> ERROR org.apache.spark.storage.DiskBlockManager:75 - Failed to create any
>> local dir.
>>
>> I have already located the cause. A newly added function chmod700() in
>> org.apache.util.Utils uses functionality which only works on a Unix file
>> system.
>>
>> See also pull request [https://github.com/apache/spark/pull/4299] for my
>> suggestion how to resolve the issue.
>>
>> Best regards,
>>
>> Martin Weindel
>>
>>
>>
>> --
>> View this message in context: 
>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-1-RC2-tp10317p10370.html
>> Sent from the Apache Spark Developers List mailing list archive at 
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[RESULT] [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-02-02 Thread Patrick Wendell
This is cancelled in favor of RC2.

On Mon, Feb 2, 2015 at 8:50 PM, Patrick Wendell  wrote:
> The windows issue reported only affects actually running Spark on
> Windows (not job submission). However, I agree it's worth cutting a
> new RC. I'm going to cancel this vote and propose RC3 with a single
> additional patch. Let's try to vote that through so we can ship Spark
> 1.2.1.
>
> - Patrick
>
> On Sat, Jan 31, 2015 at 7:36 PM, Matei Zaharia  
> wrote:
>> This looks like a pretty serious problem, thanks! Glad people are testing on 
>> Windows.
>>
>> Matei
>>
>>> On Jan 31, 2015, at 11:57 AM, MartinWeindel  
>>> wrote:
>>>
>>> FYI: Spark 1.2.1rc2 does not work on Windows!
>>>
>>> On creating a Spark context you get following log output on my Windows
>>> machine:
>>> INFO  org.apache.spark.SparkEnv:59 - Registering BlockManagerMaster
>>> ERROR org.apache.spark.util.Utils:75 - Failed to create local root dir in
>>> C:\Users\mweindel\AppData\Local\Temp\. Ignoring this directory.
>>> ERROR org.apache.spark.storage.DiskBlockManager:75 - Failed to create any
>>> local dir.
>>>
>>> I have already located the cause. A newly added function chmod700() in
>>> org.apache.util.Utils uses functionality which only works on a Unix file
>>> system.
>>>
>>> See also pull request [https://github.com/apache/spark/pull/4299] for my
>>> suggestion how to resolve the issue.
>>>
>>> Best regards,
>>>
>>> Martin Weindel
>>>
>>>
>>>
>>> --
>>> View this message in context: 
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-1-RC2-tp10317p10370.html
>>> Sent from the Apache Spark Developers List mailing list archive at 
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.2.1 (RC3)

2015-02-02 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.2.1!

The tag to be voted on is v1.2.1-rc3 (commit b6eaf77):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b6eaf77d4332bfb0a698849b1f5f917d20d70e97

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.2.1-rc3/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1065/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.2.1-rc3-docs/

Changes from rc2:
A single patch fixing a windows issue.

Please vote on releasing this package as Apache Spark 1.2.1!

The vote is open until Friday, February 06, at 05:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.2.1
[ ] -1 Do not release this package because ...

For a list of fixes in this release, see http://s.apache.org/Mpn.

To learn more about Apache Spark, please see
http://spark.apache.org/

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[ANNOUNCE] branch-1.3 has been cut

2015-02-03 Thread Patrick Wendell
Hey All,

Just wanted to announce that we've cut the 1.3 branch which will
become the 1.3 release after community testing.

There are still some features that will go in (in higher level
libraries, and some stragglers in spark core), but overall this
indicates the end of major feature development for Spark 1.3 and a
transition into testing.

Within a few days I'll cut a snapshot package release for this so that
people can begin testing.

https://git-wip-us.apache.org/repos/asf?p=spark.git;a=shortlog;h=refs/heads/branch-1.3

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: 1.2.1-rc3 - Avro input format for Hadoop 2 broken/fix?

2015-02-04 Thread Patrick Wendell
Hi Markus,

That won't be included in 1.2.1 most likely because the release votes
have already started, and at that point we don't hold the release
except for major regression issues from 1.2.0. However, if this goes
through we can backport it into the 1.2 branch and it will end up in a
future maintenance release, or you can just build spark from that
branch as soon as it's in there.

- Patric

On Wed, Feb 4, 2015 at 7:30 AM, M. Dale  wrote:
> SPARK-3039 "Spark assembly for new hadoop API (hadoop 2) contains
> avro-mapred for hadoop 1 API" was reopened
> and prevents v.1.2.1-rc3 from using Avro Input format for Hadoop 2
> API/instances (it includes the hadoop1 avro-mapred library files).
>
> What are the chances of getting the fix outlined here
> (https://github.com/medale/spark/compare/apache:v1.2.1-rc3...avro-hadoop2-v1.2.1-rc2)
> included in 1.2.1? My apologies, I do not know how to generate a pull
> request against a tag version.
>
> I did add pull request https://github.com/apache/spark/pull/4315 for the
> current 1.3.0-SNAPSHOT master on this issue. Even though 1.3.0 build already
> does not include avro-mapred in the spark assembly jar this minor change
> improves dependence convergence.
>
> Thanks,
> Markus
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: multi-line comment style

2015-02-04 Thread Patrick Wendell
Personally I have no opinion, but agree it would be nice to standardize.

- Patrick

On Wed, Feb 4, 2015 at 1:58 PM, Sean Owen  wrote:
> One thing Marcelo pointed out to me is that the // style does not
> interfere with commenting out blocks of code with /* */, which is a
> small good thing. I am also accustomed to // style for multiline, and
> reserve /** */ for javadoc / scaladoc. Meaning, seeing the /* */ style
> inline always looks a little funny to me.
>
> On Wed, Feb 4, 2015 at 3:53 PM, Kay Ousterhout  
> wrote:
>> Hi all,
>>
>> The Spark Style Guide
>> 
>> says multi-line comments should formatted as:
>>
>> /*
>>  * This is a
>>  * very
>>  * long comment.
>>  */
>>
>> But in my experience, we almost always use "//" for multi-line comments:
>>
>> // This is a
>> // very
>> // long comment.
>>
>> Here are some examples:
>>
>>- Recent commit by Reynold, king of style:
>>
>> https://github.com/apache/spark/commit/bebf4c42bef3e75d31ffce9bfdb331c16f34ddb1#diff-d616b5496d1a9f648864f4ab0db5a026R58
>>- RDD.scala:
>>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L361
>>- DAGScheduler.scala:
>>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L281
>>
>>
>> Any objections to me updating the style guide to reflect this?  As with
>> other style issues, I think consistency here is helpful (and formatting
>> multi-line comments as "//" does nicely visually distinguish code comments
>> from doc comments).
>>
>> -Kay
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: PSA: Maven supports parallel builds

2015-02-05 Thread Patrick Wendell
I've done this in the past, but back when I wasn't using Zinc it
didn't make a big difference. It's worth doing this in our jenkins
environment though.

- Patrick

On Thu, Feb 5, 2015 at 4:52 PM, Dirceu Semighini Filho
 wrote:
> Thanks Nicholas, I didn't knew this.
>
> 2015-02-05 22:16 GMT-02:00 Nicholas Chammas :
>
>> Y'all may already know this, but I haven't seen it mentioned anywhere in
>> our docs on here and it's a pretty easy win.
>>
>> Maven supports parallel builds
>> <
>> https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3
>> >
>> with the -T command line option.
>>
>> For example:
>>
>> ./build/mvn -T 1C -Dhadoop.version=1.2.1 -DskipTests clean package
>>
>> This will have Maven use 1 thread per core on your machine to build Spark.
>>
>> On my little MacBook air, this cuts the build time from 14 minutes to 10.5
>> minutes. A machine with more cores should see a bigger improvement.
>>
>> Note though that the docs mark this as experimental, so I wouldn't change
>> our reference build to use this. But it should be useful, for example, in
>> Jenkins or when working locally.
>>
>> Nick
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Improving metadata in Spark JIRA

2015-02-06 Thread Patrick Wendell
Per Nick's suggestion I added two components:

1. Spark Submit
2. Spark Scheduler

I figured I would just add these since if we decide later we don't
want them, we can simply merge them into Spark Core.

On Fri, Feb 6, 2015 at 11:53 AM, Nicholas Chammas
 wrote:
> Do we need some new components to be added to the JIRA project?
>
> Like:
>
>-
>
>scheduler
> -
>
>YARN
> - spark-submit
>- ...?
>
> Nick
>
>
> On Fri Feb 06 2015 at 10:50:41 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> +9000 on cleaning up JIRA.
>>
>> Thank you Sean for laying out some specific things to tackle. I will
>> assist with this.
>>
>> Regarding email, I think Sandy is right. I only get JIRA email for issues
>> I'm watching.
>>
>> Nick
>>
>> On Fri Feb 06 2015 at 9:52:58 AM Sandy Ryza 
>> wrote:
>>
>>> JIRA updates don't go to this list, they go to iss...@spark.apache.org.
>>> I
>>> don't think many are signed up for that list, and those that are probably
>>> have a flood of emails anyway.
>>>
>>> So I'd definitely be in favor of any JIRA cleanup that you're up for.
>>>
>>> -Sandy
>>>
>>> On Fri, Feb 6, 2015 at 6:45 AM, Sean Owen  wrote:
>>>
>>> > I've wasted no time in wielding the commit bit to complete a number of
>>> > small, uncontroversial changes. I wouldn't commit anything that didn't
>>> > already appear to have review, consensus and little risk, but please
>>> > let me know if anything looked a little too bold, so I can calibrate.
>>> >
>>> >
>>> > Anyway, I'd like to continue some small house-cleaning by improving
>>> > the state of JIRA's metadata, in order to let it give us a little
>>> > clearer view on what's happening in the project:
>>> >
>>> > a. Add Component to every (open) issue that's missing one
>>> > b. Review all Critical / Blocker issues to de-escalate ones that seem
>>> > obviously neither
>>> > c. Correct open issues that list a Fix version that has already been
>>> > released
>>> > d. Close all issues Resolved for a release that has already been
>>> released
>>> >
>>> > The problem with doing so is that it will create a tremendous amount
>>> > of email to the list, like, several hundred. It's possible to make
>>> > bulk changes and suppress e-mail though, which could be done for all
>>> > but b.
>>> >
>>> > Better to suppress the emails when making such changes? or just not
>>> > bother on some of these?
>>> >
>>> > -
>>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: dev-h...@spark.apache.org
>>> >
>>> >
>>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC3)

2015-02-08 Thread Patrick Wendell
I'll add a +1 as well.

On Fri, Feb 6, 2015 at 2:38 PM, Matei Zaharia  wrote:
> +1
>
> Tested on Mac OS X.
>
> Matei
>
>
>> On Feb 2, 2015, at 8:57 PM, Patrick Wendell  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 1.2.1!
>>
>> The tag to be voted on is v1.2.1-rc3 (commit b6eaf77):
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b6eaf77d4332bfb0a698849b1f5f917d20d70e97
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.1-rc3/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1065/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.1-rc3-docs/
>>
>> Changes from rc2:
>> A single patch fixing a windows issue.
>>
>> Please vote on releasing this package as Apache Spark 1.2.1!
>>
>> The vote is open until Friday, February 06, at 05:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.2.1
>> [ ] -1 Do not release this package because ...
>>
>> For a list of fixes in this release, see http://s.apache.org/Mpn.
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[RESULT] [VOTE] Release Apache Spark 1.2.1 (RC3)

2015-02-08 Thread Patrick Wendell
This vote passes with 5 +1 votes (3 binding) and no 0 or -1 votes.

+1 Votes:
Krishna Sankar
Sean Owen*
Chip Senkbeil
Matei Zaharia*
Patrick Wendell*

0 Votes:
(none)

-1 Votes:
(none)

On Fri, Feb 6, 2015 at 5:12 PM, Patrick Wendell  wrote:
> I'll add a +1 as well.
>
> On Fri, Feb 6, 2015 at 2:38 PM, Matei Zaharia  wrote:
>> +1
>>
>> Tested on Mac OS X.
>>
>> Matei
>>
>>
>>> On Feb 2, 2015, at 8:57 PM, Patrick Wendell  wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version 
>>> 1.2.1!
>>>
>>> The tag to be voted on is v1.2.1-rc3 (commit b6eaf77):
>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b6eaf77d4332bfb0a698849b1f5f917d20d70e97
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-1.2.1-rc3/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1065/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-1.2.1-rc3-docs/
>>>
>>> Changes from rc2:
>>> A single patch fixing a windows issue.
>>>
>>> Please vote on releasing this package as Apache Spark 1.2.1!
>>>
>>> The vote is open until Friday, February 06, at 05:00 UTC and passes
>>> if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.2.1
>>> [ ] -1 Do not release this package because ...
>>>
>>> For a list of fixes in this release, see http://s.apache.org/Mpn.
>>>
>>> To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Improving metadata in Spark JIRA

2015-02-08 Thread Patrick Wendell
I think we already have a YARN component.

https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20%3D%20YARN

I don't think JIRA allows it to be mandatory, but if it does, that
would be useful.

On Sat, Feb 7, 2015 at 5:08 PM, Nicholas Chammas
 wrote:
> By the way, isn't it possible to make the "Component" field mandatory when
> people open new issues? Shouldn't we do that?
>
> Btw Patrick, don't we need a YARN component? I think our JIRA components
> should roughly match the components on the PR dashboard.
>
> Nick
>
> On Fri Feb 06 2015 at 12:25:52 PM Patrick Wendell 
> wrote:
>>
>> Per Nick's suggestion I added two components:
>>
>> 1. Spark Submit
>> 2. Spark Scheduler
>>
>> I figured I would just add these since if we decide later we don't
>> want them, we can simply merge them into Spark Core.
>>
>> On Fri, Feb 6, 2015 at 11:53 AM, Nicholas Chammas
>>  wrote:
>> > Do we need some new components to be added to the JIRA project?
>> >
>> > Like:
>> >
>> >-
>> >
>> >scheduler
>> > -
>> >
>> >YARN
>> > - spark-submit
>> >- ...?
>> >
>> > Nick
>> >
>> >
>> > On Fri Feb 06 2015 at 10:50:41 AM Nicholas Chammas <
>> > nicholas.cham...@gmail.com> wrote:
>> >
>> >> +9000 on cleaning up JIRA.
>> >>
>> >> Thank you Sean for laying out some specific things to tackle. I will
>> >> assist with this.
>> >>
>> >> Regarding email, I think Sandy is right. I only get JIRA email for
>> >> issues
>> >> I'm watching.
>> >>
>> >> Nick
>> >>
>> >> On Fri Feb 06 2015 at 9:52:58 AM Sandy Ryza 
>> >> wrote:
>> >>
>> >>> JIRA updates don't go to this list, they go to
>> >>> iss...@spark.apache.org.
>> >>> I
>> >>> don't think many are signed up for that list, and those that are
>> >>> probably
>> >>> have a flood of emails anyway.
>> >>>
>> >>> So I'd definitely be in favor of any JIRA cleanup that you're up for.
>> >>>
>> >>> -Sandy
>> >>>
>> >>> On Fri, Feb 6, 2015 at 6:45 AM, Sean Owen  wrote:
>> >>>
>> >>> > I've wasted no time in wielding the commit bit to complete a number
>> >>> > of
>> >>> > small, uncontroversial changes. I wouldn't commit anything that
>> >>> > didn't
>> >>> > already appear to have review, consensus and little risk, but please
>> >>> > let me know if anything looked a little too bold, so I can
>> >>> > calibrate.
>> >>> >
>> >>> >
>> >>> > Anyway, I'd like to continue some small house-cleaning by improving
>> >>> > the state of JIRA's metadata, in order to let it give us a little
>> >>> > clearer view on what's happening in the project:
>> >>> >
>> >>> > a. Add Component to every (open) issue that's missing one
>> >>> > b. Review all Critical / Blocker issues to de-escalate ones that
>> >>> > seem
>> >>> > obviously neither
>> >>> > c. Correct open issues that list a Fix version that has already been
>> >>> > released
>> >>> > d. Close all issues Resolved for a release that has already been
>> >>> released
>> >>> >
>> >>> > The problem with doing so is that it will create a tremendous amount
>> >>> > of email to the list, like, several hundred. It's possible to make
>> >>> > bulk changes and suppress e-mail though, which could be done for all
>> >>> > but b.
>> >>> >
>> >>> > Better to suppress the emails when making such changes? or just not
>> >>> > bother on some of these?
>> >>> >
>> >>> >
>> >>> > -
>> >>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >>> >
>> >>> >
>> >>>
>> >>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Unit tests

2015-02-08 Thread Patrick Wendell
Hey All,

The tests are in a not-amazing state right now due to a few compounding factors:

1. We've merged a large volume of patches recently.
2. The load on jenkins has been relatively high, exposing races and
other behavior not seen at lower load.

For those not familiar, the main issue is flaky (non deterministic)
test failures. Right now I'm trying to prioritize keeping the
PullReqeustBuilder in good shape since it will block development if it
is down.

For other tests, let's try to keep filing JIRA's when we see issues
and use the flaky-test label (see http://bit.ly/1yRif9S):

I may contact people regarding specific tests. This is a very high
priority to get in good shape. This kind of thing is no one's "fault"
but just the result of a lot of concurrent development, and everyone
needs to pitch in to get back in a good place.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Keep or remove Debian packaging in Spark?

2015-02-09 Thread Patrick Wendell
I have wondered whether we should sort of deprecated it more
officially, since otherwise I think people have the reasonable
expectation based on the current code that Spark intends to support
"complete" Debian packaging as part of the upstream build. Having
something that's sort-of maintained but no one is helping review and
merge patches on it or make it fully functional, IMO that doesn't
benefit us or our users. There are a bunch of other projects that are
specifically devoted to packaging, so it seems like there is a clear
separation of concerns here.

On Mon, Feb 9, 2015 at 7:31 AM, Mark Hamstra  wrote:
>>
>> it sounds like nobody intends these to be used to actually deploy Spark
>
>
> I wouldn't go quite that far.  What we have now can serve as useful input
> to a deployment tool like Chef, but the user is then going to need to add
> some customization or configuration within the context of that tooling to
> get Spark installed just the way they want.  So it is not so much that the
> current Debian packaging can't be used as that it has never really been
> intended to be a completely finished product that a newcomer could, for
> example, use to install Spark completely and quickly to Ubuntu and have a
> fully-functional environment in which they could then run all of the
> examples, tutorials, etc.
>
> Getting to that level of packaging (and maintenance) is something that I'm
> not sure we want to do since that is a better fit with Bigtop and the
> efforts of Cloudera, Horton Works, MapR, etc. to distribute Spark.
>
> On Mon, Feb 9, 2015 at 2:41 AM, Sean Owen  wrote:
>
>> This is a straw poll to assess whether there is support to keep and
>> fix, or remove, the Debian packaging-related config in Spark.
>>
>> I see several oldish outstanding JIRAs relating to problems in the
>> packaging:
>>
>> https://issues.apache.org/jira/browse/SPARK-1799
>> https://issues.apache.org/jira/browse/SPARK-2614
>> https://issues.apache.org/jira/browse/SPARK-3624
>> https://issues.apache.org/jira/browse/SPARK-4436
>> (and a similar idea about making RPMs)
>> https://issues.apache.org/jira/browse/SPARK-665
>>
>> The original motivation seems related to Chef:
>>
>>
>> https://issues.apache.org/jira/browse/SPARK-2614?focusedCommentId=14070908&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14070908
>>
>> Mark's recent comments cast some doubt on whether it is essential:
>>
>> https://github.com/apache/spark/pull/4277#issuecomment-72114226
>>
>> and in recent conversations I didn't hear dissent to the idea of removing
>> this.
>>
>> Is this still useful enough to fix up? All else equal I'd like to
>> start to walk back some of the complexity of the build, but I don't
>> know how all-else-equal it is. Certainly, it sounds like nobody
>> intends these to be used to actually deploy Spark.
>>
>> I don't doubt it's useful to someone, but can they maintain the
>> packaging logic elsewhere?
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[ANNOUNCE] Apache Spark 1.2.1 Released

2015-02-09 Thread Patrick Wendell
Hi All,

I've just posted the 1.2.1 maintenance release of Apache Spark. We
recommend all 1.2.0 users upgrade to this release, as this release
includes stability fixes across all components of Spark.

- Download this release: http://spark.apache.org/downloads.html
- View the release notes:
http://spark.apache.org/releases/spark-release-1-2-1.html
- Full list of JIRA issues resolved in this release: http://s.apache.org/Mpn

Thanks to everyone who helped work on this release!

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: multi-line comment style

2015-02-09 Thread Patrick Wendell
Clearly there isn't a strictly optimal commenting format (pro's and
cons for both '//' and '/*'). My thought is for consistency we should
just chose one and put in the style guide.

On Mon, Feb 9, 2015 at 12:25 PM, Xiangrui Meng  wrote:
> Btw, I think allowing `/* ... */` without the leading `*` in lines is
> also useful. Check this line:
> https://github.com/apache/spark/pull/4259/files#diff-e9dcb3b5f3de77fc31b3aff7831110eaR55,
> where we put the R commands that can reproduce the test result. It is
> easier if we write in the following style:
>
> ~~~
> /*
>  Using the following R code to load the data and train the model using
> glmnet package.
>
>  library("glmnet")
>  data <- read.csv("path", header=FALSE, stringsAsFactors=FALSE)
>  features <- as.matrix(data.frame(as.numeric(data$V2), as.numeric(data$V3)))
>  label <- as.numeric(data$V1)
>  weights <- coef(glmnet(features, label, family="gaussian", alpha = 0,
> lambda = 0))
>  */
> ~~~
>
> So people can copy & paste the R commands directly.
>
> Xiangrui
>
> On Mon, Feb 9, 2015 at 12:18 PM, Xiangrui Meng  wrote:
>> I like the `/* .. */` style more. Because it is easier for IDEs to
>> recognize it as a block comment. If you press enter in the comment
>> block with the `//` style, IDEs won't add `//` for you. -Xiangrui
>>
>> On Wed, Feb 4, 2015 at 2:15 PM, Reynold Xin  wrote:
>>> We should update the style doc to reflect what we have in most places
>>> (which I think is //).
>>>
>>>
>>>
>>> On Wed, Feb 4, 2015 at 2:09 PM, Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
>>>> FWIW I like the multi-line // over /* */ from a purely style standpoint.
>>>> The Google Java style guide[1] has some comment about code formatting tools
>>>> working better with /* */ but there doesn't seem to be any strong arguments
>>>> for one over the other I can find
>>>>
>>>> Thanks
>>>> Shivaram
>>>>
>>>> [1]
>>>>
>>>> https://google-styleguide.googlecode.com/svn/trunk/javaguide.html#s4.8.6.1-block-comment-style
>>>>
>>>> On Wed, Feb 4, 2015 at 2:05 PM, Patrick Wendell 
>>>> wrote:
>>>>
>>>> > Personally I have no opinion, but agree it would be nice to standardize.
>>>> >
>>>> > - Patrick
>>>> >
>>>> > On Wed, Feb 4, 2015 at 1:58 PM, Sean Owen  wrote:
>>>> > > One thing Marcelo pointed out to me is that the // style does not
>>>> > > interfere with commenting out blocks of code with /* */, which is a
>>>> > > small good thing. I am also accustomed to // style for multiline, and
>>>> > > reserve /** */ for javadoc / scaladoc. Meaning, seeing the /* */ style
>>>> > > inline always looks a little funny to me.
>>>> > >
>>>> > > On Wed, Feb 4, 2015 at 3:53 PM, Kay Ousterhout <
>>>> kayousterh...@gmail.com>
>>>> > wrote:
>>>> > >> Hi all,
>>>> > >>
>>>> > >> The Spark Style Guide
>>>> > >> <
>>>> > https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
>>>> >
>>>> > >> says multi-line comments should formatted as:
>>>> > >>
>>>> > >> /*
>>>> > >>  * This is a
>>>> > >>  * very
>>>> > >>  * long comment.
>>>> > >>  */
>>>> > >>
>>>> > >> But in my experience, we almost always use "//" for multi-line
>>>> comments:
>>>> > >>
>>>> > >> // This is a
>>>> > >> // very
>>>> > >> // long comment.
>>>> > >>
>>>> > >> Here are some examples:
>>>> > >>
>>>> > >>- Recent commit by Reynold, king of style:
>>>> > >>
>>>> >
>>>> https://github.com/apache/spark/commit/bebf4c42bef3e75d31ffce9bfdb331c16f34ddb1#diff-d616b5496d1a9f648864f4ab0db5a026R58
>>>> > >>- RDD.scala:
>>>> > >>
>>>> >
>>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L361
>>>> > >>- DAGScheduler.scala:
>>>> > >>
>>>> >
>>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L281
>>>> > >>
>>>> > >>
>>>> > >> Any objections to me updating the style guide to reflect this?  As
>>>> with
>>>> > >> other style issues, I think consistency here is helpful (and
>>>> formatting
>>>> > >> multi-line comments as "//" does nicely visually distinguish code
>>>> > comments
>>>> > >> from doc comments).
>>>> > >>
>>>> > >> -Kay
>>>> > >
>>>> > > -
>>>> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>> > > For additional commands, e-mail: dev-h...@spark.apache.org
>>>> > >
>>>> >
>>>> > -
>>>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>> > For additional commands, e-mail: dev-h...@spark.apache.org
>>>> >
>>>> >
>>>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Mail to u...@spark.apache.org failing

2015-02-09 Thread Patrick Wendell
Ah - we should update it to suggest mailing the dev@ list (and if
there is enough traffic maybe do something else).

I'm happy to add you if you can give an organization name, URL, a list
of which Spark components you are using, and a short description of
your use case..

On Mon, Feb 9, 2015 at 9:00 PM, Meethu Mathew  wrote:
> Hi,
>
> The mail id given in
> https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark seems to
> be failing. Can anyone tell me how to get added to Powered By Spark list?
>
> --
>
> Regards,
>
> *Meethu*

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Keep or remove Debian packaging in Spark?

2015-02-09 Thread Patrick Wendell
Mark was involved in adding this code (IIRC) and has also been the
most active in maintaining it. So I'd be interested in hearing his
thoughts on that proposal. Mark - would you be okay deprecating this
and having Spark instead work with the upstream projects that focus on
packaging?

My feeling is that it's better to just have nothing than to have
something not usable out-of-the-box (which to your point, is a lot
more work).

On Mon, Feb 9, 2015 at 4:10 PM,   wrote:
> This could be something if the spark community wanted to not maintain 
> debs/rpms directly via the project could direct interested efforts towards 
> apache bigtop.  Right now debs/rpms of bigtop components, as well as related 
> tests is a focus.
>
> Something that would be great is if at least one spark committer with 
> interests in config/pkg/testing could be liason and pt for bigtop efforts.
>
> Right now focus on bigtop 0.9, which currently includes spark 1.2.  Jira for 
> items included in 0.9 can be found here:
>
> https://issues.apache.org/jira/browse/BIGTOP-1480
>
>
>
> -Original Message-
> From: Sean Owen [mailto:so...@cloudera.com]
> Sent: Monday, February 9, 2015 3:52 PM
> To: Nicholas Chammas
> Cc: Patrick Wendell; Mark Hamstra; dev
> Subject: Re: Keep or remove Debian packaging in Spark?
>
> What about this straw man proposal: deprecate in 1.3 with some kind of 
> message in the build, and remove for 1.4? And add a pointer to any 
> third-party packaging that might provide similar functionality?
>
> On Mon, Feb 9, 2015 at 6:47 PM, Nicholas Chammas  
> wrote:
>> +1 to an "official" deprecation + redirecting users to some other
>> +project
>> that will or already is taking this on.
>>
>> Nate?
>>
>>
>>
>> On Mon Feb 09 2015 at 10:08:27 AM Patrick Wendell 
>> wrote:
>>>
>>> I have wondered whether we should sort of deprecated it more
>>> officially, since otherwise I think people have the reasonable
>>> expectation based on the current code that Spark intends to support
>>> "complete" Debian packaging as part of the upstream build. Having
>>> something that's sort-of maintained but no one is helping review and
>>> merge patches on it or make it fully functional, IMO that doesn't
>>> benefit us or our users. There are a bunch of other projects that are
>>> specifically devoted to packaging, so it seems like there is a clear
>>> separation of concerns here.
>>>
>>> On Mon, Feb 9, 2015 at 7:31 AM, Mark Hamstra
>>> 
>>> wrote:
>>> >>
>>> >> it sounds like nobody intends these to be used to actually deploy
>>> >> Spark
>>> >
>>> >
>>> > I wouldn't go quite that far.  What we have now can serve as useful
>>> > input to a deployment tool like Chef, but the user is then going to
>>> > need to add some customization or configuration within the context
>>> > of that tooling to get Spark installed just the way they want.  So
>>> > it is not so much that the current Debian packaging can't be used
>>> > as that it has never really been intended to be a completely
>>> > finished product that a newcomer could, for example, use to install
>>> > Spark completely and quickly to Ubuntu and have a fully-functional
>>> > environment in which they could then run all of the examples,
>>> > tutorials, etc.
>>> >
>>> > Getting to that level of packaging (and maintenance) is something
>>> > that I'm not sure we want to do since that is a better fit with
>>> > Bigtop and the efforts of Cloudera, Horton Works, MapR, etc. to
>>> > distribute Spark.
>>> >
>>> > On Mon, Feb 9, 2015 at 2:41 AM, Sean Owen  wrote:
>>> >
>>> >> This is a straw poll to assess whether there is support to keep
>>> >> and fix, or remove, the Debian packaging-related config in Spark.
>>> >>
>>> >> I see several oldish outstanding JIRAs relating to problems in the
>>> >> packaging:
>>> >>
>>> >> https://issues.apache.org/jira/browse/SPARK-1799
>>> >> https://issues.apache.org/jira/browse/SPARK-2614
>>> >> https://issues.apache.org/jira/browse/SPARK-3624
>>> >> https://issues.apache.org/jira/browse/SPARK-4436
>>> >> (and a similar idea about making RPMs)
>>> >> https://issues.apache.org/jira/browse/SPARK-665
>>> >>
>>> >> The original motivation seems related to Chef:
>

Re: New Metrics Sink class not packaged in spark-assembly jar

2015-02-09 Thread Patrick Wendell
Actually, to correct myself, the assembly jar is in
assembly/target/scala-2.11 (I think).

On Mon, Feb 9, 2015 at 10:42 PM, Patrick Wendell  wrote:

> Hi Judy,
>
> If you have added source files in the sink/ source folder, they should
> appear in the assembly jar when you build. One thing I noticed is that you
> are looking inside the "/dist" folder. That only gets populated if you run
> "make-distribution". The normal development process is just to do "mvn
> package" and then look at the assembly jar that is contained in core/target.
>
> - Patrick
>
> On Mon, Feb 9, 2015 at 10:02 PM, Judy Nash <
> judyn...@exchange.microsoft.com> wrote:
>
>>  Hello,
>>
>>
>>
>> Working on SPARK-5708 <https://issues.apache.org/jira/browse/SPARK-5708>
>> - Add Slf4jSink to Spark Metrics Sink.
>>
>>
>>
>> Wrote a new Slf4jSink class (see patch attached), but the new class is
>> not packaged as part of spark-assembly jar.
>>
>>
>>
>> Do I need to update build config somewhere to have this packaged?
>>
>>
>>
>> Current packaged class:
>>
>>
>>
>> Thought I must have missed something basic but can't figure out why.
>>
>>
>>
>> Thanks!
>>
>> Judy
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>
>


Re: New Metrics Sink class not packaged in spark-assembly jar

2015-02-09 Thread Patrick Wendell
Hi Judy,

If you have added source files in the sink/ source folder, they should
appear in the assembly jar when you build. One thing I noticed is that you
are looking inside the "/dist" folder. That only gets populated if you run
"make-distribution". The normal development process is just to do "mvn
package" and then look at the assembly jar that is contained in core/target.

- Patrick

On Mon, Feb 9, 2015 at 10:02 PM, Judy Nash 
wrote:

>  Hello,
>
>
>
> Working on SPARK-5708 
> - Add Slf4jSink to Spark Metrics Sink.
>
>
>
> Wrote a new Slf4jSink class (see patch attached), but the new class is not
> packaged as part of spark-assembly jar.
>
>
>
> Do I need to update build config somewhere to have this packaged?
>
>
>
> Current packaged class:
>
>
>
> Thought I must have missed something basic but can't figure out why.
>
>
>
> Thanks!
>
> Judy
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>


Re: Powered by Spark: Concur

2015-02-10 Thread Patrick Wendell
Thanks Paolo - I've fixed it.

On Mon, Feb 9, 2015 at 11:10 PM, Paolo Platter
 wrote:
> Hi,
>
> I checked the powered by wiki too and Agile Labs should be Agile Lab. The 
> link is wrong too, it should be www.agilelab.it.
> The description is correct.
>
> Thanks a lot
>
> Paolo
>
> Inviata dal mio Windows Phone
> 
> Da: Denny Lee
> Inviato: 10/02/2015 07:41
> A: Matei Zaharia
> Cc: dev@spark.apache.org
> Oggetto: Re: Powered by Spark: Concur
>
> Thanks Matei - much appreciated!
>
> On Mon Feb 09 2015 at 10:23:57 PM Matei Zaharia 
> wrote:
>
>> Thanks Denny; added you.
>>
>> Matei
>>
>> > On Feb 9, 2015, at 10:11 PM, Denny Lee  wrote:
>> >
>> > Forgot to add Concur to the "Powered by Spark" wiki:
>> >
>> > Concur
>> > https://www.concur.com
>> > Spark SQL, MLLib
>> > Using Spark for travel and expenses analytics and personalization
>> >
>> > Thanks!
>> > Denny
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[ANNOUNCE] Spark 1.3.0 Snapshot 1

2015-02-11 Thread Patrick Wendell
Hey All,

I've posted Spark 1.3.0 snapshot 1. At this point the 1.3 branch is
ready for community testing and we are strictly merging fixes and
documentation across all components.

The release files, including signatures, digests, etc can be found at:
http://people.apache.org/~pwendell/spark-1.3.0-snapshot1/

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1068/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.3.0-snapshot1-docs/

Please report any issues with the release to this thread and/or to our
project JIRA. Thanks!

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-02-12 Thread Patrick Wendell
The map will start with a capacity of 64, but will grow to accommodate
new data. Are you using the groupBy operator in Spark or are you using
Spark SQL's group by? This usually happens if you are grouping or
aggregating in a way that doesn't sufficiently condense the data
created from each input partition.

- Patrick

On Wed, Feb 11, 2015 at 9:37 PM, fightf...@163.com  wrote:
> Hi,
>
> Really have no adequate solution got for this issue. Expecting any available
> analytical rules or hints.
>
> Thanks,
> Sun.
>
> 
> fightf...@163.com
>
>
> From: fightf...@163.com
> Date: 2015-02-09 11:56
> To: user; dev
> Subject: Re: Sort Shuffle performance issues about using AppendOnlyMap for
> large data sets
> Hi,
> Problem still exists. Any experts would take a look at this?
>
> Thanks,
> Sun.
>
> 
> fightf...@163.com
>
>
> From: fightf...@163.com
> Date: 2015-02-06 17:54
> To: user; dev
> Subject: Sort Shuffle performance issues about using AppendOnlyMap for large
> data sets
> Hi, all
> Recently we had caught performance issues when using spark 1.2.0 to read
> data from hbase and do some summary work.
> Our scenario means to : read large data sets from hbase (maybe 100G+ file) ,
> form hbaseRDD, transform to schemardd,
> groupby and aggregate the data while got fewer new summary data sets,
> loading data into hbase (phoenix).
>
> Our major issue lead to : aggregate large datasets to get summary data sets
> would consume too long time (1 hour +) , while that
> should be supposed not so bad performance. We got the dump file attached and
> stacktrace from jstack like the following:
>
> From the stacktrace and dump file we can identify that processing large
> datasets would cause frequent AppendOnlyMap growing, and
> leading to huge map entrysize. We had referenced the source code of
> org.apache.spark.util.collection.AppendOnlyMap and found that
> the map had been initialized with capacity of 64. That would be too small
> for our use case.
>
> So the question is : Does anyone had encounted such issues before? How did
> that be resolved? I cannot find any jira issues for such problems and
> if someone had seen, please kindly let us know.
>
> More specified solution would goes to : Does any possibility exists for user
> defining the map capacity releatively in spark? If so, please
> tell how to achieve that.
>
> Best Thanks,
> Sun.
>
>Thread 22432: (state = IN_JAVA)
> - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87,
> line=224 (Compiled frame; information may be imprecise)
> - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable()
> @bci=1, line=38 (Interpreted frame)
> - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22,
> line=198 (Compiled frame)
> -
> org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object,
> scala.Function2) @bci=201, line=145 (Compiled frame)
> -
> org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object,
> scala.Function2) @bci=3, line=32 (Compiled frame)
> -
> org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator)
> @bci=141, line=205 (Compiled frame)
> -
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator)
> @bci=74, line=58 (Interpreted frame)
> -
> org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)
> @bci=169, line=68 (Interpreted frame)
> -
> org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)
> @bci=2, line=41 (Interpreted frame)
> - org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted
> frame)
> - org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196
> (Interpreted frame)
> -
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
> @bci=95, line=1145 (Interpreted frame)
> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
> (Interpreted frame)
> - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame)
>
>
> Thread 22431: (state = IN_JAVA)
> - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87,
> line=224 (Compiled frame; information may be imprecise)
> - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable()
> @bci=1, line=38 (Interpreted frame)
> - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22,
> line=198 (Compiled frame)
> -
> org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object,
> scala.Function2) @bci=201, line=145 (Compiled frame)
> -
> org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object,
> scala.Function2) @bci=3, line=32 (Compiled frame)
> -
> org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator)
> @bci=141, line=205 (Compiled frame)
> -
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator)
> @bci=74, line=58 (Interp

Re: driver fail-over in Spark streaming 1.2.0

2015-02-12 Thread Patrick Wendell
It will create and connect to new executors. The executors are mostly
stateless, so the program can resume with new executors.

On Wed, Feb 11, 2015 at 11:24 PM, lin  wrote:
> Hi, all
>
> In Spark Streaming 1.2.0, when the driver fails and a new driver starts
> with the most updated check-pointed data, will the former Executors
> connects to the new driver, or will the new driver starts out its own set
> of new Executors? In which piece of codes is that done?
>
> Any reply will be appreciated :)
>
> regards,
>
> lin

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: How to track issues that must wait for Spark 2.x in JIRA?

2015-02-12 Thread Patrick Wendell
Yeah my preferred is also having a more open ended "2+" for issues
that are clearly desirable but blocked by compatibility concerns.

What I would really want to avoid is major feature proposals sitting
around in our JIRA and tagged under some 2.X version. IMO JIRA isn't
the place for thoughts about very-long-term things. When we get these,
I'd be include to either close them as "won't fix" or "later".

On Thu, Feb 12, 2015 at 12:47 AM, Reynold Xin  wrote:
> It seems to me having a version that is 2+ is good for that? Once we move
> to 2.0, we can retag those that are not going to be fixed in 2.0 as 2.0.1
> or 2.1.0 .
>
> On Thu, Feb 12, 2015 at 12:42 AM, Sean Owen  wrote:
>
>> Patrick and I were chatting about how to handle several issues which
>> clearly need a fix, and are easy, but can't be implemented until a
>> next major release like Spark 2.x since it would change APIs.
>> Examples:
>>
>> https://issues.apache.org/jira/browse/SPARK-3266
>> https://issues.apache.org/jira/browse/SPARK-3369
>> https://issues.apache.org/jira/browse/SPARK-4819
>>
>> We could simply make version 2.0.0 in JIRA. Although straightforward,
>> it might imply that release planning has begun for 2.0.0.
>>
>> The version could be called "2+" for now to better indicate its status.
>>
>> There is also a "Later" JIRA resolution. Although resolving the above
>> seems a little wrong, it might be reasonable if we're sure to revisit
>> "Later", well, at some well defined later. The three issues above risk
>> getting lost in the shuffle.
>>
>> We also wondered whether using "Later" is good or bad. It takes items
>> off the radar that aren't going to be acted on anytime soon -- and
>> there are lots of those right now. It might send a message that these
>> will be revisited when they are even less likely to if resolved.
>>
>> Any opinions?
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Replacing Jetty with TomCat

2015-02-17 Thread Patrick Wendell
Hey Niranda,

It seems to me a lot of effort to support multiple libraries inside of
Spark like this, so I'm not sure that's a great solution.

If you are building an application that embeds Spark, is it not
possible for you to continue to use Jetty for Spark's internal servers
and use tomcat for your own server's? I would guess that many complex
applications end up embedding multiple server libraries in various
places (Spark itself has different transport mechanisms, etc.)

- Patrick

On Tue, Feb 17, 2015 at 7:14 PM, Niranda Perera
 wrote:
> Hi Sean,
> The main issue we have is, running two web servers in a single product. we
> think it would not be an elegant solution.
>
> Could you please point me to the main areas where jetty server is tightly
> coupled or extension points where I could plug tomcat instead of jetty?
> If successful I could contribute it to the spark project. :-)
>
> cheers
>
>
>
> On Mon, Feb 16, 2015 at 4:51 PM, Sean Owen  wrote:
>
>> There's no particular reason you have to remove the embedded Jetty
>> server, right? it doesn't prevent you from using it inside another app
>> that happens to run in Tomcat. You won't be able to switch it out
>> without rewriting a fair bit of code, no, but you don't need to.
>>
>> On Mon, Feb 16, 2015 at 5:08 AM, Niranda Perera
>>  wrote:
>> > Hi,
>> >
>> > We are thinking of integrating Spark server inside a product. Our current
>> > product uses Tomcat as its webserver.
>> >
>> > Is it possible to switch the Jetty webserver in Spark to Tomcat
>> > off-the-shelf?
>> >
>> > Cheers
>> >
>> > --
>> > Niranda
>>
>
>
>
> --
> Niranda

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-18 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.3.0!

The tag to be voted on is v1.3.0-rc1 (commit f97b0d4a):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=f97b0d4a6b26504916816d7aefcf3132cd1da6c2

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.3.0-rc1/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1069/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.3.0-rc1-docs/

Please vote on releasing this package as Apache Spark 1.3.0!

The vote is open until Saturday, February 21, at 08:03 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.3.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== How can I help test this release? ==
If you are a Spark user, you can help us test this release by
taking a Spark 1.2 workload and running on this release candidate,
then reporting any regressions.

== What justifies a -1 vote for this release? ==
This vote is happening towards the end of the 1.3 QA period,
so -1 votes should only occur for significant regressions from 1.2.1.
Bugs already present in 1.2.X, minor regressions, or bugs related
to new features will not block this release.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Merging code into branch 1.3

2015-02-18 Thread Patrick Wendell
Hey Committers,

Now that Spark 1.3 rc1 is cut, please restrict branch-1.3 merges to
the following:

1. Fixes for issues blocking the 1.3 release (i.e. 1.2.X regressions)
2. Documentation and tests.
3. Fixes for non-blocker issues that are surgical, low-risk, and/or
outside of the core.

If there is a lower priority bug fix (a non-blocker) that requires
nontrivial code changes, do not merge it into 1.3. If something seems
borderline, feel free to reach out to me and we can work through it
together.

This is what we've done for the last few releases to make sure rc's
become progressively more stable, and it is important towards helping
us cut timely releases.

Thanks!

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-18 Thread Patrick Wendell
> UISeleniumSuite:
> *** RUN ABORTED ***
>   java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal
> ...

This is a newer test suite. There is something flaky about it, we
should definitely fix it, IMO it's not a blocker though.

>
> Patrick this link gives a 404:
> https://people.apache.org/keys/committer/pwendell.asc

Works for me. Maybe it's some ephemeral issue?

> Finally, I already realized I failed to get the fix for
> https://issues.apache.org/jira/browse/SPARK-5669 correct, and that has
> to be correct for the release. I'll patch that up straight away,
> sorry. I believe the result of the intended fix is still as I
> described in SPARK-5669, so there is no bad news there. A local test
> seems to confirm it and I'm waiting on Jenkins. If it's all good I'll
> merge that fix. So, that much will need a new release, I apologize.

Thanks for finding this. I'm going to leave this open for continued testing...

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [Performance] Possible regression in rdd.take()?

2015-02-18 Thread Patrick Wendell
I believe the heuristic governing the way that take() decides to fetch
partitions changed between these versions. It could be that in certain
cases the new heuristic is worse, but it might be good to just look at
the source code and see, for your number of elements taken and number
of partitions, if there was any effective change in how aggressively
spark fetched partitions.

This was quite a while ago, but I think the change was made because in
many cases the newer code works more efficiently.

- Patrick

On Wed, Feb 18, 2015 at 4:47 PM, Matt Cheah  wrote:
> Hi everyone,
>
> Between Spark 1.0.2 and Spark 1.1.1, I have noticed that rdd.take()
> consistently has a slower execution time on the later release. I was
> wondering if anyone else has had similar observations.
>
> I have two setups where this reproduces. The first is a local test. I
> launched a spark cluster with 4 worker JVMs on my Mac, and launched a
> Spark-Shell. I retrieved the text file and immediately called rdd.take(N) on
> it, where N varied. The RDD is a plaintext CSV, 4GB in size, split over 8
> files, which ends up having 128 partitions, and a total of 8000 rows.
> The numbers I discovered between Spark 1.0.2 and Spark 1.1.1 are, with all
> numbers being in seconds:
>
> 1 items
>
> Spark 1.0.2: 0.069281, 0.012261, 0.011083
>
> Spark 1.1.1: 0.11577, 0.097636, 0.11321
>
>
> 4 items
>
> Spark 1.0.2: 0.023751, 0.069365, 0.023603
>
> Spark 1.1.1: 0.224287, 0.229651, 0.158431
>
>
> 10 items
>
> Spark 1.0.2: 0.047019, 0.049056, 0.042568
>
> Spark 1.1.1: 0.353277, 0.288965, 0.281751
>
>
> 40 items
>
> Spark 1.0.2: 0.216048, 0.198049, 0.796037
>
> Spark 1.1.1: 1.865622, 2.224424, 2.037672
>
> This small test suite indicates a consistently reproducible performance
> regression.
>
>
> I also notice this on a larger scale test. The cluster used is on EC2:
>
> ec2 instance type: m2.4xlarge
> 10 slaves, 1 master
> ephemeral storage
> 70 cores, 50 GB/box
>
> In this case, I have a 100GB dataset split into 78 files totally 350 million
> items, and I take the first 50,000 items from the RDD. In this case, I have
> tested this on different formats of the raw data.
>
> With plaintext files:
>
> Spark 1.0.2: 0.422s, 0.363s, 0.382s
>
> Spark 1.1.1: 4.54s, 1.28s, 1.221s, 1.13s
>
>
> With snappy-compressed Avro files:
>
> Spark 1.0.2: 0.73s, 0.395s, 0.426s
>
> Spark 1.1.1: 4.618s, 1.81s, 1.158s, 1.333s
>
> Again demonstrating a reproducible performance regression.
>
> I was wondering if anyone else observed this regression, and if so, if
> anyone would have any idea what could possibly have caused it between Spark
> 1.0.2 and Spark 1.1.1?
>
> Thanks,
>
> -Matt Cheah

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Patrick Wendell
So actually, the list of blockers on JIRA is a bit outdated. These
days I won't cut RC1 unless there are no known issues that I'm aware
of that would actually block the release (that's what the snapshot
ones are for). I'm going to clean those up and push others to do so
also.

The main issues I'm aware of that came about post RC1 are:
1. Python submission broken on YARN
2. The license issue in MLlib [now fixed].
3. Varargs broken for Java Dataframes [now fixed]

Re: Corey - yeah, as it stands now I try to wait if there are things
that look like implicit -1 votes.

On Mon, Feb 23, 2015 at 6:13 AM, Corey Nolet  wrote:
> Thanks Sean. I glossed over the comment about SPARK-5669.
>
> On Mon, Feb 23, 2015 at 9:05 AM, Sean Owen  wrote:
>>
>> Yes my understanding from Patrick's comment is that this RC will not
>> be released, but, to keep testing. There's an implicit -1 out of the
>> gates there, I believe, and so the vote won't pass, so perhaps that's
>> why there weren't further binding votes. I'm sure that will be
>> formalized shortly.
>>
>> FWIW here are 10 issues still listed as blockers for 1.3.0:
>>
>> SPARK-5910 DataFrame.selectExpr("col as newName") does not work
>> SPARK-5904 SPARK-5166 DataFrame methods with varargs do not work in Java
>> SPARK-5873 Can't see partially analyzed plans
>> SPARK-5546 Improve path to Kafka assembly when trying Kafka Python API
>> SPARK-5517 SPARK-5166 Add input types for Java UDFs
>> SPARK-5463 Fix Parquet filter push-down
>> SPARK-5310 SPARK-5166 Update SQL programming guide for 1.3
>> SPARK-5183 SPARK-5180 Document data source API
>> SPARK-3650 Triangle Count handles reverse edges incorrectly
>> SPARK-3511 Create a RELEASE-NOTES.txt file in the repo
>>
>>
>> On Mon, Feb 23, 2015 at 1:55 PM, Corey Nolet  wrote:
>> > This vote was supposed to close on Saturday but it looks like no PMCs
>> > voted
>> > (other than the implicit vote from Patrick). Was there a discussion
>> > offline
>> > to cut an RC2? Was the vote extended?
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Patrick Wendell
It's only been reported on this thread by Tom, so far.

On Mon, Feb 23, 2015 at 10:29 AM, Marcelo Vanzin  wrote:
> Hey Patrick,
>
> Do you have a link to the bug related to Python and Yarn? I looked at
> the blockers in Jira but couldn't find it.
>
> On Mon, Feb 23, 2015 at 10:18 AM, Patrick Wendell  wrote:
>> So actually, the list of blockers on JIRA is a bit outdated. These
>> days I won't cut RC1 unless there are no known issues that I'm aware
>> of that would actually block the release (that's what the snapshot
>> ones are for). I'm going to clean those up and push others to do so
>> also.
>>
>> The main issues I'm aware of that came about post RC1 are:
>> 1. Python submission broken on YARN
>> 2. The license issue in MLlib [now fixed].
>> 3. Varargs broken for Java Dataframes [now fixed]
>>
>> Re: Corey - yeah, as it stands now I try to wait if there are things
>> that look like implicit -1 votes.
>
> --
> Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: UnusedStubClass in 1.3.0-rc1

2015-02-25 Thread Patrick Wendell
Hey Cody,

What build command are you using? In any case, we can actually comment
out the "unused" thing now in the root pom.xml. It existed just to
ensure that at least one dependency was listed in the shade plugin
configuration (otherwise, some work we do that requires the shade
plugin does not happen). However, now there are other things there. If
you just comment out the line in the root pom.xml adding this
dependency, does it work?

- Patrick

On Wed, Feb 25, 2015 at 7:53 AM, Cody Koeninger  wrote:
> So when building 1.3.0-rc1 I see the following warning:
>
> [WARNING] spark-streaming-kafka_2.10-1.3.0.jar, unused-1.0.0.jar define 1
> overlappping classes:
>
> [WARNING]   - org.apache.spark.unused.UnusedStubClass
>
>
> and when trying to build an assembly of a project that was previously using
> 1.3 snapshots without difficulty, I see the following errors:
>
>
> [error] (*:assembly) deduplicate: different file contents found in the
> following:
>
> [error]
> /Users/cody/.m2/repository/org/apache/spark/spark-streaming-kafka_2.10/1.3.0/spark-streaming-kafka_2.10-1.3.0.jar:org/apache/spark/unused/UnusedStubClass.class
>
> [error]
> /Users/cody/.m2/repository/org/spark-project/spark/unused/1.0.0/unused-1.0.0.jar:org/apache/spark/unused/UnusedStubClass.class
>
>
> This persists even after a clean / rebuild of both 1.3.0-rc1 and the
> project using it.
>
>
> I can just exclude that jar in the assembly definition, but is anyone else
> seeing similar issues?  If so, might be worth resolving rather than make
> users mess with assembly exclusions.
>
> I see that this class was introduced a while ago, related to SPARK-3812 but
> the jira issue doesn't have much detail.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: UnusedStubClass in 1.3.0-rc1

2015-02-25 Thread Patrick Wendell
This has been around for multiple versions of Spark, so I am a bit
surprised to see it not working in your build.

- Patrick

On Wed, Feb 25, 2015 at 9:41 AM, Patrick Wendell  wrote:
> Hey Cody,
>
> What build command are you using? In any case, we can actually comment
> out the "unused" thing now in the root pom.xml. It existed just to
> ensure that at least one dependency was listed in the shade plugin
> configuration (otherwise, some work we do that requires the shade
> plugin does not happen). However, now there are other things there. If
> you just comment out the line in the root pom.xml adding this
> dependency, does it work?
>
> - Patrick
>
> On Wed, Feb 25, 2015 at 7:53 AM, Cody Koeninger  wrote:
>> So when building 1.3.0-rc1 I see the following warning:
>>
>> [WARNING] spark-streaming-kafka_2.10-1.3.0.jar, unused-1.0.0.jar define 1
>> overlappping classes:
>>
>> [WARNING]   - org.apache.spark.unused.UnusedStubClass
>>
>>
>> and when trying to build an assembly of a project that was previously using
>> 1.3 snapshots without difficulty, I see the following errors:
>>
>>
>> [error] (*:assembly) deduplicate: different file contents found in the
>> following:
>>
>> [error]
>> /Users/cody/.m2/repository/org/apache/spark/spark-streaming-kafka_2.10/1.3.0/spark-streaming-kafka_2.10-1.3.0.jar:org/apache/spark/unused/UnusedStubClass.class
>>
>> [error]
>> /Users/cody/.m2/repository/org/spark-project/spark/unused/1.0.0/unused-1.0.0.jar:org/apache/spark/unused/UnusedStubClass.class
>>
>>
>> This persists even after a clean / rebuild of both 1.3.0-rc1 and the
>> project using it.
>>
>>
>> I can just exclude that jar in the assembly definition, but is anyone else
>> seeing similar issues?  If so, might be worth resolving rather than make
>> users mess with assembly exclusions.
>>
>> I see that this class was introduced a while ago, related to SPARK-3812 but
>> the jira issue doesn't have much detail.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-25 Thread Patrick Wendell
Hey All,

Just a quick updated on this thread. Issues have continued to trickle
in. Not all of them are blocker level but enough to warrant another
RC:

I've been keeping the JIRA dashboard up and running with the latest
status (sorry, long link):
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20%22Target%20Version%2Fs%22%20%3D%201.3.0%20AND%20(fixVersion%20IS%20EMPTY%20OR%20fixVersion%20!%3D%201.3.0)%20AND%20(Resolution%20IS%20EMPTY%20OR%20Resolution%20IN%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority%2C%20component

One these are in I will cut another RC. Thanks everyone for the
continued voting!

- Patrick

On Mon, Feb 23, 2015 at 10:52 PM, Tathagata Das
 wrote:
> Hey all,
>
> I found a major issue where JobProgressListener (a listener used to keep
> track of jobs for the web UI) never forgets stages in one of its data
> structures. This is a blocker for long running applications.
> https://issues.apache.org/jira/browse/SPARK-5967
>
> I am testing a fix for this right now.
>
> TD
>
> On Mon, Feb 23, 2015 at 7:23 PM, Soumitra Kumar 
> wrote:
>
>> +1 (non-binding)
>>
>> For: https://issues.apache.org/jira/browse/SPARK-3660
>>
>> . Docs OK
>> . Example code is good
>>
>> -Soumitra.
>>
>>
>> On Mon, Feb 23, 2015 at 10:33 AM, Marcelo Vanzin 
>> wrote:
>>
>> > Hi Tom, are you using an sbt-built assembly by any chance? If so, take
>> > a look at SPARK-5808.
>> >
>> > I haven't had any problems with the maven-built assembly. Setting
>> > SPARK_HOME on the executors is a workaround if you want to use the sbt
>> > assembly.
>> >
>> > On Fri, Feb 20, 2015 at 2:56 PM, Tom Graves
>> >  wrote:
>> > > Trying to run pyspark on yarn in client mode with basic wordcount
>> > example I see the following error when doing the collect:
>> > > Error from python worker:  /usr/bin/python: No module named
>> > sqlPYTHONPATH was:
>> >
>> /grid/3/tmp/yarn-local/usercache/tgraves/filecache/20/spark-assembly-1.3.0-hadoop2.6.0.1.1411101121.jarjava.io.EOFException
>> >   at java.io.DataInputStream.readInt(DataInputStream.java:392)
>> > at
>> >
>> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
>> >   at
>> >
>> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
>> >   at
>> >
>> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
>> >   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)
>> >   at
>> org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69)
>> >   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at
>> > org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:308)
>> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at
>> >
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>> >   at
>> >
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> >   at org.apache.spark.scheduler.Task.run(Task.scala:64)at
>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
>> >   at
>> >
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >   at
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >   at java.lang.Thread.run(Thread.java:722)
>> > > any ideas on this?
>> > > Tom
>> > >
>> > >  On Wednesday, February 18, 2015 2:14 AM, Patrick Wendell <
>> > pwend...@gmail.com> wrote:
>> > >
>> > >
>> > >  Please vote on releasing the following candidate as Apache Spark
>> > version 1.3.0!
>> > >
>> > > The tag to be voted on is v1.3.0-rc1 (commit f97b0d4a):
>> > >
>> >
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=f97b0d4a6b26504916816d7aefcf3132cd1da6c2
>> > >
>> > > The release files, including signatures, digests, etc. can be found at:
>> > > http://people.apache.org/~pwendell/spark-1.3.0-rc1/
>> > >
>> > > Release artifacts are signed with the following key:
>> > > https://people.apache.org/keys/committer/pwendell

Re: spark-ec2 default to Hadoop 2

2015-03-01 Thread Patrick Wendell
Yeah calling it Hadoop 2 was a very bad naming choice (of mine!), this
was back when CDH4 was the only real distribution available with some
of the newer Hadoop API's and packaging.

I think to not surprise people using this, it's best to keep v1 as the
default. Overall, we try not to change default values too often to
make upgrading easy for people.

- Patrick

On Sun, Mar 1, 2015 at 3:14 PM, Shivaram Venkataraman
 wrote:
> One reason I wouldn't change the default is that the Hadoop 2 launched by
> spark-ec2 is not a full Hadoop 2 distribution -- Its more of a hybrid
> Hadoop version built using CDH4 (it uses HDFS 2, but not YARN AFAIK).
>
> Also our default Hadoop version in the Spark build is still 1.0.4 [1], so
> it makes sense to stick to that in spark-ec2 as well ?
>
> [1] https://github.com/apache/spark/blob/master/pom.xml#L122
>
> Thanks
> Shivaram
>
> On Sun, Mar 1, 2015 at 2:59 PM, Nicholas Chammas > wrote:
>
>>
>> https://github.com/apache/spark/blob/fd8d283eeb98e310b1e85ef8c3a8af9e547ab5e0/ec2/spark_ec2.py#L162-L164
>>
>> Is there any reason we shouldn't update the default Hadoop major version in
>> spark-ec2 to 2?
>>
>> Nick
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[RESULT] [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-03-03 Thread Patrick Wendell
This vote is cancelled in favor of RC2.

On Thu, Feb 26, 2015 at 9:50 AM, Sandor Van Wassenhove
 wrote:
> FWIW, I tested the first rc and saw no regressions. I ran our benchmarks
> built against spark 1.3 and saw results consistent with spark 1.2/1.2.1.
>
> On 2/25/15, 5:51 PM, "Patrick Wendell"  wrote:
>
>>Hey All,
>>
>>Just a quick updated on this thread. Issues have continued to trickle
>>in. Not all of them are blocker level but enough to warrant another
>>RC:
>>
>>I've been keeping the JIRA dashboard up and running with the latest
>>status (sorry, long link):
>>https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jir
>>a_issues_-3Fjql-3Dproject-2520-253D-2520SPARK-2520AND-2520-2522Target-2520
>>Version-252Fs-2522-2520-253D-25201.3.0-2520AND-2520-28fixVersion-2520IS-25
>>20EMPTY-2520OR-2520fixVersion-2520-21-253D-25201.3.0-29-2520AND-2520-28Res
>>olution-2520IS-2520EMPTY-2520OR-2520Resolution-2520IN-2520-28Done-252C-252
>>0Fixed-252C-2520Implemented-29-29-2520ORDER-2520BY-2520priority-252C-2520c
>>omponent&d=AwIFAw&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=cyguR-hd
>>uPXP87jeUDbz1NGOZ18iIQjDTb_C1-_2JUA&m=frmHzwi9qJcMu2udAW6MBS4NWwKmHCBBpCG9
>>zeuaRhA&s=SEjc91m9Dpx8QLLWlMK_5G0ORYtTHlLR2r3091n9qU0&e=
>>
>>One these are in I will cut another RC. Thanks everyone for the
>>continued voting!
>>
>>- Patrick
>>
>>On Mon, Feb 23, 2015 at 10:52 PM, Tathagata Das
>> wrote:
>>> Hey all,
>>>
>>> I found a major issue where JobProgressListener (a listener used to keep
>>> track of jobs for the web UI) never forgets stages in one of its data
>>> structures. This is a blocker for long running applications.
>>>
>>>https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_ji
>>>ra_browse_SPARK-2D5967&d=AwIFAw&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oO
>>>nmz8&r=cyguR-hduPXP87jeUDbz1NGOZ18iIQjDTb_C1-_2JUA&m=frmHzwi9qJcMu2udAW6M
>>>BS4NWwKmHCBBpCG9zeuaRhA&s=06QttEOx2YqhPQ2sWdQmOElwog_cJ5iT2Mqa1_5jnl4&e=
>>>
>>> I am testing a fix for this right now.
>>>
>>> TD
>>>
>>> On Mon, Feb 23, 2015 at 7:23 PM, Soumitra Kumar
>>>
>>> wrote:
>>>
>>>> +1 (non-binding)
>>>>
>>>> For:
>>>>https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_j
>>>>ira_browse_SPARK-2D3660&d=AwIFAw&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6
>>>>oOnmz8&r=cyguR-hduPXP87jeUDbz1NGOZ18iIQjDTb_C1-_2JUA&m=frmHzwi9qJcMu2udA
>>>>W6MBS4NWwKmHCBBpCG9zeuaRhA&s=0sBvf0vWgAski9HweupKdPZwWdYH0Mimda14oHnNVDA
>>>>&e=
>>>>
>>>> . Docs OK
>>>> . Example code is good
>>>>
>>>> -Soumitra.
>>>>
>>>>
>>>> On Mon, Feb 23, 2015 at 10:33 AM, Marcelo Vanzin 
>>>> wrote:
>>>>
>>>> > Hi Tom, are you using an sbt-built assembly by any chance? If so,
>>>>take
>>>> > a look at SPARK-5808.
>>>> >
>>>> > I haven't had any problems with the maven-built assembly. Setting
>>>> > SPARK_HOME on the executors is a workaround if you want to use the
>>>>sbt
>>>> > assembly.
>>>> >
>>>> > On Fri, Feb 20, 2015 at 2:56 PM, Tom Graves
>>>> >  wrote:
>>>> > > Trying to run pyspark on yarn in client mode with basic wordcount
>>>> > example I see the following error when doing the collect:
>>>> > > Error from python worker:  /usr/bin/python: No module named
>>>> > sqlPYTHONPATH was:
>>>> >
>>>>
>>>>/grid/3/tmp/yarn-local/usercache/tgraves/filecache/20/spark-assembly-1.3
>>>>.0-hadoop2.6.0.1.1411101121.jarjava.io.EOFException
>>>> >   at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>>> > at
>>>> >
>>>>
>>>>org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorker
>>>>Factory.scala:163)
>>>> >   at
>>>> >
>>>>
>>>>org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(Pyth
>>>>onWorkerFactory.scala:86)
>>>> >   at
>>>> >
>>>>
>>>>org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFacto
>>>>ry.scala:62)
>>>> >   at
>>>>org.apache.spark.SparkEnv.

[VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-03 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.3.0!

The tag to be voted on is v1.3.0-rc2 (commit 3af2687):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3af26870e5163438868c4eb2df88380a533bb232

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.3.0-rc2/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

Staging repositories for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1074/
(published with version '1.3.0')
https://repository.apache.org/content/repositories/orgapachespark-1075/
(published with version '1.3.0-rc2')

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.3.0-rc2-docs/

Please vote on releasing this package as Apache Spark 1.3.0!

The vote is open until Saturday, March 07, at 04:17 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.3.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== How does this compare to RC1 ==
This patch includes a variety of bug fixes found in RC1.

== How can I help test this release? ==
If you are a Spark user, you can help us test this release by
taking a Spark 1.2 workload and running on this release candidate,
then reporting any regressions.

If you are happy with this release based on your own testing, give a +1 vote.

== What justifies a -1 vote for this release? ==
This vote is happening towards the end of the 1.3 QA period,
so -1 votes should only occur for significant regressions from 1.2.1.
Bugs already present in 1.2.X, minor regressions, or bugs related
to new features will not block this release.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-04 Thread Patrick Wendell
Hey Marcelo,

Yes - I agree. That one trickled in just as I was packaging this RC.
However, I still put this out here to allow people to test the
existing fixes, etc.

- Patrick

On Wed, Mar 4, 2015 at 9:26 AM, Marcelo Vanzin  wrote:
> I haven't tested the rc2 bits yet, but I'd consider
> https://issues.apache.org/jira/browse/SPARK-6144 a serious regression
> from 1.2 (since it affects existing "addFile()" functionality if the
> URL is "hdfs:...").
>
> Will test other parts separately.
>
> On Tue, Mar 3, 2015 at 8:19 PM, Patrick Wendell  wrote:
>> Please vote on releasing the following candidate as Apache Spark version 
>> 1.3.0!
>>
>> The tag to be voted on is v1.3.0-rc2 (commit 3af2687):
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3af26870e5163438868c4eb2df88380a533bb232
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.3.0-rc2/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> Staging repositories for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1074/
>> (published with version '1.3.0')
>> https://repository.apache.org/content/repositories/orgapachespark-1075/
>> (published with version '1.3.0-rc2')
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.3.0-rc2-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.3.0!
>>
>> The vote is open until Saturday, March 07, at 04:17 UTC and passes if
>> a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.3.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> == How does this compare to RC1 ==
>> This patch includes a variety of bug fixes found in RC1.
>>
>> == How can I help test this release? ==
>> If you are a Spark user, you can help us test this release by
>> taking a Spark 1.2 workload and running on this release candidate,
>> then reporting any regressions.
>>
>> If you are happy with this release based on your own testing, give a +1 vote.
>>
>> == What justifies a -1 vote for this release? ==
>> This vote is happening towards the end of the 1.3 QA period,
>> so -1 votes should only occur for significant regressions from 1.2.1.
>> Bugs already present in 1.2.X, minor regressions, or bugs related
>> to new features will not block this release.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>
>
>
> --
> Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Task result is serialized twice by serializer and closure serializer

2015-03-04 Thread Patrick Wendell
Hey Mingyu,

I think it's broken out separately so we can record the time taken to
serialize the result. Once we serializing it once, the second
serialization should be really simple since it's just wrapping
something that has already been turned into a byte buffer. Do you see
a specific issue with serializing it twice?

I think you need to have two steps if you want to record the time
taken to serialize the result, since that needs to be sent back to the
driver when the task completes.

- Patrick

On Wed, Mar 4, 2015 at 4:01 PM, Mingyu Kim  wrote:
> Hi all,
>
> It looks like the result of task is serialized twice, once by serializer 
> (I.e. Java/Kryo depending on configuration) and once again by closure 
> serializer (I.e. Java). To link the actual code,
>
> The first one: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L213
> The second one: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L226
>
> This serializes the "value", which is the result of task run twice, which 
> affects things like collect(), takeSample(), and toLocalIterator(). Would it 
> make sense to simply serialize the DirectTaskResult once using the regular 
> "serializer" (as opposed to closure serializer)? Would it cause problems when 
> the Accumulator values are not Kryo-serializable?
>
> Alternatively, if we can assume that Accumator values are small, we can 
> closure-serialize those, put the serialized byte array in DirectTaskResult 
> with the raw task result "value", and serialize DirectTaskResult.
>
> What do people think?
>
> Thanks,
> Mingyu

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Task result is serialized twice by serializer and closure serializer

2015-03-04 Thread Patrick Wendell
Yeah, it will result in a second serialized copy of the array (costing
some memory). But the computational overhead should be very small. The
absolute worst case here will be when doing a collect() or something
similar that just bundles the entire partition.

- Patrick

On Wed, Mar 4, 2015 at 5:47 PM, Mingyu Kim  wrote:
> The concern is really just the runtime overhead and memory footprint of
> Java-serializing an already-serialized byte array again. We originally
> noticed this when we were using RDD.toLocalIterator() which serializes the
> entire 64MB partition. We worked around this issue by kryo-serializing and
> snappy-compressing the partition on the executor side before returning it
> back to the driver, but this operation just felt redundant.
>
> Your explanation about reporting the time taken makes it clearer why it¹s
> designed this way. Since the byte array for the serialized task result
> shouldn¹t account for the majority of memory footprint anyways, I¹m okay
> with leaving it as is, then.
>
> Thanks,
> Mingyu
>
>
>
>
>
> On 3/4/15, 5:07 PM, "Patrick Wendell"  wrote:
>
>>Hey Mingyu,
>>
>>I think it's broken out separately so we can record the time taken to
>>serialize the result. Once we serializing it once, the second
>>serialization should be really simple since it's just wrapping
>>something that has already been turned into a byte buffer. Do you see
>>a specific issue with serializing it twice?
>>
>>I think you need to have two steps if you want to record the time
>>taken to serialize the result, since that needs to be sent back to the
>>driver when the task completes.
>>
>>- Patrick
>>
>>On Wed, Mar 4, 2015 at 4:01 PM, Mingyu Kim  wrote:
>>> Hi all,
>>>
>>> It looks like the result of task is serialized twice, once by
>>>serializer (I.e. Java/Kryo depending on configuration) and once again by
>>>closure serializer (I.e. Java). To link the actual code,
>>>
>>> The first one:
>>>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_sp
>>>ark_blob_master_core_src_main_scala_org_apache_spark_executor_Executor.sc
>>>ala-23L213&d=AwIFAw&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=ennQJ
>>>q47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=dw_fNvxBZ1DixNDGBTXRZBKn36QFyH-9
>>>WMY_2Z07ulA&s=cSKekTNmnB0g54h6-FaF-zOL46UZC_1_LdKK3p9Q0aA&e=
>>> The second one:
>>>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_sp
>>>ark_blob_master_core_src_main_scala_org_apache_spark_executor_Executor.sc
>>>ala-23L226&d=AwIFAw&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=ennQJ
>>>q47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=dw_fNvxBZ1DixNDGBTXRZBKn36QFyH-9
>>>WMY_2Z07ulA&s=PFoz0HyINd2XuiqkHPgyMsOh9eFkCwXOdl9zdxfBwxM&e=
>>>
>>> This serializes the "value", which is the result of task run twice,
>>>which affects things like collect(), takeSample(), and
>>>toLocalIterator(). Would it make sense to simply serialize the
>>>DirectTaskResult once using the regular "serializer" (as opposed to
>>>closure serializer)? Would it cause problems when the Accumulator values
>>>are not Kryo-serializable?
>>>
>>> Alternatively, if we can assume that Accumator values are small, we can
>>>closure-serialize those, put the serialized byte array in
>>>DirectTaskResult with the raw task result "value", and serialize
>>>DirectTaskResult.
>>>
>>> What do people think?
>>>
>>> Thanks,
>>> Mingyu
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: enum-like types in Spark

2015-03-04 Thread Patrick Wendell
I like #4 as well and agree with Aaron's suggestion.

- Patrick

On Wed, Mar 4, 2015 at 6:07 PM, Aaron Davidson  wrote:
> I'm cool with #4 as well, but make sure we dictate that the values should
> be defined within an object with the same name as the enumeration (like we
> do for StorageLevel). Otherwise we may pollute a higher namespace.
>
> e.g. we SHOULD do:
>
> trait StorageLevel
> object StorageLevel {
>   case object MemoryOnly extends StorageLevel
>   case object DiskOnly extends StorageLevel
> }
>
> On Wed, Mar 4, 2015 at 5:37 PM, Michael Armbrust 
> wrote:
>
>> #4 with a preference for CamelCaseEnums
>>
>> On Wed, Mar 4, 2015 at 5:29 PM, Joseph Bradley 
>> wrote:
>>
>> > another vote for #4
>> > People are already used to adding "()" in Java.
>> >
>> >
>> > On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch 
>> wrote:
>> >
>> > > #4 but with MemoryOnly (more scala-like)
>> > >
>> > > http://docs.scala-lang.org/style/naming-conventions.html
>> > >
>> > > Constants, Values, Variable and Methods
>> > >
>> > > Constant names should be in upper camel case. That is, if the member is
>> > > final, immutable and it belongs to a package object or an object, it
>> may
>> > be
>> > > considered a constant (similar to Java'sstatic final members):
>> > >
>> > >
>> > >1. object Container {
>> > >2. val MyConstant = ...
>> > >3. }
>> > >
>> > >
>> > > 2015-03-04 17:11 GMT-08:00 Xiangrui Meng :
>> > >
>> > > > Hi all,
>> > > >
>> > > > There are many places where we use enum-like types in Spark, but in
>> > > > different ways. Every approach has both pros and cons. I wonder
>> > > > whether there should be an "official" approach for enum-like types in
>> > > > Spark.
>> > > >
>> > > > 1. Scala's Enumeration (e.g., SchedulingMode, WorkerState, etc)
>> > > >
>> > > > * All types show up as Enumeration.Value in Java.
>> > > >
>> > > >
>> > >
>> >
>> http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulingMode.html
>> > > >
>> > > > 2. Java's Enum (e.g., SaveMode, IOMode)
>> > > >
>> > > > * Implementation must be in a Java file.
>> > > > * Values doesn't show up in the ScalaDoc:
>> > > >
>> > > >
>> > >
>> >
>> http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.network.util.IOMode
>> > > >
>> > > > 3. Static fields in Java (e.g., TripletFields)
>> > > >
>> > > > * Implementation must be in a Java file.
>> > > > * Doesn't need "()" in Java code.
>> > > > * Values don't show up in the ScalaDoc:
>> > > >
>> > > >
>> > >
>> >
>> http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.graphx.TripletFields
>> > > >
>> > > > 4. Objects in Scala. (e.g., StorageLevel)
>> > > >
>> > > > * Needs "()" in Java code.
>> > > > * Values show up in both ScalaDoc and JavaDoc:
>> > > >
>> > > >
>> > >
>> >
>> http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.storage.StorageLevel$
>> > > >
>> > > >
>> > >
>> >
>> http://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html
>> > > >
>> > > > It would be great if we have an "official" approach for this as well
>> > > > as the naming convention for enum-like values ("MEMORY_ONLY" or
>> > > > "MemoryOnly"). Personally, I like 4) with "MEMORY_ONLY". Any
>> thoughts?
>> > > >
>> > > > Best,
>> > > > Xiangrui
>> > > >
>> > > > -
>> > > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > > > For additional commands, e-mail: dev-h...@spark.apache.org
>> > > >
>> > > >
>> > >
>> >
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: enum-like types in Spark

2015-03-05 Thread Patrick Wendell
Yes - only new or internal API's. I doubt we'd break any exposed APIs for
the purpose of clean up.

Patrick
On Mar 5, 2015 12:16 AM, "Mridul Muralidharan"  wrote:

> While I dont have any strong opinions about how we handle enum's
> either way in spark, I assume the discussion is targetted at (new) api
> being designed in spark.
> Rewiring what we already have exposed will lead to incompatible api
> change (StorageLevel for example, is in 1.0).
>
> Regards,
> Mridul
>
> On Wed, Mar 4, 2015 at 11:45 PM, Aaron Davidson 
> wrote:
> > That's kinda annoying, but it's just a little extra boilerplate. Can you
> > call it as StorageLevel.DiskOnly() from Java? Would it also work if they
> > were case classes with empty constructors, without the field?
> >
> > On Wed, Mar 4, 2015 at 11:35 PM, Xiangrui Meng  wrote:
> >
> >> `case object` inside an `object` doesn't show up in Java. This is the
> >> minimal code I found to make everything show up correctly in both
> >> Scala and Java:
> >>
> >> sealed abstract class StorageLevel // cannot be a trait
> >>
> >> object StorageLevel {
> >>   private[this] case object _MemoryOnly extends StorageLevel
> >>   final val MemoryOnly: StorageLevel = _MemoryOnly
> >>
> >>   private[this] case object _DiskOnly extends StorageLevel
> >>   final val DiskOnly: StorageLevel = _DiskOnly
> >> }
> >>
> >> On Wed, Mar 4, 2015 at 8:10 PM, Patrick Wendell 
> >> wrote:
> >> > I like #4 as well and agree with Aaron's suggestion.
> >> >
> >> > - Patrick
> >> >
> >> > On Wed, Mar 4, 2015 at 6:07 PM, Aaron Davidson 
> >> wrote:
> >> >> I'm cool with #4 as well, but make sure we dictate that the values
> >> should
> >> >> be defined within an object with the same name as the enumeration
> (like
> >> we
> >> >> do for StorageLevel). Otherwise we may pollute a higher namespace.
> >> >>
> >> >> e.g. we SHOULD do:
> >> >>
> >> >> trait StorageLevel
> >> >> object StorageLevel {
> >> >>   case object MemoryOnly extends StorageLevel
> >> >>   case object DiskOnly extends StorageLevel
> >> >> }
> >> >>
> >> >> On Wed, Mar 4, 2015 at 5:37 PM, Michael Armbrust <
> >> mich...@databricks.com>
> >> >> wrote:
> >> >>
> >> >>> #4 with a preference for CamelCaseEnums
> >> >>>
> >> >>> On Wed, Mar 4, 2015 at 5:29 PM, Joseph Bradley <
> jos...@databricks.com>
> >> >>> wrote:
> >> >>>
> >> >>> > another vote for #4
> >> >>> > People are already used to adding "()" in Java.
> >> >>> >
> >> >>> >
> >> >>> > On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch  >
> >> >>> wrote:
> >> >>> >
> >> >>> > > #4 but with MemoryOnly (more scala-like)
> >> >>> > >
> >> >>> > > http://docs.scala-lang.org/style/naming-conventions.html
> >> >>> > >
> >> >>> > > Constants, Values, Variable and Methods
> >> >>> > >
> >> >>> > > Constant names should be in upper camel case. That is, if the
> >> member is
> >> >>> > > final, immutable and it belongs to a package object or an
> object,
> >> it
> >> >>> may
> >> >>> > be
> >> >>> > > considered a constant (similar to Java'sstatic final members):
> >> >>> > >
> >> >>> > >
> >> >>> > >1. object Container {
> >> >>> > >2. val MyConstant = ...
> >> >>> > >3. }
> >> >>> > >
> >> >>> > >
> >> >>> > > 2015-03-04 17:11 GMT-08:00 Xiangrui Meng :
> >> >>> > >
> >> >>> > > > Hi all,
> >> >>> > > >
> >> >>> > > > There are many places where we use enum-like types in Spark,
> but
> >> in
> >> >>> > > > different ways. Every approach has both pros and cons. I
> wonder
> >> >>> > > > whet

[VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-05 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.3.0!

The tag to be voted on is v1.3.0-rc2 (commit 4aaf48d4):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4aaf48d46d13129f0f9bdafd771dd80fe568a7dc

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.3.0-rc3/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

Staging repositories for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1078

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.3.0-rc3-docs/

Please vote on releasing this package as Apache Spark 1.3.0!

The vote is open until Monday, March 09, at 02:52 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.3.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== How does this compare to RC2 ==
This release includes the following bug fixes:

https://issues.apache.org/jira/browse/SPARK-6144
https://issues.apache.org/jira/browse/SPARK-6171
https://issues.apache.org/jira/browse/SPARK-5143
https://issues.apache.org/jira/browse/SPARK-6182
https://issues.apache.org/jira/browse/SPARK-6175

== How can I help test this release? ==
If you are a Spark user, you can help us test this release by
taking a Spark 1.2 workload and running on this release candidate,
then reporting any regressions.

If you are happy with this release based on your own testing, give a +1 vote.

== What justifies a -1 vote for this release? ==
This vote is happening towards the end of the 1.3 QA period,
so -1 votes should only occur for significant regressions from 1.2.1.
Bugs already present in 1.2.X, minor regressions, or bugs related
to new features will not block this release.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[RESULT] [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-05 Thread Patrick Wendell
This vote is cancelled in favor of RC3.

On Wed, Mar 4, 2015 at 3:22 PM, Sean Owen  wrote:
> I think we will have to fix
> https://issues.apache.org/jira/browse/SPARK-5143 as well before the
> final 1.3.x.
>
> But yes everything else checks out for me, including sigs and hashes
> and building the source release.
>
> I have been following JIRA closely and am not aware of other blockers
> besides the ones already identified.
>
> On Wed, Mar 4, 2015 at 7:09 PM, Marcelo Vanzin  wrote:
>> -1 (non-binding) because of SPARK-6144.
>>
>> But aside from that I ran a set of tests on top of standalone and yarn
>> and things look good.
>>
>> On Tue, Mar 3, 2015 at 8:19 PM, Patrick Wendell  wrote:
>>> Please vote on releasing the following candidate as Apache Spark version 
>>> 1.3.0!
>>>
>>> The tag to be voted on is v1.3.0-rc2 (commit 3af2687):
>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3af26870e5163438868c4eb2df88380a533bb232
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-1.3.0-rc2/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> Staging repositories for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1074/
>>> (published with version '1.3.0')
>>> https://repository.apache.org/content/repositories/orgapachespark-1075/
>>> (published with version '1.3.0-rc2')
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-1.3.0-rc2-docs/
>>>
>>> Please vote on releasing this package as Apache Spark 1.3.0!
>>>
>>> The vote is open until Saturday, March 07, at 04:17 UTC and passes if
>>> a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.3.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>>
>>> == How does this compare to RC1 ==
>>> This patch includes a variety of bug fixes found in RC1.
>>>
>>> == How can I help test this release? ==
>>> If you are a Spark user, you can help us test this release by
>>> taking a Spark 1.2 workload and running on this release candidate,
>>> then reporting any regressions.
>>>
>>> If you are happy with this release based on your own testing, give a +1 
>>> vote.
>>>
>>> == What justifies a -1 vote for this release? ==
>>> This vote is happening towards the end of the 1.3 QA period,
>>> so -1 votes should only occur for significant regressions from 1.2.1.
>>> Bugs already present in 1.2.X, minor regressions, or bugs related
>>> to new features will not block this release.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-06 Thread Patrick Wendell
I'll kick it off with a +1.

On Thu, Mar 5, 2015 at 6:52 PM, Patrick Wendell  wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 1.3.0!
>
> The tag to be voted on is v1.3.0-rc2 (commit 4aaf48d4):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4aaf48d46d13129f0f9bdafd771dd80fe568a7dc
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.3.0-rc3/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> Staging repositories for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1078
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.3.0-rc3-docs/
>
> Please vote on releasing this package as Apache Spark 1.3.0!
>
> The vote is open until Monday, March 09, at 02:52 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.3.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == How does this compare to RC2 ==
> This release includes the following bug fixes:
>
> https://issues.apache.org/jira/browse/SPARK-6144
> https://issues.apache.org/jira/browse/SPARK-6171
> https://issues.apache.org/jira/browse/SPARK-5143
> https://issues.apache.org/jira/browse/SPARK-6182
> https://issues.apache.org/jira/browse/SPARK-6175
>
> == How can I help test this release? ==
> If you are a Spark user, you can help us test this release by
> taking a Spark 1.2 workload and running on this release candidate,
> then reporting any regressions.
>
> If you are happy with this release based on your own testing, give a +1 vote.
>
> == What justifies a -1 vote for this release? ==
> This vote is happening towards the end of the 1.3 QA period,
> so -1 votes should only occur for significant regressions from 1.2.1.
> Bugs already present in 1.2.X, minor regressions, or bugs related
> to new features will not block this release.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-06 Thread Patrick Wendell
Hey Sean,

> SPARK-5310 Update SQL programming guide for 1.3
> SPARK-5183 Document data source API
> SPARK-6128 Update Spark Streaming Guide for Spark 1.3

For these, the issue is that they are documentation JIRA's, which
don't need to be timed exactly with the release vote, since we can
update the documentation on the website whenever we want. In the past
I've just mentally filtered these out when considering RC's. I see a
few options here:

1. We downgrade such issues away from Blocker (more clear, but we risk
loosing them in the fray if they really are things we want to have
before the release is posted).
2. We provide a filter to the community that excludes 'Documentation'
issues and shows all other blockers for 1.3. We can put this on the
wiki, for instance.

Which do you prefer?

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-06 Thread Patrick Wendell
Sean,

The docs are distributed and consumed in a fundamentally different way
than Spark code itself. So we've always considered the "deadline" for
doc changes to be when the release is finally posted.

If there are small inconsistencies with the docs present in the source
code for that release tag, IMO that doesn't matter much since we don't
even distribute the docs with Spark's binary releases and virtually no
one builds and hosts the docs on their own (that I am aware of, at
least). Perhaps we can recommend if people want to build the doc
sources that they should always grab the head of the most recent
release branch, to set expectations accordingly.

In the past we haven't considered it worth holding up the release
process for the purpose of the docs. It just doesn't make sense since
they are consumed "as a service". If we decide to change this
convention, it would mean shipping our releases later, since we
could't pipeline the doc finalization with voting.

- Patrick

On Fri, Mar 6, 2015 at 11:02 AM, Sean Owen  wrote:
> Given the title and tagging, it sounds like there could be some
> must-have doc changes to go with what is being released as 1.3. It can
> be finished later, and published later, but then the docs source
> shipped with the release doesn't match the site, and until then, 1.3
> is released without some "must-have" docs for 1.3 on the site.
>
> The real question to me is: are there any further, absolutely
> essential doc changes that need to accompany 1.3 or not?
>
> If not, just resolve these. If there are, then it seems like the
> release has to block on them. If there are some docs that should have
> gone in for 1.3, but didn't, but aren't essential, well I suppose it
> bears thinking about how to not slip as much work, but it doesn't
> block.
>
> I think Documentation issues certainly can be a blocker and shouldn't
> be specially ignored.
>
>
> BTW the UISeleniumSuite issue is a real failure, but I do not think it
> is serious: http://issues.apache.org/jira/browse/SPARK-6205  It isn't
> a regression from 1.2.x, but only affects tests, and only affects a
> subset of build profiles.
>
>
>
>
> On Fri, Mar 6, 2015 at 6:43 PM, Patrick Wendell  wrote:
>> Hey Sean,
>>
>>> SPARK-5310 Update SQL programming guide for 1.3
>>> SPARK-5183 Document data source API
>>> SPARK-6128 Update Spark Streaming Guide for Spark 1.3
>>
>> For these, the issue is that they are documentation JIRA's, which
>> don't need to be timed exactly with the release vote, since we can
>> update the documentation on the website whenever we want. In the past
>> I've just mentally filtered these out when considering RC's. I see a
>> few options here:
>>
>> 1. We downgrade such issues away from Blocker (more clear, but we risk
>> loosing them in the fray if they really are things we want to have
>> before the release is posted).
>> 2. We provide a filter to the community that excludes 'Documentation'
>> issues and shows all other blockers for 1.3. We can put this on the
>> wiki, for instance.
>>
>> Which do you prefer?
>>
>> - Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-06 Thread Patrick Wendell
For now, I'll just put this as critical. We can discuss the
documentation stuff offline or in another thread.

On Fri, Mar 6, 2015 at 1:36 PM, Sean Owen  wrote:
> Although the problem is small, especially if indeed the essential docs
> changes are following just a couple days behind the final release, I
> mean, why the rush if they're essential? wait a couple days, finish
> them, make the release.
>
> Answer is, I think these changes aren't actually essential given the
> comment from tdas, so: just mark these Critical? (although ... they do
> say they're changes for the 1.3 release, so kind of funny to get to
> them for 1.3.x or 1.4, but that's not important now.)
>
> I thought that Blocker really meant Blocker in this project, as I've
> been encouraged to use it to mean "don't release without this." I
> think we should use it that way. Just thinking of it as "extra
> Critical" doesn't add anything. I don't think Documentation should be
> special-cased as less important, and I don't think there's confusion
> if Blocker means what it says, so I'd 'fix' that way.
>
> If nobody sees the Hive failure I observed, and if we can just zap
> those "Blockers" one way or the other, +1
>
>
> On Fri, Mar 6, 2015 at 9:17 PM, Patrick Wendell  wrote:
>> Sean,
>>
>> The docs are distributed and consumed in a fundamentally different way
>> than Spark code itself. So we've always considered the "deadline" for
>> doc changes to be when the release is finally posted.
>>
>> If there are small inconsistencies with the docs present in the source
>> code for that release tag, IMO that doesn't matter much since we don't
>> even distribute the docs with Spark's binary releases and virtually no
>> one builds and hosts the docs on their own (that I am aware of, at
>> least). Perhaps we can recommend if people want to build the doc
>> sources that they should always grab the head of the most recent
>> release branch, to set expectations accordingly.
>>
>> In the past we haven't considered it worth holding up the release
>> process for the purpose of the docs. It just doesn't make sense since
>> they are consumed "as a service". If we decide to change this
>> convention, it would mean shipping our releases later, since we
>> could't pipeline the doc finalization with voting.
>>
>> - Patrick
>>
>> On Fri, Mar 6, 2015 at 11:02 AM, Sean Owen  wrote:
>>> Given the title and tagging, it sounds like there could be some
>>> must-have doc changes to go with what is being released as 1.3. It can
>>> be finished later, and published later, but then the docs source
>>> shipped with the release doesn't match the site, and until then, 1.3
>>> is released without some "must-have" docs for 1.3 on the site.
>>>
>>> The real question to me is: are there any further, absolutely
>>> essential doc changes that need to accompany 1.3 or not?
>>>
>>> If not, just resolve these. If there are, then it seems like the
>>> release has to block on them. If there are some docs that should have
>>> gone in for 1.3, but didn't, but aren't essential, well I suppose it
>>> bears thinking about how to not slip as much work, but it doesn't
>>> block.
>>>
>>> I think Documentation issues certainly can be a blocker and shouldn't
>>> be specially ignored.
>>>
>>>
>>> BTW the UISeleniumSuite issue is a real failure, but I do not think it
>>> is serious: http://issues.apache.org/jira/browse/SPARK-6205  It isn't
>>> a regression from 1.2.x, but only affects tests, and only affects a
>>> subset of build profiles.
>>>
>>>
>>>
>>>
>>> On Fri, Mar 6, 2015 at 6:43 PM, Patrick Wendell  wrote:
>>>> Hey Sean,
>>>>
>>>>> SPARK-5310 Update SQL programming guide for 1.3
>>>>> SPARK-5183 Document data source API
>>>>> SPARK-6128 Update Spark Streaming Guide for Spark 1.3
>>>>
>>>> For these, the issue is that they are documentation JIRA's, which
>>>> don't need to be timed exactly with the release vote, since we can
>>>> update the documentation on the website whenever we want. In the past
>>>> I've just mentally filtered these out when considering RC's. I see a
>>>> few options here:
>>>>
>>>> 1. We downgrade such issues away from Blocker (more clear, but we risk
>>>> loosing them in the fray if they really are things we want to have
>>>> before the release is posted).
>>>> 2. We provide a filter to the community that excludes 'Documentation'
>>>> issues and shows all other blockers for 1.3. We can put this on the
>>>> wiki, for instance.
>>>>
>>>> Which do you prefer?
>>>>
>>>> - Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-08 Thread Patrick Wendell
We probably want to revisit the way we do binaries in general for
1.4+. IMO, something worth forking a separate thread for.

I've been hesitating to add new binaries because people
(understandably) complain if you ever stop packaging older ones, but
on the other hand the ASF has complained that we have too many
binaries already and that we need to pare it down because of the large
volume of files. Doubling the number of binaries we produce for Scala
2.11 seemed like it would be too much.

One solution potentially is to actually package "Hadoop provided"
binaries and encourage users to use these by simply setting
HADOOP_HOME, or have instructions for specific distros. I've heard
that our existing packages don't work well on HDP for instance, since
there are some configuration quirks that differ from the upstream
Hadoop.

If we cut down on the cross building for Hadoop versions, then it is
more tenable to cross build for Scala versions without exploding the
number of binaries.

- Patrick

On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen  wrote:
> Yeah, interesting question of what is the better default for the
> single set of artifacts published to Maven. I think there's an
> argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
> and cons discussed more at
>
> https://issues.apache.org/jira/browse/SPARK-5134
> https://github.com/apache/spark/pull/3917
>
> On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia  wrote:
>> +1
>>
>> Tested it on Mac OS X.
>>
>> One small issue I noticed is that the Scala 2.11 build is using Hadoop 1 
>> without Hive, which is kind of weird because people will more likely want 
>> Hadoop 2 with Hive. So it would be good to publish a build for that 
>> configuration instead. We can do it if we do a new RC, or it might be that 
>> binary builds may not need to be voted on (I forgot the details there).
>>
>> Matei

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

2015-03-08 Thread Patrick Wendell
I think it's important to separate the goals from the implementation.
I agree with Matei on the goal - I think the goal needs to be to allow
people to download Apache Spark and use it with CDH, HDP, MapR,
whatever... This is the whole reason why HDFS and YARN have stable
API's, so that other projects can build on them in a way that works
across multiple versions. I wouldn't want to force users to upgrade
according only to some vendor timetable, that doesn't seem from the
ASF perspective like a good thing for the project. If users want to
get packages from Bigtop, or the vendors, that's totally fine too.

My point earlier was - I am not sure we are actually accomplishing
that goal now, because I've heard in some cases our "Hadoop 2.X"
packages actually don't work on certain distributions, even those that
are based on that Hadoop version. So one solution is to move towards
"bring your own Hadoop" binaries and have users just set HADOOP_HOME
and maybe document any vendor-specific configs that need to be set.
That also happens to solve the "too many binaries" problem, but only
incidentally.

- Patrick

On Sun, Mar 8, 2015 at 4:07 PM, Matei Zaharia  wrote:
> Our goal is to let people use the latest Apache release even if vendors fall 
> behind or don't want to package everything, so that's why we put out releases 
> for vendors' versions. It's fairly low overhead.
>
> Matei
>
>> On Mar 8, 2015, at 5:56 PM, Sean Owen  wrote:
>>
>> Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball
>> at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the
>> Maven artifacts.
>>
>> Patrick I see you just commented on SPARK-5134 and will follow up
>> there. Sounds like this may accidentally not be a problem.
>>
>> On binary tarball releases, I wonder if anyone has an opinion on my
>> opinion that these shouldn't be distributed for specific Hadoop
>> *distributions* to begin with. (Won't repeat the argument here yet.)
>> That resolves this n x m explosion too.
>>
>> Vendors already provide their own distribution, yes, that's their job.
>>
>>
>> On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar  wrote:
>>> Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop
>>> Distributions X ...
>>>
>>> May be one option is to have a minimum basic set (which I know is what we
>>> are discussing) and move the rest to spark-packages.org. There the vendors
>>> can add the latest downloads - for example when 1.4 is released, HDP can
>>> build a release of HDP Spark 1.4 bundle.
>>>
>>> Cheers
>>> 
>>>
>>> On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell  wrote:
>>>>
>>>> We probably want to revisit the way we do binaries in general for
>>>> 1.4+. IMO, something worth forking a separate thread for.
>>>>
>>>> I've been hesitating to add new binaries because people
>>>> (understandably) complain if you ever stop packaging older ones, but
>>>> on the other hand the ASF has complained that we have too many
>>>> binaries already and that we need to pare it down because of the large
>>>> volume of files. Doubling the number of binaries we produce for Scala
>>>> 2.11 seemed like it would be too much.
>>>>
>>>> One solution potentially is to actually package "Hadoop provided"
>>>> binaries and encourage users to use these by simply setting
>>>> HADOOP_HOME, or have instructions for specific distros. I've heard
>>>> that our existing packages don't work well on HDP for instance, since
>>>> there are some configuration quirks that differ from the upstream
>>>> Hadoop.
>>>>
>>>> If we cut down on the cross building for Hadoop versions, then it is
>>>> more tenable to cross build for Scala versions without exploding the
>>>> number of binaries.
>>>>
>>>> - Patrick
>>>>
>>>> On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen  wrote:
>>>>> Yeah, interesting question of what is the better default for the
>>>>> single set of artifacts published to Maven. I think there's an
>>>>> argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
>>>>> and cons discussed more at
>>>>>
>>>>> https://issues.apache.org/jira/browse/SPARK-5134
>>>>> https://github.com/apache/spark/pull/3917
>>>>>
>>>>> On Sun, Mar 8, 2015 at 7:42 PM, Matei 

Re: Block Transfer Service encryption support

2015-03-08 Thread Patrick Wendell
I think that yes, longer term we want to have encryption of all
communicated data. However Jeff, can you open a JIRA to discuss the
design before opening a pull request (it's fine to link to a WIP
branch if you'd like)? I'd like to better understand the performance
and operational complexity of using SSL for this in comparison with
alternatives. It would also be good to look at how the Hadoop
encryption works for their shuffle service, in terms of the design
decisions made there.

- Patrick

On Sun, Mar 8, 2015 at 5:42 PM, Jeff Turpin  wrote:
> I have already written most of the code, just finishing up the unit tests
> right now...
>
> Jeff
>
>
> On Sun, Mar 8, 2015 at 5:39 PM, Andrew Ash  wrote:
>
>> I'm interested in seeing this data transfer occurring over encrypted
>> communication channels as well.  Many customers require that all network
>> transfer occur encrypted to prevent the "soft underbelly" that's often
>> found inside a corporate network.
>>
>> On Fri, Mar 6, 2015 at 4:20 PM, turp1twin  wrote:
>>
>>> Is there a plan to implement SSL support for the Block Transfer Service
>>> (specifically, the NettyBlockTransferService implementation)? I can
>>> volunteer if needed...
>>>
>>> Jeff
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Block-Transfer-Service-encryption-support-tp10934.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Patrick Wendell
Hey All,

Today there was a JIRA posted with an observed regression around Spark
Streaming during certain recovery scenarios:

https://issues.apache.org/jira/browse/SPARK-6222

My preference is to go ahead and ship this release (RC3) as-is and if
this issue is isolated resolved soon, we can make a patch release in
the next week or two.

At some point, the cost of continuing to hold the release re/vote is
so high that it's better to just ship the release. We can document
known issues and point users to a fix once it's available. We did this
in 1.2.0 as well (there were two small known issues) and I think as a
point of process, this approach is necessary given the size of the
project.

I wanted to notify this thread though, in case this change anyones
opinion on their release vote. I will leave the thread open at least
until the end of today.

Still +1 on RC3, for me.

- Patrick

On Mon, Mar 9, 2015 at 9:36 AM, Denny Lee  wrote:
> +1 (non-binding)
>
> Spark Standalone and YARN on Hadoop 2.6 on OSX plus various tests (MLLib,
> SparkSQL, etc.)
>
> On Mon, Mar 9, 2015 at 9:18 AM Tom Graves 
> wrote:
>>
>> +1. Built from source and ran Spark on yarn on hadoop 2.6 in cluster and
>> client mode.
>> Tom
>>
>>  On Thursday, March 5, 2015 8:53 PM, Patrick Wendell
>>  wrote:
>>
>>
>>  Please vote on releasing the following candidate as Apache Spark version
>> 1.3.0!
>>
>> The tag to be voted on is v1.3.0-rc2 (commit 4aaf48d4):
>>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4aaf48d46d13129f0f9bdafd771dd80fe568a7dc
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.3.0-rc3/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> Staging repositories for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1078
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.3.0-rc3-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.3.0!
>>
>> The vote is open until Monday, March 09, at 02:52 UTC and passes if
>> a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.3.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> == How does this compare to RC2 ==
>> This release includes the following bug fixes:
>>
>> https://issues.apache.org/jira/browse/SPARK-6144
>> https://issues.apache.org/jira/browse/SPARK-6171
>> https://issues.apache.org/jira/browse/SPARK-5143
>> https://issues.apache.org/jira/browse/SPARK-6182
>> https://issues.apache.org/jira/browse/SPARK-6175
>>
>> == How can I help test this release? ==
>> If you are a Spark user, you can help us test this release by
>> taking a Spark 1.2 workload and running on this release candidate,
>> then reporting any regressions.
>>
>> If you are happy with this release based on your own testing, give a +1
>> vote.
>>
>> == What justifies a -1 vote for this release? ==
>> This vote is happening towards the end of the 1.3 QA period,
>> so -1 votes should only occur for significant regressions from 1.2.1.
>> Bugs already present in 1.2.X, minor regressions, or bugs related
>> to new features will not block this release.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Cross cutting internal changes to launch scripts

2015-03-09 Thread Patrick Wendell
Hey All,

Marcelo Vanzin has been working on a patch for a few months that
performs cross cutting clean-up and fixes to the way that Spark's
launch scripts work (including PySpark, spark submit, the daemon
scripts, etc.). The changes won't modify any public API's in terms of
how those scripts are invoked.

Historically, such patches have been difficult to test due to the
number of interactions between components and interactions with
external environments. I'd like to welcome people to test and/or code
review this patch in their own environment. This patch is the in the
very late stages of review and will likely be merged soon into master
(eventually 1.4).

https://github.com/apache/spark/pull/3916/files

I'll ping this thread again once it is merged and we can establish a
JIRA to encapsulate any issues. Just wanted to give a heads up as this
is one of the larger internal changes we've made to this
infrastructure since Spark 1.0

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: enum-like types in Spark

2015-03-09 Thread Patrick Wendell
Does this matter for our own internal types in Spark? I don't think
any of these types are designed to be used in RDD records, for
instance.

On Mon, Mar 9, 2015 at 6:25 PM, Aaron Davidson  wrote:
> Perhaps the problem with Java enums that was brought up was actually that
> their hashCode is not stable across JVMs, as it depends on the memory
> location of the enum itself.
>
> On Mon, Mar 9, 2015 at 6:15 PM, Imran Rashid  wrote:
>
>> Can you expand on the serde issues w/ java enum's at all?  I haven't heard
>> of any problems specific to enums.  The java object serialization rules
>> seem very clear and it doesn't seem like different jvms should have a
>> choice on what they do:
>>
>>
>> http://docs.oracle.com/javase/6/docs/platform/serialization/spec/serial-arch.html#6469
>>
>> (in a nutshell, serialization must use enum.name())
>>
>> of course there are plenty of ways the user could screw this up(eg. rename
>> the enums, or change their meaning, or remove them).  But then again, all
>> of java serialization has issues w/ serialization the user has to be aware
>> of.  Eg., if we go with case objects, than java serialization blows up if
>> you add another helper method, even if that helper method is completely
>> compatible.
>>
>> Some prior debate in the scala community:
>>
>> https://groups.google.com/d/msg/scala-internals/8RWkccSRBxQ/AN5F_ZbdKIsJ
>>
>> SO post on which version to use in scala:
>>
>>
>> http://stackoverflow.com/questions/1321745/how-to-model-type-safe-enum-types
>>
>> SO post about the macro-craziness people try to add to scala to make them
>> almost as good as a simple java enum:
>> (NB: the accepted answer doesn't actually work in all cases ...)
>>
>>
>> http://stackoverflow.com/questions/20089920/custom-scala-enum-most-elegant-version-searched
>>
>> Another proposal to add better enums built into scala ... but seems to be
>> dormant:
>>
>> https://groups.google.com/forum/#!topic/scala-sips/Bf82LxK02Kk
>>
>>
>>
>> On Thu, Mar 5, 2015 at 10:49 PM, Mridul Muralidharan 
>> wrote:
>>
>> >   I have a strong dislike for java enum's due to the fact that they
>> > are not stable across JVM's - if it undergoes serde, you end up with
>> > unpredictable results at times [1].
>> > One of the reasons why we prevent enum's from being key : though it is
>> > highly possible users might depend on it internally and shoot
>> > themselves in the foot.
>> >
>> > Would be better to keep away from them in general and use something more
>> > stable.
>> >
>> > Regards,
>> > Mridul
>> >
>> > [1] Having had to debug this issue for 2 weeks - I really really hate it.
>> >
>> >
>> > On Thu, Mar 5, 2015 at 1:08 PM, Imran Rashid 
>> wrote:
>> > > I have a very strong dislike for #1 (scala enumerations).   I'm ok with
>> > #4
>> > > (with Xiangrui's final suggestion, especially making it sealed &
>> > available
>> > > in Java), but I really think #2, java enums, are the best option.
>> > >
>> > > Java enums actually have some very real advantages over the other
>> > > approaches -- you get values(), valueOf(), EnumSet, and EnumMap.  There
>> > has
>> > > been endless debate in the Scala community about the problems with the
>> > > approaches in Scala.  Very smart, level-headed Scala gurus have
>> > complained
>> > > about their short-comings (Rex Kerr's name is coming to mind, though
>> I'm
>> > > not positive about that); there have been numerous well-thought out
>> > > proposals to give Scala a better enum.  But the powers-that-be in Scala
>> > > always reject them.  IIRC the explanation for rejecting is basically
>> that
>> > > (a) enums aren't important enough for introducing some new special
>> > feature,
>> > > scala's got bigger things to work on and (b) if you really need a good
>> > > enum, just use java's enum.
>> > >
>> > > I doubt it really matters that much for Spark internals, which is why I
>> > > think #4 is fine.  But I figured I'd give my spiel, because every
>> > developer
>> > > loves language wars :)
>> > >
>> > > Imran
>> > >
>> > >
>> > >
>> > > On Thu, Mar 5, 2015 at 1:35 AM, Xiangrui Meng 
>> wrote:
>> > >

[RESULT] [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-10 Thread Patrick Wendell
This vote passes with 13 +1 votes (6 binding) and no 0 or -1 votes:

+1 (13):
Patrick Wendell*
Marcelo Vanzin
Krishna Sankar
Sean Owen*
Matei Zaharia*
Sandy Ryza
Tom Graves*
Sean McNamara*
Denny Lee
Kostas Sakellis
Joseph Bradley*
Corey Nolet
GuoQiang Li

0:
-1:

I will finalize the release notes and packaging and will post the
release in the next two days.

- Patrick

On Mon, Mar 9, 2015 at 11:51 PM, GuoQiang Li  wrote:
> I'm sorry, this is my mistake. :)
>
>
> -- 原始邮件 ------
> 发件人: "Patrick Wendell";
> 发送时间: 2015年3月10日(星期二) 下午2:20
> 收件人: "GuoQiang Li";
> 主题: Re: [VOTE] Release Apache Spark 1.3.0 (RC3)
>
> Thanks! But please e-mail the dev list and not just me personally :)
>
> On Mon, Mar 9, 2015 at 11:08 PM, GuoQiang Li  wrote:
>> +1 (non-binding)
>>
>> Test on Mac OS X 10.10.2 and CentOS 6.5
>>
>>
>> -- Original --
>> From:  "Patrick Wendell";;
>> Date:  Fri, Mar 6, 2015 10:52 AM
>> To:  "dev@spark.apache.org";
>> Subject:  [VOTE] Release Apache Spark 1.3.0 (RC3)
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.3.0!
>>
>> The tag to be voted on is v1.3.0-rc2 (commit 4aaf48d4):
>>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4aaf48d46d13129f0f9bdafd771dd80fe568a7dc
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.3.0-rc3/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> Staging repositories for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1078
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.3.0-rc3-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.3.0!
>>
>> The vote is open until Monday, March 09, at 02:52 UTC and passes if
>> a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.3.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> == How does this compare to RC2 ==
>> This release includes the following bug fixes:
>>
>> https://issues.apache.org/jira/browse/SPARK-6144
>> https://issues.apache.org/jira/browse/SPARK-6171
>> https://issues.apache.org/jira/browse/SPARK-5143
>> https://issues.apache.org/jira/browse/SPARK-6182
>> https://issues.apache.org/jira/browse/SPARK-6175
>>
>> == How can I help test this release? ==
>> If you are a Spark user, you can help us test this release by
>> taking a Spark 1.2 workload and running on this release candidate,
>> then reporting any regressions.
>>
>> If you are happy with this release based on your own testing, give a +1
>> vote.
>>
>> == What justifies a -1 vote for this release? ==
>> This vote is happening towards the end of the 1.3 QA period,
>> so -1 votes should only occur for significant regressions from 1.2.1.
>> Bugs already present in 1.2.X, minor regressions, or bugs related
>> to new features will not block this release.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[ANNOUNCE] Announcing Spark 1.3!

2015-03-13 Thread Patrick Wendell
Hi All,

I'm happy to announce the availability of Spark 1.3.0! Spark 1.3.0 is
the fourth release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 172 developers and more
than 1,000 commits!

Visit the release notes [1] to read about the new features, or
download [2] the release today.

For errata in the contributions or release notes, please e-mail me
*directly* (not on-list).

Thanks to everyone who helped work on this release!

[1] http://spark.apache.org/releases/spark-release-1-3-0.html
[2] http://spark.apache.org/downloads.html

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: May we merge into branch-1.3 at this point?

2015-03-13 Thread Patrick Wendell
Hey Sean,

Yes, go crazy. Once we close the release vote, it's open season to
merge backports into that release.

- Patrick

On Fri, Mar 13, 2015 at 9:31 AM, Mridul Muralidharan  wrote:
> Who is managing 1.3 release ? You might want to coordinate with them before
> porting changes to branch.
>
> Regards
> Mridul
>
> On Friday, March 13, 2015, Sean Owen  wrote:
>
>> Yeah, I'm guessing that is all happening quite literally as we speak.
>> The Apache git tag is the one of reference:
>>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4aaf48d46d13129f0f9bdafd771dd80fe568a7dc
>>
>> Open season on 1.3 branch then...
>>
>> On Fri, Mar 13, 2015 at 4:20 PM, Nicholas Chammas
>> > wrote:
>> > Looks like the release is out:
>> > http://spark.apache.org/releases/spark-release-1-3-0.html
>> >
>> > Though, interestingly, I think we are missing the appropriate v1.3.0 tag:
>> > https://github.com/apache/spark/releases
>> >
>> > Nick
>> >
>> > On Fri, Mar 13, 2015 at 6:07 AM Sean Owen > > wrote:
>> >>
>> >> Is the release certain enough that we can resume merging into
>> >> branch-1.3 at this point? I have a number of back-ports queued up and
>> >> didn't want to merge in case another last RC was needed. I see a few
>> >> commits to the branch though.
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
>> >> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
>> >>
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
>> For additional commands, e-mail: dev-h...@spark.apache.org 
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Wrong version on the Spark documentation page

2015-03-15 Thread Patrick Wendell
Cheng - what if you hold shift+refresh? For me the /latest link
correctly points to 1.3.0

On Sun, Mar 15, 2015 at 10:40 AM, Cheng Lian  wrote:
> It's still marked as 1.2.1 here http://spark.apache.org/docs/latest/
>
> But this page is updated (1.3.0)
> http://spark.apache.org/docs/latest/index.html
>
> Cheng
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: enum-like types in Spark

2015-03-16 Thread Patrick Wendell
Hey Xiangrui,

Do you want to write up a straw man proposal based on this line of discussion?

- Patrick

On Mon, Mar 16, 2015 at 12:12 PM, Kevin Markey  wrote:
> In some applications, I have rather heavy use of Java enums which are needed
> for related Java APIs that the application uses.  And unfortunately, they
> are also used as keys.  As such, using the native hashcodes makes any
> function over keys unstable and unpredictable, so we now use Enum.name() as
> the key instead.  Oh well.  But it works and seems to work well.
>
> Kevin
>
>
> On 03/05/2015 09:49 PM, Mridul Muralidharan wrote:
>>
>>I have a strong dislike for java enum's due to the fact that they
>> are not stable across JVM's - if it undergoes serde, you end up with
>> unpredictable results at times [1].
>> One of the reasons why we prevent enum's from being key : though it is
>> highly possible users might depend on it internally and shoot
>> themselves in the foot.
>>
>> Would be better to keep away from them in general and use something more
>> stable.
>>
>> Regards,
>> Mridul
>>
>> [1] Having had to debug this issue for 2 weeks - I really really hate it.
>>
>>
>> On Thu, Mar 5, 2015 at 1:08 PM, Imran Rashid  wrote:
>>>
>>> I have a very strong dislike for #1 (scala enumerations).   I'm ok with
>>> #4
>>> (with Xiangrui's final suggestion, especially making it sealed &
>>> available
>>> in Java), but I really think #2, java enums, are the best option.
>>>
>>> Java enums actually have some very real advantages over the other
>>> approaches -- you get values(), valueOf(), EnumSet, and EnumMap.  There
>>> has
>>> been endless debate in the Scala community about the problems with the
>>> approaches in Scala.  Very smart, level-headed Scala gurus have
>>> complained
>>> about their short-comings (Rex Kerr's name is coming to mind, though I'm
>>> not positive about that); there have been numerous well-thought out
>>> proposals to give Scala a better enum.  But the powers-that-be in Scala
>>> always reject them.  IIRC the explanation for rejecting is basically that
>>> (a) enums aren't important enough for introducing some new special
>>> feature,
>>> scala's got bigger things to work on and (b) if you really need a good
>>> enum, just use java's enum.
>>>
>>> I doubt it really matters that much for Spark internals, which is why I
>>> think #4 is fine.  But I figured I'd give my spiel, because every
>>> developer
>>> loves language wars :)
>>>
>>> Imran
>>>
>>>
>>>
>>> On Thu, Mar 5, 2015 at 1:35 AM, Xiangrui Meng  wrote:
>>>
>>>> `case object` inside an `object` doesn't show up in Java. This is the
>>>> minimal code I found to make everything show up correctly in both
>>>> Scala and Java:
>>>>
>>>> sealed abstract class StorageLevel // cannot be a trait
>>>>
>>>> object StorageLevel {
>>>>private[this] case object _MemoryOnly extends StorageLevel
>>>>final val MemoryOnly: StorageLevel = _MemoryOnly
>>>>
>>>>private[this] case object _DiskOnly extends StorageLevel
>>>>final val DiskOnly: StorageLevel = _DiskOnly
>>>> }
>>>>
>>>> On Wed, Mar 4, 2015 at 8:10 PM, Patrick Wendell 
>>>> wrote:
>>>>>
>>>>> I like #4 as well and agree with Aaron's suggestion.
>>>>>
>>>>> - Patrick
>>>>>
>>>>> On Wed, Mar 4, 2015 at 6:07 PM, Aaron Davidson 
>>>>
>>>> wrote:
>>>>>>
>>>>>> I'm cool with #4 as well, but make sure we dictate that the values
>>>>
>>>> should
>>>>>>
>>>>>> be defined within an object with the same name as the enumeration
>>>>>> (like
>>>>
>>>> we
>>>>>>
>>>>>> do for StorageLevel). Otherwise we may pollute a higher namespace.
>>>>>>
>>>>>> e.g. we SHOULD do:
>>>>>>
>>>>>> trait StorageLevel
>>>>>> object StorageLevel {
>>>>>>case object MemoryOnly extends StorageLevel
>>>>>>case object DiskOnly extends StorageLevel
>>>>>> }
>>>>>>
>>>>>> On Wed, Mar

Re: enum-like types in Spark

2015-03-23 Thread Patrick Wendell
If the official solution from the Scala community is to use Java
enums, then it seems strange they aren't generated in scaldoc? Maybe
we can just fix that w/ Typesafe's help and then we can use them.

On Mon, Mar 23, 2015 at 1:46 PM, Sean Owen  wrote:
> Yeah the fully realized #4, which gets back the ability to use it in
> switch statements (? in Scala but not Java?) does end up being kind of
> huge.
>
> I confess I'm swayed a bit back to Java enums, seeing what it
> involves. The hashCode() issue can be 'solved' with the hash of the
> String representation.
>
> On Mon, Mar 23, 2015 at 8:33 PM, Imran Rashid  wrote:
>> I've just switched some of my code over to the new format, and I just want
>> to make sure everyone realizes what we are getting into.  I went from 10
>> lines as java enums
>>
>> https://github.com/squito/spark/blob/fef66058612ebf225e58dd5f5fea6bae1afd5b31/core/src/main/java/org/apache/spark/status/api/StageStatus.java#L20
>>
>> to 30 lines with the new format:
>>
>> https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/status/api/v1/api.scala#L250
>>
>> its not just that its verbose.  each name has to be repeated 4 times, with
>> potential typos in some locations that won't be caught by the compiler.
>> Also, you have to manually maintain the "values" as you update the set of
>> enums, the compiler won't do it for you.
>>
>> The only downside I've heard for java enums is enum.hashcode().  OTOH, the
>> downsides for this version are: maintainability / verbosity, no values(),
>> more cumbersome to use from java, no enum map / enumset.
>>
>> I did put together a little util to at least get back the equivalent of
>> enum.valueOf() with this format
>>
>> https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/util/SparkEnum.scala
>>
>> I'm not trying to prevent us from moving forward on this, its fine if this
>> is still what everyone wants, but I feel pretty strongly java enums make
>> more sense.
>>
>> thanks,
>> Imran
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Any guidance on when to back port and how far?

2015-03-24 Thread Patrick Wendell
My philosophy has been basically what you suggested, Sean. One thing
you didn't mention though is if a bug fix seems complicated, I will
think very hard before back-porting it. This is because "fixes" can
introduce their own new bugs, in some cases worse than the original
issue. It's really bad to have some upgrade to a patch release and see
a regression - with our current approach this almost never happens.

I will usually try to backport up to N-2, if it can be back-ported
reasonably easily (for instance, with minor or no code changes). The
reason I do this is that vendors do end up supporting older versions,
and it's nice for them if some committer has backported a fix that
they can then pull in, even if we never ship it.

In terms of doing older maintenance releases, this one I think we
should do according to severity of issues (for instance, if there is a
security issue) or based on general command from the community. I
haven't initiated many 1.X.2 releases recently because I didn't see
huge demand. However, personally I don't mind doing these if there is
a lot of demand, at least for releases where ".0" has gone out in the
last six months.

On Tue, Mar 24, 2015 at 11:23 AM, Michael Armbrust
 wrote:
> Two other criteria that I use when deciding what to backport:
>  - Is it a regression from a previous minor release?  I'm much more likely
> to backport fixes in this case, as I'd love for most people to stay up to
> date.
>  - How scary is the change?  I think the primary goal is stability of the
> maintenance branches.  When I am confident that something is isolated and
> unlikely to break things (i.e. I'm fixing a confusing error message), then
> i'm much more likely to backport it.
>
> Regarding the length of time to continue backporting, I mostly don't
> backport to N-1, but this is partially because SQL is changing too fast for
> that to generally be useful.  These old branches usually only get attention
> from me when there is an explicit request.
>
> I'd love to hear more feedback from others.
>
> Michael
>
> On Tue, Mar 24, 2015 at 6:13 AM, Sean Owen  wrote:
>
>> So far, my rule of thumb has been:
>>
>> - Don't back-port new features or improvements in general, only bug fixes
>> - Don't back-port minor bug fixes
>> - Back-port bug fixes that seem important enough to not wait for the
>> next minor release
>> - Back-port site doc changes to the release most likely to go out
>> next, to make it a part of the next site publish
>>
>> But, how far should back-ports go, in general? If the last minor
>> release was 1.N, then to branch 1.N surely. Farther back is a question
>> of expectation for support of past minor releases. Given the pace of
>> change and time available, I assume there's not much support for
>> continuing to use release 1.(N-1) and very little for 1.(N-2).
>>
>> Concretely: does anyone expect a 1.1.2 release ever? a 1.2.2 release?
>> It'd be good to hear the received wisdom explicitly.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: hadoop input/output format advanced control

2015-03-24 Thread Patrick Wendell
Yeah - to Nick's point, I think the way to do this is to pass in a
custom conf when you create a Hadoop RDD (that's AFAIK why the conf
field is there). Is there anything you can't do with that feature?

On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
 wrote:
> Imran, on your point to read multiple files together in a partition, is it
> not simpler to use the approach of copy Hadoop conf and set per-RDD
> settings for min split to control the input size per partition, together
> with something like CombineFileInputFormat?
>
> On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid  wrote:
>
>> I think this would be a great addition, I totally agree that you need to be
>> able to set these at a finer context than just the SparkContext.
>>
>> Just to play devil's advocate, though -- the alternative is for you just
>> subclass HadoopRDD yourself, or make a totally new RDD, and then you could
>> expose whatever you need.  Why is this solution better?  IMO the criteria
>> are:
>> (a) common operations
>> (b) error-prone / difficult to implement
>> (c) non-obvious, but important for performance
>>
>> I think this case fits (a) & (c), so I think its still worthwhile.  But its
>> also worth asking whether or not its too difficult for a user to extend
>> HadoopRDD right now.  There have been several cases in the past week where
>> we've suggested that a user should read from hdfs themselves (eg., to read
>> multiple files together in one partition) -- with*out* reusing the code in
>> HadoopRDD, though they would lose things like the metric tracking &
>> preferred locations you get from HadoopRDD.  Does HadoopRDD need to some
>> refactoring to make that easier to do?  Or do we just need a good example?
>>
>> Imran
>>
>> (sorry for hijacking your thread, Koert)
>>
>>
>>
>> On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers  wrote:
>>
>> > see email below. reynold suggested i send it to dev instead of user
>> >
>> > -- Forwarded message --
>> > From: Koert Kuipers 
>> > Date: Mon, Mar 23, 2015 at 4:36 PM
>> > Subject: hadoop input/output format advanced control
>> > To: "u...@spark.apache.org" 
>> >
>> >
>> > currently its pretty hard to control the Hadoop Input/Output formats used
>> > in Spark. The conventions seems to be to add extra parameters to all
>> > methods and then somewhere deep inside the code (for example in
>> > PairRDDFunctions.saveAsHadoopFile) all these parameters get translated
>> into
>> > settings on the Hadoop Configuration object.
>> >
>> > for example for compression i see "codec: Option[Class[_ <:
>> > CompressionCodec]] = None" added to a bunch of methods.
>> >
>> > how scalable is this solution really?
>> >
>> > for example i need to read from a hadoop dataset and i dont want the
>> input
>> > (part) files to get split up. the way to do this is to set
>> > "mapred.min.split.size". now i dont want to set this at the level of the
>> > SparkContext (which can be done), since i dont want it to apply to input
>> > formats in general. i want it to apply to just this one specific input
>> > dataset i need to read. which leaves me with no options currently. i
>> could
>> > go add yet another input parameter to all the methods
>> > (SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile,
>> > etc.). but that seems ineffective.
>> >
>> > why can we not expose a Map[String, String] or some other generic way to
>> > manipulate settings for hadoop input/output formats? it would require
>> > adding one more parameter to all methods to deal with hadoop input/output
>> > formats, but after that its done. one parameter to rule them all
>> >
>> > then i could do:
>> > val x = sc.textFile("/some/path", formatSettings =
>> > Map("mapred.min.split.size" -> "12345"))
>> >
>> > or
>> > rdd.saveAsTextFile("/some/path, formatSettings =
>> > Map(mapred.output.compress" -> "true", "mapred.output.compression.codec"
>> ->
>> > "somecodec"))
>> >
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Experience using binary packages on various Hadoop distros

2015-03-24 Thread Patrick Wendell
Hey All,

For a while we've published binary packages with different Hadoop
client's pre-bundled. We currently have three interfaces to a Hadoop
cluster (a) the HDFS client (b) the YARN client (c) the Hive client.

Because (a) and (b) are supposed to be backwards compatible
interfaces. My working assumption was that for the most part (modulo
Hive) our packages work with *newer* Hadoop versions. For instance,
our Hadoop 2.4 package should work with HDFS 2.6 and YARN 2.6.
However, I have heard murmurings that these are not compatible in
practice.

So I have three questions I'd like to put out to the community:

1. Have people had difficulty using 2.4 packages with newer Hadoop
versions? If so, what specific incompatibilities have you hit?
2. Have people had issues using our binary Hadoop packages in general
with commercial or Apache Hadoop distro's, such that you have to build
from source?
3. How would people feel about publishing a "bring your own Hadoop"
binary, where you are required to point us to a local Hadoop
distribution by setting HADOOP_HOME? This might be better for ensuring
full compatibility:
https://issues.apache.org/jira/browse/SPARK-6511

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Experience using binary packages on various Hadoop distros

2015-03-24 Thread Patrick Wendell
We can probably better explain that if you are not using HDFS or YARN,
you can download any binary.

However, my question was about if the existing binaries do not work
well with newer Hadoop versions, which I heard some people suggest but
I'm looking for more specific issues.

On Tue, Mar 24, 2015 at 4:16 PM, Jey Kottalam  wrote:
> Could we gracefully fallback to an in-tree Hadoop binary (e.g. 1.0.4)
> in that case? I think many new Spark users are confused about why
> Spark has anything to do with Hadoop, e.g. I could see myself being
> confused when the download page asks me to select a "package type". I
> know that what I want is not "source code", but I'd have no idea how
> to choose amongst the apparently multiple types of binaries.
>
> On Tue, Mar 24, 2015 at 2:28 PM, Matei Zaharia  
> wrote:
>> Just a note, one challenge with the BYOH version might be that users who 
>> download that can't run in local mode without also having Hadoop. But if we 
>> describe it correctly then hopefully it's okay.
>>
>> Matei
>>
>>> On Mar 24, 2015, at 3:05 PM, Patrick Wendell  wrote:
>>>
>>> Hey All,
>>>
>>> For a while we've published binary packages with different Hadoop
>>> client's pre-bundled. We currently have three interfaces to a Hadoop
>>> cluster (a) the HDFS client (b) the YARN client (c) the Hive client.
>>>
>>> Because (a) and (b) are supposed to be backwards compatible
>>> interfaces. My working assumption was that for the most part (modulo
>>> Hive) our packages work with *newer* Hadoop versions. For instance,
>>> our Hadoop 2.4 package should work with HDFS 2.6 and YARN 2.6.
>>> However, I have heard murmurings that these are not compatible in
>>> practice.
>>>
>>> So I have three questions I'd like to put out to the community:
>>>
>>> 1. Have people had difficulty using 2.4 packages with newer Hadoop
>>> versions? If so, what specific incompatibilities have you hit?
>>> 2. Have people had issues using our binary Hadoop packages in general
>>> with commercial or Apache Hadoop distro's, such that you have to build
>>> from source?
>>> 3. How would people feel about publishing a "bring your own Hadoop"
>>> binary, where you are required to point us to a local Hadoop
>>> distribution by setting HADOOP_HOME? This might be better for ensuring
>>> full compatibility:
>>> https://issues.apache.org/jira/browse/SPARK-6511
>>>
>>> - Patrick
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: hadoop input/output format advanced control

2015-03-24 Thread Patrick Wendell
I see - if you look, in the saving functions we have the option for
the user to pass an arbitrary Configuration.

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L894

It seems fine to have the same option for the loading functions, if
it's easy to just pass this config into the input format.



On Tue, Mar 24, 2015 at 3:46 PM, Koert Kuipers  wrote:
> the (compression) codec parameter that is now part of many saveAs... methods
> came from a similar need. see SPARK-763
> hadoop has many options like this. you either going to have to allow many
> more of these optional arguments to all the methods that read from hadoop
> inputformats and write to hadoop outputformats, or you force people to
> re-create these methods using HadoopRDD, i think (if thats even possible).
>
> On Tue, Mar 24, 2015 at 6:40 PM, Koert Kuipers  wrote:
>>
>> i would like to use objectFile with some tweaks to the hadoop conf.
>> currently there is no way to do that, except recreating objectFile myself.
>> and some of the code objectFile uses i have no access to, since its private
>> to spark.
>>
>>
>> On Tue, Mar 24, 2015 at 2:59 PM, Patrick Wendell 
>> wrote:
>>>
>>> Yeah - to Nick's point, I think the way to do this is to pass in a
>>> custom conf when you create a Hadoop RDD (that's AFAIK why the conf
>>> field is there). Is there anything you can't do with that feature?
>>>
>>> On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
>>>  wrote:
>>> > Imran, on your point to read multiple files together in a partition, is
>>> > it
>>> > not simpler to use the approach of copy Hadoop conf and set per-RDD
>>> > settings for min split to control the input size per partition,
>>> > together
>>> > with something like CombineFileInputFormat?
>>> >
>>> > On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid 
>>> > wrote:
>>> >
>>> >> I think this would be a great addition, I totally agree that you need
>>> >> to be
>>> >> able to set these at a finer context than just the SparkContext.
>>> >>
>>> >> Just to play devil's advocate, though -- the alternative is for you
>>> >> just
>>> >> subclass HadoopRDD yourself, or make a totally new RDD, and then you
>>> >> could
>>> >> expose whatever you need.  Why is this solution better?  IMO the
>>> >> criteria
>>> >> are:
>>> >> (a) common operations
>>> >> (b) error-prone / difficult to implement
>>> >> (c) non-obvious, but important for performance
>>> >>
>>> >> I think this case fits (a) & (c), so I think its still worthwhile.
>>> >> But its
>>> >> also worth asking whether or not its too difficult for a user to
>>> >> extend
>>> >> HadoopRDD right now.  There have been several cases in the past week
>>> >> where
>>> >> we've suggested that a user should read from hdfs themselves (eg., to
>>> >> read
>>> >> multiple files together in one partition) -- with*out* reusing the
>>> >> code in
>>> >> HadoopRDD, though they would lose things like the metric tracking &
>>> >> preferred locations you get from HadoopRDD.  Does HadoopRDD need to
>>> >> some
>>> >> refactoring to make that easier to do?  Or do we just need a good
>>> >> example?
>>> >>
>>> >> Imran
>>> >>
>>> >> (sorry for hijacking your thread, Koert)
>>> >>
>>> >>
>>> >>
>>> >> On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers 
>>> >> wrote:
>>> >>
>>> >> > see email below. reynold suggested i send it to dev instead of user
>>> >> >
>>> >> > -- Forwarded message --
>>> >> > From: Koert Kuipers 
>>> >> > Date: Mon, Mar 23, 2015 at 4:36 PM
>>> >> > Subject: hadoop input/output format advanced control
>>> >> > To: "u...@spark.apache.org" 
>>> >> >
>>> >> >
>>> >> > currently its pretty hard to control the Hadoop Input/Output formats
>>> >> > used
>>> >> > in Spark. The conventions seems to be to add extra parameters to all
>>> >> > methods and then somewhere deep 

Re: hadoop input/output format advanced control

2015-03-25 Thread Patrick Wendell
Yeah I agree that might have been nicer, but I think for consistency
with the input API's maybe we should do the same thing. We can also
give an example of how to clone sc.hadoopConfiguration and then set
some new values:

val conf = sc.hadoopConfiguration.clone()
  .set("k1", "v1")
  .set("k2", "v2")

val rdd = sc.objectFile(..., conf)

I have no idea if that's the correct syntax, but something like that
seems almost as easy as passing a hashmap with deltas.

- Patrick

On Wed, Mar 25, 2015 at 6:34 AM, Koert Kuipers  wrote:
> my personal preference would be something like a Map[String, String] that
> only reflects the changes you want to make the Configuration for the given
> input/output format (so system wide defaults continue to come from
> sc.hadoopConfiguration), similarly to what cascading/scalding did, but am
> arbitrary Configuration will work too.
>
> i will make a jira and pullreq when i have some time.
>
>
>
> On Wed, Mar 25, 2015 at 1:23 AM, Patrick Wendell  wrote:
>>
>> I see - if you look, in the saving functions we have the option for
>> the user to pass an arbitrary Configuration.
>>
>>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L894
>>
>> It seems fine to have the same option for the loading functions, if
>> it's easy to just pass this config into the input format.
>>
>>
>>
>> On Tue, Mar 24, 2015 at 3:46 PM, Koert Kuipers  wrote:
>> > the (compression) codec parameter that is now part of many saveAs...
>> > methods
>> > came from a similar need. see SPARK-763
>> > hadoop has many options like this. you either going to have to allow
>> > many
>> > more of these optional arguments to all the methods that read from
>> > hadoop
>> > inputformats and write to hadoop outputformats, or you force people to
>> > re-create these methods using HadoopRDD, i think (if thats even
>> > possible).
>> >
>> > On Tue, Mar 24, 2015 at 6:40 PM, Koert Kuipers 
>> > wrote:
>> >>
>> >> i would like to use objectFile with some tweaks to the hadoop conf.
>> >> currently there is no way to do that, except recreating objectFile
>> >> myself.
>> >> and some of the code objectFile uses i have no access to, since its
>> >> private
>> >> to spark.
>> >>
>> >>
>> >> On Tue, Mar 24, 2015 at 2:59 PM, Patrick Wendell 
>> >> wrote:
>> >>>
>> >>> Yeah - to Nick's point, I think the way to do this is to pass in a
>> >>> custom conf when you create a Hadoop RDD (that's AFAIK why the conf
>> >>> field is there). Is there anything you can't do with that feature?
>> >>>
>> >>> On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
>> >>>  wrote:
>> >>> > Imran, on your point to read multiple files together in a partition,
>> >>> > is
>> >>> > it
>> >>> > not simpler to use the approach of copy Hadoop conf and set per-RDD
>> >>> > settings for min split to control the input size per partition,
>> >>> > together
>> >>> > with something like CombineFileInputFormat?
>> >>> >
>> >>> > On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid 
>> >>> > wrote:
>> >>> >
>> >>> >> I think this would be a great addition, I totally agree that you
>> >>> >> need
>> >>> >> to be
>> >>> >> able to set these at a finer context than just the SparkContext.
>> >>> >>
>> >>> >> Just to play devil's advocate, though -- the alternative is for you
>> >>> >> just
>> >>> >> subclass HadoopRDD yourself, or make a totally new RDD, and then
>> >>> >> you
>> >>> >> could
>> >>> >> expose whatever you need.  Why is this solution better?  IMO the
>> >>> >> criteria
>> >>> >> are:
>> >>> >> (a) common operations
>> >>> >> (b) error-prone / difficult to implement
>> >>> >> (c) non-obvious, but important for performance
>> >>> >>
>> >>> >> I think this case fits (a) & (c), so I think its still worthwhile.
>> >>> >> But its
>> >>> >> also worth asking whether or not

Re: hadoop input/output format advanced control

2015-03-25 Thread Patrick Wendell
Great - that's even easier. Maybe we could have a simple example in the doc.

On Wed, Mar 25, 2015 at 7:06 PM, Sandy Ryza  wrote:
> Regarding Patrick's question, you can just do "new Configuration(oldConf)"
> to get a cloned Configuration object and add any new properties to it.
>
> -Sandy
>
> On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid  wrote:
>
>> Hi Nick,
>>
>> I don't remember the exact details of these scenarios, but I think the user
>> wanted a lot more control over how the files got grouped into partitions,
>> to group the files together by some arbitrary function.  I didn't think
>> that was possible w/ CombineFileInputFormat, but maybe there is a way?
>>
>> thanks
>>
>> On Tue, Mar 24, 2015 at 1:50 PM, Nick Pentreath 
>> wrote:
>>
>> > Imran, on your point to read multiple files together in a partition, is
>> it
>> > not simpler to use the approach of copy Hadoop conf and set per-RDD
>> > settings for min split to control the input size per partition, together
>> > with something like CombineFileInputFormat?
>> >
>> > On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid 
>> > wrote:
>> >
>> > > I think this would be a great addition, I totally agree that you need
>> to
>> > be
>> > > able to set these at a finer context than just the SparkContext.
>> > >
>> > > Just to play devil's advocate, though -- the alternative is for you
>> just
>> > > subclass HadoopRDD yourself, or make a totally new RDD, and then you
>> > could
>> > > expose whatever you need.  Why is this solution better?  IMO the
>> criteria
>> > > are:
>> > > (a) common operations
>> > > (b) error-prone / difficult to implement
>> > > (c) non-obvious, but important for performance
>> > >
>> > > I think this case fits (a) & (c), so I think its still worthwhile.  But
>> > its
>> > > also worth asking whether or not its too difficult for a user to extend
>> > > HadoopRDD right now.  There have been several cases in the past week
>> > where
>> > > we've suggested that a user should read from hdfs themselves (eg., to
>> > read
>> > > multiple files together in one partition) -- with*out* reusing the code
>> > in
>> > > HadoopRDD, though they would lose things like the metric tracking &
>> > > preferred locations you get from HadoopRDD.  Does HadoopRDD need to
>> some
>> > > refactoring to make that easier to do?  Or do we just need a good
>> > example?
>> > >
>> > > Imran
>> > >
>> > > (sorry for hijacking your thread, Koert)
>> > >
>> > >
>> > >
>> > > On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers 
>> > wrote:
>> > >
>> > > > see email below. reynold suggested i send it to dev instead of user
>> > > >
>> > > > -- Forwarded message --
>> > > > From: Koert Kuipers 
>> > > > Date: Mon, Mar 23, 2015 at 4:36 PM
>> > > > Subject: hadoop input/output format advanced control
>> > > > To: "u...@spark.apache.org" 
>> > > >
>> > > >
>> > > > currently its pretty hard to control the Hadoop Input/Output formats
>> > used
>> > > > in Spark. The conventions seems to be to add extra parameters to all
>> > > > methods and then somewhere deep inside the code (for example in
>> > > > PairRDDFunctions.saveAsHadoopFile) all these parameters get
>> translated
>> > > into
>> > > > settings on the Hadoop Configuration object.
>> > > >
>> > > > for example for compression i see "codec: Option[Class[_ <:
>> > > > CompressionCodec]] = None" added to a bunch of methods.
>> > > >
>> > > > how scalable is this solution really?
>> > > >
>> > > > for example i need to read from a hadoop dataset and i dont want the
>> > > input
>> > > > (part) files to get split up. the way to do this is to set
>> > > > "mapred.min.split.size". now i dont want to set this at the level of
>> > the
>> > > > SparkContext (which can be done), since i dont want it to apply to
>> > input
>> > > > formats in general. i want it to apply to just this one specific
>> input
>> > > > dataset i need to read. which leaves me with no options currently. i
>> > > could
>> > > > go add yet another input parameter to all the methods
>> > > > (SparkContext.textFile, SparkContext.hadoopFile,
>> > SparkContext.objectFile,
>> > > > etc.). but that seems ineffective.
>> > > >
>> > > > why can we not expose a Map[String, String] or some other generic way
>> > to
>> > > > manipulate settings for hadoop input/output formats? it would require
>> > > > adding one more parameter to all methods to deal with hadoop
>> > input/output
>> > > > formats, but after that its done. one parameter to rule them all
>> > > >
>> > > > then i could do:
>> > > > val x = sc.textFile("/some/path", formatSettings =
>> > > > Map("mapred.min.split.size" -> "12345"))
>> > > >
>> > > > or
>> > > > rdd.saveAsTextFile("/some/path, formatSettings =
>> > > > Map(mapred.output.compress" -> "true",
>> > "mapred.output.compression.codec"
>> > > ->
>> > > > "somecodec"))
>> > > >
>> > >
>> >
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additio

Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Patrick Wendell
I think we have a version of mapPartitions that allows you to tell
Spark the partitioning is preserved:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L639

We could also add a map function that does same. Or you can just write
your map using an iterator.

- Patrick

On Thu, Mar 26, 2015 at 3:07 PM, Jonathan Coveney  wrote:
> This is just a deficiency of the api, imo. I agree: mapValues could
> definitely be a function (K, V)=>V1. The option isn't set by the function,
> it's on the RDD. So you could look at the code and do this.
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala
>
>  def mapValues[U](f: V => U): RDD[(K, U)] = {
> val cleanF = self.context.clean(f)
> new MapPartitionsRDD[(K, U), (K, V)](self,
>   (context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },
>   preservesPartitioning = true)
>   }
>
> What you want:
>
>  def mapValues[U](f: (K, V) => U): RDD[(K, U)] = {
> val cleanF = self.context.clean(f)
> new MapPartitionsRDD[(K, U), (K, V)](self,
>   (context, pid, iter) => iter.map { case t@(k, _) => (k, cleanF(t)) },
>   preservesPartitioning = true)
>   }
>
> One of the nice things about spark is that making such new operators is very
> easy :)
>
> 2015-03-26 17:54 GMT-04:00 Zhan Zhang :
>
>> Thanks Jonathan. You are right regarding rewrite the example.
>>
>> I mean providing such option to developer so that it is controllable. The
>> example may seems silly, and I don't know the use cases.
>>
>> But for example, if I also want to operate both the key and value part to
>> generate some new value with keeping key part untouched. Then mapValues may
>> not be able to  do this.
>>
>> Changing the code to allow this is trivial, but I don't know whether there
>> is some special reason behind this.
>>
>> Thanks.
>>
>> Zhan Zhang
>>
>>
>>
>>
>> On Mar 26, 2015, at 2:49 PM, Jonathan Coveney  wrote:
>>
>> I believe if you do the following:
>>
>>
>> sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByKey(_+_).toDebugString
>>
>> (8) MapPartitionsRDD[34] at reduceByKey at :23 []
>>  |  MapPartitionsRDD[33] at mapValues at :23 []
>>  |  ShuffledRDD[32] at reduceByKey at :23 []
>>  +-(8) MapPartitionsRDD[31] at map at :23 []
>> |  ParallelCollectionRDD[30] at parallelize at :23 []
>>
>> The difference is that spark has no way to know that your map closure
>> doesn't change the key. if you only use mapValues, it does. Pretty cool that
>> they optimized that :)
>>
>> 2015-03-26 17:44 GMT-04:00 Zhan Zhang :
>>>
>>> Hi Folks,
>>>
>>> Does anybody know what is the reason not allowing preserverPartitioning
>>> in RDD.map? Do I miss something here?
>>>
>>> Following example involves two shuffles. I think if preservePartitioning
>>> is allowed, we can avoid the second one, right?
>>>
>>>  val r1 = sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4))
>>>  val r2 = r1.map((_, 1))
>>>  val r3 = r2.reduceByKey(_+_)
>>>  val r4 = r3.map(x=>(x._1, x._2 + 1))
>>>  val r5 = r4.reduceByKey(_+_)
>>>  r5.collect.foreach(println)
>>>
>>> scala> r5.toDebugString
>>> res2: String =
>>> (8) ShuffledRDD[4] at reduceByKey at :29 []
>>>  +-(8) MapPartitionsRDD[3] at map at :27 []
>>> |  ShuffledRDD[2] at reduceByKey at :25 []
>>> +-(8) MapPartitionsRDD[1] at map at :23 []
>>>|  ParallelCollectionRDD[0] at parallelize at :21 []
>>>
>>> Thanks.
>>>
>>> Zhan Zhang
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Unit test logs in Jenkins?

2015-04-01 Thread Patrick Wendell
Hey Marcelo,

Great question. Right now, some of the more active developers have an
account that allows them to log into this cluster to inspect logs (we
copy the logs from each run to a node on that cluster). The
infrastructure is maintained by the AMPLab.

I will put you in touch the someone there who can get you an account.

This is a short term solution. The longer term solution is to have
these scp'd regularly to an S3 bucket or somewhere people can get
access to them, but that's not ready yet.

- Patrick



On Wed, Apr 1, 2015 at 1:01 PM, Marcelo Vanzin  wrote:
> Hey all,
>
> Is there a way to access unit test logs in jenkins builds? e.g.,
> core/target/unit-tests.log
>
> That would be really helpful to debug build failures. The scalatest
> output isn't all that helpful.
>
> If that's currently not available, would it be possible to add those
> logs as build artifacts?
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.3.1

2015-04-04 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.3.1!

The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f31b713ed90bcec63ebc4e530cbb69851

The list of fixes present in this release can be found at:
http://bit.ly/1C2nVPY

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.3.1-rc1/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1080

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.3.1-rc1-docs/

Please vote on releasing this package as Apache Spark 1.3.1!

The vote is open until Wednesday, April 08, at 01:10 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.3.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.2.2

2015-04-05 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.2.2!

The tag to be voted on is v1.2.2-rc1 (commit 7531b50):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7531b50e406ee2e3301b009ceea7c684272b2e27

The list of fixes present in this release can be found at:
http://bit.ly/1DCNddt

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.2.2-rc1/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1082/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.2.2-rc1-docs/

Please vote on releasing this package as Apache Spark 1.2.2!

The vote is open until Thursday, April 08, at 00:30 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.2.2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.1

2015-04-06 Thread Patrick Wendell
I believe TD just forgot to set the fix version on the JIRA. There is
a fix for this in 1.3:

https://github.com/apache/spark/commit/03e263f5b527cf574f4ffcd5cd886f7723e3756e

- Patrick

On Mon, Apr 6, 2015 at 2:31 PM, Mark Hamstra  wrote:
> Is that correct, or is the JIRA just out of sync, since TD's PR was merged?
> https://github.com/apache/spark/pull/5008
>
> On Mon, Apr 6, 2015 at 11:10 AM, Hari Shreedharan
>  wrote:
>>
>> It does not look like https://issues.apache.org/jira/browse/SPARK-6222
>> made it. It was targeted towards this release.
>>
>>
>>
>>
>> Thanks, Hari
>>
>> On Mon, Apr 6, 2015 at 11:04 AM, York, Brennon
>>  wrote:
>>
>> > +1 (non-binding)
>> > Tested GraphX, build infrastructure, & core test suite on OSX 10.9 w/
>> > Java
>> > 1.7/1.8
>> > On 4/6/15, 5:21 AM, "Sean Owen"  wrote:
>> >>SPARK-6673 is not, in the end, relevant for 1.3.x I believe; we just
>> >>resolved it for 1.4 anyway. False alarm there.
>> >>
>> >>I back-ported SPARK-6205 into the 1.3 branch for next time. We'll pick
>> >>it up if there's another RC, but by itself is not something that needs
>> >>a new RC. (I will give the same treatment to branch 1.2 if needed in
>> >>light of the 1.2.2 release.)
>> >>
>> >>I applied the simple change in SPARK-6205 in order to continue
>> >>executing tests and all was well. I still see a few failures in Hive
>> >>tests:
>> >>
>> >>- show_create_table_serde *** FAILED ***
>> >>- show_tblproperties *** FAILED ***
>> >>- udf_std *** FAILED ***
>> >>- udf_stddev *** FAILED ***
>> >>
>> >>with ...
>> >>
>> >>mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0
>> >>-DskipTests clean package; mvn -Phadoop-2.4 -Pyarn -Phive
>> >>-Phive-0.13.1 -Dhadoop.version=2.6.0 test
>> >>
>> >>... but these are not regressions from 1.3.0.
>> >>
>> >>+1 from me at this point on the current artifacts.
>> >>
>> >>On Sun, Apr 5, 2015 at 9:24 AM, Sean Owen  wrote:
>> >>> Signatures and hashes are good.
>> >>> LICENSE, NOTICE still check out.
>> >>> Compiles for a Hadoop 2.6 + YARN + Hive profile.
>> >>>
>> >>> I still see the UISeleniumSuite test failure observed in 1.3.0, which
>> >>> is minor and already fixed. I don't know why I didn't back-port it:
>> >>> https://issues.apache.org/jira/browse/SPARK-6205
>> >>>
>> >>> If we roll another, let's get this easy fix in, but it is only an
>> >>> issue with tests.
>> >>>
>> >>>
>> >>> On JIRA, I checked open issues with Fix Version = 1.3.0 or 1.3.1 and
>> >>> all look legitimate (e.g. reopened or in progress)
>> >>>
>> >>>
>> >>> There is 1 open Blocker for 1.3.1 per Andrew:
>> >>> https://issues.apache.org/jira/browse/SPARK-6673 spark-shell.cmd can't
>> >>> start even when spark was built in Windows
>> >>>
>> >>> I believe this can be resolved quickly but as a matter of hygiene
>> >>> should be fixed or demoted before release.
>> >>>
>> >>>
>> >>> FYI there are 16 Critical issues marked for 1.3.0 / 1.3.1; worth
>> >>> examining before release to see how critical they are:
>> >>>
>> >>> SPARK-6701,Flaky test: o.a.s.deploy.yarn.YarnClusterSuite Python
>> >>> application,,Open,4/3/15
>> >>> SPARK-6484,"Ganglia metrics xml reporter doesn't escape
>> >>> correctly",Josh Rosen,Open,3/24/15
>> >>> SPARK-6270,Standalone Master hangs when streaming job
>> >>>completes,,Open,3/11/15
>> >>> SPARK-6209,ExecutorClassLoader can leak connections after failing to
>> >>> load classes from the REPL class server,Josh Rosen,In Progress,4/2/15
>> >>> SPARK-5113,Audit and document use of hostnames and IP addresses in
>> >>> Spark,,Open,3/24/15
>> >>> SPARK-5098,Number of running tasks become negative after tasks
>> >>> lost,,Open,1/14/15
>> >>> SPARK-4925,Publish Spark SQL hive-thriftserver maven artifact,Patrick
>> >>> Wendell,Reopened,3/23/15
>> >>> SPARK-4922,Support dynamic allocation for coarse-grained
>> >>

Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread Patrick Wendell
What if you don't run zinc? I.e. just download maven and run that "mvn
package...". It might take longer, but I wonder if it will work.

On Mon, Apr 6, 2015 at 10:26 PM, mjhb  wrote:
> Similar problem on 1.2 branch:
>
> [ERROR] Failed to execute goal on project spark-core_2.11: Could not resolve
> dependencies for project
> org.apache.spark:spark-core_2.11:jar:1.2.3-SNAPSHOT: The following artifacts
> could not be resolved:
> org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT,
> org.apache.spark:spark-network-shuffle_2.10:jar:1.2.3-SNAPSHOT: Failure to
> find org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT in
> http://repository.apache.org/snapshots was cached in the local repository,
> resolution will not be reattempted until the update interval of
> apache.snapshots has elapsed or updates are forced -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal on project spark-core_2.11: Could not resolve dependencies for project
> org.apache.spark:spark-core_2.11:jar:1.2.3-SNAPSHOT: The following artifacts
> could not be resolved:
> org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT,
> org.apache.spark:spark-network-shuffle_2.10:jar:1.2.3-SNAPSHOT: Failure to
> find org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT in
> http://repository.apache.org/snapshots was cached in the local repository,
> resolution will not be reattempted until the update interval of
> apache.snapshots has elapsed or updates are forced
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/1-3-Build-Error-with-Scala-2-11-tp11441p11442.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread Patrick Wendell
The issue is that if you invoke "build/mvn" it will start zinc again
if it sees that it is killed.

The absolute most "sterile" thing to do is this:
1. Kill any zinc processes.
2. Clean up spark "git clean -fdx" (WARNING: this will delete any
staged changes you have, if you have code modifications or extra files
around)
3. Run the 2.11 script to change the versions.
4. Run "mvn package" with maven that you installed on your machine.


On Mon, Apr 6, 2015 at 10:43 PM, Marty Bower  wrote:
> I'm killing zinc (if it's running) before running each build attempt.
>
> Trying to build as "clean" as possible.
>
>
> On Mon, Apr 6, 2015 at 7:31 PM Patrick Wendell  wrote:
>>
>> What if you don't run zinc? I.e. just download maven and run that "mvn
>> package...". It might take longer, but I wonder if it will work.
>>
>> On Mon, Apr 6, 2015 at 10:26 PM, mjhb  wrote:
>> > Similar problem on 1.2 branch:
>> >
>> > [ERROR] Failed to execute goal on project spark-core_2.11: Could not
>> > resolve
>> > dependencies for project
>> > org.apache.spark:spark-core_2.11:jar:1.2.3-SNAPSHOT: The following
>> > artifacts
>> > could not be resolved:
>> > org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT,
>> > org.apache.spark:spark-network-shuffle_2.10:jar:1.2.3-SNAPSHOT: Failure
>> > to
>> > find org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT in
>> > http://repository.apache.org/snapshots was cached in the local
>> > repository,
>> > resolution will not be reattempted until the update interval of
>> > apache.snapshots has elapsed or updates are forced -> [Help 1]
>> > org.apache.maven.lifecycle.LifecycleExecutionException: Failed to
>> > execute
>> > goal on project spark-core_2.11: Could not resolve dependencies for
>> > project
>> > org.apache.spark:spark-core_2.11:jar:1.2.3-SNAPSHOT: The following
>> > artifacts
>> > could not be resolved:
>> > org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT,
>> > org.apache.spark:spark-network-shuffle_2.10:jar:1.2.3-SNAPSHOT: Failure
>> > to
>> > find org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT in
>> > http://repository.apache.org/snapshots was cached in the local
>> > repository,
>> > resolution will not be reattempted until the update interval of
>> > apache.snapshots has elapsed or updates are forced
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-developers-list.1001551.n3.nabble.com/1-3-Build-Error-with-Scala-2-11-tp11441p11442.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> > Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread Patrick Wendell
One thing that I think can cause issues is if you run build/mvn with
Scala 2.10, then try to run it with 2.11, since I think we may store
some downloaded jars relating to zinc that will get screwed up. Not
sure that's what is happening, just an idea.

On Mon, Apr 6, 2015 at 10:54 PM, Patrick Wendell  wrote:
> The issue is that if you invoke "build/mvn" it will start zinc again
> if it sees that it is killed.
>
> The absolute most "sterile" thing to do is this:
> 1. Kill any zinc processes.
> 2. Clean up spark "git clean -fdx" (WARNING: this will delete any
> staged changes you have, if you have code modifications or extra files
> around)
> 3. Run the 2.11 script to change the versions.
> 4. Run "mvn package" with maven that you installed on your machine.
>
>
> On Mon, Apr 6, 2015 at 10:43 PM, Marty Bower  wrote:
>> I'm killing zinc (if it's running) before running each build attempt.
>>
>> Trying to build as "clean" as possible.
>>
>>
>> On Mon, Apr 6, 2015 at 7:31 PM Patrick Wendell  wrote:
>>>
>>> What if you don't run zinc? I.e. just download maven and run that "mvn
>>> package...". It might take longer, but I wonder if it will work.
>>>
>>> On Mon, Apr 6, 2015 at 10:26 PM, mjhb  wrote:
>>> > Similar problem on 1.2 branch:
>>> >
>>> > [ERROR] Failed to execute goal on project spark-core_2.11: Could not
>>> > resolve
>>> > dependencies for project
>>> > org.apache.spark:spark-core_2.11:jar:1.2.3-SNAPSHOT: The following
>>> > artifacts
>>> > could not be resolved:
>>> > org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT,
>>> > org.apache.spark:spark-network-shuffle_2.10:jar:1.2.3-SNAPSHOT: Failure
>>> > to
>>> > find org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT in
>>> > http://repository.apache.org/snapshots was cached in the local
>>> > repository,
>>> > resolution will not be reattempted until the update interval of
>>> > apache.snapshots has elapsed or updates are forced -> [Help 1]
>>> > org.apache.maven.lifecycle.LifecycleExecutionException: Failed to
>>> > execute
>>> > goal on project spark-core_2.11: Could not resolve dependencies for
>>> > project
>>> > org.apache.spark:spark-core_2.11:jar:1.2.3-SNAPSHOT: The following
>>> > artifacts
>>> > could not be resolved:
>>> > org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT,
>>> > org.apache.spark:spark-network-shuffle_2.10:jar:1.2.3-SNAPSHOT: Failure
>>> > to
>>> > find org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT in
>>> > http://repository.apache.org/snapshots was cached in the local
>>> > repository,
>>> > resolution will not be reattempted until the update interval of
>>> > apache.snapshots has elapsed or updates are forced
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> > http://apache-spark-developers-list.1001551.n3.nabble.com/1-3-Build-Error-with-Scala-2-11-tp11441p11442.html
>>> > Sent from the Apache Spark Developers List mailing list archive at
>>> > Nabble.com.
>>> >
>>> > -
>>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: dev-h...@spark.apache.org
>>> >

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread Patrick Wendell
The only think that can persist outside of Spark is if there is still
a live Zinc process. We took care to make sure this was a generally
stateless mechanism.

Both the 1.2.X and 1.3.X releases are built with Scala 2.11 for
packaging purposes. And these have been built as recently as in the
last few days, since we are voting on 1.2.2 and 1.3.1. However there
could be issues that only affect certain environments.

- Patrick

On Mon, Apr 6, 2015 at 11:02 PM, mjhb  wrote:
> I resorted to deleting the spark directory between each build earlier today
> (attempting maximum sterility) and then re-cloning from github and switching
> to the 1.2 or 1.3 branch.
>
> Does anything persist outside of the spark directory?
>
> Are you able to build either 1.2 or 1.3 w/ Scala-2.11?
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/1-3-Build-Error-with-Scala-2-11-tp11441p11447.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread Patrick Wendell
Hmm..  Make sure you are building with the right flags. I think you need to
pass -Dscala-2.11 to maven. Take a look at the upstream docs - on my phone
now so can't easily access.
On Apr 7, 2015 1:01 AM, "mjhb"  wrote:

> I even deleted my local maven repository (.m2) but still stuck when
> attempting to build w/ Scala-2.11:
>
> [ERROR] Failed to execute goal on project spark-core_2.11: Could not
> resolve
> dependencies for project
> org.apache.spark:spark-core_2.11:jar:1.3.2-SNAPSHOT: The following
> artifacts
> could not be resolved:
> org.apache.spark:spark-network-common_2.10:jar:1.3.2-SNAPSHOT,
> org.apache.spark:spark-network-shuffle_2.10:jar:1.3.2-SNAPSHOT: Could not
> find artifact org.apache.spark:spark-network-common_2.10:jar:1.3.2-SNAPSHOT
> in apache.snapshots (http://repository.apache.org/snapshots) -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal on project spark-core_2.11: Could not resolve dependencies for project
> org.apache.spark:spark-core_2.11:jar:1.3.2-SNAPSHOT: The following
> artifacts
> could not be resolved:
> org.apache.spark:spark-network-common_2.10:jar:1.3.2-SNAPSHOT,
> org.apache.spark:spark-network-shuffle_2.10:jar:1.3.2-SNAPSHOT: Could not
> find artifact org.apache.spark:spark-network-common_2.10:jar:1.3.2-SNAPSHOT
> in apache.snapshots (http://repository.apache.org/snapshots)
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/1-3-Build-Error-with-Scala-2-11-tp11441p11449.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


<    1   2   3   4   5   6   7   >