Re: Signing releases with pwendell or release manager's key?

2017-09-17 Thread Patrick Wendell
Sparks release pipeline is automated and part of that automation includes
securely injecting this key for the purpose of signing. I asked the ASF to
provide a service account key several years ago but they suggested that we
use a key attributed to an individual even if the process is automated.

I believe other projects that release with high frequency also have
automated the signing process.

This key is injected during the build process. A really ambitious release
manager could reverse engineer this in a way that reveals the private key,
however if someone is a release manager then they themselves can do quite a
bit of nefarious things anyways.

It is true that we trust all previous release managers instead of only one.
We could probably rotate the jenkins credentials periodically in order to
compensate for this, if we think this is a nontrivial risk.

- Patrick

On Sun, Sep 17, 2017 at 7:04 PM, Holden Karau  wrote:

> Would any of Patrick/Josh/Shane (or other PMC folks with
> understanding/opinions on this setup) care to comment? If this is a
> blocking issue I can cancel the current release vote thread while we
> discuss this some more.
>
> On Fri, Sep 15, 2017 at 5:18 PM Holden Karau  wrote:
>
>> Oh yes and to keep people more informed I've been updating a PR for the
>> release documentation as I go to write down some of this unwritten
>> knowledge -- https://github.com/apache/spark-website/pull/66
>>
>>
>> On Fri, Sep 15, 2017 at 5:12 PM Holden Karau 
>> wrote:
>>
>>> Also continuing the discussion from the vote threads, Shane probably has
>>> the best idea on the ACLs for Jenkins so I've CC'd him as well.
>>>
>>>
>>> On Fri, Sep 15, 2017 at 5:09 PM Holden Karau 
>>> wrote:
>>>
 Changing the release jobs, beyond the available parameters, right now
 depends on Josh arisen as there are some scripts which generate the jobs
 which aren't public. I've done temporary fixes in the past with the Python
 packaging but my understanding is that in the medium term it requires
 access to the scripts.

 So +CC Josh.

 On Fri, Sep 15, 2017 at 4:38 PM Ryan Blue  wrote:

> I think this needs to be fixed. It's true that there are barriers to
> publication, but the signature is what we use to authenticate Apache
> releases.
>
> If Patrick's key is available on Jenkins for any Spark committer to
> use, then the chance of a compromise are much higher than for a normal RM
> key.
>
> rb
>
> On Fri, Sep 15, 2017 at 12:34 PM, Sean Owen 
> wrote:
>
>> Yeah I had meant to ask about that in the past. While I presume
>> Patrick consents to this and all that, it does mean that anyone with 
>> access
>> to said Jenkins scripts can create a signed Spark release, regardless of
>> who they are.
>>
>> I haven't thought through whether that's a theoretical issue we can
>> ignore or something we need to fix up. For example you can't get a 
>> release
>> on the ASF mirrors without more authentication.
>>
>> How hard would it be to make the script take in a key? it sort of
>> looks like the script already takes GPG_KEY, but don't know how to modify
>> the jobs. I suppose it would be ideal, in any event, for the actual 
>> release
>> manager to sign.
>>
>> On Fri, Sep 15, 2017 at 8:28 PM Holden Karau 
>> wrote:
>>
>>> That's a good question, I built the release candidate however the
>>> Jenkins scripts don't take a parameter for configuring who signs them
>>> rather it always signs them with Patrick's key. You can see this from
>>> previous releases which were managed by other folks but still signed by
>>> Patrick.
>>>
>>> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue 
>>> wrote:
>>>
 The signature is valid, but why was the release signed with Patrick
 Wendell's private key? Did Patrick build the release candidate?

>>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
 --
 Twitter: https://twitter.com/holdenkarau

>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
> --
> Twitter: https://twitter.com/holdenkarau
>


Re: Signing releases with pwendell or release manager's key?

2017-09-18 Thread Patrick Wendell
One thing we could do is modify the release tooling to allow the key to be
injected each time, thus allowing any RM to insert their own key at build
time.

Patrick

On Mon, Sep 18, 2017 at 4:56 PM Ryan Blue  wrote:

> I don't understand why it is necessary to share a release key. If this is
> something that can be automated in a Jenkins job, then can it be a script
> with a reasonable set of build requirements for Mac and Ubuntu? That's the
> approach I've seen the most in other projects.
>
> I'm also not just concerned about release managers. Having a key stored
> persistently on outside infrastructure adds the most risk, as Luciano noted
> as well. We should also start publishing checksums in the Spark VOTE
> thread, which are currently missing. The risk I'm concerned about is that
> if the key were compromised, it would be possible to replace binaries with
> perfectly valid ones, at least on some mirrors. If the Apache copy were
> replaced, then we wouldn't even be able to catch that it had happened.
> Given the high profile of Spark and the number of companies that run it, I
> think we need to take extra care to make sure that can't happen, even if it
> is an annoyance for the release managers.
>
> On Sun, Sep 17, 2017 at 10:12 PM, Patrick Wendell 
> wrote:
>
>> Sparks release pipeline is automated and part of that automation includes
>> securely injecting this key for the purpose of signing. I asked the ASF to
>> provide a service account key several years ago but they suggested that we
>> use a key attributed to an individual even if the process is automated.
>>
>> I believe other projects that release with high frequency also have
>> automated the signing process.
>>
>> This key is injected during the build process. A really ambitious release
>> manager could reverse engineer this in a way that reveals the private key,
>> however if someone is a release manager then they themselves can do quite a
>> bit of nefarious things anyways.
>>
>> It is true that we trust all previous release managers instead of only
>> one. We could probably rotate the jenkins credentials periodically in order
>> to compensate for this, if we think this is a nontrivial risk.
>>
>> - Patrick
>>
>> On Sun, Sep 17, 2017 at 7:04 PM, Holden Karau 
>> wrote:
>>
>>> Would any of Patrick/Josh/Shane (or other PMC folks with
>>> understanding/opinions on this setup) care to comment? If this is a
>>> blocking issue I can cancel the current release vote thread while we
>>> discuss this some more.
>>>
>>> On Fri, Sep 15, 2017 at 5:18 PM Holden Karau 
>>> wrote:
>>>
>>>> Oh yes and to keep people more informed I've been updating a PR for the
>>>> release documentation as I go to write down some of this unwritten
>>>> knowledge -- https://github.com/apache/spark-website/pull/66
>>>>
>>>>
>>>> On Fri, Sep 15, 2017 at 5:12 PM Holden Karau 
>>>> wrote:
>>>>
>>>>> Also continuing the discussion from the vote threads, Shane probably
>>>>> has the best idea on the ACLs for Jenkins so I've CC'd him as well.
>>>>>
>>>>>
>>>>> On Fri, Sep 15, 2017 at 5:09 PM Holden Karau 
>>>>> wrote:
>>>>>
>>>>>> Changing the release jobs, beyond the available parameters, right now
>>>>>> depends on Josh arisen as there are some scripts which generate the jobs
>>>>>> which aren't public. I've done temporary fixes in the past with the 
>>>>>> Python
>>>>>> packaging but my understanding is that in the medium term it requires
>>>>>> access to the scripts.
>>>>>>
>>>>>> So +CC Josh.
>>>>>>
>>>>>> On Fri, Sep 15, 2017 at 4:38 PM Ryan Blue  wrote:
>>>>>>
>>>>>>> I think this needs to be fixed. It's true that there are barriers to
>>>>>>> publication, but the signature is what we use to authenticate Apache
>>>>>>> releases.
>>>>>>>
>>>>>>> If Patrick's key is available on Jenkins for any Spark committer to
>>>>>>> use, then the chance of a compromise are much higher than for a normal 
>>>>>>> RM
>>>>>>> key.
>>>>>>>
>>>>>>> rb
>>>>>>>
>>>>>>> On Fri, Sep 15, 2017 at 12:34 PM, Sean Owen 
>>>>>

Re: Signing releases with pwendell or release manager's key?

2017-09-18 Thread Patrick Wendell
Hey I talked more with Josh Rosen about this who has helped with automation
since I became less involved in release management.

I can think of a few different things that would improve our RM based on
these suggestions:

(1) We could remove signing step from the rest of the automation and as the
RM to sign the artifacts locally as a last step. This does mean we'd trust
the RM's environment not to be owned, but it could be better if there is
concern about centralization of risk. I'm curious how other projects do
this.

(2) We could rotate the RM position. BTW Holden Karau is doing this and
that's how this whole discussion started.

(3) We should make sure all build tooling automation is in the repo itself
so that the build is 100% reproducible by anyone. I think most of it is
already in dev/ [1] but there might be jenkins configs, etc that could be
put into the spark repo.

[1] https://github.com/apache/spark/tree/master/dev/create-release

- Patrick

On Mon, Sep 18, 2017 at 6:23 PM, Patrick Wendell 
wrote:

> One thing we could do is modify the release tooling to allow the key to be
> injected each time, thus allowing any RM to insert their own key at build
> time.
>
> Patrick
>
> On Mon, Sep 18, 2017 at 4:56 PM Ryan Blue  wrote:
>
>> I don't understand why it is necessary to share a release key. If this is
>> something that can be automated in a Jenkins job, then can it be a script
>> with a reasonable set of build requirements for Mac and Ubuntu? That's the
>> approach I've seen the most in other projects.
>>
>> I'm also not just concerned about release managers. Having a key stored
>> persistently on outside infrastructure adds the most risk, as Luciano noted
>> as well. We should also start publishing checksums in the Spark VOTE
>> thread, which are currently missing. The risk I'm concerned about is that
>> if the key were compromised, it would be possible to replace binaries with
>> perfectly valid ones, at least on some mirrors. If the Apache copy were
>> replaced, then we wouldn't even be able to catch that it had happened.
>> Given the high profile of Spark and the number of companies that run it, I
>> think we need to take extra care to make sure that can't happen, even if it
>> is an annoyance for the release managers.
>>
>> On Sun, Sep 17, 2017 at 10:12 PM, Patrick Wendell > > wrote:
>>
>>> Sparks release pipeline is automated and part of that automation
>>> includes securely injecting this key for the purpose of signing. I asked
>>> the ASF to provide a service account key several years ago but they
>>> suggested that we use a key attributed to an individual even if the process
>>> is automated.
>>>
>>> I believe other projects that release with high frequency also have
>>> automated the signing process.
>>>
>>> This key is injected during the build process. A really ambitious
>>> release manager could reverse engineer this in a way that reveals the
>>> private key, however if someone is a release manager then they themselves
>>> can do quite a bit of nefarious things anyways.
>>>
>>> It is true that we trust all previous release managers instead of only
>>> one. We could probably rotate the jenkins credentials periodically in order
>>> to compensate for this, if we think this is a nontrivial risk.
>>>
>>> - Patrick
>>>
>>> On Sun, Sep 17, 2017 at 7:04 PM, Holden Karau 
>>> wrote:
>>>
>>>> Would any of Patrick/Josh/Shane (or other PMC folks with
>>>> understanding/opinions on this setup) care to comment? If this is a
>>>> blocking issue I can cancel the current release vote thread while we
>>>> discuss this some more.
>>>>
>>>> On Fri, Sep 15, 2017 at 5:18 PM Holden Karau 
>>>> wrote:
>>>>
>>>>> Oh yes and to keep people more informed I've been updating a PR for
>>>>> the release documentation as I go to write down some of this unwritten
>>>>> knowledge -- https://github.com/apache/spark-website/pull/66
>>>>>
>>>>>
>>>>> On Fri, Sep 15, 2017 at 5:12 PM Holden Karau 
>>>>> wrote:
>>>>>
>>>>>> Also continuing the discussion from the vote threads, Shane probably
>>>>>> has the best idea on the ACLs for Jenkins so I've CC'd him as well.
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 15, 2017 at 5:09 PM Holden Karau 
>>>>>> wrote:
>>>>>>
>>>>>>> C

Re: Signing releases with pwendell or release manager's key?

2017-09-18 Thread Patrick Wendell
For the current release - maybe Holden could just sign the artifacts with
her own key manually, if this is a concern. I don't think that would
require modifying the release pipeline, except to just remove/ignore the
existing signatures.

- Patrick

On Mon, Sep 18, 2017 at 7:56 PM, Reynold Xin  wrote:

> Does anybody know whether this is a hard blocker? If it is not, we should
> probably push 2.1.2 forward quickly and do the infrastructure improvement
> in parallel.
>
> On Mon, Sep 18, 2017 at 7:49 PM, Holden Karau 
> wrote:
>
>> I'm more than willing to help migrate the scripts as part of either this
>> release or the next.
>>
>> It sounds like there is a consensus developing around changing the
>> process -- should we hold off on the 2.1.2 release or roll this into the
>> next one?
>>
>> On Mon, Sep 18, 2017 at 7:37 PM, Marcelo Vanzin 
>> wrote:
>>
>>> +1 to this. There should be a script in the Spark repo that has all
>>> the logic needed for a release. That script should take the RM's key
>>> as a parameter.
>>>
>>> if there's a desire to keep the current Jenkins job to create the
>>> release, it should be based on that script. But from what I'm seeing
>>> there are currently too many unknowns in the release process.
>>>
>>> On Mon, Sep 18, 2017 at 4:55 PM, Ryan Blue 
>>> wrote:
>>> > I don't understand why it is necessary to share a release key. If this
>>> is
>>> > something that can be automated in a Jenkins job, then can it be a
>>> script
>>> > with a reasonable set of build requirements for Mac and Ubuntu? That's
>>> the
>>> > approach I've seen the most in other projects.
>>> >
>>> > I'm also not just concerned about release managers. Having a key stored
>>> > persistently on outside infrastructure adds the most risk, as Luciano
>>> noted
>>> > as well. We should also start publishing checksums in the Spark VOTE
>>> thread,
>>> > which are currently missing. The risk I'm concerned about is that if
>>> the key
>>> > were compromised, it would be possible to replace binaries with
>>> perfectly
>>> > valid ones, at least on some mirrors. If the Apache copy were
>>> replaced, then
>>> > we wouldn't even be able to catch that it had happened. Given the high
>>> > profile of Spark and the number of companies that run it, I think we
>>> need to
>>> > take extra care to make sure that can't happen, even if it is an
>>> annoyance
>>> > for the release managers.
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


Re: Signing releases with pwendell or release manager's key?

2017-09-18 Thread Patrick Wendell
Sounds good - thanks Holden!

On Mon, Sep 18, 2017 at 8:21 PM, Holden Karau  wrote:

> That sounds like a pretty good temporary work around if folks agree I'll
> cancel release vote for 2.1.2 and work on getting an RC2 out later this
> week manually signed. I've filed JIRA SPARK-22055 & SPARK-22054 to port the
> release scripts and allow injecting of the RM's key.
>
> On Mon, Sep 18, 2017 at 8:11 PM, Patrick Wendell 
> wrote:
>
>> For the current release - maybe Holden could just sign the artifacts with
>> her own key manually, if this is a concern. I don't think that would
>> require modifying the release pipeline, except to just remove/ignore the
>> existing signatures.
>>
>> - Patrick
>>
>> On Mon, Sep 18, 2017 at 7:56 PM, Reynold Xin  wrote:
>>
>>> Does anybody know whether this is a hard blocker? If it is not, we
>>> should probably push 2.1.2 forward quickly and do the infrastructure
>>> improvement in parallel.
>>>
>>> On Mon, Sep 18, 2017 at 7:49 PM, Holden Karau 
>>> wrote:
>>>
>>>> I'm more than willing to help migrate the scripts as part of either
>>>> this release or the next.
>>>>
>>>> It sounds like there is a consensus developing around changing the
>>>> process -- should we hold off on the 2.1.2 release or roll this into the
>>>> next one?
>>>>
>>>> On Mon, Sep 18, 2017 at 7:37 PM, Marcelo Vanzin 
>>>> wrote:
>>>>
>>>>> +1 to this. There should be a script in the Spark repo that has all
>>>>> the logic needed for a release. That script should take the RM's key
>>>>> as a parameter.
>>>>>
>>>>> if there's a desire to keep the current Jenkins job to create the
>>>>> release, it should be based on that script. But from what I'm seeing
>>>>> there are currently too many unknowns in the release process.
>>>>>
>>>>> On Mon, Sep 18, 2017 at 4:55 PM, Ryan Blue 
>>>>> wrote:
>>>>> > I don't understand why it is necessary to share a release key. If
>>>>> this is
>>>>> > something that can be automated in a Jenkins job, then can it be a
>>>>> script
>>>>> > with a reasonable set of build requirements for Mac and Ubuntu?
>>>>> That's the
>>>>> > approach I've seen the most in other projects.
>>>>> >
>>>>> > I'm also not just concerned about release managers. Having a key
>>>>> stored
>>>>> > persistently on outside infrastructure adds the most risk, as
>>>>> Luciano noted
>>>>> > as well. We should also start publishing checksums in the Spark VOTE
>>>>> thread,
>>>>> > which are currently missing. The risk I'm concerned about is that if
>>>>> the key
>>>>> > were compromised, it would be possible to replace binaries with
>>>>> perfectly
>>>>> > valid ones, at least on some mirrors. If the Apache copy were
>>>>> replaced, then
>>>>> > we wouldn't even be able to catch that it had happened. Given the
>>>>> high
>>>>> > profile of Spark and the number of companies that run it, I think we
>>>>> need to
>>>>> > take extra care to make sure that can't happen, even if it is an
>>>>> annoyance
>>>>> > for the release managers.
>>>>>
>>>>> --
>>>>> Marcelo
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>
>>>
>>
>
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>


Re: sbt scala compiler crashes on spark-sql

2014-11-02 Thread Patrick Wendell
By the way - we can report issues to the Scala/Typesafe team if we
have a way to reproduce this. I just haven't found a reliable
reproduction yet.

- Patrick

On Sun, Nov 2, 2014 at 7:48 PM, Stephen Boesch  wrote:
> Yes I have seen this same error - and for team members as well - repeatedly
> since June. A Patrick and Cheng mentioned, the next step is to do an sbt
> clean
>
> 2014-11-02 19:37 GMT-08:00 Cheng Lian :
>
>> I often see this when I first build the whole Spark project with SBT, then
>> modify some code and tries to build and debug within IDEA, or vice versa.
>> A
>> clean rebuild can always solve this.
>>
>> On Mon, Nov 3, 2014 at 11:28 AM, Patrick Wendell 
>> wrote:
>>
>> > Does this happen if you clean and recompile? I've seen failures on and
>> > off, but haven't been able to find one that I could reproduce from a
>> > clean build such that we could hand it to the scala team.
>> >
>> > - Patrick
>> >
>> > On Sun, Nov 2, 2014 at 7:25 PM, Imran Rashid 
>> > wrote:
>> > > I'm finding the scala compiler crashes when I compile the spark-sql
>> > project
>> > > in sbt.  This happens in both the 1.1 branch and master (full error
>> > > below).  The other projects build fine in sbt, and everything builds
>> > > fine
>> > > in maven.  is there some sbt option I'm forgetting?  Any one else
>> > > experiencing this?
>> > >
>> > > Also, are there up-to-date instructions on how to do common dev tasks
>> > > in
>> > > both sbt & maven?  I have only found these instructions on building
>> > > with
>> > > maven:
>> > >
>> > > http://spark.apache.org/docs/latest/building-with-maven.html
>> > >
>> > > and some general info here:
>> > >
>> > >
>> > > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
>> > >
>> > > but I think this doesn't walk through a lot of the steps of a typical
>> > > dev
>> > > cycle, eg, continuous compilation, running one test, running one main
>> > > class, etc.  (especially since it seems like people still favor sbt
>> > > for
>> > > dev.)  If it doesn't already exist somewhere, I could try to put
>> > together a
>> > > brief doc for how to do the basics.  (I'm returning to spark dev after
>> > > a
>> > > little hiatus myself, and I'm hitting some stumbling blocks that are
>> > > probably common knowledge to everyone still dealing with it all the
>> > time.)
>> > >
>> > > thanks,
>> > > Imran
>> > >
>> > > --
>> > > full crash info from sbt:
>> > >
>> > >> project sql
>> > > [info] Set current project to spark-sql (in build
>> > > file:/Users/imran/spark/spark/)
>> > >> compile
>> > > [info] Compiling 62 Scala sources to
>> > > /Users/imran/spark/spark/sql/catalyst/target/scala-2.10/classes...
>> > > [info] Compiling 45 Scala sources and 39 Java sources to
>> > > /Users/imran/spark/spark/sql/core/target/scala-2.10/classes...
>> > > [error]
>> > > [error]  while compiling:
>> > >
>> >
>> > /Users/imran/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/types/util/DataTypeConversions.scala
>> > > [error] during phase: jvm
>> > > [error]  library version: version 2.10.4
>> > > [error] compiler version: version 2.10.4
>> > > [error]   reconstructed args: -classpath
>> > >
>> >
>> > /Users/imran/spark/spark/sql/core/target/scala-2.10/classes:/Users/imran/spark/spark/core/target/scala-2.10/classes:/Users/imran/spark/spark/sql/catalyst/target/scala-2.10/classes:/Users/imran/spark/spark/lib_managed/jars/hadoop-client-1.0.4.jar:/Users/imran/spark/spark/lib_managed/jars/hadoop-core-1.0.4.jar:/Users/imran/spark/spark/lib_managed/jars/xmlenc-0.52.jar:/Users/imran/spark/spark/lib_managed/jars/commons-math-2.1.jar:/Users/imran/spark/spark/lib_managed/jars/commons-configuration-1.6.jar:/Users/imran/spark/spark/lib_managed/jars/commons-collections-3.2.1.jar:/Users/imran/spark/spark/lib_managed/jars/commons-lang-2.4.jar:/Users/imran/spark/spark/lib_managed/jars/commons-logging-1.1.1.jar:/Users/imran/spark/spark/lib_managed/jars/commons-digester-1.8.jar:/Users/imran/spark/spark/lib_managed/jars/commons-beanutils-1.7.0.jar:/Users/imran/spark/spark/li

branch-1.2 has been cut

2014-11-03 Thread Patrick Wendell
Hi All,

I've just cut the release branch for Spark 1.2, consistent with then
end of the scheduled feature window for the release. New commits to
master will need to be explicitly merged into branch-1.2 in order to
be in the release.

This begins the transition into a QA period for Spark 1.2, with a
focus on testing and fixes. A few smaller features may still go in as
folks wrap up loose ends in the next 48 hours (or for developments in
alpha components).

To help with QA, I'll try to package up a SNAPSHOT release soon for
community testing; this worked well when testing Spark 1.1 before
official votes started. I might give it a few days to allow committers
to merge in back-logged fixes and other patches that were punted to
after the feature freeze.

Thanks to everyone who helped author and review patches over the last few weeks!

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Patrick Wendell
I'm a +1 on this as well, I think it will be a useful model as we
scale the project in the future and recognizes some informal process
we have now.

To respond to Sandy's comment: for changes that fall in between the
component boundaries or are straightforward, my understanding of this
model is you wouldn't need an explicit sign off. I think this is why
unlike some other projects, we wouldn't e.g. lock down permissions to
portions of the source tree. If some obvious fix needs to go in,
people should just merge it.

- Patrick

On Wed, Nov 5, 2014 at 5:57 PM, Sandy Ryza  wrote:
> This seems like a good idea.
>
> An area that wasn't listed, but that I think could strongly benefit from
> maintainers, is the build.  Having consistent oversight over Maven, SBT,
> and dependencies would allow us to avoid subtle breakages.
>
> Component maintainers have come up several times within the Hadoop project,
> and I think one of the main reasons the proposals have been rejected is
> that, structurally, its effect is to slow down development.  As you
> mention, this is somewhat mitigated if being a maintainer leads committers
> to take on more responsibility, but it might be worthwhile to draw up more
> specific ideas on how to combat this?  E.g. do obvious changes, doc fixes,
> test fixes, etc. always require a maintainer?
>
> -Sandy
>
> On Wed, Nov 5, 2014 at 5:36 PM, Michael Armbrust 
> wrote:
>
>> +1 (binding)
>>
>> On Wed, Nov 5, 2014 at 5:33 PM, Matei Zaharia 
>> wrote:
>>
>> > BTW, my own vote is obviously +1 (binding).
>> >
>> > Matei
>> >
>> > > On Nov 5, 2014, at 5:31 PM, Matei Zaharia 
>> > wrote:
>> > >
>> > > Hi all,
>> > >
>> > > I wanted to share a discussion we've been having on the PMC list, as
>> > well as call for an official vote on it on a public list. Basically, as
>> the
>> > Spark project scales up, we need to define a model to make sure there is
>> > still great oversight of key components (in particular internal
>> > architecture and public APIs), and to this end I've proposed
>> implementing a
>> > maintainer model for some of these components, similar to other large
>> > projects.
>> > >
>> > > As background on this, Spark has grown a lot since joining Apache.
>> We've
>> > had over 80 contributors/month for the past 3 months, which I believe
>> makes
>> > us the most active project in contributors/month at Apache, as well as
>> over
>> > 500 patches/month. The codebase has also grown significantly, with new
>> > libraries for SQL, ML, graphs and more.
>> > >
>> > > In this kind of large project, one common way to scale development is
>> to
>> > assign "maintainers" to oversee key components, where each patch to that
>> > component needs to get sign-off from at least one of its maintainers.
>> Most
>> > existing large projects do this -- at Apache, some large ones with this
>> > model are CloudStack (the second-most active project overall),
>> Subversion,
>> > and Kafka, and other examples include Linux and Python. This is also
>> > by-and-large how Spark operates today -- most components have a de-facto
>> > maintainer.
>> > >
>> > > IMO, adopting this model would have two benefits:
>> > >
>> > > 1) Consistent oversight of design for that component, especially
>> > regarding architecture and API. This process would ensure that the
>> > component's maintainers see all proposed changes and consider them to fit
>> > together in a good way.
>> > >
>> > > 2) More structure for new contributors and committers -- in particular,
>> > it would be easy to look up who's responsible for each module and ask
>> them
>> > for reviews, etc, rather than having patches slip between the cracks.
>> > >
>> > > We'd like to start with in a light-weight manner, where the model only
>> > applies to certain key components (e.g. scheduler, shuffle) and
>> user-facing
>> > APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand
>> > it if we deem it useful. The specific mechanics would be as follows:
>> > >
>> > > - Some components in Spark will have maintainers assigned to them,
>> where
>> > one of the maintainers needs to sign off on each patch to the component.
>> > > - Each component with maintainers will have at least 2 maintainers.
>> > > - Maintainers will be assigned from the most active and knowledgeable
>> > committers on that component by the PMC. The PMC can vote to add / remove
>> > maintainers, and maintained components, through consensus.
>> > > - Maintainers are expected to be active in responding to patches for
>> > their components, though they do not need to be the main reviewers for
>> them
>> > (e.g. they might just sign off on architecture / API). To prevent
>> inactive
>> > maintainers from blocking the project, if a maintainer isn't responding
>> in
>> > a reasonable time period (say 2 weeks), other committers can merge the
>> > patch, and the PMC will want to discuss adding another maintainer.
>> > >
>> > > If you'd like to see examples for this model, check out the following

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Patrick Wendell
I think new committers might or might not be maintainers (it would
depend on the PMC vote). I don't think it would affect what you could
merge, you can merge in any part of the source tree, you just need to
get sign off if you want to touch a public API or make major
architectural changes. Most projects already require code review from
other committers before you commit something, so it's just a version
of that where you have specific people appointed to specific
components for review.

If you look, most large software projects have a maintainer model,
both in Apache and outside of it. Cloudstack is probably the best
example in Apache since they are the second most active project
(roughly) after Spark. They have two levels of maintainers and much
strong language - their language: "In general, maintainers only have
commit rights on the module for which they are responsible.".

I'd like us to start with something simpler and lightweight as
proposed here. Really the proposal on the table is just to codify the
current de-facto process to make sure we stick by it as we scale. If
we want to add more formality to it or strictness, we can do it later.

- Patrick

On Thu, Nov 6, 2014 at 3:29 PM, Hari Shreedharan
 wrote:
> How would this model work with a new committer who gets voted in? Does it 
> mean that a new committer would be a maintainer for at least one area -- else 
> we could end up having committers who really can't merge anything significant 
> until he becomes a maintainer.
>
>
> Thanks,
> Hari
>
> On Thu, Nov 6, 2014 at 3:00 PM, Matei Zaharia 
> wrote:
>
>> I think you're misunderstanding the idea of "process" here. The point of 
>> process is to make sure something happens automatically, which is useful to 
>> ensure a certain level of quality. For example, all our patches go through 
>> Jenkins, and nobody will make the mistake of merging them if they fail 
>> tests, or RAT checks, or API compatibility checks. The idea is to get the 
>> same kind of automation for design on these components. This is a very 
>> common process for large software projects, and it's essentially what we had 
>> already, but formalizing it will make clear that this is the process we 
>> want. It's important to do it early in order to be able to refine the 
>> process as the project grows.
>> In terms of scope, again, the maintainers are *not* going to be the only 
>> reviewers for that component, they are just a second level of sign-off 
>> required for architecture and API. Being a maintainer is also not a 
>> "promotion", it's a responsibility. Since we don't have much experience yet 
>> with this model, I didn't propose automatic rules beyond that the PMC can 
>> add / remove maintainers -- presumably the PMC is in the best position to 
>> know what the project needs. I think automatic rules are exactly the kind of 
>> "process" you're arguing against. The "process" here is about ensuring 
>> certain checks are made for every code change, not about automating 
>> personnel and development decisions.
>> In any case, I appreciate your input on this, and we're going to evaluate 
>> the model to see how it goes. It might be that we decide we don't want it at 
>> all. However, from what I've seen of other projects (not Hadoop but projects 
>> with an order of magnitude more contributors, like Python or Linux), this is 
>> one of the best ways to have consistently great releases with a large 
>> contributor base and little room for error. With all due respect to what 
>> Hadoop's accomplished, I wouldn't use Hadoop as the best example to strive 
>> for; in my experience there I've seen patches reverted because of 
>> architectural disagreements, new APIs released and abandoned, and generally 
>> an experience that's been painful for users. A lot of the decisions we've 
>> made in Spark (e.g. time-based release cycle, built-in libraries, API 
>> stability rules, etc) were based on lessons learned there, in an attempt to 
>> define a better model.
>> Matei
>>> On Nov 6, 2014, at 2:18 PM, bc Wong  wrote:
>>>
>>> On Thu, Nov 6, 2014 at 11:25 AM, Matei Zaharia >> > wrote:
>>> 
>>> Ultimately, the core motivation is that the project has grown to the point 
>>> where it's hard to expect every committer to have full understanding of 
>>> every component. Some committers know a ton about systems but little about 
>>> machine learning, some are algorithmic whizzes but may not realize the 
>>> implications of changing something on the Python API, etc. This is just a 
>>> way to make sure that a domain expert has looked at the areas where it is 
>>> most likely for something to go wrong.
>>>
>>> Hi Matei,
>>>
>>> I understand where you're coming from. My suggestion is to solve this 
>>> without adding a new process. In the example above, those "algo whizzes" 
>>> committers should realize that they're touching the Python API, and loop in 
>>> some Python maintainers. Those Python maintainers would then re

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Patrick Wendell
Hey Greg,

Regarding subversion - I think the reference is to partial vs full
committers here:
https://subversion.apache.org/docs/community-guide/roles.html

- Patrick

On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein  wrote:
> -1 (non-binding)
>
> This is an idea that runs COMPLETELY counter to the Apache Way, and is
> to be severely frowned up. This creates *unequal* ownership of the
> codebase.
>
> Each Member of the PMC should have *equal* rights to all areas of the
> codebase until their purview. It should not be subjected to others'
> "ownership" except throught the standard mechanisms of reviews and
> if/when absolutely necessary, to vetos.
>
> Apache does not want "leads", "benevolent dictators" or "assigned
> maintainers", no matter how you may dress it up with multiple
> maintainers per component. The fact is that this creates an unequal
> level of ownership and responsibility. The Board has shut down
> projects that attempted or allowed for "Leads". Just a few months ago,
> there was a problem with somebody calling themself a "Lead".
>
> I don't know why you suggest that Apache Subversion does this. We
> absolutely do not. Never have. Never will. The Subversion codebase is
> owned by all of us, and we all care for every line of it. Some people
> know more than others, of course. But any one of us, can change any
> part, without being subjected to a "maintainer". Of course, we ask
> people with more knowledge of the component when we feel
> uncomfortable, but we also know when it is safe or not to make a
> specific change. And *always*, our fellow committers can review our
> work and let us know when we've done something wrong.
>
> Equal ownership reduces fiefdoms, enhances a feeling of community and
> project ownership, and creates a more open and inviting project.
>
> So again: -1 on this entire concept. Not good, to be polite.
>
> Regards,
> Greg Stein
> Director, Vice Chairman
> Apache Software Foundation
>
> On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote:
>> Hi all,
>>
>> I wanted to share a discussion we've been having on the PMC list, as well as 
>> call for an official vote on it on a public list. Basically, as the Spark 
>> project scales up, we need to define a model to make sure there is still 
>> great oversight of key components (in particular internal architecture and 
>> public APIs), and to this end I've proposed implementing a maintainer model 
>> for some of these components, similar to other large projects.
>>
>> As background on this, Spark has grown a lot since joining Apache. We've had 
>> over 80 contributors/month for the past 3 months, which I believe makes us 
>> the most active project in contributors/month at Apache, as well as over 500 
>> patches/month. The codebase has also grown significantly, with new libraries 
>> for SQL, ML, graphs and more.
>>
>> In this kind of large project, one common way to scale development is to 
>> assign "maintainers" to oversee key components, where each patch to that 
>> component needs to get sign-off from at least one of its maintainers. Most 
>> existing large projects do this -- at Apache, some large ones with this 
>> model are CloudStack (the second-most active project overall), Subversion, 
>> and Kafka, and other examples include Linux and Python. This is also 
>> by-and-large how Spark operates today -- most components have a de-facto 
>> maintainer.
>>
>> IMO, adopting this model would have two benefits:
>>
>> 1) Consistent oversight of design for that component, especially regarding 
>> architecture and API. This process would ensure that the component's 
>> maintainers see all proposed changes and consider them to fit together in a 
>> good way.
>>
>> 2) More structure for new contributors and committers -- in particular, it 
>> would be easy to look up who's responsible for each module and ask them for 
>> reviews, etc, rather than having patches slip between the cracks.
>>
>> We'd like to start with in a light-weight manner, where the model only 
>> applies to certain key components (e.g. scheduler, shuffle) and user-facing 
>> APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand it 
>> if we deem it useful. The specific mechanics would be as follows:
>>
>> - Some components in Spark will have maintainers assigned to them, where one 
>> of the maintainers needs to sign off on each patch to the component.
>> - Each component with maintainers will have at least 2 maintainers.
>> - Maintainers will be assigned from the most active and knowledgeable 
>> committers on that component by the PMC. The PMC can vote to add / remove 
>> maintainers, and maintained components, through consensus.
>> - Maintainers are expected to be active in responding to patches for their 
>> components, though they do not need to be the main reviewers for them (e.g. 
>> they might just sign off on architecture / API). To prevent inactive 
>> maintainers from blocking the project, if a maintainer isn't responding in a 

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Patrick Wendell
In fact, if you look at the subversion commiter list, the majority of
people here have commit access only for particular areas of the
project:

http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS

On Thu, Nov 6, 2014 at 4:26 PM, Patrick Wendell  wrote:
> Hey Greg,
>
> Regarding subversion - I think the reference is to partial vs full
> committers here:
> https://subversion.apache.org/docs/community-guide/roles.html
>
> - Patrick
>
> On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein  wrote:
>> -1 (non-binding)
>>
>> This is an idea that runs COMPLETELY counter to the Apache Way, and is
>> to be severely frowned up. This creates *unequal* ownership of the
>> codebase.
>>
>> Each Member of the PMC should have *equal* rights to all areas of the
>> codebase until their purview. It should not be subjected to others'
>> "ownership" except throught the standard mechanisms of reviews and
>> if/when absolutely necessary, to vetos.
>>
>> Apache does not want "leads", "benevolent dictators" or "assigned
>> maintainers", no matter how you may dress it up with multiple
>> maintainers per component. The fact is that this creates an unequal
>> level of ownership and responsibility. The Board has shut down
>> projects that attempted or allowed for "Leads". Just a few months ago,
>> there was a problem with somebody calling themself a "Lead".
>>
>> I don't know why you suggest that Apache Subversion does this. We
>> absolutely do not. Never have. Never will. The Subversion codebase is
>> owned by all of us, and we all care for every line of it. Some people
>> know more than others, of course. But any one of us, can change any
>> part, without being subjected to a "maintainer". Of course, we ask
>> people with more knowledge of the component when we feel
>> uncomfortable, but we also know when it is safe or not to make a
>> specific change. And *always*, our fellow committers can review our
>> work and let us know when we've done something wrong.
>>
>> Equal ownership reduces fiefdoms, enhances a feeling of community and
>> project ownership, and creates a more open and inviting project.
>>
>> So again: -1 on this entire concept. Not good, to be polite.
>>
>> Regards,
>> Greg Stein
>> Director, Vice Chairman
>> Apache Software Foundation
>>
>> On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote:
>>> Hi all,
>>>
>>> I wanted to share a discussion we've been having on the PMC list, as well 
>>> as call for an official vote on it on a public list. Basically, as the 
>>> Spark project scales up, we need to define a model to make sure there is 
>>> still great oversight of key components (in particular internal 
>>> architecture and public APIs), and to this end I've proposed implementing a 
>>> maintainer model for some of these components, similar to other large 
>>> projects.
>>>
>>> As background on this, Spark has grown a lot since joining Apache. We've 
>>> had over 80 contributors/month for the past 3 months, which I believe makes 
>>> us the most active project in contributors/month at Apache, as well as over 
>>> 500 patches/month. The codebase has also grown significantly, with new 
>>> libraries for SQL, ML, graphs and more.
>>>
>>> In this kind of large project, one common way to scale development is to 
>>> assign "maintainers" to oversee key components, where each patch to that 
>>> component needs to get sign-off from at least one of its maintainers. Most 
>>> existing large projects do this -- at Apache, some large ones with this 
>>> model are CloudStack (the second-most active project overall), Subversion, 
>>> and Kafka, and other examples include Linux and Python. This is also 
>>> by-and-large how Spark operates today -- most components have a de-facto 
>>> maintainer.
>>>
>>> IMO, adopting this model would have two benefits:
>>>
>>> 1) Consistent oversight of design for that component, especially regarding 
>>> architecture and API. This process would ensure that the component's 
>>> maintainers see all proposed changes and consider them to fit together in a 
>>> good way.
>>>
>>> 2) More structure for new contributors and committers -- in particular, it 
>>> would be easy to look up who's responsible for each module and ask them for 
>>> reviews, etc, rather than having patches slip between the cracks.

Re: Should new YARN shuffle service work with "yarn-alpha"?

2014-11-07 Thread Patrick Wendell
I bet it doesn't work. +1 on isolating it's inclusion to only the
newer YARN API's.

- Patrick

On Fri, Nov 7, 2014 at 11:43 PM, Sean Owen  wrote:
> I noticed that this doesn't compile:
>
> mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean 
> package
>
> [error] warning: [options] bootstrap class path not set in conjunction
> with -source 1.6
> [error] 
> /Users/srowen/Documents/spark/network/yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:26:
> error: cannot find symbol
> [error] import org.apache.hadoop.yarn.server.api.AuxiliaryService;
> [error] ^
> [error]   symbol:   class AuxiliaryService
> [error]   location: package org.apache.hadoop.yarn.server.api
> [error] 
> /Users/srowen/Documents/spark/network/yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:27:
> error: cannot find symbol
> [error] import 
> org.apache.hadoop.yarn.server.api.ApplicationInitializationContext;
> [error] ^
> ...
>
> Should it work? if not shall I propose to enable the service only with -Pyarn?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Should new YARN shuffle service work with "yarn-alpha"?

2014-11-08 Thread Patrick Wendell
I think you might be conflating two things. The first error you posted
was because YARN didn't standardize the shuffle API in alpha versions
so our spark-network-yarn module won't compile. We should just disable
that module if yarn alpha is used. spark-network-yarn is a leaf in the
intra-module dependency graph, and core doesn't depend on it.

This second error is something else. Maybe you are excluding
network-shuffle instead of spark-network-yarn?



On Fri, Nov 7, 2014 at 11:50 PM, Sean Owen  wrote:
> Hm. Problem is, core depends directly on it:
>
> [error] 
> /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SecurityManager.scala:25:
> object sasl is not a member of package org.apache.spark.network
> [error] import org.apache.spark.network.sasl.SecretKeyHolder
> [error] ^
> [error] 
> /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SecurityManager.scala:147:
> not found: type SecretKeyHolder
> [error] private[spark] class SecurityManager(sparkConf: SparkConf)
> extends Logging with SecretKeyHolder {
> [error]
>  ^
> [error] 
> /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala:29:
> object RetryingBlockFetcher is not a member of package
> org.apache.spark.network.shuffle
> [error] import org.apache.spark.network.shuffle.{RetryingBlockFetcher,
> BlockFetchingListener, OneForOneBlockFetcher}
> [error]^
> [error] 
> /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/deploy/worker/StandaloneWorkerShuffleService.scala:23:
> object sasl is not a member of package org.apache.spark.network
> [error] import org.apache.spark.network.sasl.SaslRpcHandler
> [error]
>
> ...
>
> [error] 
> /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/storage/BlockManager.scala:124:
> too many arguments for constructor ExternalShuffleClient: (x$1:
> org.apache.spark.network.util.TransportConf, x$2:
> String)org.apache.spark.network.shuffle.ExternalShuffleClient
> [error] new
> ExternalShuffleClient(SparkTransportConf.fromSparkConf(conf),
> securityManager,
> [error] ^
> [error] 
> /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/storage/BlockManager.scala:39:
> object protocol is not a member of package
> org.apache.spark.network.shuffle
> [error] import org.apache.spark.network.shuffle.protocol.ExecutorShuffleInfo
> [error] ^
> [error] 
> /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/storage/BlockManager.scala:214:
> not found: type ExecutorShuffleInfo
> [error] val shuffleConfig = new ExecutorShuffleInfo(
> [error]
> ...
>
>
> More refactoring needed? Either to support YARN alpha as a separate
> shuffle module, or sever this dependency?
>
> Of course this goes away when yarn-alpha goes away too.
>
>
> On Sat, Nov 8, 2014 at 7:45 AM, Patrick Wendell  wrote:
>> I bet it doesn't work. +1 on isolating it's inclusion to only the
>> newer YARN API's.
>>
>> - Patrick
>>
>> On Fri, Nov 7, 2014 at 11:43 PM, Sean Owen  wrote:
>>> I noticed that this doesn't compile:
>>>
>>> mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean 
>>> package
>>>
>>> [error] warning: [options] bootstrap class path not set in conjunction
>>> with -source 1.6
>>> [error] 
>>> /Users/srowen/Documents/spark/network/yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:26:
>>> error: cannot find symbol
>>> [error] import org.apache.hadoop.yarn.server.api.AuxiliaryService;
>>> [error] ^
>>> [error]   symbol:   class AuxiliaryService
>>> [error]   location: package org.apache.hadoop.yarn.server.api
>>> [error] 
>>> /Users/srowen/Documents/spark/network/yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:27:
>>> error: cannot find symbol
>>> [error] import 
>>> org.apache.hadoop.yarn.server.api.ApplicationInitializationContext;
>>> [error] ^
>>> ...
>>>
>>> Should it work? if not shall I propose to enable the service only with 
>>> -Pyarn?
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Should new YARN shuffle service work with "yarn-alpha"?

2014-11-08 Thread Patrick Wendell
Great - I think that should work, but if there are any issues we can
definitely fix them up.

On Sat, Nov 8, 2014 at 12:47 AM, Sean Owen  wrote:
> Oops, that was my mistake. I moved network/shuffle into yarn, when
> it's just that network/yarn should be removed from yarn-alpha. That
> makes yarn-alpha work. I'll run tests and open a quick JIRA / PR for
> the change.
>
> On Sat, Nov 8, 2014 at 8:23 AM, Patrick Wendell  wrote:
>> This second error is something else. Maybe you are excluding
>> network-shuffle instead of spark-network-yarn?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: getting exception when trying to build spark from master

2014-11-10 Thread Patrick Wendell
I reverted that patch to see if it fixes it.

On Mon, Nov 10, 2014 at 1:45 PM, Josh Rosen  wrote:
> It looks like the Jenkins maven builds are broken, too.  Based on the
> Jenkins logs, I think that this pull request may have broken things
> (although I'm not sure why):
>
> https://github.com/apache/spark/pull/3030#issuecomment-62436181
>
> On Mon, Nov 10, 2014 at 1:42 PM, Sadhan Sood  wrote:
>
>> Getting an exception while trying to build spark in spark-core:
>>
>> [ERROR]
>>
>>  while compiling:
>>
>> /Users/dev/tellapart_spark/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala
>>
>> during phase: typer
>>
>>  library version: version 2.10.4
>>
>> compiler version: version 2.10.4
>>
>>   reconstructed args: -deprecation -feature -classpath
>>
>>
>>   last tree to typer: Ident(enumDispatcher)
>>
>>   symbol: value enumDispatcher (flags: )
>>
>>symbol definition: val enumDispatcher:
>> java.util.EnumSet[javax.servlet.DispatcherType]
>>
>>  tpe: java.util.EnumSet[javax.servlet.DispatcherType]
>>
>>symbol owners: value enumDispatcher -> value $anonfun -> method
>> addFilters -> object JettyUtils -> package ui
>>
>>   context owners: value $anonfun -> value $anonfun -> method addFilters
>> -> object JettyUtils -> package ui
>>
>>
>> == Enclosing template or block ==
>>
>>
>> Block(
>>
>>   ValDef( // val filters: Array[String]
>>
>> 
>>
>> "filters"
>>
>> AppliedTypeTree(
>>
>>   "Array"
>>
>>   "String"
>>
>> )
>>
>> Apply(
>>
>>   conf.get("spark.ui.filters", "").split(',')."map"
>>
>>   Function( // val $anonfun: , tree.tpe=String => String
>>
>> ValDef( // x$1: String
>>
>> 
>>
>>   "x$1"
>>
>>// tree.tpe=String
>>
>>   
>>
>> )
>>
>> Apply( // def trim(): String in class String, tree.tpe=String
>>
>>   "x$1"."trim" // def trim(): String in class String,
>> tree.tpe=()String
>>
>>   Nil
>>
>> )
>>
>>   )
>>
>> )
>>
>>   )
>>
>>   Apply(
>>
>> "filters"."foreach"
>>
>> Match(
>>
>>   
>>
>>   CaseDef(
>>
>> Bind( // val filter: String
>>
>>   "filter"
>>
>>   Typed(
>>
>> "_" // tree.tpe=String
>>
>> "String"
>>
>>   )
>>
>> )
>>
>> If(
>>
>>   "filter"."isEmpty"."unary_$bang"
>>
>>   Block(
>>
>> // 7 statements
>>
>> Apply(
>>
>>   "logInfo"
>>
>>   Apply( // final def +(x$1: Any): String in class String,
>> tree.tpe=String
>>
>> "Adding filter: "."$plus" // final def +(x$1: Any): String
>> in class String, tree.tpe=(x$1: Any)String
>>
>> "filter" // val filter: String, tree.tpe=String
>>
>>   )
>>
>> )
>>
>> ValDef( // val holder: org.eclipse.jetty.servlet.FilterHolder
>>
>>   
>>
>>   "holder"
>>
>>   "FilterHolder"
>>
>>   Apply(
>>
>> new FilterHolder.""
>>
>> Nil
>>
>>   )
>>
>> )
>>
>> Apply( // def setClassName(x$1: String): Unit in class Holder,
>> tree.tpe=Unit
>>
>>   "holder"."setClassName" // def setClassName(x$1: String):
>> Unit in class Holder, tree.tpe=(x$1: String)Unit
>>
>>   "filter" // val filter: String, tree.tpe=String
>>
>> )
>>
>> Apply(
>>
>>   conf.get("spark.".+(filter).+(".params"),
>> "").split(',').map(((x$2: String) => x$2.trim()))."toSet"."foreach"
>>
>>   Function( // val $anonfun: 
>>
>> ValDef( // param: String
>>
>>
>>
>>   "param"
>>
>>   "String"
>>
>>   
>>
>> )
>>
>> If(
>>
>>   "param"."isEmpty"."unary_$bang"
>>
>>   Block(
>>
>> ValDef( // val parts: Array[String]
>>
>>   
>>
>>   "parts"
>>
>>// tree.tpe=Array[String]
>>
>>   Apply( // def split(x$1: String): Array[String] in
>> class String, tree.tpe=Array[String]
>>
>> "param"."split" // def split(x$1: String):
>> Array[String] in class String, tree.tpe=(x$1: String)Array[String]
>>
>> "="
>>
>>   )
>>
>> )
>>
>> If(
>>
>>   Apply( // def ==(x: Int): Boolean in class Int,
>> tree.tpe=Boolean
>>
>> "parts"."length"."$eq$eq" // def ==(x: Int):
>> Boolean in class Int, tree.tpe=(x: Int)Boolean
>>
>> 2
>>
>>   )
>>
>>   Apply( // def setInitParameter(x$1: String,x$2:
>> String): Unit in class Holder
>>
>> "holder"."setInitParameter" 

Re: JIRA + PR backlog

2014-11-11 Thread Patrick Wendell
I wonder if we should be linking to that dashboard somewhere from our
official docs or the wiki...

On Tue, Nov 11, 2014 at 12:23 PM, Nicholas Chammas
 wrote:
> Yeah, kudos to Josh for putting that together.
>
> On Tue, Nov 11, 2014 at 3:26 AM, Yu Ishikawa 
> wrote:
>
>> Great jobs!
>> I didn't know "Spark PR Dashboard."
>>
>> Thanks
>> Yu Ishikawa
>>
>>
>>
>> -
>> -- Yu Ishikawa
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/JIRA-PR-backlog-tp9157p9282.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[NOTICE] [BUILD] Minor changes to Spark's build

2014-11-11 Thread Patrick Wendell
Hey All,

I've just merged a patch that adds support for Scala 2.11 which will
have some minor implications for the build. These are due to the
complexities of supporting two versions of Scala in a single project.

1. The JDBC server will now require a special flag to build
-Phive-thriftserver on top of the existing flag -Phive. This is
because some build permutations (only in Scala 2.11) won't support the
JDBC server yet due to transitive dependency conflicts.

2. The build now uses non-standard source layouts in a few additional
places (we already did this for the Hive project) - the repl and the
examples modules. This is just fine for maven/sbt, but it may affect
users who import the build in IDE's that are using these projects and
want to build Spark from the IDE. I'm going to update our wiki to
include full instructions for making this work well in IntelliJ.

If there are any other build related issues please respond to this
thread and we'll make sure they get sorted out. Thanks to Prashant
Sharma who is the author of this feature!

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-12 Thread Patrick Wendell
Yeah Sandy and I were chatting about this today and din't realize
-Pscala-2.10 was mandatory. This is a fairly invasive change, so I was
thinking maybe we could try to remove that. Also if someone doesn't
give -Pscala-2.10 it fails in a way that is initially silent, which is
bad because most people won't know to do this.

https://issues.apache.org/jira/browse/SPARK-4375

On Wed, Nov 12, 2014 at 10:29 PM, Prashant Sharma  wrote:
> Thanks Patrick, I have one suggestion that we should make passing
> -Pscala-2.10 mandatory for maven users. I am sorry for not mentioning this
> before. There is no way around not passing that option for maven
> users(only). However, this is unnecessary for sbt users because it is added
> automatically if -Pscala-2.11 is absent.
>
>
> Prashant Sharma
>
>
>
> On Wed, Nov 12, 2014 at 3:53 PM, Sean Owen  wrote:
>
>> - Tip: when you rebase, IntelliJ will temporarily think things like the
>> Kafka module are being removed. Say 'no' when it asks if you want to remove
>> them.
>> - Can we go straight to Scala 2.11.4?
>>
>> On Wed, Nov 12, 2014 at 5:47 AM, Patrick Wendell 
>> wrote:
>>
>> > Hey All,
>> >
>> > I've just merged a patch that adds support for Scala 2.11 which will
>> > have some minor implications for the build. These are due to the
>> > complexities of supporting two versions of Scala in a single project.
>> >
>> > 1. The JDBC server will now require a special flag to build
>> > -Phive-thriftserver on top of the existing flag -Phive. This is
>> > because some build permutations (only in Scala 2.11) won't support the
>> > JDBC server yet due to transitive dependency conflicts.
>> >
>> > 2. The build now uses non-standard source layouts in a few additional
>> > places (we already did this for the Hive project) - the repl and the
>> > examples modules. This is just fine for maven/sbt, but it may affect
>> > users who import the build in IDE's that are using these projects and
>> > want to build Spark from the IDE. I'm going to update our wiki to
>> > include full instructions for making this work well in IntelliJ.
>> >
>> > If there are any other build related issues please respond to this
>> > thread and we'll make sure they get sorted out. Thanks to Prashant
>> > Sharma who is the author of this feature!
>> >
>> > - Patrick
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>> >
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-12 Thread Patrick Wendell
I think printing an error that says "-Pscala-2.10 must be enabled" is
probably okay. It's a slight regression but it's super obvious to
users. That could be a more elegant solution than the somewhat
complicated monstrosity I proposed on the JIRA.

On Wed, Nov 12, 2014 at 10:37 PM, Prashant Sharma  wrote:
> One thing we can do it is print a helpful error and break. I don't know
> about how this can be done, but since now I can write groovy inside maven
> build so we have more control. (Yay!!)
>
> Prashant Sharma
>
>
>
> On Thu, Nov 13, 2014 at 12:05 PM, Patrick Wendell 
> wrote:
>>
>> Yeah Sandy and I were chatting about this today and din't realize
>> -Pscala-2.10 was mandatory. This is a fairly invasive change, so I was
>> thinking maybe we could try to remove that. Also if someone doesn't
>> give -Pscala-2.10 it fails in a way that is initially silent, which is
>> bad because most people won't know to do this.
>>
>> https://issues.apache.org/jira/browse/SPARK-4375
>>
>> On Wed, Nov 12, 2014 at 10:29 PM, Prashant Sharma 
>> wrote:
>> > Thanks Patrick, I have one suggestion that we should make passing
>> > -Pscala-2.10 mandatory for maven users. I am sorry for not mentioning
>> > this
>> > before. There is no way around not passing that option for maven
>> > users(only). However, this is unnecessary for sbt users because it is
>> > added
>> > automatically if -Pscala-2.11 is absent.
>> >
>> >
>> > Prashant Sharma
>> >
>> >
>> >
>> > On Wed, Nov 12, 2014 at 3:53 PM, Sean Owen  wrote:
>> >
>> >> - Tip: when you rebase, IntelliJ will temporarily think things like the
>> >> Kafka module are being removed. Say 'no' when it asks if you want to
>> >> remove
>> >> them.
>> >> - Can we go straight to Scala 2.11.4?
>> >>
>> >> On Wed, Nov 12, 2014 at 5:47 AM, Patrick Wendell 
>> >> wrote:
>> >>
>> >> > Hey All,
>> >> >
>> >> > I've just merged a patch that adds support for Scala 2.11 which will
>> >> > have some minor implications for the build. These are due to the
>> >> > complexities of supporting two versions of Scala in a single project.
>> >> >
>> >> > 1. The JDBC server will now require a special flag to build
>> >> > -Phive-thriftserver on top of the existing flag -Phive. This is
>> >> > because some build permutations (only in Scala 2.11) won't support
>> >> > the
>> >> > JDBC server yet due to transitive dependency conflicts.
>> >> >
>> >> > 2. The build now uses non-standard source layouts in a few additional
>> >> > places (we already did this for the Hive project) - the repl and the
>> >> > examples modules. This is just fine for maven/sbt, but it may affect
>> >> > users who import the build in IDE's that are using these projects and
>> >> > want to build Spark from the IDE. I'm going to update our wiki to
>> >> > include full instructions for making this work well in IntelliJ.
>> >> >
>> >> > If there are any other build related issues please respond to this
>> >> > thread and we'll make sure they get sorted out. Thanks to Prashant
>> >> > Sharma who is the author of this feature!
>> >> >
>> >> > - Patrick
>> >> >
>> >> > -
>> >> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >> >
>> >> >
>> >>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-12 Thread Patrick Wendell
I actually do agree with this - let's see if we can find a solution
that doesn't regress this behavior. Maybe we can simply move the one
kafka example into its own project instead of having it in the
examples project.

On Wed, Nov 12, 2014 at 11:07 PM, Sandy Ryza  wrote:
> Currently there are no mandatory profiles required to build Spark.  I.e.
> "mvn package" just works.  It seems sad that we would need to break this.
>
> On Wed, Nov 12, 2014 at 10:59 PM, Patrick Wendell 
> wrote:
>>
>> I think printing an error that says "-Pscala-2.10 must be enabled" is
>> probably okay. It's a slight regression but it's super obvious to
>> users. That could be a more elegant solution than the somewhat
>> complicated monstrosity I proposed on the JIRA.
>>
>> On Wed, Nov 12, 2014 at 10:37 PM, Prashant Sharma 
>> wrote:
>> > One thing we can do it is print a helpful error and break. I don't know
>> > about how this can be done, but since now I can write groovy inside
>> > maven
>> > build so we have more control. (Yay!!)
>> >
>> > Prashant Sharma
>> >
>> >
>> >
>> > On Thu, Nov 13, 2014 at 12:05 PM, Patrick Wendell 
>> > wrote:
>> >>
>> >> Yeah Sandy and I were chatting about this today and din't realize
>> >> -Pscala-2.10 was mandatory. This is a fairly invasive change, so I was
>> >> thinking maybe we could try to remove that. Also if someone doesn't
>> >> give -Pscala-2.10 it fails in a way that is initially silent, which is
>> >> bad because most people won't know to do this.
>> >>
>> >> https://issues.apache.org/jira/browse/SPARK-4375
>> >>
>> >> On Wed, Nov 12, 2014 at 10:29 PM, Prashant Sharma
>> >> 
>> >> wrote:
>> >> > Thanks Patrick, I have one suggestion that we should make passing
>> >> > -Pscala-2.10 mandatory for maven users. I am sorry for not mentioning
>> >> > this
>> >> > before. There is no way around not passing that option for maven
>> >> > users(only). However, this is unnecessary for sbt users because it is
>> >> > added
>> >> > automatically if -Pscala-2.11 is absent.
>> >> >
>> >> >
>> >> > Prashant Sharma
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Nov 12, 2014 at 3:53 PM, Sean Owen 
>> >> > wrote:
>> >> >
>> >> >> - Tip: when you rebase, IntelliJ will temporarily think things like
>> >> >> the
>> >> >> Kafka module are being removed. Say 'no' when it asks if you want to
>> >> >> remove
>> >> >> them.
>> >> >> - Can we go straight to Scala 2.11.4?
>> >> >>
>> >> >> On Wed, Nov 12, 2014 at 5:47 AM, Patrick Wendell
>> >> >> 
>> >> >> wrote:
>> >> >>
>> >> >> > Hey All,
>> >> >> >
>> >> >> > I've just merged a patch that adds support for Scala 2.11 which
>> >> >> > will
>> >> >> > have some minor implications for the build. These are due to the
>> >> >> > complexities of supporting two versions of Scala in a single
>> >> >> > project.
>> >> >> >
>> >> >> > 1. The JDBC server will now require a special flag to build
>> >> >> > -Phive-thriftserver on top of the existing flag -Phive. This is
>> >> >> > because some build permutations (only in Scala 2.11) won't support
>> >> >> > the
>> >> >> > JDBC server yet due to transitive dependency conflicts.
>> >> >> >
>> >> >> > 2. The build now uses non-standard source layouts in a few
>> >> >> > additional
>> >> >> > places (we already did this for the Hive project) - the repl and
>> >> >> > the
>> >> >> > examples modules. This is just fine for maven/sbt, but it may
>> >> >> > affect
>> >> >> > users who import the build in IDE's that are using these projects
>> >> >> > and
>> >> >> > want to build Spark from the IDE. I'm going to update our wiki to
>> >> >> > include full instructions for making this work well in IntelliJ.
>> >> >> >
>> >> >> > If there are any other build related issues please respond to this
>> >> >> > thread and we'll make sure they get sorted out. Thanks to Prashant
>> >> >> > Sharma who is the author of this feature!
>> >> >> >
>> >> >> > - Patrick
>> >> >> >
>> >> >> >
>> >> >> > -
>> >> >> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >> >> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >> >> >
>> >> >> >
>> >> >>
>> >
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-13 Thread Patrick Wendell
Hey Marcelo,

I'm not sure chaining activation works like that. At least in my
experience activation based on properties only works for properties
explicitly specified at the command line rather than declared
elsewhere in the pom.

https://gist.github.com/pwendell/6834223e68f254e6945e

I any case, I think Prashant just didn't document that his patch
required -Pscala-2.10 explicitly, which is what he said further up in
the thread. And Sandy has a solution that has better behavior than
that, which is nice.

- Patrick

On Thu, Nov 13, 2014 at 10:15 AM, Sandy Ryza  wrote:
> https://github.com/apache/spark/pull/3239 addresses this
>
> On Thu, Nov 13, 2014 at 10:05 AM, Marcelo Vanzin 
> wrote:
>>
>> Hello there,
>>
>> So I just took a quick look at the pom and I see two problems with it.
>>
>> - "activatedByDefault" does not work like you think it does. It only
>> "activates by default" if you do not explicitly activate other
>> profiles. So if you do "mvn package", scala-2.10 will be activated;
>> but if you do "mvn -Pyarn package", it will not.
>>
>> - you need to duplicate the "activation" stuff everywhere where the
>> profile is declared, not just in the root pom. (I spent quite some
>> time yesterday fighting a similar issue...)
>>
>> My suggestion here is to change the activation of scala-2.10 to look like
>> this:
>>
>> 
>>   
>> !scala-2.11
>>   
>> 
>>
>> And change the scala-2.11 profile to do this:
>>
>> 
>>   true
>> 
>>
>> I haven't tested, but in my experience this will activate the
>> scala-2.10 profile by default, unless you explicitly activate the 2.11
>> profile, in which case that property will be set and scala-2.10 will
>> not activate. If you look at examples/pom.xml, that's the same
>> strategy used to choose which hbase profile to activate.
>>
>> Ah, and just to reinforce, the activation logic needs to be copied to
>> other places (e.g. examples/pom.xml, repl/pom.xml, and any other place
>> that has scala-2.x profiles).
>>
>>
>>
>> On Wed, Nov 12, 2014 at 11:14 PM, Patrick Wendell 
>> wrote:
>> > I actually do agree with this - let's see if we can find a solution
>> > that doesn't regress this behavior. Maybe we can simply move the one
>> > kafka example into its own project instead of having it in the
>> > examples project.
>> >
>> > On Wed, Nov 12, 2014 at 11:07 PM, Sandy Ryza 
>> > wrote:
>> >> Currently there are no mandatory profiles required to build Spark.
>> >> I.e.
>> >> "mvn package" just works.  It seems sad that we would need to break
>> >> this.
>> >>
>> >> On Wed, Nov 12, 2014 at 10:59 PM, Patrick Wendell 
>> >> wrote:
>> >>>
>> >>> I think printing an error that says "-Pscala-2.10 must be enabled" is
>> >>> probably okay. It's a slight regression but it's super obvious to
>> >>> users. That could be a more elegant solution than the somewhat
>> >>> complicated monstrosity I proposed on the JIRA.
>> >>>
>> >>> On Wed, Nov 12, 2014 at 10:37 PM, Prashant Sharma
>> >>> 
>> >>> wrote:
>> >>> > One thing we can do it is print a helpful error and break. I don't
>> >>> > know
>> >>> > about how this can be done, but since now I can write groovy inside
>> >>> > maven
>> >>> > build so we have more control. (Yay!!)
>> >>> >
>> >>> > Prashant Sharma
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Thu, Nov 13, 2014 at 12:05 PM, Patrick Wendell
>> >>> > 
>> >>> > wrote:
>> >>> >>
>> >>> >> Yeah Sandy and I were chatting about this today and din't realize
>> >>> >> -Pscala-2.10 was mandatory. This is a fairly invasive change, so I
>> >>> >> was
>> >>> >> thinking maybe we could try to remove that. Also if someone doesn't
>> >>> >> give -Pscala-2.10 it fails in a way that is initially silent, which
>> >>> >> is
>> >>> >> bad because most people won't know to do this.
>> >>> >>
>> >>> >> https://issues.apache.org/jira/browse/SPARK-4375
>> >>> >>
>> >>> &

Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-13 Thread Patrick Wendell
> That's true, but note the code I posted activates a profile based on
> the lack of a property being set, which is why it works. Granted, I
> did not test that if you activate the other profile, the one with the
> property check will be disabled.

Ah yeah good call - I so then we'd trigger 2.11-vs-not based on the
presence of -Dscala-2.11.

Would that fix this issue then? It might be a simpler fix to merge
into the 1.2 branch than Sandy's patch since we're pretty late in the
game (though that patch does other things separately that I'd like to
see end up in Spark soon).

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Has anyone else observed this build break?

2014-11-14 Thread Patrick Wendell
A recent patch broke clean builds for me, I am trying to see how
widespread this issue is and whether we need to revert the patch.

The error I've seen is this when building the examples project:

spark-examples_2.10: Could not resolve dependencies for project
org.apache.spark:spark-examples_2.10:jar:1.2.0-SNAPSHOT: Could not
find artifact jdk.tools:jdk.tools:jar:1.7 at specified path
/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/../lib/tools.jar

The reason for this error is that hbase-annotations is using a
"system" scoped dependency in their hbase-annotations pom, and this
doesn't work with certain JDK layouts such as that provided on Mac OS:

http://central.maven.org/maven2/org/apache/hbase/hbase-annotations/0.98.7-hadoop2/hbase-annotations-0.98.7-hadoop2.pom

Has anyone else seen this or is it just me?

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Has anyone else observed this build break?

2014-11-14 Thread Patrick Wendell
A work around for this fix is identified here:
http://dbknickerbocker.blogspot.com/2013/04/simple-fix-to-missing-toolsjar-in-jdk.html

However, if this affects more users I'd prefer to just fix it properly
in our build.

On Fri, Nov 14, 2014 at 12:17 PM, Patrick Wendell  wrote:
> A recent patch broke clean builds for me, I am trying to see how
> widespread this issue is and whether we need to revert the patch.
>
> The error I've seen is this when building the examples project:
>
> spark-examples_2.10: Could not resolve dependencies for project
> org.apache.spark:spark-examples_2.10:jar:1.2.0-SNAPSHOT: Could not
> find artifact jdk.tools:jdk.tools:jar:1.7 at specified path
> /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/../lib/tools.jar
>
> The reason for this error is that hbase-annotations is using a
> "system" scoped dependency in their hbase-annotations pom, and this
> doesn't work with certain JDK layouts such as that provided on Mac OS:
>
> http://central.maven.org/maven2/org/apache/hbase/hbase-annotations/0.98.7-hadoop2/hbase-annotations-0.98.7-hadoop2.pom
>
> Has anyone else seen this or is it just me?
>
> - Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Has anyone else observed this build break?

2014-11-14 Thread Patrick Wendell
I think in this case we can probably just drop that dependency, so
there is a simpler fix. But mostly I'm curious whether anyone else has
observed this.

On Fri, Nov 14, 2014 at 12:24 PM, Hari Shreedharan
 wrote:
> Seems like a comment on that page mentions a fix, which would add yet
> another profile though -- specifically telling mvn that if it is an apple
> jdk, use the classes.jar as the tools.jar as well, since Apple-packaged JDK
> 6 bundled them together.
>
> Link:
> http://permalink.gmane.org/gmane.comp.java.maven-plugins.mojo.user/4320
>
> I didn't test it, but maybe this can fix it?
>
> Thanks,
> Hari
>
>
> On Fri, Nov 14, 2014 at 12:21 PM, Patrick Wendell 
> wrote:
>>
>> A work around for this fix is identified here:
>>
>> http://dbknickerbocker.blogspot.com/2013/04/simple-fix-to-missing-toolsjar-in-jdk.html
>>
>> However, if this affects more users I'd prefer to just fix it properly
>> in our build.
>>
>> On Fri, Nov 14, 2014 at 12:17 PM, Patrick Wendell 
>> wrote:
>> > A recent patch broke clean builds for me, I am trying to see how
>> > widespread this issue is and whether we need to revert the patch.
>> >
>> > The error I've seen is this when building the examples project:
>> >
>> > spark-examples_2.10: Could not resolve dependencies for project
>> > org.apache.spark:spark-examples_2.10:jar:1.2.0-SNAPSHOT: Could not
>> > find artifact jdk.tools:jdk.tools:jar:1.7 at specified path
>> >
>> > /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/../lib/tools.jar
>> >
>> > The reason for this error is that hbase-annotations is using a
>> > "system" scoped dependency in their hbase-annotations pom, and this
>> > doesn't work with certain JDK layouts such as that provided on Mac OS:
>> >
>> >
>> > http://central.maven.org/maven2/org/apache/hbase/hbase-annotations/0.98.7-hadoop2/hbase-annotations-0.98.7-hadoop2.pom
>> >
>> > Has anyone else seen this or is it just me?
>> >
>> > - Patrick
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Has anyone else observed this build break?

2014-11-15 Thread Patrick Wendell
Sounds like this is pretty specific to my environment so not a big
deal then. However, if we can safely exclude those packages it's worth
doing.

On Sat, Nov 15, 2014 at 7:27 AM, Ted Yu  wrote:
> I couldn't reproduce the problem using:
>
> java version "1.6.0_65"
> Java(TM) SE Runtime Environment (build 1.6.0_65-b14-462-11M4609)
> Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-462, mixed mode)
>
> Since hbase-annotations is a transitive dependency, I created the following
> pull request to exclude it from various hbase modules:
> https://github.com/apache/spark/pull/3286
>
> Cheers
>
> https://github.com/apache/spark/pull/3286
>
> On Sat, Nov 15, 2014 at 6:56 AM, Ted Yu  wrote:
>>
>> Sorry for the late reply.
>>
>> I tested my patch on Mac with the following JDK:
>>
>> java version "1.7.0_60"
>> Java(TM) SE Runtime Environment (build 1.7.0_60-b19)
>> Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode)
>>
>> Let me see if the problem can be solved upstream in HBase
>> hbase-annotations module.
>>
>> Cheers
>>
>> On Fri, Nov 14, 2014 at 12:32 PM, Patrick Wendell 
>> wrote:
>>>
>>> I think in this case we can probably just drop that dependency, so
>>> there is a simpler fix. But mostly I'm curious whether anyone else has
>>> observed this.
>>>
>>> On Fri, Nov 14, 2014 at 12:24 PM, Hari Shreedharan
>>>  wrote:
>>> > Seems like a comment on that page mentions a fix, which would add yet
>>> > another profile though -- specifically telling mvn that if it is an
>>> > apple
>>> > jdk, use the classes.jar as the tools.jar as well, since Apple-packaged
>>> > JDK
>>> > 6 bundled them together.
>>> >
>>> > Link:
>>> > http://permalink.gmane.org/gmane.comp.java.maven-plugins.mojo.user/4320
>>> >
>>> > I didn't test it, but maybe this can fix it?
>>> >
>>> > Thanks,
>>> > Hari
>>> >
>>> >
>>> > On Fri, Nov 14, 2014 at 12:21 PM, Patrick Wendell 
>>> > wrote:
>>> >>
>>> >> A work around for this fix is identified here:
>>> >>
>>> >>
>>> >> http://dbknickerbocker.blogspot.com/2013/04/simple-fix-to-missing-toolsjar-in-jdk.html
>>> >>
>>> >> However, if this affects more users I'd prefer to just fix it properly
>>> >> in our build.
>>> >>
>>> >> On Fri, Nov 14, 2014 at 12:17 PM, Patrick Wendell 
>>> >> wrote:
>>> >> > A recent patch broke clean builds for me, I am trying to see how
>>> >> > widespread this issue is and whether we need to revert the patch.
>>> >> >
>>> >> > The error I've seen is this when building the examples project:
>>> >> >
>>> >> > spark-examples_2.10: Could not resolve dependencies for project
>>> >> > org.apache.spark:spark-examples_2.10:jar:1.2.0-SNAPSHOT: Could not
>>> >> > find artifact jdk.tools:jdk.tools:jar:1.7 at specified path
>>> >> >
>>> >> >
>>> >> > /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/../lib/tools.jar
>>> >> >
>>> >> > The reason for this error is that hbase-annotations is using a
>>> >> > "system" scoped dependency in their hbase-annotations pom, and this
>>> >> > doesn't work with certain JDK layouts such as that provided on Mac
>>> >> > OS:
>>> >> >
>>> >> >
>>> >> >
>>> >> > http://central.maven.org/maven2/org/apache/hbase/hbase-annotations/0.98.7-hadoop2/hbase-annotations-0.98.7-hadoop2.pom
>>> >> >
>>> >> > Has anyone else seen this or is it just me?
>>> >> >
>>> >> > - Patrick
>>> >>
>>> >> -
>>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> >> For additional commands, e-mail: dev-h...@spark.apache.org
>>> >>
>>> >
>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Patrick Wendell
Neither is strictly optimal which is why we ended up supporting both.
Our reference build for packaging is Maven so you are less likely to
run into unexpected dependency issues, etc. Many developers use sbt as
well. It's somewhat religion and the best thing might be to try both
and see which you prefer.

- Patrick

On Sun, Nov 16, 2014 at 1:47 PM, Mark Hamstra  wrote:
>>
>> The console mode of sbt (just run
>> sbt/sbt and then a long running console session is started that will accept
>> further commands) is great for building individual subprojects or running
>> single test suites.  In addition to being faster since its a long running
>> JVM, its got a lot of nice features like tab-completion for test case
>> names.
>
>
> We include the scala-maven-plugin in spark/pom.xml, so equivalent
> functionality is available using Maven.  You can start a console session
> with `mvn scala:console`.
>
>
> On Sun, Nov 16, 2014 at 1:23 PM, Michael Armbrust 
> wrote:
>
>> I'm going to have to disagree here.  If you are building a release
>> distribution or integrating with legacy systems then maven is probably the
>> correct choice.  However most of the core developers that I know use sbt,
>> and I think its a better choice for exploration and development overall.
>> That said, this probably falls into the category of a religious argument so
>> you might want to look at both options and decide for yourself.
>>
>> In my experience the SBT build is significantly faster with less effort
>> (and I think sbt is still faster even if you go through the extra effort of
>> installing zinc) and easier to read.  The console mode of sbt (just run
>> sbt/sbt and then a long running console session is started that will accept
>> further commands) is great for building individual subprojects or running
>> single test suites.  In addition to being faster since its a long running
>> JVM, its got a lot of nice features like tab-completion for test case
>> names.
>>
>> For example, if I wanted to see what test cases are available in the SQL
>> subproject you can do the following:
>>
>> [marmbrus@michaels-mbp spark (tpcds)]$ sbt/sbt
>> [info] Loading project definition from
>> /Users/marmbrus/workspace/spark/project/project
>> [info] Loading project definition from
>>
>> /Users/marmbrus/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
>> [info] Set current project to spark-parent (in build
>> file:/Users/marmbrus/workspace/spark/)
>> > sql/test-only **
>> --
>>  org.apache.spark.sql.CachedTableSuite
>> org.apache.spark.sql.DataTypeSuite
>>  org.apache.spark.sql.DslQuerySuite
>> org.apache.spark.sql.InsertIntoSuite
>> ...
>>
>> Another very useful feature is the development console, which starts an
>> interactive REPL including the most recent version of the code and a lot of
>> useful imports for some subprojects.  For example in the hive subproject it
>> automatically sets up a temporary database with a bunch of test data
>> pre-loaded:
>>
>> $ sbt/sbt hive/console
>> > hive/console
>> ...
>> import org.apache.spark.sql.hive._
>> import org.apache.spark.sql.hive.test.TestHive._
>> import org.apache.spark.sql.parquet.ParquetTestData
>> Welcome to Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
>> 1.7.0_45).
>> Type in expressions to have them evaluated.
>> Type :help for more information.
>>
>> scala> sql("SELECT * FROM src").take(2)
>> res0: Array[org.apache.spark.sql.Row] = Array([238,val_238], [86,val_86])
>>
>> Michael
>>
>> On Sun, Nov 16, 2014 at 3:27 AM, Dinesh J. Weerakkody <
>> dineshjweerakk...@gmail.com> wrote:
>>
>> > Hi Stephen and Sean,
>> >
>> > Thanks for correction.
>> >
>> > On Sun, Nov 16, 2014 at 12:28 PM, Sean Owen  wrote:
>> >
>> > > No, the Maven build is the main one.  I would use it unless you have a
>> > > need to use the SBT build in particular.
>> > > On Nov 16, 2014 2:58 AM, "Dinesh J. Weerakkody" <
>> > > dineshjweerakk...@gmail.com> wrote:
>> > >
>> > >> Hi Yiming,
>> > >>
>> > >> I believe that both SBT and MVN is supported in SPARK, but SBT is
>> > >> preferred
>> > >> (I'm not 100% sure about this :) ). When I'm using MVN I got some
>> build
>> > >> failures. After that used SBT and works fine.
>> > >>
>> > >> You can go through these discussions regarding SBT vs MVN and learn
>> pros
>> > >> and cons of both [1] [2].
>> > >>
>> > >> [1]
>> > >>
>> > >>
>> >
>> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html
>> > >>
>> > >> [2]
>> > >>
>> > >>
>> >
>> https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ
>> > >>
>> > >> Thanks,
>> > >>
>> > >> On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang <
>> sdi...@gmail.com>
>> > >> wrote:
>> > >>
>> > >> > Hi,
>> > >> >
>> > >> >
>> > >> >
>> > >> > I am new in developing Spark and my current focus is about
>> > >> co-scheduling of
>> > >> > spark tasks. However, I am confused with the building tools:
>> sometimes
>> > >> the
>> > >> > doc

[ANNOUNCE] Spark 1.2.0 Release Preview Posted

2014-11-17 Thread Patrick Wendell
Hi All,

I've just posted a preview of the Spark 1.2.0. release for community
regression testing.

Issues reported now will get close attention, so please help us test!
You can help by running an existing Spark 1.X workload on this and
reporting any regressions. As we start voting, etc, the bar for
reported issues to hold the release will get higher and higher, so
test early!

The tag to be is v1.2.0-snapshot1 (commit 38c1fbd96)

The release files, including signatures, digests, etc can be found at:
http://people.apache.org/~pwendell/spark-1.2.0-snapshot1

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1038/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.2.0-snapshot1-docs/

== Notes ==
- Maven artifacts are published for both Scala 2.10 and 2.11. Binary
distributions are not posted for Scala 2.11 yet, but will be posted
soon.

- There are two significant config default changes that users may want
to revert if doing A:B testing against older versions.

"spark.shuffle.manager" default has changed to "sort" (was "hash")
"spark.shuffle.blockTransferService" default has changed to "netty" (was "nio")

- This release contains a shuffle service for YARN. This jar is
present in all Hadoop 2.X binary packages in
"lib/spark-1.2.0-yarn-shuffle.jar"

Cheers,
Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-17 Thread Patrick Wendell
Hey Kevin,

If you are upgrading from 1.0.X to 1.1.X checkout the upgrade notes
here [1] - it could be that default changes caused a regression for
your workload. Do you still see a regression if you restore the
configuration changes?

It's great to hear specifically about issues like this, so please fork
a new thread and describe your workload if you see a regression. The
main focus of a patch release vote like this is to test regressions
against the previous release on the same line (e.g. 1.1.1 vs 1.1.0)
though of course we still want to be cognizant of 1.0-to-1.1
regressions and make sure we can address them down the road.

[1] https://spark.apache.org/releases/spark-release-1-1-0.html

On Mon, Nov 17, 2014 at 2:04 PM, Kevin Markey  wrote:
> +0 (non-binding)
>
> Compiled Spark, recompiled and ran application with 1.1.1 RC1 with Yarn,
> plain-vanilla Hadoop 2.3.0. No regressions.
>
> However, 12% to 22% increase in run time relative to 1.0.0 release.  (No
> other environment or configuration changes.)  Would have recommended +1 were
> it not for added latency.
>
> Not sure if added latency a function of 1.0 vs 1.1 or 1.0 vs 1.1.1 changes,
> as we've never tested with 1.1.0. But thought I'd share the results.  (This
> is somewhat disappointing.)
>
> Kevin Markey
>
>
> On 11/17/2014 11:42 AM, Debasish Das wrote:
>>
>> Andrew,
>>
>> I put up 1.1.1 branch and I am getting shuffle failures while doing
>> flatMap
>> followed by groupBy...My cluster memory is less than the memory I need and
>> therefore flatMap does around 400 GB of shuffle...memory is around 120
>> GB...
>>
>> 14/11/13 23:10:49 WARN TaskSetManager: Lost task 22.1 in stage 191.0 (TID
>> 4084, istgbd020.hadoop.istg.verizon.com): FetchFailed(null, shuffleId=4,
>> mapId=-1, reduceId=22)
>>
>> I searched on user-list and this issue has been found over there:
>>
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-partitionBy-FetchFailed-td14760.html
>>
>> I wanted to make sure whether 1.1.1 does not have the same bug...-1 from
>> me
>> till we figure out the root cause...
>>
>> Thanks.
>>
>> Deb
>>
>> On Mon, Nov 17, 2014 at 10:33 AM, Andrew Or  wrote:
>>
>>> This seems like a legitimate blocker. We will cut another RC to include
>>> the
>>> revert.
>>>
>>> 2014-11-16 17:29 GMT-08:00 Kousuke Saruta :
>>>
 Now I've finished to revert for SPARK-4434 and opened PR.


 (2014/11/16 17:08), Josh Rosen wrote:

> -1
>
> I found a potential regression in 1.1.1 related to spark-submit and
> cluster
> deploy mode: https://issues.apache.org/jira/browse/SPARK-4434
>
> I think that this is worth fixing.
>
> On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian 
> wrote:
>
>   +1
>>
>>
>> Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues
>> are
>> fixed. Hive version inspection works as expected.
>>
>>
>> On 11/15/14 8:25 AM, Zach Fry wrote:
>>
>>   +0
>>>
>>>
>>> I expect to start testing on Monday but won't have enough results to
>>> change
>>> my vote from +0
>>> until Monday night or Tuesday morning.
>>>
>>> Thanks,
>>> Zach
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-
>>> developers-list.1001551.n3.nabble.com/VOTE-Release-
>>> Apache-Spark-1-1-1-RC1-tp9311p9370.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>>
>>>
>>> -
>>
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>>

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


>>>
>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Apache infra github sync down

2014-11-18 Thread Patrick Wendell
Hey All,

The Apache-->github mirroring is not working right now and hasn't been
working fo more than 24 hours. This means that pull requests will not
appear as closed even though they have been merged. It also causes
diffs to display incorrectly in some cases. If you'd like to follow
progress by Apache infra on this issue you can watch this JIRA:

https://issues.apache.org/jira/browse/INFRA-8654

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Build break

2014-11-19 Thread Patrick Wendell
Hey All,

Just a heads up. I merged this patch last night which caused the Spark
build to break:

https://github.com/apache/spark/commit/397d3aae5bde96b01b4968dde048b6898bb6c914

The patch itself was fine and previously had passed on Jenkins. The
issue was that other intermediate changes merged since it last passed,
and the combination of those changes with the patch caused an issue
with our binary compatibility tests. This kind of race condition can
happen from time to time.

I've merged in a hot fix that should resolve this:
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0df02ca463a4126e5437b37114c6759a57ab71ee

We'll keep an eye this and make sure future builds are passing.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Spark development with IntelliJ

2014-11-20 Thread Patrick Wendell
Hi All,

I noticed people sometimes struggle to get Spark set up in IntelliJ.
I'd like to maintain comprehensive instructions on our Wiki to make
this seamless for future developers. Due to some nuances of our build,
getting to the point where you can build + test every module from
within the IDE is not trivial. I created a reference here:

https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-BuildingSparkinIntelliJIDEA

I'd love people to independently test this and/or share potential improvements.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Automated github closing of issues is not working

2014-11-21 Thread Patrick Wendell
After we merge pull requests in Spark they are closed via a special
message we put in each commit description ("Closes #XXX"). This
feature stopped working around 21 hours ago causing already-merged
pull requests to display as open.

I've contacted Github support with the issue. No word from them yet.

It is not clear whether this relates to recently delays syncing with Github.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: How spark and hive integrate in long term?

2014-11-22 Thread Patrick Wendell
There are two distinct topics when it comes to hive integration. Part
of the 1.3 roadmap will likely be better defining the plan for Hive
integration as Hive adds future versions.

1. Ability to interact with Hive metastore's from different versions
==> I.e. if a user has a metastore, can Spark SQL read the data? This
one we want need to solve by asking Hive for a stable metastore thrift
API, or adding sufficient features to the HCatalog API so we can use
that.

2. Compatibility with HQL over time as Hive adds new features.
==> This relates to how often we update our internal library
dependency on Hive and/or build support for new Hive features
internally.

On Sat, Nov 22, 2014 at 10:01 AM, Zhan Zhang  wrote:
> Thanks Cheng for the insights.
>
> Regarding the HCatalog, I did some initial investigation too and agree with 
> you. As of now, it seems not a good solution. I will try to talk to Hive 
> people to see whether there is such guarantee for downward compatibility for 
> thrift protocol. By the way, I tried some basic functions using hive-0.13 
> connect to hive-0.14 metastore, and it looks like they are compatible.
>
> Thanks.
>
> Zhan Zhang
>
>
> On Nov 22, 2014, at 7:14 AM, Cheng Lian  wrote:
>
>> Should emphasize that this is still a quick and rough conclusion, will 
>> investigate this in more detail after 1.2.0 release. Anyway we really like 
>> to provide Hive support in Spark SQL as smooth and clean as possible for 
>> both developers and end users.
>>
>> On 11/22/14 11:05 PM, Cheng Lian wrote:
>>>
>>> Hey Zhan,
>>>
>>> This is a great question. We are also seeking for a stable API/protocol 
>>> that works with multiple Hive versions (esp. 0.12+). SPARK-4114 
>>>  was opened for this. Did 
>>> some research into HCatalog recently, but I must confess that I'm not an 
>>> expert on HCatalog, actually spent only 1 day on exploring it. So please 
>>> don't hesitate to correct me if I was wrong about the conclusions I made 
>>> below.
>>>
>>> First, although HCatalog API is more pleasant to work with, it's 
>>> unfortunately feature incomplete. It only provides a subset of most 
>>> commonly used operations. For example, |HCatCreateTableDesc| maps only a 
>>> subset of |CreateTableDesc|, properties like |storeAsSubDirectories|, 
>>> |skewedColNames| and |skewedColValues| are missing. It's also impossible to 
>>> alter table properties via HCatalog API (Spark SQL uses this to implement 
>>> the |ANALYZE| command). The |hcat| CLI tool provides all those features 
>>> missing in HCatalog API via raw Metastore API, and is structurally similar 
>>> to the old Hive CLI.
>>>
>>> Second, HCatalog API itself doesn't ensure compatibility, it's the Thrift 
>>> protocol that matters. HCatalog is directly built upon raw Metastore API, 
>>> and talks the same Metastore Thrift protocol. The problem we encountered in 
>>> Spark SQL is that, usually we deploy Spark SQL Hive support with embedded 
>>> mode (for testing) or local mode Metastore, and this makes us suffer from 
>>> things like Metastore database schema changes. If Hive Metastore Thrift 
>>> protocol is guaranteed to be downward compatible, then hopefully we can 
>>> resort to remote mode Metastore and always depend on most recent Hive APIs. 
>>> I had a glance of Thrift protocol version handling code in Hive, it seems 
>>> that downward compatibility is not an issue. However I didn't find any 
>>> official documents about Thrift protocol compatibility.
>>>
>>> That said, in the future, hopefully we can only depend on most recent Hive 
>>> dependencies and remove the Hive shim layer introduced in branch 1.2. For 
>>> users who use exactly the same version of Hive as Spark SQL, they can use 
>>> either remote or local/embedded Metastore; while for users who want to 
>>> interact with existing legacy Hive clusters, they have to setup a remote 
>>> Metastore and let the Thrift protocol to handle compatibility.
>>>
>>> -- Cheng
>>>
>>> On 11/22/14 6:51 AM, Zhan Zhang wrote:
>>>
 Now Spark and hive integration is a very nice feature. But I am wondering
 what the long term roadmap is for spark integration with hive. Both of 
 these
 two projects are undergoing fast improvement and changes. Currently, my
 understanding is that spark hive sql part relies on hive meta store and
 basic parser to operate, and the thrift-server intercept hive query and
 replace it with its own engine.

 With every release of hive, there need a significant effort on spark part 
 to
 support it.

 For the metastore part, we may possibly replace it with hcatalog. But given
 the dependency of other parts on hive, e.g., metastore, thriftserver,
 hcatlog may not be able to help much.

 Does anyone have any insight or idea in mind?

 Thanks.

 Zhan Zhang



 --
 View this message in 
 context:http://apache-spark-developers-list.10015

Re: Apache infra github sync down

2014-11-22 Thread Patrick Wendell
Hi All,

Unfortunately this went back down again. I've opened a new JIRA to track it:

https://issues.apache.org/jira/browse/INFRA-8688

- Patrick

On Tue, Nov 18, 2014 at 10:24 PM, Patrick Wendell  wrote:
> Hey All,
>
> The Apache-->github mirroring is not working right now and hasn't been
> working fo more than 24 hours. This means that pull requests will not
> appear as closed even though they have been merged. It also causes
> diffs to display incorrectly in some cases. If you'd like to follow
> progress by Apache infra on this issue you can watch this JIRA:
>
> https://issues.apache.org/jira/browse/INFRA-8654
>
> - Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-23 Thread Patrick Wendell
+1 (binding).

Don't see any evidence of regressions at this point. The issue
reported by Hector was not related to this rlease.

On Sun, Nov 23, 2014 at 9:50 AM, Debasish Das  wrote:
> -1 from me...same FetchFailed issue as what Hector saw...
>
> I am running Netflix dataset and dumping out recommendation for all users.
> It shuffles around 100 GB data on disk to run a reduceByKey per user on
> utils.BoundedPriorityQueue...The code runs fine with MovieLens1m dataset...
>
> I gave Spark 10 nodes, 8 cores, 160 GB of memory.
>
> Fails with the following FetchFailed errors.
>
> 14/11/23 11:51:22 WARN TaskSetManager: Lost task 28.0 in stage 188.0 (TID
> 2818, tblpmidn08adv-hdp.tdc.vzwcorp.com): FetchFailed(BlockManagerId(1,
> tblpmidn03adv-hdp.tdc.vzwcorp.com, 52528, 0), shuffleId=35, mapId=28,
> reduceId=28)
>
> It's a consistent behavior on master as well.
>
> I tested it both on YARN and Standalone. I compiled spark-1.1 branch
> (assuming it has all the fixes from RC2 tag.
>
> I am now compiling spark-1.0 branch and see if this issue shows up there as
> well. If it is related to hash/sort based shuffle most likely it won't show
> up on 1.0.
>
> Thanks.
>
> Deb
>
> On Thu, Nov 20, 2014 at 12:16 PM, Hector Yee  wrote:
>
>> Whoops I must have used the 1.2 preview and mixed them up.
>>
>> spark-shell -version shows  version 1.2.0
>>
>> Will update the bug https://issues.apache.org/jira/browse/SPARK-4516 to
>> 1.2
>>
>> On Thu, Nov 20, 2014 at 11:59 AM, Matei Zaharia 
>> wrote:
>>
>> > Ah, I see. But the spark.shuffle.blockTransferService property doesn't
>> > exist in 1.1 (AFAIK) -- what exactly are you doing to get this problem?
>> >
>> > Matei
>> >
>> > On Nov 20, 2014, at 11:50 AM, Hector Yee  wrote:
>> >
>> > This is whatever was in http://people.apache.org/~andrewor14/spark-1
>> > .1.1-rc2/
>> >
>> > On Thu, Nov 20, 2014 at 11:48 AM, Matei Zaharia > >
>> > wrote:
>> >
>> >> Hector, is this a comment on 1.1.1 or on the 1.2 preview?
>> >>
>> >> Matei
>> >>
>> >> > On Nov 20, 2014, at 11:39 AM, Hector Yee 
>> wrote:
>> >> >
>> >> > I think it is a race condition caused by netty deactivating a channel
>> >> while
>> >> > it is active.
>> >> > Switched to nio and it works fine
>> >> > --conf spark.shuffle.blockTransferService=nio
>> >> >
>> >> > On Thu, Nov 20, 2014 at 10:44 AM, Hector Yee 
>> >> wrote:
>> >> >
>> >> >> I'm still seeing the fetch failed error and updated
>> >> >> https://issues.apache.org/jira/browse/SPARK-3633
>> >> >>
>> >> >> On Thu, Nov 20, 2014 at 10:21 AM, Marcelo Vanzin <
>> van...@cloudera.com>
>> >> >> wrote:
>> >> >>
>> >> >>> +1 (non-binding)
>> >> >>>
>> >> >>> . ran simple things on spark-shell
>> >> >>> . ran jobs in yarn client & cluster modes, and standalone cluster
>> mode
>> >> >>>
>> >> >>> On Wed, Nov 19, 2014 at 2:51 PM, Andrew Or 
>> >> wrote:
>> >>  Please vote on releasing the following candidate as Apache Spark
>> >> version
>> >>  1.1.1.
>> >> 
>> >>  This release fixes a number of bugs in Spark 1.1.0. Some of the
>> >> notable
>> >> >>> ones
>> >>  are
>> >>  - [SPARK-3426] Sort-based shuffle compression settings are
>> >> incompatible
>> >>  - [SPARK-3948] Stream corruption issues in sort-based shuffle
>> >>  - [SPARK-4107] Incorrect handling of Channel.read() led to data
>> >> >>> truncation
>> >>  The full list is at http://s.apache.org/z9h and in the CHANGES.txt
>> >> >>> attached.
>> >> 
>> >>  Additionally, this candidate fixes two blockers from the previous
>> RC:
>> >>  - [SPARK-4434] Cluster mode jar URLs are broken
>> >>  - [SPARK-4480][SPARK-4467] Too many open files exception from
>> shuffle
>> >> >>> spills
>> >> 
>> >>  The tag to be voted on is v1.1.1-rc2 (commit 3693ae5d):
>> >>  http://s.apache.org/p8
>> >> 
>> >>  The release files, including signatures, digests, etc can be found
>> >> at:
>> >>  http://people.apache.org/~andrewor14/spark-1.1.1-rc2/
>> >> 
>> >>  Release artifacts are signed with the following key:
>> >>  https://people.apache.org/keys/committer/andrewor14.asc
>> >> 
>> >>  The staging repository for this release can be found at:
>> >> 
>> >> https://repository.apache.org/content/repositories/orgapachespark-1043/
>> >> 
>> >>  The documentation corresponding to this release can be found at:
>> >>  http://people.apache.org/~andrewor14/spark-1.1.1-rc2-docs/
>> >> 
>> >>  Please vote on releasing this package as Apache Spark 1.1.1!
>> >> 
>> >>  The vote is open until Saturday, November 22, at 23:00 UTC and
>> >> passes if
>> >>  a majority of at least 3 +1 PMC votes are cast.
>> >>  [ ] +1 Release this package as Apache Spark 1.1.1
>> >>  [ ] -1 Do not release this package because ...
>> >> 
>> >>  To learn more about Apache Spark, please see
>> >>  http://spark.apache.org/
>> >> 
>> >>  Cheers,
>> >>  Andrew
>> >> 
>> >> 
>> >> 
>> --

Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-23 Thread Patrick Wendell
Hey Stephen,

Thanks for bringing this up. Technically when we call a release vote
it needs to be on the exact commit that will be the final release.
However, one thing I've thought of doing for a while would be to
publish the maven artifacts using a version tag with $VERSION-rcX even
if the underlying commit has $VERSION in the pom files. Some recent
changes I've made to the way we do publishing in branch 1.2 should
make this pretty easy - it wasn't very easy before because we used
maven's publishing plugin which makes modifying the published version
tricky. Our current approach is, indeed, problematic because maven
artifacts are supposed to be immutable once they have a specific
version identifier.

I created SPARK-4568 to track this:
https://issues.apache.org/jira/browse/SPARK-4568

- Patrick

On Sun, Nov 23, 2014 at 8:11 PM, Matei Zaharia  wrote:
> Interesting, perhaps we could publish each one with two IDs, of which the rc 
> one is unofficial. The problem is indeed that you have to vote on a hash for 
> a potentially final artifact.
>
> Matei
>
>> On Nov 23, 2014, at 7:54 PM, Stephen Haberman  
>> wrote:
>>
>> Hi,
>>
>> I wanted to try 1.1.1-rc2 because we're running into SPARK-3633, but
>> the"rc" releases not being tagged with "-rcX" means the pre-built artifacts
>> are basically useless to me.
>>
>> (Pedantically, to test a release, I have to upload it into our internal
>> repo, to compile jobs, start clusters, etc. Invariably when an rcX artifact
>> ends up not being final, then I'm screwed, because I would have to clear
>> the local cache of any of our machines, dev/Jenkins/etc., that ever
>> downloaded the "formerly known as 1.1.1 but not really" rc artifacts.)
>>
>> What's frustrating is that I know other Apache projects do rc releases, and
>> even get them into Maven central, e.g.:
>>
>> http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22org.apache.tapestry%22%20AND%20a%3A%22tapestry-ioc%22
>>
>> So, I apologize for the distraction from getting real work done, but
>> perhaps you guys could find a creative way to work around the
>> well-intentioned mandate on artifact voting?
>>
>> (E.g. perhaps have multiple votes, one for each successive rc (with -rcX
>> suffix), then, once blessed, another one on the actually-final/no-rcX
>> artifact (built from the last rc's tag); or publish no-rcX artifacts for
>> official voting, as today, but then, at the same time, add -rcX artifacts
>> to Maven central for non-binding/3rd party testing, etc.)
>>
>> Thanks,
>> Stephen
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Notes on writing complex spark applications

2014-11-23 Thread Patrick Wendell
Hey Evan,

It might be nice to merge this into existing documentation. In
particular, a lot of this could serve to update the current tuning
section and programming guides.

It could also work to paste this wholesale as a reference for Spark
users, but in that case it's less likely to get updated when other
things change, or be found by users reading through the spark docs.

- Patrick

On Sun, Nov 23, 2014 at 8:27 PM, Inkyu Lee  wrote:
> Very helpful!!
>
> thank you very much!
>
> 2014-11-24 2:17 GMT+09:00 Sam Bessalah :
>
>> Thanks Evan, this is great.
>> On Nov 23, 2014 5:58 PM, "Evan R. Sparks"  wrote:
>>
>> > Hi all,
>> >
>> > Shivaram Venkataraman, Joseph Gonzalez, Tomer Kaftan, and I have been
>> > working on a short document about writing high performance Spark
>> > applications based on our experience developing MLlib, GraphX, ml-matrix,
>> > pipelines, etc. It may be a useful document both for users and new Spark
>> > developers - perhaps it should go on the wiki?
>> >
>> > The document itself is here:
>> >
>> >
>> https://docs.google.com/document/d/1gEIawzRsOwksV_bq4je3ofnd-7Xu-u409mdW-RXTDnQ/edit?usp=sharing
>> > and I've created SPARK-4565
>> >  to track this.
>> >
>> > - Evan
>> >
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.2.0 (RC1)

2014-11-28 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.2.0!

The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=1056e9ec13203d0c51564265e94d77a054498fdb

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.2.0-rc1/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1048/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/

Please vote on releasing this package as Apache Spark 1.2.0!

The vote is open until Tuesday, December 02, at 05:15 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.1.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== What justifies a -1 vote for this release? ==
This vote is happening very late into the QA period compared with
previous votes, so -1 votes should only occur for significant
regressions from 1.0.2. Bugs already present in 1.1.X, minor
regressions, or bugs related to new features will not block this
release.

== What default changes should I be aware of? ==
1. The default value of "spark.shuffle.blockTransferService" has been
changed to "netty"
--> Old behavior can be restored by switching to "nio"

2. The default value of "spark.shuffle.manager" has been changed to "sort".
--> Old behavior can be restored by setting "spark.shuffle.manager" to "hash".

== Other notes ==
Because this vote is occurring over a weekend, I will likely extend
the vote if this RC survives until the end of the vote period.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-11-29 Thread Patrick Wendell
Thanks for pointing this out, Matei. I don't think a minor typo like
this is a big deal. Hopefully it's clear to everyone this is the 1.2.0
release vote, as indicated by the subject and all of the artifacts.

On Sat, Nov 29, 2014 at 1:26 AM, Matei Zaharia  wrote:
> Hey Patrick, unfortunately you got some of the text here wrong, saying 1.1.0 
> instead of 1.2.0. Not sure it will matter since there can well be another RC 
> after testing, but we should be careful.
>
> Matei
>
>> On Nov 28, 2014, at 9:16 PM, Patrick Wendell  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 1.2.0!
>>
>> The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1):
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=1056e9ec13203d0c51564265e94d77a054498fdb
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.0-rc1/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1048/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.2.0!
>>
>> The vote is open until Tuesday, December 02, at 05:15 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.1.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> == What justifies a -1 vote for this release? ==
>> This vote is happening very late into the QA period compared with
>> previous votes, so -1 votes should only occur for significant
>> regressions from 1.0.2. Bugs already present in 1.1.X, minor
>> regressions, or bugs related to new features will not block this
>> release.
>>
>> == What default changes should I be aware of? ==
>> 1. The default value of "spark.shuffle.blockTransferService" has been
>> changed to "netty"
>> --> Old behavior can be restored by switching to "nio"
>>
>> 2. The default value of "spark.shuffle.manager" has been changed to "sort".
>> --> Old behavior can be restored by setting "spark.shuffle.manager" to 
>> "hash".
>>
>> == Other notes ==
>> Because this vote is occurring over a weekend, I will likely extend
>> the vote if this RC survives until the end of the vote period.
>>
>> - Patrick
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Trouble testing after updating to latest master

2014-11-29 Thread Patrick Wendell
Thanks for reporting this. One thing to try is to just do a git clean
to make sure you have a totally clean working space ("git clean -fdx"
will blow away any differences you have from the repo, of course only
do that if you don't have other files around). Can you reproduce this
if you just run "sbt/sbt compile"? Also, if you can, can you reproduce
it if you checkout only the spark master branch and not merged with
your own code? Finally, if you can reproduce it on master, can you
perform a bisection to find out which commit caused it?

- Patrick

On Sat, Nov 29, 2014 at 10:29 PM, Ganelin, Ilya
 wrote:
> Hi all - I've just merged in the latest changes from the Spark master branch 
> to my local branch. I am able to build just fine with
> mvm clean package
> However, when I attempt to run dev/run-tests, I get the following error:
>
> Using /Library/Java/JavaVirtualMachines/jdk1.8.0_20.jdk/Contents/Home as 
> default JAVA_HOME.
> Note, this will be overridden by -java-home if it is set.
> Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar
> [error] Got a return code of 1 on line 163 of the run-tests script.
>
> With an individual test I get the same error. I have tried downloading a new 
> copy of SBT 0.13.6 but it has not helped. Does anyone have any suggestions 
> for getting this running? Things worked fine before updating Spark.
> 
>
> The information contained in this e-mail is confidential and/or proprietary 
> to Capital One and/or its affiliates. The information transmitted herewith is 
> intended only for use by the individual or entity to which it is addressed.  
> If the reader of this message is not the intended recipient, you are hereby 
> notified that any review, retransmission, dissemination, distribution, 
> copying or other use of, or taking of any action in reliance upon this 
> information is strictly prohibited. If you have received this communication 
> in error, please contact the sender and delete the material from your 
> computer.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Trouble testing after updating to latest master

2014-11-29 Thread Patrick Wendell
Sounds good. Glad you got it working.

On Sat, Nov 29, 2014 at 11:16 PM, Ganelin, Ilya
 wrote:
> I am able to successfully run sbt/sbt-compile and run the tests after
> running git clean -fdx. I¹m guessing network issues wound up corrupting
> some of the files that had been downloaded. Thanks, Patrick!
>
>
> On 11/29/14, 10:52 PM, "Patrick Wendell"  wrote:
>
>>Thanks for reporting this. One thing to try is to just do a git clean
>>to make sure you have a totally clean working space ("git clean -fdx"
>>will blow away any differences you have from the repo, of course only
>>do that if you don't have other files around). Can you reproduce this
>>if you just run "sbt/sbt compile"? Also, if you can, can you reproduce
>>it if you checkout only the spark master branch and not merged with
>>your own code? Finally, if you can reproduce it on master, can you
>>perform a bisection to find out which commit caused it?
>>
>>- Patrick
>>
>>On Sat, Nov 29, 2014 at 10:29 PM, Ganelin, Ilya
>> wrote:
>>> Hi all - I've just merged in the latest changes from the Spark master
>>>branch to my local branch. I am able to build just fine with
>>> mvm clean package
>>> However, when I attempt to run dev/run-tests, I get the following error:
>>>
>>> Using /Library/Java/JavaVirtualMachines/jdk1.8.0_20.jdk/Contents/Home
>>>as default JAVA_HOME.
>>> Note, this will be overridden by -java-home if it is set.
>>> Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar
>>> [error] Got a return code of 1 on line 163 of the run-tests script.
>>>
>>> With an individual test I get the same error. I have tried downloading
>>>a new copy of SBT 0.13.6 but it has not helped. Does anyone have any
>>>suggestions for getting this running? Things worked fine before updating
>>>Spark.
>>> 
>>>
>>> The information contained in this e-mail is confidential and/or
>>>proprietary to Capital One and/or its affiliates. The information
>>>transmitted herewith is intended only for use by the individual or
>>>entity to which it is addressed.  If the reader of this message is not
>>>the intended recipient, you are hereby notified that any review,
>>>retransmission, dissemination, distribution, copying or other use of, or
>>>taking of any action in reliance upon this information is strictly
>>>prohibited. If you have received this communication in error, please
>>>contact the sender and delete the material from your computer.
>
> 
>
> The information contained in this e-mail is confidential and/or proprietary 
> to Capital One and/or its affiliates. The information transmitted herewith is 
> intended only for use by the individual or entity to which it is addressed.  
> If the reader of this message is not the intended recipient, you are hereby 
> notified that any review, retransmission, dissemination, distribution, 
> copying or other use of, or taking of any action in reliance upon this 
> information is strictly prohibited. If you have received this communication 
> in error, please contact the sender and delete the material from your 
> computer.
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spurious test failures, testing best practices

2014-11-30 Thread Patrick Wendell
Hey Ryan,

A few more things here. You should feel free to send patches to
Jenkins to test them, since this is the reference environment in which
we regularly run tests. This is the normal workflow for most
developers and we spend a lot of effort provisioning/maintaining a
very large jenkins cluster to allow developers access this resource. A
common development approach is to locally run tests that you've added
in a patch, then send it to jenkins for the full run, and then try to
debug locally if you see specific unanticipated test failures.

One challenge we have is that given the proliferation of OS versions,
Java versions, Python versions, ulimits, etc. there is a combinatorial
number of environments in which tests could be run. It is very hard in
some cases to figure out post-hoc why a given test is not working in a
specific environment. I think a good solution here would be to use a
standardized docker container for running Spark tests and asking folks
to use that locally if they are trying to run all of the hundreds of
Spark tests.

Another solution would be to mock out every system interaction in
Spark's tests including e.g. filesystem interactions to try and reduce
variance across environments. However, that seems difficult.

As the number of developers of Spark increases, it's definitely a good
idea for us to invest in developer infrastructure including things
like snapshot releases, better documentation, etc. Thanks for bringing
this up as a pain point.

- Patrick


On Sun, Nov 30, 2014 at 3:35 PM, Ryan Williams
 wrote:
> thanks for the info, Matei and Brennon. I will try to switch my workflow to
> using sbt. Other potential action items:
>
> - currently the docs only contain information about building with maven,
> and even then don't cover many important cases, as I described in my
> previous email. If SBT is as much better as you've described then that
> should be made much more obvious. Wasn't it the case recently that there
> was only a page about building with SBT, and not one about building with
> maven? Clearer messaging around this needs to exist in the documentation,
> not just on the mailing list, imho.
>
> - +1 to better distinguishing between unit and integration tests, having
> separate scripts for each, improving documentation around common workflows,
> expectations of brittleness with each kind of test, advisability of just
> relying on Jenkins for certain kinds of tests to not waste too much time,
> etc. Things like the compiler crash should be discussed in the
> documentation, not just in the mailing list archives, if new contributors
> are likely to run into them through no fault of their own.
>
> - What is the algorithm you use to decide what tests you might have broken?
> Can we codify it in some scripts that other people can use?
>
>
>
> On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia 
> wrote:
>
>> Hi Ryan,
>>
>> As a tip (and maybe this isn't documented well), I normally use SBT for
>> development to avoid the slow build process, and use its interactive
>> console to run only specific tests. The nice advantage is that SBT can keep
>> the Scala compiler loaded and JITed across builds, making it faster to
>> iterate. To use it, you can do the following:
>>
>> - Start the SBT interactive console with sbt/sbt
>> - Build your assembly by running the "assembly" target in the assembly
>> project: assembly/assembly
>> - Run all the tests in one module: core/test
>> - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite (this
>> also supports tab completion)
>>
>> Running all the tests does take a while, and I usually just rely on
>> Jenkins for that once I've run the tests for the things I believed my patch
>> could break. But this is because some of them are integration tests (e.g.
>> DistributedSuite, which creates multi-process mini-clusters). Many of the
>> individual suites run fast without requiring this, however, so you can pick
>> the ones you want. Perhaps we should find a way to tag them so people  can
>> do a "quick-test" that skips the integration ones.
>>
>> The assembly builds are annoying but they only take about a minute for me
>> on a MacBook Pro with SBT warmed up. The assembly is actually only required
>> for some of the "integration" tests (which launch new processes), but I'd
>> recommend doing it all the time anyway since it would be very confusing to
>> run those with an old assembly. The Scala compiler crash issue can also be
>> a problem, but I don't see it very often with SBT. If it happens, I exit
>> SBT and do sbt clean.
>>
>> Anyway, this is useful feedback and I think we should try to improve some
>> of these suites, but hopefully you can also try the faster SBT process. At
>> the end of the day, if we want integration tests, the whole test process
>> will take an hour, but most of the developers I know leave that to Jenkins
>> and only run individual tests locally before submitting a patch.
>>
>> Matei
>>
>>
>> > On Nov 30, 2014, at 2:39

Re: Spurious test failures, testing best practices

2014-11-30 Thread Patrick Wendell
Hey Ryan,

The existing JIRA also covers publishing nightly docs:
https://issues.apache.org/jira/browse/SPARK-1517

- Patrick

On Sun, Nov 30, 2014 at 5:53 PM, Ryan Williams
 wrote:
> Thanks Nicholas, glad to hear that some of this info will be pushed to the
> main site soon, but this brings up yet another point of confusion that I've
> struggled with, namely whether the documentation on github or that on
> spark.apache.org should be considered the primary reference for people
> seeking to learn about best practices for developing Spark.
>
> Trying to read docs starting from
> https://github.com/apache/spark/blob/master/docs/index.md right now, I find
> that all of the links to other parts of the documentation are broken: they
> point to relative paths that end in ".html", which will work when published
> on the docs-site, but that would have to end in ".md" if a person was to be
> able to navigate them on github.
>
> So expecting people to use the up-to-date docs on github (where all
> internal URLs 404 and the main github README suggests that the "latest
> Spark documentation" can be found on the actually-months-old docs-site
> <https://github.com/apache/spark#online-documentation>) is not a good
> solution. On the other hand, consulting months-old docs on the site is also
> problematic, as this thread and your last email have borne out.  The result
> is that there is no good place on the internet to learn about the most
> up-to-date best practices for using/developing Spark.
>
> Why not build http://spark.apache.org/docs/latest/ nightly (or every
> commit) off of what's in github, rather than having that URL point to the
> last release's docs (up to ~3 months old)? This way, casual users who want
> the docs for the released version they happen to be using (which is already
> frequently != "/latest" today, for many Spark users) can (still) find them
> at http://spark.apache.org/docs/X.Y.Z, and the github README can safely
> point people to a site (/latest) that actually has up-to-date docs that
> reflect ToT and whose links work.
>
> If there are concerns about existing semantics around "/latest" URLs being
> broken, some new URL could be used, like
> http://spark.apache.org/docs/snapshot/, but given that everything under
> http://spark.apache.org/docs/latest/ is in a state of
> planned-backwards-incompatible-changes every ~3mos, that doesn't sound like
> that serious an issue to me; anyone sending around permanent links to
> things under /latest is already going to have those links break / not make
> sense in the near future.
>
>
> On Sun Nov 30 2014 at 5:24:33 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>>
>>- currently the docs only contain information about building with
>>maven,
>>and even then don't cover many important cases
>>
>>  All other points aside, I just want to point out that the docs document
>> both how to use Maven and SBT and clearly state
>> <https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt>
>> that Maven is the "build of reference" while SBT may be preferable for
>> day-to-day development.
>>
>> I believe the main reason most people miss this documentation is that,
>> though it's up-to-date on GitHub, it has't been published yet to the docs
>> site. It should go out with the 1.2 release.
>>
>> Improvements to the documentation on building Spark belong here:
>> https://github.com/apache/spark/blob/master/docs/building-spark.md
>>
>> If there are clear recommendations that come out of this thread but are
>> not in that doc, they should be added in there. Other, less important
>> details may possibly be better suited for the Contributing to Spark
>> <https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark>
>> guide.
>>
>> Nick
>>
>>
>> On Sun Nov 30 2014 at 6:50:55 PM Patrick Wendell 
>> wrote:
>>
>>> Hey Ryan,
>>>
>>> A few more things here. You should feel free to send patches to
>>> Jenkins to test them, since this is the reference environment in which
>>> we regularly run tests. This is the normal workflow for most
>>> developers and we spend a lot of effort provisioning/maintaining a
>>> very large jenkins cluster to allow developers access this resource. A
>>> common development approach is to locally run tests that you've added
>>> in a patch, then send it to jenkins for the full run, and then try to
>>> debug locally if you see specific unanticipated test failures.
>>>

Re: Spurious test failures, testing best practices

2014-11-30 Thread Patrick Wendell
Btw - the documnetation on github represents the source code of our
docs, which is versioned with each release. Unfortunately github will
always try to render ".md" files so it could look to a passerby like
this is supposed to represent published docs. This is a feature
limitation of github, AFAIK we cannot disable it.

The official published docs are associated with each release and
available on the apache.org website. I think "/latest" is a common
convention for referring to the latest *published release* docs, so
probably we can't change that (the audience for /latest is orders of
magnitude larger than for snapshot docs). However we could just add
/snapshot and publish docs there.

- Patrick

On Sun, Nov 30, 2014 at 6:15 PM, Patrick Wendell  wrote:
> Hey Ryan,
>
> The existing JIRA also covers publishing nightly docs:
> https://issues.apache.org/jira/browse/SPARK-1517
>
> - Patrick
>
> On Sun, Nov 30, 2014 at 5:53 PM, Ryan Williams
>  wrote:
>> Thanks Nicholas, glad to hear that some of this info will be pushed to the
>> main site soon, but this brings up yet another point of confusion that I've
>> struggled with, namely whether the documentation on github or that on
>> spark.apache.org should be considered the primary reference for people
>> seeking to learn about best practices for developing Spark.
>>
>> Trying to read docs starting from
>> https://github.com/apache/spark/blob/master/docs/index.md right now, I find
>> that all of the links to other parts of the documentation are broken: they
>> point to relative paths that end in ".html", which will work when published
>> on the docs-site, but that would have to end in ".md" if a person was to be
>> able to navigate them on github.
>>
>> So expecting people to use the up-to-date docs on github (where all
>> internal URLs 404 and the main github README suggests that the "latest
>> Spark documentation" can be found on the actually-months-old docs-site
>> <https://github.com/apache/spark#online-documentation>) is not a good
>> solution. On the other hand, consulting months-old docs on the site is also
>> problematic, as this thread and your last email have borne out.  The result
>> is that there is no good place on the internet to learn about the most
>> up-to-date best practices for using/developing Spark.
>>
>> Why not build http://spark.apache.org/docs/latest/ nightly (or every
>> commit) off of what's in github, rather than having that URL point to the
>> last release's docs (up to ~3 months old)? This way, casual users who want
>> the docs for the released version they happen to be using (which is already
>> frequently != "/latest" today, for many Spark users) can (still) find them
>> at http://spark.apache.org/docs/X.Y.Z, and the github README can safely
>> point people to a site (/latest) that actually has up-to-date docs that
>> reflect ToT and whose links work.
>>
>> If there are concerns about existing semantics around "/latest" URLs being
>> broken, some new URL could be used, like
>> http://spark.apache.org/docs/snapshot/, but given that everything under
>> http://spark.apache.org/docs/latest/ is in a state of
>> planned-backwards-incompatible-changes every ~3mos, that doesn't sound like
>> that serious an issue to me; anyone sending around permanent links to
>> things under /latest is already going to have those links break / not make
>> sense in the near future.
>>
>>
>> On Sun Nov 30 2014 at 5:24:33 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>>
>>>- currently the docs only contain information about building with
>>>maven,
>>>and even then don't cover many important cases
>>>
>>>  All other points aside, I just want to point out that the docs document
>>> both how to use Maven and SBT and clearly state
>>> <https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt>
>>> that Maven is the "build of reference" while SBT may be preferable for
>>> day-to-day development.
>>>
>>> I believe the main reason most people miss this documentation is that,
>>> though it's up-to-date on GitHub, it has't been published yet to the docs
>>> site. It should go out with the 1.2 release.
>>>
>>> Improvements to the documentation on building Spark belong here:
>>> https://github.com/apache/spark/blob/master/docs/building-spark.md
>>>
>>> If there are clear recommendations that come out of this thread but are

Re: Spurious test failures, testing best practices

2014-11-30 Thread Patrick Wendell
Hi Ilya - you can just submit a pull request and the way we test them
is to run it through jenkins. You don't need to do anything special.

On Sun, Nov 30, 2014 at 8:57 PM, Ganelin, Ilya
 wrote:
> Hi, Patrick - with regards to testing on Jenkins, is the process for this
> to submit a pull request for the branch or is there another interface we
> can use to submit a build to Jenkins for testing?
>
> On 11/30/14, 6:49 PM, "Patrick Wendell"  wrote:
>
>>Hey Ryan,
>>
>>A few more things here. You should feel free to send patches to
>>Jenkins to test them, since this is the reference environment in which
>>we regularly run tests. This is the normal workflow for most
>>developers and we spend a lot of effort provisioning/maintaining a
>>very large jenkins cluster to allow developers access this resource. A
>>common development approach is to locally run tests that you've added
>>in a patch, then send it to jenkins for the full run, and then try to
>>debug locally if you see specific unanticipated test failures.
>>
>>One challenge we have is that given the proliferation of OS versions,
>>Java versions, Python versions, ulimits, etc. there is a combinatorial
>>number of environments in which tests could be run. It is very hard in
>>some cases to figure out post-hoc why a given test is not working in a
>>specific environment. I think a good solution here would be to use a
>>standardized docker container for running Spark tests and asking folks
>>to use that locally if they are trying to run all of the hundreds of
>>Spark tests.
>>
>>Another solution would be to mock out every system interaction in
>>Spark's tests including e.g. filesystem interactions to try and reduce
>>variance across environments. However, that seems difficult.
>>
>>As the number of developers of Spark increases, it's definitely a good
>>idea for us to invest in developer infrastructure including things
>>like snapshot releases, better documentation, etc. Thanks for bringing
>>this up as a pain point.
>>
>>- Patrick
>>
>>
>>On Sun, Nov 30, 2014 at 3:35 PM, Ryan Williams
>> wrote:
>>> thanks for the info, Matei and Brennon. I will try to switch my
>>>workflow to
>>> using sbt. Other potential action items:
>>>
>>> - currently the docs only contain information about building with maven,
>>> and even then don't cover many important cases, as I described in my
>>> previous email. If SBT is as much better as you've described then that
>>> should be made much more obvious. Wasn't it the case recently that there
>>> was only a page about building with SBT, and not one about building with
>>> maven? Clearer messaging around this needs to exist in the
>>>documentation,
>>> not just on the mailing list, imho.
>>>
>>> - +1 to better distinguishing between unit and integration tests, having
>>> separate scripts for each, improving documentation around common
>>>workflows,
>>> expectations of brittleness with each kind of test, advisability of just
>>> relying on Jenkins for certain kinds of tests to not waste too much
>>>time,
>>> etc. Things like the compiler crash should be discussed in the
>>> documentation, not just in the mailing list archives, if new
>>>contributors
>>> are likely to run into them through no fault of their own.
>>>
>>> - What is the algorithm you use to decide what tests you might have
>>>broken?
>>> Can we codify it in some scripts that other people can use?
>>>
>>>
>>>
>>> On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia 
>>> wrote:
>>>
>>>> Hi Ryan,
>>>>
>>>> As a tip (and maybe this isn't documented well), I normally use SBT for
>>>> development to avoid the slow build process, and use its interactive
>>>> console to run only specific tests. The nice advantage is that SBT can
>>>>keep
>>>> the Scala compiler loaded and JITed across builds, making it faster to
>>>> iterate. To use it, you can do the following:
>>>>
>>>> - Start the SBT interactive console with sbt/sbt
>>>> - Build your assembly by running the "assembly" target in the assembly
>>>> project: assembly/assembly
>>>> - Run all the tests in one module: core/test
>>>> - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite
>>>>(this
>>>> also supports tab completion)
>>>>
>>>> Run

Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-12-01 Thread Patrick Wendell
Hey All,

Just an update. Josh, Andrew, and others are working to reproduce
SPARK-4498 and fix it. Other than that issue no serious regressions
have been reported so far. If we are able to get a fix in for that
soon, we'll likely cut another RC with the patch.

Continued testing of RC1 is definitely appreciated!

I'll leave this vote open to allow folks to continue posting comments.
It's fine to still give "+1" from your own testing... i.e. you can
assume at this point SPARK-4498 will be fixed before releasing.

- Patrick

On Mon, Dec 1, 2014 at 3:30 PM, Matei Zaharia  wrote:
> +0.9 from me. Tested it on Mac and Windows (someone has to do it) and while 
> things work, I noticed a few recent scripts don't have Windows equivalents, 
> namely https://issues.apache.org/jira/browse/SPARK-4683 and 
> https://issues.apache.org/jira/browse/SPARK-4684. The first one at least 
> would be good to fix if we do another RC. Not blocking the release but useful 
> to fix in docs is https://issues.apache.org/jira/browse/SPARK-4685.
>
> Matei
>
>
>> On Dec 1, 2014, at 11:18 AM, Josh Rosen  wrote:
>>
>> Hi everyone,
>>
>> There's an open bug report related to Spark standalone which could be a 
>> potential release-blocker (pending investigation / a bug fix): 
>> https://issues.apache.org/jira/browse/SPARK-4498.  This issue seems 
>> non-deterministc and only affects long-running Spark standalone deployments, 
>> so it may be hard to reproduce.  I'm going to work on a patch to add 
>> additional logging in order to help with debugging.
>>
>> I just wanted to give an early head's up about this issue and to get more 
>> eyes on it in case anyone else has run into it or wants to help with 
>> debugging.
>>
>> - Josh
>>
>> On November 28, 2014 at 9:18:09 PM, Patrick Wendell (pwend...@gmail.com) 
>> wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 1.2.0!
>>
>> The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1):
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=1056e9ec13203d0c51564265e94d77a054498fdb
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.0-rc1/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1048/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.2.0!
>>
>> The vote is open until Tuesday, December 02, at 05:15 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.1.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> == What justifies a -1 vote for this release? ==
>> This vote is happening very late into the QA period compared with
>> previous votes, so -1 votes should only occur for significant
>> regressions from 1.0.2. Bugs already present in 1.1.X, minor
>> regressions, or bugs related to new features will not block this
>> release.
>>
>> == What default changes should I be aware of? ==
>> 1. The default value of "spark.shuffle.blockTransferService" has been
>> changed to "netty"
>> --> Old behavior can be restored by switching to "nio"
>>
>> 2. The default value of "spark.shuffle.manager" has been changed to "sort".
>> --> Old behavior can be restored by setting "spark.shuffle.manager" to 
>> "hash".
>>
>> == Other notes ==
>> Because this vote is occurring over a weekend, I will likely extend
>> the vote if this RC survives until the end of the vote period.
>>
>> - Patrick
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: keeping PR titles / descriptions up to date

2014-12-02 Thread Patrick Wendell
Also a note on this for committers - it's possible to re-word the
title during merging, by just running "git commit -a --amend" before
you push the PR.

- Patrick

On Tue, Dec 2, 2014 at 12:50 PM, Mridul Muralidharan  wrote:
> I second that !
> Would also be great if the JIRA was updated accordingly too.
>
> Regards,
> Mridul
>
>
> On Wed, Dec 3, 2014 at 1:53 AM, Kay Ousterhout  
> wrote:
>> Hi all,
>>
>> I've noticed a bunch of times lately where a pull request changes to be
>> pretty different from the original pull request, and the title /
>> description never get updated.  Because the pull request title and
>> description are used as the commit message, the incorrect description lives
>> on forever, making it harder to understand the reason behind a particular
>> commit without going back and reading the entire conversation on the pull
>> request.  If folks could try to keep these up to date (and committers, try
>> to remember to verify that the title and description are correct before
>> making merging pull requests), that would be awesome.
>>
>> -Kay
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spurious test failures, testing best practices

2014-12-02 Thread Patrick Wendell
Hey Ryan,

What if you run a single "mvn install" to install all libraries
locally - then can you "mvn compile -pl core"? I think this may be the
only way to make it work.

- Patrick

On Tue, Dec 2, 2014 at 2:40 PM, Ryan Williams
 wrote:
> Following on Mark's Maven examples, here is another related issue I'm
> having:
>
> I'd like to compile just the `core` module after a `mvn clean`, without
> building an assembly JAR first. Is this possible?
>
> Attempting to do it myself, the steps I performed were:
>
> - `mvn compile -pl core`: fails because `core` depends on `network/common`
> and `network/shuffle`, neither of which is installed in my local maven
> cache (and which don't exist in central Maven repositories, I guess? I
> thought Spark is publishing snapshot releases?)
>
> - `network/shuffle` also depends on `network/common`, so I'll `mvn install`
> the latter first: `mvn install -DskipTests -pl network/common`. That
> succeeds, and I see a newly built 1.3.0-SNAPSHOT jar in my local maven
> repository.
>
> - However, `mvn install -DskipTests -pl network/shuffle` subsequently
> fails, seemingly due to not finding network/core. Here's
>  a sample full
> output from running `mvn install -X -U -DskipTests -pl network/shuffle`
> from such a state (the -U was to get around a previous failure based on
> having cached a failed lookup of network-common-1.3.0-SNAPSHOT).
>
> - Thinking maven might be special-casing "-SNAPSHOT" versions, I tried
> replacing "1.3.0-SNAPSHOT" with "1.3.0.1" globally and repeating these
> steps, but the error seems to be the same
> .
>
> Any ideas?
>
> Thanks,
>
> -Ryan
>
> On Sun Nov 30 2014 at 6:37:28 PM Mark Hamstra 
> wrote:
>
>> >
>> > - Start the SBT interactive console with sbt/sbt
>> > - Build your assembly by running the "assembly" target in the assembly
>> > project: assembly/assembly
>> > - Run all the tests in one module: core/test
>> > - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite
>> (this
>> > also supports tab completion)
>>
>>
>> The equivalent using Maven:
>>
>> - Start zinc
>> - Build your assembly using the mvn "package" or "install" target
>> ("install" is actually the equivalent of SBT's "publishLocal") -- this step
>> is the first step in
>> http://spark.apache.org/docs/latest/building-with-maven.
>> html#spark-tests-in-maven
>> - Run all the tests in one module: mvn -pl core test
>> - Run a specific suite: mvn -pl core
>> -DwildcardSuites=org.apache.spark.rdd.RDDSuite test (the -pl option isn't
>> strictly necessary if you don't mind waiting for Maven to scan through all
>> the other sub-projects only to do nothing; and, of course, it needs to be
>> something other than "core" if the test you want to run is in another
>> sub-project.)
>>
>> You also typically want to carry along in each subsequent step any relevant
>> command line options you added in the "package"/"install" step.
>>
>> On Sun, Nov 30, 2014 at 3:06 PM, Matei Zaharia 
>> wrote:
>>
>> > Hi Ryan,
>> >
>> > As a tip (and maybe this isn't documented well), I normally use SBT for
>> > development to avoid the slow build process, and use its interactive
>> > console to run only specific tests. The nice advantage is that SBT can
>> keep
>> > the Scala compiler loaded and JITed across builds, making it faster to
>> > iterate. To use it, you can do the following:
>> >
>> > - Start the SBT interactive console with sbt/sbt
>> > - Build your assembly by running the "assembly" target in the assembly
>> > project: assembly/assembly
>> > - Run all the tests in one module: core/test
>> > - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite
>> (this
>> > also supports tab completion)
>> >
>> > Running all the tests does take a while, and I usually just rely on
>> > Jenkins for that once I've run the tests for the things I believed my
>> patch
>> > could break. But this is because some of them are integration tests (e.g.
>> > DistributedSuite, which creates multi-process mini-clusters). Many of the
>> > individual suites run fast without requiring this, however, so you can
>> pick
>> > the ones you want. Perhaps we should find a way to tag them so people
>> can
>> > do a "quick-test" that skips the integration ones.
>> >
>> > The assembly builds are annoying but they only take about a minute for me
>> > on a MacBook Pro with SBT warmed up. The assembly is actually only
>> required
>> > for some of the "integration" tests (which launch new processes), but I'd
>> > recommend doing it all the time anyway since it would be very confusing
>> to
>> > run those with an old assembly. The Scala compiler crash issue can also
>> be
>> > a problem, but I don't see it very often with SBT. If it happens, I exit
>> > SBT and do sbt clean.
>> >
>> > Anyway, this is useful feedback and I think we should try to improve some
>> > of these suites, but hopefully you can also try

Re: Ooyala Spark JobServer

2014-12-04 Thread Patrick Wendell
Hey Jun,

The Ooyala server is being maintained by it's original author (Evan Chan)
here:

https://github.com/spark-jobserver/spark-jobserver

This is likely to stay as a standalone project for now, since it builds
directly on Spark's public API's.

- Patrick

On Wed, Dec 3, 2014 at 9:02 PM, Jun Feng Liu  wrote:

> Hi, I am wondering the status of the Ooyala Spark Jobserver, any plan to
> get it into the spark release?
>
> Best Regards
>
>
> *Jun Feng Liu*
> IBM China Systems & Technology Laboratory in Beijing
>
>   --
>  [image: 2D barcode - encoded with contact information] *Phone: 
> *86-10-82452683
>
> * E-mail:* *liuj...@cn.ibm.com* 
> [image: IBM]
>
> BLD 28,ZGC Software Park
> No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> China
>
>
>
>
>


Re: Exception adding resource files in latest Spark

2014-12-04 Thread Patrick Wendell
Thanks for flagging this. I reverted the relevant YARN fix in Spark
1.2 release. We can try to debug this in master.

On Thu, Dec 4, 2014 at 9:51 PM, Jianshi Huang  wrote:
> I created a ticket for this:
>
>   https://issues.apache.org/jira/browse/SPARK-4757
>
>
> Jianshi
>
> On Fri, Dec 5, 2014 at 1:31 PM, Jianshi Huang 
> wrote:
>>
>> Correction:
>>
>> According to Liancheng, this hotfix might be the root cause:
>>
>>
>> https://github.com/apache/spark/commit/38cb2c3a36a5c9ead4494cbc3dde008c2f0698ce
>>
>> Jianshi
>>
>> On Fri, Dec 5, 2014 at 12:45 PM, Jianshi Huang 
>> wrote:
>>>
>>> Looks like the datanucleus*.jar shouldn't appear in the hdfs path in
>>> Yarn-client mode.
>>>
>>> Maybe this patch broke yarn-client.
>>>
>>>
>>> https://github.com/apache/spark/commit/a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53
>>>
>>> Jianshi
>>>
>>> On Fri, Dec 5, 2014 at 12:02 PM, Jianshi Huang 
>>> wrote:

 Actually my HADOOP_CLASSPATH has already been set to include
 /etc/hadoop/conf/*

 export
 HADOOP_CLASSPATH=/etc/hbase/conf/hbase-site.xml:/usr/lib/hbase/lib/hbase-protocol.jar:$(hbase
 classpath)

 Jianshi

 On Fri, Dec 5, 2014 at 11:54 AM, Jianshi Huang 
 wrote:
>
> Looks like somehow Spark failed to find the core-site.xml in
> /et/hadoop/conf
>
> I've already set the following env variables:
>
> export YARN_CONF_DIR=/etc/hadoop/conf
> export HADOOP_CONF_DIR=/etc/hadoop/conf
> export HBASE_CONF_DIR=/etc/hbase/conf
>
> Should I put $HADOOP_CONF_DIR/* to HADOOP_CLASSPATH?
>
> Jianshi
>
> On Fri, Dec 5, 2014 at 11:37 AM, Jianshi Huang
>  wrote:
>>
>> I got the following error during Spark startup (Yarn-client mode):
>>
>> 14/12/04 19:33:58 INFO Client: Uploading resource
>> file:/x/home/jianshuang/spark/spark-latest/lib/datanucleus-api-jdo-3.2.6.jar
>> ->
>> hdfs://stampy/user/jianshuang/.sparkStaging/application_1404410683830_531767/datanucleus-api-jdo-3.2.6.jar
>> java.lang.IllegalArgumentException: Wrong FS:
>> hdfs://stampy/user/jianshuang/.sparkStaging/application_1404410683830_531767/datanucleus-api-jdo-3.2.6.jar,
>> expected: file:///
>> at
>> org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:79)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:506)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>> at
>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
>> at
>> org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67)
>> at
>> org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:257)
>> at
>> org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:242)
>> at scala.Option.foreach(Option.scala:236)
>> at
>> org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:242)
>> at
>> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:35)
>> at
>> org.apache.spark.deploy.yarn.ClientBase$class.createContainerLaunchContext(ClientBase.scala:350)
>> at
>> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:35)
>> at
>> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:80)
>> at
>> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
>> at
>> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:140)
>> at
>> org.apache.spark.SparkContext.(SparkContext.scala:335)
>> at
>> org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:986)
>> at $iwC$$iwC.(:9)
>> at $iwC.(:18)
>> at (:20)
>> at .(:24)
>>
>> I'm using latest Spark built from master HEAD yesterday. Is this a
>> bug?
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github & Blog: http://huangjs.github.com/
>>>
>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Gi

Re: zinc invocation examples

2014-12-05 Thread Patrick Wendell
One thing I created a JIRA for a while back was to have a similar
script to "sbt/sbt" that transparently downloads Zinc, Scala, and
Maven in a subdirectory of Spark and sets it up correctly. I.e.
"build/mvn".

Outside of brew for MacOS there aren't good Zinc packages, and it's a
pain to figure out how to set it up.

https://issues.apache.org/jira/browse/SPARK-4501

Prashant Sharma looked at this for a bit but I don't think he's
working on it actively any more, so if someone wanted to do this, I'd
be extremely grateful.

- Patrick

On Fri, Dec 5, 2014 at 11:05 AM, Ryan Williams
 wrote:
> fwiw I've been using `zinc -scala-home $SCALA_HOME -nailed -start` which:
>
> - starts a nailgun server as well,
> - uses my installed scala 2.{10,11}, as opposed to zinc's default 2.9.2
> : "If no options are passed to
> locate a version of Scala then Scala 2.9.2 is used by default (which is
> bundled with zinc)."
>
> The latter seems like it might be especially important.
>
>
> On Thu Dec 04 2014 at 4:25:32 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Oh, derp. I just assumed from looking at all the options that there was
>> something to it. Thanks Sean.
>>
>> On Thu Dec 04 2014 at 7:47:33 AM Sean Owen  wrote:
>>
>> > You just run it once with "zinc -start" and leave it running as a
>> > background process on your build machine. You don't have to do
>> > anything for each build.
>> >
>> > On Wed, Dec 3, 2014 at 3:44 PM, Nicholas Chammas
>> >  wrote:
>> > > https://github.com/apache/spark/blob/master/docs/
>> > building-spark.md#speeding-up-compilation-with-zinc
>> > >
>> > > Could someone summarize how they invoke zinc as part of a regular
>> > > build-test-etc. cycle?
>> > >
>> > > I'll add it in to the aforelinked page if appropriate.
>> > >
>> > > Nick
>> >
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-12-05 Thread Patrick Wendell
Hey All,

Thanks all for the continued testing!

The issue I mentioned earlier SPARK-4498 was fixed earlier this week
(hat tip to Mark Hamstra who contributed to fix).

In the interim a few smaller blocker-level issues with Spark SQL were
found and fixed (SPARK-4753, SPARK-4552, SPARK-4761).

There is currently an outstanding issue (SPARK-4740[1]) in Spark core
that needs to be fixed.

I want to thank in particular Shopify and Intel China who have
identified and helped test blocker issues with the release. This type
of workload testing around releases is really helpful for us.

Once things stabilize I will cut RC2. I think we're pretty close with this one.

- Patrick

On Wed, Dec 3, 2014 at 5:38 PM, Takeshi Yamamuro  wrote:
> +1 (non-binding)
>
> Checked on CentOS 6.5, compiled from the source.
> Ran various examples in stand-alone master and three slaves, and
> browsed the web UI.
>
> On Sat, Nov 29, 2014 at 2:16 PM, Patrick Wendell  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.2.0!
>>
>> The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1):
>>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=1056e9ec13203d0c51564265e94d77a054498fdb
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.0-rc1/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1048/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.2.0!
>>
>> The vote is open until Tuesday, December 02, at 05:15 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.1.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> == What justifies a -1 vote for this release? ==
>> This vote is happening very late into the QA period compared with
>> previous votes, so -1 votes should only occur for significant
>> regressions from 1.0.2. Bugs already present in 1.1.X, minor
>> regressions, or bugs related to new features will not block this
>> release.
>>
>> == What default changes should I be aware of? ==
>> 1. The default value of "spark.shuffle.blockTransferService" has been
>> changed to "netty"
>> --> Old behavior can be restored by switching to "nio"
>>
>> 2. The default value of "spark.shuffle.manager" has been changed to "sort".
>> --> Old behavior can be restored by setting "spark.shuffle.manager" to
>> "hash".
>>
>> == Other notes ==
>> Because this vote is occurring over a weekend, I will likely extend
>> the vote if this RC survives until the end of the vote period.
>>
>> - Patrick
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Is this a little bug in BlockTransferMessage ?

2014-12-09 Thread Patrick Wendell
Hey Nick,

Thanks for bringing this up. I believe these Java tests are running in
the sbt build right now, the issue is that this particular bug was
flagged by the triggering of a runtime Java "assert" (not a normal
Junit test assertion) and those are not enabled in our sbt tests. It
would be good to fix it so that assertions run when we do the sbt
tests, for some reason I think the sbt tests disable them by default.

I think the original issue is fixed now (that Sean found and
reported). It would be good to get assertions running in our tests,
but I'm not sure I'd block the release on it. The normal JUnit
assertions are running correctly.

- Patrick

On Tue, Dec 9, 2014 at 3:35 PM, Nicholas Chammas
 wrote:
> OK. That's concerning. Hopefully that's the only bug we'll dig up once we
> run all the Java tests but who knows.
>
> Patrick,
>
> Shouldn't this be a release blocking bug for 1.2 (mostly just because it
> has already been covered by a unit test)? Well, that, as well as any other
> bugs that come up as we run these Java tests.
>
> Nick
>
> On Tue Dec 09 2014 at 6:32:53 PM Sean Owen  wrote:
>
>> I'm not so sure about SBT, but I'm looking at the output now and do
>> not see things like JavaAPISuite being run. I see them compiled. That
>> I'm not as sure how to fix. I think I have a solution for Maven on
>> SPARK-4159.
>>
>> On Tue, Dec 9, 2014 at 11:30 PM, Nicholas Chammas
>>  wrote:
>> > So all this time the tests that Jenkins has been running via Jenkins and
>> SBT
>> > + ScalaTest... those haven't been running any of the Java unit tests?
>> >
>> > SPARK-4159 only mentions Maven as a problem, but I'm wondering how these
>> > tests got through Jenkins OK.
>> >
>> > On Tue Dec 09 2014 at 5:34:22 PM Sean Owen  wrote:
>> >>
>> >> Yep, will do. The test does catch it -- it's just not being executed.
>> >> I think I have a reasonable start on re-enabling surefire + Java tests
>> >> for SPARK-4159.
>> >>
>> >> On Tue, Dec 9, 2014 at 10:30 PM, Aaron Davidson 
>> >> wrote:
>> >> > Oops, that does look like a bug. Strange that the
>> >> > BlockTransferMessageSuite
>> >> > did not catch this. "+1" sounds like the right solution, would you be
>> >> > able
>> >> > to submit a PR?
>> >> >
>> >> > On Tue, Dec 9, 2014 at 1:53 PM, Sean Owen  wrote:
>> >> >>
>> >> >>
>> >> >>
>> >> >> https://github.com/apache/spark/blob/master/network/
>> shuffle/src/main/java/org/apache/spark/network/shuffle/
>> protocol/BlockTransferMessage.java#L70
>> >> >>
>> >> >> public byte[] toByteArray() {
>> >> >>   ByteBuf buf = Unpooled.buffer(encodedLength());
>> >> >>   buf.writeByte(type().id);
>> >> >>   encode(buf);
>> >> >>   assert buf.writableBytes() == 0 : "Writable bytes remain: " +
>> >> >> buf.writableBytes();
>> >> >>   return buf.array();
>> >> >> }
>> >> >>
>> >> >> Running the Java tests at last might have turned up a little bug
>> here,
>> >> >> but wanted to check. This makes a buffer to hold enough bytes to
>> >> >> encode the message. But it writes 1 byte, plus the message. This
>> makes
>> >> >> the buffer expand, and then does have nonzero capacity afterwards, so
>> >> >> the assert fails.
>> >> >>
>> >> >> So just needs a "+ 1" in the size?
>> >> >
>> >> >
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: dev-h...@spark.apache.org
>> >>
>> >
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Build Spark 1.2.0-rc1 encounter exceptions when running HiveContext - Caused by: java.lang.ClassNotFoundException: com.esotericsoftware.shaded.org.objenesis.strategy.InstantiatorStrategy

2014-12-10 Thread Patrick Wendell
Hi Andrew,

It looks like somehow you are including jars from the upstream Apache
Hive 0.13 project on your classpath. For Spark 1.2 Hive 0.13 support,
we had to modify Hive to use a different version of Kryo that was
compatible with Spark's Kryo version.

https://github.com/pwendell/hive/commit/5b582f242946312e353cfce92fc3f3fa472aedf3

I would look through the actual classpath and make sure you aren't
including your own hive-exec jar somehow.

- Patrick

On Wed, Dec 10, 2014 at 9:48 AM, Andrew Lee  wrote:
> Apologize for the format, somehow it got messed up and linefeed were removed. 
> Here's a reformatted version.
> Hi All,
> I tried to include necessary libraries in SPARK_CLASSPATH in spark-env.sh to 
> include auxiliaries JARs and datanucleus*.jars from Hive, however, when I run 
> HiveContext, it gives me the following error:
>
> Caused by: java.lang.ClassNotFoundException: 
> com.esotericsoftware.shaded.org.objenesis.strategy.InstantiatorStrategy
>
> I have checked the JARs with (jar tf), looks like this is already included 
> (shaded) in the assembly JAR (spark-assembly-1.2.0-hadoop2.4.1.jar) which is 
> configured in the System classpath already. I couldn't figure out what is 
> going on with the shading on the esotericsoftware JARs here.  Any help is 
> appreciated.
>
>
> How to reproduce the problem?
> Run the following 3 statements in spark-shell ( This is how I launched my 
> spark-shell. cd /opt/spark; ./bin/spark-shell --master yarn --deploy-mode 
> client --queue research --driver-memory 1024M)
>
> import org.apache.spark.SparkContext
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> hiveContext.hql("CREATE TABLE IF NOT EXISTS spark_hive_test_table (key INT, 
> value STRING)")
>
>
>
> A reference of my environment.
> Apache Hadoop 2.4.1
> Apache Hive 0.13.1
> Apache Spark branch-1.2 (installed under /opt/spark/, and config under 
> /etc/spark/)
> Maven build command:
>
> mvn -U -X -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.4.1 
> -Dyarn.version=2.4.1 -Dhive.version=0.13.1 -DskipTests install
>
> Source Code commit label: eb4d457a870f7a281dc0267db72715cd00245e82
>
> My spark-env.sh have the following contents when I executed spark-shell:
>> HADOOP_HOME=/opt/hadoop/
>> HIVE_HOME=/opt/hive/
>> HADOOP_CONF_DIR=/etc/hadoop/
>> YARN_CONF_DIR=/etc/hadoop/
>> HIVE_CONF_DIR=/etc/hive/
>> HADOOP_SNAPPY_JAR=$(find $HADOOP_HOME/share/hadoop/common/lib/ -type f -name 
>> "snappy-java-*.jar")
>> HADOOP_LZO_JAR=$(find $HADOOP_HOME/share/hadoop/common/lib/ -type f -name 
>> "hadoop-lzo-*.jar")
>> SPARK_YARN_DIST_FILES=/user/spark/libs/spark-assembly-1.2.0-hadoop2.4.1.jar
>> export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native
>> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native
>> export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_HOME/lib/native
>> export 
>> SPARK_CLASSPATH=$SPARK_CLASSPATH:$HADOOP_SNAPPY_JAR:$HADOOP_LZO_JAR:$HIVE_CONF_DIR:/opt/hive/lib/datanucleus-api-jdo-3.2.6.jar:/opt/hive/lib/datanucleus-core-3.2.10.jar:/opt/hive/lib/datanucleus-rdbms-3.2.9.jar
>
>
>> Here's what I see from my stack trace.
>> warning: there were 1 deprecation warning(s); re-run with -deprecation for 
>> details
>> Hive history 
>> file=/home/hive/log/alti-test-01/hive_job_log_b5db9539-4736-44b3-a601-04fa77cb6730_1220828461.txt
>> java.lang.NoClassDefFoundError: 
>> com/esotericsoftware/shaded/org/objenesis/strategy/InstantiatorStrategy
>>   at 
>> org.apache.hadoop.hive.ql.exec.Utilities.(Utilities.java:925)
>>   at 
>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.validate(SemanticAnalyzer.java:9718)
>>   at 
>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.validate(SemanticAnalyzer.java:9712)
>>   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:434)
>>   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:322)
>>   at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:975)
>>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1040)
>>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
>>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
>>   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:305)
>>   at 
>> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
>>   at 
>> org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
>>   at 
>> org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
>>   at 
>> org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
>>   at 
>> org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30)
>>   at 
>> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
>>   at 
>> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
>>   at 
>> org.apache.spark.sql.SchemaRDDLike$class.$init$(S

[RESULT] [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-12-10 Thread Patrick Wendell
This vote is closed in favor of RC2.

On Fri, Dec 5, 2014 at 2:02 PM, Patrick Wendell  wrote:
> Hey All,
>
> Thanks all for the continued testing!
>
> The issue I mentioned earlier SPARK-4498 was fixed earlier this week
> (hat tip to Mark Hamstra who contributed to fix).
>
> In the interim a few smaller blocker-level issues with Spark SQL were
> found and fixed (SPARK-4753, SPARK-4552, SPARK-4761).
>
> There is currently an outstanding issue (SPARK-4740[1]) in Spark core
> that needs to be fixed.
>
> I want to thank in particular Shopify and Intel China who have
> identified and helped test blocker issues with the release. This type
> of workload testing around releases is really helpful for us.
>
> Once things stabilize I will cut RC2. I think we're pretty close with this 
> one.
>
> - Patrick
>
> On Wed, Dec 3, 2014 at 5:38 PM, Takeshi Yamamuro  
> wrote:
>> +1 (non-binding)
>>
>> Checked on CentOS 6.5, compiled from the source.
>> Ran various examples in stand-alone master and three slaves, and
>> browsed the web UI.
>>
>> On Sat, Nov 29, 2014 at 2:16 PM, Patrick Wendell  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.2.0!
>>>
>>> The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1):
>>>
>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=1056e9ec13203d0c51564265e94d77a054498fdb
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-1.2.0-rc1/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1048/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/
>>>
>>> Please vote on releasing this package as Apache Spark 1.2.0!
>>>
>>> The vote is open until Tuesday, December 02, at 05:15 UTC and passes
>>> if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.1.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>>
>>> == What justifies a -1 vote for this release? ==
>>> This vote is happening very late into the QA period compared with
>>> previous votes, so -1 votes should only occur for significant
>>> regressions from 1.0.2. Bugs already present in 1.1.X, minor
>>> regressions, or bugs related to new features will not block this
>>> release.
>>>
>>> == What default changes should I be aware of? ==
>>> 1. The default value of "spark.shuffle.blockTransferService" has been
>>> changed to "netty"
>>> --> Old behavior can be restored by switching to "nio"
>>>
>>> 2. The default value of "spark.shuffle.manager" has been changed to "sort".
>>> --> Old behavior can be restored by setting "spark.shuffle.manager" to
>>> "hash".
>>>
>>> == Other notes ==
>>> Because this vote is occurring over a weekend, I will likely extend
>>> the vote if this RC survives until the end of the vote period.
>>>
>>> - Patrick
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-10 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.2.0!

The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.2.0-rc2/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1055/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/

Please vote on releasing this package as Apache Spark 1.2.0!

The vote is open until Saturday, December 13, at 21:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.2.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== What justifies a -1 vote for this release? ==
This vote is happening relatively late into the QA period, so
-1 votes should only occur for significant regressions from
1.0.2. Bugs already present in 1.1.X, minor
regressions, or bugs related to new features will not block this
release.

== What default changes should I be aware of? ==
1. The default value of "spark.shuffle.blockTransferService" has been
changed to "netty"
--> Old behavior can be restored by switching to "nio"

2. The default value of "spark.shuffle.manager" has been changed to "sort".
--> Old behavior can be restored by setting "spark.shuffle.manager" to "hash".

== How does this differ from RC1 ==
This has fixes for a handful of issues identified - some of the
notable fixes are:

[Core]
SPARK-4498: Standalone Master can fail to recognize completed/failed
applications

[SQL]
SPARK-4552: Query for empty parquet table in spark sql hive get
IllegalArgumentException
SPARK-4753: Parquet2 does not prune based on OR filters on partition columns
SPARK-4761: With JDBC server, set Kryo as default serializer and
disable reference tracking
SPARK-4785: When called with arguments referring column fields, PMOD throws NPE

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Is Apache JIRA down?

2014-12-10 Thread Patrick Wendell
I believe many apache services are/were down due to an outage.

On Wed, Dec 10, 2014 at 5:24 PM, Nicholas Chammas
 wrote:
> Nevermind, seems to be back up now.
>
> On Wed Dec 10 2014 at 7:46:30 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> For example: https://issues.apache.org/jira/browse/SPARK-3431
>>
>> Where do we report/track issues with JIRA itself being down?
>>
>> Nick
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: zinc invocation examples

2014-12-12 Thread Patrick Wendell
Hey York - I'm sending some feedback off-list, feel free to open a PR as well.


On Tue, Dec 9, 2014 at 12:05 PM, York, Brennon
 wrote:
> Patrick, I¹ve nearly completed a basic build out for the SPARK-4501 issue
> (at https://github.com/brennonyork/spark/tree/SPARK-4501) and would be
> great to get your initial read on it. Per this thread I need to add in the
> -scala-home call to zinc, but its close to ready for a PR.
>
> On 12/5/14, 2:10 PM, "Patrick Wendell"  wrote:
>
>>One thing I created a JIRA for a while back was to have a similar
>>script to "sbt/sbt" that transparently downloads Zinc, Scala, and
>>Maven in a subdirectory of Spark and sets it up correctly. I.e.
>>"build/mvn".
>>
>>Outside of brew for MacOS there aren't good Zinc packages, and it's a
>>pain to figure out how to set it up.
>>
>>https://issues.apache.org/jira/browse/SPARK-4501
>>
>>Prashant Sharma looked at this for a bit but I don't think he's
>>working on it actively any more, so if someone wanted to do this, I'd
>>be extremely grateful.
>>
>>- Patrick
>>
>>On Fri, Dec 5, 2014 at 11:05 AM, Ryan Williams
>> wrote:
>>> fwiw I've been using `zinc -scala-home $SCALA_HOME -nailed -start`
>>>which:
>>>
>>> - starts a nailgun server as well,
>>> - uses my installed scala 2.{10,11}, as opposed to zinc's default 2.9.2
>>> <https://github.com/typesafehub/zinc#scala>: "If no options are passed
>>>to
>>> locate a version of Scala then Scala 2.9.2 is used by default (which is
>>> bundled with zinc)."
>>>
>>> The latter seems like it might be especially important.
>>>
>>>
>>> On Thu Dec 04 2014 at 4:25:32 PM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> Oh, derp. I just assumed from looking at all the options that there was
>>>> something to it. Thanks Sean.
>>>>
>>>> On Thu Dec 04 2014 at 7:47:33 AM Sean Owen  wrote:
>>>>
>>>> > You just run it once with "zinc -start" and leave it running as a
>>>> > background process on your build machine. You don't have to do
>>>> > anything for each build.
>>>> >
>>>> > On Wed, Dec 3, 2014 at 3:44 PM, Nicholas Chammas
>>>> >  wrote:
>>>> > > https://github.com/apache/spark/blob/master/docs/
>>>> > building-spark.md#speeding-up-compilation-with-zinc
>>>> > >
>>>> > > Could someone summarize how they invoke zinc as part of a regular
>>>> > > build-test-etc. cycle?
>>>> > >
>>>> > > I'll add it in to the aforelinked page if appropriate.
>>>> > >
>>>> > > Nick
>>>> >
>>>>
>>
>>-
>>To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>For additional commands, e-mail: dev-h...@spark.apache.org
>>
>
> 
>
> The information contained in this e-mail is confidential and/or proprietary 
> to Capital One and/or its affiliates. The information transmitted herewith is 
> intended only for use by the individual or entity to which it is addressed.  
> If the reader of this message is not the intended recipient, you are hereby 
> notified that any review, retransmission, dissemination, distribution, 
> copying or other use of, or taking of any action in reliance upon this 
> information is strictly prohibited. If you have received this communication 
> in error, please contact the sender and delete the material from your 
> computer.
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Test failures after Jenkins upgrade

2014-12-15 Thread Patrick Wendell
Hey All,

It appears that a single test suite is failing after the jenkins
upgrade: "org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite".
My guess is the suite is not resilient in some way to differences in
the environment (JVM, OS version, or something else).

I'm going to disable the suite to get the build passing. This should
be done in the next 30 minutes or so.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Test failures after Jenkins upgrade

2014-12-15 Thread Patrick Wendell
Ah cool Josh - I think for some reason we are hitting this every time
now. Since this is holding up a bunch of other patches, I just pushed
something ignoring the tests as a hotfix. Even waiting for a couple
hours is really expensive productivity-wise given the frequency with
which we run tests. We should just re-enable them when we merge the
appropriate fix.

On Mon, Dec 15, 2014 at 10:54 AM, Josh Rosen  wrote:
> There's a JIRA for this: https://issues.apache.org/jira/browse/SPARK-4826
>
> And two open PRs:
>
> https://github.com/apache/spark/pull/3695
> https://github.com/apache/spark/pull/3701
>
> We might be close to fixing this via one of those PRs, so maybe we should
> try using one of those instead?
>
> On December 15, 2014 at 10:51:46 AM, Patrick Wendell (pwend...@gmail.com)
> wrote:
>
> Hey All,
>
> It appears that a single test suite is failing after the jenkins
> upgrade: "org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite".
> My guess is the suite is not resilient in some way to differences in
> the environment (JVM, OS version, or something else).
>
> I'm going to disable the suite to get the build passing. This should
> be done in the next 30 minutes or so.
>
> - Patrick
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Governance of the Jenkins whitelist

2014-12-15 Thread Patrick Wendell
Hey Andrew,

The list of admins is maintained by the Amplab as part of their
donation of this infrastructure. The reason why we need to have admins
is that the pull request builder will fetch and then execute arbitrary
user code, so we need to do a security audit before we can approve
testing new patches. Over time when we get to know users we usually
whitelist them so they can test whatever they want.

I can see offline if the Amplab would be open to adding you as an
admin. I think we've added people over time who are very involved in
the community. Just wanted to send this e-mail so people understand
how it works.

- Patrick

On Sat, Dec 13, 2014 at 11:43 PM, Andrew Ash  wrote:
> Jenkins is a really valuable tool for increasing quality of incoming
> patches to Spark, but I've noticed that there are often a lot of patches
> waiting for testing because they haven't been approved for testing.
>
> Certain users can instruct Jenkins to run on a PR, or add other users to a
> whitelist. How does governance work for that list of admins?  Meaning who
> is currently on it, and what are the requirements to be on that list?
>
> Can I be permissioned to allow Jenkins to run on certain PRs?  I've often
> come across well-intentioned PRs that are languishing because Jenkins has
> yet to run on them.
>
> Andrew

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Scala's Jenkins setup looks neat

2014-12-16 Thread Patrick Wendell
Yeah you can do it - just make sure they understand it is a new
feature so we're asking them to revisit it. They looked at it in the
past and they concluded they couldn't give us access without giving us
push access.

- Patrick

On Tue, Dec 16, 2014 at 6:06 PM, Reynold Xin  wrote:
> It's worth trying :)
>
>
> On Tue, Dec 16, 2014 at 6:02 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>>
>> News flash!
>>
>> From the latest version of the GitHub API
>> :
>>
>> Note that the repo:status OAuth scope
>>  grants targeted access to
>> Statuses *without* also granting access to repository code, while the repo
>> scope grants permission to code as well as statuses.
>>
>> As I understand it, ASF Infra has said no in the past to granting access
>> to statuses because it also granted push access.
>>
>> If so, this no longer appears to be the case.
>>
>> 1) Did I understand correctly and 2) should I open a new request with ASF
>> Infra to give us OAuth keys with repo:status access?
>>
>> Nick
>>
>> On Sat Sep 06 2014 at 1:29:53 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>> Aww, that's a bummer...
>>>
>>>
>>> On Sat, Sep 6, 2014 at 1:10 PM, Reynold Xin  wrote:
>>>
 that would require github hooks permission and unfortunately asf infra
 wouldn't allow that.

 Maybe they will change their mind one day, but so far we asked about
 this and the answer has been no for security reasons.

 On Saturday, September 6, 2014, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> After reading Erik's email, I found this Scala PR
>  and immediately noticed a
> few
> cool things:
>
>- Jenkins is hooked directly into GitHub somehow, so you get the
> "All is
>well" message in the merge status window, presumably based on the
> last test
>status
>- Jenkins is also tagging the PR based on its test status or need for
>review
>- Jenkins is also tagging the PR for a specific milestone
>
> Do any of these things make sense to add to our setup? Or perhaps
> something
> inspired by these features?
>
> Nick
>

>>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: RDD data flow

2014-12-16 Thread Patrick Wendell
> Why is that? Shouldn't all Partitions be Iterators? Clearly I'm missing
> something.

The Partition itself doesn't need to be an iterator - the iterator
comes from the result of compute(partition). The Partition is just an
identifier for that partition, not the data itself. Take a look at the
signature for compute() in the RDD class.

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L97

>
> On a related subject, I was thinking of documenting the data flow of RDDs in
> more detail. The code is not hard to follow, but it's nice to have a simple
> picture with the major components and some explanation of the flow.  The
> declaration of Partition is throwing me off.
>
> Thanks!
>
>
>
> -
> --
> Madhu
> https://www.linkedin.com/in/msiddalingaiah
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-data-flow-tp9804.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[RESULT] [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-16 Thread Patrick Wendell
This vote has PASSED with 12 +1 votes (8 binding) and no 0 or -1 votes:

+1:
Matei Zaharia*
Madhu Siddalingaiah
Reynold Xin*
Sandy Ryza
Josh Rozen*
Mark Hamstra*
Denny Lee
Tom Graves*
GuiQiang Li
Nick Pentreath*
Sean McNamara*
Patrick Wendell*

0:

-1:

I'll finalize and package this release in the next 48 hours. Thanks to
everyone who contributed.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-16 Thread Patrick Wendell
I'm closing this vote now, will send results in a new thread.

On Sat, Dec 13, 2014 at 12:47 PM, Sean McNamara
 wrote:
> +1 tested on OS X and deployed+tested our apps via YARN into our staging 
> cluster.
>
> Sean
>
>
>> On Dec 11, 2014, at 10:40 AM, Reynold Xin  wrote:
>>
>> +1
>>
>> Tested on OS X.
>>
>> On Wednesday, December 10, 2014, Patrick Wendell  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.2.0!
>>>
>>> The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
>>>
>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-1.2.0-rc2/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1055/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/
>>>
>>> Please vote on releasing this package as Apache Spark 1.2.0!
>>>
>>> The vote is open until Saturday, December 13, at 21:00 UTC and passes
>>> if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.2.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>>
>>> == What justifies a -1 vote for this release? ==
>>> This vote is happening relatively late into the QA period, so
>>> -1 votes should only occur for significant regressions from
>>> 1.0.2. Bugs already present in 1.1.X, minor
>>> regressions, or bugs related to new features will not block this
>>> release.
>>>
>>> == What default changes should I be aware of? ==
>>> 1. The default value of "spark.shuffle.blockTransferService" has been
>>> changed to "netty"
>>> --> Old behavior can be restored by switching to "nio"
>>>
>>> 2. The default value of "spark.shuffle.manager" has been changed to "sort".
>>> --> Old behavior can be restored by setting "spark.shuffle.manager" to
>>> "hash".
>>>
>>> == How does this differ from RC1 ==
>>> This has fixes for a handful of issues identified - some of the
>>> notable fixes are:
>>>
>>> [Core]
>>> SPARK-4498: Standalone Master can fail to recognize completed/failed
>>> applications
>>>
>>> [SQL]
>>> SPARK-4552: Query for empty parquet table in spark sql hive get
>>> IllegalArgumentException
>>> SPARK-4753: Parquet2 does not prune based on OR filters on partition
>>> columns
>>> SPARK-4761: With JDBC server, set Kryo as default serializer and
>>> disable reference tracking
>>> SPARK-4785: When called with arguments referring column fields, PMOD
>>> throws NPE
>>>
>>> - Patrick
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
>>> For additional commands, e-mail: dev-h...@spark.apache.org 
>>>
>>>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[ANNOUNCE] Requiring JIRA for inclusion in release credits

2014-12-16 Thread Patrick Wendell
Hey All,

Due to the very high volume of contributions, we're switching to an
automated process for generating release credits. This process relies
on JIRA for categorizing contributions, so it's not possible for us to
provide credits in the case where users submit pull requests with no
associated JIRA.

This needed to be automated because, with more than 1000 commits per
release, finding proper names for every commit and summarizing
contributions was taking on the order of days of time.

For 1.2.0 there were around 100 commits that did not have JIRA's. I'll
try to manually merge these into the credits, but please e-mail me
directly if you are not credited once the release notes are posted.
The notes should be posted within 48 hours of right now.

We already ask that users include a JIRA for pull requests, but now it
will be required for proper attribution. I've updated the contributing
guide on the wiki to reflect this.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Which committers care about Kafka?

2014-12-18 Thread Patrick Wendell
Hey Cody,

Thanks for reaching out with this. The lead on streaming is TD - he is
traveling this week though so I can respond a bit. To the high level
point of whether Kafka is important - it definitely is. Something like
80% of Spark Streaming deployments (anecdotally) ingest data from
Kafka. Also, good support for Kafka is something we generally want in
Spark and not a library. In some cases IIRC there were user libraries
that used unstable Kafka API's and we were somewhat waiting on Kafka
to stabilize them to merge things upstream. Otherwise users wouldn't
be able to use newer Kakfa versions. This is a high level impression
only though, I haven't talked to TD about this recently so it's worth
revisiting given the developments in Kafka.

Please do bring things up like this on the dev list if there are
blockers for your usage - thanks for pinging it.

- Patrick

On Thu, Dec 18, 2014 at 7:07 AM, Cody Koeninger  wrote:
> Now that 1.2 is finalized...  who are the go-to people to get some
> long-standing Kafka related issues resolved?
>
> The existing api is not sufficiently safe nor flexible for our production
> use.  I don't think we're alone in this viewpoint, because I've seen
> several different patches and libraries to fix the same things we've been
> running into.
>
> Regarding flexibility
>
> https://issues.apache.org/jira/browse/SPARK-3146
>
> has been outstanding since August, and IMHO an equivalent of this is
> absolutely necessary.  We wrote a similar patch ourselves, then found that
> PR and have been running it in production.  We wouldn't be able to get our
> jobs done without it.  It also allows users to solve a whole class of
> problems for themselves (e.g. SPARK-2388, arbitrary delay of messages, etc).
>
> Regarding safety, I understand the motivation behind WriteAheadLog as a
> general solution for streaming unreliable sources, but Kafka already is a
> reliable source.  I think there's a need for an api that treats it as
> such.  Even aside from the performance issues of duplicating the
> write-ahead log in kafka into another write-ahead log in hdfs, I need
> exactly-once semantics in the face of failure (I've had failures that
> prevented reloading a spark streaming checkpoint, for instance).
>
> I've got an implementation i've been using
>
> https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka
> /src/main/scala/org/apache/spark/rdd/kafka
>
> Tresata has something similar at https://github.com/tresata/spark-kafka,
> and I know there were earlier attempts based on Storm code.
>
> Trying to distribute these kinds of fixes as libraries rather than patches
> to Spark is problematic, because large portions of the implementation are
> private[spark].
>
>  I'd like to help, but i need to know whose attention to get.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [RESULT] [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-18 Thread Patrick Wendell
Update: An Apache infrastructure issue prevented me from pushing this
last night. The issue was resolved today and I should be able to push
the final release artifacts tonight.

On Tue, Dec 16, 2014 at 9:20 PM, Patrick Wendell  wrote:
> This vote has PASSED with 12 +1 votes (8 binding) and no 0 or -1 votes:
>
> +1:
> Matei Zaharia*
> Madhu Siddalingaiah
> Reynold Xin*
> Sandy Ryza
> Josh Rozen*
> Mark Hamstra*
> Denny Lee
> Tom Graves*
> GuiQiang Li
> Nick Pentreath*
> Sean McNamara*
> Patrick Wendell*
>
> 0:
>
> -1:
>
> I'll finalize and package this release in the next 48 hours. Thanks to
> everyone who contributed.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Announcing Spark 1.2!

2014-12-19 Thread Patrick Wendell
I'm happy to announce the availability of Spark 1.2.0! Spark 1.2.0 is
the third release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 172 developers and more
than 1,000 commits!

This release brings operational and performance improvements in Spark
core including a new network transport subsytem designed for very
large shuffles. Spark SQL introduces an API for external data sources
along with Hive 13 support, dynamic partitioning, and the
fixed-precision decimal type. MLlib adds a new pipeline-oriented
package (spark.ml) for composing multiple algorithms. Spark Streaming
adds a Python API and a write ahead log for fault tolerance. Finally,
GraphX has graduated from alpha and introduces a stable API along with
performance improvements.

Visit the release notes [1] to read about the new features, or
download [2] the release today.

For errata in the contributions or release notes, please e-mail me
*directly* (not on-list).

Thanks to everyone involved in creating, testing, and documenting this release!

[1] http://spark.apache.org/releases/spark-release-1-2-0.html
[2] http://spark.apache.org/downloads.html

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Announcing Spark 1.2!

2014-12-19 Thread Patrick Wendell
Thanks for pointing out the tag issue. I've updated all links to point
to the correct tag (from the vote thread):

a428c446e23e628b746e0626cc02b7b3cadf588e

On Fri, Dec 19, 2014 at 1:55 AM, Sean Owen  wrote:
> Tag 1.2.0 is older than 1.2.0-rc2. I wonder if it just didn't get
> updated. I assume it's going to be 1.2.0-rc2 plus a few commits
> related to the release process.
>
> On Fri, Dec 19, 2014 at 9:50 AM, Shixiong Zhu  wrote:
>> Congrats!
>>
>> A little question about this release: Which commit is this release based on?
>> v1.2.0 and v1.2.0-rc2 are pointed to different commits in
>> https://github.com/apache/spark/releases
>>
>> Best Regards,
>>
>> Shixiong Zhu
>>
>> 2014-12-19 16:52 GMT+08:00 Patrick Wendell :
>>>
>>> I'm happy to announce the availability of Spark 1.2.0! Spark 1.2.0 is
>>> the third release on the API-compatible 1.X line. It is Spark's
>>> largest release ever, with contributions from 172 developers and more
>>> than 1,000 commits!
>>>
>>> This release brings operational and performance improvements in Spark
>>> core including a new network transport subsytem designed for very
>>> large shuffles. Spark SQL introduces an API for external data sources
>>> along with Hive 13 support, dynamic partitioning, and the
>>> fixed-precision decimal type. MLlib adds a new pipeline-oriented
>>> package (spark.ml) for composing multiple algorithms. Spark Streaming
>>> adds a Python API and a write ahead log for fault tolerance. Finally,
>>> GraphX has graduated from alpha and introduces a stable API along with
>>> performance improvements.
>>>
>>> Visit the release notes [1] to read about the new features, or
>>> download [2] the release today.
>>>
>>> For errata in the contributions or release notes, please e-mail me
>>> *directly* (not on-list).
>>>
>>> Thanks to everyone involved in creating, testing, and documenting this
>>> release!
>>>
>>> [1] http://spark.apache.org/releases/spark-release-1-2-0.html
>>> [2] http://spark.apache.org/downloads.html
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Use mvn to build Spark 1.2.0 failed

2014-12-22 Thread Patrick Wendell
I also couldn't reproduce this issued.

On Mon, Dec 22, 2014 at 2:24 AM, Sean Owen  wrote:
> I just tried the exact same command and do not see any error. Maybe
> you can make sure you're starting from a clean extraction of the
> distro, and check your environment. I'm on OSX, Maven 3.2, Java 8 but
> I don't know that any of those would be relevant.
>
> On Mon, Dec 22, 2014 at 4:10 AM, wyphao.2007  wrote:
>> Hi all, Today download Spark source from 
>> http://spark.apache.org/downloads.html page, and I use
>>
>>
>>  ./make-distribution.sh --tgz -Phadoop-2.2 -Pyarn -DskipTests 
>> -Dhadoop.version=2.2.0 -Phive
>>
>>
>> to build the release, but I encountered an exception as follow:
>>
>>
>> [INFO] --- build-helper-maven-plugin:1.8:add-source (add-scala-sources) @ 
>> spark-parent ---
>> [INFO] Source directory: /home/q/spark/spark-1.2.0/src/main/scala added.
>> [INFO]
>> [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ 
>> spark-parent ---
>> [INFO] 
>> 
>> [INFO] Reactor Summary:
>> [INFO]
>> [INFO] Spark Project Parent POM .. FAILURE [1.015s]
>> [INFO] Spark Project Networking .. SKIPPED
>> [INFO] Spark Project Shuffle Streaming Service ... SKIPPED
>> [INFO] Spark Project Core  SKIPPED
>> [INFO] Spark Project Bagel ... SKIPPED
>> [INFO] Spark Project GraphX .. SKIPPED
>> [INFO] Spark Project Streaming ... SKIPPED
>> [INFO] Spark Project Catalyst  SKIPPED
>> [INFO] Spark Project SQL . SKIPPED
>> [INFO] Spark Project ML Library .. SKIPPED
>> [INFO] Spark Project Tools ... SKIPPED
>> [INFO] Spark Project Hive  SKIPPED
>> [INFO] Spark Project REPL  SKIPPED
>> [INFO] Spark Project YARN Parent POM . SKIPPED
>> [INFO] Spark Project YARN Stable API . SKIPPED
>> [INFO] Spark Project Assembly  SKIPPED
>> [INFO] Spark Project External Twitter  SKIPPED
>> [INFO] Spark Project External Flume Sink . SKIPPED
>> [INFO] Spark Project External Flume .. SKIPPED
>> [INFO] Spark Project External MQTT ... SKIPPED
>> [INFO] Spark Project External ZeroMQ . SKIPPED
>> [INFO] Spark Project External Kafka .. SKIPPED
>> [INFO] Spark Project Examples  SKIPPED
>> [INFO] Spark Project YARN Shuffle Service  SKIPPED
>> [INFO] 
>> 
>> [INFO] BUILD FAILURE
>> [INFO] 
>> 
>> [INFO] Total time: 1.644s
>> [INFO] Finished at: Mon Dec 22 10:56:35 CST 2014
>> [INFO] Final Memory: 21M/481M
>> [INFO] 
>> 
>> [ERROR] Failed to execute goal 
>> org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (default) 
>> on project spark-parent: Error finding remote resources manifests: 
>> /home/q/spark/spark-1.2.0/target/maven-shared-archive-resources/META-INF/NOTICE
>>  (No such file or directory) -> [Help 1]
>> [ERROR]
>> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
>> switch.
>> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>> [ERROR]
>> [ERROR] For more information about the errors and possible solutions, please 
>> read the following articles:
>> [ERROR] [Help 1] 
>> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
>>
>>
>> but the NOTICE file is in the download spark release:
>>
>>
>> [wyp@spark  /home/q/spark/spark-1.2.0]$ ll
>> total 248
>> drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 assembly
>> drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 bagel
>> drwxrwxr-x 2 1000 1000  4096 Dec 10 18:02 bin
>> drwxrwxr-x 2 1000 1000  4096 Dec 10 18:02 conf
>> -rw-rw-r-- 1 1000 1000   663 Dec 10 18:02 CONTRIBUTING.md
>> drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 core
>> drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 data
>> drwxrwxr-x 4 1000 1000  4096 Dec 10 18:02 dev
>> drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 docker
>> drwxrwxr-x 7 1000 1000  4096 Dec 10 18:02 docs
>> drwxrwxr-x 4 1000 1000  4096 Dec 10 18:02 ec2
>> drwxrwxr-x 4 1000 1000  4096 Dec 10 18:02 examples
>> drwxrwxr-x 8 1000 1000  4096 Dec 10 18:02 external
>> drwxrwxr-x 5 1000 1000  4096 Dec 10 18:02 extras
>> drwxrwxr-x 4 1000 1000  4096 Dec 10 18:02 graphx
>> -rw-rw-r-- 1 1000 1000 45242 Dec 10 18:02 LICENSE
>> -rwxrwxr-x 1 1000 1000  7941 Dec 10 18:02 make-distribution.sh
>> drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 mllib
>> drwxrwxr-x 5 1000 1000  4096 D

Re: Announcing Spark Packages

2014-12-22 Thread Patrick Wendell
Xiangrui asked me to report that it's back and running :)

On Mon, Dec 22, 2014 at 3:21 PM, peng  wrote:
> Me 2 :)
>
>
> On 12/22/2014 06:14 PM, Andrew Ash wrote:
>
> Hi Xiangrui,
>
> That link is currently returning a 503 Over Quota error message.  Would you
> mind pinging back out when the page is back up?
>
> Thanks!
> Andrew
>
> On Mon, Dec 22, 2014 at 12:37 PM, Xiangrui Meng  wrote:
>>
>> Dear Spark users and developers,
>>
>> I'm happy to announce Spark Packages (http://spark-packages.org), a
>> community package index to track the growing number of open source
>> packages and libraries that work with Apache Spark. Spark Packages
>> makes it easy for users to find, discuss, rate, and install packages
>> for any version of Spark, and makes it easy for developers to
>> contribute packages.
>>
>> Spark Packages will feature integrations with various data sources,
>> management tools, higher level domain-specific libraries, machine
>> learning algorithms, code samples, and other Spark content. Thanks to
>> the package authors, the initial listing of packages includes
>> scientific computing libraries, a job execution server, a connector
>> for importing Avro data, tools for launching Spark on Google Compute
>> Engine, and many others.
>>
>> I'd like to invite you to contribute and use Spark Packages and
>> provide feedback! As a disclaimer: Spark Packages is a community index
>> maintained by Databricks and (by design) will include packages outside
>> of the ASF Spark project. We are excited to help showcase and support
>> all of the great work going on in the broader Spark community!
>>
>> Cheers,
>> Xiangrui
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: More general submitJob API

2014-12-22 Thread Patrick Wendell
A SparkContext is thread safe, so you can just have different threads
that create their own RDD's and do actions, etc.

- Patrick

On Mon, Dec 22, 2014 at 4:15 PM, Alessandro Baretta
 wrote:
> Andrew,
>
> Thanks, yes, this is what I wanted: basically just to start multiple jobs
> concurrently in threads.
>
> Alex
>
> On Mon, Dec 22, 2014 at 4:04 PM, Andrew Ash  wrote:
>>
>> Hi Alex,
>>
>> SparkContext.submitJob() is marked as experimental -- most client programs
>> shouldn't be using it.  What are you looking to do?
>>
>> For multiplexing jobs, one thing you can do is have multiple threads in
>> your client JVM each submit jobs on your SparkContext job.  This is
>> described here in the docs:
>> http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
>>
>> Andrew
>>
>> On Mon, Dec 22, 2014 at 1:32 PM, Alessandro Baretta > > wrote:
>>
>>> Fellow Sparkers,
>>>
>>> I'm rather puzzled at the submitJob API. I can't quite figure out how it
>>> is
>>> supposed to be used. Is there any more documentation about it?
>>>
>>> Also, is there any simpler way to multiplex jobs on the cluster, such as
>>> starting multiple computations in as many threads in the driver and
>>> reaping
>>> all the results when they are available?
>>>
>>> Thanks,
>>>
>>> Alex
>>>
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Announcing Spark Packages

2014-12-22 Thread Patrick Wendell
Hey Nick,

I think Hitesh was just trying to be helpful and point out the policy
- not necessarily saying there was an issue. We've taken a close look
at this and I think we're in good shape her vis-a-vis this policy.

- Patrick

On Mon, Dec 22, 2014 at 5:29 PM, Nicholas Chammas
 wrote:
> Hitesh,
>
> From your link:
>
> You may not use ASF trademarks such as "Apache" or "ApacheFoo" or "Foo" in
> your own domain names if that use would be likely to confuse a relevant
> consumer about the source of software or services provided through your
> website, without written approval of the VP, Apache Brand Management or
> designee.
>
> The title on the packages website is "A community index of packages for
> Apache Spark." Furthermore, the footnote of the website reads "Spark
> Packages is a community site hosting modules that are not part of Apache
> Spark."
>
> I think there's nothing on there that would "confuse a relevant consumer
> about the source of software". It's pretty clear that the Spark Packages
> name is well within the ASF's guidelines.
>
> Have I misunderstood the ASF's policy?
>
> Nick
>
>
> On Mon Dec 22 2014 at 6:40:10 PM Hitesh Shah  wrote:
>>
>> Hello Xiangrui,
>>
>> If you have not already done so, you should look at
>> http://www.apache.org/foundation/marks/#domains for the policy on use of ASF
>> trademarked terms in domain names.
>>
>> thanks
>> -- Hitesh
>>
>> On Dec 22, 2014, at 12:37 PM, Xiangrui Meng  wrote:
>>
>> > Dear Spark users and developers,
>> >
>> > I'm happy to announce Spark Packages (http://spark-packages.org), a
>> > community package index to track the growing number of open source
>> > packages and libraries that work with Apache Spark. Spark Packages
>> > makes it easy for users to find, discuss, rate, and install packages
>> > for any version of Spark, and makes it easy for developers to
>> > contribute packages.
>> >
>> > Spark Packages will feature integrations with various data sources,
>> > management tools, higher level domain-specific libraries, machine
>> > learning algorithms, code samples, and other Spark content. Thanks to
>> > the package authors, the initial listing of packages includes
>> > scientific computing libraries, a job execution server, a connector
>> > for importing Avro data, tools for launching Spark on Google Compute
>> > Engine, and many others.
>> >
>> > I'd like to invite you to contribute and use Spark Packages and
>> > provide feedback! As a disclaimer: Spark Packages is a community index
>> > maintained by Databricks and (by design) will include packages outside
>> > of the ASF Spark project. We are excited to help showcase and support
>> > all of the great work going on in the broader Spark community!
>> >
>> > Cheers,
>> > Xiangrui
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [ANNOUNCE] Requiring JIRA for inclusion in release credits

2014-12-22 Thread Patrick Wendell
Hey Josh,

We don't explicitly track contributions to spark-ec2 in the Apache
Spark release notes. The main reason is that usually updates to
spark-ec2 include a corresponding update to spark so we get it there.
This may not always be the case though, so let me know if you think
there is something missing we should add.

- Patrick

On Mon, Dec 22, 2014 at 6:17 PM, Nicholas Chammas
 wrote:
> Does this include contributions made against the spark-ec2 repo?
>
> On Wed Dec 17 2014 at 12:29:19 AM Patrick Wendell 
> wrote:
>>
>> Hey All,
>>
>> Due to the very high volume of contributions, we're switching to an
>> automated process for generating release credits. This process relies
>> on JIRA for categorizing contributions, so it's not possible for us to
>> provide credits in the case where users submit pull requests with no
>> associated JIRA.
>>
>> This needed to be automated because, with more than 1000 commits per
>> release, finding proper names for every commit and summarizing
>> contributions was taking on the order of days of time.
>>
>> For 1.2.0 there were around 100 commits that did not have JIRA's. I'll
>> try to manually merge these into the credits, but please e-mail me
>> directly if you are not credited once the release notes are posted.
>> The notes should be posted within 48 hours of right now.
>>
>> We already ask that users include a JIRA for pull requests, but now it
>> will be required for proper attribution. I've updated the contributing
>> guide on the wiki to reflect this.
>>
>> - Patrick
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [ANNOUNCE] Requiring JIRA for inclusion in release credits

2014-12-22 Thread Patrick Wendell
s/Josh/Nick/ - sorry!

On Mon, Dec 22, 2014 at 10:52 PM, Patrick Wendell  wrote:
> Hey Josh,
>
> We don't explicitly track contributions to spark-ec2 in the Apache
> Spark release notes. The main reason is that usually updates to
> spark-ec2 include a corresponding update to spark so we get it there.
> This may not always be the case though, so let me know if you think
> there is something missing we should add.
>
> - Patrick
>
> On Mon, Dec 22, 2014 at 6:17 PM, Nicholas Chammas
>  wrote:
>> Does this include contributions made against the spark-ec2 repo?
>>
>> On Wed Dec 17 2014 at 12:29:19 AM Patrick Wendell 
>> wrote:
>>>
>>> Hey All,
>>>
>>> Due to the very high volume of contributions, we're switching to an
>>> automated process for generating release credits. This process relies
>>> on JIRA for categorizing contributions, so it's not possible for us to
>>> provide credits in the case where users submit pull requests with no
>>> associated JIRA.
>>>
>>> This needed to be automated because, with more than 1000 commits per
>>> release, finding proper names for every commit and summarizing
>>> contributions was taking on the order of days of time.
>>>
>>> For 1.2.0 there were around 100 commits that did not have JIRA's. I'll
>>> try to manually merge these into the credits, but please e-mail me
>>> directly if you are not credited once the release notes are posted.
>>> The notes should be posted within 48 hours of right now.
>>>
>>> We already ask that users include a JIRA for pull requests, but now it
>>> will be required for proper attribution. I've updated the contributing
>>> guide on the wiki to reflect this.
>>>
>>> - Patrick
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Problems with large dataset using collect() and broadcast()

2014-12-24 Thread Patrick Wendell
Hi Will,

When you call collect() the item you are collecting needs to fit in
memory on the driver. Is it possible your driver program does not have
enough memory?

- Patrick

On Wed, Dec 24, 2014 at 9:34 PM, Will Yang  wrote:
> Hi all,
> In my occasion, I have a huge HashMap[(Int, Long), (Double, Double,
> Double)], say several GB to tens of GB, after each iteration, I need to
> collect() this HashMap and perform some calculation, and then broadcast()
> it to every node. Now I have 20GB for each executor and after it
> performances collect(), it gets stuck at "Added rdd_xx_xx", no further
> respond showed on the Application UI.
>
> I've tried to lower the spark.shuffle.memoryFraction and
> spark.storage.memoryFraction, but it seems that it can only deal with as
> much as 2GB HashMap. What should I optimize for such conditions.
>
> (ps: sorry for my bad English & Grammar)
>
>
> Thanks

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Patrick Wendell
Is it sufficient to set "spark.hadoop.validateOutputSpecs" to false?

http://spark.apache.org/docs/latest/configuration.html

- Patrick

On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai  wrote:
> Hi,
>
>
>
> We have such requirements to save RDD output to HDFS with saveAsTextFile
> like API, but need to overwrite the data if existed. I'm not sure if current
> Spark support such kind of operations, or I need to check this manually?
>
>
>
> There's a thread in mailing list discussed about this
> (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
> I'm not sure this feature is enabled or not, or with some configurations?
>
>
>
> Appreciate your suggestions.
>
>
>
> Thanks a lot
>
> Jerry

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Patrick Wendell
So the behavior of overwriting existing directories IMO is something
we don't want to encourage. The reason why the Hadoop client has these
checks is that it's very easy for users to do unsafe things without
them. For instance, a user could overwrite an RDD that had 100
partitions with an RDD that has 10 partitions... and if they read back
the RDD they would get a corrupted RDD that has a combination of data
from the old and new RDD.

If users want to circumvent these safety checks, we need to make them
explicitly disable them. Given this, I think a config option is as
reasonable as any alternatives. This is already pretty easy IMO.

- Patrick

On Wed, Dec 24, 2014 at 11:28 PM, Cheng, Hao  wrote:
> I am wondering if we can provide more friendly API, other than configuration 
> for this purpose. What do you think Patrick?
>
> Cheng Hao
>
> -Original Message-
> From: Patrick Wendell [mailto:pwend...@gmail.com]
> Sent: Thursday, December 25, 2014 3:22 PM
> To: Shao, Saisai
> Cc: u...@spark.apache.org; dev@spark.apache.org
> Subject: Re: Question on saveAsTextFile with overwrite option
>
> Is it sufficient to set "spark.hadoop.validateOutputSpecs" to false?
>
> http://spark.apache.org/docs/latest/configuration.html
>
> - Patrick
>
> On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai  wrote:
>> Hi,
>>
>>
>>
>> We have such requirements to save RDD output to HDFS with
>> saveAsTextFile like API, but need to overwrite the data if existed.
>> I'm not sure if current Spark support such kind of operations, or I need to 
>> check this manually?
>>
>>
>>
>> There's a thread in mailing list discussed about this
>> (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Sp
>> ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
>> I'm not sure this feature is enabled or not, or with some configurations?
>>
>>
>>
>> Appreciate your suggestions.
>>
>>
>>
>> Thanks a lot
>>
>> Jerry
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
> commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



ANNOUNCE: New build script ./build/mvn

2014-12-27 Thread Patrick Wendell
Hi All,

A consistent piece of feedback from Spark developers has been that the
Maven build is very slow. Typesafe provides a tool called Zinc which
improves Scala complication speed substantially with Maven, but is
difficult to install and configure, especially for platforms other
than Mac OS.

I've just merged a patch (authored by Brennon York) that provides an
automatically configured Maven instance with Zinc embedded in Spark.
E.g.:

./build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.3 package

It is hard to test changes like this across all environments, so
please give this a spin and report any issues on the Spark JIRA. It is
working correctly if you see the following message during compilation:

[INFO] Using zinc server for incremental compilation

Note that developers preferring their own Maven installation are
unaffected by this and can just ignore this new feature.

Cheers,
- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark driver main thread hanging after SQL insert

2015-01-02 Thread Patrick Wendell
Hi Alessandro,

Can you create a JIRA for this rather than reporting it on the dev
list? That's where we track issues like this. Thanks!.

- Patrick

On Wed, Dec 31, 2014 at 8:48 PM, Alessandro Baretta
 wrote:
> Here's what the console shows:
>
> 15/01/01 01:12:29 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 58.0,
> whose tasks have all completed, from pool
> 15/01/01 01:12:29 INFO scheduler.DAGScheduler: Stage 58 (runJob at
> ParquetTableOperations.scala:326) finished in 5493.549 s
> 15/01/01 01:12:29 INFO scheduler.DAGScheduler: Job 41 finished: runJob at
> ParquetTableOperations.scala:326, took 5493.747061 s
>
> It is now 01:40:03, so the driver has been hanging for the last 28 minutes.
> The web UI on the other hand shows that all tasks completed successfully,
> and the output directory has been populated--although the _SUCCESS file is
> missing.
>
> It is worth noting that my code started this job as its own thread. The
> actual code looks like the following snippet, modulo some simplifications.
>
>   def save_to_parquet(allowExisting : Boolean = false) = {
> val threads = tables.map(table => {
>   val thread = new Thread {
> override def run {
>   table.insertInto(t.table_name)
> }
>   }
>   thread.start
>   thread
> })
> threads.foreach(_.join)
>   }
>
> As far as I can see the insertInto call never returns. Any idea why?
>
> Alex

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark UI history job duration is wrong

2015-01-05 Thread Patrick Wendell
Thanks for reporting this - it definitely sounds like a bug. Please
open a JIRA for it. My guess is that we define the start or end time
of the job based on the current time instead of looking at data
encoded in the underlying event stream. That would cause it to not
work properly when loading from historical data.

- Patrick

On Mon, Jan 5, 2015 at 12:25 PM, Olivier Toupin
 wrote:
> Hello,
>
> I'm using Spark 1.2.0 and when running an application, if I go into the UI
> and then in the job tab ("/jobs/") the jobs duration are relevant and the
> posted durations looks ok.
>
> However when I open the history ("history/app-/jobs/") for that job,
> the duration are wrong showing milliseconds instead of the relevant job
> time. The submitted time for each job (except maybe the first) is different
> also.
>
> The stage tab is unaffected and show the correct duration for each stages in
> both mode.
>
> Should I open a bug?
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-UI-history-job-duration-is-wrong-tp10010.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Hang on Executor classloader lookup for the remote REPL URL classloader

2015-01-07 Thread Patrick Wendell
Hey Andrew,

So the executors in Spark will fetch classes from the driver node for
classes defined in the repl from an HTTP server on the driver. Is this
happening in the context of a repl session? Also, is it deterministic
or does it happen only periodically?

The reason all of the other threads are hanging is that there is a
global lock around classloading, so they all queue up.

Could you attach the full stack trace from the driver? Is it possible
that something in the network is blocking the transfer of bytes
between these two processes? Based on the stack trace it looks like it
sent an HTTP request and is waiting on the result back from the
driver.

One thing to check is to verify that the TCP connection between them
used for the repl class server is still alive from the vantage point
of both the executor and driver nodes. Another thing to try would be
to temporarily open up any firewalls that are on the nodes or in the
network and see if this makes the problem go away (to isolate it to an
exogenous-to-Spark network issue).

- Patrick

On Wed, Aug 20, 2014 at 11:35 PM, Andrew Ash  wrote:
> Hi Spark devs,
>
> I'm seeing a stacktrace where the classloader that reads from the REPL is
> hung, and blocking all progress on that executor.  Below is that hung
> thread's stacktrace, and also the stacktrace of another hung thread.
>
> I thought maybe there was an issue with the REPL's JVM on the other side,
> but didn't see anything useful in that stacktrace either.
>
> Any ideas what I should be looking for?
>
> Thanks!
> Andrew
>
>
> "Executor task launch worker-0" daemon prio=10 tid=0x7f780c208000
> nid=0x6ae9 runnable [0x7f78c2eeb000]
>java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:152)
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> - locked <0x7f7e13ea9560> (a java.io.BufferedInputStream)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323)
> - locked <0x7f7e13e9eeb0> (a
> sun.net.www.protocol.http.HttpURLConnection)
> at java.net.URL.openStream(URL.java:1037)
> at
> org.apache.spark.repl.ExecutorClassLoader.findClassLocally(ExecutorClassLoader.scala:86)
> at
> org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:63)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> - locked <0x7f7fc9018980> (a
> org.apache.spark.repl.ExecutorClassLoader)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:270)
> at org.apache.avro.util.ClassUtils.forName(ClassUtils.java:102)
> at org.apache.avro.util.ClassUtils.forName(ClassUtils.java:82)
> at
> org.apache.avro.specific.SpecificData.getClass(SpecificData.java:132)
> at
> org.apache.avro.specific.SpecificDatumReader.setSchema(SpecificDatumReader.java:69)
> at
> org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:126)
> at
> org.apache.avro.file.DataFileReader.(DataFileReader.java:97)
> at
> org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:59)
> at
> org.apache.avro.mapred.AvroRecordReader.(AvroRecordReader.java:41)
> at
> org.apache.avro.mapred.AvroInputFormat.getRecordReader(AvroInputFormat.java:71)
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:193)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:184)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:93)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
>
> And the other threads are stuck on the Class.forName0() method too:
>
> "Executor task launch worker-4" daemon prio=10 tid=0x7f780c20f000
> nid=0x6aed waiting for monitor entry [0x7f78c2ae8000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:270)
>

Re: When will spark support "push" style shuffle?

2015-01-07 Thread Patrick Wendell
This question is conflating a few different concepts. I think the main
question is whether Spark will have a shuffle implementation that
streams data rather than persisting it to disk/cache as a buffer.
Spark currently decouples the shuffle write from the read using
disk/OS cache as a buffer. The two benefits of this approach this are
that it allows intra-query fault tolerance and it makes it easier to
elastically scale and reschedule work within a job. We consider these
to be design requirements (think about jobs that run for several hours
on hundreds of machines). Impala, and similar systems like dremel and
f1, not offer fault tolerance within a query at present. They also
require gang scheduling the entire set of resources that will exist
for the duration of a query.

A secondary question is whether our shuffle should have a barrier or
not. Spark's shuffle currently has a hard barrier between map and
reduce stages. We haven't seen really strong evidence that removing
the barrier is a net win. It can help the performance of a single job
(modestly), but in the a multi-tenant workload, it leads to poor
utilization since you have a lot of reduce tasks that are taking up
slots waiting for mappers to finish. Many large scale users of
Map/Reduce disable this feature in production clusters for that
reason. Thus, we haven't seen compelling evidence for removing the
barrier at this point, given the complexity of doing so.

It is possible that future versions of Spark will support push-based
shuffles, potentially in a mode that remove some of Spark's fault
tolerance properties. But there are many other things we can still
optimize about the shuffle that would likely come before this.

- Patrick

On Wed, Jan 7, 2015 at 6:01 PM, 曹雪林  wrote:
> Hi,
>
>   I've heard a lot of complain about spark's "pull" style shuffle. Is
> there any plan to support "push" style shuffle in the near future?
>
>   Currently, the shuffle phase must be completed before the next stage
> starts. While, it is said, in Impala, the shuffled data is "streamed" to
> the next stage handler, which greatly saves time. Will spark support this
> mechanism one day?
>
> Thanks

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark development with IntelliJ

2015-01-08 Thread Patrick Wendell
Nick - yes. Do you mind moving it? I should have put it in the
"Contributing to Spark" page.

On Thu, Jan 8, 2015 at 3:22 PM, Nicholas Chammas
 wrote:
> Side question: Should this section
> 
> in
> the wiki link to Useful Developer Tools
> ?
>
> On Thu Jan 08 2015 at 6:19:55 PM Sean Owen  wrote:
>
>> I remember seeing this too, but it seemed to be transient. Try
>> compiling again. In my case I recall that IJ was still reimporting
>> some modules when I tried to build. I don't see this error in general.
>>
>> On Thu, Jan 8, 2015 at 10:38 PM, Bill Bejeck  wrote:
>> > I was having the same issue and that helped.  But now I get the following
>> > compilation error when trying to run a test from within Intellij (v 14)
>> >
>> > /Users/bbejeck/dev/github_clones/bbejeck-spark/sql/
>> catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala
>> > Error:(308, 109) polymorphic expression cannot be instantiated to
>> expected
>> > type;
>> >  found   : [T(in method
>> > apply)]org.apache.spark.sql.catalyst.dsl.ScalaUdfBuilder[T(in method
>> apply)]
>> >  required: org.apache.spark.sql.catalyst.dsl.package.ScalaUdfBuilder[T(
>> in
>> > method functionToUdfBuilder)]
>> >   implicit def functionToUdfBuilder[T: TypeTag](func: Function1[_, T]):
>> > ScalaUdfBuilder[T] = ScalaUdfBuilder(func)
>> >
>> > Any thoughts?
>> >
>> > ^
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark development with IntelliJ

2015-01-08 Thread Patrick Wendell
Actually I went ahead and did it.

On Thu, Jan 8, 2015 at 10:25 PM, Patrick Wendell  wrote:
> Nick - yes. Do you mind moving it? I should have put it in the
> "Contributing to Spark" page.
>
> On Thu, Jan 8, 2015 at 3:22 PM, Nicholas Chammas
>  wrote:
>> Side question: Should this section
>> <https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-IDESetup>
>> in
>> the wiki link to Useful Developer Tools
>> <https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools>?
>>
>> On Thu Jan 08 2015 at 6:19:55 PM Sean Owen  wrote:
>>
>>> I remember seeing this too, but it seemed to be transient. Try
>>> compiling again. In my case I recall that IJ was still reimporting
>>> some modules when I tried to build. I don't see this error in general.
>>>
>>> On Thu, Jan 8, 2015 at 10:38 PM, Bill Bejeck  wrote:
>>> > I was having the same issue and that helped.  But now I get the following
>>> > compilation error when trying to run a test from within Intellij (v 14)
>>> >
>>> > /Users/bbejeck/dev/github_clones/bbejeck-spark/sql/
>>> catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala
>>> > Error:(308, 109) polymorphic expression cannot be instantiated to
>>> expected
>>> > type;
>>> >  found   : [T(in method
>>> > apply)]org.apache.spark.sql.catalyst.dsl.ScalaUdfBuilder[T(in method
>>> apply)]
>>> >  required: org.apache.spark.sql.catalyst.dsl.package.ScalaUdfBuilder[T(
>>> in
>>> > method functionToUdfBuilder)]
>>> >   implicit def functionToUdfBuilder[T: TypeTag](func: Function1[_, T]):
>>> > ScalaUdfBuilder[T] = ScalaUdfBuilder(func)
>>> >
>>> > Any thoughts?
>>> >
>>> > ^
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Job priority

2015-01-11 Thread Patrick Wendell
Priority scheduling isn't something we've supported in Spark and we've
opted to support FIFO and Fair scheduling and asked users to try and
fit these to the needs of their applications.

In practice from what I've seen of priority schedulers, such as the
linux CPU scheduler, is that strict priority scheduling is never used
in practice because of priority starvation and other issues. So you
have this second tier of heuristics that exist to deal with issues
like starvation, priority inversion, etc, and these become very
complex over time.

That said, I looked a this a bit with @kayousterhout and I don't think
it would be very hard to implement a simple priority scheduler in the
current architecture. My main concern would be additional complexity
that would develop over time, based on looking at previous
implementations in the wild.

Alessandro, would you be able to open a JIRA and list some of your
requirements there? That way we could hear whether other people have
similar needs.

- Patrick

On Sun, Jan 11, 2015 at 10:07 AM, Mark Hamstra  wrote:
> Yes, if you are asking about developing a new priority queue job scheduling
> feature and not just about how job scheduling currently works in Spark, the
> that's a dev list issue.  The current job scheduling priority is at the
> granularity of pools containing jobs, not the jobs themselves; so if you
> require strictly job-level priority queuing, that would require a new
> development effort -- and one that I expect will involve a lot of tricky
> corner cases.
>
> Sorry for misreading the nature of your initial inquiry.
>
> On Sun, Jan 11, 2015 at 7:36 AM, Alessandro Baretta 
> wrote:
>
>> Cody,
>>
>> While I might be able to improve the scheduling of my jobs by using a few
>> different pools with weights equal to, say, 1, 1e3 and 1e6, effectively
>> getting a small handful of priority classes. Still, this is really not
>> quite what I am describing. This is why my original post was on the dev
>> list. Let me then ask if there is any interest in having priority queue job
>> scheduling in Spark. This is something I might be able to pull off.
>>
>> Alex
>>
>> On Sun, Jan 11, 2015 at 6:21 AM, Cody Koeninger 
>> wrote:
>>
>>> If you set up a number of pools equal to the number of different priority
>>> levels you want, make the relative weights of those pools very different,
>>> and submit a job to the pool representing its priority, I think youll get
>>> behavior equivalent to a priority queue. Try it and see.
>>>
>>> If I'm misunderstandng what youre trying to do, then I don't know.
>>>
>>>
>>> On Sunday, January 11, 2015, Alessandro Baretta 
>>> wrote:
>>>
 Cody,

 Maybe I'm not getting this, but it doesn't look like this page is
 describing a priority queue scheduling policy. What this section discusses
 is how resources are shared between queues. A weight-1000 pool will get
 1000 times more resources allocated to it than a priority 1 queue. Great,
 but not what I want. I want to be able to define an Ordering on make my
 tasks representing their priority, and have Spark allocate all resources to
 the job that has the highest priority.

 Alex

 On Sat, Jan 10, 2015 at 10:11 PM, Cody Koeninger 
 wrote:

>
> http://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties
>
> "Setting a high weight such as 1000 also makes it possible to
> implement *priority* between pools--in essence, the weight-1000 pool
> will always get to launch tasks first whenever it has jobs active."
>
> On Sat, Jan 10, 2015 at 11:57 PM, Alessandro Baretta <
> alexbare...@gmail.com> wrote:
>
>> Mark,
>>
>> Thanks, but I don't see how this documentation solves my problem. You
>> are referring me to documentation of fair scheduling; whereas, I am 
>> asking
>> about as unfair a scheduling policy as can be: a priority queue.
>>
>> Alex
>>
>> On Sat, Jan 10, 2015 at 5:00 PM, Mark Hamstra > > wrote:
>>
>>> -dev, +user
>>>
>>> http://spark.apache.org/docs/latest/job-scheduling.html
>>>
>>>
>>> On Sat, Jan 10, 2015 at 4:40 PM, Alessandro Baretta <
>>> alexbare...@gmail.com> wrote:
>>>
 Is it possible to specify a priority level for a job, such that the
 active
 jobs might be scheduled in order of priority?

 Alex

>>>
>>>
>>
>

>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Fwd: [ NOTICE ] Service Downtime Notification - R/W git repos

2015-01-13 Thread Patrick Wendell
FYI our git repo may be down for a few hours today.
-- Forwarded message --
From: "Tony Stevenson" 
Date: Jan 13, 2015 6:49 AM
Subject: [ NOTICE ] Service Downtime Notification - R/W git repos
To:
Cc:

Folks,

Please note than on Thursday 15th at 20:00 UTC the Infrastructure team
will be taking the read/write git repositories offline.  We expect
that this migration to last about 4 hours.

During the outage the service will be migrated from an old host to a
new one.   We intend to keep the URL the same for access to the repos
after the migration, but an alternate name is already in place in case
DNS updates take too long.   Please be aware it might take some hours
after the completion of the downtime for github to update and reflect
any changes.

The Infrastructure team have been trialling the new host for about a
week now, and [touch wood] have not had any problems with it.

The service is current;y available by accessing repos via:
https://git-wip-us.apache.org

If you have any questions please address them to infrastruct...@apache.org




--
Cheers,
Tony

On behalf of the Apache Infrastructure Team

--
Tony Stevenson

t...@pc-tony.com
pct...@apache.org

http://www.pc-tony.com

GPG - 1024D/51047D66
--


Re: Bouncing Mails

2015-01-17 Thread Patrick Wendell
Akhil,

Those are handled by ASF infrastructure, not anyone in the Spark
project. So this list is not the appropriate place to ask for help.

- Patrick

On Sat, Jan 17, 2015 at 12:56 AM, Akhil Das  wrote:
> My mails to the mailing list are getting rejected, have opened a Jira issue,
> can someone take a look at it?
>
> https://issues.apache.org/jira/browse/INFRA-9032
>
>
>
>
>
>
> Thanks
> Best Regards

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Semantics of LGTM

2015-01-17 Thread Patrick Wendell
Hey All,

Just wanted to ping about a minor issue - but one that ends up having
consequence given Spark's volume of reviews and commits. As much as
possible, I think that we should try and gear towards "Google Style"
LGTM on reviews. What I mean by this is that LGTM has the following
semantics:

"I know this code well, or I've looked at it close enough to feel
confident it should be merged. If there are issues/bugs with this code
later on, I feel confident I can help with them."

Here is an alternative semantic:

"Based on what I know about this part of the code, I don't see any
show-stopper problems with this patch".

The issue with the latter is that it ultimately erodes the
significance of LGTM, since subsequent reviewers need to reason about
what the person meant by saying LGTM. In contrast, having strong
semantics around LGTM can help streamline reviews a lot, especially as
reviewers get more experienced and gain trust from the comittership.

There are several easy ways to give a more limited endorsement of a patch:
- "I'm not familiar with this code, but style, etc look good" (general
endorsement)
- "The build changes in this code LGTM, but I haven't reviewed the
rest" (limited LGTM)

If people are okay with this, I might add a short note on the wiki.
I'm sending this e-mail first, though, to see whether anyone wants to
express agreement or disagreement with this approach.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Semantics of LGTM

2015-01-17 Thread Patrick Wendell
I think the ASF +1 is *slightly* different than Google's LGTM, because
it might convey wanting the patch/feature to be merged but not
necessarily saying you did a thorough review and stand behind it's
technical contents. For instance, I've seen people pile on +1's to try
and indicate support for a feature or patch in some projects, even
though they didn't do a thorough technical review. This +1 is
definitely a useful mechanism.

There is definitely much overlap though in the meaning, though, and
it's largely because Spark had it's own culture around reviews before
it was donated to the ASF, so there is a mix of two styles.

Nonetheless, I'd prefer to stick with the stronger LGTM semantics I
proposed originally (unlike the one Sandy proposed, e.g.). This is
what I've seen every project using the LGTM convention do (Google, and
some open source projects such as Impala) to indicate technical
sign-off.

- Patrick

On Sat, Jan 17, 2015 at 7:09 PM, Aaron Davidson  wrote:
> I think I've seen something like +2 = "strong LGTM" and +1 = "weak LGTM;
> someone else should review" before. It's nice to have a shortcut which isn't
> a sentence when talking about weaker forms of LGTM.
>
> On Sat, Jan 17, 2015 at 6:59 PM,  wrote:
>>
>> I think clarifying these semantics is definitely worthwhile. Maybe this
>> complicates the process with additional terminology, but the way I've used
>> these has been:
>>
>> +1 - I think this is safe to merge and, barring objections from others,
>> would merge it immediately.
>>
>> LGTM - I have no concerns about this patch, but I don't necessarily feel
>> qualified to make a final call about it.  The TM part acknowledges the
>> judgment as a little more subjective.
>>
>> I think having some concise way to express both of these is useful.
>>
>> -Sandy
>>
>> > On Jan 17, 2015, at 5:40 PM, Patrick Wendell  wrote:
>> >
>> > Hey All,
>> >
>> > Just wanted to ping about a minor issue - but one that ends up having
>> > consequence given Spark's volume of reviews and commits. As much as
>> > possible, I think that we should try and gear towards "Google Style"
>> > LGTM on reviews. What I mean by this is that LGTM has the following
>> > semantics:
>> >
>> > "I know this code well, or I've looked at it close enough to feel
>> > confident it should be merged. If there are issues/bugs with this code
>> > later on, I feel confident I can help with them."
>> >
>> > Here is an alternative semantic:
>> >
>> > "Based on what I know about this part of the code, I don't see any
>> > show-stopper problems with this patch".
>> >
>> > The issue with the latter is that it ultimately erodes the
>> > significance of LGTM, since subsequent reviewers need to reason about
>> > what the person meant by saying LGTM. In contrast, having strong
>> > semantics around LGTM can help streamline reviews a lot, especially as
>> > reviewers get more experienced and gain trust from the comittership.
>> >
>> > There are several easy ways to give a more limited endorsement of a
>> > patch:
>> > - "I'm not familiar with this code, but style, etc look good" (general
>> > endorsement)
>> > - "The build changes in this code LGTM, but I haven't reviewed the
>> > rest" (limited LGTM)
>> >
>> > If people are okay with this, I might add a short note on the wiki.
>> > I'm sending this e-mail first, though, to see whether anyone wants to
>> > express agreement or disagreement with this approach.
>> >
>> > - Patrick
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Semantics of LGTM

2015-01-19 Thread Patrick Wendell
Okay - so given all this I was going to put the following on the wiki
tentatively:

## Reviewing Code
Community code review is Spark's fundamental quality assurance
process. When reviewing a patch, your goal should be to help
streamline the committing process by giving committers confidence this
patch has been verified by an additional party. It's encouraged to
(politely) submit technical feedback to the author to identify areas
for improvement or potential bugs.

If you feel a patch is ready for inclusion in Spark, indicate this to
committers with a comment: "I think this patch looks good". Spark uses
the LGTM convention for indicating the highest level of technical
sign-off on a patch: simply comment with the word "LGTM". An LGTM is a
strong statement, it should be interpreted as the following: "I've
looked at this thoroughly and take as much ownership as if I wrote the
patch myself". If you comment LGTM you will be expected to help with
bugs or follow-up issues on the patch. Judicious use of LGTM's is a
great way to gain credibility as a reviewer with the broader
community.

It's also welcome for reviewers to argue against the inclusion of a
feature or patch. Simply indicate this in the comments.

- Patrick

On Mon, Jan 19, 2015 at 2:40 AM, Prashant Sharma  wrote:
> Patrick's original proposal LGTM :).  However until now, I have been in the
> impression of LGTM with special emphasis on TM part. That said, I will be
> okay/happy(or Responsible ) for the patch, if it goes in.
>
> Prashant Sharma
>
>
>
> On Sun, Jan 18, 2015 at 2:33 PM, Reynold Xin  wrote:
>>
>> Maybe just to avoid LGTM as a single token when it is not actually
>> according to Patrick's definition, but anybody can still leave comments
>> like:
>>
>> "The direction of the PR looks good to me." or "+1 on the direction"
>>
>> "The build part looks good to me"
>>
>> ...
>>
>>
>> On Sat, Jan 17, 2015 at 8:49 PM, Kay Ousterhout 
>> wrote:
>>
>> > +1 to Patrick's proposal of strong LGTM semantics.  On past projects,
>> > I've
>> > heard the semantics of "LGTM" expressed as "I've looked at this
>> > thoroughly
>> > and take as much ownership as if I wrote the patch myself".  My
>> > understanding is that this is the level of review we expect for all
>> > patches
>> > that ultimately go into Spark, so it's important to have a way to
>> > concisely
>> > describe when this has been done.
>> >
>> > Aaron / Sandy, when have you found the weaker LGTM to be useful?  In the
>> > cases I've seen, if someone else says "I looked at this very quickly and
>> > didn't see any glaring problems", it doesn't add any value for
>> > subsequent
>> > reviewers (someone still needs to take a thorough look).
>> >
>> > -Kay
>> >
>> > On Sat, Jan 17, 2015 at 8:04 PM,  wrote:
>> >
>> > > Yeah, the ASF +1 has become partly overloaded to mean both "I would
>> > > like
>> > > to see this feature" and "this patch should be committed", although,
>> > > at
>> > > least in Hadoop, using +1 on JIRA (as opposed to, say, in a release
>> > > vote)
>> > > should unambiguously mean the latter unless qualified in some other
>> > > way.
>> > >
>> > > I don't have any opinion on the specific characters, but I agree with
>> > > Aaron that it would be nice to have some sort of abbreviation for both
>> > the
>> > > strong and weak forms of approval.
>> > >
>> > > -Sandy
>> > >
>> > > > On Jan 17, 2015, at 7:25 PM, Patrick Wendell 
>> > wrote:
>> > > >
>> > > > I think the ASF +1 is *slightly* different than Google's LGTM,
>> > > > because
>> > > > it might convey wanting the patch/feature to be merged but not
>> > > > necessarily saying you did a thorough review and stand behind it's
>> > > > technical contents. For instance, I've seen people pile on +1's to
>> > > > try
>> > > > and indicate support for a feature or patch in some projects, even
>> > > > though they didn't do a thorough technical review. This +1 is
>> > > > definitely a useful mechanism.
>> > > >
>> > > > There is definitely much overlap though in the meaning, though, and
>> > > > it's largely because Spark

Re: Semantics of LGTM

2015-01-19 Thread Patrick Wendell
The wiki does not seem to be operational ATM, but I will do this when
it is back up.

On Mon, Jan 19, 2015 at 12:00 PM, Patrick Wendell  wrote:
> Okay - so given all this I was going to put the following on the wiki
> tentatively:
>
> ## Reviewing Code
> Community code review is Spark's fundamental quality assurance
> process. When reviewing a patch, your goal should be to help
> streamline the committing process by giving committers confidence this
> patch has been verified by an additional party. It's encouraged to
> (politely) submit technical feedback to the author to identify areas
> for improvement or potential bugs.
>
> If you feel a patch is ready for inclusion in Spark, indicate this to
> committers with a comment: "I think this patch looks good". Spark uses
> the LGTM convention for indicating the highest level of technical
> sign-off on a patch: simply comment with the word "LGTM". An LGTM is a
> strong statement, it should be interpreted as the following: "I've
> looked at this thoroughly and take as much ownership as if I wrote the
> patch myself". If you comment LGTM you will be expected to help with
> bugs or follow-up issues on the patch. Judicious use of LGTM's is a
> great way to gain credibility as a reviewer with the broader
> community.
>
> It's also welcome for reviewers to argue against the inclusion of a
> feature or patch. Simply indicate this in the comments.
>
> - Patrick
>
> On Mon, Jan 19, 2015 at 2:40 AM, Prashant Sharma  wrote:
>> Patrick's original proposal LGTM :).  However until now, I have been in the
>> impression of LGTM with special emphasis on TM part. That said, I will be
>> okay/happy(or Responsible ) for the patch, if it goes in.
>>
>> Prashant Sharma
>>
>>
>>
>> On Sun, Jan 18, 2015 at 2:33 PM, Reynold Xin  wrote:
>>>
>>> Maybe just to avoid LGTM as a single token when it is not actually
>>> according to Patrick's definition, but anybody can still leave comments
>>> like:
>>>
>>> "The direction of the PR looks good to me." or "+1 on the direction"
>>>
>>> "The build part looks good to me"
>>>
>>> ...
>>>
>>>
>>> On Sat, Jan 17, 2015 at 8:49 PM, Kay Ousterhout 
>>> wrote:
>>>
>>> > +1 to Patrick's proposal of strong LGTM semantics.  On past projects,
>>> > I've
>>> > heard the semantics of "LGTM" expressed as "I've looked at this
>>> > thoroughly
>>> > and take as much ownership as if I wrote the patch myself".  My
>>> > understanding is that this is the level of review we expect for all
>>> > patches
>>> > that ultimately go into Spark, so it's important to have a way to
>>> > concisely
>>> > describe when this has been done.
>>> >
>>> > Aaron / Sandy, when have you found the weaker LGTM to be useful?  In the
>>> > cases I've seen, if someone else says "I looked at this very quickly and
>>> > didn't see any glaring problems", it doesn't add any value for
>>> > subsequent
>>> > reviewers (someone still needs to take a thorough look).
>>> >
>>> > -Kay
>>> >
>>> > On Sat, Jan 17, 2015 at 8:04 PM,  wrote:
>>> >
>>> > > Yeah, the ASF +1 has become partly overloaded to mean both "I would
>>> > > like
>>> > > to see this feature" and "this patch should be committed", although,
>>> > > at
>>> > > least in Hadoop, using +1 on JIRA (as opposed to, say, in a release
>>> > > vote)
>>> > > should unambiguously mean the latter unless qualified in some other
>>> > > way.
>>> > >
>>> > > I don't have any opinion on the specific characters, but I agree with
>>> > > Aaron that it would be nice to have some sort of abbreviation for both
>>> > the
>>> > > strong and weak forms of approval.
>>> > >
>>> > > -Sandy
>>> > >
>>> > > > On Jan 17, 2015, at 7:25 PM, Patrick Wendell 
>>> > wrote:
>>> > > >
>>> > > > I think the ASF +1 is *slightly* different than Google's LGTM,
>>> > > > because
>>> > > > it might convey wanting the patch/feature to be merged but not
>>> > > > necessarily saying you did a thorough review and stand behind

Re: Standardized Spark dev environment

2015-01-20 Thread Patrick Wendell
To respond to the original suggestion by Nick. I always thought it
would be useful to have a Docker image on which we run the tests and
build releases, so that we could have a consistent environment that
other packagers or people trying to exhaustively run Spark tests could
replicate (or at least look at) to understand exactly how we recommend
building Spark. Sean - do you think that is too high of overhead?

In terms of providing images that we encourage as standard deployment
images of Spark and want to make portable across environments, that's
a much larger project and one with higher associated maintenance
overhead. So I'd be interested in seeing that evolve as its own
project (spark-deploy) or something associated with bigtop, etc.

- Patrick

On Tue, Jan 20, 2015 at 10:30 PM, Paolo Platter
 wrote:
> Hi all,
> I also tried the docker way and it works well.
> I suggest to look at sequenceiq/spark dockers, they are very active on that 
> field.
>
> Paolo
>
> Inviata dal mio Windows Phone
> 
> Da: jay vyas
> Inviato: 21/01/2015 04:45
> A: Nicholas Chammas
> Cc: Will Benton; Spark dev 
> list
> Oggetto: Re: Standardized Spark dev environment
>
> I can comment on both...  hi will and nate :)
>
> 1) Will's Dockerfile solution is  the most  simple direct solution to the
> dev environment question : its a  efficient way to build and develop spark
> environments for dev/test..  It would be cool to put that Dockerfile
> (and/or maybe a shell script which uses it) in the top level of spark as
> the build entry point.  For total platform portability, u could wrap in a
> vagrantfile to launch a lightweight vm, so that windows worked equally
> well.
>
> 2) However, since nate mentioned  vagrant and bigtop, i have to chime in :)
> the vagrant recipes in bigtop are a nice reference deployment of how to
> deploy spark in a heterogenous hadoop style environment, and tighter
> integration testing w/ bigtop for spark releases would be lovely !  The
> vagrant stuff use puppet to deploy an n node VM or docker based cluster, in
> which users can easily select components (including
> spark,yarn,hbase,hadoop,etc...) by simnply editing a YAML file :
> https://github.com/apache/bigtop/blob/master/bigtop-deploy/vm/vagrant-puppet/vagrantconfig.yaml
> As nate said, it would be alot of fun to get more cross collaboration
> between the spark and bigtop communities.   Input on how we can better
> integrate spark (wether its spork, hbase integration, smoke tests aroudn
> the mllib stuff, or whatever, is always welcome )
>
>
>
>
>
>
> On Tue, Jan 20, 2015 at 10:21 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> How many profiles (hadoop / hive /scala) would this development environment
>> support ?
>>
>> As many as we want. We probably want to cover a good chunk of the build
>> matrix  that Spark
>> officially supports.
>>
>> What does this provide, concretely?
>>
>> It provides a reliable way to create a "good" Spark development
>> environment. Roughly speaking, this probably should mean an environment
>> that matches Jenkins, since that's where we run "official" testing and
>> builds.
>>
>> For example, Spark has to run on Java 6 and Python 2.6. When devs build and
>> run Spark locally, we can make sure they're doing it on these versions of
>> the languages with a simple vagrant up.
>>
>> Nate, could you comment on how something like this would relate to the
>> Bigtop effort?
>>
>> http://chapeau.freevariable.com/2014/08/jvm-test-docker.html
>>
>> Will, that's pretty sweet. I tried something similar a few months ago as an
>> experiment to try building/testing Spark within a container. Here's the
>> shell script I used > >
>> against the base CentOS Docker image to setup an environment ready to build
>> and test Spark.
>>
>> We want to run Spark unit tests within containers on Jenkins, so it might
>> make sense to develop a single Docker image that can be used as both a "dev
>> environment" as well as execution container on Jenkins.
>>
>> Perhaps that's the approach to take instead of looking into Vagrant.
>>
>> Nick
>>
>> On Tue Jan 20 2015 at 8:22:41 PM Will Benton  wrote:
>>
>> Hey Nick,
>> >
>> > I did something similar with a Docker image last summer; I haven't
>> updated
>> > the images to cache the dependencies for the current Spark master, but it
>> > would be trivial to do so:
>> >
>> > http://chapeau.freevariable.com/2014/08/jvm-test-docker.html
>> >
>> >
>> > best,
>> > wb
>> >
>> >
>> > - Original Message -
>> > > From: "Nicholas Chammas" 
>> > > To: "Spark dev list" 
>> > > Sent: Tuesday, January 20, 2015 6:13:31 PM
>> > > Subject: Standardized Spark dev environment
>> > >
>> > > What do y'all think of creating a standardized Spark dev

Re: Standardized Spark dev environment

2015-01-21 Thread Patrick Wendell
> If the goal is a reproducible test environment then I think that is what
> Jenkins is. Granted you can only ask it for a test. But presumably you get
> the same result if you start from the same VM image as Jenkins and run the
> same steps.

But the issue is when users can't reproduce Jenkins failures. We don't
publish anywhere what the exact set of packages and versions is that
is installed on Jenkins. And it can change since it's a shared
infrastructure with other projects. So why not publish this manifest
as a docker file and then have it run on jenkins using that image? My
point is that this "VM image + steps" is not public anywhere.

> I bet it is not hard to set up and maintain. I bet it is easier than a VM.
> But unless Jenkins is using it aren't we just making another different
> standard build env in an effort to standardize? If it is not the same then
> it loses value as being exactly the same as the reference build env. Has a
> problem come up that this solves?

Right now the reference build env is an AMI I created and keep adding
stuff to when Spark gets new dependencies (e.g. the version of ruby we
need to create the docs, new python stats libraries, etc). So if we
had a docker image, then I would use that for making the RC's as well
and it could serve as a definitive reference for people who want to
understand exactly what set of things they need to build Spark.

>
> If the goal is just easing developer set up then what does a Docker image do
> - what does it set up for me? I don't know of stuff I need set up on OS X
> for me beyond the IDE.

There are actually a good number of packages you need to do a full
build of Spark including a compliant python version, Java version,
certain python packages, ruby and jekyll stuff for the docs, etc
(mentioned a bit earlier).

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Standardized Spark dev environment

2015-01-21 Thread Patrick Wendell
Yep,

I think it's only useful (and likely to be maintained) if we actually
use this on Jenkins. So that was my proposal. Basically give people a
docker file so they can understand exactly what versions of everything
we use for our reference build. And if they don't want to use docker
directly, this will at least serve as an up-to-date list of
packages/versions they should try to install locally in whatever
environment they have.

- Patrick

On Wed, Jan 21, 2015 at 5:42 AM, Will Benton  wrote:
> - Original Message -----
>> From: "Patrick Wendell" 
>> To: "Sean Owen" 
>> Cc: "dev" , "jay vyas" , 
>> "Paolo Platter"
>> , "Nicholas Chammas" 
>> , "Will Benton" 
>> Sent: Wednesday, January 21, 2015 2:09:35 AM
>> Subject: Re: Standardized Spark dev environment
>
>> But the issue is when users can't reproduce Jenkins failures.
>
> Yeah, to answer Sean's question, this was part of the problem I was trying to 
> solve.  The other part was teasing out differences between the Fedora Java 
> environment and a more conventional Java environment.  I agree with Sean (and 
> I think this is your suggestion as well, Patrick) that making the environment 
> Jenkins runs a standard image that is available for public consumption would 
> be useful in general.
>
>
>
> best,
> wb

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Upcoming Spark 1.2.1 RC

2015-01-21 Thread Patrick Wendell
Hey All,

I am planning to cut a 1.2.1 RC soon and wanted to notify people.

There are a handful of important fixes in the 1.2.1 branch
(http://s.apache.org/Mpn) particularly for Spark SQL. There was also
an issue publishing some of our artifacts with 1.2.0 and this release
would fix it for downstream projects.

You can track outstanding 1.2.1 blocker issues here at
http://s.apache.org/2v2 - I'm guessing all remaining blocker issues
will be fixed today.

I think we have a good handle on the remaining outstanding fixes, but
please let me know if you think there are severe outstanding fixes
that need to be backported into this branch or are not tracked above.

Thanks!
- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: renaming SchemaRDD -> DataFrame

2015-01-26 Thread Patrick Wendell
One thing potentially not clear from this e-mail, there will be a 1:1
correspondence where you can get an RDD to/from a DataFrame.

On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin  wrote:
> Hi,
>
> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
> get the community's opinion.
>
> The context is that SchemaRDD is becoming a common data format used for
> bringing data into Spark from external systems, and used for various
> components of Spark, e.g. MLlib's new pipeline API. We also expect more and
> more users to be programming directly against SchemaRDD API rather than the
> core RDD API. SchemaRDD, through its less commonly used DSL originally
> designed for writing test cases, always has the data-frame like API. In
> 1.3, we are redesigning the API to make the API usable for end users.
>
>
> There are two motivations for the renaming:
>
> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>
> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
> though it would contain some RDD functions like map, flatMap, etc), and
> calling it Schema*RDD* while it is not an RDD is highly confusing. Instead.
> DataFrame.rdd will return the underlying RDD for all RDD methods.
>
>
> My understanding is that very few users program directly against the
> SchemaRDD API at the moment, because they are not well documented. However,
> oo maintain backward compatibility, we can create a type alias DataFrame
> that is still named SchemaRDD. This will maintain source compatibility for
> Scala. That said, we will have to update all existing materials to use
> DataFrame rather than SchemaRDD.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-26 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.2.1!

The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.2.1-rc1/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1061/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/

Please vote on releasing this package as Apache Spark 1.2.1!

The vote is open until Friday, January 30, at 07:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.2.1
[ ] -1 Do not release this package because ...

For a list of fixes in this release, see http://s.apache.org/Mpn.

To learn more about Apache Spark, please see
http://spark.apache.org/

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



  1   2   3   4   5   6   7   >