Re: Signing releases with pwendell or release manager's key?

2017-09-18 Thread Patrick Wendell
Sounds good - thanks Holden!

On Mon, Sep 18, 2017 at 8:21 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

> That sounds like a pretty good temporary work around if folks agree I'll
> cancel release vote for 2.1.2 and work on getting an RC2 out later this
> week manually signed. I've filed JIRA SPARK-22055 & SPARK-22054 to port the
> release scripts and allow injecting of the RM's key.
>
> On Mon, Sep 18, 2017 at 8:11 PM, Patrick Wendell <patr...@databricks.com>
> wrote:
>
>> For the current release - maybe Holden could just sign the artifacts with
>> her own key manually, if this is a concern. I don't think that would
>> require modifying the release pipeline, except to just remove/ignore the
>> existing signatures.
>>
>> - Patrick
>>
>> On Mon, Sep 18, 2017 at 7:56 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> Does anybody know whether this is a hard blocker? If it is not, we
>>> should probably push 2.1.2 forward quickly and do the infrastructure
>>> improvement in parallel.
>>>
>>> On Mon, Sep 18, 2017 at 7:49 PM, Holden Karau <hol...@pigscanfly.ca>
>>> wrote:
>>>
>>>> I'm more than willing to help migrate the scripts as part of either
>>>> this release or the next.
>>>>
>>>> It sounds like there is a consensus developing around changing the
>>>> process -- should we hold off on the 2.1.2 release or roll this into the
>>>> next one?
>>>>
>>>> On Mon, Sep 18, 2017 at 7:37 PM, Marcelo Vanzin <van...@cloudera.com>
>>>> wrote:
>>>>
>>>>> +1 to this. There should be a script in the Spark repo that has all
>>>>> the logic needed for a release. That script should take the RM's key
>>>>> as a parameter.
>>>>>
>>>>> if there's a desire to keep the current Jenkins job to create the
>>>>> release, it should be based on that script. But from what I'm seeing
>>>>> there are currently too many unknowns in the release process.
>>>>>
>>>>> On Mon, Sep 18, 2017 at 4:55 PM, Ryan Blue <rb...@netflix.com.invalid>
>>>>> wrote:
>>>>> > I don't understand why it is necessary to share a release key. If
>>>>> this is
>>>>> > something that can be automated in a Jenkins job, then can it be a
>>>>> script
>>>>> > with a reasonable set of build requirements for Mac and Ubuntu?
>>>>> That's the
>>>>> > approach I've seen the most in other projects.
>>>>> >
>>>>> > I'm also not just concerned about release managers. Having a key
>>>>> stored
>>>>> > persistently on outside infrastructure adds the most risk, as
>>>>> Luciano noted
>>>>> > as well. We should also start publishing checksums in the Spark VOTE
>>>>> thread,
>>>>> > which are currently missing. The risk I'm concerned about is that if
>>>>> the key
>>>>> > were compromised, it would be possible to replace binaries with
>>>>> perfectly
>>>>> > valid ones, at least on some mirrors. If the Apache copy were
>>>>> replaced, then
>>>>> > we wouldn't even be able to catch that it had happened. Given the
>>>>> high
>>>>> > profile of Spark and the number of companies that run it, I think we
>>>>> need to
>>>>> > take extra care to make sure that can't happen, even if it is an
>>>>> annoyance
>>>>> > for the release managers.
>>>>>
>>>>> --
>>>>> Marcelo
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>
>>>
>>
>
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>


Re: Signing releases with pwendell or release manager's key?

2017-09-18 Thread Patrick Wendell
For the current release - maybe Holden could just sign the artifacts with
her own key manually, if this is a concern. I don't think that would
require modifying the release pipeline, except to just remove/ignore the
existing signatures.

- Patrick

On Mon, Sep 18, 2017 at 7:56 PM, Reynold Xin  wrote:

> Does anybody know whether this is a hard blocker? If it is not, we should
> probably push 2.1.2 forward quickly and do the infrastructure improvement
> in parallel.
>
> On Mon, Sep 18, 2017 at 7:49 PM, Holden Karau 
> wrote:
>
>> I'm more than willing to help migrate the scripts as part of either this
>> release or the next.
>>
>> It sounds like there is a consensus developing around changing the
>> process -- should we hold off on the 2.1.2 release or roll this into the
>> next one?
>>
>> On Mon, Sep 18, 2017 at 7:37 PM, Marcelo Vanzin 
>> wrote:
>>
>>> +1 to this. There should be a script in the Spark repo that has all
>>> the logic needed for a release. That script should take the RM's key
>>> as a parameter.
>>>
>>> if there's a desire to keep the current Jenkins job to create the
>>> release, it should be based on that script. But from what I'm seeing
>>> there are currently too many unknowns in the release process.
>>>
>>> On Mon, Sep 18, 2017 at 4:55 PM, Ryan Blue 
>>> wrote:
>>> > I don't understand why it is necessary to share a release key. If this
>>> is
>>> > something that can be automated in a Jenkins job, then can it be a
>>> script
>>> > with a reasonable set of build requirements for Mac and Ubuntu? That's
>>> the
>>> > approach I've seen the most in other projects.
>>> >
>>> > I'm also not just concerned about release managers. Having a key stored
>>> > persistently on outside infrastructure adds the most risk, as Luciano
>>> noted
>>> > as well. We should also start publishing checksums in the Spark VOTE
>>> thread,
>>> > which are currently missing. The risk I'm concerned about is that if
>>> the key
>>> > were compromised, it would be possible to replace binaries with
>>> perfectly
>>> > valid ones, at least on some mirrors. If the Apache copy were
>>> replaced, then
>>> > we wouldn't even be able to catch that it had happened. Given the high
>>> > profile of Spark and the number of companies that run it, I think we
>>> need to
>>> > take extra care to make sure that can't happen, even if it is an
>>> annoyance
>>> > for the release managers.
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


Re: Signing releases with pwendell or release manager's key?

2017-09-18 Thread Patrick Wendell
Hey I talked more with Josh Rosen about this who has helped with automation
since I became less involved in release management.

I can think of a few different things that would improve our RM based on
these suggestions:

(1) We could remove signing step from the rest of the automation and as the
RM to sign the artifacts locally as a last step. This does mean we'd trust
the RM's environment not to be owned, but it could be better if there is
concern about centralization of risk. I'm curious how other projects do
this.

(2) We could rotate the RM position. BTW Holden Karau is doing this and
that's how this whole discussion started.

(3) We should make sure all build tooling automation is in the repo itself
so that the build is 100% reproducible by anyone. I think most of it is
already in dev/ [1] but there might be jenkins configs, etc that could be
put into the spark repo.

[1] https://github.com/apache/spark/tree/master/dev/create-release

- Patrick

On Mon, Sep 18, 2017 at 6:23 PM, Patrick Wendell <patr...@databricks.com>
wrote:

> One thing we could do is modify the release tooling to allow the key to be
> injected each time, thus allowing any RM to insert their own key at build
> time.
>
> Patrick
>
> On Mon, Sep 18, 2017 at 4:56 PM Ryan Blue <rb...@netflix.com> wrote:
>
>> I don't understand why it is necessary to share a release key. If this is
>> something that can be automated in a Jenkins job, then can it be a script
>> with a reasonable set of build requirements for Mac and Ubuntu? That's the
>> approach I've seen the most in other projects.
>>
>> I'm also not just concerned about release managers. Having a key stored
>> persistently on outside infrastructure adds the most risk, as Luciano noted
>> as well. We should also start publishing checksums in the Spark VOTE
>> thread, which are currently missing. The risk I'm concerned about is that
>> if the key were compromised, it would be possible to replace binaries with
>> perfectly valid ones, at least on some mirrors. If the Apache copy were
>> replaced, then we wouldn't even be able to catch that it had happened.
>> Given the high profile of Spark and the number of companies that run it, I
>> think we need to take extra care to make sure that can't happen, even if it
>> is an annoyance for the release managers.
>>
>> On Sun, Sep 17, 2017 at 10:12 PM, Patrick Wendell <patr...@databricks.com
>> > wrote:
>>
>>> Sparks release pipeline is automated and part of that automation
>>> includes securely injecting this key for the purpose of signing. I asked
>>> the ASF to provide a service account key several years ago but they
>>> suggested that we use a key attributed to an individual even if the process
>>> is automated.
>>>
>>> I believe other projects that release with high frequency also have
>>> automated the signing process.
>>>
>>> This key is injected during the build process. A really ambitious
>>> release manager could reverse engineer this in a way that reveals the
>>> private key, however if someone is a release manager then they themselves
>>> can do quite a bit of nefarious things anyways.
>>>
>>> It is true that we trust all previous release managers instead of only
>>> one. We could probably rotate the jenkins credentials periodically in order
>>> to compensate for this, if we think this is a nontrivial risk.
>>>
>>> - Patrick
>>>
>>> On Sun, Sep 17, 2017 at 7:04 PM, Holden Karau <hol...@pigscanfly.ca>
>>> wrote:
>>>
>>>> Would any of Patrick/Josh/Shane (or other PMC folks with
>>>> understanding/opinions on this setup) care to comment? If this is a
>>>> blocking issue I can cancel the current release vote thread while we
>>>> discuss this some more.
>>>>
>>>> On Fri, Sep 15, 2017 at 5:18 PM Holden Karau <hol...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> Oh yes and to keep people more informed I've been updating a PR for
>>>>> the release documentation as I go to write down some of this unwritten
>>>>> knowledge -- https://github.com/apache/spark-website/pull/66
>>>>>
>>>>>
>>>>> On Fri, Sep 15, 2017 at 5:12 PM Holden Karau <hol...@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> Also continuing the discussion from the vote threads, Shane probably
>>>>>> has the best idea on the ACLs for Jenkins so I've CC'd him as well.
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 15, 2017 at 5:09 P

Re: Signing releases with pwendell or release manager's key?

2017-09-18 Thread Patrick Wendell
One thing we could do is modify the release tooling to allow the key to be
injected each time, thus allowing any RM to insert their own key at build
time.

Patrick

On Mon, Sep 18, 2017 at 4:56 PM Ryan Blue <rb...@netflix.com> wrote:

> I don't understand why it is necessary to share a release key. If this is
> something that can be automated in a Jenkins job, then can it be a script
> with a reasonable set of build requirements for Mac and Ubuntu? That's the
> approach I've seen the most in other projects.
>
> I'm also not just concerned about release managers. Having a key stored
> persistently on outside infrastructure adds the most risk, as Luciano noted
> as well. We should also start publishing checksums in the Spark VOTE
> thread, which are currently missing. The risk I'm concerned about is that
> if the key were compromised, it would be possible to replace binaries with
> perfectly valid ones, at least on some mirrors. If the Apache copy were
> replaced, then we wouldn't even be able to catch that it had happened.
> Given the high profile of Spark and the number of companies that run it, I
> think we need to take extra care to make sure that can't happen, even if it
> is an annoyance for the release managers.
>
> On Sun, Sep 17, 2017 at 10:12 PM, Patrick Wendell <patr...@databricks.com>
> wrote:
>
>> Sparks release pipeline is automated and part of that automation includes
>> securely injecting this key for the purpose of signing. I asked the ASF to
>> provide a service account key several years ago but they suggested that we
>> use a key attributed to an individual even if the process is automated.
>>
>> I believe other projects that release with high frequency also have
>> automated the signing process.
>>
>> This key is injected during the build process. A really ambitious release
>> manager could reverse engineer this in a way that reveals the private key,
>> however if someone is a release manager then they themselves can do quite a
>> bit of nefarious things anyways.
>>
>> It is true that we trust all previous release managers instead of only
>> one. We could probably rotate the jenkins credentials periodically in order
>> to compensate for this, if we think this is a nontrivial risk.
>>
>> - Patrick
>>
>> On Sun, Sep 17, 2017 at 7:04 PM, Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>>> Would any of Patrick/Josh/Shane (or other PMC folks with
>>> understanding/opinions on this setup) care to comment? If this is a
>>> blocking issue I can cancel the current release vote thread while we
>>> discuss this some more.
>>>
>>> On Fri, Sep 15, 2017 at 5:18 PM Holden Karau <hol...@pigscanfly.ca>
>>> wrote:
>>>
>>>> Oh yes and to keep people more informed I've been updating a PR for the
>>>> release documentation as I go to write down some of this unwritten
>>>> knowledge -- https://github.com/apache/spark-website/pull/66
>>>>
>>>>
>>>> On Fri, Sep 15, 2017 at 5:12 PM Holden Karau <hol...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> Also continuing the discussion from the vote threads, Shane probably
>>>>> has the best idea on the ACLs for Jenkins so I've CC'd him as well.
>>>>>
>>>>>
>>>>> On Fri, Sep 15, 2017 at 5:09 PM Holden Karau <hol...@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> Changing the release jobs, beyond the available parameters, right now
>>>>>> depends on Josh arisen as there are some scripts which generate the jobs
>>>>>> which aren't public. I've done temporary fixes in the past with the 
>>>>>> Python
>>>>>> packaging but my understanding is that in the medium term it requires
>>>>>> access to the scripts.
>>>>>>
>>>>>> So +CC Josh.
>>>>>>
>>>>>> On Fri, Sep 15, 2017 at 4:38 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>>>
>>>>>>> I think this needs to be fixed. It's true that there are barriers to
>>>>>>> publication, but the signature is what we use to authenticate Apache
>>>>>>> releases.
>>>>>>>
>>>>>>> If Patrick's key is available on Jenkins for any Spark committer to
>>>>>>> use, then the chance of a compromise are much higher than for a normal 
>>>>>>> RM
>>>>>>> key.
>>>>>>>
>>>>>>> rb

Re: Signing releases with pwendell or release manager's key?

2017-09-17 Thread Patrick Wendell
Sparks release pipeline is automated and part of that automation includes
securely injecting this key for the purpose of signing. I asked the ASF to
provide a service account key several years ago but they suggested that we
use a key attributed to an individual even if the process is automated.

I believe other projects that release with high frequency also have
automated the signing process.

This key is injected during the build process. A really ambitious release
manager could reverse engineer this in a way that reveals the private key,
however if someone is a release manager then they themselves can do quite a
bit of nefarious things anyways.

It is true that we trust all previous release managers instead of only one.
We could probably rotate the jenkins credentials periodically in order to
compensate for this, if we think this is a nontrivial risk.

- Patrick

On Sun, Sep 17, 2017 at 7:04 PM, Holden Karau  wrote:

> Would any of Patrick/Josh/Shane (or other PMC folks with
> understanding/opinions on this setup) care to comment? If this is a
> blocking issue I can cancel the current release vote thread while we
> discuss this some more.
>
> On Fri, Sep 15, 2017 at 5:18 PM Holden Karau  wrote:
>
>> Oh yes and to keep people more informed I've been updating a PR for the
>> release documentation as I go to write down some of this unwritten
>> knowledge -- https://github.com/apache/spark-website/pull/66
>>
>>
>> On Fri, Sep 15, 2017 at 5:12 PM Holden Karau 
>> wrote:
>>
>>> Also continuing the discussion from the vote threads, Shane probably has
>>> the best idea on the ACLs for Jenkins so I've CC'd him as well.
>>>
>>>
>>> On Fri, Sep 15, 2017 at 5:09 PM Holden Karau 
>>> wrote:
>>>
 Changing the release jobs, beyond the available parameters, right now
 depends on Josh arisen as there are some scripts which generate the jobs
 which aren't public. I've done temporary fixes in the past with the Python
 packaging but my understanding is that in the medium term it requires
 access to the scripts.

 So +CC Josh.

 On Fri, Sep 15, 2017 at 4:38 PM Ryan Blue  wrote:

> I think this needs to be fixed. It's true that there are barriers to
> publication, but the signature is what we use to authenticate Apache
> releases.
>
> If Patrick's key is available on Jenkins for any Spark committer to
> use, then the chance of a compromise are much higher than for a normal RM
> key.
>
> rb
>
> On Fri, Sep 15, 2017 at 12:34 PM, Sean Owen 
> wrote:
>
>> Yeah I had meant to ask about that in the past. While I presume
>> Patrick consents to this and all that, it does mean that anyone with 
>> access
>> to said Jenkins scripts can create a signed Spark release, regardless of
>> who they are.
>>
>> I haven't thought through whether that's a theoretical issue we can
>> ignore or something we need to fix up. For example you can't get a 
>> release
>> on the ASF mirrors without more authentication.
>>
>> How hard would it be to make the script take in a key? it sort of
>> looks like the script already takes GPG_KEY, but don't know how to modify
>> the jobs. I suppose it would be ideal, in any event, for the actual 
>> release
>> manager to sign.
>>
>> On Fri, Sep 15, 2017 at 8:28 PM Holden Karau 
>> wrote:
>>
>>> That's a good question, I built the release candidate however the
>>> Jenkins scripts don't take a parameter for configuring who signs them
>>> rather it always signs them with Patrick's key. You can see this from
>>> previous releases which were managed by other folks but still signed by
>>> Patrick.
>>>
>>> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue 
>>> wrote:
>>>
 The signature is valid, but why was the release signed with Patrick
 Wendell's private key? Did Patrick build the release candidate?

>>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
 --
 Twitter: https://twitter.com/holdenkarau

>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
> --
> Twitter: https://twitter.com/holdenkarau
>


[jira] [Commented] (SPARK-16685) audit release docs are ambiguous

2016-07-24 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15391168#comment-15391168
 ] 

Patrick Wendell commented on SPARK-16685:
-

These scripts are pretty old and I'm not sure if anyone still uses them. I had 
written them a while back as sanity tests for some release builds. Today, those 
things are tested broadly by the community so I think this has become 
redundant. [~rxin] are these still used? If not, it might be good to remove 
them from the source repo.

> audit release docs are ambiguous
> 
>
> Key: SPARK-16685
> URL: https://issues.apache.org/jira/browse/SPARK-16685
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 1.6.2
>Reporter: jay vyas
>Priority: Minor
>
> The dev/audit-release tooling is ambiguous.
> - should it run against a real cluster? if so when?
> - what should be in the release repo?  Just jars? tarballs?  ( i assume jars 
> because its a .ivy, but not sure).
> - 
> https://github.com/apache/spark/tree/master/dev/audit-release



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13855) Spark 1.6.1 artifacts not found in S3 bucket / direct download

2016-03-16 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-13855.
-
   Resolution: Fixed
Fix Version/s: 1.6.1

> Spark 1.6.1 artifacts not found in S3 bucket / direct download
> --
>
> Key: SPARK-13855
> URL: https://issues.apache.org/jira/browse/SPARK-13855
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.1
> Environment: production
>Reporter: Sandesh Deshmane
>    Assignee: Patrick Wendell
> Fix For: 1.6.1
>
>
> Getting below error while deploying spark on EC2 with version 1.6.1
> [timing] scala init:  00h 00m 12s
> Initializing spark
> --2016-03-14 07:05:30--  
> http://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
> Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.50.12
> Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.50.12|:80... 
> connected.
> HTTP request sent, awaiting response... 404 Not Found
> 2016-03-14 07:05:30 ERROR 404: Not Found.
> ERROR: Unknown Spark version
> spark/init.sh: line 137: return: -1: invalid option
> return: usage: return [n]
> Unpacking Spark
> tar (child): spark-*.tgz: Cannot open: No such file or directory
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> rm: cannot remove `spark-*.tgz': No such file or directory
> mv: missing destination file operand after `spark'
> Try `mv --help' for more information.
> Checked s3 bucket spark-related-packages and noticed that no spark 1.6.1 
> present



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13855) Spark 1.6.1 artifacts not found in S3 bucket / direct download

2016-03-16 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196901#comment-15196901
 ] 

Patrick Wendell commented on SPARK-13855:
-

I've uploaded the artifacts, thanks.

> Spark 1.6.1 artifacts not found in S3 bucket / direct download
> --
>
> Key: SPARK-13855
> URL: https://issues.apache.org/jira/browse/SPARK-13855
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.1
> Environment: production
>Reporter: Sandesh Deshmane
>    Assignee: Patrick Wendell
> Fix For: 1.6.1
>
>
> Getting below error while deploying spark on EC2 with version 1.6.1
> [timing] scala init:  00h 00m 12s
> Initializing spark
> --2016-03-14 07:05:30--  
> http://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
> Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.50.12
> Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.50.12|:80... 
> connected.
> HTTP request sent, awaiting response... 404 Not Found
> 2016-03-14 07:05:30 ERROR 404: Not Found.
> ERROR: Unknown Spark version
> spark/init.sh: line 137: return: -1: invalid option
> return: usage: return [n]
> Unpacking Spark
> tar (child): spark-*.tgz: Cannot open: No such file or directory
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> rm: cannot remove `spark-*.tgz': No such file or directory
> mv: missing destination file operand after `spark'
> Try `mv --help' for more information.
> Checked s3 bucket spark-related-packages and noticed that no spark 1.6.1 
> present



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13855) Spark 1.6.1 artifacts not found in S3 bucket / direct download

2016-03-16 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reassigned SPARK-13855:
---

Assignee: Patrick Wendell  (was: Michael Armbrust)

> Spark 1.6.1 artifacts not found in S3 bucket / direct download
> --
>
> Key: SPARK-13855
> URL: https://issues.apache.org/jira/browse/SPARK-13855
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.1
> Environment: production
>Reporter: Sandesh Deshmane
>    Assignee: Patrick Wendell
>
> Getting below error while deploying spark on EC2 with version 1.6.1
> [timing] scala init:  00h 00m 12s
> Initializing spark
> --2016-03-14 07:05:30--  
> http://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
> Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.50.12
> Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.50.12|:80... 
> connected.
> HTTP request sent, awaiting response... 404 Not Found
> 2016-03-14 07:05:30 ERROR 404: Not Found.
> ERROR: Unknown Spark version
> spark/init.sh: line 137: return: -1: invalid option
> return: usage: return [n]
> Unpacking Spark
> tar (child): spark-*.tgz: Cannot open: No such file or directory
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> rm: cannot remove `spark-*.tgz': No such file or directory
> mv: missing destination file operand after `spark'
> Try `mv --help' for more information.
> Checked s3 bucket spark-related-packages and noticed that no spark 1.6.1 
> present



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-16 Thread Patrick Wendell
+1

On Wed, Dec 16, 2015 at 6:15 PM, Ted Yu  wrote:

> Ran test suite (minus docker-integration-tests)
> All passed
>
> +1
>
> [INFO] Spark Project External ZeroMQ .. SUCCESS [
> 13.647 s]
> [INFO] Spark Project External Kafka ... SUCCESS [
> 45.424 s]
> [INFO] Spark Project Examples . SUCCESS [02:06
> min]
> [INFO] Spark Project External Kafka Assembly .. SUCCESS [
> 11.280 s]
> [INFO]
> 
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] Total time: 01:49 h
> [INFO] Finished at: 2015-12-16T17:06:58-08:00
>
> On Wed, Dec 16, 2015 at 4:37 PM, Andrew Or  wrote:
>
>> +1
>>
>> Mesos cluster mode regression in RC2 is now fixed (SPARK-12345
>>  / PR10332
>> ).
>>
>> Also tested on standalone client and cluster mode. No problems.
>>
>> 2015-12-16 15:16 GMT-08:00 Rad Gruchalski :
>>
>>> I also noticed that spark.replClassServer.host and
>>> spark.replClassServer.port aren’t used anymore. The transport now happens
>>> over the main RpcEnv.
>>>
>>> Kind regards,
>>> Radek Gruchalski
>>> ra...@gruchalski.com 
>>> de.linkedin.com/in/radgruchalski/
>>>
>>>
>>> *Confidentiality:*This communication is intended for the above-named
>>> person and may be confidential and/or legally privileged.
>>> If it has come to you in error you must take no action based on it, nor
>>> must you copy or show it to anyone; please delete/destroy and inform the
>>> sender immediately.
>>>
>>> On Wednesday, 16 December 2015 at 23:43, Marcelo Vanzin wrote:
>>>
>>> I was going to say that spark.executor.port is not used anymore in
>>> 1.6, but damn, there's still that akka backend hanging around there
>>> even when netty is being used... we should fix this, should be a
>>> simple one-liner.
>>>
>>> On Wed, Dec 16, 2015 at 2:35 PM, singinpirate 
>>> wrote:
>>>
>>> -0 (non-binding)
>>>
>>> I have observed that when we set spark.executor.port in 1.6, we get
>>> thrown a
>>> NPE in SparkEnv$.create(SparkEnv.scala:259). It used to work in 1.5.2. Is
>>> anyone else seeing this?
>>>
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>>
>>
>


[jira] [Updated] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame

2015-12-10 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-12148:

Priority: Major  (was: Critical)

> SparkR: rename DataFrame to SparkDataFrame
> --
>
> Key: SPARK-12148
> URL: https://issues.apache.org/jira/browse/SPARK-12148
> Project: Spark
>  Issue Type: Improvement
>  Components: R, SparkR
>Reporter: Michael Lawrence
>
> The SparkR package represents a Spark DataFrame with the class "DataFrame". 
> That conflicts with the more general DataFrame class defined in the S4Vectors 
> package. Would it not be more appropriate to use the name "SparkDataFrame" 
> instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame

2015-12-10 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-12148:

Priority: Critical  (was: Minor)

> SparkR: rename DataFrame to SparkDataFrame
> --
>
> Key: SPARK-12148
> URL: https://issues.apache.org/jira/browse/SPARK-12148
> Project: Spark
>  Issue Type: Wish
>  Components: R, SparkR
>Reporter: Michael Lawrence
>Priority: Critical
>
> The SparkR package represents a Spark DataFrame with the class "DataFrame". 
> That conflicts with the more general DataFrame class defined in the S4Vectors 
> package. Would it not be more appropriate to use the name "SparkDataFrame" 
> instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame

2015-12-10 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-12148:

Issue Type: Improvement  (was: Wish)

> SparkR: rename DataFrame to SparkDataFrame
> --
>
> Key: SPARK-12148
> URL: https://issues.apache.org/jira/browse/SPARK-12148
> Project: Spark
>  Issue Type: Improvement
>  Components: R, SparkR
>Reporter: Michael Lawrence
>Priority: Critical
>
> The SparkR package represents a Spark DataFrame with the class "DataFrame". 
> That conflicts with the more general DataFrame class defined in the S4Vectors 
> package. Would it not be more appropriate to use the name "SparkDataFrame" 
> instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12110) spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build Spark with Hive

2015-12-02 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-12110:

Description: 
I am using spark-1.5.1-bin-hadoop2.6. I used 
spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured 
spark-env to use python3. I can not run the tokenizer sample code. Is there a 
work around?

Kind regards

Andy

{code}
/root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
658 raise Exception("You must build Spark with Hive. "
659 "Export 'SPARK_HIVE=true' and run "
--> 660 "build/sbt assembly", e)
661 
662 def _get_hive_ctx(self):

Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run 
build/sbt assembly", Py4JJavaError('An error occurred while calling 
None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o38))




http://spark.apache.org/docs/latest/ml-features.html#tokenizer

from pyspark.ml.feature import Tokenizer, RegexTokenizer

sentenceDataFrame = sqlContext.createDataFrame([
  (0, "Hi I heard about Spark"),
  (1, "I wish Java could use case classes"),
  (2, "Logistic,regression,models,are,neat")
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsDataFrame = tokenizer.transform(sentenceDataFrame)
for words_label in wordsDataFrame.select("words", "label").take(3):
  print(words_label)

---
Py4JJavaError Traceback (most recent call last)
/root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
654 if not hasattr(self, '_scala_HiveContext'):
--> 655 self._scala_HiveContext = self._get_hive_ctx()
656 return self._scala_HiveContext

/root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
662 def _get_hive_ctx(self):
--> 663 return self._jvm.HiveContext(self._jsc.sc())
664 

/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
__call__(self, *args)
700 return_value = get_return_value(answer, self._gateway_client, 
None,
--> 701 self._fqn)
702 

/root/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 35 try:
---> 36 return f(*a, **kw)
 37 except py4j.protocol.Py4JJavaError as e:

/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:

Py4JJavaError: An error occurred while calling 
None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.io.IOException: Filesystem closed
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at 
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
at 
org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
at 
org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:323)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1057)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:554)
at 
org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:599)
at 
org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
... 15 mo

[jira] [Updated] (SPARK-12110) spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build Spark with Hive

2015-12-02 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-12110:

Component/s: (was: ML)
 (was: SQL)
 (was: PySpark)
 EC2

> spark-1.5.1-bin-hadoop2.6;  pyspark.ml.feature  Exception: ("You must build 
> Spark with Hive 
> 
>
> Key: SPARK-12110
> URL: https://issues.apache.org/jira/browse/SPARK-12110
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.5.1
> Environment: cluster created using 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2
>Reporter: Andrew Davidson
>
> I am using spark-1.5.1-bin-hadoop2.6. I used 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured 
> spark-env to use python3. I can not run the tokenizer sample code. Is there a 
> work around?
> Kind regards
> Andy
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 658 raise Exception("You must build Spark with Hive. "
> 659 "Export 'SPARK_HIVE=true' and run "
> --> 660 "build/sbt assembly", e)
> 661 
> 662 def _get_hive_ctx(self):
> Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run 
> build/sbt assembly", Py4JJavaError('An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o38))
> http://spark.apache.org/docs/latest/ml-features.html#tokenizer
> from pyspark.ml.feature import Tokenizer, RegexTokenizer
> sentenceDataFrame = sqlContext.createDataFrame([
>   (0, "Hi I heard about Spark"),
>   (1, "I wish Java could use case classes"),
>   (2, "Logistic,regression,models,are,neat")
> ], ["label", "sentence"])
> tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
> wordsDataFrame = tokenizer.transform(sentenceDataFrame)
> for words_label in wordsDataFrame.select("words", "label").take(3):
>   print(words_label)
> ---
> Py4JJavaError Traceback (most recent call last)
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 654 if not hasattr(self, '_scala_HiveContext'):
> --> 655 self._scala_HiveContext = self._get_hive_ctx()
> 656 return self._scala_HiveContext
> /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
> 662 def _get_hive_ctx(self):
> --> 663 return self._jvm.HiveContext(self._jsc.sc())
> 664 
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
> 700 return_value = get_return_value(answer, self._gateway_client, 
> None,
> --> 701 self._fqn)
> 702 
> /root/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  35 try:
> ---> 36 return f(*a, **kw)
>  37 except py4j.protocol.Py4JJavaError as e:
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.io.IOException: Filesystem closed
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
>   at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
>   at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
>   at py4j

[jira] [Commented] (SPARK-12110) spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build Spark with Hive

2015-12-02 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036960#comment-15036960
 ] 

Patrick Wendell commented on SPARK-12110:
-

Hey Andrew, could you show exactly the command you are running to run this 
example? Also, if you simply download Spark 1.5.1 and run the same command 
locally rather than in your modified EC2 cluster, does it work?

> spark-1.5.1-bin-hadoop2.6;  pyspark.ml.feature  Exception: ("You must build 
> Spark with Hive 
> 
>
> Key: SPARK-12110
> URL: https://issues.apache.org/jira/browse/SPARK-12110
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.5.1
> Environment: cluster created using 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2
>Reporter: Andrew Davidson
>
> I am using spark-1.5.1-bin-hadoop2.6. I used 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured 
> spark-env to use python3. I can not run the tokenizer sample code. Is there a 
> work around?
> Kind regards
> Andy
> {code}
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 658 raise Exception("You must build Spark with Hive. "
> 659 "Export 'SPARK_HIVE=true' and run "
> --> 660 "build/sbt assembly", e)
> 661 
> 662 def _get_hive_ctx(self):
> Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run 
> build/sbt assembly", Py4JJavaError('An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o38))
> http://spark.apache.org/docs/latest/ml-features.html#tokenizer
> from pyspark.ml.feature import Tokenizer, RegexTokenizer
> sentenceDataFrame = sqlContext.createDataFrame([
>   (0, "Hi I heard about Spark"),
>   (1, "I wish Java could use case classes"),
>   (2, "Logistic,regression,models,are,neat")
> ], ["label", "sentence"])
> tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
> wordsDataFrame = tokenizer.transform(sentenceDataFrame)
> for words_label in wordsDataFrame.select("words", "label").take(3):
>   print(words_label)
> ---
> Py4JJavaError Traceback (most recent call last)
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 654 if not hasattr(self, '_scala_HiveContext'):
> --> 655 self._scala_HiveContext = self._get_hive_ctx()
> 656 return self._scala_HiveContext
> /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
> 662 def _get_hive_ctx(self):
> --> 663 return self._jvm.HiveContext(self._jsc.sc())
> 664 
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
> 700 return_value = get_return_value(answer, self._gateway_client, 
> None,
> --> 701 self._fqn)
> 702 
> /root/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  35 try:
> ---> 36 return f(*a, **kw)
>  37 except py4j.protocol.Py4JJavaError as e:
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.io.IOException: Filesystem closed
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
>   at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
>   at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at ja

Re: When to cut RCs

2015-12-02 Thread Patrick Wendell
In terms of advertising to people the status of the release and whether an
RC is likely to go out, the best mechanism I can think of is our current
mechanism of using JIRA and respecting the semantics of a blocker JIRA. We
could do a better job though creating a JIRA dashboard for each release and
linking to it publicly so it's very clear to people what is going on. I
have always used one privately when managing previous releases, but no
reason we can't put one up on the website or wiki.

IMO a mailing list is not a great mechanism for the fine-grained work of
release management because of the sheer complexity and volume of finalizing
a spark release. Being a release manager means tracking over a course of
several weeks typically dozens of distinct issues and trying to prioritize
them, get more clarity from the report of those issues, possibly reaching
out to people on the phone or in person to get more details, etc. You want
a mutable dashboard where you can convey the current status clearly.

What might be good in the early stages is a weekly e-mail to the dev@ list
just refreshing what is on the JIRA and letting people know how things are
looking. So someone just passing by has some idea of how things are going
and can chime in, etc.

Once an RC is cut then we do mostly rely on the mailing list for
discussion. At that point the number of known issues is small enough I
think to discuss in an all-to-all fashion.

- Patrick

On Wed, Dec 2, 2015 at 1:25 PM, Sean Owen  wrote:

> On Wed, Dec 2, 2015 at 9:06 PM, Michael Armbrust 
> wrote:
> > This can be debated, but I explicitly ignored test and documentation
> issues.
> > Since the docs are published separately and easy to update, I don't think
> > its worth further disturbing the release cadence for these JIRAs.
>
> It makes sense to not hold up an RC since they don't affect testing of
> functionality. Prior releases have ultimately gone out with doc issues
> still outstanding (and bugs) though. This doesn't seem to be on
> anyone's release checklist, and maybe part of it is because they're
> let slide for RCs.  Your suggestion to check-point release status
> below sounds spot on; I sort of tried to do that earlier.
>
>
> > Up until today various committers have told me that there were known
> issues
> > with branch-1.6 that would cause them to -1 the release.  Whenever this
> > happened, I asked them to ensure there was a properly targeted blocker
> JIRA
> > open so people could publicly track the status of the release.  As long
> as
> > such issues were open, I only published a preview since making an RC is
> > pretty high cost.
>
> Makes sense if these are all getting translated into Blockers and
> resolved before an RC. It's the simplest mechanism to communicate and
> track this in a distributed way.
>
> "No blockers" is a minimal criterion for release. It still seems funny
> to release with so many issues targeted for 1.6.0, including issues
> that aren't critical or bugs. Sure, that's just hygiene. But without
> it, do people take "Target Version" seriously? if they don't, is there
> any force guiding people to prioritize or decide what to (not) work
> on? I'm sure the communication happens, just doesn't seem like it's
> fully on JIRA, which is ultimately suboptimal.
>
>
> > I actually did spent quite a bit of time asking people to close various
> > umbrella issues, and I was pretty strict about watching JIRA throughout
> the
> > process.  Perhaps as an additional step, future preview releases or
> branch
> > cuts can include a link to an authoritative dashboard that we will use to
> > decide when we are ready to make an RC.  I'm also open to other
> suggestions.
>
> Yes, that's great. It takes the same effort from everyone. Having a
> green light on a dashboard at release time is only the symptom of
> decent planning. The effect I think it really needs to have occurs
> now: what's really probably on the menu for 1.7? and periodically
> track against that goal. Then the release process is with any luck
> just a formality with no surprises.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


[jira] [Commented] (SPARK-11903) Deprecate make-distribution.sh --skip-java-test

2015-11-22 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021511#comment-15021511
 ] 

Patrick Wendell commented on SPARK-11903:
-

I think it's simply dead code. SKIP_JAVA_TEST related to a check we did 
regarding whether Java 6 was being used instead of Java 7. It doesn't have 
anything to do with unit tests. Spark now requires Java 7, so the test has been 
removed, but the parser still handles that variable. It was just an omission 
not deleted as part of SPARK-7733 
(https://github.com/apache/spark/commit/e84815dc333a69368a48e0152f02934980768a14)
 /cc [~srowen].

> Deprecate make-distribution.sh --skip-java-test
> ---
>
> Key: SPARK-11903
> URL: https://issues.apache.org/jira/browse/SPARK-11903
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The {{\-\-skip-java-test}} option to {{make-distribution.sh}} [does not 
> appear to be 
> used|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L72-L73],
>  and tests are [always 
> skipped|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L170].
>  Searching the Spark codebase for {{SKIP_JAVA_TEST}} yields no results other 
> than [this 
> one|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L72-L73].
> If this option is not needed, we should deprecate and eventually remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11903) Deprecate make-distribution.sh --skip-java-test

2015-11-22 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021511#comment-15021511
 ] 

Patrick Wendell edited comment on SPARK-11903 at 11/23/15 4:29 AM:
---

I think it's simply dead code that should be deleted. SKIP_JAVA_TEST related to 
a check we did regarding whether Java 6 was being used instead of Java 7. It 
doesn't have anything to do with unit tests. Spark now requires Java 7, so the 
test has been removed, but the parser still handles that variable. It was just 
an omission not deleted as part of SPARK-7733 
(https://github.com/apache/spark/commit/e84815dc333a69368a48e0152f02934980768a14)
 /cc [~srowen].


was (Author: pwendell):
I think it's simply dead code. SKIP_JAVA_TEST related to a check we did 
regarding whether Java 6 was being used instead of Java 7. It doesn't have 
anything to do with unit tests. Spark now requires Java 7, so the test has been 
removed, but the parser still handles that variable. It was just an omission 
not deleted as part of SPARK-7733 
(https://github.com/apache/spark/commit/e84815dc333a69368a48e0152f02934980768a14)
 /cc [~srowen].

> Deprecate make-distribution.sh --skip-java-test
> ---
>
> Key: SPARK-11903
> URL: https://issues.apache.org/jira/browse/SPARK-11903
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The {{\-\-skip-java-test}} option to {{make-distribution.sh}} [does not 
> appear to be 
> used|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L72-L73],
>  and tests are [always 
> skipped|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L170].
>  Searching the Spark codebase for {{SKIP_JAVA_TEST}} yields no results other 
> than [this 
> one|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L72-L73].
> If this option is not needed, we should deprecate and eventually remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: A proposal for Spark 2.0

2015-11-10 Thread Patrick Wendell
I also feel the same as Reynold. I agree we should minimize API breaks and
focus on fixing things around the edge that were mistakes (e.g. exposing
Guava and Akka) rather than any overhaul that could fragment the community.
Ideally a major release is a lightweight process we can do every couple of
years, with minimal impact for users.

- Patrick

On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> > For this reason, I would *not* propose doing major releases to break
> substantial API's or perform large re-architecting that prevent users from
> upgrading. Spark has always had a culture of evolving architecture
> incrementally and making changes - and I don't think we want to change this
> model.
>
> +1 for this. The Python community went through a lot of turmoil over the
> Python 2 -> Python 3 transition because the upgrade process was too painful
> for too long. The Spark community will benefit greatly from our explicitly
> looking to avoid a similar situation.
>
> > 3. Assembly-free distribution of Spark: don’t require building an
> enormous assembly jar in order to run Spark.
>
> Could you elaborate a bit on this? I'm not sure what an assembly-free
> distribution means.
>
> Nick
>
> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin  wrote:
>
>> I’m starting a new thread since the other one got intermixed with feature
>> requests. Please refrain from making feature request in this thread. Not
>> that we shouldn’t be adding features, but we can always add features in
>> 1.7, 2.1, 2.2, ...
>>
>> First - I want to propose a premise for how to think about Spark 2.0 and
>> major releases in Spark, based on discussion with several members of the
>> community: a major release should be low overhead and minimally disruptive
>> to the Spark community. A major release should not be very different from a
>> minor release and should not be gated based on new features. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>> For this reason, I would *not* propose doing major releases to break
>> substantial API's or perform large re-architecting that prevent users from
>> upgrading. Spark has always had a culture of evolving architecture
>> incrementally and making changes - and I don't think we want to change this
>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>
>> If the community likes the above model, then to me it seems reasonable to
>> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
>> major releases every 2 years seems doable within the above model.
>>
>> Under this model, here is a list of example things I would propose doing
>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>
>>
>> APIs
>>
>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 1.x.
>>
>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>> about user applications being unable to use Akka due to Spark’s dependency
>> on Akka.
>>
>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>
>> 4. Better class package structure for low level developer API’s. In
>> particular, we have some DeveloperApi (mostly various listener-related
>> classes) added over the years. Some packages include only one or two public
>> classes but a lot of private classes. A better structure is to have public
>> classes isolated to a few public packages, and these public packages should
>> have minimal private classes for low level developer APIs.
>>
>> 5. Consolidate task metric and accumulator API. Although having some
>> subtle differences, these two are very similar but have completely
>> different code path.
>>
>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>> moving them to other package(s). They are already used beyond SQL, e.g. in
>> ML pipelines, and will be used by streaming also.
>>
>>
>> Operation/Deployment
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but it has been end-of-life.
>>
>> 2. Remove Hadoop 1 support.
>>
>> 3. Assembly-free distribution of Spark: don’t require building an
>> enormous assembly jar in order to run Spark.
>>
>>


[jira] [Commented] (SPARK-11326) Support for authentication and encryption in standalone mode

2015-11-09 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997448#comment-14997448
 ] 

Patrick Wendell commented on SPARK-11326:
-

There are a few related conversations here:

1. The feature set of standalone scheduler and goals. The main goal of that 
scheduler is to make it easy for people to download and run Spark with minimal 
extra dependencies. The main difference between the standalone mode and other 
schedulers is that we aren't providing support for scheduling other frameworks 
than Spark (and likely never will). Other than that, features are added on a 
case-by-case basis depending on whether there is sufficient commitment from the 
maintainers to support the feature long term.

2. Security in non-YARN modes. I would actually like to see better support for 
security in other modes of Spark, the main reason being supporting the large 
number of users not inside of Hadoop deployments. BTW, I think the existing 
security architecture of Spark makes this possible, because the concern of 
distributing a shared secret is largely decoupled from the specific security 
mechanism. But we haven't really exposed public hooks for injecting secrets. 
There is also the question of secure job submission which is addressed in this 
JIRA. This needs some thought and probably makes sense to discuss on the Spark 
1.7 timeframe.

Overall I think some broader questions need to be answered, and it's something 
perhaps we can discuss once 1.6 is out the door as we think about 1.7.

> Support for authentication and encryption in standalone mode
> 
>
> Key: SPARK-11326
> URL: https://issues.apache.org/jira/browse/SPARK-11326
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Jacek Lewandowski
>
> h3.The idea
> Currently, in standalone mode, all components, for all network connections 
> need to use the same secure token if they want to have any security ensured. 
> This ticket is intended to split the communication in standalone mode to make 
> it more like in Yarn mode - application internal communication and scheduler 
> communication.
> Such refactoring will allow for the scheduler (master, workers) to use a 
> distinct secret, which will remain unknown for the users. Similarly, it will 
> allow for better security in applications, because each application will be 
> able to use a distinct secret as well. 
> By providing SASL authentication/encryption for connections between a client 
> (Client or AppClient) and Spark Master, it becomes possible introducing 
> pluggable authentication for standalone deployment mode.
> h3.Improvements introduced by this patch
> This patch introduces the following changes:
> * Spark driver or submission client do not have to use the same secret as 
> workers use to communicate with Master
> * Master is able to authenticate individual clients with the following rules:
> ** When connecting to the master, the client needs to specify 
> {{spark.authenticate.secret}} which is an authentication token for the user 
> specified by {{spark.authenticate.user}} ({{sparkSaslUser}} by default)
> ** Master configuration may include additional 
> {{spark.authenticate.secrets.}} entries for specifying 
> authentication token for particular users or 
> {{spark.authenticate.authenticatorClass}} which specify an implementation of 
> external credentials provider (which is able to retrieve the authentication 
> token for a given user).
> ** Workers authenticate with Master as default user {{sparkSaslUser}}. 
> * The authorization rules are as follows:
> ** A regular user is able to manage only his own application (the application 
> which he submitted)
> ** A regular user is not able to register or manager workers
> ** Spark default user {{sparkSaslUser}} can manage all the applications
> h3.User facing changes when running application
> h4.General principles:
> - conf: {{spark.authenticate.secret}} is *never sent* over the wire
> - env: {{SPARK_AUTH_SECRET}} is *never sent* over the wire
> - In all situations env variable will overwrite conf variable if present. 
> - In all situations when a user has to pass a secret, it is better (safer) to 
> do this through env variable
> - In work modes with multiple secrets we assume encrypted communication 
> between client and master, between driver and master, between master and 
> workers
> 
> h4.Work modes and descriptions
> h5.Client mode, single secret
> h6.Configuration
> - env: {{SPARK_AUTH_SECRET=secret}} or conf: 
> {{spark.authenticate.secret=secret}}
> h6.Description
> - The driver is r

Re: State of the Build

2015-11-05 Thread Patrick Wendell
Hey Jakob,

The builds in Spark are largely maintained by me, Sean, and Michael
Armbrust (for SBT). For historical reasons, Spark supports both a Maven and
SBT build. Maven is the build of reference for packaging Spark and is used
by many downstream packagers and to build all Spark releases. SBT is more
often used by developers. Both builds inherit from the same pom files (and
rely on the same profiles) to minimize maintenance complexity of Spark's
very complex dependency graph.

If you are looking to make contributions that help with the build, I am
happy to point you towards some things that are consistent maintenance
headaches. There are two major pain points right now that I'd be thrilled
to see fixes for:

1. SBT relies on a different dependency conflict resolution strategy than
maven - causing all kinds of headaches for us. I have heard that newer
versions of SBT can (maybe?) use Maven as a dependency resolver instead of
Ivy. This would make our life so much better if it were possible, either by
virtue of upgrading SBT or somehow doing this ourselves.

2. We don't have a great way of auditing the net effect of dependency
changes when people make them in the build. I am working on a fairly clunky
patch to do this here:

https://github.com/apache/spark/pull/8531

It could be done much more nicely using SBT, but only provided (1) is
solved.

Doing a major overhaul of the sbt build to decouple it from pom files, I'm
not sure that's the best place to start, given that we need to continue to
support maven - the coupling is intentional. But getting involved in the
build in general would be completely welcome.

- Patrick

On Thu, Nov 5, 2015 at 10:53 PM, Sean Owen  wrote:

> Maven isn't 'legacy', or supported for the benefit of third parties.
> SBT had some behaviors / problems that Maven didn't relative to what
> Spark needs. SBT is a development-time alternative only, and partly
> generated from the Maven build.
>
> On Fri, Nov 6, 2015 at 1:48 AM, Koert Kuipers  wrote:
> > People who do upstream builds of spark (think bigtop and hadoop distros)
> are
> > used to legacy systems like maven, so maven is the default build. I don't
> > think it will change.
> >
> > Any improvements for the sbt build are of course welcome (it is still
> used
> > by many developers), but i would not do anything that increases the
> burden
> > of maintaining two build systems.
> >
> > On Nov 5, 2015 18:38, "Jakob Odersky"  wrote:
> >>
> >> Hi everyone,
> >> in the process of learning Spark, I wanted to get an overview of the
> >> interaction between all of its sub-projects. I therefore decided to
> have a
> >> look at the build setup and its dependency management.
> >> Since I am alot more comfortable using sbt than maven, I decided to try
> to
> >> port the maven configuration to sbt (with the help of automated tools).
> >> This led me to a couple of observations and questions on the build
> system
> >> design:
> >>
> >> First, currently, there are two build systems, maven and sbt. Is there a
> >> preferred tool (or future direction to one)?
> >>
> >> Second, the sbt build also uses maven "profiles" requiring the use of
> >> specific commandline parameters when starting sbt. Furthermore, since it
> >> relies on maven poms, dependencies to the scala binary version (_2.xx)
> are
> >> hardcoded and require running an external script when switching
> versions.
> >> Sbt could leverage built-in constructs to support cross-compilation and
> >> emulate profiles with configurations and new build targets. This would
> >> remove external state from the build (in that no extra steps need to be
> >> performed in a particular order to generate artifacts for a new
> >> configuration) and therefore improve stability and build reproducibility
> >> (maybe even build performance). I was wondering if implementing such
> >> functionality for the sbt build would be welcome?
> >>
> >> thanks,
> >> --Jakob
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: test failed due to OOME

2015-11-02 Thread Patrick Wendell
I believe this is some bug in our tests. For some reason we are using way
more memory than necessary. We'll probably need to log into Jenkins and
heap dump some running tests and figure out what is going on.

On Mon, Nov 2, 2015 at 7:42 AM, Ted Yu  wrote:

> Looks like SparkListenerSuite doesn't OOM on QA runs compared to Jenkins
> builds.
>
> I wonder if this is due to difference between machines running QA tests vs
> machines running Jenkins builds.
>
> On Fri, Oct 30, 2015 at 1:19 PM, Ted Yu  wrote:
>
>> I noticed that the SparkContext created in each sub-test is not stopped
>> upon finishing sub-test.
>>
>> Would stopping each SparkContext make a difference in terms of heap
>> memory consumption ?
>>
>> Cheers
>>
>> On Fri, Oct 30, 2015 at 12:04 PM, Mridul Muralidharan 
>> wrote:
>>
>>> It is giving OOM at 32GB ? Something looks wrong with that ... that is
>>> already on the higher side.
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> On Fri, Oct 30, 2015 at 11:28 AM, shane knapp 
>>> wrote:
>>> > here's the current heap settings on our workers:
>>> > InitialHeapSize == 2.1G
>>> > MaxHeapSize == 32G
>>> >
>>> > system ram:  128G
>>> >
>>> > we can bump it pretty easily...  it's just a matter of deciding if we
>>> > want to do this globally (super easy, but will affect ALL maven builds
>>> > on our system -- not just spark) or on a per-job basis (this doesn't
>>> > scale that well).
>>> >
>>> > thoughts?
>>> >
>>> > On Fri, Oct 30, 2015 at 9:47 AM, Ted Yu  wrote:
>>> >> This happened recently on Jenkins:
>>> >>
>>> >>
>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.3,label=spark-test/3964/console
>>> >>
>>> >> On Sun, Oct 18, 2015 at 7:54 AM, Ted Yu  wrote:
>>> >>>
>>> >>> From
>>> >>>
>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=spark-test/3846/console
>>> >>> :
>>> >>>
>>> >>> SparkListenerSuite:
>>> >>> - basic creation and shutdown of LiveListenerBus
>>> >>> - bus.stop() waits for the event queue to completely drain
>>> >>> - basic creation of StageInfo
>>> >>> - basic creation of StageInfo with shuffle
>>> >>> - StageInfo with fewer tasks than partitions
>>> >>> - local metrics
>>> >>> - onTaskGettingResult() called when result fetched remotely ***
>>> FAILED ***
>>> >>>   org.apache.spark.SparkException: Job aborted due to stage failure:
>>> Task
>>> >>> 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in
>>> stage
>>> >>> 0.0 (TID 0, localhost): java.lang.OutOfMemoryError: Java heap space
>>> >>>  at java.util.Arrays.copyOf(Arrays.java:2271)
>>> >>>  at
>>> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>>> >>>  at
>>> >>>
>>> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>>> >>>  at
>>> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>>> >>>  at
>>> >>>
>>> java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1852)
>>> >>>  at java.io.ObjectOutputStream.write(ObjectOutputStream.java:708)
>>> >>>  at org.apache.spark.util.Utils$.writeByteBuffer(Utils.scala:182)
>>> >>>  at
>>> >>>
>>> org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply$mcV$sp(TaskResult.scala:52)
>>> >>>  at
>>> org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1160)
>>> >>>  at
>>> >>>
>>> org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:49)
>>> >>>  at
>>> >>>
>>> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1458)
>>> >>>  at
>>> >>>
>>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429)
>>> >>>  at
>>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>>> >>>  at
>>> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>>> >>>  at
>>> >>>
>>> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
>>> >>>  at
>>> >>>
>>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>>> >>>  at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
>>> >>>  at
>>> >>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> >>>  at
>>> >>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> >>>  at java.lang.Thread.run(Thread.java:745)
>>> >>>
>>> >>>
>>> >>> Should more heap be given to test suite ?
>>> >>>
>>> >>>
>>> >>> Cheers
>>> >>
>>> >>
>>> >
>>> > -
>>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: dev-h...@spark.apache.org
>>> >
>>>
>>
>>
>


[jira] [Resolved] (SPARK-11236) Upgrade Tachyon dependency to 0.8.0

2015-11-02 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-11236.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> Upgrade Tachyon dependency to 0.8.0
> ---
>
> Key: SPARK-11236
> URL: https://issues.apache.org/jira/browse/SPARK-11236
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Calvin Jia
> Fix For: 1.6.0
>
>
> Update the tachyon-client dependency from 0.7.1 to 0.8.0. There are no new 
> dependencies added or Spark facing APIs changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11236) Upgrade Tachyon dependency to 0.8.0

2015-11-02 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-11236:

Assignee: Calvin Jia

> Upgrade Tachyon dependency to 0.8.0
> ---
>
> Key: SPARK-11236
> URL: https://issues.apache.org/jira/browse/SPARK-11236
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Calvin Jia
>Assignee: Calvin Jia
> Fix For: 1.6.0
>
>
> Update the tachyon-client dependency from 0.7.1 to 0.8.0. There are no new 
> dependencies added or Spark facing APIs changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11446) Spark 1.6 release notes

2015-11-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-11446:

Target Version/s: 1.6.0

> Spark 1.6 release notes
> ---
>
> Key: SPARK-11446
> URL: https://issues.apache.org/jira/browse/SPARK-11446
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>        Reporter: Patrick Wendell
>Assignee: Michael Armbrust
>Priority: Critical
>
> This is a staging location where we can keep track of changes that need to be 
> documented in the release notes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11446) Spark 1.6 release notes

2015-11-01 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-11446:
---

 Summary: Spark 1.6 release notes
 Key: SPARK-11446
 URL: https://issues.apache.org/jira/browse/SPARK-11446
 Project: Spark
  Issue Type: Task
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Michael Armbrust
Priority: Critical


This is a staging location where we can keep track of changes that need to be 
documented in the release notes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11238) SparkR: Documentation change for merge function

2015-11-01 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984646#comment-14984646
 ] 

Patrick Wendell commented on SPARK-11238:
-

I created SPARK-11446 and linked it here.

> SparkR: Documentation change for merge function
> ---
>
> Key: SPARK-11238
> URL: https://issues.apache.org/jira/browse/SPARK-11238
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>  Labels: releasenotes
>
> As discussed in pull request: https://github.com/apache/spark/pull/9012, the 
> signature of the merge function will be changed, therefore documentation 
> change is required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11446) Spark 1.6 release notes

2015-11-01 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984776#comment-14984776
 ] 

Patrick Wendell commented on SPARK-11446:
-

I think this is redundant with the "releasenotes" tag so I am closing it.

> Spark 1.6 release notes
> ---
>
> Key: SPARK-11446
> URL: https://issues.apache.org/jira/browse/SPARK-11446
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Michael Armbrust
>Priority: Critical
>
> This is a staging location where we can keep track of changes that need to be 
> documented in the release notes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11446) Spark 1.6 release notes

2015-11-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell closed SPARK-11446.
---
Resolution: Invalid

> Spark 1.6 release notes
> ---
>
> Key: SPARK-11446
> URL: https://issues.apache.org/jira/browse/SPARK-11446
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>        Reporter: Patrick Wendell
>Assignee: Michael Armbrust
>Priority: Critical
>
> This is a staging location where we can keep track of changes that need to be 
> documented in the release notes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.5.2 (RC1)

2015-10-26 Thread Patrick Wendell
I verified that the issue with build binaries being present in the source
release is fixed. Haven't done enough vetting for a full vote, but did
verify that.

On Sun, Oct 25, 2015 at 12:07 AM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark
> version 1.5.2. The vote is open until Wed Oct 28, 2015 at 08:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.2
> [ ] -1 Do not release this package because ...
>
>
> The release fixes 51 known issues in Spark 1.5.1, listed here:
> http://s.apache.org/spark-1.5.2
>
> The tag to be voted on is v1.5.2-rc1:
> https://github.com/apache/spark/releases/tag/v1.5.2-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> *http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc1-bin/
> *
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> - as version 1.5.2-rc1:
> https://repository.apache.org/content/repositories/orgapachespark-1151
> - as version 1.5.2:
> https://repository.apache.org/content/repositories/orgapachespark-1150
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-v1.5.2-rc1-docs/
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> What justifies a -1 vote for this release?
> 
> -1 vote should occur for regressions from Spark 1.5.1. Bugs already
> present in 1.5.1 will not block this release.
>
> ===
> What should happen to JIRA tickets still targeting 1.5.2?
> ===
> Please target 1.5.3 or 1.6.0.
>
>
>


Request to be added to Incubator PMC

2015-10-25 Thread Patrick Wendell
Hi All,

I would like to be added to the Incubator PMC to help mentor a new project.
I am an Apache Member. I am not sure the exact process to be added, so I am
emailing this list as a first step!

Cheers,
- Patrick


[jira] [Commented] (SPARK-11305) Remove Third-Party Hadoop Distributions Doc Page

2015-10-25 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973493#comment-14973493
 ] 

Patrick Wendell commented on SPARK-11305:
-

/cc [~srowen] for his thoughts.

> Remove Third-Party Hadoop Distributions Doc Page
> 
>
> Key: SPARK-11305
> URL: https://issues.apache.org/jira/browse/SPARK-11305
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>        Reporter: Patrick Wendell
>Priority: Critical
>
> There is a fairly old page in our docs that contains a bunch of assorted 
> information regarding running Spark on Hadoop clusters. I think this page 
> should be removed and merged into other parts of the docs because the 
> information is largely redundant and somewhat outdated.
> http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
> There are three sections:
> 1. Compile time Hadoop version - this information I think can be removed in 
> favor of that on the "building spark" page. These days most "advanced users" 
> are building without bundling Hadoop, so I'm not sure giving them a bunch of 
> different Hadoop versions sends the right message.
> 2. Linking against Hadoop - this doesn't seem to add much beyond what is in 
> the programming guide.
> 3. Where to run Spark - redundant with the hardware provisioning guide.
> 4. Inheriting cluster configurations - I think this would be better as a 
> section at the end of the configuration page. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11305) Remove Third-Party Hadoop Distributions Doc Page

2015-10-25 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-11305:
---

 Summary: Remove Third-Party Hadoop Distributions Doc Page
 Key: SPARK-11305
 URL: https://issues.apache.org/jira/browse/SPARK-11305
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Patrick Wendell
Priority: Critical


There is a fairly old page in our docs that contains a bunch of assorted 
information regarding running Spark on Hadoop clusters. I think this page 
should be removed and merged into other parts of the docs because the 
information is largely redundant and somewhat outdated.

http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html

There are three sections:

1. Compile time Hadoop version - this information I think can be removed in 
favor of that on the "building spark" page. These days most "advanced users" 
are building without bundling Hadoop, so I'm not sure giving them a bunch of 
different Hadoop versions sends the right message.

2. Linking against Hadoop - this doesn't seem to add much beyond what is in the 
programming guide.

3. Where to run Spark - redundant with the hardware provisioning guide.

4. Inheriting cluster configurations - I think this would be better as a 
section at the end of the configuration page. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10971) sparkR: RRunner should allow setting path to Rscript

2015-10-25 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973510#comment-14973510
 ] 

Patrick Wendell commented on SPARK-10971:
-

Reynold has sent out the vote email based on the original fix. Since that vote 
is likely to pass, this patch will probably be in 1.5.3.

> sparkR: RRunner should allow setting path to Rscript
> 
>
> Key: SPARK-10971
> URL: https://issues.apache.org/jira/browse/SPARK-10971
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Sun Rui
> Fix For: 1.5.3, 1.6.0
>
>
> I'm running spark on yarn and trying to use R in cluster mode. RRunner seems 
> to just call Rscript and assumes its in the path. But on our YARN deployment 
> R isn't installed on the nodes so it needs to be distributed along with the 
> job and we need the ability to point to where it gets installed. sparkR in 
> client mode has the config spark.sparkr.r.command to point to Rscript. 
> RRunner should have something similar so it works in cluster mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10971) sparkR: RRunner should allow setting path to Rscript

2015-10-25 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973510#comment-14973510
 ] 

Patrick Wendell edited comment on SPARK-10971 at 10/26/15 12:02 AM:


Reynold has sent out the vote email based on the tagged commit. Since that vote 
is likely to pass, this patch will probably be in 1.5.3.


was (Author: pwendell):
Reynold has sent out the vote email based on the original fix. Since that vote 
is likely to pass, this patch will probably be in 1.5.3.

> sparkR: RRunner should allow setting path to Rscript
> 
>
> Key: SPARK-10971
> URL: https://issues.apache.org/jira/browse/SPARK-10971
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Sun Rui
> Fix For: 1.5.3, 1.6.0
>
>
> I'm running spark on yarn and trying to use R in cluster mode. RRunner seems 
> to just call Rscript and assumes its in the path. But on our YARN deployment 
> R isn't installed on the nodes so it needs to be distributed along with the 
> job and we need the ability to point to where it gets installed. sparkR in 
> client mode has the config spark.sparkr.r.command to point to Rscript. 
> RRunner should have something similar so it works in cluster mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10971) sparkR: RRunner should allow setting path to Rscript

2015-10-25 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10971:

Fix Version/s: (was: 1.5.2)
   1.5.3

> sparkR: RRunner should allow setting path to Rscript
> 
>
> Key: SPARK-10971
> URL: https://issues.apache.org/jira/browse/SPARK-10971
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Sun Rui
> Fix For: 1.5.3, 1.6.0
>
>
> I'm running spark on yarn and trying to use R in cluster mode. RRunner seems 
> to just call Rscript and assumes its in the path. But on our YARN deployment 
> R isn't installed on the nodes so it needs to be distributed along with the 
> job and we need the ability to point to where it gets installed. sparkR in 
> client mode has the config spark.sparkr.r.command to point to Rscript. 
> RRunner should have something similar so it works in cluster mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: BUILD SYSTEM: amp-jenkins-worker-05 offline

2015-10-19 Thread Patrick Wendell
This is what I'm looking at:

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/



On Mon, Oct 19, 2015 at 12:58 PM, shane knapp <skn...@berkeley.edu> wrote:

> all we did was reboot -05 and -03...  i'm seeing a bunch of green
> builds.  could you provide me w/some specific failures so i can look
> in to them more closely?
>
> On Mon, Oct 19, 2015 at 12:27 PM, Patrick Wendell <pwend...@gmail.com>
> wrote:
> > Hey Shane,
> >
> > It also appears that every Spark build is failing right now. Could it be
> > related to your changes?
> >
> > - Patrick
> >
> > On Mon, Oct 19, 2015 at 11:13 AM, shane knapp <skn...@berkeley.edu>
> wrote:
> >>
> >> worker 05 is back up now...  looks like the machine OOMed and needed
> >> to be kicked.
> >>
> >> On Mon, Oct 19, 2015 at 9:39 AM, shane knapp <skn...@berkeley.edu>
> wrote:
> >> > i'll have to head down to the colo and see what's up with it...  it
> >> > seems to be wedged (pings ok, can't ssh in) and i'll update the list
> >> > when i figure out what's wrong.
> >> >
> >> > i don't think it caught fire (#toosoon?), because everything else is
> >> > up and running.  :)
> >> >
> >> > shane
> >>
> >> --
> >> You received this message because you are subscribed to the Google
> Groups
> >> "amp-infra" group.
> >> To unsubscribe from this group and stop receiving emails from it, send
> an
> >> email to amp-infra+unsubscr...@googlegroups.com.
> >> For more options, visit https://groups.google.com/d/optout.
> >
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "amp-infra" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to amp-infra+unsubscr...@googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "amp-infra" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to amp-infra+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>


Re: BUILD SYSTEM: amp-jenkins-worker-05 offline

2015-10-19 Thread Patrick Wendell
I think many of them are coming form the Spark 1.4 builds:

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-1.4-Maven-pre-YARN/3900/console

On Mon, Oct 19, 2015 at 1:44 PM, Patrick Wendell <pwend...@gmail.com> wrote:

> This is what I'm looking at:
>
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>
>
>
> On Mon, Oct 19, 2015 at 12:58 PM, shane knapp <skn...@berkeley.edu> wrote:
>
>> all we did was reboot -05 and -03...  i'm seeing a bunch of green
>> builds.  could you provide me w/some specific failures so i can look
>> in to them more closely?
>>
>> On Mon, Oct 19, 2015 at 12:27 PM, Patrick Wendell <pwend...@gmail.com>
>> wrote:
>> > Hey Shane,
>> >
>> > It also appears that every Spark build is failing right now. Could it be
>> > related to your changes?
>> >
>> > - Patrick
>> >
>> > On Mon, Oct 19, 2015 at 11:13 AM, shane knapp <skn...@berkeley.edu>
>> wrote:
>> >>
>> >> worker 05 is back up now...  looks like the machine OOMed and needed
>> >> to be kicked.
>> >>
>> >> On Mon, Oct 19, 2015 at 9:39 AM, shane knapp <skn...@berkeley.edu>
>> wrote:
>> >> > i'll have to head down to the colo and see what's up with it...  it
>> >> > seems to be wedged (pings ok, can't ssh in) and i'll update the list
>> >> > when i figure out what's wrong.
>> >> >
>> >> > i don't think it caught fire (#toosoon?), because everything else is
>> >> > up and running.  :)
>> >> >
>> >> > shane
>> >>
>> >> --
>> >> You received this message because you are subscribed to the Google
>> Groups
>> >> "amp-infra" group.
>> >> To unsubscribe from this group and stop receiving emails from it, send
>> an
>> >> email to amp-infra+unsubscr...@googlegroups.com.
>> >> For more options, visit https://groups.google.com/d/optout.
>> >
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> Groups
>> > "amp-infra" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> an
>> > email to amp-infra+unsubscr...@googlegroups.com.
>> > For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "amp-infra" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to amp-infra+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>


Re: BUILD SYSTEM: amp-jenkins-worker-05 offline

2015-10-19 Thread Patrick Wendell
Hey Shane,

It also appears that every Spark build is failing right now. Could it be
related to your changes?

- Patrick

On Mon, Oct 19, 2015 at 11:13 AM, shane knapp  wrote:

> worker 05 is back up now...  looks like the machine OOMed and needed
> to be kicked.
>
> On Mon, Oct 19, 2015 at 9:39 AM, shane knapp  wrote:
> > i'll have to head down to the colo and see what's up with it...  it
> > seems to be wedged (pings ok, can't ssh in) and i'll update the list
> > when i figure out what's wrong.
> >
> > i don't think it caught fire (#toosoon?), because everything else is
> > up and running.  :)
> >
> > shane
>
> --
> You received this message because you are subscribed to the Google Groups
> "amp-infra" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to amp-infra+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>


[jira] [Assigned] (SPARK-11070) Remove older releases on dist.apache.org

2015-10-16 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reassigned SPARK-11070:
---

Assignee: Patrick Wendell

> Remove older releases on dist.apache.org
> 
>
> Key: SPARK-11070
> URL: https://issues.apache.org/jira/browse/SPARK-11070
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Sean Owen
>    Assignee: Patrick Wendell
>Priority: Trivial
> Attachments: SPARK-11070.patch
>
>
> dist.apache.org should be periodically cleaned up such that it only includes 
> the latest releases in each active minor release branch. This is to reduce 
> load on mirrors. It can probably lose the 1.2.x releases at this point. In 
> total this would clean out 6 of the 9 releases currently mirrored at 
> https://dist.apache.org/repos/dist/release/spark/ 
> All releases are always archived at archive.apache.org and continue to be 
> available. The JS behind spark.apache.org/downloads.html needs to be updated 
> to point at archive.apache.org for older releases, then.
> There won't be a pull request for this as it's strictly an update to the site 
> hosted in SVN, and the files hosted by Apache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11070) Remove older releases on dist.apache.org

2015-10-16 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961515#comment-14961515
 ] 

Patrick Wendell commented on SPARK-11070:
-

I removed them - I did leave 1.5.0 for now, but we can remove it in a bit - 
just because 1.5.1 is so new.

{code}
svn rm https://dist.apache.org/repos/dist/release/spark/spark-1.1.1 -m "Remving 
Spark 1.1.1 release"
svn rm https://dist.apache.org/repos/dist/release/spark/spark-1.2.1 -m "Remving 
Spark 1.2.1 release"
svn rm https://dist.apache.org/repos/dist/release/spark/spark-1.2.2 -m "Remving 
Spark 1.2.2 release"
svn rm https://dist.apache.org/repos/dist/release/spark/spark-1.3.0 -m "Remving 
Spark 1.3.0 release"
svn rm https://dist.apache.org/repos/dist/release/spark/spark-1.4.0 -m "Remving 
Spark 1.4.0 release"
{code}

> Remove older releases on dist.apache.org
> 
>
> Key: SPARK-11070
> URL: https://issues.apache.org/jira/browse/SPARK-11070
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Sean Owen
>Assignee: Patrick Wendell
>Priority: Trivial
> Attachments: SPARK-11070.patch
>
>
> dist.apache.org should be periodically cleaned up such that it only includes 
> the latest releases in each active minor release branch. This is to reduce 
> load on mirrors. It can probably lose the 1.2.x releases at this point. In 
> total this would clean out 6 of the 9 releases currently mirrored at 
> https://dist.apache.org/repos/dist/release/spark/ 
> All releases are always archived at archive.apache.org and continue to be 
> available. The JS behind spark.apache.org/downloads.html needs to be updated 
> to point at archive.apache.org for older releases, then.
> There won't be a pull request for this as it's strictly an update to the site 
> hosted in SVN, and the files hosted by Apache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11070) Remove older releases on dist.apache.org

2015-10-16 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-11070.
-
Resolution: Fixed

> Remove older releases on dist.apache.org
> 
>
> Key: SPARK-11070
> URL: https://issues.apache.org/jira/browse/SPARK-11070
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Sean Owen
>    Assignee: Patrick Wendell
>Priority: Trivial
> Attachments: SPARK-11070.patch
>
>
> dist.apache.org should be periodically cleaned up such that it only includes 
> the latest releases in each active minor release branch. This is to reduce 
> load on mirrors. It can probably lose the 1.2.x releases at this point. In 
> total this would clean out 6 of the 9 releases currently mirrored at 
> https://dist.apache.org/repos/dist/release/spark/ 
> All releases are always archived at archive.apache.org and continue to be 
> available. The JS behind spark.apache.org/downloads.html needs to be updated 
> to point at archive.apache.org for older releases, then.
> There won't be a pull request for this as it's strictly an update to the site 
> hosted in SVN, and the files hosted by Apache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10877) Assertions fail straightforward DataFrame job due to word alignment

2015-10-16 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10877:

Assignee: Davies Liu

> Assertions fail straightforward DataFrame job due to word alignment
> ---
>
> Key: SPARK-10877
> URL: https://issues.apache.org/jira/browse/SPARK-10877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Matt Cheah
>Assignee: Davies Liu
> Attachments: SparkFilterByKeyTest.scala
>
>
> I have some code that I’m running in a unit test suite, but the code I’m 
> running is failing with an assertion error.
> I have translated the JUnit test that was failing, to a Scala script that I 
> will attach to the ticket. The assertion error is the following:
> {code}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.AssertionError: 
> lengthInBytes must be a multiple of 8 (word-aligned)
> at 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeWords(Murmur3_x86_32.java:53)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.hashCode(UnsafeArrayData.java:289)
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.hashCode(rows.scala:149)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.hashCode(rows.scala:247)
> at org.apache.spark.HashPartitioner.getPartition(Partitioner.scala:85)
> at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
> at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> {code}
> However, it turns out that this code actually works normally and computes the 
> correct result if assertions are turned off.
> I traced the code and found that when hashUnsafeWords was called, it was 
> given a byte-length of 12, which clearly is not a multiple of 8. However, the 
> job seems to compute correctly regardless of this fact. Of course, I can’t 
> just disable assertions for my unit test though.
> A few things we need to understand:
> 1. Why is the lengthInBytes of size 12?
> 2. Is it actually a problem that the byte length is not word-aligned? If so, 
> how should we fix the byte length? If it's not a problem, why is the 
> assertion flagging a false negative?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Is "mllib" no longer Experimental?

2015-10-14 Thread Patrick Wendell
I would tend to agree with this approach. We should audit all
@Experimenetal labels before the 1.6 release and clear them out when
appropriate.

- Patrick

On Wed, Oct 14, 2015 at 2:13 AM, Sean Owen  wrote:

> Someone asked, is "ML pipelines" stable? I said, no, most of the key
> classes are still marked @Experimental, which matches my expression that
> things may still be subject to change.
>
> But then, I see that MLlib classes, which are de facto not seeing much
> further work and no API change, are also mostly marked @Experimental. If,
> generally, no more significant work is going into MLlib classes, is it time
> to remove most or all of those labels, to keep it meaningful?
>
> Sean
>


Re: Status of SBT Build

2015-10-14 Thread Patrick Wendell
Hi Jakob,

There is a temporary issue with the Scala 2.11 build in SBT. The problem is
this wasn't previously covered by our automated tests so it broke without
us knowing - this has been actively discussed on the dev list in the last
24 hours. I am trying to get it working in our test harness today.

In terms of fixing the underlying issues, I am not sure whether there is a
JIRA for it yet, but we should make one if not. Does anyone know?

- Patrick

On Wed, Oct 14, 2015 at 12:13 PM, Jakob Odersky  wrote:

> Hi everyone,
>
> I've been having trouble building Spark with SBT recently. Scala 2.11
> doesn't work and in all cases I get large amounts of warnings and even
> errors on tests.
>
> I was therefore wondering what the official status of spark with sbt is?
> Is it very new and still buggy or unmaintained and "falling to pieces"?
>
> In any case, I would be glad to help with any issues on setting up a clean
> and working build with sbt.
>
> thanks,
> --Jakob
>


[jira] [Updated] (SPARK-11110) Scala 2.11 build fails due to compiler errors

2015-10-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-0?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-0:

Assignee: Jakob Odersky

> Scala 2.11 build fails due to compiler errors
> -
>
> Key: SPARK-0
> URL: https://issues.apache.org/jira/browse/SPARK-0
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>        Reporter: Patrick Wendell
>Assignee: Jakob Odersky
>
> Right now the 2.11 build is failing due to compiler errors in SBT (though not 
> in Maven). I have updated our 2.11 compile test harness to catch this.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1667/consoleFull
> {code}
> [error] 
> /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308:
>  no valid targets for annotation on value conf - it is discarded unused. You 
> may specify targets with meta-annotations, e.g. @(transient @param)
> [error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf)
> [error] 
> {code}
> This is one error, but there may be others past this point (the compile fails 
> fast).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11110) Scala 2.11 build fails due to compiler errors

2015-10-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-0?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-0:

Priority: Critical  (was: Major)

> Scala 2.11 build fails due to compiler errors
> -
>
> Key: SPARK-0
> URL: https://issues.apache.org/jira/browse/SPARK-0
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>        Reporter: Patrick Wendell
>Assignee: Jakob Odersky
>Priority: Critical
>
> Right now the 2.11 build is failing due to compiler errors in SBT (though not 
> in Maven). I have updated our 2.11 compile test harness to catch this.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1667/consoleFull
> {code}
> [error] 
> /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308:
>  no valid targets for annotation on value conf - it is discarded unused. You 
> may specify targets with meta-annotations, e.g. @(transient @param)
> [error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf)
> [error] 
> {code}
> This is one error, but there may be others past this point (the compile fails 
> fast).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11110) Scala 2.11 build fails due to compiler errors

2015-10-14 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-0:
---

 Summary: Scala 2.11 build fails due to compiler errors
 Key: SPARK-0
 URL: https://issues.apache.org/jira/browse/SPARK-0
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell


Right now the 2.11 build is failing due to compiler errors in SBT (though not 
in Maven). I have updated our 2.11 compile test harness to catch this.

https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1667/consoleFull

{code}
[error] 
/home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308:
 no valid targets for annotation on value conf - it is discarded unused. You 
may specify targets with meta-annotations, e.g. @(transient @param)
[error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf)
[error] 
{code}

This is one error, but there may be others past this point (the compile fails 
fast).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11115) IPv6 regression

2015-10-14 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958078#comment-14958078
 ] 

Patrick Wendell edited comment on SPARK-5 at 10/15/15 12:38 AM:


The title of this says "Regression" - did it regress from a previous version? I 
am going to update the title, let me know if there is any issue.


was (Author: pwendell):
The title of this says "Regression" - did it regression from a previous 
version? I am going to update the title, let me know if there is any issue.

> IPv6 regression
> ---
>
> Key: SPARK-5
> URL: https://issues.apache.org/jira/browse/SPARK-5
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: CentOS 6.7, Java 1.8.0_25, dual stack IPv4 + IPv6
>Reporter: Thomas Dudziak
>Priority: Critical
>
> When running Spark with -Djava.net.preferIPv6Addresses=true, I get this error:
> 15/10/14 14:36:01 ERROR SparkContext: Error initializing SparkContext.
> java.lang.AssertionError: assertion failed: Expected hostname
>   at scala.Predef$.assert(Predef.scala:179)
>   at org.apache.spark.util.Utils$.checkHost(Utils.scala:805)
>   at 
> org.apache.spark.storage.BlockManagerId.(BlockManagerId.scala:48)
>   at 
> org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:107)
>   at 
> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:190)
>   at org.apache.spark.SparkContext.(SparkContext.scala:528)
>   at 
> org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
> Looking at the code in question, it seems that the code will only work for 
> IPv4 as it assumes ':' can't be part of the hostname (which it clearly can 
> for IPv6 addresses).
> Instead, the code should probably use Guava's HostAndPort class, i.e.:
>   def checkHost(host: String, message: String = "") {
> assert(!HostAndPort.fromString(host).hasPort, message)
>   }
>   def checkHostPort(hostPort: String, message: String = "") {
> assert(HostAndPort.fromString(hostPort).hasPort, message)
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11081) Shade Jersey dependency to work around the compatibility issue with Jersey2

2015-10-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-11081:

Component/s: Build

> Shade Jersey dependency to work around the compatibility issue with Jersey2
> ---
>
> Key: SPARK-11081
> URL: https://issues.apache.org/jira/browse/SPARK-11081
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Reporter: Mingyu Kim
>
> As seen from this thread 
> (https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCALte62yD8H3=2KVMiFs7NZjn929oJ133JkPLrNEj=vrx-d2...@mail.gmail.com%3E),
>  Spark is incompatible with Jersey 2 especially when Spark is embedded in an 
> application running with Jersey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11092) Add source URLs to API documentation.

2015-10-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-11092:

Assignee: Jakob Odersky

> Add source URLs to API documentation.
> -
>
> Key: SPARK-11092
> URL: https://issues.apache.org/jira/browse/SPARK-11092
> Project: Spark
>  Issue Type: Documentation
>  Components: Build, Documentation
>Reporter: Jakob Odersky
>Assignee: Jakob Odersky
>Priority: Trivial
>
> It would be nice to have source URLs in the Spark scaladoc, similar to the 
> standard library (e.g. 
> http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).
> The fix should be really simple, just adding a line to the sbt unidoc 
> settings.
> I'll use the github repo url 
> bq. https://github.com/apache/spark/tree/v${version}/${FILE_PATH}
> Feel free to tell me if I should use something else as base url.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11115) Host verification is not correct for IPv6

2015-10-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5:

Summary: Host verification is not correct for IPv6  (was: IPv6 regression)

> Host verification is not correct for IPv6
> -
>
> Key: SPARK-5
> URL: https://issues.apache.org/jira/browse/SPARK-5
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: CentOS 6.7, Java 1.8.0_25, dual stack IPv4 + IPv6
>Reporter: Thomas Dudziak
>Priority: Critical
>
> When running Spark with -Djava.net.preferIPv6Addresses=true, I get this error:
> 15/10/14 14:36:01 ERROR SparkContext: Error initializing SparkContext.
> java.lang.AssertionError: assertion failed: Expected hostname
>   at scala.Predef$.assert(Predef.scala:179)
>   at org.apache.spark.util.Utils$.checkHost(Utils.scala:805)
>   at 
> org.apache.spark.storage.BlockManagerId.(BlockManagerId.scala:48)
>   at 
> org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:107)
>   at 
> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:190)
>   at org.apache.spark.SparkContext.(SparkContext.scala:528)
>   at 
> org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
> Looking at the code in question, it seems that the code will only work for 
> IPv4 as it assumes ':' can't be part of the hostname (which it clearly can 
> for IPv6 addresses).
> Instead, the code should probably use Guava's HostAndPort class, i.e.:
>   def checkHost(host: String, message: String = "") {
> assert(!HostAndPort.fromString(host).hasPort, message)
>   }
>   def checkHostPort(hostPort: String, message: String = "") {
> assert(HostAndPort.fromString(hostPort).hasPort, message)
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11115) IPv6 regression

2015-10-14 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958078#comment-14958078
 ] 

Patrick Wendell commented on SPARK-5:
-

The title of this says "Regression" - did it regression from a previous 
version? I am going to update the title, let me know if there is any issue.

> IPv6 regression
> ---
>
> Key: SPARK-5
> URL: https://issues.apache.org/jira/browse/SPARK-5
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: CentOS 6.7, Java 1.8.0_25, dual stack IPv4 + IPv6
>Reporter: Thomas Dudziak
>Priority: Critical
>
> When running Spark with -Djava.net.preferIPv6Addresses=true, I get this error:
> 15/10/14 14:36:01 ERROR SparkContext: Error initializing SparkContext.
> java.lang.AssertionError: assertion failed: Expected hostname
>   at scala.Predef$.assert(Predef.scala:179)
>   at org.apache.spark.util.Utils$.checkHost(Utils.scala:805)
>   at 
> org.apache.spark.storage.BlockManagerId.(BlockManagerId.scala:48)
>   at 
> org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:107)
>   at 
> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:190)
>   at org.apache.spark.SparkContext.(SparkContext.scala:528)
>   at 
> org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
> Looking at the code in question, it seems that the code will only work for 
> IPv4 as it assumes ':' can't be part of the hostname (which it clearly can 
> for IPv6 addresses).
> Instead, the code should probably use Guava's HostAndPort class, i.e.:
>   def checkHost(host: String, message: String = "") {
> assert(!HostAndPort.fromString(host).hasPort, message)
>   }
>   def checkHostPort(hostPort: String, message: String = "") {
> assert(HostAndPort.fromString(hostPort).hasPort, message)
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11006) Rename NullColumnAccess as NullColumnAccessor

2015-10-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-11006:

Component/s: SQL

> Rename NullColumnAccess as NullColumnAccessor
> -
>
> Key: SPARK-11006
> URL: https://issues.apache.org/jira/browse/SPARK-11006
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Trivial
> Fix For: 1.6.0
>
>
> In sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnAccessor.scala 
> , NullColumnAccess should be renmaed as NullColumnAccessor so that same 
> convention is adhered to for the accessors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11111) Fast null-safe join

2015-10-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1:

Component/s: SQL

> Fast null-safe join
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Today, null safe joins are executed with a Cartesian product.
> {code}
> scala> sqlContext.sql("select * from t a join t b on (a.i <=> b.i)").explain
> == Physical Plan ==
> TungstenProject [i#2,j#3,i#7,j#8]
>  Filter (i#2 <=> i#7)
>   CartesianProduct
>LocalTableScan [i#2,j#3], [[1,1]]
>LocalTableScan [i#7,j#8], [[1,1]]
> {code}
> One option is to add this rewrite to the optimizer:
> {code}
> select * 
> from t a 
> join t b 
>   on coalesce(a.i, ) = coalesce(b.i, ) AND (a.i <=> b.i)
> {code}
> Acceptance criteria: joins with only null safe equality should not result in 
> a Cartesian product.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11056) Improve documentation on how to build Spark efficiently

2015-10-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-11056:

Component/s: Documentation

> Improve documentation on how to build Spark efficiently
> ---
>
> Key: SPARK-11056
> URL: https://issues.apache.org/jira/browse/SPARK-11056
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
> Fix For: 1.5.2, 1.6.0
>
>
> Slow build times are a common pain point for new Spark developers.  We should 
> improve the main documentation on building Spark to describe how to make 
> building Spark less painful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6230) Provide authentication and encryption for Spark's RPC

2015-10-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954523#comment-14954523
 ] 

Patrick Wendell commented on SPARK-6230:


Should we update Spark's documentation to explain this? I think at present it 
only discusses encrypted RPC via akka. But this will be the new recommended way 
to encrypt RPC.

> Provide authentication and encryption for Spark's RPC
> -
>
> Key: SPARK-6230
> URL: https://issues.apache.org/jira/browse/SPARK-6230
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Reporter: Marcelo Vanzin
>
> Make sure the RPC layer used by Spark supports the auth and encryption 
> features of the network/common module.
> This kinda ignores akka; adding support for SASL to akka, while possible, 
> seems to be at odds with the direction being taken in Spark, so let's 
> restrict this to the new RPC layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Scala 2.11 builds broken/ Can the PR build run also 2.11?

2015-10-12 Thread Patrick Wendell
It's really easy to create and modify those builds. If the issue is that we
need to add SBT or Maven to the existing one, it's a short change. We can
just have it build both of them. I wasn't aware of things breaking before
in one build but not another.

- Patrick

On Mon, Oct 12, 2015 at 9:21 AM, Sean Owen <so...@cloudera.com> wrote:

> Yeah, was the issue that it had to be built vs Maven to show the error
> and this uses SBT -- or vice versa? that's why the existing test
> didn't detect it. Was just thinking of adding one more of these non-PR
> builds, but I forget if there was a reason this is hard. Certainly not
> worth building for each PR.
>
> On Mon, Oct 12, 2015 at 5:16 PM, Patrick Wendell <pwend...@gmail.com>
> wrote:
> > We already do automated compile testing for Scala 2.11 similar to Hadoop
> > versions:
> >
> > https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/
> >
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-master-Scala211-Compile/buildTimeTrend
> >
> >
> > If you look, this build takes 7-10 minutes, so it's a nontrivial
> increase to
> > add it to all new PR's. Also, it's only broken once in the last few
> months
> > (despite many patches going in) - a pretty low failure rate. For
> scenarios
> > like this it's better to test it asynchronously. We can even just revert
> a
> > patch immediately if it's found to break 2.11.
> >
> > Put another way - we typically have 1000 patches or more per release.
> Even
> > at one jenkins run per patch: 7 minutes * 1000 = 7 days of developer
> > productivity loss. Compare that to having a few times where we have to
> > revert a patch and ask someone to resubmit (which maybe takes at most one
> > hour)... it's not worth it.
> >
> > - Patrick
> >
> > On Mon, Oct 12, 2015 at 8:24 AM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> There are many Jenkins jobs besides the pull request builder that
> >> build against various Hadoop combinations, for example, in the
> >> background. Is there an obstacle to building vs 2.11 on both Maven and
> >> SBT this way?
> >>
> >> On Mon, Oct 12, 2015 at 2:55 PM, Iulian Dragoș
> >> <iulian.dra...@typesafe.com> wrote:
> >> > Anything that can be done by a machine should be done by a machine. I
> am
> >> > not
> >> > sure we have enough data to say it's only once or twice per release,
> and
> >> > even if we were to issue a PR for each breakage, it's additional load
> on
> >> > committers and reviewers, not to mention our own work. I personally
> >> > don't
> >> > see how 2-3 minutes of compute time per PR can justify hours of work
> >> > plus
> >> > reviews.
> >
> >
>


Re: [ANNOUNCE] Announcing Spark 1.5.1

2015-10-11 Thread Patrick Wendell
I think Daniel is correct here. The source artifact incorrectly includes
jars. It is inadvertent and not part of our intended release process. This
was something I noticed in Spark 1.5.0 and filed a JIRA and was fixed by
updating our build scripts to fix it. However, our build environment was
not using the most current version of the build scripts. See related links:

https://issues.apache.org/jira/browse/SPARK-10511
https://github.com/apache/spark/pull/8774/files

I can update our build environment and we can repackage the Spark 1.5.1
source tarball. To not include sources.

- Patrick

On Sun, Oct 11, 2015 at 8:53 AM, Sean Owen  wrote:

> Daniel: we did not vote on a tag. Please again read the VOTE email I
> linked to you:
>
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-1-RC1-tt14310.html#none
>
> among other things, it contains a link to the concrete source (and
> binary) distribution under vote:
>
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>
> You can still examine it, sure.
>
> Dependencies are *not* bundled in the source release. You're again
> misunderstanding what you are seeing. Read my email again.
>
> I am still pretty confused about what the problem is. This is entirely
> business as usual for ASF projects. I'll follow up with you offline if
> you have any more doubts.
>
> On Sun, Oct 11, 2015 at 4:49 PM, Daniel Gruno 
> wrote:
> > Here's my issue:
> >
> > How am I to audit that the dependencies you bundle are in fact what you
> > claim they are?  How do I know they don't contain malware or - in light
> > of recent events - emissions test rigging? ;)
> >
> > I am not interested in a git tag - that means nothing in the ASF voting
> > process, you cannot vote on a tag, only on a release candidate. The VCS
> > in use is irrelevant in this issue. If you can point me to a release
> > candidate archive that was voted upon and does not contain binary
> > applications, all is well.
> >
> > If there is no such thing, and we cannot come to an understanding, I
> > will exercise my ASF Members' rights and bring this to the attention of
> > the board of directors and ask for a clarification of the legality of
> this.
> >
> > I find it highly irregular. Perhaps it is something some projects do in
> > the Java community, but that doesn't make it permissible in my view.
> >
> > With regards,
> > Daniel.
> >
> >
> > On 10/11/2015 05:42 PM, Sean Owen wrote:
> >> Still confused. Why are you saying we didn't vote on an archive? refer
> >> to the email I linked, which includes both the git tag and a link to
> >> all generated artifacts (also in my email).
> >>
> >> So, there are two things at play here:
> >>
> >> First, I am not sure what you mean that a source distro can't have
> >> binary files. It's supposed to have the source code of Spark, and
> >> shouldn't contain binary Spark. Nothing you listed are Spark binaries.
> >> However, a distribution might have a lot of things in it that support
> >> the source build, like copies of tools, test files, etc.  That
> >> explains I think the first couple lines that you identified.
> >>
> >> Still, I am curious why you are saying that would invalidate a source
> >> release? I have never heard anything like that.
> >>
> >> Second, I do think there are some binaries in here that aren't
> >> supposed to be there, like the build/ directory stuff. IIRC these were
> >> included accidentally and won't be in the next release. At least, I
> >> don't see why they need to be bundled. These are just local copies of
> >> third party tools though, and don't really matter. As it happens, the
> >> licenses that get distributed with the source distro even cover all of
> >> this stuff. I think that's not supposed to be there, but, also don't
> >> see it's 'invalid' as a result.
> >>
> >>
> >> On Sun, Oct 11, 2015 at 4:33 PM, Daniel Gruno 
> wrote:
> >>> On 10/11/2015 05:29 PM, Sean Owen wrote:
>  Of course, but what's making you think this was a binary-only
>  distribution?
> >>>
> >>> I'm not saying binary-only, I am saying your source release contains
> >>> binary programs, which would invalidate a release vote. Is there a
> >>> release candidate package, that is voted on (saying you have a git tag
> >>> does not satisfy this criteria, you need to vote on an actual archive
> of
> >>> files, otherwise there is no cogent proof of the release being from
> that
> >>> specific git tag).
> >>>
> >>> Here's what I found in your source release:
> >>>
> >>> Binary application (application/jar; charset=binary) found in
> >>> spark-1.5.1/sql/hive/src/test/resources/data/files/TestSerDe.jar
> >>>
> >>> Binary application (application/jar; charset=binary) found in
> >>>
> spark-1.5.1/sql/hive/src/test/resources/regression-test-SPARK-8489/test.jar
> >>>
> >>> Binary application (application/jar; charset=binary) found in
> >>> 

Re: [ANNOUNCE] Announcing Spark 1.5.1

2015-10-11 Thread Patrick Wendell
*to not include binaries.

On Sun, Oct 11, 2015 at 9:35 PM, Patrick Wendell <pwend...@gmail.com> wrote:

> I think Daniel is correct here. The source artifact incorrectly includes
> jars. It is inadvertent and not part of our intended release process. This
> was something I noticed in Spark 1.5.0 and filed a JIRA and was fixed by
> updating our build scripts to fix it. However, our build environment was
> not using the most current version of the build scripts. See related links:
>
> https://issues.apache.org/jira/browse/SPARK-10511
> https://github.com/apache/spark/pull/8774/files
>
> I can update our build environment and we can repackage the Spark 1.5.1
> source tarball. To not include sources.
>
> - Patrick
>
> On Sun, Oct 11, 2015 at 8:53 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> Daniel: we did not vote on a tag. Please again read the VOTE email I
>> linked to you:
>>
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-1-RC1-tt14310.html#none
>>
>> among other things, it contains a link to the concrete source (and
>> binary) distribution under vote:
>>
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>>
>> You can still examine it, sure.
>>
>> Dependencies are *not* bundled in the source release. You're again
>> misunderstanding what you are seeing. Read my email again.
>>
>> I am still pretty confused about what the problem is. This is entirely
>> business as usual for ASF projects. I'll follow up with you offline if
>> you have any more doubts.
>>
>> On Sun, Oct 11, 2015 at 4:49 PM, Daniel Gruno <humbed...@apache.org>
>> wrote:
>> > Here's my issue:
>> >
>> > How am I to audit that the dependencies you bundle are in fact what you
>> > claim they are?  How do I know they don't contain malware or - in light
>> > of recent events - emissions test rigging? ;)
>> >
>> > I am not interested in a git tag - that means nothing in the ASF voting
>> > process, you cannot vote on a tag, only on a release candidate. The VCS
>> > in use is irrelevant in this issue. If you can point me to a release
>> > candidate archive that was voted upon and does not contain binary
>> > applications, all is well.
>> >
>> > If there is no such thing, and we cannot come to an understanding, I
>> > will exercise my ASF Members' rights and bring this to the attention of
>> > the board of directors and ask for a clarification of the legality of
>> this.
>> >
>> > I find it highly irregular. Perhaps it is something some projects do in
>> > the Java community, but that doesn't make it permissible in my view.
>> >
>> > With regards,
>> > Daniel.
>> >
>> >
>> > On 10/11/2015 05:42 PM, Sean Owen wrote:
>> >> Still confused. Why are you saying we didn't vote on an archive? refer
>> >> to the email I linked, which includes both the git tag and a link to
>> >> all generated artifacts (also in my email).
>> >>
>> >> So, there are two things at play here:
>> >>
>> >> First, I am not sure what you mean that a source distro can't have
>> >> binary files. It's supposed to have the source code of Spark, and
>> >> shouldn't contain binary Spark. Nothing you listed are Spark binaries.
>> >> However, a distribution might have a lot of things in it that support
>> >> the source build, like copies of tools, test files, etc.  That
>> >> explains I think the first couple lines that you identified.
>> >>
>> >> Still, I am curious why you are saying that would invalidate a source
>> >> release? I have never heard anything like that.
>> >>
>> >> Second, I do think there are some binaries in here that aren't
>> >> supposed to be there, like the build/ directory stuff. IIRC these were
>> >> included accidentally and won't be in the next release. At least, I
>> >> don't see why they need to be bundled. These are just local copies of
>> >> third party tools though, and don't really matter. As it happens, the
>> >> licenses that get distributed with the source distro even cover all of
>> >> this stuff. I think that's not supposed to be there, but, also don't
>> >> see it's 'invalid' as a result.
>> >>
>> >>
>> >> On Sun, Oct 11, 2015 at 4:33 PM, Daniel Gruno <humbed...@apache.org>
>> wrote:
>> >>> On 10/11/2015 05:29 PM, Sea

Re: [ANNOUNCE] Announcing Spark 1.5.1

2015-10-11 Thread Patrick Wendell
Oh I see - yes it's the build/. I always thought release votes related to a
source tag rather than specific binaries. But maybe we can just fix it in
1.5.2 if there is concern about mutating binaries. It seems reasonable to
me.

For tests... in the past we've tried to avoid having jars inside of the
source tree, including some effort to generate jars on the fly which a lot
of our tests use. I am not sure whether it's a firm policy that you can't
have jars in test folders, though. If it is, we could probably do some
magic to get rid of these few ones that have crept in.

- Patrick

On Sun, Oct 11, 2015 at 9:57 PM, Sean Owen <so...@cloudera.com> wrote:

> Agree, but we are talking about the build/ bit right?
>
> I don't agree that it invalidates the release, which is probably the more
> important idea. As a point of process, you would not want to modify and
> republish the artifact that was already released after being voted on -
> unless it was invalid in which case we spin up 1.5.1.1 or something.
>
> But that build/ directory should go in future releases.
>
> I think he is talking about more than this though and the other jars look
> like they are part of tests, and still nothing to do with Spark binaries.
> Those can and should stay.
>
> On Mon, Oct 12, 2015, 5:35 AM Patrick Wendell <pwend...@gmail.com> wrote:
>
>> I think Daniel is correct here. The source artifact incorrectly includes
>> jars. It is inadvertent and not part of our intended release process. This
>> was something I noticed in Spark 1.5.0 and filed a JIRA and was fixed by
>> updating our build scripts to fix it. However, our build environment was
>> not using the most current version of the build scripts. See related links:
>>
>> https://issues.apache.org/jira/browse/SPARK-10511
>> https://github.com/apache/spark/pull/8774/files
>>
>> I can update our build environment and we can repackage the Spark 1.5.1
>> source tarball. To not include sources.
>>
>>
>> - Patrick
>>
>> On Sun, Oct 11, 2015 at 8:53 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> Daniel: we did not vote on a tag. Please again read the VOTE email I
>>> linked to you:
>>>
>>>
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-1-RC1-tt14310.html#none
>>>
>>> among other things, it contains a link to the concrete source (and
>>> binary) distribution under vote:
>>>
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>>>
>>> You can still examine it, sure.
>>>
>>> Dependencies are *not* bundled in the source release. You're again
>>> misunderstanding what you are seeing. Read my email again.
>>>
>>> I am still pretty confused about what the problem is. This is entirely
>>> business as usual for ASF projects. I'll follow up with you offline if
>>> you have any more doubts.
>>>
>>> On Sun, Oct 11, 2015 at 4:49 PM, Daniel Gruno <humbed...@apache.org>
>>> wrote:
>>> > Here's my issue:
>>> >
>>> > How am I to audit that the dependencies you bundle are in fact what you
>>> > claim they are?  How do I know they don't contain malware or - in light
>>> > of recent events - emissions test rigging? ;)
>>> >
>>> > I am not interested in a git tag - that means nothing in the ASF voting
>>> > process, you cannot vote on a tag, only on a release candidate. The VCS
>>> > in use is irrelevant in this issue. If you can point me to a release
>>> > candidate archive that was voted upon and does not contain binary
>>> > applications, all is well.
>>> >
>>> > If there is no such thing, and we cannot come to an understanding, I
>>> > will exercise my ASF Members' rights and bring this to the attention of
>>> > the board of directors and ask for a clarification of the legality of
>>> this.
>>> >
>>> > I find it highly irregular. Perhaps it is something some projects do in
>>> > the Java community, but that doesn't make it permissible in my view.
>>> >
>>> > With regards,
>>> > Daniel.
>>> >
>>> >
>>> > On 10/11/2015 05:42 PM, Sean Owen wrote:
>>> >> Still confused. Why are you saying we didn't vote on an archive? refer
>>> >> to the email I linked, which includes both the git tag and a link to
>>> >> all generated artifacts (also in my email).
>>> >>
>>> >> So, there are two things at play here:

Re: [ANNOUNCE] Announcing Spark 1.5.1

2015-10-11 Thread Patrick Wendell
Yeah I mean I definitely think we're not violating the *spirit* of the "no
binaries" policy, in that we do not include any binary code that is used at
runtime. This is because the binaries we distribute relate only to build
and testing.

Whether we are violating the *letter* of the policy, I'm not so sure. In
the very strictest interpretation of "there cannot be any binary files in
your downloaded tarball" - we aren't honoring that. We got a lot of people
complaining about the sbt jar for instance when we were in the incubator. I
found those complaints a little pedantic, but we ended up removing it from
our source tree and adding things to download it for the user.

- Patrick

On Sun, Oct 11, 2015 at 10:12 PM, Sean Owen <so...@cloudera.com> wrote:

> No we are voting on the artifacts being released (too) in principle.
> Although of course the artifacts should be a deterministic function of the
> source at a certain point in time.
>
> I think the concern is about putting Spark binaries or its dependencies
> into a source release. That should not happen, but it is not what has
> happened here.
>
> On Mon, Oct 12, 2015, 6:03 AM Patrick Wendell <pwend...@gmail.com> wrote:
>
>> Oh I see - yes it's the build/. I always thought release votes related to
>> a source tag rather than specific binaries. But maybe we can just fix it in
>> 1.5.2 if there is concern about mutating binaries. It seems reasonable to
>> me.
>>
>> For tests... in the past we've tried to avoid having jars inside of the
>> source tree, including some effort to generate jars on the fly which a lot
>> of our tests use. I am not sure whether it's a firm policy that you can't
>> have jars in test folders, though. If it is, we could probably do some
>> magic to get rid of these few ones that have crept in.
>>
>> - Patrick
>>
>> On Sun, Oct 11, 2015 at 9:57 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> Agree, but we are talking about the build/ bit right?
>>>
>>> I don't agree that it invalidates the release, which is probably the
>>> more important idea. As a point of process, you would not want to modify
>>> and republish the artifact that was already released after being voted on -
>>> unless it was invalid in which case we spin up 1.5.1.1 or something.
>>>
>>> But that build/ directory should go in future releases.
>>>
>>> I think he is talking about more than this though and the other jars
>>> look like they are part of tests, and still nothing to do with Spark
>>> binaries. Those can and should stay.
>>>
>>> On Mon, Oct 12, 2015, 5:35 AM Patrick Wendell <pwend...@gmail.com>
>>> wrote:
>>>
>>>> I think Daniel is correct here. The source artifact incorrectly
>>>> includes jars. It is inadvertent and not part of our intended release
>>>> process. This was something I noticed in Spark 1.5.0 and filed a JIRA and
>>>> was fixed by updating our build scripts to fix it. However, our build
>>>> environment was not using the most current version of the build scripts.
>>>> See related links:
>>>>
>>>> https://issues.apache.org/jira/browse/SPARK-10511
>>>> https://github.com/apache/spark/pull/8774/files
>>>>
>>>> I can update our build environment and we can repackage the Spark 1.5.1
>>>> source tarball. To not include sources.
>>>>
>>>>
>>>> - Patrick
>>>>
>>>> On Sun, Oct 11, 2015 at 8:53 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>>> Daniel: we did not vote on a tag. Please again read the VOTE email I
>>>>> linked to you:
>>>>>
>>>>>
>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-1-RC1-tt14310.html#none
>>>>>
>>>>> among other things, it contains a link to the concrete source (and
>>>>> binary) distribution under vote:
>>>>>
>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>>>>>
>>>>> You can still examine it, sure.
>>>>>
>>>>> Dependencies are *not* bundled in the source release. You're again
>>>>> misunderstanding what you are seeing. Read my email again.
>>>>>
>>>>> I am still pretty confused about what the problem is. This is entirely
>>>>> business as usual for ASF projects. I'll follow up with you offline if
>>>>> you have any more doubts.
>>>>>
>

Re: Scala 2.11 builds broken/ Can the PR build run also 2.11?

2015-10-09 Thread Patrick Wendell
I would push back slightly. The reason we have the PR builds taking so long
is death by a million small things that we add. Doing a full 2.11 compile
is order minutes... it's a nontrivial increase to the build times.

It doesn't seem that bad to me to go back post-hoc once in a while and fix
2.11 bugs when they come up. It's on the order of once or twice per release
and the typesafe guys keep a close eye on it (thanks!). Compare that to
literally thousands of PR runs and a few minutes every time, IMO it's not
worth it.

On Fri, Oct 9, 2015 at 3:31 PM, Hari Shreedharan 
wrote:

> +1, much better than having a new PR each time to fix something for
> scala-2.11 every time a patch breaks it.
>
> Thanks,
> Hari Shreedharan
>
>
>
>
> On Oct 9, 2015, at 11:47 AM, Michael Armbrust 
> wrote:
>
> How about just fixing the warning? I get it; it doesn't stop this from
>> happening again, but still seems less drastic than tossing out the
>> whole mechanism.
>>
>
> +1
>
> It also does not seem that expensive to test only compilation for Scala
> 2.11 on PR builds.
>
>
>


Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-07 Thread Patrick Wendell
I don't think we have a firm contract around that. So far we've never
removed old artifacts, but the ASF has asked us at time to decrease the
size of binaries we post. In the future at some point we may drop older
ones since we keep adding new ones.

If downstream projects are depending on our artifacts, I'd say just hold
tight for now until something changes. If it changes, then those projects
might need to build Spark on their own and host older hadoop versions, etc.

On Wed, Oct 7, 2015 at 9:59 AM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> Thanks guys.
>
> Regarding this earlier question:
>
> More importantly, is there some rough specification for what packages we
> should be able to expect in this S3 bucket with every release?
>
> Is the implied answer that we should continue to expect the same set of
> artifacts for every release for the foreseeable future?
>
> Nick
> ​
>
> On Tue, Oct 6, 2015 at 1:13 AM Patrick Wendell <pwend...@gmail.com> wrote:
>
>> The missing artifacts are uploaded now. Things should propagate in the
>> next 24 hours. If there are still issues past then ping this thread. Thanks!
>>
>> - Patrick
>>
>> On Mon, Oct 5, 2015 at 2:41 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Thanks for looking into this Josh.
>>>
>>> On Mon, Oct 5, 2015 at 5:39 PM Josh Rosen <joshro...@databricks.com>
>>> wrote:
>>>
>>>> I'm working on a fix for this right now. I'm planning to re-run a
>>>> modified copy of the release packaging scripts which will emit only the
>>>> missing artifacts (so we won't upload new artifacts with different SHAs for
>>>> the builds which *did* succeed).
>>>>
>>>> I expect to have this finished in the next day or so; I'm currently
>>>> blocked by some infra downtime but expect that to be resolved soon.
>>>>
>>>> - Josh
>>>>
>>>> On Mon, Oct 5, 2015 at 8:46 AM, Nicholas Chammas <
>>>> nicholas.cham...@gmail.com> wrote:
>>>>
>>>>> Blaž said:
>>>>>
>>>>> Also missing is
>>>>> http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
>>>>> which breaks spark-ec2 script.
>>>>>
>>>>> This is the package I am referring to in my original email.
>>>>>
>>>>> Nick said:
>>>>>
>>>>> It appears that almost every version of Spark up to and including
>>>>> 1.5.0 has included a —bin-hadoop1.tgz release (e.g.
>>>>> spark-1.5.0-bin-hadoop1.tgz). However, 1.5.1 has no such package.
>>>>>
>>>>> Nick
>>>>> ​
>>>>>
>>>>> On Mon, Oct 5, 2015 at 3:27 AM Blaž Šnuderl <snud...@gmail.com> wrote:
>>>>>
>>>>>> Also missing is http://s3.amazonaws.com/spark-related-packages/spark-
>>>>>> 1.5.1-bin-hadoop1.tgz which breaks spark-ec2 script.
>>>>>>
>>>>>> On Mon, Oct 5, 2015 at 5:20 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>>
>>>>>>> hadoop1 package for Scala 2.10 wasn't in RC1 either:
>>>>>>>
>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>>>>>>>
>>>>>>> On Sun, Oct 4, 2015 at 5:17 PM, Nicholas Chammas <
>>>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I’m looking here:
>>>>>>>>
>>>>>>>> https://s3.amazonaws.com/spark-related-packages/
>>>>>>>>
>>>>>>>> I believe this is where one set of official packages is published.
>>>>>>>> Please correct me if this is not the case.
>>>>>>>>
>>>>>>>> It appears that almost every version of Spark up to and including
>>>>>>>> 1.5.0 has included a --bin-hadoop1.tgz release (e.g.
>>>>>>>> spark-1.5.0-bin-hadoop1.tgz).
>>>>>>>>
>>>>>>>> However, 1.5.1 has no such package. There is a
>>>>>>>> spark-1.5.1-bin-hadoop1-scala2.11.tgz package, but this is a
>>>>>>>> separate thing. (1.5.0 also has a hadoop1-scala2.11 package.)
>>>>>>>>
>>>>>>>> Was this intentional?
>>>>>>>>
>>>>>>>> More importantly, is there some rough specification for what
>>>>>>>> packages we should be able to expect in this S3 bucket with every 
>>>>>>>> release?
>>>>>>>>
>>>>>>>> This is important for those of us who depend on this publishing
>>>>>>>> venue (e.g. spark-ec2 and related tools).
>>>>>>>>
>>>>>>>> Nick
>>>>>>>> ​
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>


Re: Adding Spark Testing functionality

2015-10-06 Thread Patrick Wendell
Hey Holden,

It would be helpful if you could outline the set of features you'd imagine
being part of Spark in a short doc. I didn't see a README on the existing
repo, so it's hard to know exactly what is being proposed.

As a general point of process, we've typically avoided merging modules into
Spark that can exist outside of the project. A testing utility package that
is based on Spark's public API's seems like a really useful thing for the
community, but it does seem like a good fit for a package library. At
least, this is my first question after taking a look at the project.

In any case, getting some high level view of the functionality you imagine
would be helpful to give more detailed feedback.

- Patrick

On Tue, Oct 6, 2015 at 3:12 PM, Holden Karau  wrote:

> Hi Spark Devs,
>
> So this has been brought up a few times before, and generally on the user
> list people get directed to use spark-testing-base. I'd like to start
> moving some of spark-testing-base's functionality into Spark so that people
> don't need a library to do what is (hopefully :p) a very common requirement
> across all Spark projects.
>
> To that end I was wondering what peoples thoughts are on where this should
> live inside of Spark. I was thinking it could either be a separate testing
> project (like sql or similar), or just put the bits to enable testing
> inside of each relevant project.
>
> I was also thinking it probably makes sense to only move the unit testing
> parts at the start and leave things like integration testing in a testing
> project since that could vary depending on the users environment.
>
> What are peoples thoughts?
>
> Cheers,
>
> Holden :)
>


Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-05 Thread Patrick Wendell
The missing artifacts are uploaded now. Things should propagate in the next
24 hours. If there are still issues past then ping this thread. Thanks!

- Patrick

On Mon, Oct 5, 2015 at 2:41 PM, Nicholas Chammas  wrote:

> Thanks for looking into this Josh.
>
> On Mon, Oct 5, 2015 at 5:39 PM Josh Rosen 
> wrote:
>
>> I'm working on a fix for this right now. I'm planning to re-run a
>> modified copy of the release packaging scripts which will emit only the
>> missing artifacts (so we won't upload new artifacts with different SHAs for
>> the builds which *did* succeed).
>>
>> I expect to have this finished in the next day or so; I'm currently
>> blocked by some infra downtime but expect that to be resolved soon.
>>
>> - Josh
>>
>> On Mon, Oct 5, 2015 at 8:46 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Blaž said:
>>>
>>> Also missing is
>>> http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
>>> which breaks spark-ec2 script.
>>>
>>> This is the package I am referring to in my original email.
>>>
>>> Nick said:
>>>
>>> It appears that almost every version of Spark up to and including 1.5.0
>>> has included a —bin-hadoop1.tgz release (e.g. spark-1.5.0-bin-hadoop1.tgz).
>>> However, 1.5.1 has no such package.
>>>
>>> Nick
>>> ​
>>>
>>> On Mon, Oct 5, 2015 at 3:27 AM Blaž Šnuderl  wrote:
>>>
 Also missing is http://s3.amazonaws.com/spark-related-packages/spark-
 1.5.1-bin-hadoop1.tgz which breaks spark-ec2 script.

 On Mon, Oct 5, 2015 at 5:20 AM, Ted Yu  wrote:

> hadoop1 package for Scala 2.10 wasn't in RC1 either:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>
> On Sun, Oct 4, 2015 at 5:17 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I’m looking here:
>>
>> https://s3.amazonaws.com/spark-related-packages/
>>
>> I believe this is where one set of official packages is published.
>> Please correct me if this is not the case.
>>
>> It appears that almost every version of Spark up to and including
>> 1.5.0 has included a --bin-hadoop1.tgz release (e.g.
>> spark-1.5.0-bin-hadoop1.tgz).
>>
>> However, 1.5.1 has no such package. There is a
>> spark-1.5.1-bin-hadoop1-scala2.11.tgz package, but this is a
>> separate thing. (1.5.0 also has a hadoop1-scala2.11 package.)
>>
>> Was this intentional?
>>
>> More importantly, is there some rough specification for what packages
>> we should be able to expect in this S3 bucket with every release?
>>
>> This is important for those of us who depend on this publishing venue
>> (e.g. spark-ec2 and related tools).
>>
>> Nick
>> ​
>>
>
>

>>


Re: Spark 1.6 Release window is not updated in Spark-wiki

2015-10-01 Thread Patrick Wendell
BTW - the merge window for 1.6 is September+October. The QA window is
November and we'll expect to ship probably early december. We are on a
3 month release cadence, with the caveat that there is some
pipelining... as we finish release X we are already starting on
release X+1.

- Patrick

On Thu, Oct 1, 2015 at 11:30 AM, Patrick Wendell <pwend...@gmail.com> wrote:
> Ah - I can update it. Usually i do it after the release is cut. It's
> just a standard 3 month cadence.
>
> On Thu, Oct 1, 2015 at 3:55 AM, Sean Owen <so...@cloudera.com> wrote:
>> My guess is that the 1.6 merge window should close at the end of
>> November (2 months from now)? I can update it but wanted to check if
>> anyone else has a preferred tentative plan.
>>
>> On Thu, Oct 1, 2015 at 2:20 AM, Meethu Mathew <meethu.mat...@flytxt.com> 
>> wrote:
>>> Hi,
>>> In the https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage the
>>> current release window has not been changed from 1.5. Can anybody give an
>>> idea of the expected dates for 1.6 version?
>>>
>>> Regards,
>>>
>>> Meethu Mathew
>>> Senior Engineer
>>> Flytxt
>>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark 1.6 Release window is not updated in Spark-wiki

2015-10-01 Thread Patrick Wendell
Ah - I can update it. Usually i do it after the release is cut. It's
just a standard 3 month cadence.

On Thu, Oct 1, 2015 at 3:55 AM, Sean Owen  wrote:
> My guess is that the 1.6 merge window should close at the end of
> November (2 months from now)? I can update it but wanted to check if
> anyone else has a preferred tentative plan.
>
> On Thu, Oct 1, 2015 at 2:20 AM, Meethu Mathew  
> wrote:
>> Hi,
>> In the https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage the
>> current release window has not been changed from 1.5. Can anybody give an
>> idea of the expected dates for 1.6 version?
>>
>> Regards,
>>
>> Meethu Mathew
>> Senior Engineer
>> Flytxt
>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Patrick Wendell
Hey Richard,

My assessment (just looked before I saw Sean's email) is the same as
his. The NOTICE file embeds other projects' licenses. If those
licenses themselves have pointers to other files or dependencies, we
don't embed them. I think this is standard practice.

- Patrick

On Thu, Sep 24, 2015 at 10:00 AM, Sean Owen  wrote:
> Hi Richard, those are messages reproduced from other projects' NOTICE
> files, not created by Spark. They need to be reproduced in Spark's
> NOTICE file to comply with the license, but their text may or may not
> apply to Spark's distribution. The intent is that users would track
> this back to the source project if interested to investigate what the
> upstream notice is about.
>
> Requirements vary by license, but I do not believe there is additional
> requirement to reproduce these other files. Their license information
> is already indicated in accordance with the license terms.
>
> What licenses are you looking for in LICENSE that you believe should be there?
>
> Getting all this right is both difficult and important. I've made some
> efforts over time to strictly comply with the Apache take on
> licensing, which is at http://www.apache.org/legal/resolved.html  It's
> entirely possible there's still a mistake somewhere in here (possibly
> a new dependency, etc). Please point it out if you see such a thing.
>
> But so far what you describe is "working as intended", as far as I
> know, according to Apache.
>
>
> On Thu, Sep 24, 2015 at 5:52 PM, Richard Hillegas  wrote:
>> -1 (non-binding)
>>
>> I was able to build Spark cleanly from the source distribution using the
>> command in README.md:
>>
>> build/mvn -DskipTests clean package
>>
>> However, while I was waiting for the build to complete, I started going
>> through the NOTICE file. I was confused about where to find licenses for 3rd
>> party software bundled with Spark. About halfway through the NOTICE file,
>> starting with Java Collections Framework, there is a list of licenses of the
>> form
>>
>>license/*.txt
>>
>> But there is no license subdirectory in the source distro. I couldn't find
>> the  *.txt license files for Java Collections Framework, Base64 Encoder, or
>> JZlib anywhere in the source distro. I couldn't find those files in license
>> subdirectories at the indicated home pages for those projects. (I did find
>> the license for JZLIB somewhere else, however:
>> http://www.jcraft.com/jzlib/LICENSE.txt.)
>>
>> In addition, I couldn't find licenses for those projects in the master
>> LICENSE file.
>>
>> Are users supposed to get licenses from the indicated 3rd party web sites?
>> Those online licenses could change. I would feel more comfortable if the ASF
>> were protected by our bundling the licenses inside our source distros.
>>
>> After looking for those three licenses, I stopped reading the NOTICE file.
>> Maybe I'm confused about how to read the NOTICE file. Where should users
>> expect to find the 3rd party licenses?
>>
>> Thanks,
>> -Rick
>>
>> Reynold Xin  wrote on 09/24/2015 12:27:25 AM:
>>
>>> From: Reynold Xin 
>>> To: "dev@spark.apache.org" 
>>> Date: 09/24/2015 12:28 AM
>>> Subject: [VOTE] Release Apache Spark 1.5.1 (RC1)
>>
>>
>>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC
>>> and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.5.1
>>> [ ] -1 Do not release this package because ...
>>>
>>> The release fixes 81 known issues in Spark 1.5.0, listed here:
>>> http://s.apache.org/spark-1.5.1
>>>
>>> The tag to be voted on is v1.5.1-rc1:
>>> https://github.com/apache/spark/commit/
>>> 4df97937dbf68a9868de58408b9be0bf87dbbb94
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release (1.5.1) can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1148/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/
>>>
>>> ===
>>> How can I help test this release?
>>> ===
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate,
>>> then reporting any regressions.
>>>
>>> 
>>> What justifies a -1 vote for this release?
>>> 
>>> -1 vote should occur for regressions from Spark 1.5.0. Bugs already
>>> present in 

Re: RFC: packaging Spark without assemblies

2015-09-23 Thread Patrick Wendell
I think it would be a big improvement to get rid of it. It's not how
jars are supposed to be packaged and it has caused problems in many
different context over the years.

For me a key step in moving away would be to fully audit/understand
all compatibility implications of removing it. If other people are
supportive of this plan I can offer to help spend some time thinking
about any potential corner cases, etc.

- Patrick

On Wed, Sep 23, 2015 at 3:13 PM, Marcelo Vanzin  wrote:
> Hey all,
>
> This is something that we've discussed several times internally, but
> never really had much time to look into; but as time passes by, it's
> increasingly becoming an issue for us and I'd like to throw some ideas
> around about how to fix it.
>
> So, without further ado:
> https://github.com/vanzin/spark/pull/2/files
>
> (You can comment there or click "View" to read the formatted document.
> I thought that would be easier than sharing on Google Drive or Box or
> something.)
>
> It would be great to get people's feedback, especially if there are
> strong reasons for the assemblies that I'm not aware of.
>
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Why there is no snapshots for 1.5 branch?

2015-09-22 Thread Patrick Wendell
I just added snapshot builds for 1.5. They will take a few hours to
build, but once we get them working should publish every few hours.

https://amplab.cs.berkeley.edu/jenkins/view/Spark-Packaging

- Patrick

On Mon, Sep 21, 2015 at 10:36 PM, Bin Wang  wrote:
> However I find some scripts in dev/audit-release, can I use them?
>
> Bin Wang 于2015年9月22日周二 下午1:34写道:
>>
>> No, I mean push spark to my private repository. Spark don't have a
>> build.sbt as far as I see.
>>
>> Fengdong Yu 于2015年9月22日周二 下午1:29写道:
>>>
>>> Do you mean you want to publish the artifact to your private repository?
>>>
>>> if so, please using ‘sbt publish’
>>>
>>> add the following in your build.sb:
>>>
>>> publishTo := {
>>>   val nexus = "https://YOUR_PRIVATE_REPO_HOSTS/;
>>>   if (version.value.endsWith("SNAPSHOT"))
>>> Some("snapshots" at nexus + "content/repositories/snapshots")
>>>   else
>>> Some("releases"  at nexus + "content/repositories/releases")
>>>
>>> }
>>>
>>>
>>>
>>> On Sep 22, 2015, at 13:26, Bin Wang  wrote:
>>>
>>> My project is using sbt (or maven), which need to download dependency
>>> from a maven repo. I have my own private maven repo with nexus but I don't
>>> know how to push my own build to it, can you give me a hint?
>>>
>>> Mark Hamstra 于2015年9月22日周二 下午1:25写道:

 Yeah, whoever is maintaining the scripts and snapshot builds has fallen
 down on the job -- but there is nothing preventing you from checking out
 branch-1.5 and creating your own build, which is arguably a smarter thing 
 to
 do anyway.  If I'm going to use a non-release build, then I want the full
 git commit history of exactly what is in that build readily available, not
 just somewhat arbitrary JARs.

 On Mon, Sep 21, 2015 at 9:57 PM, Bin Wang  wrote:
>
> But I cannot find 1.5.1-SNAPSHOT either at
> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/
>
> Mark Hamstra 于2015年9月22日周二 下午12:55写道:
>>
>> There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released.
>> The current head of branch-1.5 is 1.5.1-SNAPSHOT -- soon to be 1.5.1 
>> release
>> candidates and then the 1.5.1 release.
>>
>> On Mon, Sep 21, 2015 at 9:51 PM, Bin Wang  wrote:
>>>
>>> I'd like to use some important bug fixes in 1.5 branch and I look for
>>> the apache maven host, but don't find any snapshot for 1.5 branch.
>>> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/
>>>
>>> I can find 1.4.X and 1.6.0 versions, why there is no snapshot for
>>> 1.5.X?
>>
>>

>>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Created] (FLINK-2699) Flink is filling Spark JIRA with incorrect PR links

2015-09-17 Thread Patrick Wendell (JIRA)
Patrick Wendell created FLINK-2699:
--

 Summary: Flink is filling Spark JIRA with incorrect PR links
 Key: FLINK-2699
 URL: https://issues.apache.org/jira/browse/FLINK-2699
 Project: Flink
  Issue Type: Bug
Reporter: Patrick Wendell
Priority: Blocker


I think you guys are using our script for synchronizing JIRA. However, you 
didn't adjust the target JIRA identifier so it is still posting to Spark. In 
the past few hours we've seen a lot of random Flink pull requests being linked 
on the Spark JIRA. This is obviously not desirable for us since they are 
different projects.

The JIRA links are being created by the user "Maximilian Michels" ([~mxm]).

https://issues.apache.org/jira/secure/ViewProfile.jspa?name=mxm

I saw these as recently as 5 hours ago - but if you've fixed it already go 
ahead and close this. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2699) Flink is filling Spark JIRA with incorrect PR links

2015-09-17 Thread Patrick Wendell (JIRA)
Patrick Wendell created FLINK-2699:
--

 Summary: Flink is filling Spark JIRA with incorrect PR links
 Key: FLINK-2699
 URL: https://issues.apache.org/jira/browse/FLINK-2699
 Project: Flink
  Issue Type: Bug
Reporter: Patrick Wendell
Priority: Blocker


I think you guys are using our script for synchronizing JIRA. However, you 
didn't adjust the target JIRA identifier so it is still posting to Spark. In 
the past few hours we've seen a lot of random Flink pull requests being linked 
on the Spark JIRA. This is obviously not desirable for us since they are 
different projects.

The JIRA links are being created by the user "Maximilian Michels" ([~mxm]).

https://issues.apache.org/jira/secure/ViewProfile.jspa?name=mxm

I saw these as recently as 5 hours ago - but if you've fixed it already go 
ahead and close this. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2699) Flink is filling Spark JIRA with incorrect PR links

2015-09-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated FLINK-2699:
---
Description: 
I think you guys are using our script for synchronizing JIRA. However, you 
didn't adjust the target JIRA identifier so it is still posting to Spark. In 
the past few hours we've seen a lot of random Flink pull requests being linked 
on the Spark JIRA. This is obviously not desirable for us since they are 
different projects.

The JIRA links are being created by the user "Maximilian Michels" ([~mxm]).

https://issues.apache.org/jira/secure/ViewProfile.jspa?name=mxm

I saw these as recently as 5 hours ago. There are around 23 links that were 
created - if you could go ahead and remove them that would be useful. Thanks!

  was:
I think you guys are using our script for synchronizing JIRA. However, you 
didn't adjust the target JIRA identifier so it is still posting to Spark. In 
the past few hours we've seen a lot of random Flink pull requests being linked 
on the Spark JIRA. This is obviously not desirable for us since they are 
different projects.

The JIRA links are being created by the user "Maximilian Michels" ([~mxm]).

https://issues.apache.org/jira/secure/ViewProfile.jspa?name=mxm

I saw these as recently as 5 hours ago - but if you've fixed it already go 
ahead and close this. Thanks.


> Flink is filling Spark JIRA with incorrect PR links
> ---
>
> Key: FLINK-2699
> URL: https://issues.apache.org/jira/browse/FLINK-2699
> Project: Flink
>  Issue Type: Bug
>Reporter: Patrick Wendell
>Priority: Blocker
>
> I think you guys are using our script for synchronizing JIRA. However, you 
> didn't adjust the target JIRA identifier so it is still posting to Spark. In 
> the past few hours we've seen a lot of random Flink pull requests being 
> linked on the Spark JIRA. This is obviously not desirable for us since they 
> are different projects.
> The JIRA links are being created by the user "Maximilian Michels" ([~mxm]).
> https://issues.apache.org/jira/secure/ViewProfile.jspa?name=mxm
> I saw these as recently as 5 hours ago. There are around 23 links that were 
> created - if you could go ahead and remove them that would be useful. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2699) Flink is filling Spark JIRA with incorrect PR links

2015-09-17 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804497#comment-14804497
 ] 

Patrick Wendell commented on FLINK-2699:


Great - thanks for cleaning this up. No worries.

> Flink is filling Spark JIRA with incorrect PR links
> ---
>
> Key: FLINK-2699
> URL: https://issues.apache.org/jira/browse/FLINK-2699
> Project: Flink
>  Issue Type: Bug
>    Reporter: Patrick Wendell
>Priority: Blocker
>
> I think you guys are using our script for synchronizing JIRA. However, you 
> didn't adjust the target JIRA identifier so it is still posting to Spark. In 
> the past few hours we've seen a lot of random Flink pull requests being 
> linked on the Spark JIRA. This is obviously not desirable for us since they 
> are different projects.
> The JIRA links are being created by the user "Maximilian Michels" ([~mxm]).
> https://issues.apache.org/jira/secure/ViewProfile.jspa?name=mxm
> I saw these as recently as 5 hours ago. There are around 23 links that were 
> created - if you could go ahead and remove them that would be useful. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SPARK-10650) Spark docs include test and other extra classes

2015-09-16 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10650:

Description: 
In 1.5.0 there are some extra classes in the Spark docs - including a bunch of 
test classes. We need to figure out what commit introduced those and fix it. 
The obvious things like genJavadoc version have not changed.

http://spark.apache.org/docs/1.4.1/api/java/org/apache/spark/streaming/ [before]
http://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/streaming/ [after]


> Spark docs include test and other extra classes
> ---
>
> Key: SPARK-10650
> URL: https://issues.apache.org/jira/browse/SPARK-10650
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.0
>    Reporter: Patrick Wendell
>Assignee: Andrew Or
>
> In 1.5.0 there are some extra classes in the Spark docs - including a bunch 
> of test classes. We need to figure out what commit introduced those and fix 
> it. The obvious things like genJavadoc version have not changed.
> http://spark.apache.org/docs/1.4.1/api/java/org/apache/spark/streaming/ 
> [before]
> http://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/streaming/ 
> [after]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10650) Spark docs include test and other extra classes

2015-09-16 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10650:

Priority: Critical  (was: Major)

> Spark docs include test and other extra classes
> ---
>
> Key: SPARK-10650
> URL: https://issues.apache.org/jira/browse/SPARK-10650
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.0
>    Reporter: Patrick Wendell
>Assignee: Andrew Or
>Priority: Critical
>
> In 1.5.0 there are some extra classes in the Spark docs - including a bunch 
> of test classes. We need to figure out what commit introduced those and fix 
> it. The obvious things like genJavadoc version have not changed.
> http://spark.apache.org/docs/1.4.1/api/java/org/apache/spark/streaming/ 
> [before]
> http://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/streaming/ 
> [after]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10650) Spark docs include test and other extra classes

2015-09-16 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10650:

Affects Version/s: 1.5.0

> Spark docs include test and other extra classes
> ---
>
> Key: SPARK-10650
> URL: https://issues.apache.org/jira/browse/SPARK-10650
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.0
>    Reporter: Patrick Wendell
>Assignee: Andrew Or
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10650) Spark docs include test and other extra classes

2015-09-16 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-10650:
---

 Summary: Spark docs include test and other extra classes
 Key: SPARK-10650
 URL: https://issues.apache.org/jira/browse/SPARK-10650
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Andrew Or






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10650) Spark docs include test and other extra classes

2015-09-16 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10650:

Target Version/s: 1.5.1

> Spark docs include test and other extra classes
> ---
>
> Key: SPARK-10650
> URL: https://issues.apache.org/jira/browse/SPARK-10650
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.0
>    Reporter: Patrick Wendell
>Assignee: Andrew Or
>Priority: Critical
>
> In 1.5.0 there are some extra classes in the Spark docs - including a bunch 
> of test classes. We need to figure out what commit introduced those and fix 
> it. The obvious things like genJavadoc version have not changed.
> http://spark.apache.org/docs/1.4.1/api/java/org/apache/spark/streaming/ 
> [before]
> http://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/streaming/ 
> [after]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6942) Umbrella: UI Visualizations for Core and Dataframes

2015-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6942:
---
Assignee: Andrew Or  (was: Patrick Wendell)

> Umbrella: UI Visualizations for Core and Dataframes 
> 
>
> Key: SPARK-6942
> URL: https://issues.apache.org/jira/browse/SPARK-6942
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, SQL, Web UI
>        Reporter: Patrick Wendell
>Assignee: Andrew Or
> Fix For: 1.5.0
>
>
> This is an umbrella issue for the assorted visualization proposals for 
> Spark's UI. The scope will likely cover Spark 1.4 and 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics

2015-09-15 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-10620:
---

 Summary: Look into whether accumulator mechanism can replace 
TaskMetrics
 Key: SPARK-10620
 URL: https://issues.apache.org/jira/browse/SPARK-10620
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Andrew Or


This task is simply to explore whether the internal representation used by 
TaskMetrics could be performed by using accumulators rather than having two 
separate mechanisms. Note that we need to continue to preserve the existing 
"Task Metric" data structures that are exposed to users through event logs etc. 
The question is can we use a single internal codepath and perhaps make this 
easier to extend in the future.

I think there are a few things to look into:
- How do the semantics of accumulators on stage retries differ from aggregate 
TaskMetrics for a stage? Could we implement clearer retry semantics for 
internal accumulators to allow them to be the same - for instance, zeroing 
accumulator values if a stage is retried (see discussion here: SPARK-10042).
- Are there metrics that do not fit well into the accumulator model, or would 
be difficult to update as an accumulator.
- If we expose metrics through accumulators in the future rather than 
continuing to add fields to TaskMetrics, what is the best way to coerce 
compatibility?
- Is it worth it to do this, or is the consolidation too complicated to justify?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics

2015-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10620:

Description: 
This task is simply to explore whether the internal representation used by 
TaskMetrics could be performed by using accumulators rather than having two 
separate mechanisms. Note that we need to continue to preserve the existing 
"Task Metric" data structures that are exposed to users through event logs etc. 
The question is can we use a single internal codepath and perhaps make this 
easier to extend in the future.

I think a full exploration would answer the following questions:
- How do the semantics of accumulators on stage retries differ from aggregate 
TaskMetrics for a stage? Could we implement clearer retry semantics for 
internal accumulators to allow them to be the same - for instance, zeroing 
accumulator values if a stage is retried (see discussion here: SPARK-10042).
- Are there metrics that do not fit well into the accumulator model, or would 
be difficult to update as an accumulator.
- If we expose metrics through accumulators in the future rather than 
continuing to add fields to TaskMetrics, what is the best way to coerce 
compatibility?
- Are there any other considerations?
- Is it worth it to do this, or is the consolidation too complicated to justify?

  was:
This task is simply to explore whether the internal representation used by 
TaskMetrics could be performed by using accumulators rather than having two 
separate mechanisms. Note that we need to continue to preserve the existing 
"Task Metric" data structures that are exposed to users through event logs etc. 
The question is can we use a single internal codepath and perhaps make this 
easier to extend in the future.

I think there are a few things to look into:
- How do the semantics of accumulators on stage retries differ from aggregate 
TaskMetrics for a stage? Could we implement clearer retry semantics for 
internal accumulators to allow them to be the same - for instance, zeroing 
accumulator values if a stage is retried (see discussion here: SPARK-10042).
- Are there metrics that do not fit well into the accumulator model, or would 
be difficult to update as an accumulator.
- If we expose metrics through accumulators in the future rather than 
continuing to add fields to TaskMetrics, what is the best way to coerce 
compatibility?
- Is it worth it to do this, or is the consolidation too complicated to justify?


> Look into whether accumulator mechanism can replace TaskMetrics
> ---
>
> Key: SPARK-10620
> URL: https://issues.apache.org/jira/browse/SPARK-10620
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Andrew Or
>
> This task is simply to explore whether the internal representation used by 
> TaskMetrics could be performed by using accumulators rather than having two 
> separate mechanisms. Note that we need to continue to preserve the existing 
> "Task Metric" data structures that are exposed to users through event logs 
> etc. The question is can we use a single internal codepath and perhaps make 
> this easier to extend in the future.
> I think a full exploration would answer the following questions:
> - How do the semantics of accumulators on stage retries differ from aggregate 
> TaskMetrics for a stage? Could we implement clearer retry semantics for 
> internal accumulators to allow them to be the same - for instance, zeroing 
> accumulator values if a stage is retried (see discussion here: SPARK-10042).
> - Are there metrics that do not fit well into the accumulator model, or would 
> be difficult to update as an accumulator.
> - If we expose metrics through accumulators in the future rather than 
> continuing to add fields to TaskMetrics, what is the best way to coerce 
> compatibility?
> - Are there any other considerations?
> - Is it worth it to do this, or is the consolidation too complicated to 
> justify?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics

2015-09-15 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745690#comment-14745690
 ] 

Patrick Wendell commented on SPARK-10620:
-

/cc [~imranr] and [~srowen] for any comments. In my mind the goal here is just 
to produce some design thoughts and not to actually do it (at this point).

> Look into whether accumulator mechanism can replace TaskMetrics
> ---
>
> Key: SPARK-10620
> URL: https://issues.apache.org/jira/browse/SPARK-10620
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>        Reporter: Patrick Wendell
>Assignee: Andrew Or
>
> This task is simply to explore whether the internal representation used by 
> TaskMetrics could be performed by using accumulators rather than having two 
> separate mechanisms. Note that we need to continue to preserve the existing 
> "Task Metric" data structures that are exposed to users through event logs 
> etc. The question is can we use a single internal codepath and perhaps make 
> this easier to extend in the future.
> I think a full exploration would answer the following questions:
> - How do the semantics of accumulators on stage retries differ from aggregate 
> TaskMetrics for a stage? Could we implement clearer retry semantics for 
> internal accumulators to allow them to be the same - for instance, zeroing 
> accumulator values if a stage is retried (see discussion here: SPARK-10042).
> - Are there metrics that do not fit well into the accumulator model, or would 
> be difficult to update as an accumulator.
> - If we expose metrics through accumulators in the future rather than 
> continuing to add fields to TaskMetrics, what is the best way to coerce 
> compatibility?
> - Are there any other considerations?
> - Is it worth it to do this, or is the consolidation too complicated to 
> justify?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10511) Source releases should not include maven jars

2015-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10511:

Assignee: Luciano Resende

> Source releases should not include maven jars
> -
>
> Key: SPARK-10511
> URL: https://issues.apache.org/jira/browse/SPARK-10511
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0
>    Reporter: Patrick Wendell
>Assignee: Luciano Resende
>Priority: Blocker
>
> I noticed our source jars seemed really big for 1.5.0. At least one 
> contributing factor is that, likely due to some change in the release script, 
> the maven jars are being bundled in with the source code in our build 
> directory. This runs afoul of the ASF policy on binaries in source releases - 
> we should fix it in 1.5.1.
> The issue (I think) is that we might invoke maven to compute the version 
> between when we checkout Spark from github and when we package the source 
> file. I think it could be fixed by simply clearing out the build/ directory 
> after that statement runs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10623) turning on predicate pushdown throws nonsuch element exception when RDD is empty

2015-09-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10623:

Component/s: SQL

> turning on predicate pushdown throws nonsuch element exception when RDD is 
> empty 
> -
>
> Key: SPARK-10623
> URL: https://issues.apache.org/jira/browse/SPARK-10623
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Ram Sriharsha
>Assignee: Zhan Zhang
>
> Turning on predicate pushdown for ORC datasources results in a 
> NoSuchElementException:
> scala> val df = sqlContext.sql("SELECT name FROM people WHERE age < 15")
> df: org.apache.spark.sql.DataFrame = [name: string]
> scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
> scala> df.explain
> == Physical Plan ==
> java.util.NoSuchElementException
> Disabling the pushdown makes things work again:
> scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "false")
> scala> df.explain
> == Physical Plan ==
> Project [name#6]
>  Filter (age#7 < 15)
>   Scan 
> OrcRelation[file:/home/mydir/spark-1.5.0-SNAPSHOT/test/people][name#6,age#7]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10601) Spark SQL - Support for MINUS

2015-09-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10601:

Component/s: SQL

> Spark SQL - Support for MINUS
> -
>
> Key: SPARK-10601
> URL: https://issues.apache.org/jira/browse/SPARK-10601
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Richard Garris
>
> Spark SQL does not current supported SQL Minus



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10600) SparkSQL - Support for Not Exists in a Correlated Subquery

2015-09-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10600:

Component/s: SQL

> SparkSQL - Support for Not Exists in a Correlated Subquery
> --
>
> Key: SPARK-10600
> URL: https://issues.apache.org/jira/browse/SPARK-10600
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Richard Garris
>
> Spark SQL currently does not support NOT EXISTS clauses (e.g. 
> SELECT * FROM TABLE_A WHERE NOT EXISTS ( SELECT 1 FROM TABLE_B where 
> TABLE_B.id = TABLE_A.id)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10576) Move .java files out of src/main/scala

2015-09-12 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742280#comment-14742280
 ] 

Patrick Wendell commented on SPARK-10576:
-

FWIW - seems to me like moving them into /java makes sense. If we are going to 
have src/main/scala and src/main/java, might as well use them correctly. What 
do you think [~rxin].

> Move .java files out of src/main/scala
> --
>
> Key: SPARK-10576
> URL: https://issues.apache.org/jira/browse/SPARK-10576
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Sean Owen
>Priority: Minor
>
> (I suppose I'm really asking for an opinion on this, rather than asserting it 
> must be done, but seems worthwhile. CC [~rxin] and [~pwendell])
> As pointed out on the mailing list, there are some Java files in the Scala 
> source tree:
> {code}
> ./bagel/src/main/scala/org/apache/spark/bagel/package-info.java
> ./core/src/main/scala/org/apache/spark/annotation/AlphaComponent.java
> ./core/src/main/scala/org/apache/spark/annotation/DeveloperApi.java
> ./core/src/main/scala/org/apache/spark/annotation/Experimental.java
> ./core/src/main/scala/org/apache/spark/annotation/package-info.java
> ./core/src/main/scala/org/apache/spark/annotation/Private.java
> ./core/src/main/scala/org/apache/spark/api/java/package-info.java
> ./core/src/main/scala/org/apache/spark/broadcast/package-info.java
> ./core/src/main/scala/org/apache/spark/executor/package-info.java
> ./core/src/main/scala/org/apache/spark/io/package-info.java
> ./core/src/main/scala/org/apache/spark/rdd/package-info.java
> ./core/src/main/scala/org/apache/spark/scheduler/package-info.java
> ./core/src/main/scala/org/apache/spark/serializer/package-info.java
> ./core/src/main/scala/org/apache/spark/util/package-info.java
> ./core/src/main/scala/org/apache/spark/util/random/package-info.java
> ./external/flume/src/main/scala/org/apache/spark/streaming/flume/package-info.java
> ./external/kafka/src/main/scala/org/apache/spark/streaming/kafka/package-info.java
> ./external/mqtt/src/main/scala/org/apache/spark/streaming/mqtt/package-info.java
> ./external/twitter/src/main/scala/org/apache/spark/streaming/twitter/package-info.java
> ./external/zeromq/src/main/scala/org/apache/spark/streaming/zeromq/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/impl/EdgeActiveness.java
> ./graphx/src/main/scala/org/apache/spark/graphx/lib/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/TripletFields.java
> ./graphx/src/main/scala/org/apache/spark/graphx/util/package-info.java
> ./mllib/src/main/scala/org/apache/spark/ml/attribute/package-info.java
> ./mllib/src/main/scala/org/apache/spark/ml/package-info.java
> ./mllib/src/main/scala/org/apache/spark/mllib/package-info.java
> ./sql/catalyst/src/main/scala/org/apache/spark/sql/types/SQLUserDefinedType.java
> ./sql/hive/src/main/scala/org/apache/spark/sql/hive/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/api/java/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/dstream/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/StreamingContextState.java
> {code}
> It happens to work since the Scala compiler plugin is handling both.
> On its face, they should be in the Java source tree. I'm trying to figure out 
> if there are good reasons they have to be in this less intuitive location.
> I might try moving them just to see.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10511) Source releases should not include maven jars

2015-09-09 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-10511:
---

 Summary: Source releases should not include maven jars
 Key: SPARK-10511
 URL: https://issues.apache.org/jira/browse/SPARK-10511
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.5.0
Reporter: Patrick Wendell
Priority: Blocker


I noticed our source jars seemed really big for 1.5.0. At least one 
contributing factor is that, likely due to some change in the release script, 
the maven jars are being bundled in with the source code in our build 
directory. This runs afoul of the ASF policy on binaries in source releases - 
we should fix it in 1.5.1.

The issue (I think) is that we might invoke maven to compute the version 
between when we checkout Spark from github and when we package the source file. 
I think it could be fixed by simply clearing out the build/ directory after 
that statement runs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10374) Spark-core 1.5.0-RC2 can create version conflicts with apps depending on protobuf-2.4

2015-08-31 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723792#comment-14723792
 ] 

Patrick Wendell commented on SPARK-10374:
-

Hey Matt,

I think the only thing that could have influenced you is that we changed our 
default advertised akka dependency. We used to advertise an older version of 
akka that shaded protobuf. What happens if you manually coerce that version of 
akka in your application?

Spark itself doesn't directly use protobuf. But some of our dependencies do, 
including both akka and Hadoop. My guess is that you are now in a situation 
where you can't reconcile the akka and hadoop protobuf versions and make them 
both happy. This would be consistent with the changes we made in 1.5 in 
SPARK-7042.

The fix would be to exclude all com.typsafe.akka artifacts from Spark and 
manually add org.spark-project.akka to your build.

However, since you didn't post a full stack trace, I can't know for sure 
whether it is akka that complains when you try to fix the protobuf version at 
2.4.

> Spark-core 1.5.0-RC2 can create version conflicts with apps depending on 
> protobuf-2.4
> -
>
> Key: SPARK-10374
> URL: https://issues.apache.org/jira/browse/SPARK-10374
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Matt Cheah
>
> My Hadoop cluster is running 2.0.0-CDH4.7.0, and I have an application that 
> depends on the Spark 1.5.0 libraries via Gradle, and Hadoop 2.0.0 libraries. 
> When I run the driver application, I can hit the following error:
> {code}
> … java.lang.UnsupportedOperationException: This is 
> supposed to be overridden by subclasses.
> at 
> com.google.protobuf.GeneratedMessage.getUnknownFields(GeneratedMessage.java:180)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto.getSerializedSize(ClientNamenodeProtocolProtos.java:30108)
> at 
> com.google.protobuf.AbstractMessageLite.toByteString(AbstractMessageLite.java:49)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.constructRpcRequest(ProtobufRpcEngine.java:149)
> {code}
> This application used to work when pulling in Spark 1.4.1 dependencies, and 
> thus this is a regression.
> I used Gradle’s dependencyInsight task to dig a bit deeper. Against our Spark 
> 1.4.1-backed project, it shows that dependency resolution pulls in Protobuf 
> 2.4.0a from the Hadoop CDH4 modules and Protobuf 2.5.0-spark from the Spark 
> modules. It appears that Spark used to shade its protobuf dependencies and 
> hence Spark’s and Hadoop’s protobuf dependencies wouldn’t collide. However 
> when I ran dependencyInsight again against Spark 1.5 and it looks like 
> protobuf is no longer shaded from the Spark module.
> 1.4.1 dependencyInsight:
> {code}
> com.google.protobuf:protobuf-java:2.4.0a
> +--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0
> |\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0
> | +--- compile
> | \--- org.apache.spark:spark-core_2.10:1.4.1
> |  +--- compile
> |  +--- org.apache.spark:spark-sql_2.10:1.4.1
> |  |\--- compile
> |  \--- org.apache.spark:spark-catalyst_2.10:1.4.1
> |   \--- org.apache.spark:spark-sql_2.10:1.4.1 (*)
> \--- org.apache.hadoop:hadoop-hdfs:2.0.0-cdh4.6.0
>  \--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 (*)
> org.spark-project.protobuf:protobuf-java:2.5.0-spark
> \--- org.spark-project.akka:akka-remote_2.10:2.3.4-spark
>  \--- org.apache.spark:spark-core_2.10:1.4.1
>   +--- compile
>   +--- org.apache.spark:spark-sql_2.10:1.4.1
>   |\--- compile
>   \--- org.apache.spark:spark-catalyst_2.10:1.4.1
>\--- org.apache.spark:spark-sql_2.10:1.4.1 (*)
> {code}
> 1.5.0-rc2 dependencyInsight:
> {code}
> com.google.protobuf:protobuf-java:2.5.0 (conflict resolution)
> \--- com.typesafe.akka:akka-remote_2.10:2.3.11
>  \--- org.apache.spark:spark-core_2.10:1.5.0-rc2
>   +--- compile
>   +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2
>   |\--- compile
>   \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2
>\--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 (*)
> com.google.protobuf:protobuf-java:2.4.0a -> 2.5.0
> +--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0
> |\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0
> | +--- compile
> | \--- org.apache.spark:spark-core_2.10:1.5.0-rc2
> |  +--- 

[jira] [Commented] (SPARK-10359) Enumerate Spark's dependencies in a file and diff against it for new pull requests

2015-08-31 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723844#comment-14723844
 ] 

Patrick Wendell commented on SPARK-10359:
-

The approach in SPARK-4123 was a bit different, but there is some overlap. We 
ended up reverting that patch because it wasn't working consistently. I'll 
close that one as a dup of this one.

> Enumerate Spark's dependencies in a file and diff against it for new pull 
> requests 
> ---
>
> Key: SPARK-10359
> URL: https://issues.apache.org/jira/browse/SPARK-10359
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>    Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>
> Sometimes when we have dependency changes it can be pretty unclear what 
> transitive set of things are changing. If we enumerate all of the 
> dependencies and put them in a source file in the repo, we can make it so 
> that it is very explicit what is changing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4123) Show dependency changes in pull requests

2015-08-31 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4123.

Resolution: Duplicate

I've proposed a slightly different approach in SPARK-10359, so I'm closing this 
since there is high overlap.

> Show dependency changes in pull requests
> 
>
> Key: SPARK-4123
> URL: https://issues.apache.org/jira/browse/SPARK-4123
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>        Reporter: Patrick Wendell
>Assignee: Brennon York
>Priority: Critical
>
> We should inspect the classpath of Spark's assembly jar for every pull 
> request. This only takes a few seconds in Maven and it will help weed out 
> dependency changes from the master branch. Ideally we'd post any dependency 
> changes in the pull request message.
> {code}
> $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
> INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath
> $ git checkout apache/master
> $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
> INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath
> $ diff my-classpath master-classpath
> < chill-java-0.3.6.jar
> < chill_2.10-0.3.6.jar
> ---
> > chill-java-0.5.0.jar
> > chill_2.10-0.5.0.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9545) Run Maven tests in pull request builder if title has [test-maven] in it

2015-08-30 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-9545:
---
Summary: Run Maven tests in pull request builder if title has 
[test-maven] in it  (was: Run Maven tests in pull request builder if title 
has [maven-test] in it)

 Run Maven tests in pull request builder if title has [test-maven] in it
 -

 Key: SPARK-9545
 URL: https://issues.apache.org/jira/browse/SPARK-9545
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell
 Fix For: 1.6.0


 We have infrastructure now in the build tooling for running maven tests, but 
 it's not actually used anywhere. With a very minor change we can support 
 running maven tests if the pull request title has maven-test in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9547) Allow testing pull requests with different Hadoop versions

2015-08-30 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-9547.

   Resolution: Fixed
Fix Version/s: 1.6.0

 Allow testing pull requests with different Hadoop versions
 --

 Key: SPARK-9547
 URL: https://issues.apache.org/jira/browse/SPARK-9547
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell
 Fix For: 1.6.0


 Similar to SPARK-9545 we should allow testing different Hadoop profiles in 
 the PRB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9545) Run Maven tests in pull request builder if title has [maven-test] in it

2015-08-30 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-9545.

   Resolution: Fixed
Fix Version/s: 1.6.0

 Run Maven tests in pull request builder if title has [maven-test] in it
 -

 Key: SPARK-9545
 URL: https://issues.apache.org/jira/browse/SPARK-9545
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell
 Fix For: 1.6.0


 We have infrastructure now in the build tooling for running maven tests, but 
 it's not actually used anywhere. With a very minor change we can support 
 running maven tests if the pull request title has maven-test in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[ANNOUNCE] New testing capabilities for pull requests

2015-08-30 Thread Patrick Wendell
Hi All,

For pull requests that modify the build, you can now test different
build permutations as part of the pull request builder. To trigger
these, you add a special phrase to the title of the pull request.
Current options are:

[test-maven] - run tests using maven and not sbt
[test-hadoop1.0] - test using older hadoop versions (can use 1.0, 2.0,
2.2, and 2.3).

The relevant source code is here:
https://github.com/apache/spark/blob/master/dev/run-tests-jenkins#L193

This is useful because it allows up-front testing of build changes to
avoid breaks once a patch has already been merged.

I've documented this on the wiki:
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Created] (SPARK-10359) Enumerate Spark's dependencies in a file and diff against it for new pull requests

2015-08-30 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-10359:
---

 Summary: Enumerate Spark's dependencies in a file and diff against 
it for new pull requests 
 Key: SPARK-10359
 URL: https://issues.apache.org/jira/browse/SPARK-10359
 Project: Spark
  Issue Type: New Feature
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell


Sometimes when we have dependency changes it can be pretty unclear what 
transitive set of things are changing. If we enumerate all of the dependencies 
and put them in a source file in the repo, we can make it so that it is very 
explicit what is changing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Paring down / tagging tests (or some other way to avoid timeouts)?

2015-08-25 Thread Patrick Wendell
There is already code in place that restricts which tests run
depending on which code is modified. However, changes inside of
Spark's core currently require running all dependent tests. If you
have some ideas about how to improve that heuristic, it would be
great.

- Patrick

On Tue, Aug 25, 2015 at 1:33 PM, Marcelo Vanzin van...@cloudera.com wrote:
 Hello y'all,

 So I've been getting kinda annoyed with how many PR tests have been
 timing out. I took one of the logs from one of my PRs and started to
 do some crunching on the data from the output, and here's a list of
 the 5 slowest suites:

 307.14s HiveSparkSubmitSuite
 382.641s VersionsSuite
 398s CliSuite
 410.52s HashJoinCompatibilitySuite
 2508.61s HiveCompatibilitySuite

 Looking at those, I'm not surprised at all that we see so many
 timeouts. Is there any ongoing effort to trim down those tests
 (especially HiveCompatibilitySuite) or somehow restrict when they're
 run?

 Almost 1 hour to run a single test suite that affects a rather
 isolated part of the code base looks a little excessive to me.

 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >