Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-02-20 Thread Sean Owen
Two builds is indeed a pain, since it's an ongoing chore to keep them
in sync. For example, I am already seeing that the two do not quite
declare the same dependencies (see recent patch).

I think publishing artifacts to Maven central should be considered a
hard requirement if it isn't already one from the ASF, and it may be?
Certainly most people out there would be shocked if you told them
Spark is not in the repo at all. And that requires at least
maintaining a pom that declares the structure of the project.

This does not necessarily mean using Maven to build, but is a reason
that removing the pom is going to make this a lot harder for people to
consume as a project.

Maven has its pros and cons but there are plenty of people lurking
around who know it quite well. Certainly it's easier for the Hadoop
people to understand and work with. On the other hand, it supports
Scala although only via a plugin, which is weaker support. sbt seems
like a fairly new, basic, ad-hoc tool. Is there an advantage to it,
other than being Scala (which is an advantage)?

--
Sean Owen | Director, Data Science | London


On Fri, Feb 21, 2014 at 4:03 AM, Patrick Wendell  wrote:
> Hey All,
>
> It's very high overhead having two build systems in Spark. Before
> getting into a long discussion about the merits of sbt vs maven, I
> wanted to pose a simple question to the dev list:
>
> Is there anyone who feels that dropping either sbt or maven would have
> a major consequence for them?
>
> And I say "major consequence" meaning something becomes completely
> impossible now and can't be worked around. This is different from an
> "inconvenience", i.e., something which can be worked around but will
> require some investment.
>
> I'm posing the question in this way because, if there are features in
> either build system that are absolutely-un-available in the other,
> then we'll have to maintain both for the time being. I'm merely trying
> to see whether this is the case...
>
> - Patrick


[GitHub] incubator-spark pull request: SPARK-1111: URL Validation Throws Er...

2014-02-20 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/incubator-spark/pull/625#issuecomment-35705998
  
@aarondav fixed the nit, waiting for tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: External spilling - fix Int.MaxValue...

2014-02-20 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/incubator-spark/pull/624#issuecomment-35705443
  
Thanks @guojc and @andrewor14! LGTM. Maybe @pwendell wants to take a look 
as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1111: URL Validation Throws Er...

2014-02-20 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/incubator-spark/pull/625#issuecomment-35704996
  
LGTM save the minor regex nit


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1111: URL Validation Throws Er...

2014-02-20 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/625#discussion_r9938103
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/ClientArguments.scala 
---
@@ -115,3 +110,7 @@ private[spark] class ClientArguments(args: 
Array[String]) {
 System.exit(exitCode)
   }
 }
+
+object ClientArguments {
+  def isValidJarUrl(s: String) = s.matches("^(.+):(.+)jar")
--- End diff --

"^" is technically not needed since matches is full-string anyway


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1094] Support MiMa for report...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/585#issuecomment-35704910
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1094] Support MiMa for report...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/585#issuecomment-35704911
  
One or more automated tests failed
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12796/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1094] Support MiMa for report...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/585#issuecomment-35704781
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1094] Support MiMa for report...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/585#issuecomment-35704782
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1100] prevent Spark from over...

2014-02-20 Thread jyotiska
Github user jyotiska commented on the pull request:

https://github.com/apache/incubator-spark/pull/626#issuecomment-35704511
  
I think it is a good idea to add an extra flag for overwriting. If the flag 
is not present, Spark should throw an exception. I will see if the bug is also 
present in PySpark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1094] Support MiMa for report...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/585#issuecomment-35704292
  
One or more automated tests failed
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12794/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1094] Support MiMa for report...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/585#issuecomment-35704289
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [java8API] SPARK-964 Investigate the...

2014-02-20 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/incubator-spark/pull/539#issuecomment-35704205
  
Also, can you tell me where those implicits in JavaPairRDD are used? Can't 
we manually do the conversions there? At the very least, the implicits should 
be private[spark] so that Java users don't try to call them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1094] Support MiMa for report...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/585#issuecomment-35704162
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Deprecated and added a few java api ...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/402#issuecomment-35704179
  
Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Deprecated and added a few java api ...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/402#issuecomment-35704178
  
 Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [java8API] SPARK-964 Investigate the...

2014-02-20 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/incubator-spark/pull/539#issuecomment-35704145
  
Hey Prashant, this looks pretty good at first glance. Can you also create a 
Java 8 version of the Streaming suite?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1094] Support MiMa for report...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/585#issuecomment-35704161
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1100] prevent Spark from over...

2014-02-20 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/incubator-spark/pull/626#issuecomment-35703455
  
Typically, the way this gets done is - write to a temporary directory, 
taking care of multiple attempts for same partition (failure case)/multiple 
concurrent executions on same partition (speculative execution case) and once 
job is done,  move to the desired destination (or delete dir if job fails) - 
like what mapred does for example.
(Moves are atomic NN operations).

So when output directory is "done", it is fully done : not partially/in 
progress/etc.
Particularly the bug mentioned - of left over files from previous jobs, etc 
- is just scarey !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-02-20 Thread Mridul Muralidharan
I am not sure if this is resolved now - but maven was better at
building the assembly jars compared to sbt.
To the point where I stopped using sbt due to unpredictable order in
which it unjars the dependencies to create the assembled jar (we do
have quite a lot of conflicting classes in our dependency tree :-( ).
I dont know if this is an artifact of how we specify it in sbt
project, or something else ...

If this is still an issue, then using sbt only is a non starter.

Regards,
Mridul




On Fri, Feb 21, 2014 at 9:33 AM, Patrick Wendell  wrote:
> Hey All,
>
> It's very high overhead having two build systems in Spark. Before
> getting into a long discussion about the merits of sbt vs maven, I
> wanted to pose a simple question to the dev list:
>
> Is there anyone who feels that dropping either sbt or maven would have
> a major consequence for them?
>
> And I say "major consequence" meaning something becomes completely
> impossible now and can't be worked around. This is different from an
> "inconvenience", i.e., something which can be worked around but will
> require some investment.
>
> I'm posing the question in this way because, if there are features in
> either build system that are absolutely-un-available in the other,
> then we'll have to maintain both for the time being. I'm merely trying
> to see whether this is the case...
>
> - Patrick


[GitHub] incubator-spark pull request: Deprecated and added a few java api ...

2014-02-20 Thread ScrapCodes
Github user ScrapCodes commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/402#discussion_r9937176
  
--- Diff: core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala 
---
@@ -73,7 +74,7 @@ trait JavaRDDLike[T, This <: JavaRDDLike[T, This]] 
extends Serializable {
* of the original partition.
*/
   def mapPartitionsWithIndex[R: ClassTag](
-  f: JFunction2[Int, java.util.Iterator[T], java.util.Iterator[R]],
+  f: JFunction2[Integer, java.util.Iterator[T], java.util.Iterator[R]],
--- End diff --

I am sorry, I think that was some other PR. Going to change this right 
away. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1100] prevent Spark from over...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/626#issuecomment-35701775
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1111: URL Validation Throws Er...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/625#issuecomment-35701664
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12793/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: External spilling - fix Int.MaxValue...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/624#issuecomment-35701668
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12792/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: External spilling - fix Int.MaxValue...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/624#issuecomment-35701667
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: External spilling - fix Int.MaxValue...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/624#issuecomment-35701666
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1111: URL Validation Throws Er...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/625#issuecomment-35701663
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: External spilling - fix Int.MaxValue...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/624#issuecomment-35701669
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12791/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1100] prevent Spark from over...

2014-02-20 Thread CodingCat
GitHub user CodingCat opened a pull request:

https://github.com/apache/incubator-spark/pull/626

[SPARK-1100] prevent Spark from overwriting directory silently and leaving 
dirty directory

Thanks for Diana Carroll to report this issue

the current saveAsTextFile/SequenceFile will overwrite the output directory 
silently if the directory already exists, this behaviour is not desirable 
because

1. overwriting the data silently is not user-friendly

2. if the partition number of two writing operation changed, then the 
output directory will contain the results generated by two runnings

My fix includes:

1. add some new APIs with a flag for users to define whether he/she wants 
to overwrite the directory:

if the flag is set to true, then the output directory is deleted first and 
then written into the new data to prevent the output directory contains results 
from multiple rounds of running; 

if the flag is set to false, Spark will throw an exception if the output 
directory already exists

2. I didn't change saveNewHadoopAPI because in the new API, the overwrite 
flag is defined by the implementation of RecordWriter, we don't need to control 
that in Spark

3. changed JavaAPI part

4. default behaviour is overwriting

-

Two questions

1. should we deprecate the old APIs without such a flag?

2. I noticed that Spark Streaming also called these APIs, I thought we 
don't need to change the related part in streaming? @tdas 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/CodingCat/incubator-spark SPARK-1100

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-spark/pull/626.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #626


commit 2ec87a1f63b4650036691e5bf5d484aae4e6d470
Author: CodingCat 
Date:   2014-02-21T05:32:17Z

add new APIs to enable users define whether to overwrite the output 
directory




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1114: Allow PySpark to use exi...

2014-02-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/incubator-spark/pull/622


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1111: URL Validation Throws Er...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/625#issuecomment-35700515
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1111: URL Validation Throws Er...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/625#issuecomment-35700516
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1111: URL Validation Throws Er...

2014-02-20 Thread pwendell
GitHub user pwendell opened a pull request:

https://github.com/apache/incubator-spark/pull/625

SPARK-: URL Validation Throws Error for HDFS URL's

Fixes an error where HDFS URL's cause an exception. Should be merged into 
master and 0.9.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/pwendell/incubator-spark url-validation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-spark/pull/625.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #625


commit fa25ee2b02aa5b4518e76938cce71aad7239ba31
Author: Patrick Wendell 
Date:   2014-02-21T05:29:19Z

SPARK-: URL Validation Throws Error for HDFS URL's




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1114: Allow PySpark to use exi...

2014-02-20 Thread ahirreddy
Github user ahirreddy commented on the pull request:

https://github.com/apache/incubator-spark/pull/622#issuecomment-35700324
  
Great! Thanks—
Sent from Mailbox for iPhone

On Thu, Feb 20, 2014 at 9:21 PM, Matei Zaharia 
wrote:

> Looks good, I've merged this in.
> ---
> Reply to this email directly or view it on GitHub:
> https://github.com/apache/incubator-spark/pull/622#issuecomment-35700134


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1114: Allow PySpark to use exi...

2014-02-20 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/incubator-spark/pull/622#issuecomment-35700134
  
Looks good, I've merged this in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-02-20 Thread Patrick Wendell
Hey Henry,

Yep, I wanted to reboot this since some time has passed and people may
have new or changed ways of using the build.

Maven makes the Apache publishing fairly seamless, but after the last
two releases I believe we could make it work with sbt as well. sbt
also supports publishing and other Apache projects such as Kafka
publish with sbt.

On Thu, Feb 20, 2014 at 8:50 PM, Henry Saputra  wrote:
> Thanks for bringing back the build systems discussions, Patrick.
> There was a long discussion way back before Spark joining ASF and as I
> remember there has not been clear "winner" between using sbt or maven.
>
> Maven makes it easier to publish the artifacts to Nexus repository,
> not sure if sbt can do  the same, and as I remember one of the
> limitations or drawbacks about maven is the use of profiles.
> Matei had suggested using some kind of Hadoop client detection as in
> Parquet project to manage the Hadoop versions to avoid profiles.
>
>
> - Henry
>
> On Thu, Feb 20, 2014 at 8:03 PM, Patrick Wendell  wrote:
>> Hey All,
>>
>> It's very high overhead having two build systems in Spark. Before
>> getting into a long discussion about the merits of sbt vs maven, I
>> wanted to pose a simple question to the dev list:
>>
>> Is there anyone who feels that dropping either sbt or maven would have
>> a major consequence for them?
>>
>> And I say "major consequence" meaning something becomes completely
>> impossible now and can't be worked around. This is different from an
>> "inconvenience", i.e., something which can be worked around but will
>> require some investment.
>>
>> I'm posing the question in this way because, if there are features in
>> either build system that are absolutely-un-available in the other,
>> then we'll have to maintain both for the time being. I'm merely trying
>> to see whether this is the case...
>>
>> - Patrick


[GitHub] incubator-spark pull request: External spilling - fix Int.MaxValue...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/624#issuecomment-35699690
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: External spilling - fix Int.MaxValue...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/624#issuecomment-35699691
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


Re: Signal/Noise Ratio

2014-02-20 Thread Henry Saputra
Daniel Gruno from ASF infra mentioned this when adding Github plugin
to dev@ list support :
"We may, in the future, add the possibility to filter out certain
comments from being relayed to the ML (such as jenkins workflows etc),
but this will all depend on how this initial phase goes along."

Looks like for Apache Spark we need ability to filter comments from Jenkins.

So if we could "filter" the Jenkins comment fro being sent to dev@
list would this help reduce the noise?

- Henry

On Thu, Feb 20, 2014 at 1:01 PM, Andrew Ash  wrote:
> I'm fine with keeping the GitHub traffic if we can
>
> a) take away the Jenkins build started / build finished / build succeeded /
> build failed messages.  Those aren't "dev discussion" and are very noisy.
>  I don't think they help anyone, and people who care about those for a
> particular PR (because they're a reviewer or author on it) are already
> subscribed through GitHub.
> b) change the format of the emails that are sent out; I find them very
> poorly formatted.  I'd prefer no deep tab for the message.
>
> http://mail-archives.apache.org/mod_mbox/incubator-spark-dev/201402.mbox/%3c20140210192901.ce834922...@tyr.zones.apache.org%3E
>
> FWIW I'm filtering all emails from g...@git.apache.org straight to trash
> right now because of the noise.
>
>
> On Thu, Feb 20, 2014 at 12:51 PM, Mattmann, Chris A (3980) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
>
>> Guys,
>>
>> Whether you are a TLP or not the big key here is making sure that
>> dev discussion does not happen elsewhere outside of the list. You
>> can create e.g., a github-dev@spark.a.o list, but you will need
>> to make sure that:
>>
>> a) if dev discussion is happening there that it gets flowed up to
>> dev@spark.a.o. All development discussion must appear on the dev
>> list and must be traceable as a project discussion and decisions
>> appear on the list(s).
>>
>> b) automated/etc. email is simply that, and there isn't a ton of
>> discussion going on on those github emails, and that it's mostly
>> going on on the dev@spark.a.o list.
>>
>> If you can meet those 2 criteria/litmus test, I think it's fine.
>> The big concern is that if the discussion is not happening elsewhere,
>> then the decisions make for Apache Spark are based on information
>> that isn't co-located with the Apache Spark project. So that's the
>> thing that the PMC needs to keep in mind (note I said PMC now, yay!) :)
>>
>> Cheers and just keep the above in mind and you'll be good.
>>
>> Cheers,
>> Chris
>>
>>
>>
>>
>> -Original Message-
>> From: Andy Konwinski 
>> Reply-To: "dev@spark.incubator.apache.org" > >
>> Date: Thursday, February 20, 2014 12:36 PM
>> To: "dev@spark.incubator.apache.org" 
>> Subject: Re: Signal/Noise Ratio
>>
>> >That is a very valid point about the list archives (which a mail filter
>> >doesn't address and which impacts the community in a negative way).
>> >
>> >As of today we are a Top Level Project so I think we have a little more
>> >autonomy for this sort of dev vs separate list decision.
>> >
>> >
>> >On Thu, Feb 20, 2014 at 12:15 PM, Ethan Jewett 
>> wrote:
>> >
>> >> Is there anything stopping us from using a different list, segregated
>> >>from
>> >> the dev list? The Github emails significantly reduce the signal-noise
>> >>ratio
>> >> of this list, and while it is possible (but annoying) to filter them
>> >>out in
>> >> our individual inboxes, it makes the archives of the list much less
>> >>usable
>> >> in many ways.
>> >>
>> >>
>> >> On Tue, Feb 18, 2014 at 2:20 PM, Aaron Davidson 
>> >> wrote:
>> >>
>> >> > This is due, unfortunately, to Apache policies that all
>> >> development-related
>> >> > discussion should take place on the dev list. As we are attempting to
>> >> > graduate from an incubating project to an Apache top level project,
>> >>there
>> >> > were some concerns raised about GitHub, and the fastest solution to
>> >>avoid
>> >> > conflict related to our graduation was to CC dev@ for all GitHub
>> >> messages.
>> >> > Once our graduation is complete, we may be able to find a less noisy
>> >>way
>> >> of
>> >> > dealing with these messages.
>> >> >
>> >> > In the meantime, one simple solution is to filter out all messages
>> >>that
>> >> > come from g...@git.apache.org and are destined to
>> >> > dev@spark.incubator.apache.org.
>> >> >
>> >> >
>> >> > On Tue, Feb 18, 2014 at 10:04 AM, Gerard Maas 
>> >> > wrote:
>> >> >
>> >> > > +1 please.
>> >> > >
>> >> > >
>> >> > > On Tue, Feb 18, 2014 at 6:04 PM, Michael Ernest
>> >>> >> > > >wrote:
>> >> > >
>> >> > > > +1
>> >> > > >
>> >> > > >
>> >> > > > On Tue, Feb 18, 2014 at 8:24 AM, Heiko Braun <
>> >> ike.br...@googlemail.com
>> >> > > > >wrote:
>> >> > > >
>> >> > > > >
>> >> > > > >
>> >> > > > > Wouldn't it be better to move the github messages to a dedicated
>> >> > email
>> >> > > > > list?
>> >> > > > >
>> >> > > > > Regards, Heiko
>> >> > > > >
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > > --
>> >> > > > Michael Ernest
>> >> > > > Sr. 

[GitHub] incubator-spark pull request: External spilling - fix Int.MaxValue...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/624#issuecomment-35699760
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: External spilling - fix Int.MaxValue...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/624#issuecomment-35699761
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-02-20 Thread Henry Saputra
Thanks for bringing back the build systems discussions, Patrick.
There was a long discussion way back before Spark joining ASF and as I
remember there has not been clear "winner" between using sbt or maven.

Maven makes it easier to publish the artifacts to Nexus repository,
not sure if sbt can do  the same, and as I remember one of the
limitations or drawbacks about maven is the use of profiles.
Matei had suggested using some kind of Hadoop client detection as in
Parquet project to manage the Hadoop versions to avoid profiles.


- Henry

On Thu, Feb 20, 2014 at 8:03 PM, Patrick Wendell  wrote:
> Hey All,
>
> It's very high overhead having two build systems in Spark. Before
> getting into a long discussion about the merits of sbt vs maven, I
> wanted to pose a simple question to the dev list:
>
> Is there anyone who feels that dropping either sbt or maven would have
> a major consequence for them?
>
> And I say "major consequence" meaning something becomes completely
> impossible now and can't be worked around. This is different from an
> "inconvenience", i.e., something which can be worked around but will
> require some investment.
>
> I'm posing the question in this way because, if there are features in
> either build system that are absolutely-un-available in the other,
> then we'll have to maintain both for the time being. I'm merely trying
> to see whether this is the case...
>
> - Patrick


Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-02-20 Thread Mark Hamstra
Is dropping Maven an option, or must we have it to comply with the Apache
release process?


On Thu, Feb 20, 2014 at 8:03 PM, Patrick Wendell  wrote:

> Hey All,
>
> It's very high overhead having two build systems in Spark. Before
> getting into a long discussion about the merits of sbt vs maven, I
> wanted to pose a simple question to the dev list:
>
> Is there anyone who feels that dropping either sbt or maven would have
> a major consequence for them?
>
> And I say "major consequence" meaning something becomes completely
> impossible now and can't be worked around. This is different from an
> "inconvenience", i.e., something which can be worked around but will
> require some investment.
>
> I'm posing the question in this way because, if there are features in
> either build system that are absolutely-un-available in the other,
> then we'll have to maintain both for the time being. I'm merely trying
> to see whether this is the case...
>
> - Patrick
>


[GitHub] incubator-spark pull request: External spilling - fix Int.MaxValue...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/624#issuecomment-35697469
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: External spilling - fix Int.MaxValue...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/624#issuecomment-35697471
  
One or more automated tests failed
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12790/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-02-20 Thread Patrick Wendell
Hey All,

It's very high overhead having two build systems in Spark. Before
getting into a long discussion about the merits of sbt vs maven, I
wanted to pose a simple question to the dev list:

Is there anyone who feels that dropping either sbt or maven would have
a major consequence for them?

And I say "major consequence" meaning something becomes completely
impossible now and can't be worked around. This is different from an
"inconvenience", i.e., something which can be worked around but will
require some investment.

I'm posing the question in this way because, if there are features in
either build system that are absolutely-un-available in the other,
then we'll have to maintain both for the time being. I'm merely trying
to see whether this is the case...

- Patrick


[GitHub] incubator-spark pull request: Deprecated and added a few java api ...

2014-02-20 Thread ScrapCodes
Github user ScrapCodes commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/402#discussion_r9935572
  
--- Diff: core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala 
---
@@ -73,7 +74,7 @@ trait JavaRDDLike[T, This <: JavaRDDLike[T, This]] 
extends Serializable {
* of the original partition.
*/
   def mapPartitionsWithIndex[R: ClassTag](
-  f: JFunction2[Int, java.util.Iterator[T], java.util.Iterator[R]],
+  f: JFunction2[Integer, java.util.Iterator[T], java.util.Iterator[R]],
--- End diff --

Well I think there was a context to that comment, Earlier right above that
we were importing java.lang.Integer and then using it. He asked me to
remove and make it java.lang.Integer, but later I discovered import was not
necessary. But if you think having that specified explicitly is good, it
can be done in a moment ?


On Fri, Feb 21, 2014 at 7:26 AM, Patrick Wendell
wrote:

> In core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala:
>
> > @@ -73,7 +74,7 @@ trait JavaRDDLike[T, This <: JavaRDDLike[T, This]] 
extends Serializable {
> > * of the original partition.
> > */
> >def mapPartitionsWithIndex[R: ClassTag](
> > -  f: JFunction2[Int, java.util.Iterator[T], java.util.Iterator[R]],
> > +  f: JFunction2[Integer, java.util.Iterator[T], 
java.util.Iterator[R]],
>
> @ScrapCodes  I think 
@rxinis suggesting that you should actually write
> java.lang.Integer to make it more explicit.
>
> --
> Reply to this email directly or view it on 
GitHub
> .
>



-- 
Prashant


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: External spilling - fix Int.MaxValue...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/624#issuecomment-35695580
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: External spilling - fix Int.MaxValue...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/624#issuecomment-35695581
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: External spilling - fix Int.MaxValue...

2014-02-20 Thread andrewor14
GitHub user andrewor14 opened a pull request:

https://github.com/apache/incubator-spark/pull/624

External spilling - fix Int.MaxValue hash code collision bug

The original poster of this bug is @guojc, who opened a PR that preceded 
this one at https://github.com/apache/incubator-spark/pull/612.

ExternalAppendOnlyMap uses key hash code to order the buffer streams from 
which spilled files are read back into memory. When a buffer stream is empty, 
the default hash code for that stream is equal to Int.MaxValue. This is, 
however, a perfectly legitimate candidate for a key hash code. When reading 
from a spilled map containing such a key, a hash collision may occur, in which 
case we attempt to read from an empty stream and throw NoSuchElementException.

The fix is to maintain the invariant that empty buffer streams are never 
added back to the queue to be considered. This guarantees that we never read 
from an empty buffer stream, ever again.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/andrewor14/incubator-spark spilling-bug

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-spark/pull/624.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #624


commit 21c1a39ebe7d0f10519621f3ef54aa6e89c08441
Author: Andrew Or 
Date:   2014-02-21T02:04:27Z

Add hash collision tests to ExternalAppendOnlyMapSuite

As of now, the test "spilling with hash collisions using the Int.MaxValue 
key" fails.
Fixing this behavior is the main goal of this PR.

commit c11f03b6e6e82617a826dc3acbd09a52760f143b
Author: Andrew Or 
Date:   2014-02-21T02:58:18Z

Fix Int.MaxValue hash collision bug in ExternalAppendOnlyMap

The solution is to maintain the invariant that mergeHeap contains only 
non-empty
StreamBuffer's at the time next() is called.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Fix ExternalMap on case of key's has...

2014-02-20 Thread guojc
Github user guojc commented on the pull request:

https://github.com/apache/incubator-spark/pull/612#issuecomment-35694101
  
Yes, I'm Jiacheng Guo. It's ok if you can find another good solution for 
this bug. Thanks for your guy's work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Fix ExternalMap on case of key's has...

2014-02-20 Thread guojc
Github user guojc closed the pull request at:

https://github.com/apache/incubator-spark/pull/612


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Allow PySpark to use existing JVM an...

2014-02-20 Thread ahirreddy
Github user ahirreddy commented on the pull request:

https://github.com/apache/incubator-spark/pull/622#issuecomment-35692179
  
https://spark-project.atlassian.net/browse/SPARK-1114


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Allow PySpark to use existing JVM an...

2014-02-20 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/incubator-spark/pull/622#issuecomment-35691749
  
@ahirreddy Could you create a JIRA for this? Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Deprecated and added a few java api ...

2014-02-20 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/402#discussion_r9934007
  
--- Diff: core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala 
---
@@ -73,7 +74,7 @@ trait JavaRDDLike[T, This <: JavaRDDLike[T, This]] 
extends Serializable {
* of the original partition.
*/
   def mapPartitionsWithIndex[R: ClassTag](
-  f: JFunction2[Int, java.util.Iterator[T], java.util.Iterator[R]],
+  f: JFunction2[Integer, java.util.Iterator[T], java.util.Iterator[R]],
--- End diff --

@ScrapCodes I think @rxin is suggesting that you should actually write 
`java.lang.Integer` to make it more explicit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Deprecated and added a few java api ...

2014-02-20 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/402#discussion_r9933178
  
--- Diff: core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala 
---
@@ -73,7 +74,7 @@ trait JavaRDDLike[T, This <: JavaRDDLike[T, This]] 
extends Serializable {
* of the original partition.
*/
   def mapPartitionsWithIndex[R: ClassTag](
-  f: JFunction2[Int, java.util.Iterator[T], java.util.Iterator[R]],
+  f: JFunction2[Integer, java.util.Iterator[T], java.util.Iterator[R]],
--- End diff --

yep, could you do this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1094] Support MiMa for report...

2014-02-20 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/585#discussion_r9933131
  
--- Diff: project/MimaBuild.scala ---
@@ -0,0 +1,105 @@
+import com.typesafe.tools.mima.plugin.MimaKeys.{binaryIssueFilters, 
previousArtifact}
+import com.typesafe.tools.mima.plugin.MimaPlugin.mimaDefaultSettings
+
+object MimaBuild {
+
+  val ignoredABIProblems = {
+import com.typesafe.tools.mima.core._
+import com.typesafe.tools.mima.core.ProblemFilters._
+/**
+ * A: Detections are semi private or likely to become semi private at 
some point.
+ */
+
Seq(exclude[MissingClassProblem]("org.apache.spark.util.XORShiftRandom"),
+  
exclude[MissingClassProblem]("org.apache.spark.util.XORShiftRandom$"),
+  
exclude[MissingMethodProblem]("org.apache.spark.util.Utils.cloneWritables"),
+  // Scheduler is not considered a public API.
+  excludePackage("org.apache.spark.deploy"),
+  // Was made private in 1.0
--- End diff --

It's sort of a hack but you can exclude these as packages like this:

```

excludePackage("org.apache.spark.util.collection.ExternalAppendOnlyMap#DiskMapIterator")

excludePackage("org.apache.spark.util.collection.ExternalAppendOnlyMap#ExternalIterator")
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Super minor: Add require for mergeCo...

2014-02-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/incubator-spark/pull/623


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Super minor: Add require for mergeCo...

2014-02-20 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/incubator-spark/pull/623#issuecomment-35687583
  
Thanks aaron looks good. I'll merge this into master and 0.9.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Super minor: Add require for mergeCo...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/623#issuecomment-35686973
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Super minor: Add require for mergeCo...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/623#issuecomment-35686975
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12789/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Super minor: Add require for mergeCo...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/623#issuecomment-35685133
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Super minor: Add require for mergeCo...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/623#issuecomment-35685132
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-929: Fully deprecate usage of ...

2014-02-20 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/incubator-spark/pull/615#issuecomment-35684716
  
Hm actually sorry that was totally wrong. Who uses this script externally 
at all? Why don't we just _not_ document this...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-929: Fully deprecate usage of ...

2014-02-20 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/incubator-spark/pull/615#issuecomment-35684474
  
I think SPARK_CLIENT_MEMORY isn't so hot either because most often 
`spark-class` isn't used to run a client, it's most often used by users to run 
examples. Maybe SPARK_CLASS_MEMORY? @asfgit 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


Re: Signal/Noise Ratio

2014-02-20 Thread Xiangrui Meng
+1 If someone replies a github thread via spark-dev, it won't show up
on github and it gets filtered by most people. -Xiangrui

On Thu, Feb 20, 2014 at 3:38 PM, Patrick Wendell  wrote:
> I'd personally like to see this go to a separate list.
>
> Until then I'd strongly recommended using filters to get rid of them.
> In gmail it's trivial...
>
> On Thu, Feb 20, 2014 at 1:07 PM, Ethan Jewett  wrote:
>> That would be fine. I would just like the problem fixed. The list has gone
>> from being a consistently pretty interesting and content-heavy read to
>> being a trudge to go through and attempt to extract the relevant
>> information from every day.
>>
>>
>> On Thu, Feb 20, 2014 at 3:01 PM, Andrew Ash  wrote:
>>
>>> I'm fine with keeping the GitHub traffic if we can
>>>
>>> a) take away the Jenkins build started / build finished / build succeeded /
>>> build failed messages.  Those aren't "dev discussion" and are very noisy.
>>>  I don't think they help anyone, and people who care about those for a
>>> particular PR (because they're a reviewer or author on it) are already
>>> subscribed through GitHub.
>>> b) change the format of the emails that are sent out; I find them very
>>> poorly formatted.  I'd prefer no deep tab for the message.
>>>
>>>
>>> http://mail-archives.apache.org/mod_mbox/incubator-spark-dev/201402.mbox/%3c20140210192901.ce834922...@tyr.zones.apache.org%3E
>>>
>>> FWIW I'm filtering all emails from g...@git.apache.org straight to trash
>>> right now because of the noise.
>>>
>>>
>>> On Thu, Feb 20, 2014 at 12:51 PM, Mattmann, Chris A (3980) <
>>> chris.a.mattm...@jpl.nasa.gov> wrote:
>>>
>>> > Guys,
>>> >
>>> > Whether you are a TLP or not the big key here is making sure that
>>> > dev discussion does not happen elsewhere outside of the list. You
>>> > can create e.g., a github-dev@spark.a.o list, but you will need
>>> > to make sure that:
>>> >
>>> > a) if dev discussion is happening there that it gets flowed up to
>>> > dev@spark.a.o. All development discussion must appear on the dev
>>> > list and must be traceable as a project discussion and decisions
>>> > appear on the list(s).
>>> >
>>> > b) automated/etc. email is simply that, and there isn't a ton of
>>> > discussion going on on those github emails, and that it's mostly
>>> > going on on the dev@spark.a.o list.
>>> >
>>> > If you can meet those 2 criteria/litmus test, I think it's fine.
>>> > The big concern is that if the discussion is not happening elsewhere,
>>> > then the decisions make for Apache Spark are based on information
>>> > that isn't co-located with the Apache Spark project. So that's the
>>> > thing that the PMC needs to keep in mind (note I said PMC now, yay!) :)
>>> >
>>> > Cheers and just keep the above in mind and you'll be good.
>>> >
>>> > Cheers,
>>> > Chris
>>> >
>>> >
>>> >
>>> >
>>> > -Original Message-
>>> > From: Andy Konwinski 
>>> > Reply-To: "dev@spark.incubator.apache.org" <
>>> dev@spark.incubator.apache.org
>>> > >
>>> > Date: Thursday, February 20, 2014 12:36 PM
>>> > To: "dev@spark.incubator.apache.org" 
>>> > Subject: Re: Signal/Noise Ratio
>>> >
>>> > >That is a very valid point about the list archives (which a mail filter
>>> > >doesn't address and which impacts the community in a negative way).
>>> > >
>>> > >As of today we are a Top Level Project so I think we have a little more
>>> > >autonomy for this sort of dev vs separate list decision.
>>> > >
>>> > >
>>> > >On Thu, Feb 20, 2014 at 12:15 PM, Ethan Jewett 
>>> > wrote:
>>> > >
>>> > >> Is there anything stopping us from using a different list, segregated
>>> > >>from
>>> > >> the dev list? The Github emails significantly reduce the signal-noise
>>> > >>ratio
>>> > >> of this list, and while it is possible (but annoying) to filter them
>>> > >>out in
>>> > >> our individual inboxes, it makes the archives of the list much less
>>> > >>usable
>>> > >> in many ways.
>>> > >>
>>> > >>
>>> > >> On Tue, Feb 18, 2014 at 2:20 PM, Aaron Davidson 
>>> > >> wrote:
>>> > >>
>>> > >> > This is due, unfortunately, to Apache policies that all
>>> > >> development-related
>>> > >> > discussion should take place on the dev list. As we are attempting
>>> to
>>> > >> > graduate from an incubating project to an Apache top level project,
>>> > >>there
>>> > >> > were some concerns raised about GitHub, and the fastest solution to
>>> > >>avoid
>>> > >> > conflict related to our graduation was to CC dev@ for all GitHub
>>> > >> messages.
>>> > >> > Once our graduation is complete, we may be able to find a less noisy
>>> > >>way
>>> > >> of
>>> > >> > dealing with these messages.
>>> > >> >
>>> > >> > In the meantime, one simple solution is to filter out all messages
>>> > >>that
>>> > >> > come from g...@git.apache.org and are destined to
>>> > >> > dev@spark.incubator.apache.org.
>>> > >> >
>>> > >> >
>>> > >> > On Tue, Feb 18, 2014 at 10:04 AM, Gerard Maas <
>>> gerard.m...@gmail.com>
>>> > >> > wrote:
>>> > >> >
>>> > >> > > +1 please.
>>> > >> > >
>>> 

[GitHub] incubator-spark pull request: SPARK-929: Fully deprecate usage of ...

2014-02-20 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/incubator-spark/pull/615#issuecomment-35684291
  
Right. What I mean is that calling the variable SPARK_DRIVER_MEMORY might 
be confusing in the context of yarn-standalone because its value would apply to 
the client and not the driver (if that's the right terminology).  Would 
SPARK_CLIENT_MEMORY possibly make more sense?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-929: Fully deprecate usage of ...

2014-02-20 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/incubator-spark/pull/615#issuecomment-35684293
  
Right. What I mean is that calling the variable SPARK_DRIVER_MEMORY might 
be confusing in the context of yarn-standalone because its value would apply to 
the client and not the driver (if that's the right terminology).  Would 
SPARK_CLIENT_MEMORY possibly make more sense?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-929: Fully deprecate usage of ...

2014-02-20 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/incubator-spark/pull/615#issuecomment-35684281
  
Right. What I mean is that calling the variable SPARK_DRIVER_MEMORY might 
be confusing in the context of yarn-standalone because its value would apply to 
the client and not the driver (if that's the right terminology).  Would 
SPARK_CLIENT_MEMORY possibly make more sense?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Super minor: Add require for mergeCo...

2014-02-20 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/623#discussion_r9930713
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -77,6 +77,7 @@ class PairRDDFunctions[K: ClassTag, V: ClassTag](self: 
RDD[(K, V)])
   partitioner: Partitioner,
   mapSideCombine: Boolean = true,
   serializerClass: String = null): RDD[(K, C)] = {
+require(mergeCombiners != null, "mergeCombiners must be defined") // 
required as of Spark 0.9.0
--- End diff --

fyi: this line is <100ch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Super minor: Add require for mergeCo...

2014-02-20 Thread aarondav
GitHub user aarondav opened a pull request:

https://github.com/apache/incubator-spark/pull/623

Super minor: Add require for mergeCombiners in combineByKey

We changed the behavior in 0.9.0 from requiring that mergeCombiners be null 
when mapSideCombine was false to requiring that mergeCombiners *never* be null, 
for external sorting. This patch adds a require() to make this behavior change 
explicitly messaged rather than resulting in a NPE.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aarondav/incubator-spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-spark/pull/623.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #623


commit 520b80c7bef100e7b1c2b0fb6388569ac0335681
Author: Aaron Davidson 
Date:   2014-02-20T23:41:20Z

Super minor: Add require for mergeCombiners in combineByKey

We changed the behavior in 0.9.0 from requiring that mergeCombiners
be null when mapSideCombine was false to requiring that mergeCombiners
*never* be null, for external sorting. This patch adds a require()
to make this behavior change explicitly messaged rather than resulting in
a NPE.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


Re: [GitHub] incubator-spark pull request: MLLIB-24: url of "Collaborative Filt...

2014-02-20 Thread Xiangrui Meng
Just want to test whether this message will be forwarded to github. -Xiangrui

On Wed, Feb 19, 2014 at 11:00 PM, asfgit  wrote:
> Github user asfgit closed the pull request at:
>
> https://github.com/apache/incubator-spark/pull/619
>
>
> ---
> If your project is set up for it, you can reply to this email and have your
> reply appear on GitHub as well. To do so, please top-post your response.
> If your project does not have this feature enabled and wishes so, or if the
> feature is enabled but not working, please contact infrastructure at
> infrastruct...@apache.org or file a JIRA ticket with INFRA.
> ---


Re: Problem with akka.frameSize

2014-02-20 Thread Patrick Wendell
Thanks for this bug report... we'll look into this!


On Thu, Feb 20, 2014 at 8:39 AM, Guillaume Pitel  wrote:

>  Jira ticket created https://spark-project.atlassian.net/browse/SPARK-1112
>
> Guillaume
>
> Hi,
>
> I've sent a few emails to the user mailing list, but since I believe this
> is bug, I think it's time to talk to the developpers.
>
> So here is what happens: since we've migrated from 0.8.1 to 0.9, whatever
> the value of spark.akka.frameSize I set, the Executors lock when a collect
> tries to send more than 10MB of data to the driver.
>
> Since 10MB is the spark default, I suspect it could be related to
> something in the configuration. We still use System.setProperty to set the
> frameSize.
>
> As a workaround, setting the frameSize back to 10 seems to work.
>
> Guillaume
>
>
> --
>[image: eXenSa]
>  *Guillaume PITEL, Président*
> +33(0)6 25 48 86 80
>
> eXenSa S.A.S. 
>  41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>


Re: Signal/Noise Ratio

2014-02-20 Thread Patrick Wendell
I'd personally like to see this go to a separate list.

Until then I'd strongly recommended using filters to get rid of them.
In gmail it's trivial...

On Thu, Feb 20, 2014 at 1:07 PM, Ethan Jewett  wrote:
> That would be fine. I would just like the problem fixed. The list has gone
> from being a consistently pretty interesting and content-heavy read to
> being a trudge to go through and attempt to extract the relevant
> information from every day.
>
>
> On Thu, Feb 20, 2014 at 3:01 PM, Andrew Ash  wrote:
>
>> I'm fine with keeping the GitHub traffic if we can
>>
>> a) take away the Jenkins build started / build finished / build succeeded /
>> build failed messages.  Those aren't "dev discussion" and are very noisy.
>>  I don't think they help anyone, and people who care about those for a
>> particular PR (because they're a reviewer or author on it) are already
>> subscribed through GitHub.
>> b) change the format of the emails that are sent out; I find them very
>> poorly formatted.  I'd prefer no deep tab for the message.
>>
>>
>> http://mail-archives.apache.org/mod_mbox/incubator-spark-dev/201402.mbox/%3c20140210192901.ce834922...@tyr.zones.apache.org%3E
>>
>> FWIW I'm filtering all emails from g...@git.apache.org straight to trash
>> right now because of the noise.
>>
>>
>> On Thu, Feb 20, 2014 at 12:51 PM, Mattmann, Chris A (3980) <
>> chris.a.mattm...@jpl.nasa.gov> wrote:
>>
>> > Guys,
>> >
>> > Whether you are a TLP or not the big key here is making sure that
>> > dev discussion does not happen elsewhere outside of the list. You
>> > can create e.g., a github-dev@spark.a.o list, but you will need
>> > to make sure that:
>> >
>> > a) if dev discussion is happening there that it gets flowed up to
>> > dev@spark.a.o. All development discussion must appear on the dev
>> > list and must be traceable as a project discussion and decisions
>> > appear on the list(s).
>> >
>> > b) automated/etc. email is simply that, and there isn't a ton of
>> > discussion going on on those github emails, and that it's mostly
>> > going on on the dev@spark.a.o list.
>> >
>> > If you can meet those 2 criteria/litmus test, I think it's fine.
>> > The big concern is that if the discussion is not happening elsewhere,
>> > then the decisions make for Apache Spark are based on information
>> > that isn't co-located with the Apache Spark project. So that's the
>> > thing that the PMC needs to keep in mind (note I said PMC now, yay!) :)
>> >
>> > Cheers and just keep the above in mind and you'll be good.
>> >
>> > Cheers,
>> > Chris
>> >
>> >
>> >
>> >
>> > -Original Message-
>> > From: Andy Konwinski 
>> > Reply-To: "dev@spark.incubator.apache.org" <
>> dev@spark.incubator.apache.org
>> > >
>> > Date: Thursday, February 20, 2014 12:36 PM
>> > To: "dev@spark.incubator.apache.org" 
>> > Subject: Re: Signal/Noise Ratio
>> >
>> > >That is a very valid point about the list archives (which a mail filter
>> > >doesn't address and which impacts the community in a negative way).
>> > >
>> > >As of today we are a Top Level Project so I think we have a little more
>> > >autonomy for this sort of dev vs separate list decision.
>> > >
>> > >
>> > >On Thu, Feb 20, 2014 at 12:15 PM, Ethan Jewett 
>> > wrote:
>> > >
>> > >> Is there anything stopping us from using a different list, segregated
>> > >>from
>> > >> the dev list? The Github emails significantly reduce the signal-noise
>> > >>ratio
>> > >> of this list, and while it is possible (but annoying) to filter them
>> > >>out in
>> > >> our individual inboxes, it makes the archives of the list much less
>> > >>usable
>> > >> in many ways.
>> > >>
>> > >>
>> > >> On Tue, Feb 18, 2014 at 2:20 PM, Aaron Davidson 
>> > >> wrote:
>> > >>
>> > >> > This is due, unfortunately, to Apache policies that all
>> > >> development-related
>> > >> > discussion should take place on the dev list. As we are attempting
>> to
>> > >> > graduate from an incubating project to an Apache top level project,
>> > >>there
>> > >> > were some concerns raised about GitHub, and the fastest solution to
>> > >>avoid
>> > >> > conflict related to our graduation was to CC dev@ for all GitHub
>> > >> messages.
>> > >> > Once our graduation is complete, we may be able to find a less noisy
>> > >>way
>> > >> of
>> > >> > dealing with these messages.
>> > >> >
>> > >> > In the meantime, one simple solution is to filter out all messages
>> > >>that
>> > >> > come from g...@git.apache.org and are destined to
>> > >> > dev@spark.incubator.apache.org.
>> > >> >
>> > >> >
>> > >> > On Tue, Feb 18, 2014 at 10:04 AM, Gerard Maas <
>> gerard.m...@gmail.com>
>> > >> > wrote:
>> > >> >
>> > >> > > +1 please.
>> > >> > >
>> > >> > >
>> > >> > > On Tue, Feb 18, 2014 at 6:04 PM, Michael Ernest
>> > >>> > >> > > >wrote:
>> > >> > >
>> > >> > > > +1
>> > >> > > >
>> > >> > > >
>> > >> > > > On Tue, Feb 18, 2014 at 8:24 AM, Heiko Braun <
>> > >> ike.br...@googlemail.com
>> > >> > > > >wrote:
>> > >> > > >
>> > >> > > > >
>> > >> > > > >

[GitHub] incubator-spark pull request: Fix ExternalMap on case of key's has...

2014-02-20 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/incubator-spark/pull/612#issuecomment-35681966
  
@guojc Hey, we discussed about this a little more and we thought of a 
different way of solving this that also simplifies some of the existing logic. 
I will create a new PR and make the appropriate changes there.

This is a serious bug and we intend to add it back into the 0.9 release. 
Thanks and I'll be sure to include you in the credits.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Fix ExternalMap on case of key's has...

2014-02-20 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/incubator-spark/pull/612#issuecomment-35681623
  
@guojc - hey since Andrew may propose a slightly different fix, I want to 
make sure you are credited with this in our release notes. Are you Jiacheng 
Guo? Found this name looking at twitter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: doctest updated for mapValues, flatM...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/621#issuecomment-35680739
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: doctest updated for mapValues, flatM...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/621#issuecomment-35680738
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Allow PySpark to use existing JVM an...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/622#issuecomment-35680688
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Allow PySpark to use existing JVM an...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/622#issuecomment-35680690
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12787/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: doctest updated for mapValues, flatM...

2014-02-20 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/incubator-spark/pull/621#issuecomment-35680133
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Allow PySpark to use existing JVM an...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/622#issuecomment-35678138
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Allow PySpark to use existing JVM an...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/622#issuecomment-35678137
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Allow PySpark to use existing JVM an...

2014-02-20 Thread ahirreddy
GitHub user ahirreddy opened a pull request:

https://github.com/apache/incubator-spark/pull/622

Allow PySpark to use existing JVM and Gateway

Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark 
implementation of SparkConf to take existing SparkConf JVM handle. Change to 
PySpark SparkContext to allow subclass specific context initialization.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ahirreddy/incubator-spark pyspark-existing-jvm

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-spark/pull/622.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #622


commit a86f45721c7f009adb3fc7070d3641569e999ffd
Author: Ahir Reddy 
Date:   2014-02-20T22:13:40Z

Patch to allow PySpark to use existing JVM and Gateway. Changes to
PySpark implementation of SparkConf to take existing SparkConf JVM
handle. Change to PySpark SparkContext to allow subclass specific
context initialization.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Add Security to Spark - Akka, Http, ...

2014-02-20 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/332#discussion_r9926321
  
--- Diff: core/src/main/scala/org/apache/spark/SecurityManager.scala ---
@@ -0,0 +1,112 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import org.apache.hadoop.io.Text
+import org.apache.hadoop.security.Credentials
+import org.apache.hadoop.security.UserGroupInformation
+
+import org.apache.spark.deploy.SparkHadoopUtil
+
+/** 
+ * Spark class responsible for security.  
+ */
+private[spark] class SecurityManager extends Logging {
+
+  private val isAuthOn = System.getProperty("spark.authenticate", 
"false").toBoolean
+  private val isUIAuthOn = System.getProperty("spark.authenticate.ui", 
"false").toBoolean
+  private val viewAcls = System.getProperty("spark.ui.view.acls", 
"").split(',').map(_.trim()).toSet
+  private val secretKey = generateSecretKey()
+  logDebug("is auth enabled = " + isAuthOn + " is uiAuth enabled = " + 
isUIAuthOn)
+ 
+  /**
+   * In Yarn mode it uses Hadoop UGI to pass the secret as that
+   * will keep it protected.  For a standalone SPARK cluster
+   * use a environment variable SPARK_SECRET to specify the secret.
+   * This probably isn't ideal but only the user who starts the process
+   * should have access to view the variable (at least on Linux).
+   * Since we can't set the environment variable we set the 
+   * java system property SPARK_SECRET so it will automatically
+   * generate a secret is not specified.  This definitely is not
+   * ideal since users can see it. We should switch to put it in 
+   * a config.
+   */
+  private def generateSecretKey(): String = {
+
+if (!isAuthenticationEnabled) return null
+// first check to see if secret already set, else generate it
+if (SparkHadoopUtil.get.isYarnMode) {
+  val credentials = SparkHadoopUtil.get.getCurrentUserCredentials()
+  if (credentials != null) { 
+val secretKey = credentials.getSecretKey(new Text("akkaCookie"))
+if (secretKey != null) {
+  logDebug("in yarn mode, getting secret from credentials")
+  return new Text(secretKey).toString
+} else {
+  logDebug("getSecretKey: yarn mode, secret key from credentials 
is null")
+}
+  } else {
+logDebug("getSecretKey: yarn mode, credentials are null")
+  }
+}
+val secret = System.getProperty("SPARK_SECRET", 
System.getenv("SPARK_SECRET")) 
+if (secret != null && !secret.isEmpty()) return secret 
+// generate one 
+val sCookie = akka.util.Crypt.generateSecureCookie
+
+// if we generate we must be the first so lets set it so its used by 
everyone else
+if (SparkHadoopUtil.get.isYarnMode) {
+  val creds = new Credentials()
+  creds.addSecretKey(new Text("akkaCookie"), sCookie.getBytes())
+  SparkHadoopUtil.get.addCurrentUserCredentials(creds)
+  logDebug("adding secret to credentials yarn mode")
+} else {
+  System.setProperty("SPARK_SECRET", sCookie)
+  logDebug("adding secret to java property")
+}
+return sCookie
--- End diff --

The problem here is that it is not really how it does it between hadoop 
versions, which is what I think of SparkHadoopUtil being used for.  Its either 
Yarn deployed or its for instance standalone deploy.   We can move a bit of the 
logic into SparkHadoopUtil, like the code inside of the isYarnMode blocks, but 
we would still need the yarn check or abstract that out somewhere else.

is sounds like we are going to want to add better support for this for the 
standalone deploy so for now I suggest we leave this as is until we get a 
better idea of how that is going to work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top

[GitHub] incubator-spark pull request: Fix ExternalMap on case of key's has...

2014-02-20 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/612#discussion_r9925806
  
--- Diff: 
core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala
 ---
@@ -83,6 +83,28 @@ class ExternalAppendOnlyMapSuite extends FunSuite with 
BeforeAndAfter with Local
   (3, Set[Int](30
   }
 
+  test("insert with collision on hashCode Int.MaxValue") {
--- End diff --

I would rename this "spilling with..." instead of "insert with...". More 
details explained below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Fix ExternalMap on case of key's has...

2014-02-20 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/incubator-spark/pull/612#issuecomment-35672119
  
Thanks again for finding this bug. A number of users have reported it 
before and we had not been able to provide a good answer, but this patch pretty 
much explains it.

I have left some comments regarding the corresponding test. In particular, 
the one you have now does not induce spilling, and so does not trigger the 
exception that you ran into. I left an example of how to make this happen.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Fix ExternalMap on case of key's has...

2014-02-20 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/612#discussion_r9925630
  
--- Diff: 
core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala
 ---
@@ -83,6 +83,28 @@ class ExternalAppendOnlyMapSuite extends FunSuite with 
BeforeAndAfter with Local
   (3, Set[Int](30
   }
 
+  test("insert with collision on hashCode Int.MaxValue") {
+val conf = new SparkConf(false)
+sc = new SparkContext("local", "test", conf)
+
+val map = new ExternalAppendOnlyMap[Int, Int, 
ArrayBuffer[Int]](createCombiner,
+  mergeValue, mergeCombiners)
+
+map.insert(Int.MaxValue, 10)
+map.insert(2, 20)
+map.insert(3, 30)
+map.insert(Int.MaxValue, 100)
+map.insert(2, 200)
+map.insert(Int.MaxValue, 1000)
+val it = map.iterator
+assert(it.hasNext)
+val result = it.toSet[(Int, ArrayBuffer[Int])].map(kv => (kv._1, 
kv._2.toSet))
+assert(result == Set[(Int, Set[Int])](
+  (Int.MaxValue, Set[Int](10, 100, 1000)),
+  (2, Set[Int](20, 200)),
+  (3, Set[Int](30
--- End diff --

Even after setting the memory parameters, we still need to insert a lot 
into the map to induce spilling. I have been able to trigger the exception that 
you found with the following:

(1 until 10).foreach { i => map.insert(i, i) }
map.insert(Int.MaxValue, Int.MaxValue)

val it = map.iterator
while (it.hasNext) {
  it.next()
}


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Fix ExternalMap on case of key's has...

2014-02-20 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/612#discussion_r9925460
  
--- Diff: 
core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala
 ---
@@ -83,6 +83,28 @@ class ExternalAppendOnlyMapSuite extends FunSuite with 
BeforeAndAfter with Local
   (3, Set[Int](30
   }
 
+  test("insert with collision on hashCode Int.MaxValue") {
+val conf = new SparkConf(false)
+sc = new SparkContext("local", "test", conf)
+
--- End diff --

Looks like this test currently does not induce spilling. I would set up the 
memory constraints as follows:

val conf = new SparkConf()
conf.set("spark.shuffle.memoryFraction", "0.001")
sc = new SparkContext("local-cluster[1,1,512]", "test", conf)

(Note that in this test it is crucial for SparkConf to take in no 
arguments. This is a workaround for the hacky way we currently pass in 
environment variables in the tests)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


Re: Signal/Noise Ratio

2014-02-20 Thread Ethan Jewett
That would be fine. I would just like the problem fixed. The list has gone
from being a consistently pretty interesting and content-heavy read to
being a trudge to go through and attempt to extract the relevant
information from every day.


On Thu, Feb 20, 2014 at 3:01 PM, Andrew Ash  wrote:

> I'm fine with keeping the GitHub traffic if we can
>
> a) take away the Jenkins build started / build finished / build succeeded /
> build failed messages.  Those aren't "dev discussion" and are very noisy.
>  I don't think they help anyone, and people who care about those for a
> particular PR (because they're a reviewer or author on it) are already
> subscribed through GitHub.
> b) change the format of the emails that are sent out; I find them very
> poorly formatted.  I'd prefer no deep tab for the message.
>
>
> http://mail-archives.apache.org/mod_mbox/incubator-spark-dev/201402.mbox/%3c20140210192901.ce834922...@tyr.zones.apache.org%3E
>
> FWIW I'm filtering all emails from g...@git.apache.org straight to trash
> right now because of the noise.
>
>
> On Thu, Feb 20, 2014 at 12:51 PM, Mattmann, Chris A (3980) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
>
> > Guys,
> >
> > Whether you are a TLP or not the big key here is making sure that
> > dev discussion does not happen elsewhere outside of the list. You
> > can create e.g., a github-dev@spark.a.o list, but you will need
> > to make sure that:
> >
> > a) if dev discussion is happening there that it gets flowed up to
> > dev@spark.a.o. All development discussion must appear on the dev
> > list and must be traceable as a project discussion and decisions
> > appear on the list(s).
> >
> > b) automated/etc. email is simply that, and there isn't a ton of
> > discussion going on on those github emails, and that it's mostly
> > going on on the dev@spark.a.o list.
> >
> > If you can meet those 2 criteria/litmus test, I think it's fine.
> > The big concern is that if the discussion is not happening elsewhere,
> > then the decisions make for Apache Spark are based on information
> > that isn't co-located with the Apache Spark project. So that's the
> > thing that the PMC needs to keep in mind (note I said PMC now, yay!) :)
> >
> > Cheers and just keep the above in mind and you'll be good.
> >
> > Cheers,
> > Chris
> >
> >
> >
> >
> > -Original Message-
> > From: Andy Konwinski 
> > Reply-To: "dev@spark.incubator.apache.org" <
> dev@spark.incubator.apache.org
> > >
> > Date: Thursday, February 20, 2014 12:36 PM
> > To: "dev@spark.incubator.apache.org" 
> > Subject: Re: Signal/Noise Ratio
> >
> > >That is a very valid point about the list archives (which a mail filter
> > >doesn't address and which impacts the community in a negative way).
> > >
> > >As of today we are a Top Level Project so I think we have a little more
> > >autonomy for this sort of dev vs separate list decision.
> > >
> > >
> > >On Thu, Feb 20, 2014 at 12:15 PM, Ethan Jewett 
> > wrote:
> > >
> > >> Is there anything stopping us from using a different list, segregated
> > >>from
> > >> the dev list? The Github emails significantly reduce the signal-noise
> > >>ratio
> > >> of this list, and while it is possible (but annoying) to filter them
> > >>out in
> > >> our individual inboxes, it makes the archives of the list much less
> > >>usable
> > >> in many ways.
> > >>
> > >>
> > >> On Tue, Feb 18, 2014 at 2:20 PM, Aaron Davidson 
> > >> wrote:
> > >>
> > >> > This is due, unfortunately, to Apache policies that all
> > >> development-related
> > >> > discussion should take place on the dev list. As we are attempting
> to
> > >> > graduate from an incubating project to an Apache top level project,
> > >>there
> > >> > were some concerns raised about GitHub, and the fastest solution to
> > >>avoid
> > >> > conflict related to our graduation was to CC dev@ for all GitHub
> > >> messages.
> > >> > Once our graduation is complete, we may be able to find a less noisy
> > >>way
> > >> of
> > >> > dealing with these messages.
> > >> >
> > >> > In the meantime, one simple solution is to filter out all messages
> > >>that
> > >> > come from g...@git.apache.org and are destined to
> > >> > dev@spark.incubator.apache.org.
> > >> >
> > >> >
> > >> > On Tue, Feb 18, 2014 at 10:04 AM, Gerard Maas <
> gerard.m...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > +1 please.
> > >> > >
> > >> > >
> > >> > > On Tue, Feb 18, 2014 at 6:04 PM, Michael Ernest
> > >> > >> > > >wrote:
> > >> > >
> > >> > > > +1
> > >> > > >
> > >> > > >
> > >> > > > On Tue, Feb 18, 2014 at 8:24 AM, Heiko Braun <
> > >> ike.br...@googlemail.com
> > >> > > > >wrote:
> > >> > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > Wouldn't it be better to move the github messages to a
> dedicated
> > >> > email
> > >> > > > > list?
> > >> > > > >
> > >> > > > > Regards, Heiko
> > >> > > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > Michael Ernest
> > >> > > > Sr. Solutions Consultant
> > >> > > > West Coast
> > >> > > >
> > >> 

Re: Signal/Noise Ratio

2014-02-20 Thread Andrew Ash
I'm fine with keeping the GitHub traffic if we can

a) take away the Jenkins build started / build finished / build succeeded /
build failed messages.  Those aren't "dev discussion" and are very noisy.
 I don't think they help anyone, and people who care about those for a
particular PR (because they're a reviewer or author on it) are already
subscribed through GitHub.
b) change the format of the emails that are sent out; I find them very
poorly formatted.  I'd prefer no deep tab for the message.

http://mail-archives.apache.org/mod_mbox/incubator-spark-dev/201402.mbox/%3c20140210192901.ce834922...@tyr.zones.apache.org%3E

FWIW I'm filtering all emails from g...@git.apache.org straight to trash
right now because of the noise.


On Thu, Feb 20, 2014 at 12:51 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Guys,
>
> Whether you are a TLP or not the big key here is making sure that
> dev discussion does not happen elsewhere outside of the list. You
> can create e.g., a github-dev@spark.a.o list, but you will need
> to make sure that:
>
> a) if dev discussion is happening there that it gets flowed up to
> dev@spark.a.o. All development discussion must appear on the dev
> list and must be traceable as a project discussion and decisions
> appear on the list(s).
>
> b) automated/etc. email is simply that, and there isn't a ton of
> discussion going on on those github emails, and that it's mostly
> going on on the dev@spark.a.o list.
>
> If you can meet those 2 criteria/litmus test, I think it's fine.
> The big concern is that if the discussion is not happening elsewhere,
> then the decisions make for Apache Spark are based on information
> that isn't co-located with the Apache Spark project. So that's the
> thing that the PMC needs to keep in mind (note I said PMC now, yay!) :)
>
> Cheers and just keep the above in mind and you'll be good.
>
> Cheers,
> Chris
>
>
>
>
> -Original Message-
> From: Andy Konwinski 
> Reply-To: "dev@spark.incubator.apache.org"  >
> Date: Thursday, February 20, 2014 12:36 PM
> To: "dev@spark.incubator.apache.org" 
> Subject: Re: Signal/Noise Ratio
>
> >That is a very valid point about the list archives (which a mail filter
> >doesn't address and which impacts the community in a negative way).
> >
> >As of today we are a Top Level Project so I think we have a little more
> >autonomy for this sort of dev vs separate list decision.
> >
> >
> >On Thu, Feb 20, 2014 at 12:15 PM, Ethan Jewett 
> wrote:
> >
> >> Is there anything stopping us from using a different list, segregated
> >>from
> >> the dev list? The Github emails significantly reduce the signal-noise
> >>ratio
> >> of this list, and while it is possible (but annoying) to filter them
> >>out in
> >> our individual inboxes, it makes the archives of the list much less
> >>usable
> >> in many ways.
> >>
> >>
> >> On Tue, Feb 18, 2014 at 2:20 PM, Aaron Davidson 
> >> wrote:
> >>
> >> > This is due, unfortunately, to Apache policies that all
> >> development-related
> >> > discussion should take place on the dev list. As we are attempting to
> >> > graduate from an incubating project to an Apache top level project,
> >>there
> >> > were some concerns raised about GitHub, and the fastest solution to
> >>avoid
> >> > conflict related to our graduation was to CC dev@ for all GitHub
> >> messages.
> >> > Once our graduation is complete, we may be able to find a less noisy
> >>way
> >> of
> >> > dealing with these messages.
> >> >
> >> > In the meantime, one simple solution is to filter out all messages
> >>that
> >> > come from g...@git.apache.org and are destined to
> >> > dev@spark.incubator.apache.org.
> >> >
> >> >
> >> > On Tue, Feb 18, 2014 at 10:04 AM, Gerard Maas 
> >> > wrote:
> >> >
> >> > > +1 please.
> >> > >
> >> > >
> >> > > On Tue, Feb 18, 2014 at 6:04 PM, Michael Ernest
> >> >> > > >wrote:
> >> > >
> >> > > > +1
> >> > > >
> >> > > >
> >> > > > On Tue, Feb 18, 2014 at 8:24 AM, Heiko Braun <
> >> ike.br...@googlemail.com
> >> > > > >wrote:
> >> > > >
> >> > > > >
> >> > > > >
> >> > > > > Wouldn't it be better to move the github messages to a dedicated
> >> > email
> >> > > > > list?
> >> > > > >
> >> > > > > Regards, Heiko
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Michael Ernest
> >> > > > Sr. Solutions Consultant
> >> > > > West Coast
> >> > > >
> >> > >
> >> >
> >>
>
>


Re: Signal/Noise Ratio

2014-02-20 Thread Mattmann, Chris A (3980)
Guys,

Whether you are a TLP or not the big key here is making sure that
dev discussion does not happen elsewhere outside of the list. You
can create e.g., a github-dev@spark.a.o list, but you will need
to make sure that:

a) if dev discussion is happening there that it gets flowed up to
dev@spark.a.o. All development discussion must appear on the dev
list and must be traceable as a project discussion and decisions
appear on the list(s).

b) automated/etc. email is simply that, and there isn't a ton of
discussion going on on those github emails, and that it's mostly
going on on the dev@spark.a.o list.

If you can meet those 2 criteria/litmus test, I think it's fine.
The big concern is that if the discussion is not happening elsewhere,
then the decisions make for Apache Spark are based on information
that isn't co-located with the Apache Spark project. So that's the
thing that the PMC needs to keep in mind (note I said PMC now, yay!) :)

Cheers and just keep the above in mind and you'll be good.

Cheers,
Chris




-Original Message-
From: Andy Konwinski 
Reply-To: "dev@spark.incubator.apache.org" 
Date: Thursday, February 20, 2014 12:36 PM
To: "dev@spark.incubator.apache.org" 
Subject: Re: Signal/Noise Ratio

>That is a very valid point about the list archives (which a mail filter
>doesn't address and which impacts the community in a negative way).
>
>As of today we are a Top Level Project so I think we have a little more
>autonomy for this sort of dev vs separate list decision.
>
>
>On Thu, Feb 20, 2014 at 12:15 PM, Ethan Jewett  wrote:
>
>> Is there anything stopping us from using a different list, segregated
>>from
>> the dev list? The Github emails significantly reduce the signal-noise
>>ratio
>> of this list, and while it is possible (but annoying) to filter them
>>out in
>> our individual inboxes, it makes the archives of the list much less
>>usable
>> in many ways.
>>
>>
>> On Tue, Feb 18, 2014 at 2:20 PM, Aaron Davidson 
>> wrote:
>>
>> > This is due, unfortunately, to Apache policies that all
>> development-related
>> > discussion should take place on the dev list. As we are attempting to
>> > graduate from an incubating project to an Apache top level project,
>>there
>> > were some concerns raised about GitHub, and the fastest solution to
>>avoid
>> > conflict related to our graduation was to CC dev@ for all GitHub
>> messages.
>> > Once our graduation is complete, we may be able to find a less noisy
>>way
>> of
>> > dealing with these messages.
>> >
>> > In the meantime, one simple solution is to filter out all messages
>>that
>> > come from g...@git.apache.org and are destined to
>> > dev@spark.incubator.apache.org.
>> >
>> >
>> > On Tue, Feb 18, 2014 at 10:04 AM, Gerard Maas 
>> > wrote:
>> >
>> > > +1 please.
>> > >
>> > >
>> > > On Tue, Feb 18, 2014 at 6:04 PM, Michael Ernest
>>> > > >wrote:
>> > >
>> > > > +1
>> > > >
>> > > >
>> > > > On Tue, Feb 18, 2014 at 8:24 AM, Heiko Braun <
>> ike.br...@googlemail.com
>> > > > >wrote:
>> > > >
>> > > > >
>> > > > >
>> > > > > Wouldn't it be better to move the github messages to a dedicated
>> > email
>> > > > > list?
>> > > > >
>> > > > > Regards, Heiko
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Michael Ernest
>> > > > Sr. Solutions Consultant
>> > > > West Coast
>> > > >
>> > >
>> >
>>



Re: ASF Board Meeting Summary - February 19, 2014

2014-02-20 Thread Nan Zhu
Congratulations to all! 

-- 
Nan Zhu


On Thursday, February 20, 2014 at 3:18 PM, Konstantin Boudnik wrote:

> We! ;)
> 
> On Thu, Feb 20, 2014 at 08:37AM, Andy Konwinski wrote:
> > Congrats Spark community! I think this means we are officially now a TLP!
> > -- Forwarded message --
> > From: "Brett Porter" mailto:chair...@apache.org)>
> > Date: Feb 19, 2014 11:26 PM
> > Subject: ASF Board Meeting Summary - February 19, 2014
> > To: mailto:committ...@apache.org)>
> > Cc:
> > 
> > The February board meeting took place on the 19th.
> > 
> > The following directors were present:
> > 
> > Shane Curcuru
> > Bertrand Delacretaz
> > Roy T. Fielding
> > Jim Jagielski
> > Chris Mattmann
> > Brett Porter
> > Greg Stein
> > 
> > Apologies were received from Sam Ruby.
> > 
> > The following officers were present:
> > 
> > Ross Gardler
> > Rich Bowen
> > Craig L Russell
> > 
> > The following guests were present:
> > 
> > Sean Kelly
> > Daniel Gruno
> > Phil Steitz
> > Jake Farrell
> > Marvin Humphrey
> > David Nalley
> > Noah Slater
> > 
> > The January minutes were approved.
> > Minutes will be posted to http://www.apache.org/foundation/records/minutes/
> > 
> > The following reports were not approved and are expected next month:
> > 
> > Report from the Apache Lenya Project [Richard Frovarp]
> > 
> > The following reports were not received and are expected next month:
> > 
> > Report from the Apache Abdera Project [Ant Elder]
> > Report from the Apache Buildr Project [Alex Boisvert]
> > Report from the Apache Click Project [Malcolm Edgar]
> > Report from the Apache Community Development Project [Luciano Resende]
> > Report from the Apache Continuum Project [Brent Atkinson]
> > Report from the Apache Creadur Project [Robert Burrell Donkin]
> > Report from the Apache DirectMemory Project [Raffaele P. Guidi]
> > Report from the Apache Giraph Project [Avery Ching]
> > Report from the Apache Velocity Project [Nathan Bubna]
> > 
> > All other reports to the board were approved.
> > 
> > The following resolutions were passed unanimously:
> > 
> > A. Establish the Apache Open Climate Workbench Project (Michael Joyce, VP)
> > B. Change the Apache Incubator Project Chair (Roman Shaposhnik, VP)
> > C. Establish the Apache Spark Project (Matei Zaharia, VP)
> > D. Establish the Apache Knox Project (Kevin Minder, VP)
> > 
> > The next board meeting will be on the 19th of March. 



Re: Signal/Noise Ratio

2014-02-20 Thread Andy Konwinski
That is a very valid point about the list archives (which a mail filter
doesn't address and which impacts the community in a negative way).

As of today we are a Top Level Project so I think we have a little more
autonomy for this sort of dev vs separate list decision.


On Thu, Feb 20, 2014 at 12:15 PM, Ethan Jewett  wrote:

> Is there anything stopping us from using a different list, segregated from
> the dev list? The Github emails significantly reduce the signal-noise ratio
> of this list, and while it is possible (but annoying) to filter them out in
> our individual inboxes, it makes the archives of the list much less usable
> in many ways.
>
>
> On Tue, Feb 18, 2014 at 2:20 PM, Aaron Davidson 
> wrote:
>
> > This is due, unfortunately, to Apache policies that all
> development-related
> > discussion should take place on the dev list. As we are attempting to
> > graduate from an incubating project to an Apache top level project, there
> > were some concerns raised about GitHub, and the fastest solution to avoid
> > conflict related to our graduation was to CC dev@ for all GitHub
> messages.
> > Once our graduation is complete, we may be able to find a less noisy way
> of
> > dealing with these messages.
> >
> > In the meantime, one simple solution is to filter out all messages that
> > come from g...@git.apache.org and are destined to
> > dev@spark.incubator.apache.org.
> >
> >
> > On Tue, Feb 18, 2014 at 10:04 AM, Gerard Maas 
> > wrote:
> >
> > > +1 please.
> > >
> > >
> > > On Tue, Feb 18, 2014 at 6:04 PM, Michael Ernest  > > >wrote:
> > >
> > > > +1
> > > >
> > > >
> > > > On Tue, Feb 18, 2014 at 8:24 AM, Heiko Braun <
> ike.br...@googlemail.com
> > > > >wrote:
> > > >
> > > > >
> > > > >
> > > > > Wouldn't it be better to move the github messages to a dedicated
> > email
> > > > > list?
> > > > >
> > > > > Regards, Heiko
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Michael Ernest
> > > > Sr. Solutions Consultant
> > > > West Coast
> > > >
> > >
> >
>


Re: Fwd: ASF Board Meeting Summary - February 19, 2014

2014-02-20 Thread Konstantin Boudnik
We! ;)

On Thu, Feb 20, 2014 at 08:37AM, Andy Konwinski wrote:
> Congrats Spark community! I think this means we are officially now a TLP!
> -- Forwarded message --
> From: "Brett Porter" 
> Date: Feb 19, 2014 11:26 PM
> Subject: ASF Board Meeting Summary - February 19, 2014
> To: 
> Cc:
> 
> The February board meeting took place on the 19th.
> 
> The following directors were present:
> 
>   Shane Curcuru
>   Bertrand Delacretaz
>   Roy T. Fielding
>   Jim Jagielski
>   Chris Mattmann
>   Brett Porter
>   Greg Stein
> 
> Apologies were received from Sam Ruby.
> 
> The following officers were present:
> 
>   Ross Gardler
>   Rich Bowen
>   Craig L Russell
> 
> The following guests were present:
> 
>   Sean Kelly
>   Daniel Gruno
>   Phil Steitz
>   Jake Farrell
>   Marvin Humphrey
>   David Nalley
>   Noah Slater
> 
> The January minutes were approved.
> Minutes will be posted to http://www.apache.org/foundation/records/minutes/
> 
> The following reports were not approved and are expected next month:
> 
>  Report from the Apache Lenya Project  [Richard Frovarp]
> 
> The following reports were not received and are expected next month:
> 
>   Report from the Apache Abdera Project  [Ant Elder]
>   Report from the Apache Buildr Project  [Alex Boisvert]
>   Report from the Apache Click Project  [Malcolm Edgar]
>   Report from the Apache Community Development Project  [Luciano Resende]
>   Report from the Apache Continuum Project  [Brent Atkinson]
>   Report from the Apache Creadur Project  [Robert Burrell Donkin]
>   Report from the Apache DirectMemory Project  [Raffaele P. Guidi]
>   Report from the Apache Giraph Project  [Avery Ching]
>   Report from the Apache Velocity Project  [Nathan Bubna]
> 
> All other reports to the board were approved.
> 
> The following resolutions were passed unanimously:
> 
>   A. Establish the Apache Open Climate Workbench Project (Michael Joyce, VP)
>   B. Change the Apache Incubator Project Chair (Roman Shaposhnik, VP)
>   C. Establish the Apache Spark Project (Matei Zaharia, VP)
>   D. Establish the Apache Knox Project (Kevin Minder, VP)
> 
> The next board meeting will be on the 19th of March.


Re: Signal/Noise Ratio

2014-02-20 Thread Ethan Jewett
Is there anything stopping us from using a different list, segregated from
the dev list? The Github emails significantly reduce the signal-noise ratio
of this list, and while it is possible (but annoying) to filter them out in
our individual inboxes, it makes the archives of the list much less usable
in many ways.


On Tue, Feb 18, 2014 at 2:20 PM, Aaron Davidson  wrote:

> This is due, unfortunately, to Apache policies that all development-related
> discussion should take place on the dev list. As we are attempting to
> graduate from an incubating project to an Apache top level project, there
> were some concerns raised about GitHub, and the fastest solution to avoid
> conflict related to our graduation was to CC dev@ for all GitHub messages.
> Once our graduation is complete, we may be able to find a less noisy way of
> dealing with these messages.
>
> In the meantime, one simple solution is to filter out all messages that
> come from g...@git.apache.org and are destined to
> dev@spark.incubator.apache.org.
>
>
> On Tue, Feb 18, 2014 at 10:04 AM, Gerard Maas 
> wrote:
>
> > +1 please.
> >
> >
> > On Tue, Feb 18, 2014 at 6:04 PM, Michael Ernest  > >wrote:
> >
> > > +1
> > >
> > >
> > > On Tue, Feb 18, 2014 at 8:24 AM, Heiko Braun  > > >wrote:
> > >
> > > >
> > > >
> > > > Wouldn't it be better to move the github messages to a dedicated
> email
> > > > list?
> > > >
> > > > Regards, Heiko
> > > >
> > >
> > >
> > >
> > > --
> > > Michael Ernest
> > > Sr. Solutions Consultant
> > > West Coast
> > >
> >
>


[GitHub] incubator-spark pull request: Add Security to Spark - Akka, Http, ...

2014-02-20 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/332#discussion_r9921913
  
--- Diff: 
core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
@@ -483,10 +496,131 @@ private[spark] class ConnectionManager(port: Int, 
conf: SparkConf) extends Loggi
 /*handleMessage(connection, message)*/
   }
 
-  private def handleMessage(connectionManagerId: ConnectionManagerId, 
message: Message) {
+  private def handleClientAuthNeg(
+  waitingConn: SendingConnection,
+  securityMsg: SecurityMessage, 
+  connectionId : ConnectionId) {
+if (waitingConn.isSaslComplete()) {
+  logDebug("Client sasl completed for id: "  + 
waitingConn.connectionId)
+  connectionsAwaitingSasl -= waitingConn.connectionId
+  waitingConn.getAuthenticated().synchronized {
+waitingConn.getAuthenticated().notifyAll();
+  }
+  return
+} else {
+  var replyToken : Array[Byte] = null
+  try {
+replyToken = 
waitingConn.sparkSaslClient.saslResponse(securityMsg.getToken);
+if (waitingConn.isSaslComplete()) {
+  logDebug("Client sasl completed after evaluate for id: " + 
waitingConn.connectionId)
+  connectionsAwaitingSasl -= waitingConn.connectionId
+  waitingConn.getAuthenticated().synchronized {
+waitingConn.getAuthenticated().notifyAll()
+  }
+  return
+}
+var securityMsgResp = SecurityMessage.fromResponse(replyToken, 
securityMsg.getConnectionId)
+var message = securityMsgResp.toBufferMessage
+if (message == null) throw new Exception("Error creating security 
message")
+sendSecurityMessage(waitingConn.getRemoteConnectionManagerId(), 
message)
+  } catch  {
+case e: Exception => {
+  logError("Error doing sasl client: " + e)
+  waitingConn.close()
+  throw new Exception("error evaluating sasl response: " + e)
+}
+  }
+}
+  }
+
+  private def handleServerAuthNeg(
+  connection: Connection, 
+  securityMsg: SecurityMessage,
+  connectionId: ConnectionId) {
+if (!connection.isSaslComplete()) {
+  logDebug("saslContext not established")
+  var replyToken : Array[Byte] = null
+  try {
+connection.synchronized {
+  if (connection.sparkSaslServer == null) {
+logDebug("Creating sasl Server")
+connection.sparkSaslServer = new 
SparkSaslServer(securityManager)
+  }
+}
+replyToken = 
connection.sparkSaslServer.response(securityMsg.getToken)
+if (connection.isSaslComplete()) {
+  logDebug("Server sasl completed: " + connection.connectionId) 
+} else {
+  logDebug("Server sasl not completed: " + connection.connectionId)
+}
+if (replyToken != null) {
+  var securityMsgResp = SecurityMessage.fromResponse(replyToken, 
securityMsg.getConnectionId)
+  var message = securityMsgResp.toBufferMessage
+  if (message == null) throw new Exception("Error creating 
security Message")
+  sendSecurityMessage(connection.getRemoteConnectionManagerId(), 
message)
+} 
+  } catch {
+case e: Exception => {
+  logError("Error in server auth negotiation: " + e)
+  // It would probably be better to send an error message telling 
other side auth failed
+  // but for now just close
+  connection.close()
+}
+  }
+} else {
+  logDebug("connection already established for this connection id: " + 
connection.connectionId) 
+}
+  }
+
+
+  private def handleAuthentication(conn: Connection, bufferMessage: 
BufferMessage): Boolean = {
+if (bufferMessage.isSecurityNeg) {
+  logDebug("This is security neg message")
+
+  // parse as SecurityMessage
+  val securityMsg = SecurityMessage.fromBufferMessage(bufferMessage)
+  val connectionId = new ConnectionId(securityMsg.getConnectionId)
+
+  connectionsAwaitingSasl.get(connectionId) match {
+case Some(waitingConn) => {
+  // Client - this must be in response to us doing Send
+  logDebug("Client handleAuth for id: " +  
waitingConn.connectionId)
+  handleClientAuthNeg(waitingConn, securityMsg, connectionId)
+}
+case None => {
+  // Server - someone sent us something and we haven't 
authenticated yet
+  logDebug("Server handleAuth for id: " + connectionId)
+  handleServerAuthNeg(conn, securityMsg, connectionId)
+}

Re: ASF Board Meeting Summary - February 19, 2014

2014-02-20 Thread Giri Iyengar
Awesome news. Congratulations, Spark!

-giri


On Thu, Feb 20, 2014 at 11:37 AM, Andy Konwinski wrote:

> Congrats Spark community! I think this means we are officially now a TLP!
> -- Forwarded message --
> From: "Brett Porter" 
> Date: Feb 19, 2014 11:26 PM
> Subject: ASF Board Meeting Summary - February 19, 2014
> To: 
> Cc:
>
> The February board meeting took place on the 19th.
>
> The following directors were present:
>
>   Shane Curcuru
>   Bertrand Delacretaz
>   Roy T. Fielding
>   Jim Jagielski
>   Chris Mattmann
>   Brett Porter
>   Greg Stein
>
> Apologies were received from Sam Ruby.
>
> The following officers were present:
>
>   Ross Gardler
>   Rich Bowen
>   Craig L Russell
>
> The following guests were present:
>
>   Sean Kelly
>   Daniel Gruno
>   Phil Steitz
>   Jake Farrell
>   Marvin Humphrey
>   David Nalley
>   Noah Slater
>
> The January minutes were approved.
> Minutes will be posted to
> http://www.apache.org/foundation/records/minutes/
>
> The following reports were not approved and are expected next month:
>
>  Report from the Apache Lenya Project  [Richard Frovarp]
>
> The following reports were not received and are expected next month:
>
>   Report from the Apache Abdera Project  [Ant Elder]
>   Report from the Apache Buildr Project  [Alex Boisvert]
>   Report from the Apache Click Project  [Malcolm Edgar]
>   Report from the Apache Community Development Project  [Luciano Resende]
>   Report from the Apache Continuum Project  [Brent Atkinson]
>   Report from the Apache Creadur Project  [Robert Burrell Donkin]
>   Report from the Apache DirectMemory Project  [Raffaele P. Guidi]
>   Report from the Apache Giraph Project  [Avery Ching]
>   Report from the Apache Velocity Project  [Nathan Bubna]
>
> All other reports to the board were approved.
>
> The following resolutions were passed unanimously:
>
>   A. Establish the Apache Open Climate Workbench Project (Michael Joyce,
> VP)
>   B. Change the Apache Incubator Project Chair (Roman Shaposhnik, VP)
>   C. Establish the Apache Spark Project (Matei Zaharia, VP)
>   D. Establish the Apache Knox Project (Kevin Minder, VP)
>
> The next board meeting will be on the 19th of March.
>



-- 
GIRI IYENGAR, CTO
VELOS.IO
Simple. Powerful. Predictions.

440 NINTH AVE, 11TH FLOOR NEW YORK CITY, NY 10001
O: 917.525.2466x104   M: 914.924.7935
E: *giri.iyengar@v elos.io
* W: *www.velos.
io*


Re: Fwd: ASF Board Meeting Summary - February 19, 2014

2014-02-20 Thread Mridul Muralidharan
Wonderful news ! Congrats all :-)

Regards,
Mridul
On Feb 20, 2014 10:07 PM, "Andy Konwinski"  wrote:

> Congrats Spark community! I think this means we are officially now a TLP!
> -- Forwarded message --
> From: "Brett Porter" 
> Date: Feb 19, 2014 11:26 PM
> Subject: ASF Board Meeting Summary - February 19, 2014
> To: 
> Cc:
>
> The February board meeting took place on the 19th.
>
> The following directors were present:
>
>   Shane Curcuru
>   Bertrand Delacretaz
>   Roy T. Fielding
>   Jim Jagielski
>   Chris Mattmann
>   Brett Porter
>   Greg Stein
>
> Apologies were received from Sam Ruby.
>
> The following officers were present:
>
>   Ross Gardler
>   Rich Bowen
>   Craig L Russell
>
> The following guests were present:
>
>   Sean Kelly
>   Daniel Gruno
>   Phil Steitz
>   Jake Farrell
>   Marvin Humphrey
>   David Nalley
>   Noah Slater
>
> The January minutes were approved.
> Minutes will be posted to
> http://www.apache.org/foundation/records/minutes/
>
> The following reports were not approved and are expected next month:
>
>  Report from the Apache Lenya Project  [Richard Frovarp]
>
> The following reports were not received and are expected next month:
>
>   Report from the Apache Abdera Project  [Ant Elder]
>   Report from the Apache Buildr Project  [Alex Boisvert]
>   Report from the Apache Click Project  [Malcolm Edgar]
>   Report from the Apache Community Development Project  [Luciano Resende]
>   Report from the Apache Continuum Project  [Brent Atkinson]
>   Report from the Apache Creadur Project  [Robert Burrell Donkin]
>   Report from the Apache DirectMemory Project  [Raffaele P. Guidi]
>   Report from the Apache Giraph Project  [Avery Ching]
>   Report from the Apache Velocity Project  [Nathan Bubna]
>
> All other reports to the board were approved.
>
> The following resolutions were passed unanimously:
>
>   A. Establish the Apache Open Climate Workbench Project (Michael Joyce,
> VP)
>   B. Change the Apache Incubator Project Chair (Roman Shaposhnik, VP)
>   C. Establish the Apache Spark Project (Matei Zaharia, VP)
>   D. Establish the Apache Knox Project (Kevin Minder, VP)
>
> The next board meeting will be on the 19th of March.
>


[GitHub] incubator-spark pull request: doctest updated for mapValues, flatM...

2014-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/621#issuecomment-35656954
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


  1   2   >