[GitHub] incubator-spark pull request: SPARK-1124: Fix infinite retries of ...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/641#issuecomment-35864121
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1124: Fix infinite retries of ...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/641#issuecomment-35864120
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1124: Fix infinite retries of ...

2014-02-23 Thread mateiz
GitHub user mateiz opened a pull request:

https://github.com/apache/incubator-spark/pull/641

SPARK-1124: Fix infinite retries of reduce stage when a map stage failed

In the previous code, if you had a failing map stage and then tried to run 
reduce stages on it repeatedly, the first reduce stage would fail correctly, 
but the later ones would mistakenly believe that all map outputs are available 
and start failing infinitely with fetch failures from "null". See 
https://spark-project.atlassian.net/browse/SPARK-1124 for an example.

This PR also cleans up code style slightly where there was a variable named 
"s" and some weird map manipulation.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mateiz/incubator-spark spark-1124-master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-spark/pull/641.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #641


commit cd32d5e4dee1291e4509e5965322b7ffe620b1f3
Author: Matei Zaharia 
Date:   2014-02-24T07:45:48Z

SPARK-1124: Fix infinite retries of reduce stage when a map stage failed

In the previous code, if you had a failing map stage and then tried to
run reduce stages on it repeatedly, the first reduce stage would fail
correctly, but the later ones would mistakenly believe that all map
outputs are available and start failing infinitely with fetch failures
from "null".




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: add threadPool shutdown hook when ki...

2014-02-23 Thread wchswchs
Github user wchswchs commented on the pull request:

https://github.com/apache/incubator-spark/pull/628#issuecomment-35863734
  
ok,i have closed it!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: add threadPool shutdown hook when ki...

2014-02-23 Thread wchswchs
Github user wchswchs closed the pull request at:

https://github.com/apache/incubator-spark/pull/628


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: add threadPool shutdown hook when ki...

2014-02-23 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/incubator-spark/pull/628#issuecomment-35863650
  
Given this, can you close the pull request? Or do you plan to try 
interrupt? That may also not fix the issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1102] Create a saveAsNewAPIHa...

2014-02-23 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/636#discussion_r9983025
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -686,6 +649,47 @@ class PairRDDFunctions[K: ClassTag, V: ClassTag](self: 
RDD[(K, V)])
   }
 
   /**
+   * Output the RDD to any Hadoop-supported storage system with new Hadoop 
API, using a Hadoop
+   * Job object for that storage system. The Job should set an 
OutputFormat and any output paths
+   * required (e.g. a table name to write to) in the same way as it would 
be configured for a Hadoop
+   * MapReduce job.
+   */
+  def saveAsNewAPIHadoopDataset(job: NewAPIHadoopJob) {
--- End diff --

In the new Hadoop API, does this really require a Job or just a 
Configuration? In the old API we only needed a configuration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1102] Create a saveAsNewAPIHa...

2014-02-23 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/incubator-spark/pull/636#issuecomment-35863485
  
Jenkins, this is OK to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1004: PySpark on YARN

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/640#issuecomment-35863413
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1004: PySpark on YARN

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/640#issuecomment-35863414
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1004: PySpark on YARN

2014-02-23 Thread sryza
GitHub user sryza opened a pull request:

https://github.com/apache/incubator-spark/pull/640

SPARK-1004: PySpark on YARN

Make pyspark work in yarn-client mode.  This build's on Josh's work.  I 
tested verified it works on a 5-node cluster.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sryza/incubator-spark sandy-spark-1004

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-spark/pull/640.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #640


commit e752a6a1c8a9d7cbc31d7b911800e22db6fcb2b0
Author: Josh Rosen 
Date:   2014-01-24T18:19:58Z

Automatically set Yarn env vars in PySpark (SPARK-1030).

commit 0adcaa971086853b254baf32748811561bb6e209
Author: Josh Rosen 
Date:   2014-01-25T23:28:56Z

WIP towards PySpark on YARN:

- Remove reliance on SPARK_HOME on the workers.  Only the driver
  should know about SPARK_HOME.  On the workers, we ensure that the
  PySpark Python libraries are added to the PYTHONPATH.

- Add a Makefile for generating a "fat zip" that contains PySpark's
  Python dependencies.  This is a bit of a hack and I'd be open to
  better packaging tools, but this doesn't require any extra Python
  libraries.  This use case doesn't seem to be well-addressed by the
  existing Python packaging tools: there are plenty of tools to package
  complete Python environments (such as pyinstaller and virtualenv) or
  to bundle *individual* libraries (e.g. distutils), but few to generate
  portable fat zips or eggs.

This hasn't been tested with YARN and may not actually compile.

commit d4a71d0495d072d5b5364601e7cd0dc9a7c9c9b9
Author: Josh Rosen 
Date:   2014-02-19T06:27:21Z

Add missing setup.py file for PySpark.

commit dcda63863a41414ba5e410092dc4fbab2e353543
Author: Sandy Ryza 
Date:   2014-02-24T07:06:42Z

Improvements

commit 38546d4f282727f3ae112f1e564df72443b726f5
Author: Sandy Ryza 
Date:   2014-02-24T07:26:01Z

Don't set SPARK_JAR




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1089] fix the regression prob...

2014-02-23 Thread ScrapCodes
Github user ScrapCodes commented on the pull request:

https://github.com/apache/incubator-spark/pull/614#issuecomment-35862636
  
Also from this I just went ahead and tried fixing this problem in scala and 
it worked.
https://github.com/ScrapCodes/scala/tree/si-6502-fix


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1089] fix the regression prob...

2014-02-23 Thread ScrapCodes
Github user ScrapCodes commented on the pull request:

https://github.com/apache/incubator-spark/pull/614#issuecomment-35859704
  
Nice catch ! and thanks for taking the time to dig this. I am okay with 
this way of doing it, however if you and others prefer we can move this code to 
createInterpreter before creating SparkILoopInterpreter. Even if we don't I 
think its fine to merge it. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: For outputformats that are Configura...

2014-02-23 Thread xoltar
Github user xoltar commented on the pull request:

https://github.com/apache/incubator-spark/pull/638#issuecomment-35857946
  
Thanks, last change should address all code review comments. Also cleaned 
up some imports in PairRDDFunctionsSuite that weren't needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: fix building with maven on Mac OS X

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/639#issuecomment-35856579
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: For outputformats that are Configura...

2014-02-23 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/incubator-spark/pull/638#issuecomment-35856428
  
We should put this fix in 0.9 as well once it's ready to merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: For outputformats that are Configura...

2014-02-23 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/incubator-spark/pull/638#issuecomment-35856419
  
Thanks a lot for tracking this down, fixing it, and adding tests! I added 
some minor style comments, modulo those comments LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: For outputformats that are Configura...

2014-02-23 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/638#discussion_r9980747
  
--- Diff: 
core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala ---
@@ -330,4 +335,74 @@ class PairRDDFunctionsSuite extends FunSuite with 
SharedSparkContext {
   (1, ArrayBuffer(1)),
   (2, ArrayBuffer(1
   }
+
+  test("saveNewAPIHadoopFile should call setConf if format is 
configurable") {
+val pairs = sc.parallelize(Array((new Integer(1), new Integer(1
+val conf = new Configuration()
+
+//No error, non-configurable formats still work
--- End diff --

Mind adding spaces after these? `// No error, non-configurable formats`... 
Also it would be nice (but up to you) to use `/* ... */` for multi-line 
comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: For outputformats that are Configura...

2014-02-23 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/638#discussion_r9980751
  
--- Diff: 
core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala ---
@@ -330,4 +335,74 @@ class PairRDDFunctionsSuite extends FunSuite with 
SharedSparkContext {
   (1, ArrayBuffer(1)),
   (2, ArrayBuffer(1
   }
+
+  test("saveNewAPIHadoopFile should call setConf if format is 
configurable") {
+val pairs = sc.parallelize(Array((new Integer(1), new Integer(1
+val conf = new Configuration()
+
+//No error, non-configurable formats still work
+pairs.saveAsNewAPIHadoopFile[FakeFormat]("ignored")
+
+//Configurable intercepts get configured
+//ConfigTestFormat throws an exception if we try to write to it
+//when setConf hasn't been thrown first.
+//Assertion is in ConfigTestFormat.getRecordWriter
+pairs.saveAsNewAPIHadoopFile[ConfigTestFormat]("ignored")
+  }
+}
+
+// These classes are fakes for testing
+// "saveNewAPIHadoopFile should call setConf if format is configurable".
+// Unfortunately, they have to be top level classes, and not defined in
+// the test method, because otherwise Scala won't generate no-args 
constructors
+// and the test will therefore throw InstantiationException when 
saveAsNewAPIHadoopFile
+// tries to instantiate them with Class.newInstance.
+class FakeWriter extends RecordWriter[Integer,Integer] {
--- End diff --

`Integer, Integer`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: fix building with maven on Mac OS X

2014-02-23 Thread witgo
GitHub user witgo opened a pull request:

https://github.com/apache/incubator-spark/pull/639

fix building with maven on Mac OS X

fix building with maven on Mac OS X throw Failure to find 
org.eclipse.paho:mqtt-client:jar:0.4.0 in 
https://repository.apache.org/content/repositories/releases was cached in the 
local repository, resolution will not be reattempted until the update interval 
of apache-repo has elapsed or updates are forced 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/witgo/incubator-spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-spark/pull/639.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #639


commit 27c612fb0dbbf27ba5a20d870a5cbb5cf33f4d9f
Author: liguoqiang 
Date:   2014-02-24T04:00:36Z

fix building with maven on Mac OS X throw Failure to find 
org.eclipse.paho:mqtt-client:jar:0.4.0 in 
https://repository.apache.org/content/repositories/releases was cached in the 
local repository, resolution will not be reattempted until the update interval 
of apache-repo has elapsed or updates are forced




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: For outputformats that are Configura...

2014-02-23 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/638#discussion_r9980734
  
--- Diff: 
core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala ---
@@ -26,6 +26,11 @@ import com.google.common.io.Files
 
 import org.apache.spark.SparkContext._
 import org.apache.spark.{Partitioner, SharedSparkContext}
+import org.apache.hadoop.mapreduce._
--- End diff --

Mind making your new imports fit the normal style?


https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: For outputformats that are Configura...

2014-02-23 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/638#discussion_r9980719
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -617,6 +617,10 @@ class PairRDDFunctions[K: ClassTag, V: ClassTag](self: 
RDD[(K, V)])
 attemptNumber)
   val hadoopContext = newTaskAttemptContext(wrappedConf.value, 
attemptId)
   val format = outputFormatClass.newInstance
+  format match {
+case c:Configurable => c.setConf(wrappedConf.value)
--- End diff --

I don't think this is specific to hbase - I think this is something we 
should really have been doing always but it only was noticed due to the fact 
that hbase replies on this configuration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: For outputformats that are Configura...

2014-02-23 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/638#discussion_r9980723
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -617,6 +617,10 @@ class PairRDDFunctions[K: ClassTag, V: ClassTag](self: 
RDD[(K, V)])
 attemptNumber)
   val hadoopContext = newTaskAttemptContext(wrappedConf.value, 
attemptId)
   val format = outputFormatClass.newInstance
+  format match {
+case c:Configurable => c.setConf(wrappedConf.value)
--- End diff --

Add a space after the colon: `case c: Configurable => `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: For outputformats that are Configura...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/638#issuecomment-35854652
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/572#issuecomment-35854653
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12824/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: For outputformats that are Configura...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/638#issuecomment-35854654
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12823/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/572#issuecomment-35854651
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


Re: [DISCUSS] Extending public API

2014-02-23 Thread Matei Zaharia
My sense on all this is that it should be done on a case-by-case basis. To add 
a new API, it needs to be general enough that a lot of users will want to use 
it. If adding that API confuses users, that’s a problem. However, on the flip 
side, if it’s not a super-popular function but it’s just 10-20 lines of code, 
it may still be worth having. The maintenance burden on that is not too high, 
and users are used to fairly extensive collection libraries.

For the joins in particular, we added them because it’s quite easy to mess up 
writing joins by hand, even once you have cogroup().

One thing we do want to do is start implementing more specialized 
functionality, like statistics functions, in separate libraries. Right now 
there are some functions in the RDD API (e.g. sums, means, histograms, etc) 
that are fairly specific to this domain.

Matei

On Feb 23, 2014, at 10:18 AM, Amandeep Khurana  wrote:

> This makes sense. Thanks for clarifying, Mridul.
> 
> As Sean pointed out - a contrib module quickly turns into a legacy code
> base that becomes hard to maintain. From that perspective, I think the idea
> of a separate sparkbank github that is maintained by Spark contributors
> (along with users who wish to contribute add-ons like you've described) and
> adhere to the code quality and reviews like the main project seems
> appealing. And then not just sparkbank but other things that people might
> want to have as a part of the project but doesn't belong to the core
> codebase can go there? I don't know if things like this have come up in the
> past pull requests.
> 
> -Amandeep
> 
> PS: I'm not a spark committer/contributor so take my opinion fwiw. :)
> 
> 
> On Sun, Feb 23, 2014 at 1:40 AM, Mridul Muralidharan wrote:
> 
>> Good point, and I was purposefully vague on that since that is something
>> which our community should evolve imo : this was just an initial proposal
>> :-)
>> 
>> For example: there are multiple ways to do cartesian - and each has its own
>> trade offs.
>> 
>> Another candidate could be, as I mentioned, new methods which can be
>> expressed as sequences of existing methods but would be slightly more
>> performent if done in one shot - like the self cartesian pr, various types
>> of join (which can become a contrib of its own btw !), experiments using
>> key indexes, ordering, etc.
>> 
>> Addition into sparkbank or contrib (or something bettrr named !) does not
>> preclude future migration into core ... just an initial staging area for us
>> to e olve the api and get user feedback; without necessarily making spark
>> core api unstable.
>> 
>> Obviously, it is not a dumping ground for broken code/ideas ... and must
>> follow same level of scrutiny and rigour before committing.
>> Regards
>> Mridul
>> On Feb 23, 2014 11:53 AM, "Amandeep Khurana"  wrote:
>> 
>>> Mridul,
>>> 
>>> Can you give examples of APIs that people have contributed (or wanted
>>> to contribute) but you categorize as something that would go into
>>> piggybank-like (sparkbank)? Curious to know how you'd decide what
>>> should go where.
>>> 
>>> Amandeep
>>> 
 On Feb 22, 2014, at 10:06 PM, Mridul Muralidharan 
>>> wrote:
 
 Hi,
 
 Over the past few months, I have seen a bunch of pull requests which
>>> have
 extended spark api ... most commonly RDD itself.
 
 Most of them are either relatively niche case of specialization (which
 might not be useful for most cases) or idioms which can be expressed
 (sometimes with minor perf penalty) using existing api.
 
 While all of them have non zero value (hence the effort to contribute,
>>> and
 gladly welcomed !) they are extending the api in nontrivial ways and
>>> have a
 maintenance cost ... and we already have a pending effort to clean up
>> our
 interfaces prior to 1.0
 
 I believe there is a need to keep exposed api succint, expressive and
 functional in spark; while at the same time, encouraging extensions and
 specialization within spark codebase so that other users can benefit
>> from
 the shared contributions.
 
 One approach could be to start something akin to piggybank in pig to
 contribute user generated specializations, helper utils, etc : bundled
>> as
 part of spark, but not part of core itself.
 
 Thoughts, comments ?
 
 Regards,
 Mridul
>>> 
>> 



Re: standard way of running a compiled jar

2014-02-23 Thread Matei Zaharia
Yes, it is a supported option. I’m just wondering whether we want to create a 
script for it specifically. Maybe the same script could also allow submitting 
to the cluster or something.

Matei

On Feb 23, 2014, at 1:55 PM, Sandy Ryza  wrote:

> Is the client=driver mode still a supported option (outside of the REPLs),
> at least for the medium term?  My impression from reading the docs is that
> it's the most common, if not recommended, way to submit jobs.  If that's
> the case, I still think it's important, or at least helpful, to have
> something for this mode that addresses the issues below.
> 
> 
> On Sat, Feb 22, 2014 at 10:48 PM, Matei Zaharia 
> wrote:
> 
>> Hey Sandy,
>> 
>> In the long run, the ability to submit driver programs to run in the
>> cluster (added in 0.9 as org.apache.spark.deploy.Client) might solve this.
>> This is a feature currently available in the standalone mode that runs the
>> driver on a worker node, but it is also how YARN works by default, and it
>> wouldn't be too bad to do in Mesos. With this, the user could compile a JAR
>> that excludes Spark and still get Spark on the classpath.
>> 
>> This was added in 0.9 as a slightly harder to invoke feature mainly to be
>> used for Spark Streaming (since the cluster can also automatically restart
>> your driver), but we can create a script around it for submissions. I'd
>> like to see a design for such a script that takes into account all the
>> deploy modes though, because it would be confusing to use it one way on
>> YARN and another way on standalone for instance. Already the YARN submit
>> client kind of does what you're looking for.
>> 
>> Matei
>> 
>> On Feb 22, 2014, at 2:08 PM, Sandy Ryza  wrote:
>> 
>>> Hey All,
>>> 
>>> I've encountered some confusion about how to run a Spark app from a
>>> compiled jar and wanted to bring up the recommended way.
>>> 
>>> It seems like the current standard options are:
>>> * Build an uber jar that contains the user jar and all of Spark.
>>> * Explicitly include the locations of the Spark jars on the client
>>> machine in the classpath.
>>> 
>>> Both of these options have a couple issues.
>>> 
>>> For the uber jar, this means unnecessarily sending all of Spark (and its
>>> dependencies) to every executor, as well as including Spark twice in the
>>> executor classpaths.  This also requires recompiling binaries against the
>>> latest version whenever the cluster version is upgraded, lest executor
>>> classpaths include two different versions of Spark at the same time.
>>> 
>>> Explicitly including the Spark jars in the classpath is a huge pain
>> because
>>> their locations can vary significantly between different installations
>> and
>>> platforms, and makes the invocation more verbose.
>>> 
>>> What seems ideal to me is a script that takes a user jar, sets up the
>> Spark
>>> classpath, and runs it.  This means only the user jar gets shipped across
>>> the cluster, but the user doesn't need to figure out how to get the Spark
>>> jars onto the client classpath.  This is similar to the "hadoop jar"
>>> command commonly used for running MapReduce jobs.
>>> 
>>> The spark-class script seems to do almost exactly this, but I've been
>> told
>>> it's meant only for internal Spark use (with the possible exception of
>>> yarn-standalone mode). It doesn't take a user jar as an argument, but one
>>> can be added by setting the SPARK_CLASSPATH variable.  This script could
>> be
>>> stabilized for user use.
>>> 
>>> Another option would be to have a "spark-app" script that does what
>>> spark-class does, but also masks the decision of whether to run the
>> driver
>>> in the client process or on the cluster (both standalone and YARN have
>>> modes for both of these).
>>> 
>>> Does this all make sense?
>>> -Sandy
>> 
>> 



[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/572#issuecomment-35852739
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/572#issuecomment-35852737
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: For outputformats that are Configura...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/638#issuecomment-35852729
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Spark-615: make mapPartitionsWithInd...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/606#issuecomment-35852724
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12822/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: For outputformats that are Configura...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/638#issuecomment-35852730
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Spark-615: make mapPartitionsWithInd...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/606#issuecomment-35852723
  
Build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1100] prevent Spark from over...

2014-02-23 Thread CodingCat
Github user CodingCat commented on the pull request:

https://github.com/apache/incubator-spark/pull/626#issuecomment-35851996
  
but why not just preventing users from overwriting the directory, no matter 
whether there is part-*?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1100] prevent Spark from over...

2014-02-23 Thread CodingCat
Github user CodingCat commented on the pull request:

https://github.com/apache/incubator-spark/pull/626#issuecomment-35851751
  
I just went through the Spark Streaming document, it seems that it's safe 
to follow your suggestion @pwendell 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: For outputformats that are Configura...

2014-02-23 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/638#discussion_r9979453
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -617,6 +617,10 @@ class PairRDDFunctions[K: ClassTag, V: ClassTag](self: 
RDD[(K, V)])
 attemptNumber)
   val hadoopContext = newTaskAttemptContext(wrappedConf.value, 
attemptId)
   val format = outputFormatClass.newInstance
+  format match {
+case c:Configurable => c.setConf(wrappedConf.value)
--- End diff --

do we need some comments here to indicate that this line is to support a 
special case in HBase?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: For outputformats that are Configura...

2014-02-23 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/incubator-spark/pull/638#issuecomment-35851045
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: For outputformats that are Configura...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/638#issuecomment-35850754
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Spark-615: make mapPartitionsWithInd...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/606#issuecomment-35850760
  
Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: Spark-615: make mapPartitionsWithInd...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/606#issuecomment-35850759
  
 Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1084 (part 1). Fix most build ...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/637#issuecomment-35850735
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1084 (part 1). Fix most build ...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/637#issuecomment-35850736
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12821/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: For outputformats that are Configura...

2014-02-23 Thread xoltar
GitHub user xoltar opened a pull request:

https://github.com/apache/incubator-spark/pull/638

For outputformats that are Configurable, call setConf before sending data 
to them.

This allows us to use, e.g. HBase's TableOutputFormat with 
PairRDDFunctions.saveAsNewAPIHadoopFile, which otherwise would throw 
NullPointerException because the output table name hasn't been configured.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xoltar/incubator-spark SPARK-1108

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-spark/pull/638.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #638


commit 7cbcaa10bbf01cf04bba7f2883d1fb9564cd3660
Author: Bryn Keller 
Date:   2014-02-20T06:00:44Z

For outputformats that are Configurable, call setConf before sending data 
to them. This allows us to use, e.g. HBase TableOutputFormat, which otherwise 
would throw NullPointerException because the output table name hasn't been 
configured




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1122: allCollect functions for...

2014-02-23 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/incubator-spark/pull/635#issuecomment-35849312
  
@markhamstra @pwendell For the use cases, this allCollect operation may be 
useful in the grid search for a good set of training parameters for machine 
learning problems. For example, if the dataset is only 500MB but training takes 
half an hour to finish and we have to try 100 different combinations of 
training parameters (e.g., rank, regularization constants, and termination 
tolerance), the wall-clock time can be reduced by distributing the dataset to 
multiple nodes and training in parallel. Another use case is the replicated 
join, though locality issues need to be addressed. I agree with you that the 
implementation is not efficient, which puts heavy load on the driver.

@coderxiang , could you try to improve the implementation? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1084 (part 1). Fix most build ...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/637#issuecomment-35849130
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1084 (part 1). Fix most build ...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/637#issuecomment-35849129
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1078: Replace lift-json with j...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/582#issuecomment-35849112
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1078: Replace lift-json with j...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/582#issuecomment-35849113
  
One or more automated tests failed
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12820/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1122: allCollect functions for...

2014-02-23 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/incubator-spark/pull/635#issuecomment-35848534
  
@coderxiang btw - it might be something where we make it a private API so 
it can be used inside of Spark if other packages need this to do broadcast 
joins. It would be good to understand a bit the intended use case though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1084 (part 1). Fix most build ...

2014-02-23 Thread srowen
GitHub user srowen opened a pull request:

https://github.com/apache/incubator-spark/pull/637

SPARK-1084 (part 1). Fix most build warnings.

This is a redo of https://github.com/apache/incubator-spark/pull/586

This contains all the same changes, minus dependency changes. It also 
rebases and squashes some commits that could be combined.

After this is in I'll propose part 2, which concerns dependencies.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/incubator-spark SPARK-1084.1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-spark/pull/637.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #637


commit 2e52f136474abf911472af2bb639d704605cd171
Author: Sean Owen 
Date:   2014-02-11T14:37:03Z

Replace deprecated Ant  with 

commit a82b841df207128aec23ae9eb3a297e41d1bcc49
Author: Sean Owen 
Date:   2014-02-11T14:38:23Z

Remove dead scaladoc links

commit 3b7b2ad9c9a2536da51a1b6af7ebf2aff77fef32
Author: Sean Owen 
Date:   2014-02-11T14:39:48Z

Fix scaladoc invocation warning, and enable javac warnings properly, with 
plugin config updates

commit b5ccbc9c6360437afabcfc14e81321e5b7b38e4c
Author: Sean Owen 
Date:   2014-02-12T13:45:04Z

Fix one new style error introduced in scaladoc warning commit

commit 79f1c7acdb9634128d417d704a234058d2993bea
Author: Sean Owen 
Date:   2014-02-23T21:27:02Z

Fix two misc javadoc problems

commit ee1c1150d482243c190c71931852f2797ec79120
Author: Sean Owen 
Date:   2014-02-23T23:27:21Z

Suppress warnings about legitimate unchecked array creations, or change 
code to avoid it




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1078: Replace lift-json with j...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/582#issuecomment-35847834
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1078: Replace lift-json with j...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/582#issuecomment-35847835
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1078: Replace lift-json with j...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/582#issuecomment-35847506
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1078: Replace lift-json with j...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/582#issuecomment-35847507
  
One or more automated tests failed
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12819/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1078: Replace lift-json with j...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/582#issuecomment-35845761
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1078: Replace lift-json with j...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/582#issuecomment-35845760
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


Re: standard way of running a compiled jar

2014-02-23 Thread Sandy Ryza
Is the client=driver mode still a supported option (outside of the REPLs),
at least for the medium term?  My impression from reading the docs is that
it's the most common, if not recommended, way to submit jobs.  If that's
the case, I still think it's important, or at least helpful, to have
something for this mode that addresses the issues below.


On Sat, Feb 22, 2014 at 10:48 PM, Matei Zaharia wrote:

> Hey Sandy,
>
> In the long run, the ability to submit driver programs to run in the
> cluster (added in 0.9 as org.apache.spark.deploy.Client) might solve this.
> This is a feature currently available in the standalone mode that runs the
> driver on a worker node, but it is also how YARN works by default, and it
> wouldn't be too bad to do in Mesos. With this, the user could compile a JAR
> that excludes Spark and still get Spark on the classpath.
>
> This was added in 0.9 as a slightly harder to invoke feature mainly to be
> used for Spark Streaming (since the cluster can also automatically restart
> your driver), but we can create a script around it for submissions. I'd
> like to see a design for such a script that takes into account all the
> deploy modes though, because it would be confusing to use it one way on
> YARN and another way on standalone for instance. Already the YARN submit
> client kind of does what you're looking for.
>
> Matei
>
> On Feb 22, 2014, at 2:08 PM, Sandy Ryza  wrote:
>
> > Hey All,
> >
> > I've encountered some confusion about how to run a Spark app from a
> > compiled jar and wanted to bring up the recommended way.
> >
> > It seems like the current standard options are:
> > * Build an uber jar that contains the user jar and all of Spark.
> > * Explicitly include the locations of the Spark jars on the client
> > machine in the classpath.
> >
> > Both of these options have a couple issues.
> >
> > For the uber jar, this means unnecessarily sending all of Spark (and its
> > dependencies) to every executor, as well as including Spark twice in the
> > executor classpaths.  This also requires recompiling binaries against the
> > latest version whenever the cluster version is upgraded, lest executor
> > classpaths include two different versions of Spark at the same time.
> >
> > Explicitly including the Spark jars in the classpath is a huge pain
> because
> > their locations can vary significantly between different installations
> and
> > platforms, and makes the invocation more verbose.
> >
> > What seems ideal to me is a script that takes a user jar, sets up the
> Spark
> > classpath, and runs it.  This means only the user jar gets shipped across
> > the cluster, but the user doesn't need to figure out how to get the Spark
> > jars onto the client classpath.  This is similar to the "hadoop jar"
> > command commonly used for running MapReduce jobs.
> >
> > The spark-class script seems to do almost exactly this, but I've been
> told
> > it's meant only for internal Spark use (with the possible exception of
> > yarn-standalone mode). It doesn't take a user jar as an argument, but one
> > can be added by setting the SPARK_CLASSPATH variable.  This script could
> be
> > stabilized for user use.
> >
> > Another option would be to have a "spark-app" script that does what
> > spark-class does, but also masks the decision of whether to run the
> driver
> > in the client process or on the cluster (both standalone and YARN have
> > modes for both of these).
> >
> > Does this all make sense?
> > -Sandy
>
>


[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-23 Thread fommil
Github user fommil commented on the pull request:

https://github.com/apache/incubator-spark/pull/575#issuecomment-35844934
  
Actually, if somebody creates a ticket for me on 
https://github.com/fommil/jniloader that's the best way to ensure that I'll 
actually update the license and release it. I would prefer to use Mozilla if 
you are happy with that, so please do let me know what you discover. See 
http://www.apache.org/legal/resolved.html#category-b



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-23 Thread fommil
Github user fommil commented on the pull request:

https://github.com/apache/incubator-spark/pull/575#issuecomment-35844613
  
@srowen hehe, oh, I know. Actually I'm more interested in knowing exactly 
*why* they don't like LGPL. There have been so many discussions in the past 
between FSF and ASF that they don't quite appreciate that the rest of us don't 
understand either side's goals or have the memory of those previous 
discussions. I am at least confident that the thread has dusted off a lot of 
misconceptions about the LGPL and ASF's licensing goals.

Re: Mozilla license, it's definitely listed under category B in that list.

Don't worry, `JNILoader` can be made with AL2 if it needs to be... it's 
only a file or two anyway.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1084. Fix most build warnings

2014-02-23 Thread srowen
Github user srowen closed the pull request at:

https://github.com/apache/incubator-spark/pull/586


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1084. Fix most build warnings

2014-02-23 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/incubator-spark/pull/586#issuecomment-35843240
  
OK I'm going to come back with two PRs. One will have the squashed final 
output of this PR, and the other will have the parts related to dependencies 
(which are now quite trivial I think).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1071: Tidy logging strategy an...

2014-02-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/incubator-spark/pull/570


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1100] prevent Spark from over...

2014-02-23 Thread CodingCat
Github user CodingCat commented on the pull request:

https://github.com/apache/incubator-spark/pull/626#issuecomment-35842285
  
@pwendell the second situation can be avoided, sorry, just brain 
damaged..the only issue is if there is a component relies on the fact that 
Spark allows the overwriting the directory before~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-23 Thread dlwh
Github user dlwh commented on the pull request:

https://github.com/apache/incubator-spark/pull/575#issuecomment-35842233
  
@srowen @fommil Breeze is flexible enough that we can swap out different 
back ends quickly (and let users decide at runtime). So if need be, I can do 
the work to make both jblas and netlib-java work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-23 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/incubator-spark/pull/575#issuecomment-35842122
  
@fommil ASF is silent on the MPL: 
http://www.apache.org/legal/resolved.html#category-a
But Mozilla says it's compatible with AL2: 
http://www.mozilla.org/MPL/license-policy.html
Given the nature of the MPL, I suspect there is no issue. But IANAL.

Sam you see what happens when you poke the hornet's nest! I can tell you 
have pointed opinions about licensing, and encourage you argue the case as long 
as you care to. The squabble is unlikely to conclude with ASF beards saying 
"LGPL is cool".

I suggest filing a calm second JIRA to ask if there is any official stance 
on MPL, as that may solve the issue. (Want me to do it?)

If not, I think Spark should just go with a different library.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1122: allCollect functions for...

2014-02-23 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/incubator-spark/pull/635#issuecomment-35841766
  
Hey @coderxiang - this is interesting functionality but I'm -1 on including 
it in the standard API. The main reason is that this will perform poorly on 
most large datasets and make it easy for people to shoot themselves in the 
foot. A second reason is that the use case isn't totally clear - as per some of 
@markhamstra's comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1100] prevent Spark from over...

2014-02-23 Thread CodingCat
Github user CodingCat commented on the pull request:

https://github.com/apache/incubator-spark/pull/626#issuecomment-35841703
  
@pwendell Thanks for the comments, I also considered what you mentioned, 
but will that prevent other components like Spark Streaming from doing the 
right job? (I'm not familiar with streaming, but it seems that it will 
overwrite the existing directory...)

Also how to prevent the situation that the user occasionally run the job 
over the same directory for two times, but with different partition number (the 
second running has smaller value); eventually, the directory will contain the 
results from two runnings. 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1100] prevent Spark from over...

2014-02-23 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/incubator-spark/pull/626#issuecomment-35841445
  
Hey @CodingCat this approach has a few drawbacks. First, it will mean a 
pretty bad regression for some users. For instance, say that a user is calling 
saveAsHadoopFile(/my-dir) and that directory has some other random stuff in at 
as well. Previously it would have written spark files alongside the other 
stuff, but with this patch it will silently delete the other data and create 
the directory. Second, this changes the API's all over the place which we are 
trying not to do. Third, it's a little scary to have code in spark that's 
deleting HDFS directories - I'd rather make the user do it explicitly.

What if we did the following: We look in the output directory and see if 
there are any part-XX files in there already, and if so we throw an exception 
and say that the directory already has output data in it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1071: Tidy logging strategy an...

2014-02-23 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/incubator-spark/pull/570#issuecomment-35840761
  
@srowen thanks for this clean-up. I'm going to merge this into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1071: Tidy logging strategy an...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/570#issuecomment-35839844
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12818/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-23 Thread fommil
Github user fommil commented on the pull request:

https://github.com/apache/incubator-spark/pull/575#issuecomment-35839024
  
@mengxr looking through all the Apache authorised licenses, it would appear 
that the Mozilla license is a better fit with my goals since it would require 
distributors to make source code available if they make any modifications to 
`JNILoader`. Does that fit well with your project's goals? I'd rather have this 
than the Apache License.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1100] prevent Spark from over...

2014-02-23 Thread CodingCat
Github user CodingCat commented on the pull request:

https://github.com/apache/incubator-spark/pull/626#issuecomment-35838665
  
OK, fixed some bugs and squashed the commits, I think it's ready for 
further review


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


Re: [DISCUSS] Extending public API

2014-02-23 Thread Amandeep Khurana
This makes sense. Thanks for clarifying, Mridul.

As Sean pointed out - a contrib module quickly turns into a legacy code
base that becomes hard to maintain. From that perspective, I think the idea
of a separate sparkbank github that is maintained by Spark contributors
(along with users who wish to contribute add-ons like you've described) and
adhere to the code quality and reviews like the main project seems
appealing. And then not just sparkbank but other things that people might
want to have as a part of the project but doesn't belong to the core
codebase can go there? I don't know if things like this have come up in the
past pull requests.

-Amandeep

PS: I'm not a spark committer/contributor so take my opinion fwiw. :)


On Sun, Feb 23, 2014 at 1:40 AM, Mridul Muralidharan wrote:

> Good point, and I was purposefully vague on that since that is something
> which our community should evolve imo : this was just an initial proposal
> :-)
>
> For example: there are multiple ways to do cartesian - and each has its own
> trade offs.
>
> Another candidate could be, as I mentioned, new methods which can be
> expressed as sequences of existing methods but would be slightly more
> performent if done in one shot - like the self cartesian pr, various types
> of join (which can become a contrib of its own btw !), experiments using
> key indexes, ordering, etc.
>
> Addition into sparkbank or contrib (or something bettrr named !) does not
> preclude future migration into core ... just an initial staging area for us
> to e olve the api and get user feedback; without necessarily making spark
> core api unstable.
>
> Obviously, it is not a dumping ground for broken code/ideas ... and must
> follow same level of scrutiny and rigour before committing.
> Regards
> Mridul
>  On Feb 23, 2014 11:53 AM, "Amandeep Khurana"  wrote:
>
> > Mridul,
> >
> > Can you give examples of APIs that people have contributed (or wanted
> > to contribute) but you categorize as something that would go into
> > piggybank-like (sparkbank)? Curious to know how you'd decide what
> > should go where.
> >
> > Amandeep
> >
> > > On Feb 22, 2014, at 10:06 PM, Mridul Muralidharan 
> > wrote:
> > >
> > > Hi,
> > >
> > >  Over the past few months, I have seen a bunch of pull requests which
> > have
> > > extended spark api ... most commonly RDD itself.
> > >
> > > Most of them are either relatively niche case of specialization (which
> > > might not be useful for most cases) or idioms which can be expressed
> > > (sometimes with minor perf penalty) using existing api.
> > >
> > > While all of them have non zero value (hence the effort to contribute,
> > and
> > > gladly welcomed !) they are extending the api in nontrivial ways and
> > have a
> > > maintenance cost ... and we already have a pending effort to clean up
> our
> > > interfaces prior to 1.0
> > >
> > > I believe there is a need to keep exposed api succint, expressive and
> > > functional in spark; while at the same time, encouraging extensions and
> > > specialization within spark codebase so that other users can benefit
> from
> > > the shared contributions.
> > >
> > > One approach could be to start something akin to piggybank in pig to
> > > contribute user generated specializations, helper utils, etc : bundled
> as
> > > part of spark, but not part of core itself.
> > >
> > > Thoughts, comments ?
> > >
> > > Regards,
> > > Mridul
> >
>


[GitHub] incubator-spark pull request: SPARK-1084. Fix most build warnings

2014-02-23 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/incubator-spark/pull/586#issuecomment-35838441
  
Ah, great, that'll make it simple. We can only merge at the granularity of 
PRs, so it'd be great if you could split the dependency stuff into its own.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1071: Tidy logging strategy an...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/570#issuecomment-35838298
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1071: Tidy logging strategy an...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/570#issuecomment-35838296
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


Re: Anyone wants to look at SPARK-1123?

2014-02-23 Thread Nan Zhu
OK, I know where I was wrong 


Best, 

-- 
Nan Zhu
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Sunday, February 23, 2014 at 12:50 PM, Nan Zhu wrote:

> String, it should be get the following helper function 
> 
> private[spark] def getKeyClass() = implicitly[ClassTag[K]].runtimeClass
> 
> private[spark] def getValueClass() = implicitly[ClassTag[V]].runtimeClass
> 
> and this is what I run 
> 
> scala> val a = sc.textFile("/Users/nanzhu/code/incubator-spark/LICENSE", 
> 2).map(line => ("a", "b"))
> 
> scala> a.saveAsNewAPIHadoopFile("/Users/nanzhu/code/output_rdd")
> java.lang.InstantiationException
> at 
> sun.reflect.InstantiationExceptionConstructorAccessorImpl.newInstance(InstantiationExceptionConstructorAccessorImpl.java:48)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at java.lang.Class.newInstance(Class.java:374)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:632)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:590)
> at $iwC$$iwC$$iwC$$iwC.(:15)
> at $iwC$$iwC$$iwC.(:20)
> at $iwC$$iwC.(:22)
> at $iwC.(:24)
> at (:26)
> at .(:30)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:774)
> at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1042)
> at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:611)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:642)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:606)
> at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:790)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:835)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:747)
> at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:595)
> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:602)
> at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:605)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:928)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:878)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:878)
> at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:878)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:970)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> 
> 
> 
> 
> 
> 
> -- 
> Nan Zhu
> 
> 
> On Sunday, February 23, 2014 at 11:06 AM, Nick Pentreath wrote:
> 
> > Hi
> > 
> > What KeyClass and ValueClass are you trying to save as the keys/values of
> > your dataset?
> > 
> > 
> > 
> > On Sun, Feb 23, 2014 at 10:48 AM, Nan Zhu  > (mailto:zhunanmcg...@gmail.com)> wrote:
> > 
> > > Hi, all
> > > 
> > > I found the weird thing on saveAsNewAPIHadoopFile in
> > > PairRDDFunctions.scala when working on the other issue,
> > > 
> > > saveAsNewAPIHadoopFile throws java.lang.InstantiationException all the 
> > > time
> > > 
> > > I checked the commit history of the file, it seems that the API exists for
> > > a long time, no one else found this? (that's the reason I'm confusing)
> > > 
> > > Best,
> > > 
> > > --
> > > Nan Zhu
> > > 
> > 
> > 
> > 
> > 
> 
> 



Re: Anyone wants to look at SPARK-1123?

2014-02-23 Thread Nan Zhu
String, it should be get the following helper function 

private[spark] def getKeyClass() = implicitly[ClassTag[K]].runtimeClass

private[spark] def getValueClass() = implicitly[ClassTag[V]].runtimeClass

and this is what I run 

scala> val a = sc.textFile("/Users/nanzhu/code/incubator-spark/LICENSE", 
2).map(line => ("a", "b"))

scala> a.saveAsNewAPIHadoopFile("/Users/nanzhu/code/output_rdd")
java.lang.InstantiationException
at 
sun.reflect.InstantiationExceptionConstructorAccessorImpl.newInstance(InstantiationExceptionConstructorAccessorImpl.java:48)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at java.lang.Class.newInstance(Class.java:374)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:632)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:590)
at $iwC$$iwC$$iwC$$iwC.(:15)
at $iwC$$iwC$$iwC.(:20)
at $iwC$$iwC.(:22)
at $iwC.(:24)
at (:26)
at .(:30)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:774)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1042)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:611)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:642)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:606)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:790)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:835)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:747)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:595)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:602)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:605)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:928)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:878)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:878)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:878)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:970)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)






-- 
Nan Zhu


On Sunday, February 23, 2014 at 11:06 AM, Nick Pentreath wrote:

> Hi
> 
> What KeyClass and ValueClass are you trying to save as the keys/values of
> your dataset?
> 
> 
> 
> On Sun, Feb 23, 2014 at 10:48 AM, Nan Zhu  (mailto:zhunanmcg...@gmail.com)> wrote:
> 
> > Hi, all
> > 
> > I found the weird thing on saveAsNewAPIHadoopFile in
> > PairRDDFunctions.scala when working on the other issue,
> > 
> > saveAsNewAPIHadoopFile throws java.lang.InstantiationException all the time
> > 
> > I checked the commit history of the file, it seems that the API exists for
> > a long time, no one else found this? (that's the reason I'm confusing)
> > 
> > Best,
> > 
> > --
> > Nan Zhu
> > 
> 
> 
> 




[GitHub] incubator-spark pull request: SPARK-1071: Tidy logging strategy an...

2014-02-23 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/incubator-spark/pull/570#issuecomment-35837259
  
@pwendell I addressed the last point about pulling up slf4j-over-log4j12 
into core (non-test), and the indentation issue. Tests look good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1071: Tidy logging strategy an...

2014-02-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/570#discussion_r9976665
  
--- Diff: project/SparkBuild.scala ---
@@ -236,13 +236,15 @@ object SparkBuild extends Build {
 publishLocalBoth <<= Seq(publishLocal in MavenCompile, 
publishLocal).dependOn
   ) ++ net.virtualvoid.sbt.graph.Plugin.graphSettings ++ ScalaStyleSettings
 
-  val slf4jVersion = "1.7.2"
+  val slf4jVersion = "1.7.5"
 
   val excludeCglib = ExclusionRule(organization = 
"org.sonatype.sisu.inject")
   val excludeJackson = ExclusionRule(organization = "org.codehaus.jackson")
   val excludeNetty = ExclusionRule(organization = "org.jboss.netty")
   val excludeAsm = ExclusionRule(organization = "asm")
   val excludeSnappy = ExclusionRule(organization = "org.xerial.snappy")
+  val excludeCommonsLogging = ExclusionRule(organization = 
"commons-logging")
+  val excludeSLF4J = ExclusionRule(organization = "org.slf4j")
--- End diff --

@pwendell What I see left are dependencies from third-party libraries on 
slf4j-api, which is fine. Most depend on 1.7.5 (so good that the version in 
Spark is bumped to 1.7.5), and a few use 1.6.x, which should be entirely 
compatible. It's also OK for dependencies to have slf4j-over-log4j12. So AFAICT 
it's fine in this regard.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1078: Replace lift-json with j...

2014-02-23 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/incubator-spark/pull/582#issuecomment-35837078
  
Yes, I'll make the changes today.  Thanks, Aaron!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: MLLIB-25: Implicit ALS runs out of m...

2014-02-23 Thread MLnick
Github user MLnick commented on the pull request:

https://github.com/apache/incubator-spark/pull/629#issuecomment-35835626
  
@srowen good catch, thanks Sean. Didn't really think about this when I 
wrote it. Shows that testing on larger scale input data / params is always 
required!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


ask for receiving spark user mailing list

2014-02-23 Thread Lianhui Wang
hi
  i want to  ask for receiving spark user mailing list

-- 
thanks

王联辉(Lianhui Wang)
blog; http://blog.csdn.net/lance_123
兴趣方向:数据库,分布式,数据挖掘,编程语言,互联网技术等


Re: Anyone wants to look at SPARK-1123?

2014-02-23 Thread Nick Pentreath
Hi

What KeyClass and ValueClass are you trying to save as the keys/values of
your dataset?



On Sun, Feb 23, 2014 at 10:48 AM, Nan Zhu  wrote:

> Hi, all
>
> I found the weird thing on saveAsNewAPIHadoopFile  in
> PairRDDFunctions.scala when working on the other issue,
>
> saveAsNewAPIHadoopFile throws java.lang.InstantiationException all the time
>
> I checked the commit history of the file, it seems that the API exists for
> a long time, no one else found this? (that's the reason I'm confusing)
>
> Best,
>
> --
> Nan Zhu
>
>


[GitHub] incubator-spark pull request: Add Security to Spark - Akka, Http, ...

2014-02-23 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/332#discussion_r9975936
  
--- Diff: 
core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
@@ -483,10 +496,131 @@ private[spark] class ConnectionManager(port: Int, 
conf: SparkConf) extends Loggi
 /*handleMessage(connection, message)*/
   }
 
-  private def handleMessage(connectionManagerId: ConnectionManagerId, 
message: Message) {
+  private def handleClientAuthNeg(
+  waitingConn: SendingConnection,
+  securityMsg: SecurityMessage, 
+  connectionId : ConnectionId) {
+if (waitingConn.isSaslComplete()) {
+  logDebug("Client sasl completed for id: "  + 
waitingConn.connectionId)
+  connectionsAwaitingSasl -= waitingConn.connectionId
+  waitingConn.getAuthenticated().synchronized {
+waitingConn.getAuthenticated().notifyAll();
+  }
+  return
+} else {
+  var replyToken : Array[Byte] = null
+  try {
+replyToken = 
waitingConn.sparkSaslClient.saslResponse(securityMsg.getToken);
+if (waitingConn.isSaslComplete()) {
+  logDebug("Client sasl completed after evaluate for id: " + 
waitingConn.connectionId)
+  connectionsAwaitingSasl -= waitingConn.connectionId
+  waitingConn.getAuthenticated().synchronized {
+waitingConn.getAuthenticated().notifyAll()
+  }
+  return
+}
+var securityMsgResp = SecurityMessage.fromResponse(replyToken, 
securityMsg.getConnectionId)
+var message = securityMsgResp.toBufferMessage
+if (message == null) throw new Exception("Error creating security 
message")
+sendSecurityMessage(waitingConn.getRemoteConnectionManagerId(), 
message)
+  } catch  {
+case e: Exception => {
+  logError("Error doing sasl client: " + e)
+  waitingConn.close()
+  throw new Exception("error evaluating sasl response: " + e)
+}
+  }
+}
+  }
+
+  private def handleServerAuthNeg(
+  connection: Connection, 
+  securityMsg: SecurityMessage,
+  connectionId: ConnectionId) {
+if (!connection.isSaslComplete()) {
+  logDebug("saslContext not established")
+  var replyToken : Array[Byte] = null
+  try {
+connection.synchronized {
+  if (connection.sparkSaslServer == null) {
+logDebug("Creating sasl Server")
+connection.sparkSaslServer = new 
SparkSaslServer(securityManager)
+  }
+}
+replyToken = 
connection.sparkSaslServer.response(securityMsg.getToken)
+if (connection.isSaslComplete()) {
+  logDebug("Server sasl completed: " + connection.connectionId) 
+} else {
+  logDebug("Server sasl not completed: " + connection.connectionId)
+}
+if (replyToken != null) {
+  var securityMsgResp = SecurityMessage.fromResponse(replyToken, 
securityMsg.getConnectionId)
+  var message = securityMsgResp.toBufferMessage
+  if (message == null) throw new Exception("Error creating 
security Message")
+  sendSecurityMessage(connection.getRemoteConnectionManagerId(), 
message)
+} 
+  } catch {
+case e: Exception => {
+  logError("Error in server auth negotiation: " + e)
+  // It would probably be better to send an error message telling 
other side auth failed
+  // but for now just close
+  connection.close()
+}
+  }
+} else {
+  logDebug("connection already established for this connection id: " + 
connection.connectionId) 
+}
+  }
+
+
+  private def handleAuthentication(conn: Connection, bufferMessage: 
BufferMessage): Boolean = {
+if (bufferMessage.isSecurityNeg) {
+  logDebug("This is security neg message")
+
+  // parse as SecurityMessage
+  val securityMsg = SecurityMessage.fromBufferMessage(bufferMessage)
+  val connectionId = new ConnectionId(securityMsg.getConnectionId)
+
+  connectionsAwaitingSasl.get(connectionId) match {
+case Some(waitingConn) => {
+  // Client - this must be in response to us doing Send
+  logDebug("Client handleAuth for id: " +  
waitingConn.connectionId)
+  handleClientAuthNeg(waitingConn, securityMsg, connectionId)
+}
+case None => {
+  // Server - someone sent us something and we haven't 
authenticated yet
+  logDebug("Server handleAuth for id: " + connectionId)
+  handleServerAuthNeg(conn, securityMsg, connectionId)
+}
 

[GitHub] incubator-spark pull request: SPARK-1084. Fix most build warnings

2014-02-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/586#discussion_r9975107
  
--- Diff: project/SparkBuild.scala ---
@@ -340,7 +336,8 @@ object SparkBuild extends Build {
   def streamingSettings = sharedSettings ++ Seq(
 name := "spark-streaming",
 libraryDependencies ++= Seq(
-  "commons-io" % "commons-io" % "2.4"
+  "commons-io" % "commons-io" % "2.4",
+  "org.codehaus.jackson" % "jackson-mapper-asl" % "1.9.11"
--- End diff --

Also, then I don't see a particular reason to bother excluding jackson 
(1.8.8) dependencies from Hadoop. It could be a problem to have no Jackson at 
all. I can undo that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1084. Fix most build warnings

2014-02-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/586#discussion_r9975099
  
--- Diff: project/SparkBuild.scala ---
@@ -340,7 +336,8 @@ object SparkBuild extends Build {
   def streamingSettings = sharedSettings ++ Seq(
 name := "spark-streaming",
 libraryDependencies ++= Seq(
-  "commons-io" % "commons-io" % "2.4"
+  "commons-io" % "commons-io" % "2.4",
+  "org.codehaus.jackson" % "jackson-mapper-asl" % "1.9.11"
--- End diff --

This was just making the sbt build consistent with Maven. But yeah on 
second glance it does look like Streaming doesn't even use Jackson! This can be 
removed in both places. Commons IO is used. I'll wait on your comment about 
splitting into a PR to move forward with fixes like this in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


Re: [DISCUSS] Extending public API

2014-02-23 Thread Mridul Muralidharan
Good point, and I was purposefully vague on that since that is something
which our community should evolve imo : this was just an initial proposal
:-)

For example: there are multiple ways to do cartesian - and each has its own
trade offs.

Another candidate could be, as I mentioned, new methods which can be
expressed as sequences of existing methods but would be slightly more
performent if done in one shot - like the self cartesian pr, various types
of join (which can become a contrib of its own btw !), experiments using
key indexes, ordering, etc.

Addition into sparkbank or contrib (or something bettrr named !) does not
preclude future migration into core ... just an initial staging area for us
to e olve the api and get user feedback; without necessarily making spark
core api unstable.

Obviously, it is not a dumping ground for broken code/ideas ... and must
follow same level of scrutiny and rigour before committing.
Regards
Mridul
 On Feb 23, 2014 11:53 AM, "Amandeep Khurana"  wrote:

> Mridul,
>
> Can you give examples of APIs that people have contributed (or wanted
> to contribute) but you categorize as something that would go into
> piggybank-like (sparkbank)? Curious to know how you'd decide what
> should go where.
>
> Amandeep
>
> > On Feb 22, 2014, at 10:06 PM, Mridul Muralidharan 
> wrote:
> >
> > Hi,
> >
> >  Over the past few months, I have seen a bunch of pull requests which
> have
> > extended spark api ... most commonly RDD itself.
> >
> > Most of them are either relatively niche case of specialization (which
> > might not be useful for most cases) or idioms which can be expressed
> > (sometimes with minor perf penalty) using existing api.
> >
> > While all of them have non zero value (hence the effort to contribute,
> and
> > gladly welcomed !) they are extending the api in nontrivial ways and
> have a
> > maintenance cost ... and we already have a pending effort to clean up our
> > interfaces prior to 1.0
> >
> > I believe there is a need to keep exposed api succint, expressive and
> > functional in spark; while at the same time, encouraging extensions and
> > specialization within spark codebase so that other users can benefit from
> > the shared contributions.
> >
> > One approach could be to start something akin to piggybank in pig to
> > contribute user generated specializations, helper utils, etc : bundled as
> > part of spark, but not part of core itself.
> >
> > Thoughts, comments ?
> >
> > Regards,
> > Mridul
>


[GitHub] incubator-spark pull request: SPARK-1084. Fix most build warnings

2014-02-23 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/incubator-spark/pull/586#issuecomment-35827729
  
@aarondav Sure, it's already split into commits, and one of them has the 
dependency changes: 
https://github.com/srowen/incubator-spark/commit/6f2f67974bfedd40bafccd77abd0860dcbba4061
 Move this to another PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1071: Tidy logging strategy an...

2014-02-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/570#discussion_r9975070
  
--- Diff: project/SparkBuild.scala ---
@@ -236,13 +236,15 @@ object SparkBuild extends Build {
 publishLocalBoth <<= Seq(publishLocal in MavenCompile, 
publishLocal).dependOn
   ) ++ net.virtualvoid.sbt.graph.Plugin.graphSettings ++ ScalaStyleSettings
 
-  val slf4jVersion = "1.7.2"
+  val slf4jVersion = "1.7.5"
 
   val excludeCglib = ExclusionRule(organization = 
"org.sonatype.sisu.inject")
   val excludeJackson = ExclusionRule(organization = "org.codehaus.jackson")
   val excludeNetty = ExclusionRule(organization = "org.jboss.netty")
   val excludeAsm = ExclusionRule(organization = "asm")
   val excludeSnappy = ExclusionRule(organization = "org.xerial.snappy")
+  val excludeCommonsLogging = ExclusionRule(organization = 
"commons-logging")
+  val excludeSLF4J = ExclusionRule(organization = "org.slf4j")
--- End diff --

I thought I got all of them but let me double-check with mvn 
dependency:tree, and then check that the sbt build does the same.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1071: Tidy logging strategy an...

2014-02-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/570#discussion_r9975071
  
--- Diff: project/SparkBuild.scala ---
@@ -268,9 +272,9 @@ object SparkBuild extends Build {
 "it.unimi.dsi" % "fastutil" % "6.4.4",
 "colt" % "colt" % "1.2.0",
 "org.apache.mesos" % "mesos"% "0.13.0",
-"net.java.dev.jets3t"  % "jets3t"   % "0.7.1",
+"net.java.dev.jets3t"  % "jets3t"   % "0.7.1" 
excludeAll(excludeCommonsLogging),
 "org.apache.derby" % "derby"% "10.4.2.0"   
  % "test",
-"org.apache.hadoop"% hadoopClient   % hadoopVersion 
excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib),
+"org.apache.hadoop"% hadoopClient% hadoopVersion 
excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib, 
excludeCommonsLogging, excludeSLF4J),
--- End diff --

Will do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: SPARK-1071: Tidy logging strategy an...

2014-02-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/incubator-spark/pull/570#discussion_r9975073
  
--- Diff: bagel/pom.xml ---
@@ -51,6 +51,11 @@
   scalacheck_${scala.binary.version}
   test
 
+
+  org.slf4j
+  slf4j-log4j12
+  test
--- End diff --

Yeah I think that's best, will modify it accordingly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


Re: [DISCUSS] Extending public API

2014-02-23 Thread Sean Owen
Thank you for bringing this up. I think the current committers are
bravely facing down a flood of PRs, and this (among other things) is a
step that needs to be taken to scale up and keep this fun. I'd love to
have a separate discussion about more steps, but for here I offer two
bits of advice from experience:

First, you guys most certainly can and should say 'no' to some
changes. It's part of keeping the project coherent. It's always good
to try to include all contributions, but, appreciating contributions
does not always mean accepting them. I have seen projects turned to
mush by the 'anything's welcome' mentality. Push back on contributors
to contribute the thing you think is right. Please keep the API
succinct, yes.

Second, contrib/ modules are problematic. It becomes a ball of legacy
code that you still have to keep maintaining to compile and run. In a
world of Github, I think 'contrib' stuff just belongs in other repos.
I know it sounds harmless to have a contrib, but I think you'd find
the consensus here is that contrib is a mistake.

$0.02 --
--
Sean Owen | Director, Data Science | London


On Sun, Feb 23, 2014 at 6:06 AM, Mridul Muralidharan  wrote:
> Hi,
>
>   Over the past few months, I have seen a bunch of pull requests which have
> extended spark api ... most commonly RDD itself.
>
> Most of them are either relatively niche case of specialization (which
> might not be useful for most cases) or idioms which can be expressed
> (sometimes with minor perf penalty) using existing api.
>
> While all of them have non zero value (hence the effort to contribute, and
> gladly welcomed !) they are extending the api in nontrivial ways and have a
> maintenance cost ... and we already have a pending effort to clean up our
> interfaces prior to 1.0
>
> I believe there is a need to keep exposed api succint, expressive and
> functional in spark; while at the same time, encouraging extensions and
> specialization within spark codebase so that other users can benefit from
> the shared contributions.
>
> One approach could be to start something akin to piggybank in pig to
> contribute user generated specializations, helper utils, etc : bundled as
> part of spark, but not part of core itself.
>
> Thoughts, comments ?
>
> Regards,
> Mridul


Re: [DISCUSS] Extending public API

2014-02-23 Thread Cheng Lian
I think SPARK-1063 (PR-503) “Add .sortBy(f) method on RDD” would be a good 
example. Note that I’m not saying that this PR is already qualified to be 
accepted, just take it as an example:
JIRA issue: https://spark-project.atlassian.net/browse/SPARK-1063
GitHub PR: https://github.com/apache/incubator-spark/pull/508

On Feb 23, 2014, at 2:23 PM, Amandeep Khurana  wrote:

> Mridul,
> 
> Can you give examples of APIs that people have contributed (or wanted
> to contribute) but you categorize as something that would go into
> piggybank-like (sparkbank)? Curious to know how you'd decide what
> should go where.
> 
> Amandeep
> 
>> On Feb 22, 2014, at 10:06 PM, Mridul Muralidharan  wrote:
>> 
>> Hi,
>> 
>> Over the past few months, I have seen a bunch of pull requests which have
>> extended spark api ... most commonly RDD itself.
>> 
>> Most of them are either relatively niche case of specialization (which
>> might not be useful for most cases) or idioms which can be expressed
>> (sometimes with minor perf penalty) using existing api.
>> 
>> While all of them have non zero value (hence the effort to contribute, and
>> gladly welcomed !) they are extending the api in nontrivial ways and have a
>> maintenance cost ... and we already have a pending effort to clean up our
>> interfaces prior to 1.0
>> 
>> I believe there is a need to keep exposed api succint, expressive and
>> functional in spark; while at the same time, encouraging extensions and
>> specialization within spark codebase so that other users can benefit from
>> the shared contributions.
>> 
>> One approach could be to start something akin to piggybank in pig to
>> contribute user generated specializations, helper utils, etc : bundled as
>> part of spark, but not part of core itself.
>> 
>> Thoughts, comments ?
>> 
>> Regards,
>> Mridul



Anyone wants to look at SPARK-1123?

2014-02-23 Thread Nan Zhu
Hi, all  

I found the weird thing on saveAsNewAPIHadoopFile  in PairRDDFunctions.scala 
when working on the other issue,  

saveAsNewAPIHadoopFile throws java.lang.InstantiationException all the time

I checked the commit history of the file, it seems that the API exists for a 
long time, no one else found this? (that’s the reason I’m confusing)

Best,  

--  
Nan Zhu



[GitHub] incubator-spark pull request: [java8API] SPARK-964 Investigate the...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/539#issuecomment-35826379
  
Build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [SPARK-1102] Create a saveAsNewAPIHa...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/636#issuecomment-35826394
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


[GitHub] incubator-spark pull request: [java8API] SPARK-964 Investigate the...

2014-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/incubator-spark/pull/539#issuecomment-35826380
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12817/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---


  1   2   >