[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74776721
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27657/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4629


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74776715
  
  [Test build #27657 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27657/consoleFull)
 for   PR 4629 at commit 
[`dffe34e`](https://github.com/apache/spark/commit/dffe34ee262aa098c12323fd27995ce9f542fa95).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class Partitioner(object):`
  * `class SparkJobInfo(namedtuple(SparkJobInfo, jobId stageIds 
status)):`
  * `class SparkStageInfo(namedtuple(SparkStageInfo,`
  * `class StatusTracker(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-17 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74789577
  
Thanks for adding the test.

LGTM, so I'm going to merge this into `master` (1.4.0) and `branch-1.3` 
(1.3.0).  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-17 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/4629#discussion_r24852684
  
--- Diff: python/pyspark/tests.py ---
@@ -740,6 +739,27 @@ def test_multiple_python_java_RDD_conversions(self):
 converted_rdd = RDD(data_python_rdd, self.sc)
 self.assertEqual(2, converted_rdd.count())
 
+def test_narrow_dependency_in_join(self):
+rdd = self.sc.parallelize(range(10)).map(lambda x: (x, x))
--- End diff --

do these tests actually check for a narrow dependency at all?  I think they 
will pass even without it.

I'm not sure of a better suggestion, though.  I had to use 
`getNarrowDependencies` in another PR to check this:

https://github.com/apache/spark/pull/4449/files#diff-4bc3643ce90b54113cad7104f91a075bR582

but I don't think that is even exposed in pyspark ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-17 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/4629#discussion_r24853424
  
--- Diff: python/pyspark/tests.py ---
@@ -740,6 +739,27 @@ def test_multiple_python_java_RDD_conversions(self):
 converted_rdd = RDD(data_python_rdd, self.sc)
 self.assertEqual(2, converted_rdd.count())
 
+def test_narrow_dependency_in_join(self):
+rdd = self.sc.parallelize(range(10)).map(lambda x: (x, x))
--- End diff --

This test is only for correctness, I will add more check for narrow 
dependency base one the Python progress API (#3027)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-17 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4629#discussion_r24857765
  
--- Diff: python/pyspark/tests.py ---
@@ -740,6 +739,27 @@ def test_multiple_python_java_RDD_conversions(self):
 converted_rdd = RDD(data_python_rdd, self.sc)
 self.assertEqual(2, converted_rdd.count())
 
+def test_narrow_dependency_in_join(self):
+rdd = self.sc.parallelize(range(10)).map(lambda x: (x, x))
--- End diff --

I've merged #3027, so I think we can now test this by setting a job group, 
running a job, then querying the statusTracker to determine how many stages 
were actually run as part of that job.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74763180
  
  [Test build #27657 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27657/consoleFull)
 for   PR 4629 at commit 
[`dffe34e`](https://github.com/apache/spark/commit/dffe34ee262aa098c12323fd27995ce9f542fa95).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-17 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/4629#discussion_r24857907
  
--- Diff: python/pyspark/tests.py ---
@@ -740,6 +739,27 @@ def test_multiple_python_java_RDD_conversions(self):
 converted_rdd = RDD(data_python_rdd, self.sc)
 self.assertEqual(2, converted_rdd.count())
 
+def test_narrow_dependency_in_join(self):
+rdd = self.sc.parallelize(range(10)).map(lambda x: (x, x))
--- End diff --

done!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-17 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/4629#discussion_r24861247
  
--- Diff: python/pyspark/tests.py ---
@@ -740,6 +739,27 @@ def test_multiple_python_java_RDD_conversions(self):
 converted_rdd = RDD(data_python_rdd, self.sc)
 self.assertEqual(2, converted_rdd.count())
 
+def test_narrow_dependency_in_join(self):
+rdd = self.sc.parallelize(range(10)).map(lambda x: (x, x))
--- End diff --

nice!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74575564
  
  [Test build #27573 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27573/consoleFull)
 for   PR 4629 at commit 
[`eb26c62`](https://github.com/apache/spark/commit/eb26c62f4a3dc5920df2d2624918826d32d97bb5).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4629#discussion_r24778077
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -961,7 +961,14 @@ class SparkContext(config: SparkConf) extends Logging 
with ExecutorAllocationCli
   }
 
   /** Build the union of a list of RDDs. */
-  def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T] = new UnionRDD(this, 
rdds)
+  def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T] = {
+val partitioners = rdds.map(_.partitioner).toSet
+if (partitioners.size == 1  partitioners.head.isDefined) {
+  new PartitionerAwareUnionRDD(this, rdds)
+} else {
+  new UnionRDD(this, rdds)
+}
+  }
 
   /** Build the union of a list of RDDs passed as variable-length 
arguments. */
   def union[T: ClassTag](first: RDD[T], rest: RDD[T]*): RDD[T] =
--- End diff --

Can we change this method to call the `union` method that you modified so 
the change will take effect here, too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4629#discussion_r24778143
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala 
---
@@ -330,6 +331,15 @@ private[spark] object PythonRDD extends Logging {
   }
 
   /**
+   * Return an RDD of values from an RDD of (Long, Array[Byte]), with 
preservePartitions=true
+   *
+   * This is useful for PySpark to have the partitioner after partitionBy()
+   */
+  def valueOfPair(pair: JavaPairRDD[Long, Array[Byte]]): 
JavaRDD[Array[Byte]] = {
--- End diff --

I think that `JavaPairRDD.values` should do the same thing; is there a 
reason why we can't call that directly from Python?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread davies
GitHub user davies opened a pull request:

https://github.com/apache/spark/pull/4629

[SPARK-5785] [PySpark] narrow dependency for cogroup/join in PySpark

Currently, PySpark does not support narrow dependency during cogroup/join 
when the two RDDs have the partitioner, another unnecessary shuffle stage will 
come in.

The Python implementation of cogroup/join is different than Scala one, it 
depends on union() and partitionBy(). This patch will try to use 
PartitionerAwareUnionRDD() in union(), when all the RDDs have the same 
partitioner. It also fix `reservePartitioner` in all the map() or 
mapPartitions(), then partitionBy() can skip the unnecessary shuffle stage.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davies/spark narrow

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4629.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4629


commit eb26c62f4a3dc5920df2d2624918826d32d97bb5
Author: Davies Liu dav...@databricks.com
Date:   2015-02-16T21:17:11Z

narrow dependency in PySpark




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74579230
  
  [Test build #27582 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27582/consoleFull)
 for   PR 4629 at commit 
[`ff5a0a6`](https://github.com/apache/spark/commit/ff5a0a6b5dd408f2a177459e6b5498ea72f57b85).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74581079
  
  [Test build #27587 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27587/consoleFull)
 for   PR 4629 at commit 
[`cc28d97`](https://github.com/apache/spark/commit/cc28d97cc5c629102333ac9a91a7d323583cd4e6).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74585242
  
  [Test build #27582 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27582/consoleFull)
 for   PR 4629 at commit 
[`ff5a0a6`](https://github.com/apache/spark/commit/ff5a0a6b5dd408f2a177459e6b5498ea72f57b85).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class Partitioner(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74585248
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27582/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/4629#discussion_r24778859
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala 
---
@@ -330,6 +331,15 @@ private[spark] object PythonRDD extends Logging {
   }
 
   /**
+   * Return an RDD of values from an RDD of (Long, Array[Byte]), with 
preservePartitions=true
+   *
+   * This is useful for PySpark to have the partitioner after partitionBy()
+   */
+  def valueOfPair(pair: JavaPairRDD[Long, Array[Byte]]): 
JavaRDD[Array[Byte]] = {
--- End diff --

In Scala/Java API, RDD.values() will change the RDD from (K, V) into RDD of 
V, so `preservePartitions` should not be `true`.

For PySpark, it change the RDD from (hash, [(K, V)]) to (K, V), 
`preservePartitions` should be true.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/4629#discussion_r24780473
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -961,7 +961,14 @@ class SparkContext(config: SparkConf) extends Logging 
with ExecutorAllocationCli
   }
 
   /** Build the union of a list of RDDs. */
-  def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T] = new UnionRDD(this, 
rdds)
+  def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T] = {
+val partitioners = rdds.map(_.partitioner).toSet
+if (partitioners.size == 1  partitioners.head.isDefined) {
+  new PartitionerAwareUnionRDD(this, rdds)
+} else {
+  new UnionRDD(this, rdds)
+}
+  }
 
   /** Build the union of a list of RDDs passed as variable-length 
arguments. */
   def union[T: ClassTag](first: RDD[T], rest: RDD[T]*): RDD[T] =
--- End diff --

fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74583747
  
  [Test build #27573 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27573/consoleFull)
 for   PR 4629 at commit 
[`eb26c62`](https://github.com/apache/spark/commit/eb26c62f4a3dc5920df2d2624918826d32d97bb5).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74583752
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27573/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74579783
  
  [Test build #27583 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27583/consoleFull)
 for   PR 4629 at commit 
[`940245e`](https://github.com/apache/spark/commit/940245e37bf08492d6b5cd7cd82f8f0886f6f8ca).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74628437
  
  [Test build #611 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/611/consoleFull)
 for   PR 4629 at commit 
[`4d29932`](https://github.com/apache/spark/commit/4d29932172301731db904176636d530631f448ea).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74597976
  
  [Test build #610 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/610/consoleFull)
 for   PR 4629 at commit 
[`cc28d97`](https://github.com/apache/spark/commit/cc28d97cc5c629102333ac9a91a7d323583cd4e6).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4629#discussion_r24787685
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -961,11 +961,18 @@ class SparkContext(config: SparkConf) extends Logging 
with ExecutorAllocationCli
   }
 
   /** Build the union of a list of RDDs. */
-  def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T] = new UnionRDD(this, 
rdds)
+  def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T] = {
+val partitioners = rdds.map(_.partitioner).toSet
--- End diff --

If `_.partitioner` is an option, then I think this can be simplified by 
using `flatMap` instead of `map`, since that would just let you check whether 
`partitioners.size == 1` on the next line without having to have the 
`isDefined` check as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74590143
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27583/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74590132
  
  [Test build #27583 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27583/consoleFull)
 for   PR 4629 at commit 
[`940245e`](https://github.com/apache/spark/commit/940245e37bf08492d6b5cd7cd82f8f0886f6f8ca).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class Partitioner(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74589214
  
  [Test build #610 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/610/consoleFull)
 for   PR 4629 at commit 
[`cc28d97`](https://github.com/apache/spark/commit/cc28d97cc5c629102333ac9a91a7d323583cd4e6).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74590835
  
  [Test build #27587 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27587/consoleFull)
 for   PR 4629 at commit 
[`cc28d97`](https://github.com/apache/spark/commit/cc28d97cc5c629102333ac9a91a7d323583cd4e6).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class Partitioner(object):`
  * `case class ParquetRelation2(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74590851
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27587/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74607728
  
LGTM overall; this is tricky logic, though, so I'll take one more pass 
through when I get home.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74621492
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27612/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74621421
  
  [Test build #27612 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27612/consoleFull)
 for   PR 4629 at commit 
[`4d29932`](https://github.com/apache/spark/commit/4d29932172301731db904176636d530631f448ea).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74621490
  
  [Test build #27612 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27612/consoleFull)
 for   PR 4629 at commit 
[`4d29932`](https://github.com/apache/spark/commit/4d29932172301731db904176636d530631f448ea).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class Partitioner(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5785] [PySpark] narrow dependency for c...

2015-02-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4629#issuecomment-74622588
  
  [Test build #611 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/611/consoleFull)
 for   PR 4629 at commit 
[`4d29932`](https://github.com/apache/spark/commit/4d29932172301731db904176636d530631f448ea).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org