date:20141020

[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] pom.xml and SparkB...

2014-10-20 Thread Shiti

Github user Shiti commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-59886741
  
The reason for this issue is that the Maven build definition and plugin 
configuration of yarn-alpha and yarn-stable is the same as that for yarn 
common. So, the `SettingKey scalaSource` is set to that of the parent for the 
child projects and since `scalaSource` is a `SettingKey[File]`, we cannot add 
multiple Scala sources for the same project. The scalaStyle plugin depends on 
this `SettingKey` to determine Scala files. Modifying the settings for yarn 
specific projects in the Scala Build file when required is a better approach to 
fix this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-59886278
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21976/consoleFull)
 for   PR 2087 at commit 
[`1ab662d`](https://github.com/apache/spark/commit/1ab662d8ae674407bfe0f8bbc14aedf1da60c030).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-20 Thread sryza

Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-59885707
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-20 Thread sryza

Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-59885695
  
Cool, updated patch addresses comments.  It look like the failure is caused 
by a failure to fetch from git.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

2014-10-20 Thread yu-iskw

Github user yu-iskw closed the pull request at:

https://github.com/apache/spark/pull/1964


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

2014-10-20 Thread yu-iskw

Github user yu-iskw commented on the pull request:

https://github.com/apache/spark/pull/1964#issuecomment-59885635
  
Because this patch is not fit for the Spark design concept, I close this PR 
without merging.

(http://apache-spark-developers-list.1001551.n3.nabble.com/Standardized-Distance-Functions-in-MLlib-td8697.html)
Thank you very much for your cooperation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4003] [SQL] add 3 types for java SQL co...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2850#issuecomment-59885505
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21975/consoleFull)
 for   PR 2850 at commit 
[`bb0508f`](https://github.com/apache/spark/commit/bb0508f1382186c20ddb80b6032f3fce5c6cf6aa).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4003] [SQL] add 3 types for java SQL co...

2014-10-20 Thread adrian-wang

Github user adrian-wang commented on the pull request:

https://github.com/apache/spark/pull/2850#issuecomment-59885063
  
retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3405] add subnet-id and vpc-id options ...

2014-10-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2872#issuecomment-59884935
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3958] TorrentBroadcast cleanup / debugg...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2844#issuecomment-59884782
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21974/consoleFull)
 for   PR 2844 at commit 
[`1e8268d`](https://github.com/apache/spark/commit/1e8268d6111e4ad45e2acfe47d837718f2170461).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2868#issuecomment-59884738
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21966/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3405] add subnet-id and vpc-id options ...

2014-10-20 Thread mvj101

GitHub user mvj101 opened a pull request:

https://github.com/apache/spark/pull/2872

[SPARK-3405] add subnet-id and vpc-id options to spark_ec2.py

Based on this gist:
https://gist.github.com/amar-analytx/0b62543621e1f246c0a2

We use security group ids instead of security group to get around this 
issue:
https://github.com/boto/boto/issues/350

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mvj101/spark SPARK-3405

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2872.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2872


commit 52aaeec7b03251f3fcb4d1cf892df7c592e03408
Author: Mike Jennings 
Date:   2014-10-21T06:05:09Z

[SPARK-3405] add subnet-id and vpc-id options to spark_ec2.py




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4003] [SQL] add 3 types for java SQL co...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2850#issuecomment-59884695
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21968/consoleFull)
 for   PR 2850 at commit 
[`bb0508f`](https://github.com/apache/spark/commit/bb0508f1382186c20ddb80b6032f3fce5c6cf6aa).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2868#issuecomment-59884734
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21966/consoleFull)
 for   PR 2868 at commit 
[`13585e8`](https://github.com/apache/spark/commit/13585e8738e35743c6c0ab482d34552f01939bd4).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class JavaFutureActionWrapper[S, T](futureAction: FutureAction[S], 
converter: S => T)`
  * `  class SerializableMapWrapper[A, B](underlying: collection.Map[A, B])`
  * `  case class ReconnectWorker(masterUrl: String) extends DeployMessage`
  * `class Predict(`
  * `case class EvaluatePython(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4003] [SQL] add 3 types for java SQL co...

2014-10-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2850#issuecomment-59884702
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21968/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...

2014-10-20 Thread scwf

Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/2241#issuecomment-59884283
  
@marmbrus in #2499, i reproduce the golden answer and changed some *.ql 
because of 0.13 changes, the tests passed in my local machine.
@zhzhan not get you, why to replace the query play?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2868#issuecomment-59884328
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21965/consoleFull)
 for   PR 2868 at commit 
[`6b05af0`](https://github.com/apache/spark/commit/6b05af042656b192e7b14954a433a75468df1d1c).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2868#issuecomment-59884334
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21965/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3958] TorrentBroadcast cleanup / debugg...

2014-10-20 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2844#discussion_r19131009
  
--- Diff: 
core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala ---
@@ -227,6 +217,7 @@ private object TorrentBroadcast extends Logging {
* If removeFromDriver is true, also remove these persisted blocks on 
the driver.
*/
   def unpersist(id: Long, removeFromDriver: Boolean, blocking: Boolean) = {
+logInfo(s"Unpersisting TorrentBroadcast $id")
--- End diff --

I'll try to get #2851 merged this week; I'm in the middle of some 
significant UI code cleanup and I'm planning to merge most of the existing UI 
patches or to re-implement them myself.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3958] TorrentBroadcast cleanup / debugg...

2014-10-20 Thread shivaram

Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/2844#discussion_r19130957
  
--- Diff: 
core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala ---
@@ -227,6 +217,7 @@ private object TorrentBroadcast extends Logging {
* If removeFromDriver is true, also remove these persisted blocks on 
the driver.
*/
   def unpersist(id: Long, removeFromDriver: Boolean, blocking: Boolean) = {
+logInfo(s"Unpersisting TorrentBroadcast $id")
--- End diff --

Its mostly for debugging what broadcasts have been removed and what has 
not. It can be probably be made debug once we have a UI for this (#2851), but 
right now this is the only way to figure out if a broadcast variable has been 
removed by looking at the driver logs.
Also its just one line per broadcast variable (we have 2-3 lines per 
variable when it is created)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4019] Fix MapStatus compression bug tha...

2014-10-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2866#issuecomment-59883998
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21964/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4019] Fix MapStatus compression bug tha...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2866#issuecomment-59883993
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21964/consoleFull)
 for   PR 2866 at commit 
[`c23897a`](https://github.com/apache/spark/commit/c23897aea7881eb819ec074073a4431ec8ba7eb5).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4019] Fix MapStatus compression bug tha...

2014-10-20 Thread Ishiihara

Github user Ishiihara commented on the pull request:

https://github.com/apache/spark/pull/2866#issuecomment-59883834
  
@JoshRosen I have been looking into the compressed bitmap and already get a 
good idea of how to use roaring bitmap to perform the task. If this work is not 
urgent, can you give me one day or two to get the compressed bitmap part 
completed? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4019] Fix MapStatus compression bug tha...

2014-10-20 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2866#issuecomment-59883237
  
@rxin that's a fair solution, too, although the bitmap needs to be 
losslessly compressed.

I could imagine cases where data is already partitioned but a user performs 
partition-preserving operations without specifying `preservesPartitioning`, 
then does a filtering operation that would otherwise benefit from partitioning. 
 In these cases, you might have this extreme bimodal distribution where most 
blocks are zero but the remaining blocks might be big.  In these cases, do you 
care about the exact sizes of those blocks?  Probably not in most cases, since 
there will be few blocks.

I'll look into folding this into the compressed version as you've suggested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] pom.xml and SparkB...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-59883187
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21963/consoleFull)
 for   PR 2520 at commit 
[`c5b2a33`](https://github.com/apache/spark/commit/c5b2a3399d5c57ea0b5e0d15dabf7ee28d1ffaa5).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] pom.xml and SparkB...

2014-10-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-59883190
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21963/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3958] TorrentBroadcast cleanup / debugg...

2014-10-20 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/2844#discussion_r19130611
  
--- Diff: 
core/src/test/scala/org/apache/spark/broadcast/BroadcastSuite.scala ---
@@ -84,6 +89,24 @@ class BroadcastSuite extends FunSuite with 
LocalSparkContext {
 assert(results.collect().toSet === (1 to numSlaves).map(x => (x, 
10)).toSet)
   }
 
+  test("TorrentBroadcast's blockifyObject and unblockifyObject are 
inverses") {
+import org.apache.spark.broadcast.TorrentBroadcast._
+val blockSize = 1024
+val conf = new SparkConf()
+val compressionCodec = Some(new SnappyCompressionCodec(conf))
+val serializer = new JavaSerializer(conf)
+val objects = for (size <- Gen.choose(1, 1024 * 10)) yield {
--- End diff --

as discussed offline, maybe just use a random number generator here since 
Gen brings extra complexity but not much benefit in this specific case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] pom.xml and SparkB...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-59883025
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21973/consoleFull)
 for   PR 2520 at commit 
[`c5b2a33`](https://github.com/apache/spark/commit/c5b2a3399d5c57ea0b5e0d15dabf7ee28d1ffaa5).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4023] [MLlib] [PySpark] convert rdd int...

2014-10-20 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2870#discussion_r19130591
  
--- Diff: python/pyspark/mllib/tests.py ---
@@ -202,6 +204,16 @@ def test_regression(self):
 self.assertTrue(dt_model.predict(features[3]) > 0)
 
 
+class StatTests(PySparkTestCase):
+# SPARK-4023
+def test_col_with_random_rdd(self):
+data = RandomRDDs.normalVectorRDD(self.sc, 1000, 10, 10)
+summary = Statistics.colStats(data)
+self.assertEqual(1000, summary.count())
+mean = summary.mean()
+self.assertTrue(all(abs(v) < 0.1 for v in mean))
--- End diff --

This is a non-deterministic test. For SPARK-4023, we only need to test 
`colStats` and other methods for RDDs of numpy arrays and python arrays.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3958] TorrentBroadcast cleanup / debugg...

2014-10-20 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/2844#discussion_r19130571
  
--- Diff: 
core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala ---
@@ -227,6 +217,7 @@ private object TorrentBroadcast extends Logging {
* If removeFromDriver is true, also remove these persisted blocks on 
the driver.
*/
   def unpersist(id: Long, removeFromDriver: Boolean, blocking: Boolean) = {
+logInfo(s"Unpersisting TorrentBroadcast $id")
--- End diff --

I don't feel super strongly over this one, but I feel given this is for 
"debugging" of exceptional cases, it should be in debug. If your worry is that 
the broadcast cleaner might clean up stuff prematurely, then I think we should 
log in the cleaner instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] pom.xml and SparkB...

2014-10-20 Thread sarutak

Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-59882835
  
retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2759][CORE] Generic Binary File Support...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1658#issuecomment-59882709
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21972/consoleFull)
 for   PR 1658 at commit 
[`8ac288b`](https://github.com/apache/spark/commit/8ac288bc09e779f1b4c96dcb497ee4eca962439f).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-59882564
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21969/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4023] [MLlib] [PySpark] convert rdd int...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2870#issuecomment-59882352
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21961/consoleFull)
 for   PR 2870 at commit 
[`0871576`](https://github.com/apache/spark/commit/087157620a85c14534ac76f44ff079df6151ea5b).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4023] [MLlib] [PySpark] convert rdd int...

2014-10-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2870#issuecomment-59882356
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21961/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-4031] Make torrent broadcast read...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2871#issuecomment-59882084
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21970/consoleFull)
 for   PR 2871 at commit 
[`8792ed8`](https://github.com/apache/spark/commit/8792ed8399f9d1501bf4a38694531a8440d65448).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4019] Fix MapStatus compression bug tha...

2014-10-20 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/2866#issuecomment-59882089
  
Actually instead of introducing a new one, what if we introduce a 
compressed bitmap that tracks zero-sized blocks, and then use avg size to track 
only non-zero blocks?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-59882086
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21971/consoleFull)
 for   PR 2087 at commit 
[`1ab662d`](https://github.com/apache/spark/commit/1ab662d8ae674407bfe0f8bbc14aedf1da60c030).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3888] [PySpark] limit the memory used b...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2743#issuecomment-59882030
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21960/consoleFull)
 for   PR 2743 at commit 
[`c10229e`](https://github.com/apache/spark/commit/c10229e8a4eaa6944ea7c432437cdfafdb702ef5).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4019] Fix MapStatus compression bug tha...

2014-10-20 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/2866#issuecomment-59882025
  
Oh wow. Thanks for fixing this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3888] [PySpark] limit the memory used b...

2014-10-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2743#issuecomment-59882034
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21960/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-4031] Make torrent broadcast read...

2014-10-20 Thread shivaram

Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/2871#issuecomment-59881872
  
@JoshRosen -- yes, that should be fine. I will rebase once #2844 is checked 
in


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-4031] Make torrent broadcast read...

2014-10-20 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2871#issuecomment-59881787
  
This seems likely to merge-conflict with my PR #2844, so I'd like to merge 
that one first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-4031] Make torrent broadcast read...

2014-10-20 Thread shivaram

GitHub user shivaram opened a pull request:

https://github.com/apache/spark/pull/2871

[WIP] [SPARK-4031] Make torrent broadcast read blocks on use.

This avoids reading broadcast variables when they are referenced in the 
closure but not used by the code.
Note: This is a WIP and a request for comments. I will update HttpBroadcast 
and add some tests if it sounds good.

cc @rxin @JoshRosen for review

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/shivaram/spark-1 broadcast-read-value

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2871.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2871


commit 8792ed8399f9d1501bf4a38694531a8440d65448
Author: Shivaram Venkataraman 
Date:   2014-10-21T05:35:03Z

Make torrent broadcast read blocks on use.
This avoids reading broadcast variables when they are referenced
in the closure but not used by the code.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] pom.xml and SparkB...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-59881469
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21962/consoleFull)
 for   PR 2520 at commit 
[`c5b2a33`](https://github.com/apache/spark/commit/c5b2a3399d5c57ea0b5e0d15dabf7ee28d1ffaa5).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] pom.xml and SparkB...

2014-10-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-59881474
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21962/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4003] [SQL] add 3 types for java SQL co...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2850#issuecomment-59881463
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21968/consoleFull)
 for   PR 2850 at commit 
[`bb0508f`](https://github.com/apache/spark/commit/bb0508f1382186c20ddb80b6032f3fce5c6cf6aa).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...

2014-10-20 Thread zhzhan

Github user zhzhan commented on the pull request:

https://github.com/apache/spark/pull/2241#issuecomment-59881210
  
@scwf Did you also replace the query plan for hive0.13 in your another PR? 
because I also saw some query plan changes in hive0.13.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3569][SQL] Add metadata field to Struct...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2701#issuecomment-59881165
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21967/consoleFull)
 for   PR 2701 at commit 
[`611d3c2`](https://github.com/apache/spark/commit/611d3c20cf4aed9927b596d89b9ac96b2cbbcdec).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...

2014-10-20 Thread zhzhan

Github user zhzhan commented on the pull request:

https://github.com/apache/spark/pull/2241#issuecomment-59880636
  
@marmbrus I think he refers to https://github.com/apache/spark/pull/2499


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...

2014-10-20 Thread zhzhan

Github user zhzhan commented on the pull request:

https://github.com/apache/spark/pull/2241#issuecomment-59880350
  
@scwf The golden answer is different in hive12 and hive13. We need some 
extra shim layer to handle that. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

2014-10-20 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2667#discussion_r19129853
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+
+/**
+ * ::Experimental::
+ * Evaluator for ranking algorithms.
+ *
+ * @param predictionAndLabels an RDD of (predicted ranking, ground truth 
set) pairs.
+ */
+@Experimental
+class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], 
Array[T])])
+  extends Logging with Serializable {
+
+  /**
+   * Compute the average precision of all the queries, truncated at 
ranking position k.
+   *
+   * If for a query, the ranking algorithm returns n (n < k) results, the 
precision value will be
+   * computed as #(relevant items retrived) / k. This formula also applies 
when the size of the
+   * ground truth set is less than k.
+   *
+   * If a query has an empty ground truth set, zero will be returned 
together with a log warning.
--- End diff --

`returned` -> `used as precision`. We don't `return` zero.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

2014-10-20 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2667#discussion_r19129857
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+
+/**
+ * ::Experimental::
+ * Evaluator for ranking algorithms.
+ *
+ * @param predictionAndLabels an RDD of (predicted ranking, ground truth 
set) pairs.
+ */
+@Experimental
+class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], 
Array[T])])
+  extends Logging with Serializable {
+
+  /**
+   * Compute the average precision of all the queries, truncated at 
ranking position k.
+   *
+   * If for a query, the ranking algorithm returns n (n < k) results, the 
precision value will be
+   * computed as #(relevant items retrived) / k. This formula also applies 
when the size of the
+   * ground truth set is less than k.
+   *
+   * If a query has an empty ground truth set, zero will be returned 
together with a log warning.
+   *
+   * See the following paper for detail:
+   *
+   * IR evaluation methods for retrieving highly relevant documents.
+   *K. Jarvelin and J. Kekalainen
+   *
+   * @param k the position to compute the truncated precision, must be 
positive
+   * @return the average precision at the first k ranking positions
+   */
+  def precisionAt(k: Int): Double = {
+require (k > 0,"ranking position k should be positive")
+predictionAndLabels.map { case (pred, lab) =>
+  val labSet = lab.toSet
+  val n = math.min(pred.length, k)
+  var i = 0
+  var cnt = 0
+
+  while (i < n) {
+if (labSet.contains(pred(i))) {
+  cnt += 1
+}
+i += 1
+  }
+  if (labSet.size == 0) {
--- End diff --

If `labSet` is empty, the `while` loop is wasted.

~~~
if (labelSet.nonEmpty) {
  val n = math.min(...)
  ...
  cnt.toDouble / k
} else {
  logWarning("...")
  0.0
}
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

2014-10-20 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2667#discussion_r19129865
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+
+/**
+ * ::Experimental::
+ * Evaluator for ranking algorithms.
+ *
+ * @param predictionAndLabels an RDD of (predicted ranking, ground truth 
set) pairs.
+ */
+@Experimental
+class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], 
Array[T])])
+  extends Logging with Serializable {
+
+  /**
+   * Compute the average precision of all the queries, truncated at 
ranking position k.
+   *
+   * If for a query, the ranking algorithm returns n (n < k) results, the 
precision value will be
+   * computed as #(relevant items retrived) / k. This formula also applies 
when the size of the
+   * ground truth set is less than k.
+   *
+   * If a query has an empty ground truth set, zero will be returned 
together with a log warning.
+   *
+   * See the following paper for detail:
+   *
+   * IR evaluation methods for retrieving highly relevant documents.
+   *K. Jarvelin and J. Kekalainen
+   *
+   * @param k the position to compute the truncated precision, must be 
positive
+   * @return the average precision at the first k ranking positions
+   */
+  def precisionAt(k: Int): Double = {
+require (k > 0,"ranking position k should be positive")
+predictionAndLabels.map { case (pred, lab) =>
+  val labSet = lab.toSet
+  val n = math.min(pred.length, k)
+  var i = 0
+  var cnt = 0
+
+  while (i < n) {
+if (labSet.contains(pred(i))) {
+  cnt += 1
+}
+i += 1
+  }
+  if (labSet.size == 0) {
+logWarning("Empty ground truth set, check input data")
+0.0
+  } else {
+cnt.toDouble / k
+  }
+}.mean
+  }
+
+  /**
+   * Returns the mean average precision (MAP) of all the queries.
+   * If a query has an empty ground truth set, the average precision will 
be zero and a log
+   * warining is generated.
+   */
+  lazy val meanAveragePrecision: Double = {
+predictionAndLabels.map { case (pred, lab) =>
+  val labSet = lab.toSet
+  val labSetSize = labSet.size
+  var i = 0
+  var cnt = 0
+  var precSum = 0.0
+  val n = pred.length
+
+  while (i < n) {
+if (labSet.contains(pred(i))) {
+  cnt += 1
+  precSum += cnt.toDouble / (i + 1)
+}
+i += 1
+  }
+  if (labSetSize == 0) {
+logWarning("Empty ground truth set, check input data")
+0.0
+  } else {
+precSum / labSet.size
+  }
+}.mean
+  }
+
+  /**
+   * Compute the average NDCG value of all the queries, truncated at 
ranking position k.
+   * The discounted cumulative gain at position k is computed as:
+   *\sum_{i=1}^k (2^{relevance of ith item} - 1) / log(i + 1),
+   * and the NDCG is obtained by dividing the DCG value on the ground 
truth set. In the current
+   * implementation, the relevance value is binary.
+   *
+   * If for a query, the ranking algorithm returns n (n < k) results, the 
NDCG value at position n
+   * will be used. If the ground truth set contains n (n < k) results, the 
first n items will be
+   * used to compute the DCG value on the ground truth set.
+   *
+   * If a query has an empty ground truth set, zero will be returned 
together with a log warning.
--- End diff --

ditto:

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

2014-10-20 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2667#discussion_r19129866
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+
+/**
+ * ::Experimental::
+ * Evaluator for ranking algorithms.
+ *
+ * @param predictionAndLabels an RDD of (predicted ranking, ground truth 
set) pairs.
+ */
+@Experimental
+class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], 
Array[T])])
+  extends Logging with Serializable {
+
+  /**
+   * Compute the average precision of all the queries, truncated at 
ranking position k.
+   *
+   * If for a query, the ranking algorithm returns n (n < k) results, the 
precision value will be
+   * computed as #(relevant items retrived) / k. This formula also applies 
when the size of the
+   * ground truth set is less than k.
+   *
+   * If a query has an empty ground truth set, zero will be returned 
together with a log warning.
+   *
+   * See the following paper for detail:
+   *
+   * IR evaluation methods for retrieving highly relevant documents.
+   *K. Jarvelin and J. Kekalainen
+   *
+   * @param k the position to compute the truncated precision, must be 
positive
+   * @return the average precision at the first k ranking positions
+   */
+  def precisionAt(k: Int): Double = {
+require (k > 0,"ranking position k should be positive")
+predictionAndLabels.map { case (pred, lab) =>
+  val labSet = lab.toSet
+  val n = math.min(pred.length, k)
+  var i = 0
+  var cnt = 0
+
+  while (i < n) {
+if (labSet.contains(pred(i))) {
+  cnt += 1
+}
+i += 1
+  }
+  if (labSet.size == 0) {
+logWarning("Empty ground truth set, check input data")
+0.0
+  } else {
+cnt.toDouble / k
+  }
+}.mean
+  }
+
+  /**
+   * Returns the mean average precision (MAP) of all the queries.
+   * If a query has an empty ground truth set, the average precision will 
be zero and a log
+   * warining is generated.
+   */
+  lazy val meanAveragePrecision: Double = {
+predictionAndLabels.map { case (pred, lab) =>
+  val labSet = lab.toSet
+  val labSetSize = labSet.size
+  var i = 0
+  var cnt = 0
+  var precSum = 0.0
+  val n = pred.length
+
+  while (i < n) {
+if (labSet.contains(pred(i))) {
+  cnt += 1
+  precSum += cnt.toDouble / (i + 1)
+}
+i += 1
+  }
+  if (labSetSize == 0) {
+logWarning("Empty ground truth set, check input data")
+0.0
+  } else {
+precSum / labSet.size
+  }
+}.mean
+  }
+
+  /**
+   * Compute the average NDCG value of all the queries, truncated at 
ranking position k.
+   * The discounted cumulative gain at position k is computed as:
+   *\sum_{i=1}^k (2^{relevance of ith item} - 1) / log(i + 1),
+   * and the NDCG is obtained by dividing the DCG value on the ground 
truth set. In the current
+   * implementation, the relevance value is binary.
+   *
+   * If for a query, the ranking algorithm returns n (n < k) results, the 
NDCG value at position n
+   * will be used. If the ground truth set contains n (n < k) results, the 
first n items will be
+   * used to compute the DCG value on the ground truth set.
+   *
+   * If a query has an empty ground truth set, zero will be returned 
together with a log warning.
+   *
+   * See the followi

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

2014-10-20 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2667#discussion_r19129859
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+
+/**
+ * ::Experimental::
+ * Evaluator for ranking algorithms.
+ *
+ * @param predictionAndLabels an RDD of (predicted ranking, ground truth 
set) pairs.
+ */
+@Experimental
+class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], 
Array[T])])
+  extends Logging with Serializable {
+
+  /**
+   * Compute the average precision of all the queries, truncated at 
ranking position k.
+   *
+   * If for a query, the ranking algorithm returns n (n < k) results, the 
precision value will be
+   * computed as #(relevant items retrived) / k. This formula also applies 
when the size of the
+   * ground truth set is less than k.
+   *
+   * If a query has an empty ground truth set, zero will be returned 
together with a log warning.
+   *
+   * See the following paper for detail:
+   *
+   * IR evaluation methods for retrieving highly relevant documents.
+   *K. Jarvelin and J. Kekalainen
+   *
+   * @param k the position to compute the truncated precision, must be 
positive
+   * @return the average precision at the first k ranking positions
+   */
+  def precisionAt(k: Int): Double = {
+require (k > 0,"ranking position k should be positive")
+predictionAndLabels.map { case (pred, lab) =>
+  val labSet = lab.toSet
+  val n = math.min(pred.length, k)
+  var i = 0
+  var cnt = 0
+
+  while (i < n) {
+if (labSet.contains(pred(i))) {
+  cnt += 1
+}
+i += 1
+  }
+  if (labSet.size == 0) {
+logWarning("Empty ground truth set, check input data")
+0.0
+  } else {
+cnt.toDouble / k
+  }
+}.mean
+  }
+
+  /**
+   * Returns the mean average precision (MAP) of all the queries.
+   * If a query has an empty ground truth set, the average precision will 
be zero and a log
+   * warining is generated.
+   */
+  lazy val meanAveragePrecision: Double = {
+predictionAndLabels.map { case (pred, lab) =>
+  val labSet = lab.toSet
+  val labSetSize = labSet.size
+  var i = 0
+  var cnt = 0
+  var precSum = 0.0
+  val n = pred.length
+
+  while (i < n) {
+if (labSet.contains(pred(i))) {
+  cnt += 1
+  precSum += cnt.toDouble / (i + 1)
+}
+i += 1
+  }
+  if (labSetSize == 0) {
--- End diff --

ditto (do not go through the while loop if labSet is empty)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

2014-10-20 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2667#discussion_r19129863
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+
+/**
+ * ::Experimental::
+ * Evaluator for ranking algorithms.
+ *
+ * @param predictionAndLabels an RDD of (predicted ranking, ground truth 
set) pairs.
+ */
+@Experimental
+class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], 
Array[T])])
+  extends Logging with Serializable {
+
+  /**
+   * Compute the average precision of all the queries, truncated at 
ranking position k.
+   *
+   * If for a query, the ranking algorithm returns n (n < k) results, the 
precision value will be
+   * computed as #(relevant items retrived) / k. This formula also applies 
when the size of the
+   * ground truth set is less than k.
+   *
+   * If a query has an empty ground truth set, zero will be returned 
together with a log warning.
+   *
+   * See the following paper for detail:
+   *
+   * IR evaluation methods for retrieving highly relevant documents.
+   *K. Jarvelin and J. Kekalainen
+   *
+   * @param k the position to compute the truncated precision, must be 
positive
+   * @return the average precision at the first k ranking positions
+   */
+  def precisionAt(k: Int): Double = {
+require (k > 0,"ranking position k should be positive")
+predictionAndLabels.map { case (pred, lab) =>
+  val labSet = lab.toSet
+  val n = math.min(pred.length, k)
+  var i = 0
+  var cnt = 0
+
+  while (i < n) {
+if (labSet.contains(pred(i))) {
+  cnt += 1
+}
+i += 1
+  }
+  if (labSet.size == 0) {
+logWarning("Empty ground truth set, check input data")
+0.0
+  } else {
+cnt.toDouble / k
+  }
+}.mean
+  }
+
+  /**
+   * Returns the mean average precision (MAP) of all the queries.
+   * If a query has an empty ground truth set, the average precision will 
be zero and a log
+   * warining is generated.
+   */
+  lazy val meanAveragePrecision: Double = {
+predictionAndLabels.map { case (pred, lab) =>
+  val labSet = lab.toSet
+  val labSetSize = labSet.size
+  var i = 0
+  var cnt = 0
+  var precSum = 0.0
+  val n = pred.length
+
+  while (i < n) {
+if (labSet.contains(pred(i))) {
+  cnt += 1
+  precSum += cnt.toDouble / (i + 1)
+}
+i += 1
+  }
+  if (labSetSize == 0) {
+logWarning("Empty ground truth set, check input data")
+0.0
+  } else {
+precSum / labSet.size
+  }
+}.mean
+  }
+
+  /**
+   * Compute the average NDCG value of all the queries, truncated at 
ranking position k.
+   * The discounted cumulative gain at position k is computed as:
+   *\sum_{i=1}^k (2^{relevance of ith item} - 1) / log(i + 1),
+   * and the NDCG is obtained by dividing the DCG value on the ground 
truth set. In the current
+   * implementation, the relevance value is binary.
+   *
+   * If for a query, the ranking algorithm returns n (n < k) results, the 
NDCG value at position n
+   * will be used. If the ground truth set contains n (n < k) results, the 
first n items will be
+   * used to compute the DCG value on the ground truth set.
--- End diff --

This paragraph is not necessary because those cases are compatible with the 
definition of NDCG.


---
If your project i

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

2014-10-20 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2667#discussion_r19129850
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+
+/**
+ * ::Experimental::
+ * Evaluator for ranking algorithms.
+ *
+ * @param predictionAndLabels an RDD of (predicted ranking, ground truth 
set) pairs.
+ */
+@Experimental
+class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], 
Array[T])])
+  extends Logging with Serializable {
+
+  /**
+   * Compute the average precision of all the queries, truncated at 
ranking position k.
+   *
+   * If for a query, the ranking algorithm returns n (n < k) results, the 
precision value will be
+   * computed as #(relevant items retrived) / k. This formula also applies 
when the size of the
--- End diff --

`retrived` -> `retrieved`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

2014-10-20 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2667#discussion_r19129869
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+
+/**
+ * ::Experimental::
+ * Evaluator for ranking algorithms.
+ *
+ * @param predictionAndLabels an RDD of (predicted ranking, ground truth 
set) pairs.
+ */
+@Experimental
+class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], 
Array[T])])
+  extends Logging with Serializable {
+
+  /**
+   * Compute the average precision of all the queries, truncated at 
ranking position k.
+   *
+   * If for a query, the ranking algorithm returns n (n < k) results, the 
precision value will be
+   * computed as #(relevant items retrived) / k. This formula also applies 
when the size of the
+   * ground truth set is less than k.
+   *
+   * If a query has an empty ground truth set, zero will be returned 
together with a log warning.
+   *
+   * See the following paper for detail:
+   *
+   * IR evaluation methods for retrieving highly relevant documents.
+   *K. Jarvelin and J. Kekalainen
+   *
+   * @param k the position to compute the truncated precision, must be 
positive
+   * @return the average precision at the first k ranking positions
+   */
+  def precisionAt(k: Int): Double = {
+require (k > 0,"ranking position k should be positive")
+predictionAndLabels.map { case (pred, lab) =>
+  val labSet = lab.toSet
+  val n = math.min(pred.length, k)
+  var i = 0
+  var cnt = 0
+
+  while (i < n) {
+if (labSet.contains(pred(i))) {
+  cnt += 1
+}
+i += 1
+  }
+  if (labSet.size == 0) {
+logWarning("Empty ground truth set, check input data")
+0.0
+  } else {
+cnt.toDouble / k
+  }
+}.mean
+  }
+
+  /**
+   * Returns the mean average precision (MAP) of all the queries.
+   * If a query has an empty ground truth set, the average precision will 
be zero and a log
+   * warining is generated.
+   */
+  lazy val meanAveragePrecision: Double = {
+predictionAndLabels.map { case (pred, lab) =>
+  val labSet = lab.toSet
+  val labSetSize = labSet.size
+  var i = 0
+  var cnt = 0
+  var precSum = 0.0
+  val n = pred.length
+
+  while (i < n) {
+if (labSet.contains(pred(i))) {
+  cnt += 1
+  precSum += cnt.toDouble / (i + 1)
+}
+i += 1
+  }
+  if (labSetSize == 0) {
+logWarning("Empty ground truth set, check input data")
+0.0
+  } else {
+precSum / labSet.size
+  }
+}.mean
+  }
+
+  /**
+   * Compute the average NDCG value of all the queries, truncated at 
ranking position k.
+   * The discounted cumulative gain at position k is computed as:
+   *\sum_{i=1}^k (2^{relevance of ith item} - 1) / log(i + 1),
+   * and the NDCG is obtained by dividing the DCG value on the ground 
truth set. In the current
+   * implementation, the relevance value is binary.
+   *
+   * If for a query, the ranking algorithm returns n (n < k) results, the 
NDCG value at position n
+   * will be used. If the ground truth set contains n (n < k) results, the 
first n items will be
+   * used to compute the DCG value on the ground truth set.
+   *
+   * If a query has an empty ground truth set, zero will be returned 
together with a log warning.
+   *
+   * See the followi

[GitHub] spark pull request: SPARK-3568 [mllib] add ranking metrics

2014-10-20 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2667#discussion_r19129854
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala ---
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+
+/**
+ * ::Experimental::
+ * Evaluator for ranking algorithms.
+ *
+ * @param predictionAndLabels an RDD of (predicted ranking, ground truth 
set) pairs.
+ */
+@Experimental
+class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], 
Array[T])])
+  extends Logging with Serializable {
+
+  /**
+   * Compute the average precision of all the queries, truncated at 
ranking position k.
+   *
+   * If for a query, the ranking algorithm returns n (n < k) results, the 
precision value will be
+   * computed as #(relevant items retrived) / k. This formula also applies 
when the size of the
+   * ground truth set is less than k.
+   *
+   * If a query has an empty ground truth set, zero will be returned 
together with a log warning.
+   *
+   * See the following paper for detail:
+   *
+   * IR evaluation methods for retrieving highly relevant documents.
+   *K. Jarvelin and J. Kekalainen
+   *
+   * @param k the position to compute the truncated precision, must be 
positive
+   * @return the average precision at the first k ranking positions
+   */
+  def precisionAt(k: Int): Double = {
+require (k > 0,"ranking position k should be positive")
--- End diff --

`require(k > 0, "ranking ...` (remove space before `(` and add space after 
`,`)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2868#issuecomment-59880277
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21966/consoleFull)
 for   PR 2868 at commit 
[`13585e8`](https://github.com/apache/spark/commit/13585e8738e35743c6c0ab482d34552f01939bd4).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...

2014-10-20 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/2241#issuecomment-59880136
  
@scwf, which PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2868#issuecomment-59879975
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21965/consoleFull)
 for   PR 2868 at commit 
[`6b05af0`](https://github.com/apache/spark/commit/6b05af042656b192e7b14954a433a75468df1d1c).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread codedeft

Github user codedeft commented on the pull request:

https://github.com/apache/spark/pull/2868#issuecomment-59879666
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4019] Fix MapStatus compression bug tha...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2866#issuecomment-59879704
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21964/consoleFull)
 for   PR 2866 at commit 
[`c23897a`](https://github.com/apache/spark/commit/c23897aea7881eb819ec074073a4431ec8ba7eb5).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...

2014-10-20 Thread scwf

Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/2241#issuecomment-59879567
  
We can reproduce the golden answer for hive 0.13 as i done in my closed PR, 
how about that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4019] Fix MapStatus compression bug tha...

2014-10-20 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2866#issuecomment-59879396
  
;retest


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4019] Fix MapStatus compression bug tha...

2014-10-20 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2866#issuecomment-59879420
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] pom.xml and SparkB...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-59879138
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21963/consoleFull)
 for   PR 2520 at commit 
[`c5b2a33`](https://github.com/apache/spark/commit/c5b2a3399d5c57ea0b5e0d15dabf7ee28d1ffaa5).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread codedeft

Github user codedeft commented on the pull request:

https://github.com/apache/spark/pull/2868#issuecomment-59878898
  
Seems like lots of line too long messages. Will address this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] pom.xml and SparkB...

2014-10-20 Thread sarutak

Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-59878761
  
retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4023] [MLlib] [PySpark] convert rdd int...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2870#issuecomment-59878295
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21961/consoleFull)
 for   PR 2870 at commit 
[`0871576`](https://github.com/apache/spark/commit/087157620a85c14534ac76f44ff079df6151ea5b).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] pom.xml and SparkB...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-59878293
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21962/consoleFull)
 for   PR 2520 at commit 
[`c5b2a33`](https://github.com/apache/spark/commit/c5b2a3399d5c57ea0b5e0d15dabf7ee28d1ffaa5).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3888] [PySpark] limit the memory used b...

2014-10-20 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2743#issuecomment-59878151
  
@davies ah I see, thanks. This should have triggered the old one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3888] [PySpark] limit the memory used b...

2014-10-20 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2743#issuecomment-59878121
  
@pwendell There are two PullRequestBuilder plugins, one is work, another 
one (called NewSparkPullRequestBuilder) is still failing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3888] [PySpark] limit the memory used b...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2743#issuecomment-59878023
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21960/consoleFull)
 for   PR 2743 at commit 
[`c10229e`](https://github.com/apache/spark/commit/c10229e8a4eaa6944ea7c432437cdfafdb702ef5).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] pom.xml and SparkB...

2014-10-20 Thread sarutak

Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-59877987
  
retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4023] [MLlib] [PySpark] convert rdd int...

2014-10-20 Thread davies

GitHub user davies opened a pull request:

https://github.com/apache/spark/pull/2870

[SPARK-4023] [MLlib] [PySpark] convert rdd into RDD of Vector

Convert the input rdd to RDD of Vector.

cc @mengxr

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davies/spark fix4023

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2870.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2870


commit 087157620a85c14534ac76f44ff079df6151ea5b
Author: Davies Liu 
Date:   2014-10-21T04:35:15Z

convert rdd into RDD of Vector




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2868#issuecomment-59877810
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21959/consoleFull)
 for   PR 2868 at commit 
[`9ea76df`](https://github.com/apache/spark/commit/9ea76df661a93b1ebdf5ce5a764c7549b2fcbfd0).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2868#issuecomment-59877812
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21959/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2868#issuecomment-59877748
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21959/consoleFull)
 for   PR 2868 at commit 
[`9ea76df`](https://github.com/apache/spark/commit/9ea76df661a93b1ebdf5ce5a764c7549b2fcbfd0).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3888] [PySpark] limit the memory used b...

2014-10-20 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2743#issuecomment-59877740
  
@davies this should have been fixed, not sure what is going on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3888] [PySpark] limit the memory used b...

2014-10-20 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2743#issuecomment-59877728
  
jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2868#issuecomment-59877519
  
Jenkins, add to whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3161][MLLIB] Adding a node Id caching m...

2014-10-20 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2868#issuecomment-59877524
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...

2014-10-20 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/2241#issuecomment-59877085
  
I think this is looking pretty good, but I'm not okay with merging it 
before the tests are passing for Hive 13.  Let me take a look and see how hard 
that will be.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] pom.xml and SparkB...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-59876973
  
**[Tests timed 
out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21955/consoleFull)**
 for PR 2520 at commit 
[`c5b2a33`](https://github.com/apache/spark/commit/c5b2a3399d5c57ea0b5e0d15dabf7ee28d1ffaa5)
 after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3677] [BUILD] [YARN] pom.xml and SparkB...

2014-10-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2520#issuecomment-59876974
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21955/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3888] [PySpark] limit the memory used b...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2743#issuecomment-59876465
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/409/consoleFull)
 for   PR 2743 at commit 
[`c10229e`](https://github.com/apache/spark/commit/c10229e8a4eaa6944ea7c432437cdfafdb702ef5).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class JavaFutureActionWrapper[S, T](futureAction: FutureAction[S], 
converter: S => T)`
  * `  case class ReconnectWorker(masterUrl: String) extends DeployMessage`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3888] [PySpark] limit the memory used b...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2743#issuecomment-59876269
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/409/consoleFull)
 for   PR 2743 at commit 
[`c10229e`](https://github.com/apache/spark/commit/c10229e8a4eaa6944ea7c432437cdfafdb702ef5).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4019] Fix MapStatus compression bug tha...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2866#issuecomment-59876264
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21957/consoleFull)
 for   PR 2866 at commit 
[`c23897a`](https://github.com/apache/spark/commit/c23897aea7881eb819ec074073a4431ec8ba7eb5).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4019] Fix MapStatus compression bug tha...

2014-10-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2866#issuecomment-59876271
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21957/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2321] [WIP] Stable pull-based progress ...

2014-10-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2696#issuecomment-59875888
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21956/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2321] [WIP] Stable pull-based progress ...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2696#issuecomment-59875884
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21956/consoleFull)
 for   PR 2696 at commit 
[`787444c`](https://github.com/apache/spark/commit/787444c4ee20693a8f8c4fb5320ee4c4133a0d91).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class SparkContext(config: SparkConf) extends SparkStatusAPI with 
Logging `
  * `  class JobUIData(`
  * `public final class JavaStatusAPITest `
  * `  public static final class IdentityWithDelay implements 
Function `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4012] call tryOrExit instead of logUnca...

2014-10-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2864#issuecomment-59875800
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21958/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4012] call tryOrExit instead of logUnca...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2864#issuecomment-59875799
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21958/consoleFull)
 for   PR 2864 at commit 
[`3893a7e`](https://github.com/apache/spark/commit/3893a7e051674df70124b09c386c13afdc5ab3d8).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class JavaFutureActionWrapper[S, T](futureAction: FutureAction[S], 
converter: S => T)`
  * `  case class ReconnectWorker(masterUrl: String) extends DeployMessage`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4016] Allow user to show/hide UI metric...

2014-10-20 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2867#issuecomment-59872654
  
@kayousterhout This might integrate nicely with my #2852, which introduces 
some new abstractions to simplify the web UI's table rendering code.  With my 
framework, I think you might be able to automatically generate the ids used to 
show / hide columns rather than having to have a class that holds a bunch of 
strings.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4016] Allow user to show/hide UI metric...

2014-10-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2867#issuecomment-59872492
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21952/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4016] Allow user to show/hide UI metric...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2867#issuecomment-59872487
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21952/consoleFull)
 for   PR 2867 at commit 
[`e989560`](https://github.com/apache/spark/commit/e989560562b473624159e4e3554ec9898884a247).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4012] call tryOrExit instead of logUnca...

2014-10-20 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2864#issuecomment-59871965
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21958/consoleFull)
 for   PR 2864 at commit 
[`3893a7e`](https://github.com/apache/spark/commit/3893a7e051674df70124b09c386c13afdc5ab3d8).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 407 matches

Mail list logo