date:20150310

[GitHub] spark pull request: Added Companion Object for LogisticRegressionW...

2015-03-10 Thread kazk1018

Github user kazk1018 closed the pull request at:

https://github.com/apache/spark/pull/4915


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6025] [MLlib] Add helper method evaluat...

2015-03-10 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/4906#discussion_r26191896
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/loss/AbsoluteError.scala ---
@@ -61,4 +61,18 @@ object AbsoluteError extends Loss {
   math.abs(err)
 }.mean()
   }
+
+  /**
--- End diff --

But the return argument is different, no?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6275][Documentation]Miss toDF() functio...

2015-03-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4977#issuecomment-78211915
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6275][Documentation]Miss toDF() functio...

2015-03-10 Thread zzcclp

GitHub user zzcclp opened a pull request:

https://github.com/apache/spark/pull/4977

[SPARK-6275][Documentation]Miss toDF() function in 
docs/sql-programming-guide.md

Miss `toDF()` function in docs/sql-programming-guide.md

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zzcclp/spark SPARK-6275

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4977.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4977


commit 9a96c7bcb5c79d4befd3f76daadb89e3f96074ec
Author: zzcclp 
Date:   2015-03-11T05:41:30Z

Miss toDF()

Miss toDF()




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5205][Streaming]:Inconsistent behaviour...

2015-03-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4135#issuecomment-78209621
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28459/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6274][Streaming][Examples] Added exampl...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4975#issuecomment-78209601
  
  [Test build #28463 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28463/consoleFull)
 for   PR 4975 at commit 
[`705cba1`](https://github.com/apache/spark/commit/705cba1512415b39076d5275df2cc288386feb8e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5205][Streaming]:Inconsistent behaviour...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4135#issuecomment-78209607
  
**[Test build #28459 timed 
out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28459/consoleFull)**
 for PR 4135 at commit 
[`7051184`](https://github.com/apache/spark/commit/705118453ab6f9869ac090ca2bac009f167869cd)
 after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6268][MLlib] KMeans parameter getter me...

2015-03-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4974#issuecomment-78209124
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28460/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6268][MLlib] KMeans parameter getter me...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4974#issuecomment-78209117
  
  [Test build #28460 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28460/consoleFull)
 for   PR 4974 at commit 
[`f94a3d7`](https://github.com/apache/spark/commit/f94a3d7190911804309a7941ade5d5e96a3c2028).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6274][Streaming][Examples] Added exampl...

2015-03-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4975#issuecomment-78208901
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28462/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6274][Streaming][Examples] Added exampl...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4975#issuecomment-78208900
  
  [Test build #28462 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28462/consoleFull)
 for   PR 4975 at commit 
[`75a3fad`](https://github.com/apache/spark/commit/75a3fad74ab6118ce42b573a5ac9158319c54dad).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class JavaRecord implements java.io.Serializable `
  * `public final class JavaSqlNetworkWordCount `
  * `class JavaSQLContextSingleton `
  * `case class Record(word: String)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6274][Streaming][Examples] Added exampl...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4975#issuecomment-78208822
  
  [Test build #28462 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28462/consoleFull)
 for   PR 4975 at commit 
[`75a3fad`](https://github.com/apache/spark/commit/75a3fad74ab6118ce42b573a5ac9158319c54dad).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][Minor] fix typo in comments

2015-03-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4976#issuecomment-78208774
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][Minor] fix typo in comments

2015-03-10 Thread liuhb86

GitHub user liuhb86 opened a pull request:

https://github.com/apache/spark/pull/4976

[SQL][Minor] fix typo in comments

Removed an repeated "from" in the comments.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liuhb86/spark mine

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4976.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4976


commit e280e7c1b93a4a7fa8f1ac12348f34a113f36382
Author: Hongbo Liu 
Date:   2015-03-11T05:34:09Z

[SQL][Minor] fix typo in comments




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6274][Streaming][Examples] Added exampl...

2015-03-10 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/4975#issuecomment-78208286
  
lgtm -- assuming it compiles


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6274][Streaming][Examples] Added exampl...

2015-03-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4975#issuecomment-78207331
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28461/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6274][Streaming][Examples] Added exampl...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4975#issuecomment-78207330
  
  [Test build #28461 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28461/consoleFull)
 for   PR 4975 at commit 
[`5fbf789`](https://github.com/apache/spark/commit/5fbf7891ae3518f6552b8ac387ed425c2710185b).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class JavaRecord implements java.io.Serializable `
  * `public final class JavaSqlNetworkWordCount `
  * `class JavaSQLContextSingleton `
  * `case class Record(word: String)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6274][Streaming][Examples] Added exampl...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4975#issuecomment-78207239
  
  [Test build #28461 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28461/consoleFull)
 for   PR 4975 at commit 
[`5fbf789`](https://github.com/apache/spark/commit/5fbf7891ae3518f6552b8ac387ed425c2710185b).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

2015-03-10 Thread EntilZha

Github user EntilZha commented on a diff in the pull request:

https://github.com/apache/spark/pull/4807#discussion_r26189764
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala 
---
@@ -311,165 +319,319 @@ private[clustering] object LDA {
 
   private[clustering] type TokenCount = Double
 
-  /** Term vertex IDs are {-1, -2, ..., -vocabSize} */
-  private[clustering] def term2index(term: Int): Long = -(1 + term.toLong)
 
-  private[clustering] def index2term(termIndex: Long): Int = -(1 + 
termIndex).toInt
 
-  private[clustering] def isDocumentVertex(v: (VertexId, _)): Boolean = 
v._1 >= 0
+  object LearningAlgorithms extends Enumeration {
+type Algorithm = Value
+val Gibbs, EM = Value
+  }
 
-  private[clustering] def isTermVertex(v: (VertexId, _)): Boolean = v._1 < 0
+  private[clustering] trait LearningState {
+def next(): LearningState
+def topicsMatrix: Matrix
+def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], 
Array[Double])]
--- End diff --

Probably @jkbradley can weigh in here. I think both changes seem 
reasonable, then have the Matrix computed from the RDD. If there is agreement, 
i can make the change on the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6274][Streaming][Examples] Added exampl...

2015-03-10 Thread tdas

GitHub user tdas opened a pull request:

https://github.com/apache/spark/pull/4975

[SPARK-6274][Streaming][Examples] Added examples streaming + sql examples.

Added Scala, Java and Python streaming examples showing DataFrame and SQL 
operations within streaming.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tdas/spark streaming-sql-examples

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4975.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4975


commit 874b943731b9bd85cab3b1a2a28e30d66d82ddbf
Author: Tathagata Das 
Date:   2015-03-11T05:14:20Z

Added examples streaming + sql examples.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

2015-03-10 Thread witgo

Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/4807#discussion_r26189615
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala 
---
@@ -311,165 +319,319 @@ private[clustering] object LDA {
 
   private[clustering] type TokenCount = Double
 
-  /** Term vertex IDs are {-1, -2, ..., -vocabSize} */
-  private[clustering] def term2index(term: Int): Long = -(1 + term.toLong)
 
-  private[clustering] def index2term(termIndex: Long): Int = -(1 + 
termIndex).toInt
 
-  private[clustering] def isDocumentVertex(v: (VertexId, _)): Boolean = 
v._1 >= 0
+  object LearningAlgorithms extends Enumeration {
+type Algorithm = Value
+val Gibbs, EM = Value
+  }
 
-  private[clustering] def isTermVertex(v: (VertexId, _)): Boolean = v._1 < 0
+  private[clustering] trait LearningState {
+def next(): LearningState
+def topicsMatrix: Matrix
+def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], 
Array[Double])]
--- End diff --

How about  ```def topicsMatrix: Matrix``` =>  ``` def  
termTopicDistributions: RDD[(Long, Vector)]```?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

2015-03-10 Thread EntilZha

Github user EntilZha commented on a diff in the pull request:

https://github.com/apache/spark/pull/4807#discussion_r26189097
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala 
---
@@ -311,165 +319,319 @@ private[clustering] object LDA {
 
   private[clustering] type TokenCount = Double
 
-  /** Term vertex IDs are {-1, -2, ..., -vocabSize} */
-  private[clustering] def term2index(term: Int): Long = -(1 + term.toLong)
 
-  private[clustering] def index2term(termIndex: Long): Int = -(1 + 
termIndex).toInt
 
-  private[clustering] def isDocumentVertex(v: (VertexId, _)): Boolean = 
v._1 >= 0
+  object LearningAlgorithms extends Enumeration {
+type Algorithm = Value
+val Gibbs, EM = Value
+  }
 
-  private[clustering] def isTermVertex(v: (VertexId, _)): Boolean = v._1 < 0
+  private[clustering] trait LearningState {
+def next(): LearningState
+def topicsMatrix: Matrix
+def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], 
Array[Double])]
+def logLikelihood: Double
+def logPrior: Double
+def topicDistributions: RDD[(Long, Vector)]
+def globalTopicTotals: LDA.TopicCounts
--- End diff --

I think I agree here. It would require some code/helper method on 
DistributedLDAModel since that is the only place that calls 
`globalTopicTotals`. I can make the change tonight/tomorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

2015-03-10 Thread EntilZha

Github user EntilZha commented on a diff in the pull request:

https://github.com/apache/spark/pull/4807#discussion_r26188892
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala 
---
@@ -311,165 +319,319 @@ private[clustering] object LDA {
 
   private[clustering] type TokenCount = Double
 
-  /** Term vertex IDs are {-1, -2, ..., -vocabSize} */
-  private[clustering] def term2index(term: Int): Long = -(1 + term.toLong)
 
-  private[clustering] def index2term(termIndex: Long): Int = -(1 + 
termIndex).toInt
 
-  private[clustering] def isDocumentVertex(v: (VertexId, _)): Boolean = 
v._1 >= 0
+  object LearningAlgorithms extends Enumeration {
+type Algorithm = Value
+val Gibbs, EM = Value
+  }
 
-  private[clustering] def isTermVertex(v: (VertexId, _)): Boolean = v._1 < 0
+  private[clustering] trait LearningState {
+def next(): LearningState
+def topicsMatrix: Matrix
+def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], 
Array[Double])]
--- End diff --

The reason for a separate method is twofold. First, although you could 
calculate it from `topicsMatrix` in theory, the size of `topicsMatrix` could be 
very large (too large to fit in the driver memory, as the docs warn). The 
describeTopics is intended to provide an interface for the implementation to 
extract a topics matrix bounded to only the top `maxTermsPerTopic` topics. It 
is less likely this runs the driver out of memory and keeps computation of the 
top `n` topics distributed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6268][MLlib] KMeans parameter getter me...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4974#issuecomment-78203721
  
  [Test build #28460 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28460/consoleFull)
 for   PR 4974 at commit 
[`f94a3d7`](https://github.com/apache/spark/commit/f94a3d7190911804309a7941ade5d5e96a3c2028).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

2015-03-10 Thread witgo

Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/4807#discussion_r26188380
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala 
---
@@ -311,165 +319,319 @@ private[clustering] object LDA {
 
   private[clustering] type TokenCount = Double
 
-  /** Term vertex IDs are {-1, -2, ..., -vocabSize} */
-  private[clustering] def term2index(term: Int): Long = -(1 + term.toLong)
 
-  private[clustering] def index2term(termIndex: Long): Int = -(1 + 
termIndex).toInt
 
-  private[clustering] def isDocumentVertex(v: (VertexId, _)): Boolean = 
v._1 >= 0
+  object LearningAlgorithms extends Enumeration {
+type Algorithm = Value
+val Gibbs, EM = Value
+  }
 
-  private[clustering] def isTermVertex(v: (VertexId, _)): Boolean = v._1 < 0
+  private[clustering] trait LearningState {
+def next(): LearningState
+def topicsMatrix: Matrix
+def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], 
Array[Double])]
--- End diff --

Different implementations can be hidden by `topicsMatrix `, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6268][MLlib] KMeans parameter getter me...

2015-03-10 Thread hhbyyh

GitHub user hhbyyh opened a pull request:

https://github.com/apache/spark/pull/4974

[SPARK-6268][MLlib] KMeans parameter getter methods

jira: https://issues.apache.org/jira/browse/SPARK-6268

KMeans has many setters for parameters. It should have matching getters.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hhbyyh/spark get4Kmeans

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4974.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4974


commit f94a3d7190911804309a7941ade5d5e96a3c2028
Author: Yuhao Yang 
Date:   2015-03-11T04:26:19Z

add get for KMeans




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

2015-03-10 Thread witgo

Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/4807#discussion_r26188379
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala 
---
@@ -311,165 +319,319 @@ private[clustering] object LDA {
 
   private[clustering] type TokenCount = Double
 
-  /** Term vertex IDs are {-1, -2, ..., -vocabSize} */
-  private[clustering] def term2index(term: Int): Long = -(1 + term.toLong)
 
-  private[clustering] def index2term(termIndex: Long): Int = -(1 + 
termIndex).toInt
 
-  private[clustering] def isDocumentVertex(v: (VertexId, _)): Boolean = 
v._1 >= 0
+  object LearningAlgorithms extends Enumeration {
+type Algorithm = Value
+val Gibbs, EM = Value
+  }
 
-  private[clustering] def isTermVertex(v: (VertexId, _)): Boolean = v._1 < 0
+  private[clustering] trait LearningState {
+def next(): LearningState
+def topicsMatrix: Matrix
+def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], 
Array[Double])]
+def logLikelihood: Double
+def logPrior: Double
+def topicDistributions: RDD[(Long, Vector)]
+def globalTopicTotals: LDA.TopicCounts
--- End diff --

Different implementations can be hidden by `topicDistributions `.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6211][Streaming] Add Python Kafka API u...

2015-03-10 Thread jerryshao

Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/4961#issuecomment-78202972
  
Hi @davies @tdas , I added the Python Kafka API unit test, it works very 
well in my local test, but seems always failed in Jenkins test with error like:

```
Traceback (most recent call last):
  File "pyspark/streaming/tests.py", line 568, in setUp
.loadClass("org.apache.spark.streaming.kafka.KafkaTestUtils")
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 538, in __call__
self.target_id, self.name)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
 line 300, in get_return_value
format(target_id, '.', name), value)
Py4JJavaError: An error occurred while calling o4409.loadClass.
: java.lang.ClassNotFoundException: 
org.apache.spark.streaming.kafka.KafkaTestUtils
```

I'm sure I added the kafka assembly jar with `--jars`, as you can see I 
dump the log into console:

```
Additional argument: --jars 
/home/jenkins/workspace/SparkPullRequestBuilder/external/kafka-assembly/target/scala-2.10/spark-streaming-kafka-assembly_2.10-1.3.0-SNAPSHOT.jar
```

I'm not sure is there any environment difference or something should be  
taken care when testing in Jenkins, would you please give me some hints? Thanks 
a lot.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

2015-03-10 Thread EntilZha

Github user EntilZha commented on a diff in the pull request:

https://github.com/apache/spark/pull/4807#discussion_r26187620
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala 
---
@@ -311,165 +319,319 @@ private[clustering] object LDA {
 
   private[clustering] type TokenCount = Double
 
-  /** Term vertex IDs are {-1, -2, ..., -vocabSize} */
-  private[clustering] def term2index(term: Int): Long = -(1 + term.toLong)
 
-  private[clustering] def index2term(termIndex: Long): Int = -(1 + 
termIndex).toInt
 
-  private[clustering] def isDocumentVertex(v: (VertexId, _)): Boolean = 
v._1 >= 0
+  object LearningAlgorithms extends Enumeration {
+type Algorithm = Value
+val Gibbs, EM = Value
+  }
 
-  private[clustering] def isTermVertex(v: (VertexId, _)): Boolean = v._1 < 0
+  private[clustering] trait LearningState {
+def next(): LearningState
+def topicsMatrix: Matrix
+def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], 
Array[Double])]
--- End diff --

Why is it not necessary? The LDASuite which contains the 
Distributed/LocalModels calls it. How they are created, is up to the specific 
implementation of LDA. Could you be more specific why its not necessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

2015-03-10 Thread EntilZha

Github user EntilZha commented on a diff in the pull request:

https://github.com/apache/spark/pull/4807#discussion_r26187624
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala 
---
@@ -311,165 +319,319 @@ private[clustering] object LDA {
 
   private[clustering] type TokenCount = Double
 
-  /** Term vertex IDs are {-1, -2, ..., -vocabSize} */
-  private[clustering] def term2index(term: Int): Long = -(1 + term.toLong)
 
-  private[clustering] def index2term(termIndex: Long): Int = -(1 + 
termIndex).toInt
 
-  private[clustering] def isDocumentVertex(v: (VertexId, _)): Boolean = 
v._1 >= 0
+  object LearningAlgorithms extends Enumeration {
+type Algorithm = Value
+val Gibbs, EM = Value
+  }
 
-  private[clustering] def isTermVertex(v: (VertexId, _)): Boolean = v._1 < 0
+  private[clustering] trait LearningState {
+def next(): LearningState
+def topicsMatrix: Matrix
+def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], 
Array[Double])]
+def logLikelihood: Double
+def logPrior: Double
+def topicDistributions: RDD[(Long, Vector)]
+def globalTopicTotals: LDA.TopicCounts
--- End diff --

Ditto comment from above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5205][Streaming]:Inconsistent behaviour...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4135#issuecomment-78201355
  
  [Test build #28459 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28459/consoleFull)
 for   PR 4135 at commit 
[`7051184`](https://github.com/apache/spark/commit/705118453ab6f9869ac090ca2bac009f167869cd).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5205][Streaming]:Inconsistent behaviour...

2015-03-10 Thread uncleGen

Github user uncleGen commented on the pull request:

https://github.com/apache/spark/pull/4135#issuecomment-78201163
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Added Companion Object for LogisticRegressionW...

2015-03-10 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/4915#issuecomment-78201008
  
@kazk1018  You'll need to close this PR yourself.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

2015-03-10 Thread witgo

Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/4807#discussion_r26187041
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala 
---
@@ -311,165 +319,319 @@ private[clustering] object LDA {
 
   private[clustering] type TokenCount = Double
 
-  /** Term vertex IDs are {-1, -2, ..., -vocabSize} */
-  private[clustering] def term2index(term: Int): Long = -(1 + term.toLong)
 
-  private[clustering] def index2term(termIndex: Long): Int = -(1 + 
termIndex).toInt
 
-  private[clustering] def isDocumentVertex(v: (VertexId, _)): Boolean = 
v._1 >= 0
+  object LearningAlgorithms extends Enumeration {
+type Algorithm = Value
+val Gibbs, EM = Value
+  }
 
-  private[clustering] def isTermVertex(v: (VertexId, _)): Boolean = v._1 < 0
+  private[clustering] trait LearningState {
+def next(): LearningState
+def topicsMatrix: Matrix
+def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], 
Array[Double])]
+def logLikelihood: Double
+def logPrior: Double
+def topicDistributions: RDD[(Long, Vector)]
+def globalTopicTotals: LDA.TopicCounts
--- End diff --

Here is not necessary


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

2015-03-10 Thread witgo

Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/4807#discussion_r26186996
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala 
---
@@ -311,165 +319,319 @@ private[clustering] object LDA {
 
   private[clustering] type TokenCount = Double
 
-  /** Term vertex IDs are {-1, -2, ..., -vocabSize} */
-  private[clustering] def term2index(term: Int): Long = -(1 + term.toLong)
 
-  private[clustering] def index2term(termIndex: Long): Int = -(1 + 
termIndex).toInt
 
-  private[clustering] def isDocumentVertex(v: (VertexId, _)): Boolean = 
v._1 >= 0
+  object LearningAlgorithms extends Enumeration {
+type Algorithm = Value
+val Gibbs, EM = Value
+  }
 
-  private[clustering] def isTermVertex(v: (VertexId, _)): Boolean = v._1 < 0
+  private[clustering] trait LearningState {
+def next(): LearningState
+def topicsMatrix: Matrix
+def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], 
Array[Double])]
--- End diff --

This is not necessary, right? Should be removed from the `LearningState` 
class.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6222][STREAMING] Make sure batches are ...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4964#issuecomment-78193252
  
  [Test build #28457 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28457/consoleFull)
 for   PR 4964 at commit 
[`fa93b87`](https://github.com/apache/spark/commit/fa93b871ba0fe22924ff0273e975e492a6a7043c).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6222][STREAMING] Make sure batches are ...

2015-03-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4964#issuecomment-78193259
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28457/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6271][SQL] Sort these tokens in alphabe...

2015-03-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4973#issuecomment-78193081
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6185][SQL] Deltele repeated TOKEN. "TOK...

2015-03-10 Thread DoingDone9

Github user DoingDone9 commented on the pull request:

https://github.com/apache/spark/pull/4907#issuecomment-78192927
  
I hava opened a new pr for sorting  these tokens in alphabetic. 
https://github.com/apache/spark/pull/4973 @yhuai  @marmbrus 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1503][MLLIB] Initial AcceleratedGradien...

2015-03-10 Thread staple

Github user staple commented on a diff in the pull request:

https://github.com/apache/spark/pull/4934#discussion_r26186113
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/AcceleratedGradientDescent.scala
 ---
@@ -0,0 +1,237 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.optimization
+
+import scala.collection.mutable.ArrayBuffer
+
+import breeze.linalg.{DenseVector => BDV, norm}
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: DeveloperApi ::
+ * This class optimizes a vector of weights via accelerated (proximal) 
gradient descent.
+ * The implementation is based on TFOCS [[http://cvxr.com/tfocs]], 
described in Becker, Candes, and
+ * Grant 2010.
+ * @param gradient Delegate that computes the loss function value and 
gradient for a vector of
+ * weights.
+ * @param updater Delegate that updates weights in the direction of a 
gradient.
+ */
+@DeveloperApi
+class AcceleratedGradientDescent (private var gradient: Gradient, private 
var updater: Updater)
+  extends Optimizer {
+
+  private var stepSize: Double = 1.0
+  private var convergenceTol: Double = 1e-4
+  private var numIterations: Int = 100
+  private var regParam: Double = 0.0
+
+  /**
+   * Set the initial step size, used for the first step. Default 1.0.
+   * On subsequent steps, the step size will be adjusted by the 
acceleration algorithm.
+   */
+  def setStepSize(step: Double): this.type = {
+this.stepSize = step
+this
+  }
+
+  /**
+   * Set the optimization convergence tolerance. Default 1e-4.
+   * Smaller values will increase accuracy but require additional 
iterations.
+   */
+  def setConvergenceTol(tol: Double): this.type = {
+this.convergenceTol = tol
+this
+  }
+
+  /**
+   * Set the maximum number of iterations. Default 100.
+   */
+  def setNumIterations(iters: Int): this.type = {
+this.numIterations = iters
+this
+  }
+
+  /**
+   * Set the regularization parameter. Default 0.0.
+   */
+  def setRegParam(regParam: Double): this.type = {
+this.regParam = regParam
+this
+  }
+
+  /**
+   * Set a Gradient delegate for computing the loss function value and 
gradient.
+   */
+  def setGradient(gradient: Gradient): this.type = {
+this.gradient = gradient
+this
+  }
+
+  /**
+   * Set an Updater delegate for updating weights in the direction of a 
gradient.
+   * If regularization is used, the Updater will implement the 
regularization term's proximity
+   * operator. Thus the type of regularization penalty is configured by 
providing a corresponding
+   * Updater implementation.
+   */
+  def setUpdater(updater: Updater): this.type = {
+this.updater = updater
+this
+  }
+
+  /**
+   * Run accelerated gradient descent on the provided training data.
+   * @param data training data
+   * @param initialWeights initial weights
+   * @return solution vector
+   */
+  def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): 
Vector = {
+val (weights, _) = AcceleratedGradientDescent.run(
+  data,
+  gradient,
+  updater,
+  stepSize,
+  convergenceTol,
+  numIterations,
+  regParam,
+  initialWeights)
+weights
+  }
+}
+
+/**
+ * :: DeveloperApi ::
+ * Top-level method to run accelerated (proximal) gradient descent.
+ */
+@DeveloperApi
+object AcceleratedGradientDescent extends Logging {
+  /**
+   * Run accelerated proximal gradient descent.
+   * The implementation is based on TFOCS [[http://cvxr.com/tfocs]], 
described in Becker, Candes,
+   * and Grant 2010. A li

[GitHub] spark pull request: [SPARK-1503][MLLIB] Initial AcceleratedGradien...

2015-03-10 Thread staple

Github user staple commented on the pull request:

https://github.com/apache/spark/pull/4934#issuecomment-78192837
  
Hi, replying to some of the statements above:

> It seems @staple has already implemented backtracking (because he has 
results in the JIRA), but kept them out of this PR to keep it simple, so we can 
tackle that afterwards.

I wrote a backtracking implementation (and checked that it performs the 
same as the tfocs implementation). Currently it is just a port of the tfocs 
version. Iâd need a little time to make it scala / spark idiomatic, but the 
turnaround would be pretty fast.

> For example, if we add line search option, what is the semantic of 
agd.setStepSize(1.0).useLineSearch()

TFOCS supports a suggested initial lipschitz value (variable named 
âLâ), which is just a starting point for line search, so a corresponding 
behavior would be to use the step size as an initial suggestion only when line 
search is enabled. It may be desirable to use a parameter name like âLâ 
instead of âstepSizeâ to make the meaning clearer.

In TFOCS you can disable backtracking line search by setting several 
parameters (L, Lexact, alpha, and beta) which individually control different 
aspects of the backtracking implementation. 
For spark it may make sense to provide backtracking modes that are 
configured explicitly, for example fixed lipshitz bound (no backtracking), or 
backtracking line search based on the TFOCS implementation, or possibly an 
alternative line search implementation that is more conservative about 
performing round trip aggregations. Then there could be a setBacktrackingMode() 
setter to configure which mode is used.

Moving forward there may be a need to support additional acceleration 
algorithms in addition to Auslender and Teboulle. These might be configurable 
via a setAlgorithm() function.

> Btw, I don't think we need to stick to the current GradientDescent API. 
The accelerated gradient takes a smooth convex function which provides gradient 
and optionally the Lipschitz constant. The implementation of Nesterov's method 
doesn't need to know RDDs.

This is good to know. I had been assuming we would stick with the existing 
GradientDescent api including Gradient and Updater delegates. Currently the 
applySmooth and applyProjector functions (named the same as corresponding TFOCS 
functions) serve as a bridge between the acceleration implementation 
(relatively unaware of RDDs) and spark specific RDD aggregations.

This seems like a good time to mention that the backtracking implementation 
in TFOCS uses a system of caching the (expensive to compute) linear operator 
component of the objective function, which significantly reduces the cost of 
backtracking. A similar implementation is possible in spark, though the 
performance benefit may not be as significant because two round trips would 
still be required per iteration. (See p. 3 of my design doc linked in the jira 
for some more detail.) One reason I suggested not implementing linear operator 
caching in the design doc is because itâs incompatible with the existing 
Gradient interface. If we are considering an alternative interface it may be 
worth revisiting this issue.

The objective function âinterfaceâ used by TFOCS involves the functions 
applyLinear (linear operator component of objective), applySmooth (smooth 
portion of objective), applyProjector (nonsmooth portion of objective). In 
addition there are a number of numeric and categorical parameters. 
Theoretically we could adopt a similar interface (with or without applyLinear, 
depending) where RDD specific operations are encapsulated within the various 
apply* functions.

Finally, I wanted to mention that Iâm in the bay area and am happy to 
meet in person to discuss this project if that would be helpful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6271][SQL] Sort these tokens in alphabe...

2015-03-10 Thread DoingDone9

GitHub user DoingDone9 opened a pull request:

https://github.com/apache/spark/pull/4973

[SPARK-6271][SQL] Sort these tokens in alphabetic order to avoid further 
duplicate in HiveQl



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/DoingDone9/spark sort_token

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4973.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4973


commit c3f046f8de7c418d4aa7e74afea9968a8baf9231
Author: DoingDone9 <799203...@qq.com>
Date:   2015-03-02T02:11:18Z

Merge pull request #1 from apache/master

merge lastest spark

commit cb1852d14f62adbd194b1edda4ec639ba942a8ba
Author: DoingDone9 <799203...@qq.com>
Date:   2015-03-05T07:05:10Z

Merge pull request #2 from apache/master

merge lastest spark

commit c87e8b6d8cb433376a7d14778915006c31f6c01c
Author: DoingDone9 <799203...@qq.com>
Date:   2015-03-10T07:46:12Z

Merge pull request #3 from apache/master

merge lastest spark

commit c7080b35532f68ed1a308f03fbab420a50b23920
Author: DoingDone9 <799203...@qq.com>
Date:   2015-03-11T03:04:28Z

Sort these tokens in alphabetic order to avoid further duplicate in HiveQl




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6211][Streaming] Add Python Kafka API u...

2015-03-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4961#issuecomment-78192674
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28458/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6211][Streaming] Add Python Kafka API u...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4961#issuecomment-78192667
  
  [Test build #28458 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28458/consoleFull)
 for   PR 4961 at commit 
[`f66a067`](https://github.com/apache/spark/commit/f66a067a7cca20a63eb367f08d190acc2bd95262).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  class EmbeddedZookeeper(val zkConnect: String) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5183][SQL] Update SQL Docs with JDBC an...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4958#issuecomment-78191533
  
  [Test build #28456 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28456/consoleFull)
 for   PR 4958 at commit 
[`04b35cb`](https://github.com/apache/spark/commit/04b35cb7d7ae1f1306dbcb78022960f6d6628a5d).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6269] Use a different implementation of...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4972#issuecomment-78191541
  
  [Test build #28455 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28455/consoleFull)
 for   PR 4972 at commit 
[`ca063fc`](https://github.com/apache/spark/commit/ca063fc6d6f932483791e4ed311a099e05646231).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class ArrayReflect `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6269] Use a different implementation of...

2015-03-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4972#issuecomment-78191546
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28455/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5183][SQL] Update SQL Docs with JDBC an...

2015-03-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4958#issuecomment-78191538
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28456/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2015-03-10 Thread akopich

Github user akopich closed the pull request at:

https://github.com/apache/spark/pull/1269


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5986][MLLib] Add save/load for k-means

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4951#issuecomment-78190584
  
  [Test build #28453 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28453/consoleFull)
 for   PR 4951 at commit 
[`6dd74a0`](https://github.com/apache/spark/commit/6dd74a0f57678bdbfc6654433047e96ff1801429).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class KMeansModel (val clusterCenters: Array[Vector]) extends Saveable 
with Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2015-03-10 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-78190574
  
@akopich  Since this is no longer an active PR, could you please close it?

It was very helpful to have this PR as a major basis for the initial LDA 
PR.  If you do end up using the merged LDA or future versions which may be 
added, it would be great to get your input about further improvements, 
especially if they can be added incrementally.  There are people actively 
working on online variational Bayes and on Gibbs sampling, which should have 
very different behavior from EM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5986][MLLib] Add save/load for k-means

2015-03-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4951#issuecomment-78190589
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28453/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6185][SQL] Deltele repeated TOKEN. "TOK...

2015-03-10 Thread DoingDone9

Github user DoingDone9 commented on the pull request:

https://github.com/apache/spark/pull/4907#issuecomment-78190562
  
 It is a good idea that to sorting  , i will do it @adrian-wang  @marmbrus  
@chenghao-intel 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

2015-03-10 Thread witgo

Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/4807#issuecomment-78189159
  
@EntilZha  thx.
@mengxr what do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6185][SQL] Deltele repeated TOKEN. "TOK...

2015-03-10 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/4907#issuecomment-78188942
  
+1 to sorting.
On Mar 10, 2015 7:26 PM, "Daoyuan Wang"  wrote:

> Again, I think we'd better sort these tokens in alphabetic order to avoid
> further duplicate
>
> â
> Reply to this email directly or view it on GitHub
> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6185][SQL] Deltele repeated TOKEN. "TOK...

2015-03-10 Thread adrian-wang

Github user adrian-wang commented on the pull request:

https://github.com/apache/spark/pull/4907#issuecomment-78188846
  
Again, I think we'd better sort these tokens in alphabetic order to avoid 
further duplicate


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5183][SQL] Update SQL Docs with JDBC an...

2015-03-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4958#issuecomment-78188394
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28452/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5183][SQL] Update SQL Docs with JDBC an...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4958#issuecomment-78188389
  
  [Test build #28452 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28452/consoleFull)
 for   PR 4958 at commit 
[`9351dbc`](https://github.com/apache/spark/commit/9351dbcee01d8a092fbfa1fa2d095474fcc68015).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5183][SQL] Update SQL Docs with JDBC an...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4958#issuecomment-78187934
  
  [Test build #28451 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28451/consoleFull)
 for   PR 4958 at commit 
[`d81b7e7`](https://github.com/apache/spark/commit/d81b7e729632b1a557859f070337a02b1e1422ea).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5183][SQL] Update SQL Docs with JDBC an...

2015-03-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4958#issuecomment-78187940
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28451/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-6245 [SQL] jsonRDD() of empty RDD result...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4971#issuecomment-78186536
  
  [Test build #28450 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28450/consoleFull)
 for   PR 4971 at commit 
[`3c619e1`](https://github.com/apache/spark/commit/3c619e19ab762836261c8a9fa2b4240369965afe).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-6245 [SQL] jsonRDD() of empty RDD result...

2015-03-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4971#issuecomment-78186541
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28450/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6211][Streaming] Add Python Kafka API u...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4961#issuecomment-78186540
  
  [Test build #28458 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28458/consoleFull)
 for   PR 4961 at commit 
[`f66a067`](https://github.com/apache/spark/commit/f66a067a7cca20a63eb367f08d190acc2bd95262).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6222][STREAMING] Make sure batches are ...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4964#issuecomment-78186076
  
  [Test build #28457 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28457/consoleFull)
 for   PR 4964 at commit 
[`fa93b87`](https://github.com/apache/spark/commit/fa93b871ba0fe22924ff0273e975e492a6a7043c).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5124][Core] A standard RPC interface an...

2015-03-10 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4588#discussion_r26183675
  
--- Diff: core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala ---
@@ -0,0 +1,370 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.rpc
+
+import java.net.URI
+
+import scala.concurrent.Future
+import scala.concurrent.duration.FiniteDuration
+import scala.reflect.ClassTag
+
+import org.apache.spark.{SparkException, SecurityManager, SparkConf}
+import org.apache.spark.util.Utils
+
+/**
+ * An RPC environment. [[RpcEndpoint]]s need to register itself with a 
name to [[RpcEnv]] to
+ * receives messages. Then [[RpcEnv]] will process messages sent from 
[[RpcEndpointRef]] or remote
+ * nodes, and deliver them to corresponding [[RpcEndpoint]]s.
+ *
+ * [[RpcEnv]] also provides some methods to retrieve [[RpcEndpointRef]]s 
given name or uri.
+ */
+private[spark] trait RpcEnv {
+
+  /**
+   * Return RpcEndpointRef of the registered [[RpcEndpoint]]. Will be used 
to implement
+   * [[RpcEndpoint.self]].
+   */
+  private[rpc] def endpointRef(endpoint: RpcEndpoint): RpcEndpointRef
+
+  /**
+   * Return the address that [[RpcEnv]] is listening to.
+   */
+  def address: RpcAddress
+
+  /**
+   * Register a [[RpcEndpoint]] with a name and return its 
[[RpcEndpointRef]]. [[RpcEnv]] does not
+   * guarantee thread-safety.
+   */
+  def setupEndpoint(name: String, endpoint: RpcEndpoint): RpcEndpointRef
+
+  /**
+   * Register a [[RpcEndpoint]] with a name and return its 
[[RpcEndpointRef]]. [[RpcEnv]] should
+   * make sure thread-safely sending messages to [[RpcEndpoint]].
+   */
+  def setupThreadSafeEndpoint(name: String, endpoint: RpcEndpoint): 
RpcEndpointRef
+
+  /**
+   * Retrieve a [[RpcEndpointRef]] which is located in the driver via its 
name.
+   */
+  def setupDriverEndpointRef(name: String): RpcEndpointRef
+
+  /**
+   * Retrieve the [[RpcEndpointRef]] represented by `url`.
+   */
+  def setupEndpointRefByUrl(url: String): RpcEndpointRef
+
+  /**
+   * Retrieve the [[RpcEndpointRef]] represented by `systemName`, 
`address` and `endpointName`
--- End diff --

Sorry, I should have been clearer; I was going by other comments saying 
they use different system names.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2015-03-10 Thread akopich

Github user akopich commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-78184948
  
@renchengchang 
What do you mean by "topic vector"? A vector of p(t|d) \forall t?  If so, 
you can find these vectors in `RDD[DocumentParameters]` which is returned by 
`infer(documents: RDD[Document], ...)` method. `DocumentParameters` stores a 
document a vector of p(t|d) \forall t which is referred as `theta`. BTW, the 
order of documents is remained the same. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6269] Use a different implementation of...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4972#issuecomment-78184695
  
  [Test build #28455 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28455/consoleFull)
 for   PR 4972 at commit 
[`ca063fc`](https://github.com/apache/spark/commit/ca063fc6d6f932483791e4ed311a099e05646231).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5183][SQL] Update SQL Docs with JDBC an...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4958#issuecomment-78184673
  
  [Test build #28456 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28456/consoleFull)
 for   PR 4958 at commit 
[`04b35cb`](https://github.com/apache/spark/commit/04b35cb7d7ae1f1306dbcb78022960f6d6628a5d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6269] Use a different implementation of...

2015-03-10 Thread mccheah

Github user mccheah commented on the pull request:

https://github.com/apache/spark/pull/4972#issuecomment-78184492
  
I believe we use Array.get in the visitArray method in SizeEstimator, so 
there's that as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2015-03-10 Thread renchengchang

Github user renchengchang commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-78184357
  
Thanks.
I have a question:
if there is not document id ,how can I know the relation between topic 
vector and raw text?

åä»¶äºº: Avanesov Valeriy [mailto:notificati...@github.com]
åéæ¶é´: 2015å¹´3æ10æ¥ 21:18
æ¶ä»¶äºº: apache/spark
æé: ä»»æå¸¸
ä¸»é¢: [è¥éç±»é®ä»¶] Re: [spark] [SPARK-2199] [mllib] topic modeling 
(#1269)


@renchengchang
1. Hi.
2. Don't use code from this PR. Use either LDA (which is merged with mllib) 
or https://github.com/akopich/dplsa which is a further development of this PR.
3. I do not employ the concept of document id.

â
Reply to this email directly or view it on 
GitHub.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6269] Use a different implementation of...

2015-03-10 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/4972#issuecomment-78184317
  
@mccheah thanks for submitting this. Is the only thing we use getLength?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6269] Use a different implementation of...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4972#issuecomment-78183939
  
  [Test build #28454 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28454/consoleFull)
 for   PR 4972 at commit 
[`1fe09de`](https://github.com/apache/spark/commit/1fe09de24d48cdfbff995a7ef1f82d72ff521509).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class ArrayReflect `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6269] Use a different implementation of...

2015-03-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4972#issuecomment-78183941
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28454/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6269] Use a different implementation of...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4972#issuecomment-78183787
  
  [Test build #28454 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28454/consoleFull)
 for   PR 4972 at commit 
[`1fe09de`](https://github.com/apache/spark/commit/1fe09de24d48cdfbff995a7ef1f82d72ff521509).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-6245 [SQL] jsonRDD() of empty RDD result...

2015-03-10 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/4971#discussion_r26182739
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala 
---
@@ -48,7 +48,11 @@ private[sql] object JsonRDD extends Logging {
 require(samplingRatio > 0, s"samplingRatio ($samplingRatio) should be 
greater than 0")
 val schemaData = if (samplingRatio > 0.99) json else 
json.sample(false, samplingRatio, 1)
 val allKeys =
-  parseJson(schemaData, 
columnNameOfCorruptRecords).map(allKeysWithValueTypes).reduce(_ ++ _)
+  if (schemaData.isEmpty()) {
+Set[(String,DataType)]()
--- End diff --

This is a super nit, but I think typically we would just do `Set.empty` 
here or maybe `Set.empty[(String, DataType)]` if you really want to be explicit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6243][SQL] The Operation of match did n...

2015-03-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4959#issuecomment-78183498
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28448/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6269] Use a different implementation of...

2015-03-10 Thread mccheah

GitHub user mccheah opened a pull request:

https://github.com/apache/spark/pull/4972

[SPARK-6269] Use a different implementation of java.lang.reflect.Array

This patch uses a different implementation of java.lang.reflect.Array. The 
code is copied and pasted from
https://bugs.openjdk.java.net/browse/JDK-8051447 with appropriate style 
changes for this project. The appropriate code is in the public domain, so we 
can use it since the proper Apache licensing is on it.

The idea is to use pure Java code in implementing the methods there, as 
opposed to relying on native C code which ends up being ill-performing. This 
improves the performance of estimating the size of arrays when we are checking 
for spilling in Spark.

Here's the benchmark discussion from the ticket:

I did two tests. The first, less convincing, take-with-a-block-of-salt test 
I did was do a simple groupByKey operation to collect objects in a 4.0 GB text 
file RDD into 30,000 buckets. I ran 1 Master and 4 Spark Worker JVMs on my mac, 
fetching the RDD from a text file simply stored on disk, and saving it out to 
another file located on local disk. The wall clock times I got back before and 
after the change were:

Before: 352.195s, 343.871s, 359.080s
After: 342.929583s, 329.456623s, 326.151481s

So, there is a bit of an improvement after the change. I also did some 
YourKit profiling of the executors to get an idea of how much time was spent in 
size estimation before and after the change. I roughly saw that size estimation 
took up less of the time after my change, but YourKit's profiling can be 
inconsistent and who knows if I was profiling the executors that had the same 
data between runs?

The more convincing test I did was to run the size-estimation logic itself 
in an isolated unit test. I ran the following code:
```
{code}
val bigArray = 
Array.fill(1000)(Array.fill(1000)(java.util.UUID.randomUUID().toString()))
test("String arrays only perf testing") {
  val startTime = System.currentTimeMillis()
  for (i <- 1 to 5) {
SizeEstimator.estimate(bigArray)
  }
  println("Runtime: " + (System.currentTimeMillis() - startTime) / 
1000.)
}
{code}
```
I wanted to use a 2D array specifically because I wanted to measure the 
performance of repeatedly calling Array.getLength. I used UUID-Strings to 
ensure that the strings were randomized (so object re-use doesn't happen), but 
that they would all be the same size. The results were as follows:

Before change: 209.275s, 190.107s, 185.424s
After change: 160.431s, 149.487s, 151.66s
.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/palantir/spark 
feature/spark-6269-reflect-array

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4972.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4972






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6243][SQL] The Operation of match did n...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4959#issuecomment-78183487
  
  [Test build #28448 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28448/consoleFull)
 for   PR 4959 at commit 
[`6278846`](https://github.com/apache/spark/commit/6278846a50fde4610d6e540dc367c8ad61aee83f).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-6245 [SQL] jsonRDD() of empty RDD result...

2015-03-10 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/4971#issuecomment-78183462
  
LGTM, thanks @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5986][MLLib] Add save/load for k-means

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4951#issuecomment-78183339
  
  [Test build #28453 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28453/consoleFull)
 for   PR 4951 at commit 
[`6dd74a0`](https://github.com/apache/spark/commit/6dd74a0f57678bdbfc6654433047e96ff1801429).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6185][SQL] Deltele repeated TOKEN. "TOK...

2015-03-10 Thread DoingDone9

Github user DoingDone9 commented on the pull request:

https://github.com/apache/spark/pull/4907#issuecomment-78183146
  
@yhuai  Could it be merged?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6198][SQL] Support "select current_data...

2015-03-10 Thread DoingDone9

Github user DoingDone9 commented on the pull request:

https://github.com/apache/spark/pull/4926#issuecomment-78182986
  
could you test it @marmbrus 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3916#issuecomment-78182670
  
  [Test build #28447 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28447/consoleFull)
 for   PR 3916 at commit 
[`18c7e4d`](https://github.com/apache/spark/commit/18c7e4db3b9713c4bc13487e3a15e59b6bf2dc58).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

2015-03-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3916#issuecomment-78182678
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28447/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5183][SQL] Update SQL Docs with JDBC an...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4958#issuecomment-78181015
  
  [Test build #28452 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28452/consoleFull)
 for   PR 4958 at commit 
[`9351dbc`](https://github.com/apache/spark/commit/9351dbcee01d8a092fbfa1fa2d095474fcc68015).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5183][SQL] Update SQL Docs with JDBC an...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4958#issuecomment-78180500
  
  [Test build #28451 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28451/consoleFull)
 for   PR 4958 at commit 
[`d81b7e7`](https://github.com/apache/spark/commit/d81b7e729632b1a557859f070337a02b1e1422ea).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5183][SQL] Update SQL Docs with JDBC an...

2015-03-10 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/4958#issuecomment-78180187
  
lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-6245 [SQL] jsonRDD() of empty RDD result...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4971#issuecomment-78179942
  
  [Test build #28450 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28450/consoleFull)
 for   PR 4971 at commit 
[`3c619e1`](https://github.com/apache/spark/commit/3c619e19ab762836261c8a9fa2b4240369965afe).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-6225 [CORE] [SQL] [STREAMING] Resolve mo...

2015-03-10 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/4950#discussion_r26181023
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -199,12 +199,12 @@ object MatrixFactorizationModel extends 
Loader[MatrixFactorizationModel] {
   assert(formatVersion == thisFormatVersion)
   val rank = (metadata \ "rank").extract[Int]
   val userFeatures = sqlContext.parquetFile(userPath(path))
-.map { case Row(id: Int, features: Seq[Double]) =>
-  (id, features.toArray)
+.map { case Row(id: Int, features: Seq[_]) =>
+  (id, features.asInstanceOf[Seq[Double]].toArray)
--- End diff --

I think you can also use the `@unchecked` annotation here, but getting the 
syntax right usually involves guessing until it works for me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-6245 [SQL] jsonRDD() of empty RDD result...

2015-03-10 Thread srowen

GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/4971

SPARK-6245 [SQL] jsonRDD() of empty RDD results in exception

Avoid `UnsupportedOperationException` from JsonRDD.inferSchema on empty RDD.

Not sure if this is supposed to be an error (but a better one), but it 
seems like this case can come up if the input is down-sampled so much that 
nothing is sampled.

Now stuff like this:
```
sqlContext.jsonRDD(sc.parallelize(List[String]()))
```
just results in
```
org.apache.spark.sql.DataFrame = []
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-6245

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4971.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4971


commit 3c619e19ab762836261c8a9fa2b4240369965afe
Author: Sean Owen 
Date:   2015-03-11T00:26:42Z

Avoid UnsupportedOperationException from JsonRDD.inferSchema on empty RDD




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5183][SQL] Update SQL Docs with JDBC an...

2015-03-10 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/4958#discussion_r26180727
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1022,6 +1156,117 @@ results = sqlContext.sql("FROM src SELECT key, 
value").collect()
 
 
 
+## JDBC Connectivity
+
+Spark SQL also includes a datasource that can read data from other 
databases using JDBC.  This
+functionality is similar to JDBCRDD, but since the results are returned as 
a DataFrame they can
--- End diff --

I actually think for SEO we want to.  If people are searching JdbcRDD then 
I'd love them to end up here.  Do you think I should say that its going to be 
deprecated?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5183][SQL] Update SQL Docs with JDBC an...

2015-03-10 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/4958#discussion_r26180682
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1022,6 +1156,117 @@ results = sqlContext.sql("FROM src SELECT key, 
value").collect()
 
 
 
+## JDBC Connectivity
--- End diff --

JDBC To Other Databases?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5845] Time to cleanup spilled shuffle f...

2015-03-10 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4965#issuecomment-78178445
  
  [Test build #28449 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28449/consoleFull)
 for   PR 4965 at commit 
[`b946d08`](https://github.com/apache/spark/commit/b946d085c50e5e5072d7429fe8a634e5b401986c).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5845] Time to cleanup spilled shuffle f...

2015-03-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4965#issuecomment-78178449
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28449/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Minor doc: Remove the extra blank line in data...

2015-03-10 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/4955#issuecomment-78177742
  
Thanks!  Merged to master and 1.3


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6025] [MLlib] Add helper method evaluat...

2015-03-10 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4906#discussion_r26179909
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala 
---
@@ -108,6 +110,53 @@ class GradientBoostedTreesModel(
   }
 
   override protected def formatVersion: String = 
TreeEnsembleModel.SaveLoadV1_0.thisFormatVersion
+
+  /**
+   * Method to compute error or loss for every iteration of gradient 
boosting.
+   * @param data: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]
+   * @param loss: evaluation metric.
+   * @return an array with index i having the losses or errors for the 
ensemble
+   * containing trees 1 to i + 1
+   */
+  def evaluateEachIteration(
+  data: RDD[LabeledPoint],
+  loss: Loss) : Array[Double] = {
+
+val sc = data.sparkContext
+val remappedData = algo match {
+  case Classification => data.map(x => new LabeledPoint((x.label * 2) 
- 1, x.features))
+  case _ => data
+}
+val initialTree = trees(0)
+val numIterations = trees.length
+val evaluationArray = Array.fill(numIterations)(0.0)
+
+// Initial weight is 1.0
+var predictionErrorModel = remappedData.map {i =>
+  val pred = initialTree.predict(i.features)
+  val error = loss.computeError(i, pred)
+  (pred, error)
+}
+evaluationArray(0) = predictionErrorModel.values.mean()
+
+// Avoid the model being copied across numIterations.
+val broadcastTrees = sc.broadcast(trees)
+val broadcastWeights = sc.broadcast(treeWeights)
+
+(1 until numIterations).map {nTree =>
--- End diff --

space after {


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6025] [MLlib] Add helper method evaluat...

2015-03-10 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4906#discussion_r26179903
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala 
---
@@ -108,6 +110,53 @@ class GradientBoostedTreesModel(
   }
 
   override protected def formatVersion: String = 
TreeEnsembleModel.SaveLoadV1_0.thisFormatVersion
+
+  /**
+   * Method to compute error or loss for every iteration of gradient 
boosting.
+   * @param data: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]
+   * @param loss: evaluation metric.
+   * @return an array with index i having the losses or errors for the 
ensemble
+   * containing trees 1 to i + 1
+   */
+  def evaluateEachIteration(
+  data: RDD[LabeledPoint],
+  loss: Loss) : Array[Double] = {
+
+val sc = data.sparkContext
+val remappedData = algo match {
+  case Classification => data.map(x => new LabeledPoint((x.label * 2) 
- 1, x.features))
+  case _ => data
+}
+val initialTree = trees(0)
+val numIterations = trees.length
+val evaluationArray = Array.fill(numIterations)(0.0)
+
+// Initial weight is 1.0
--- End diff --

may as well use initial weight explicitly in case that changes for some 
reason in the future


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6025] [MLlib] Add helper method evaluat...

2015-03-10 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4906#discussion_r26179917
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala 
---
@@ -108,6 +110,53 @@ class GradientBoostedTreesModel(
   }
 
   override protected def formatVersion: String = 
TreeEnsembleModel.SaveLoadV1_0.thisFormatVersion
+
+  /**
+   * Method to compute error or loss for every iteration of gradient 
boosting.
+   * @param data: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]
+   * @param loss: evaluation metric.
+   * @return an array with index i having the losses or errors for the 
ensemble
+   * containing trees 1 to i + 1
+   */
+  def evaluateEachIteration(
+  data: RDD[LabeledPoint],
+  loss: Loss) : Array[Double] = {
+
+val sc = data.sparkContext
+val remappedData = algo match {
+  case Classification => data.map(x => new LabeledPoint((x.label * 2) 
- 1, x.features))
+  case _ => data
+}
+val initialTree = trees(0)
+val numIterations = trees.length
+val evaluationArray = Array.fill(numIterations)(0.0)
+
+// Initial weight is 1.0
+var predictionErrorModel = remappedData.map {i =>
+  val pred = initialTree.predict(i.features)
+  val error = loss.computeError(i, pred)
+  (pred, error)
+}
+evaluationArray(0) = predictionErrorModel.values.mean()
+
+// Avoid the model being copied across numIterations.
+val broadcastTrees = sc.broadcast(trees)
+val broadcastWeights = sc.broadcast(treeWeights)
+
+(1 until numIterations).map {nTree =>
+  predictionErrorModel = (remappedData zip predictionErrorModel) map {
+case (point, (pred, error)) => {
+  val newPred = pred + (
+broadcastTrees.value(nTree).predict(point.features) * 
broadcastWeights.value(nTree))
+  val newError = loss.computeError(point, newPred)
+  (newPred, newError)
+}
+  }
+  evaluationArray(nTree) = predictionErrorModel.values.mean()
+}
+evaluationArray
--- End diff --

You might want to explicitly unpersist the broadcast values before 
returning.  They will get unpersisted once their values go out of scope, but it 
might take longer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6025] [MLlib] Add helper method evaluat...

2015-03-10 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4906#discussion_r26179907
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala 
---
@@ -108,6 +110,53 @@ class GradientBoostedTreesModel(
   }
 
   override protected def formatVersion: String = 
TreeEnsembleModel.SaveLoadV1_0.thisFormatVersion
+
+  /**
+   * Method to compute error or loss for every iteration of gradient 
boosting.
+   * @param data: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]
+   * @param loss: evaluation metric.
+   * @return an array with index i having the losses or errors for the 
ensemble
+   * containing trees 1 to i + 1
+   */
+  def evaluateEachIteration(
+  data: RDD[LabeledPoint],
+  loss: Loss) : Array[Double] = {
+
+val sc = data.sparkContext
+val remappedData = algo match {
+  case Classification => data.map(x => new LabeledPoint((x.label * 2) 
- 1, x.features))
+  case _ => data
+}
+val initialTree = trees(0)
+val numIterations = trees.length
+val evaluationArray = Array.fill(numIterations)(0.0)
+
+// Initial weight is 1.0
+var predictionErrorModel = remappedData.map {i =>
--- End diff --

predictionErrorModel is an odd name (model?).  I'd rename it to 
predictionAndError and possibly add an explicit type for clarity.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6025] [MLlib] Add helper method evaluat...

2015-03-10 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4906#discussion_r26179911
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala 
---
@@ -108,6 +110,53 @@ class GradientBoostedTreesModel(
   }
 
   override protected def formatVersion: String = 
TreeEnsembleModel.SaveLoadV1_0.thisFormatVersion
+
+  /**
+   * Method to compute error or loss for every iteration of gradient 
boosting.
+   * @param data: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]
+   * @param loss: evaluation metric.
+   * @return an array with index i having the losses or errors for the 
ensemble
+   * containing trees 1 to i + 1
+   */
+  def evaluateEachIteration(
+  data: RDD[LabeledPoint],
+  loss: Loss) : Array[Double] = {
+
+val sc = data.sparkContext
+val remappedData = algo match {
+  case Classification => data.map(x => new LabeledPoint((x.label * 2) 
- 1, x.features))
+  case _ => data
+}
+val initialTree = trees(0)
+val numIterations = trees.length
+val evaluationArray = Array.fill(numIterations)(0.0)
+
+// Initial weight is 1.0
+var predictionErrorModel = remappedData.map {i =>
+  val pred = initialTree.predict(i.features)
+  val error = loss.computeError(i, pred)
+  (pred, error)
+}
+evaluationArray(0) = predictionErrorModel.values.mean()
+
+// Avoid the model being copied across numIterations.
+val broadcastTrees = sc.broadcast(trees)
+val broadcastWeights = sc.broadcast(treeWeights)
+
+(1 until numIterations).map {nTree =>
+  predictionErrorModel = (remappedData zip predictionErrorModel) map {
--- End diff --

I would use mapPartitions.  Before iterating over the partition elements, 
extract the trees and weights from the broadcast variables.  I believe that 
reduces overhead a little.

Also, try to avoid infix notation since non-Scala people may not be used to 
it:
```
remappedData.zip(predictionErrorModel)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 281 matches

Mail list logo