[GitHub] spark pull request: [SPARK-8874] [ML] Add missing methods in Word2...

2015-07-08 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/7263#discussion_r34228026
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -146,6 +151,55 @@ class Word2VecModel private[ml] (
 wordVectors: feature.Word2VecModel)
   extends Model[Word2VecModel] with Word2VecBase {
 
+
+  /**
+   * Return a map of every word to its Vector representation.
+   */
+  val getVectors: Map[String, Array[Float]] = wordVectors.getVectors
--- End diff --

oh my bad, I didn't see that new functionality.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-7422][MLLIB] Add argmax to Vector, Spar...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6112#issuecomment-119846476
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-7422][MLLIB] Add argmax to Vector, Spar...

2015-07-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6112#issuecomment-119846462
  
  [Test build #36902 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36902/console)
 for   PR 6112 at commit 
[`5fd9380`](https://github.com/apache/spark/commit/5fd9380573859b23a18ffe65d9ee3335a792d7ba).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2017] [UI] Stage page hangs with many t...

2015-07-08 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/7296#issuecomment-119845615
  
Looks good, did you do browser inspection and make sure this is actually 
working?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8851][YARN] In Yarn client mode, Client...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7255#issuecomment-119844053
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8851][YARN] In Yarn client mode, Client...

2015-07-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7255#issuecomment-119844045
  
  [Test build #36903 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36903/console)
 for   PR 7255 at commit 
[`c658acb`](https://github.com/apache/spark/commit/c658acb3aceac946ba7aca2d1eff40ff472c8c6c).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8874] [ML] Add missing methods in Word2...

2015-07-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7263#discussion_r34227718
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -146,6 +151,55 @@ class Word2VecModel private[ml] (
 wordVectors: feature.Word2VecModel)
   extends Model[Word2VecModel] with Word2VecBase {
 
+
+  /**
+   * Return a map of every word to its Vector representation.
+   */
+  val getVectors: Map[String, Array[Float]] = wordVectors.getVectors
--- End diff --

Actually, we can call `SparkContext.getOrCreate()` to get the active 
`SparkContext`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8839][SQL]ThriftServer2 will remove ses...

2015-07-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7239#issuecomment-119843192
  
  [Test build #36904 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36904/consoleFull)
 for   PR 7239 at commit 
[`cf7ef40`](https://github.com/apache/spark/commit/cf7ef40b9aed7bd48a79998758524a088145fd70).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8851][YARN] In Yarn client mode, Client...

2015-07-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7255#issuecomment-119843141
  
  [Test build #36903 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36903/consoleFull)
 for   PR 7255 at commit 
[`c658acb`](https://github.com/apache/spark/commit/c658acb3aceac946ba7aca2d1eff40ff472c8c6c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8839][SQL]ThriftServer2 will remove ses...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7239#issuecomment-119842653
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8839][SQL]ThriftServer2 will remove ses...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7239#issuecomment-119842597
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8851][YARN] In Yarn client mode, Client...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7255#issuecomment-119842586
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8851][YARN] In Yarn client mode, Client...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7255#issuecomment-119842625
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8839][SQL]ThriftServer2 will remove ses...

2015-07-08 Thread SaintBacchus
Github user SaintBacchus commented on the pull request:

https://github.com/apache/spark/pull/7239#issuecomment-119842176
  
@tianyi Thanks for review and comment , I had removed it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8881] Fix algorithm for scheduling exec...

2015-07-08 Thread nishkamravi2
Github user nishkamravi2 commented on the pull request:

https://github.com/apache/spark/pull/7274#issuecomment-119841645
  
Can this be retested please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-7422][MLLIB] Add argmax to Vector, Spar...

2015-07-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6112#issuecomment-119841409
  
  [Test build #36902 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36902/consoleFull)
 for   PR 6112 at commit 
[`5fd9380`](https://github.com/apache/spark/commit/5fd9380573859b23a18ffe65d9ee3335a792d7ba).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-7422][MLLIB] Add argmax to Vector, Spar...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6112#issuecomment-119841336
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-7422][MLLIB] Add argmax to Vector, Spar...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6112#issuecomment-119841350
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7902] [SPARK-6289] [SPARK-8685] [SQL] [...

2015-07-08 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/7301#issuecomment-119840910
  
@JoshRosen Ping!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8931] [SQL] Fallback to interpreted eva...

2015-07-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7309#issuecomment-119840582
  
  [Test build #36901 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36901/consoleFull)
 for   PR 7309 at commit 
[`969a612`](https://github.com/apache/spark/commit/969a612d09c1575910a840ff0df0dc2b089ee680).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8931] [SQL] Fallback to interpreted eva...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7309#issuecomment-119840475
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8931] [SQL] Fallback to interpreted eva...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7309#issuecomment-119840488
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8931] [SQL] Fallback to interpreted eva...

2015-07-08 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/7309#issuecomment-119840357
  
@JoshRosen Thanks, updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-7422][MLLIB] Add argmax to Vector, Spar...

2015-07-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6112#issuecomment-119840263
  
  [Test build #36900 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36900/console)
 for   PR 6112 at commit 
[`42341fb`](https://github.com/apache/spark/commit/42341fb9e109ddf77e949c8453de2a30c9e4e71f).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-7422][MLLIB] Add argmax to Vector, Spar...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6112#issuecomment-119840267
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8830][SQL] native levenshtein distance

2015-07-08 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/7236#issuecomment-119840066
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8830][SQL] native levenshtein distance

2015-07-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7236#issuecomment-119839695
  
  [Test build #1019 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1019/consoleFull)
 for   PR 7236 at commit 
[`ee4c4de`](https://github.com/apache/spark/commit/ee4c4de852d6e88096869a7d22370ff5e8d87ba5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-7422][MLLIB] Add argmax to Vector, Spar...

2015-07-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6112#issuecomment-119839600
  
  [Test build #36900 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36900/consoleFull)
 for   PR 6112 at commit 
[`42341fb`](https://github.com/apache/spark/commit/42341fb9e109ddf77e949c8453de2a30c9e4e71f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-7422][MLLIB] Add argmax to Vector, Spar...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6112#issuecomment-119839451
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-7422][MLLIB] Add argmax to Vector, Spar...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6112#issuecomment-119839460
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...

2015-07-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6994#discussion_r34226904
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
@@ -90,3 +90,19 @@ class ChiSqTestResult private[stat] (override val 
pValue: Double,
   super.toString
   }
 }
+
+/**
+ * :: Experimental ::
+ * Object containing the test results for the Kolmogorov-Smirnov test.
+ */
+@Experimental
+class KSTestResult private[stat] (override val pValue: Double,
--- End diff --

move `override val pValue: Double` to next line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...

2015-07-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6994#discussion_r34226914
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/stat/HypothesisTestSuite.scala ---
@@ -153,4 +157,99 @@ class HypothesisTestSuite extends SparkFunSuite with 
MLlibTestSparkContext {
   Statistics.chiSqTest(sc.parallelize(continuousFeature, 2))
 }
   }
+
+  test("1 sample Kolmogorov-Smirnov test") {
+// Create theoretical distributions
+val stdNormalDist = new NormalDistribution(0, 1)
+val expDist = new ExponentialDistribution(0.6)
+val unifDist = new UniformRealDistribution()
+
+// set seeds
+val seed = 10L
+stdNormalDist.reseedRandomGenerator(seed)
+expDist.reseedRandomGenerator(seed)
+unifDist.reseedRandomGenerator(seed)
+
+// Sample data from the distributions and parallelize it
+val n = 10
+val sampledNorm = sc.parallelize(stdNormalDist.sample(n), 10)
+val sampledExp = sc.parallelize(expDist.sample(n), 10)
+val sampledUnif = sc.parallelize(unifDist.sample(n), 10)
+
+// Use a apache math commons local KS test to verify calculations
+val ksTest = new KolmogorovSmirnovTest()
+val pThreshold = 0.05
+
+// Comparing a standard normal sample to a standard normal distribution
+val result1 = Statistics.ksTest(sampledNorm, "norm", 0, 1)
+val referenceStat1 = ksTest.kolmogorovSmirnovStatistic(stdNormalDist, 
sampledNorm.collect())
+val referencePVal1 = 1 - ksTest.cdf(referenceStat1, n)
+// Verify vs apache math commons ks test
+assert(result1.statistic ~== referenceStat1 relTol 1e-4)
+assert(result1.pValue ~== referencePVal1 relTol 1e-4)
+// Cannot reject null hypothesis
+assert(result1.pValue > pThreshold)
+
+// Comparing an exponential sample to a standard normal distribution
+val result2 = Statistics.ksTest(sampledExp, "norm", 0, 1)
+val referenceStat2 = ksTest.kolmogorovSmirnovStatistic(stdNormalDist, 
sampledExp.collect())
+val referencePVal2 = 1 - ksTest.cdf(referenceStat2, n)
+// verify vs apache math commons ks test
+assert(result2.statistic ~== referenceStat2 relTol 1e-4)
+assert(result2.pValue ~== referencePVal2 relTol 1e-4)
+// reject null hypothesis
+assert(result2.pValue < pThreshold)
+
+// Testing the use of a user provided CDF function
+// Distribution is not serializable, so will have to create in the 
lambda
+val expCDF = (x: Double) => new 
ExponentialDistribution(0.2).cumulativeProbability(x)
+
+// Comparing an exponential sample with mean X to an exponential 
distribution with mean Y
+// Where X != Y
+val result3 = Statistics.ksTest(sampledExp, expCDF)
+val referenceStat3 = ksTest.kolmogorovSmirnovStatistic(new 
ExponentialDistribution(0.2),
+  sampledExp.collect())
+val referencePVal3 = 1 - ksTest.cdf(referenceStat3, 
sampledNorm.count().toInt)
+// verify vs apache math commons ks test
+assert(result3.statistic ~== referenceStat3 relTol 1e-4)
+assert(result3.pValue ~== referencePVal3 relTol 1e-4)
+// reject null hypothesis
+assert(result3.pValue < pThreshold)
+
+/*
--- End diff --

Should define a new test for R equivalence.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...

2015-07-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6994#discussion_r34226900
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/KSTest.scala ---
@@ -0,0 +1,203 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import scala.annotation.varargs
+
+import org.apache.commons.math3.distribution.{NormalDistribution, 
RealDistribution}
+import org.apache.commons.math3.stat.inference.KolmogorovSmirnovTest
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the two-sided Kolmogorov Smirnov test for data sampled from a
+ * continuous distribution. By comparing the largest difference between 
the empirical cumulative
+ * distribution of the sample data and the theoretical distribution we can 
provide a test for the
+ * the null hypothesis that the sample data comes from that theoretical 
distribution.
+ * For more information on KS Test:
+ * @see [[https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test]]
+ *
+ * Implementation note: We seek to implement the KS test with a minimal 
number of distributed
+ * passes. We sort the RDD, and then perform the following operations on a 
per-partition basis:
+ * calculate an empirical cumulative distribution value for each 
observation, and a theoretical
+ * cumulative distribution value. We know the latter to be correct, while 
the former will be off by
+ * a constant (how large the constant is depends on how many values 
precede it in other partitions).
+ * However, given that this constant simply shifts the ECDF upwards, but 
doesn't change its shape,
+ * and furthermore, that constant is the same within a given partition, we 
can pick 2 values
+ * in each partition that can potentially resolve to the largest global 
distance. Namely, we
+ * pick the minimum distance and the maximum distance. Additionally, we 
keep track of how many
+ * elements are in each partition. Once these three values have been 
returned for every partition,
+ * we can collect and operate locally. Locally, we can now adjust each 
distance by the appropriate
+ * constant (the cumulative sum of # of elements in the prior partitions 
divided by the data set
+ * size). Finally, we take the maximum absolute value, and this is the 
statistic.
+ */
+private[stat] object KSTest extends Logging {
+
+  // Null hypothesis for the type of KS test to be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val oneSampleTwoSided = Value("Sample follows theoretical 
distribution")
+  }
+
+  /**
+   * Runs a KS test for 1 set of sample data, comparing it to a 
theoretical distribution
+   * @param data `RDD[Double]` data on which to run test
+   * @param cdf `Double => Double` function to calculate the theoretical 
CDF
+   * @return KSTestResult summarizing the test results (pval, statistic, 
and null hypothesis)
+   */
+  def testOneSample(data: RDD[Double], cdf: Double => Double): 
KSTestResult = {
+val n = data.count().toDouble
+val localData = data.sortBy(x => x).mapPartitions { part =>
+  val partDiffs = oneSampleDifferences(part, n, cdf) // local distances
+  searchOneSampleCandidates(partDiffs) // candidates: local extrema
+}.collect()
+val ksStat = searchOneSampleStatistic(localData, n) // result: global 
extreme
+evalOneSampleP(ksStat, n.toLong)
+  }
+
+  /**
+   * Runs a KS test for 1 set of sample data, comparing it to a 
theoretical distribution
+   * @param data `RDD[Double]` data on which to run test
+   * @param createDist `Unit => RealDistribution` function to create a 
theoretical distribution
+   * @return KSTestResult summarizing the test results (pval, statistic, 
and null hypothesis)
+   */
+  def testOneSample(data: RDD[Double], createDist: () => 
RealDistribution): KSTestResult = {

[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...

2015-07-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6994#discussion_r34226906
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
@@ -90,3 +90,19 @@ class ChiSqTestResult private[stat] (override val 
pValue: Double,
   super.toString
   }
 }
+
+/**
+ * :: Experimental ::
+ * Object containing the test results for the Kolmogorov-Smirnov test.
+ */
+@Experimental
+class KSTestResult private[stat] (override val pValue: Double,
+override val statistic: Double,
+override val nullHypothesis: String) extends TestResult[Int] {
+
+  override val degreesOfFreedom = 0
+
+  override def toString: String = {
+"Kolmogorov Smirnov test summary:\n" + super.toString
--- End diff --

`Kolmogorov-Smirnov`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8830][SQL] native levenshtein distance

2015-07-08 Thread tarekauel
Github user tarekauel commented on the pull request:

https://github.com/apache/spark/pull/7236#issuecomment-119839244
  
Can someone help with this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...

2015-07-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6994#discussion_r34226881
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/KSTest.scala ---
@@ -0,0 +1,203 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import scala.annotation.varargs
+
+import org.apache.commons.math3.distribution.{NormalDistribution, 
RealDistribution}
+import org.apache.commons.math3.stat.inference.KolmogorovSmirnovTest
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the two-sided Kolmogorov Smirnov test for data sampled from a
+ * continuous distribution. By comparing the largest difference between 
the empirical cumulative
+ * distribution of the sample data and the theoretical distribution we can 
provide a test for the
+ * the null hypothesis that the sample data comes from that theoretical 
distribution.
+ * For more information on KS Test:
+ * @see [[https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test]]
+ *
+ * Implementation note: We seek to implement the KS test with a minimal 
number of distributed
+ * passes. We sort the RDD, and then perform the following operations on a 
per-partition basis:
+ * calculate an empirical cumulative distribution value for each 
observation, and a theoretical
+ * cumulative distribution value. We know the latter to be correct, while 
the former will be off by
+ * a constant (how large the constant is depends on how many values 
precede it in other partitions).
+ * However, given that this constant simply shifts the ECDF upwards, but 
doesn't change its shape,
+ * and furthermore, that constant is the same within a given partition, we 
can pick 2 values
+ * in each partition that can potentially resolve to the largest global 
distance. Namely, we
+ * pick the minimum distance and the maximum distance. Additionally, we 
keep track of how many
+ * elements are in each partition. Once these three values have been 
returned for every partition,
+ * we can collect and operate locally. Locally, we can now adjust each 
distance by the appropriate
+ * constant (the cumulative sum of # of elements in the prior partitions 
divided by the data set
+ * size). Finally, we take the maximum absolute value, and this is the 
statistic.
--- End diff --

See my previous comments about the text.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...

2015-07-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6994#discussion_r34226885
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/KSTest.scala ---
@@ -0,0 +1,203 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import scala.annotation.varargs
+
+import org.apache.commons.math3.distribution.{NormalDistribution, 
RealDistribution}
+import org.apache.commons.math3.stat.inference.KolmogorovSmirnovTest
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the two-sided Kolmogorov Smirnov test for data sampled from a
+ * continuous distribution. By comparing the largest difference between 
the empirical cumulative
+ * distribution of the sample data and the theoretical distribution we can 
provide a test for the
+ * the null hypothesis that the sample data comes from that theoretical 
distribution.
+ * For more information on KS Test:
+ * @see [[https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test]]
+ *
+ * Implementation note: We seek to implement the KS test with a minimal 
number of distributed
+ * passes. We sort the RDD, and then perform the following operations on a 
per-partition basis:
+ * calculate an empirical cumulative distribution value for each 
observation, and a theoretical
+ * cumulative distribution value. We know the latter to be correct, while 
the former will be off by
+ * a constant (how large the constant is depends on how many values 
precede it in other partitions).
+ * However, given that this constant simply shifts the ECDF upwards, but 
doesn't change its shape,
+ * and furthermore, that constant is the same within a given partition, we 
can pick 2 values
+ * in each partition that can potentially resolve to the largest global 
distance. Namely, we
+ * pick the minimum distance and the maximum distance. Additionally, we 
keep track of how many
+ * elements are in each partition. Once these three values have been 
returned for every partition,
+ * we can collect and operate locally. Locally, we can now adjust each 
distance by the appropriate
+ * constant (the cumulative sum of # of elements in the prior partitions 
divided by the data set
+ * size). Finally, we take the maximum absolute value, and this is the 
statistic.
+ */
+private[stat] object KSTest extends Logging {
+
+  // Null hypothesis for the type of KS test to be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val oneSampleTwoSided = Value("Sample follows theoretical 
distribution")
+  }
+
+  /**
+   * Runs a KS test for 1 set of sample data, comparing it to a 
theoretical distribution
+   * @param data `RDD[Double]` data on which to run test
+   * @param cdf `Double => Double` function to calculate the theoretical 
CDF
+   * @return KSTestResult summarizing the test results (pval, statistic, 
and null hypothesis)
+   */
+  def testOneSample(data: RDD[Double], cdf: Double => Double): 
KSTestResult = {
+val n = data.count().toDouble
+val localData = data.sortBy(x => x).mapPartitions { part =>
+  val partDiffs = oneSampleDifferences(part, n, cdf) // local distances
+  searchOneSampleCandidates(partDiffs) // candidates: local extrema
+}.collect()
+val ksStat = searchOneSampleStatistic(localData, n) // result: global 
extreme
+evalOneSampleP(ksStat, n.toLong)
+  }
+
+  /**
+   * Runs a KS test for 1 set of sample data, comparing it to a 
theoretical distribution
+   * @param data `RDD[Double]` data on which to run test
+   * @param createDist `Unit => RealDistribution` function to create a 
theoretical distribution
+   * @return KSTestResult summarizing the test results (pval, statistic, 
and null hypothesis)
+   */
+  def testOneSample(data: RDD[Double], createDist: () => 
RealDistribution): KSTestResult = {

[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...

2015-07-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6994#discussion_r34226882
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/KSTest.scala ---
@@ -0,0 +1,203 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import scala.annotation.varargs
+
+import org.apache.commons.math3.distribution.{NormalDistribution, 
RealDistribution}
+import org.apache.commons.math3.stat.inference.KolmogorovSmirnovTest
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the two-sided Kolmogorov Smirnov test for data sampled from a
+ * continuous distribution. By comparing the largest difference between 
the empirical cumulative
+ * distribution of the sample data and the theoretical distribution we can 
provide a test for the
+ * the null hypothesis that the sample data comes from that theoretical 
distribution.
+ * For more information on KS Test:
+ * @see [[https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test]]
+ *
+ * Implementation note: We seek to implement the KS test with a minimal 
number of distributed
+ * passes. We sort the RDD, and then perform the following operations on a 
per-partition basis:
+ * calculate an empirical cumulative distribution value for each 
observation, and a theoretical
+ * cumulative distribution value. We know the latter to be correct, while 
the former will be off by
+ * a constant (how large the constant is depends on how many values 
precede it in other partitions).
+ * However, given that this constant simply shifts the ECDF upwards, but 
doesn't change its shape,
+ * and furthermore, that constant is the same within a given partition, we 
can pick 2 values
+ * in each partition that can potentially resolve to the largest global 
distance. Namely, we
+ * pick the minimum distance and the maximum distance. Additionally, we 
keep track of how many
+ * elements are in each partition. Once these three values have been 
returned for every partition,
+ * we can collect and operate locally. Locally, we can now adjust each 
distance by the appropriate
+ * constant (the cumulative sum of # of elements in the prior partitions 
divided by the data set
+ * size). Finally, we take the maximum absolute value, and this is the 
statistic.
+ */
+private[stat] object KSTest extends Logging {
+
+  // Null hypothesis for the type of KS test to be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val oneSampleTwoSided = Value("Sample follows theoretical 
distribution")
--- End diff --

minor: `oneSampleTwoSided` -> `OneSampleTwoSided`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...

2015-07-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6994#discussion_r34226883
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/KSTest.scala ---
@@ -0,0 +1,203 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import scala.annotation.varargs
+
+import org.apache.commons.math3.distribution.{NormalDistribution, 
RealDistribution}
+import org.apache.commons.math3.stat.inference.KolmogorovSmirnovTest
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the two-sided Kolmogorov Smirnov test for data sampled from a
+ * continuous distribution. By comparing the largest difference between 
the empirical cumulative
+ * distribution of the sample data and the theoretical distribution we can 
provide a test for the
+ * the null hypothesis that the sample data comes from that theoretical 
distribution.
+ * For more information on KS Test:
+ * @see [[https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test]]
+ *
+ * Implementation note: We seek to implement the KS test with a minimal 
number of distributed
+ * passes. We sort the RDD, and then perform the following operations on a 
per-partition basis:
+ * calculate an empirical cumulative distribution value for each 
observation, and a theoretical
+ * cumulative distribution value. We know the latter to be correct, while 
the former will be off by
+ * a constant (how large the constant is depends on how many values 
precede it in other partitions).
+ * However, given that this constant simply shifts the ECDF upwards, but 
doesn't change its shape,
+ * and furthermore, that constant is the same within a given partition, we 
can pick 2 values
+ * in each partition that can potentially resolve to the largest global 
distance. Namely, we
+ * pick the minimum distance and the maximum distance. Additionally, we 
keep track of how many
+ * elements are in each partition. Once these three values have been 
returned for every partition,
+ * we can collect and operate locally. Locally, we can now adjust each 
distance by the appropriate
+ * constant (the cumulative sum of # of elements in the prior partitions 
divided by the data set
+ * size). Finally, we take the maximum absolute value, and this is the 
statistic.
+ */
+private[stat] object KSTest extends Logging {
+
+  // Null hypothesis for the type of KS test to be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val oneSampleTwoSided = Value("Sample follows theoretical 
distribution")
+  }
+
+  /**
+   * Runs a KS test for 1 set of sample data, comparing it to a 
theoretical distribution
+   * @param data `RDD[Double]` data on which to run test
+   * @param cdf `Double => Double` function to calculate the theoretical 
CDF
+   * @return KSTestResult summarizing the test results (pval, statistic, 
and null hypothesis)
+   */
+  def testOneSample(data: RDD[Double], cdf: Double => Double): 
KSTestResult = {
+val n = data.count().toDouble
+val localData = data.sortBy(x => x).mapPartitions { part =>
+  val partDiffs = oneSampleDifferences(part, n, cdf) // local distances
+  searchOneSampleCandidates(partDiffs) // candidates: local extrema
+}.collect()
+val ksStat = searchOneSampleStatistic(localData, n) // result: global 
extreme
+evalOneSampleP(ksStat, n.toLong)
+  }
+
+  /**
+   * Runs a KS test for 1 set of sample data, comparing it to a 
theoretical distribution
+   * @param data `RDD[Double]` data on which to run test
+   * @param createDist `Unit => RealDistribution` function to create a 
theoretical distribution
+   * @return KSTestResult summarizing the test results (pval, statistic, 
and null hypothesis)
+   */
+  def testOneSample(data: RDD[Double], createDist: () => 
RealDistribution): KSTestResult = {

[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...

2015-07-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6994#discussion_r34226866
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala 
---
@@ -158,4 +158,47 @@ object Statistics {
   def chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult] = {
 ChiSqTest.chiSquaredFeatures(data)
   }
+
+  /**
+   * Conduct the two-sided Kolmogorov Smirnov test for data sampled from a
+   * continuous distribution. By comparing the largest difference between 
the empirical cumulative
+   * distribution of the sample data and the theoretical distribution we 
can provide a test for the
+   * the null hypothesis that the sample data comes from that theoretical 
distribution.
+   * For more information on KS Test:
+   * @see [[https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test]]
+   *
+   * Implementation note: We seek to implement the KS test with a minimal 
number of distributed
+   * passes. We sort the RDD, and then perform the following operations on 
a per-partition basis:
+   * calculate an empirical cumulative distribution value for each 
observation, and a theoretical
+   * cumulative distribution value. We know the latter to be correct, 
while the former will be off
+   * by a constant (how large the constant is depends on how many values 
precede it in other
+   * partitions).However, given that this constant simply shifts the ECDF 
upwards, but doesn't
--- End diff --

`.However` -> `. However`

`ECDF` is not defined. This is not a standard term in statistics. 
`empirical CDF` is fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...

2015-07-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6994#discussion_r34226869
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala 
---
@@ -158,4 +158,47 @@ object Statistics {
   def chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult] = {
 ChiSqTest.chiSquaredFeatures(data)
   }
+
+  /**
+   * Conduct the two-sided Kolmogorov Smirnov test for data sampled from a
+   * continuous distribution. By comparing the largest difference between 
the empirical cumulative
+   * distribution of the sample data and the theoretical distribution we 
can provide a test for the
+   * the null hypothesis that the sample data comes from that theoretical 
distribution.
+   * For more information on KS Test:
+   * @see [[https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test]]
+   *
+   * Implementation note: We seek to implement the KS test with a minimal 
number of distributed
+   * passes. We sort the RDD, and then perform the following operations on 
a per-partition basis:
+   * calculate an empirical cumulative distribution value for each 
observation, and a theoretical
+   * cumulative distribution value. We know the latter to be correct, 
while the former will be off
+   * by a constant (how large the constant is depends on how many values 
precede it in other
+   * partitions).However, given that this constant simply shifts the ECDF 
upwards, but doesn't
+   * change its shape, and furthermore, that constant is the same within a 
given partition, we can
+   * pick 2 values in each partition that can potentially resolve to the 
largest global distance.
+   * Namely, we pick the minimum distance and the maximum distance. 
Additionally, we keep track of
+   * how many elements are in each partition. Once these three values have 
been returned for every
+   * partition, we can collect and operate locally. Locally, we can now 
adjust each distance by the
+   * appropriate constant (the cumulative sum of # of elements in the 
prior partitions divided by
+   * the data set size). Finally, we take the maximum absolute value, and 
this is the statistic.
--- End diff --

I would move this paragraph inside the method as implementation details, or 
only keep a copy in `KSTest`. End users do not need to know it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...

2015-07-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6994#discussion_r34226855
  
--- Diff: docs/mllib-statistics.md ---
@@ -422,6 +422,41 @@ for i, result in enumerate(featureTestResults):
 
 
 
+Additionally, MLlib provides a 1-sample, 2-sided implementation of the 
Kolmogorov-Smirnov test
--- End diff --

`Kolmogorov-Smirnov` -> `Kolmogorov-Smirnov (KS)`

Otherwise, we use `KS` without definition.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...

2015-07-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6994#discussion_r34226874
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala 
---
@@ -158,4 +158,47 @@ object Statistics {
   def chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult] = {
 ChiSqTest.chiSquaredFeatures(data)
   }
+
+  /**
+   * Conduct the two-sided Kolmogorov Smirnov test for data sampled from a
+   * continuous distribution. By comparing the largest difference between 
the empirical cumulative
+   * distribution of the sample data and the theoretical distribution we 
can provide a test for the
+   * the null hypothesis that the sample data comes from that theoretical 
distribution.
+   * For more information on KS Test:
+   * @see [[https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test]]
+   *
+   * Implementation note: We seek to implement the KS test with a minimal 
number of distributed
+   * passes. We sort the RDD, and then perform the following operations on 
a per-partition basis:
+   * calculate an empirical cumulative distribution value for each 
observation, and a theoretical
+   * cumulative distribution value. We know the latter to be correct, 
while the former will be off
+   * by a constant (how large the constant is depends on how many values 
precede it in other
+   * partitions).However, given that this constant simply shifts the ECDF 
upwards, but doesn't
+   * change its shape, and furthermore, that constant is the same within a 
given partition, we can
+   * pick 2 values in each partition that can potentially resolve to the 
largest global distance.
+   * Namely, we pick the minimum distance and the maximum distance. 
Additionally, we keep track of
+   * how many elements are in each partition. Once these three values have 
been returned for every
+   * partition, we can collect and operate locally. Locally, we can now 
adjust each distance by the
+   * appropriate constant (the cumulative sum of # of elements in the 
prior partitions divided by
+   * the data set size). Finally, we take the maximum absolute value, and 
this is the statistic.
+   * @param data an `RDD[Double]` containing the sample of data to test
+   * @param cdf a `Double => Double` function to calculate the theoretical 
CDF at a given value
+   * @return KSTestResult object containing test statistic, p-value, and 
null hypothesis.
+   */
+  def ksTest(data: RDD[Double], cdf: Double => Double): KSTestResult = {
+KSTest.testOneSample(data, cdf)
+  }
+
+  /**
+   * Convenience function to conduct a one-sample, two sided Kolmogorov 
Smirnov test for probability
--- End diff --

`two sided Kolmogorov Smirnov` -> `two-sided Kolmogorov-Smirnov`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...

2015-07-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6994#discussion_r34226864
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala 
---
@@ -158,4 +158,47 @@ object Statistics {
   def chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult] = {
 ChiSqTest.chiSquaredFeatures(data)
   }
+
+  /**
+   * Conduct the two-sided Kolmogorov Smirnov test for data sampled from a
--- End diff --

ditto: `Kolmogorov-Smirnov (KS)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8280][SPARK-8281][SQL]Handle NaN, null ...

2015-07-08 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/6835#discussion_r34226853
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/MathFunctionsSuite.scala
 ---
@@ -92,6 +105,62 @@ class MathFunctionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 checkEvaluation(c(Literal(1.0), Literal.create(null, DoubleType)), 
null, create_row(null))
   }
 
+  private def checkNaN(
+expression: Expression, inputRow: InternalRow = EmptyRow): Unit = {
+checkNaNWithoutCodegen(expression, inputRow)
+checkNaNWithGeneratedProjection(expression, inputRow)
+checkNaNWithOptimization(expression, inputRow)
+  }
+
+  private def checkNaNWithoutCodegen(
+expression: Expression,
+expected: Any,
+inputRow: InternalRow = EmptyRow): Unit = {
+val actual = try evaluate(expression, inputRow) catch {
+  case e: Exception => fail(s"Exception evaluating $expression", e)
+}
+if (!actual.asInstanceOf[Double].isNaN) {
+  val input = if (inputRow == EmptyRow) "" else s", input: $inputRow"
+  fail(s"Incorrect evaluation (codegen off): $expression, " +
+s"actual: $actual, " +
+s"expected: NaN$input")
+}
+  }
+
--- End diff --

remove the extra blank line here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...

2015-07-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6994#discussion_r34226872
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala 
---
@@ -158,4 +158,47 @@ object Statistics {
   def chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult] = {
 ChiSqTest.chiSquaredFeatures(data)
   }
+
+  /**
+   * Conduct the two-sided Kolmogorov Smirnov test for data sampled from a
+   * continuous distribution. By comparing the largest difference between 
the empirical cumulative
+   * distribution of the sample data and the theoretical distribution we 
can provide a test for the
+   * the null hypothesis that the sample data comes from that theoretical 
distribution.
+   * For more information on KS Test:
+   * @see [[https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test]]
+   *
+   * Implementation note: We seek to implement the KS test with a minimal 
number of distributed
+   * passes. We sort the RDD, and then perform the following operations on 
a per-partition basis:
+   * calculate an empirical cumulative distribution value for each 
observation, and a theoretical
+   * cumulative distribution value. We know the latter to be correct, 
while the former will be off
+   * by a constant (how large the constant is depends on how many values 
precede it in other
+   * partitions).However, given that this constant simply shifts the ECDF 
upwards, but doesn't
+   * change its shape, and furthermore, that constant is the same within a 
given partition, we can
+   * pick 2 values in each partition that can potentially resolve to the 
largest global distance.
+   * Namely, we pick the minimum distance and the maximum distance. 
Additionally, we keep track of
+   * how many elements are in each partition. Once these three values have 
been returned for every
+   * partition, we can collect and operate locally. Locally, we can now 
adjust each distance by the
+   * appropriate constant (the cumulative sum of # of elements in the 
prior partitions divided by
+   * the data set size). Finally, we take the maximum absolute value, and 
this is the statistic.
+   * @param data an `RDD[Double]` containing the sample of data to test
+   * @param cdf a `Double => Double` function to calculate the theoretical 
CDF at a given value
+   * @return KSTestResult object containing test statistic, p-value, and 
null hypothesis.
--- End diff --

link `KSTestResult`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...

2015-07-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6994#discussion_r34226868
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala 
---
@@ -158,4 +158,47 @@ object Statistics {
   def chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult] = {
 ChiSqTest.chiSquaredFeatures(data)
   }
+
+  /**
+   * Conduct the two-sided Kolmogorov Smirnov test for data sampled from a
+   * continuous distribution. By comparing the largest difference between 
the empirical cumulative
+   * distribution of the sample data and the theoretical distribution we 
can provide a test for the
+   * the null hypothesis that the sample data comes from that theoretical 
distribution.
+   * For more information on KS Test:
+   * @see [[https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test]]
+   *
+   * Implementation note: We seek to implement the KS test with a minimal 
number of distributed
+   * passes. We sort the RDD, and then perform the following operations on 
a per-partition basis:
+   * calculate an empirical cumulative distribution value for each 
observation, and a theoretical
+   * cumulative distribution value. We know the latter to be correct, 
while the former will be off
+   * by a constant (how large the constant is depends on how many values 
precede it in other
+   * partitions).However, given that this constant simply shifts the ECDF 
upwards, but doesn't
+   * change its shape, and furthermore, that constant is the same within a 
given partition, we can
+   * pick 2 values in each partition that can potentially resolve to the 
largest global distance.
+   * Namely, we pick the minimum distance and the maximum distance. 
Additionally, we keep track of
+   * how many elements are in each partition. Once these three values have 
been returned for every
+   * partition, we can collect and operate locally. Locally, we can now 
adjust each distance by the
+   * appropriate constant (the cumulative sum of # of elements in the 
prior partitions divided by
--- End diff --

`#` -> `number`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8280][SPARK-8281][SQL]Handle NaN, null ...

2015-07-08 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/6835#discussion_r34226814
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/MathExpressionsSuite.scala ---
@@ -69,7 +69,7 @@ class MathExpressionsSuite extends QueryTest {
 if (f(-1) === math.log1p(-1)) {
   checkAnswer(
 nnDoubleData.select(c('b)),
-(1 to 9).map(n => Row(f(n * -0.1))) :+ Row(Double.NegativeInfinity)
+(1 to 9).map(n => Row(f(n * -0.1))) :+ Row(null)
--- End diff --

why the change from -inf to null?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8280][SPARK-8281][SQL]Handle NaN, null ...

2015-07-08 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/6835#discussion_r34226650
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/math.scala
 ---
@@ -248,30 +269,81 @@ case class Hypot(left: Expression, right: Expression)
 case class Pow(left: Expression, right: Expression)
   extends BinaryMathExpression(math.pow, "POWER") {
   override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
-defineCodeGen(ctx, ev, (c1, c2) => s"java.lang.Math.pow($c1, $c2)") + 
s"""
-  if (Double.valueOf(${ev.primitive}).isNaN()) {
-${ev.isNull} = true;
-  }
-  """
+defineCodeGen(ctx, ev, (c1, c2) => s"java.lang.Math.pow($c1, $c2)")
   }
 }
 
 case class Logarithm(left: Expression, right: Expression)
   extends BinaryMathExpression((c1, c2) => math.log(c2) / math.log(c1), 
"LOG") {
-  def this(child: Expression) = {
-this(EulerNumber(), child)
-  }
 
-  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
-val logCode = if (left.isInstanceOf[EulerNumber]) {
-  defineCodeGen(ctx, ev, (c1, c2) => s"java.lang.Math.log($c2)")
+  override def eval(input: InternalRow): Any = {
+val evalE1 = left.eval(input)
+if (evalE1 == null || evalE1.asInstanceOf[Double] <= 0.0) {
+  null
 } else {
-  defineCodeGen(ctx, ev, (c1, c2) => s"java.lang.Math.log($c2) / 
java.lang.Math.log($c1)")
+  val evalE2 = right.eval(input)
+  if (evalE2 == null || evalE2.asInstanceOf[Double] <= 0.0) {
+null
+  } else {
+math.log(evalE2.asInstanceOf[Double]) / 
math.log(evalE1.asInstanceOf[Double])
+  }
 }
-logCode + s"""
-  if (Double.valueOf(${ev.primitive}).isNaN()) {
+  }
+
+  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
+val eval1 = left.gen(ctx)
+val eval2 = right.gen(ctx)
+s"""
+  ${eval1.code}
+  boolean ${ev.isNull} = ${eval1.isNull} || ${eval1.primitive} <= 0.0;
+  ${ctx.javaType(dataType)} ${ev.primitive} = 
${ctx.defaultValue(dataType)};
+  if (${ev.isNull}) {
 ${ev.isNull} = true;
+  } else {
+${eval2.code}
+if (${eval2.isNull} || ${eval2.primitive} <= 0.0) {
+  ${ev.isNull} = true;
+} else {
+  ${ev.primitive} = java.lang.Math.${funcName}(${eval2.primitive}) 
/
+   java.lang.Math.${funcName}(${eval1.primitive});
+}
   }
 """
   }
+
+  // TODO: Hive's UDFLog doesn't support base in range (0.0, 1.0]
+  // If we want just behaves like Hive, use the code below and turn 
`udf_7` on
+
+//  override def eval(input: InternalRow): Any = {
+//val evalE1 = left.eval(input)
+//val evalE2 = right.eval(input)
+//if (evalE1 == null || evalE2 == null) {
+//  null
+//} else {
+//  if (evalE1.asInstanceOf[Double] <= 1.0 || 
evalE2.asInstanceOf[Double] <= 0.0) {
+//null
+//  } else {
+//math.log(evalE2.asInstanceOf[Double]) / 
math.log(evalE1.asInstanceOf[Double])
+//  }
+//}
+//  }
--- End diff --

we should support these. just remove the commented code, and add to inline 
comment for log that we support (0.0, 1.0], unlike hive.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8464][Core][Shuffle] Consider separatin...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7129#issuecomment-119838560
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8464][Core][Shuffle] Consider separatin...

2015-07-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7129#issuecomment-119838214
  
  [Test build #36888 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36888/console)
 for   PR 7129 at commit 
[`8f6e327`](https://github.com/apache/spark/commit/8f6e327b47f3bbaebe012914dc5dbe8aa69a4781).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  protected[this] case class SpilledFile(`
  * `  protected[this] class SpillReader(spill: SpilledFile) `
  * `  protected[this] class IteratorForPartition(partitionId: Int,`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4897] [PySpark] Python 3 support

2015-07-08 Thread rilut
Github user rilut commented on the pull request:

https://github.com/apache/spark/pull/5173#issuecomment-119837999
  
Sorry, I'm in a remote location for months. Maybe you/anyone could
help us to create a new issue if it still unresolved.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8280][SPARK-8281][SQL]Handle NaN, null ...

2015-07-08 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/6835#discussion_r34226508
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/math.scala
 ---
@@ -259,19 +285,14 @@ case class Atan2(left: Expression, right: Expression)
 null
   } else {
 // With codegen, the values returned by -0.0 and 0.0 are 
different. Handled with +0.0
-val result = math.atan2(evalE1.asInstanceOf[Double] + 0.0,
+math.atan2(evalE1.asInstanceOf[Double] + 0.0,
--- End diff --

i don't think u need to wrap here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8279][SQL]Add math function round

2015-07-08 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/6938#discussion_r34226418
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/math.scala
 ---
@@ -600,3 +604,146 @@ case class Logarithm(left: Expression, right: 
Expression)
 """
   }
 }
+
+case class Round(child: Expression, scale: Expression) extends Expression 
with ExpectsInputTypes {
+
+  def this(child: Expression) = {
+this(child, Literal(0))
+  }
+
+  override def children: Seq[Expression] = Seq(child, scale)
+
+  override def nullable: Boolean = true
+
+  override def foldable: Boolean = child.foldable
+
+  override lazy val dataType: DataType = child.dataType match {
+  case DecimalType.Fixed(p, s) => DecimalType(p, _scale)
+  case t => t
+}
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(
+// rely on precedence to implicit cast String into Double
+TypeCollection(DecimalType, DoubleType, FloatType, LongType, 
IntegerType, ShortType, ByteType),
+TypeCollection(LongType, IntegerType, ShortType, ByteType))
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+child.dataType match {
+  case _: NumericType => // satisfy requirement
+  case dt =>
+return TypeCheckFailure(s"Only numeric type is allowed for ROUND 
function, got $dt")
+}
+scale match {
+  case Literal(value, LongType) =>
+if (value.asInstanceOf[Long] < Int.MinValue || 
value.asInstanceOf[Long] > Int.MaxValue) {
+  return TypeCheckFailure("ROUND scale argument out of allowed 
range")
+}
+  case _ =>
+if (scale.dataType.isInstanceOf[IntegralType] && scale.foldable) {
+  // TODO: How to check out of range for foldable LongType 
Expression
+  // satisfy requirement
+} else {
+  return TypeCheckFailure("Only foldable Integral Expression " +
+s"is allowed for ROUND scale arguments, got ${child.dataType}")
+}
+}
+TypeCheckSuccess
+  }
+
+  private lazy val scaleV = scale.eval(EmptyRow)
+  private lazy val _scale = if (scaleV != null) scaleV.asInstanceOf[Int] 
else 0
+
+  override def eval(input: InternalRow): Any = {
+val evalE = child.eval(input)
+if (evalE == null || scaleV == null) return null
+round(evalE)
+  }
+
+  private lazy val round: (Any) => (Any) = typedRound(child.dataType)
+
+  // Using dataType info to find an appropriate round method
+  private def typedRound(dt: DataType)(x: Any): Any = {
+dt match {
+  case _: DecimalType =>
+val decimal = x.asInstanceOf[Decimal]
+if (decimal.changePrecision(decimal.precision, _scale)) decimal 
else null
+  case ByteType =>
+numericRound(x.asInstanceOf[Byte], _scale)
+  case ShortType =>
+numericRound(x.asInstanceOf[Short], _scale)
+  case IntegerType =>
+numericRound(x.asInstanceOf[Int], _scale)
+  case LongType =>
+numericRound(x.asInstanceOf[Long], _scale)
+  case FloatType =>
+numericRound(x.asInstanceOf[Float], _scale)
+  case DoubleType =>
+numericRound(x.asInstanceOf[Double], _scale)
+}
+  }
+
+  private def numericRound[T](input: T, scale: Int)(implicit bdc: 
BigDecimalConverter[T]): T = {
+input match {
+  case f: Float if (f.isNaN || f.isInfinite) => return input
+  case d: Double if (d.isNaN || d.isInfinite) => return input
+  case _ =>
+}
+bdc.fromBigDecimal(bdc.toBigDecimal(input).setScale(scale, 
BigDecimal.RoundingMode.HALF_UP))
+  }
+
+  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
+val ce = child.gen(ctx)
+
+def round(primitive: String, integral: Boolean): String = {
+  val (p1, p2) = if (integral) ("new", "") else ("", ".valueOf")
+  s"""
+  ${ev.primitive} = $p1 java.math.BigDecimal$p2(${primitive}).
+setScale(${_scale}, java.math.BigDecimal.ROUND_HALF_UP)"""
+}
+
+def fractionalCheck(primitive: String, function: String): String = {
+  s"""
+  if (Double.isNaN(${primitive}) || Double.isInfinite(${primitive})){
+${ev.primitive} = ${primitive};
+  } else {
+${round(primitive, false)}.${function};
+  }"""
+}
+
+def decimalRound(): String = {
+  s"""
+  if (${ce.primitive}.changePrecision(${ce.primitive}.precision(), 
${_scale})) {
+${ev.primitive} = ${ce.primitive};
+  } else {
+${ev.isNull} = true;
+  }
+  """
+}
   

[GitHub] spark pull request: [SPARK-8840][SparkR] Add float coercion on Spa...

2015-07-08 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/7280#discussion_r34226374
  
--- Diff: R/pkg/inst/tests/test_sparkSQL.R ---
@@ -108,6 +108,14 @@ test_that("create DataFrame from RDD", {
   expect_equal(count(df), 10)
   expect_equal(columns(df), c("a", "b"))
   expect_equal(dtypes(df), list(c("a", "int"), c("b", "string")))
+
+  localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 
18), height=c(164.10, 181.4, 173.7))
+  schema <- structType(structField("name", "string"), structField("age", 
"integer"), structField("height", "float"))
+  df <- createDataFrame(sqlContext, localDF, schema)
--- End diff --

I checked this. The column is still `double` due to another problem I just 
submitted in #7311. That is, in `createDataFrame`, the given `schema` will be 
overwritten.

Although I solved that in #7311, I just found that with user defined 
schema, it is possible to cause problem when collecting data from dataframe.

That is because we serialize `double` in R to `Double` in Java. If we 
define a column as `float` in R and create a dataframe based on this schema. 
The serialized and deserialized `Double` will be stored at the `float` column. 
Then when we collect the data from it, it will throw error.

@shivaram How do you think? Do we need to fix #7311? Or you think it is up 
to users to define correct schema?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8279][SQL]Add math function round

2015-07-08 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/6938#discussion_r34226166
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/math.scala
 ---
@@ -600,3 +604,146 @@ case class Logarithm(left: Expression, right: 
Expression)
 """
   }
 }
+
+case class Round(child: Expression, scale: Expression) extends Expression 
with ExpectsInputTypes {
+
+  def this(child: Expression) = {
+this(child, Literal(0))
+  }
+
+  override def children: Seq[Expression] = Seq(child, scale)
+
+  override def nullable: Boolean = true
+
+  override def foldable: Boolean = child.foldable
+
+  override lazy val dataType: DataType = child.dataType match {
+  case DecimalType.Fixed(p, s) => DecimalType(p, _scale)
+  case t => t
+}
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(
+// rely on precedence to implicit cast String into Double
+TypeCollection(DecimalType, DoubleType, FloatType, LongType, 
IntegerType, ShortType, ByteType),
+TypeCollection(LongType, IntegerType, ShortType, ByteType))
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+child.dataType match {
+  case _: NumericType => // satisfy requirement
+  case dt =>
+return TypeCheckFailure(s"Only numeric type is allowed for ROUND 
function, got $dt")
+}
+scale match {
+  case Literal(value, LongType) =>
+if (value.asInstanceOf[Long] < Int.MinValue || 
value.asInstanceOf[Long] > Int.MaxValue) {
+  return TypeCheckFailure("ROUND scale argument out of allowed 
range")
+}
+  case _ =>
+if (scale.dataType.isInstanceOf[IntegralType] && scale.foldable) {
+  // TODO: How to check out of range for foldable LongType 
Expression
+  // satisfy requirement
+} else {
+  return TypeCheckFailure("Only foldable Integral Expression " +
+s"is allowed for ROUND scale arguments, got ${child.dataType}")
+}
+}
+TypeCheckSuccess
+  }
+
+  private lazy val scaleV = scale.eval(EmptyRow)
+  private lazy val _scale = if (scaleV != null) scaleV.asInstanceOf[Int] 
else 0
+
+  override def eval(input: InternalRow): Any = {
+val evalE = child.eval(input)
+if (evalE == null || scaleV == null) return null
+round(evalE)
+  }
+
+  private lazy val round: (Any) => (Any) = typedRound(child.dataType)
+
+  // Using dataType info to find an appropriate round method
+  private def typedRound(dt: DataType)(x: Any): Any = {
+dt match {
+  case _: DecimalType =>
+val decimal = x.asInstanceOf[Decimal]
+if (decimal.changePrecision(decimal.precision, _scale)) decimal 
else null
+  case ByteType =>
+numericRound(x.asInstanceOf[Byte], _scale)
+  case ShortType =>
+numericRound(x.asInstanceOf[Short], _scale)
+  case IntegerType =>
+numericRound(x.asInstanceOf[Int], _scale)
+  case LongType =>
+numericRound(x.asInstanceOf[Long], _scale)
+  case FloatType =>
+numericRound(x.asInstanceOf[Float], _scale)
+  case DoubleType =>
+numericRound(x.asInstanceOf[Double], _scale)
+}
+  }
+
+  private def numericRound[T](input: T, scale: Int)(implicit bdc: 
BigDecimalConverter[T]): T = {
+input match {
+  case f: Float if (f.isNaN || f.isInfinite) => return input
+  case d: Double if (d.isNaN || d.isInfinite) => return input
+  case _ =>
+}
+bdc.fromBigDecimal(bdc.toBigDecimal(input).setScale(scale, 
BigDecimal.RoundingMode.HALF_UP))
+  }
+
+  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
+val ce = child.gen(ctx)
+
+def round(primitive: String, integral: Boolean): String = {
+  val (p1, p2) = if (integral) ("new", "") else ("", ".valueOf")
+  s"""
+  ${ev.primitive} = $p1 java.math.BigDecimal$p2(${primitive}).
+setScale(${_scale}, java.math.BigDecimal.ROUND_HALF_UP)"""
+}
+
+def fractionalCheck(primitive: String, function: String): String = {
+  s"""
+  if (Double.isNaN(${primitive}) || Double.isInfinite(${primitive})){
+${ev.primitive} = ${primitive};
+  } else {
+${round(primitive, false)}.${function};
+  }"""
+}
+
+def decimalRound(): String = {
+  s"""
--- End diff --

move this into the case below. don't create a function


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this fe

[GitHub] spark pull request: [SPARK-8279][SQL]Add math function round

2015-07-08 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/6938#discussion_r34226133
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BigDecimalConverter.scala
 ---
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+trait BigDecimalConverter[T] {
+  def toBigDecimal(in: T): BigDecimal
+  def fromBigDecimal(bd: BigDecimal): T
+}
+
+/**
+ * Helper type converters to work with BigDecimal
+ * from http://stackoverflow.com/a/30979266/1115193
+ */
+object BigDecimalConverter {
--- End diff --

I think we should just remove this class, and inline everything into round 
companion object.

No need for the implicit magic here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...

2015-07-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6994#discussion_r34226093
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala 
---
@@ -158,4 +158,47 @@ object Statistics {
   def chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult] = {
 ChiSqTest.chiSquaredFeatures(data)
   }
+
+  /**
+   * Conduct the two-sided Kolmogorov Smirnov test for data sampled from a
+   * continuous distribution. By comparing the largest difference between 
the empirical cumulative
+   * distribution of the sample data and the theoretical distribution we 
can provide a test for the
+   * the null hypothesis that the sample data comes from that theoretical 
distribution.
+   * For more information on KS Test:
+   * @see [[https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test]]
+   *
+   * Implementation note: We seek to implement the KS test with a minimal 
number of distributed
+   * passes. We sort the RDD, and then perform the following operations on 
a per-partition basis:
+   * calculate an empirical cumulative distribution value for each 
observation, and a theoretical
+   * cumulative distribution value. We know the latter to be correct, 
while the former will be off
+   * by a constant (how large the constant is depends on how many values 
precede it in other
+   * partitions).However, given that this constant simply shifts the ECDF 
upwards, but doesn't
+   * change its shape, and furthermore, that constant is the same within a 
given partition, we can
+   * pick 2 values in each partition that can potentially resolve to the 
largest global distance.
+   * Namely, we pick the minimum distance and the maximum distance. 
Additionally, we keep track of
+   * how many elements are in each partition. Once these three values have 
been returned for every
+   * partition, we can collect and operate locally. Locally, we can now 
adjust each distance by the
+   * appropriate constant (the cumulative sum of # of elements in the 
prior partitions divided by
+   * the data set size). Finally, we take the maximum absolute value, and 
this is the statistic.
+   * @param data an `RDD[Double]` containing the sample of data to test
+   * @param cdf a `Double => Double` function to calculate the theoretical 
CDF at a given value
+   * @return KSTestResult object containing test statistic, p-value, and 
null hypothesis.
+   */
+  def ksTest(data: RDD[Double], cdf: Double => Double): KSTestResult = {
+KSTest.testOneSample(data, cdf)
+  }
+
+  /**
+   * Convenience function to conduct a one-sample, two sided Kolmogorov 
Smirnov test for probability
+   * distribution equality. Currently supports the normal distribution, 
taking as parameters
+   * the mean and standard deviation.
+   * (distName = "norm")
+   * @param data an `RDD[Double]` containing the sample of data to test
+   * @param distName a `String` name for a theoretical distribution
+   * @param params `Double*` specifying the parameters to be used for the 
theoretical distribution
+   * @return KSTestResult object containing test statistic, p-value, and 
null hypothesis.
+   */
+  def ksTest(data: RDD[Double], distName: String, params: Double*): 
KSTestResult = {
--- End diff --

The issue with overloading the name would show up in the Python API, 
because you cannot declare two methods with the same name. Then under this 
method, you cannot call the second argument `distName` or `data2`, which has to 
be more general like `y`. This is R's doc for the second arg:

~~~
y   either a numeric vector of data values, or a character string naming a
cumulative distribution function or an actual cumulative distribution
function such as pnorm. Only continuous CDFs are valid.
~~~

MATLAB uses `kstest2`. We can discuss more in the 2-sample test PR.

@srowen This is mostly mirroring R's API. No strong preference, but I would 
never type `kolmogorovSmirnovTest` without auto-completion. (Well, I just typed 
it ...)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8279][SQL]Add math function round

2015-07-08 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/6938#discussion_r34226080
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/math.scala
 ---
@@ -600,3 +604,146 @@ case class Logarithm(left: Expression, right: 
Expression)
 """
   }
 }
+
+case class Round(child: Expression, scale: Expression) extends Expression 
with ExpectsInputTypes {
+
+  def this(child: Expression) = {
+this(child, Literal(0))
+  }
+
+  override def children: Seq[Expression] = Seq(child, scale)
+
+  override def nullable: Boolean = true
+
+  override def foldable: Boolean = child.foldable
+
+  override lazy val dataType: DataType = child.dataType match {
+  case DecimalType.Fixed(p, s) => DecimalType(p, _scale)
+  case t => t
+}
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(
+// rely on precedence to implicit cast String into Double
+TypeCollection(DecimalType, DoubleType, FloatType, LongType, 
IntegerType, ShortType, ByteType),
+TypeCollection(LongType, IntegerType, ShortType, ByteType))
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+child.dataType match {
+  case _: NumericType => // satisfy requirement
+  case dt =>
+return TypeCheckFailure(s"Only numeric type is allowed for ROUND 
function, got $dt")
+}
+scale match {
+  case Literal(value, LongType) =>
+if (value.asInstanceOf[Long] < Int.MinValue || 
value.asInstanceOf[Long] > Int.MaxValue) {
+  return TypeCheckFailure("ROUND scale argument out of allowed 
range")
+}
+  case _ =>
+if (scale.dataType.isInstanceOf[IntegralType] && scale.foldable) {
+  // TODO: How to check out of range for foldable LongType 
Expression
+  // satisfy requirement
+} else {
+  return TypeCheckFailure("Only foldable Integral Expression " +
+s"is allowed for ROUND scale arguments, got ${child.dataType}")
+}
+}
+TypeCheckSuccess
+  }
+
+  private lazy val scaleV = scale.eval(EmptyRow)
+  private lazy val _scale = if (scaleV != null) scaleV.asInstanceOf[Int] 
else 0
+
+  override def eval(input: InternalRow): Any = {
+val evalE = child.eval(input)
+if (evalE == null || scaleV == null) return null
+round(evalE)
+  }
+
+  private lazy val round: (Any) => (Any) = typedRound(child.dataType)
+
+  // Using dataType info to find an appropriate round method
+  private def typedRound(dt: DataType)(x: Any): Any = {
--- End diff --

this is too complicated i think.

just inline this in nullsafeeval - no need to create a function or use 
currying here. performance for the non-codegen path doesn't matter that much.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8931] [SQL] Fallback to interpreted eva...

2015-07-08 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/7309#issuecomment-119831685
  
There's a few ways that we can do this.  Rather than relying on environment 
variables, I'd consider making this into an undocumented SQLConf setting then 
using the test runner's system properties configuration to set that property in 
tests.  To see an example of this, look at the SparkConf setting that's used 
for controlling whether we throw exceptions if TaskMemoryManager detects a 
managed memory leak.

I think there's also a `spark.testing` System property / SparkConf that you 
might be able to use, but I'd grep the build configurations to confirm.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8279][SQL]Add math function round

2015-07-08 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/6938#discussion_r34226048
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/math.scala
 ---
@@ -600,3 +604,146 @@ case class Logarithm(left: Expression, right: 
Expression)
 """
   }
 }
+
+case class Round(child: Expression, scale: Expression) extends Expression 
with ExpectsInputTypes {
+
+  def this(child: Expression) = {
+this(child, Literal(0))
+  }
+
+  override def children: Seq[Expression] = Seq(child, scale)
+
+  override def nullable: Boolean = true
+
+  override def foldable: Boolean = child.foldable
+
+  override lazy val dataType: DataType = child.dataType match {
+  case DecimalType.Fixed(p, s) => DecimalType(p, _scale)
+  case t => t
+}
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(
+// rely on precedence to implicit cast String into Double
+TypeCollection(DecimalType, DoubleType, FloatType, LongType, 
IntegerType, ShortType, ByteType),
+TypeCollection(LongType, IntegerType, ShortType, ByteType))
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+child.dataType match {
+  case _: NumericType => // satisfy requirement
+  case dt =>
+return TypeCheckFailure(s"Only numeric type is allowed for ROUND 
function, got $dt")
+}
+scale match {
+  case Literal(value, LongType) =>
+if (value.asInstanceOf[Long] < Int.MinValue || 
value.asInstanceOf[Long] > Int.MaxValue) {
+  return TypeCheckFailure("ROUND scale argument out of allowed 
range")
+}
+  case _ =>
+if (scale.dataType.isInstanceOf[IntegralType] && scale.foldable) {
+  // TODO: How to check out of range for foldable LongType 
Expression
+  // satisfy requirement
+} else {
+  return TypeCheckFailure("Only foldable Integral Expression " +
+s"is allowed for ROUND scale arguments, got ${child.dataType}")
+}
+}
+TypeCheckSuccess
+  }
+
+  private lazy val scaleV = scale.eval(EmptyRow)
+  private lazy val _scale = if (scaleV != null) scaleV.asInstanceOf[Int] 
else 0
+
+  override def eval(input: InternalRow): Any = {
--- End diff --

use the nullsafeval


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8279][SQL]Add math function round

2015-07-08 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/6938#discussion_r34226031
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/math.scala
 ---
@@ -600,3 +604,146 @@ case class Logarithm(left: Expression, right: 
Expression)
 """
   }
 }
+
+case class Round(child: Expression, scale: Expression) extends Expression 
with ExpectsInputTypes {
+
+  def this(child: Expression) = {
+this(child, Literal(0))
+  }
+
+  override def children: Seq[Expression] = Seq(child, scale)
+
+  override def nullable: Boolean = true
+
+  override def foldable: Boolean = child.foldable
+
+  override lazy val dataType: DataType = child.dataType match {
+  case DecimalType.Fixed(p, s) => DecimalType(p, _scale)
+  case t => t
+}
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(
+// rely on precedence to implicit cast String into Double
+TypeCollection(DecimalType, DoubleType, FloatType, LongType, 
IntegerType, ShortType, ByteType),
+TypeCollection(LongType, IntegerType, ShortType, ByteType))
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+child.dataType match {
+  case _: NumericType => // satisfy requirement
+  case dt =>
+return TypeCheckFailure(s"Only numeric type is allowed for ROUND 
function, got $dt")
+}
+scale match {
+  case Literal(value, LongType) =>
+if (value.asInstanceOf[Long] < Int.MinValue || 
value.asInstanceOf[Long] > Int.MaxValue) {
+  return TypeCheckFailure("ROUND scale argument out of allowed 
range")
+}
+  case _ =>
+if (scale.dataType.isInstanceOf[IntegralType] && scale.foldable) {
+  // TODO: How to check out of range for foldable LongType 
Expression
+  // satisfy requirement
+} else {
+  return TypeCheckFailure("Only foldable Integral Expression " +
+s"is allowed for ROUND scale arguments, got ${child.dataType}")
--- End diff --

If you accept only IntegerType for the 2nd argument, long type will get 
implicitly casted to integer by the optimizer. so I think you can just check 
whether it is foldable.

also you should call super.checkInputDataTypes() to make sure the generic 
type checker can run successfully, and only run the 2nd argument foldable check 
after that.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8881] Fix algorithm for scheduling exec...

2015-07-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7274#issuecomment-119830135
  
  [Test build #36897 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36897/console)
 for   PR 7274 at commit 
[`2d6371c`](https://github.com/apache/spark/commit/2d6371c82b74b12d87b6d7ff93a2ec590d11516a).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8851][YARN] In Yarn client mode, Client...

2015-07-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7255#issuecomment-119830126
  
  [Test build #36899 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36899/consoleFull)
 for   PR 7255 at commit 
[`a6e0fc9`](https://github.com/apache/spark/commit/a6e0fc9b95d928527b3e74a1af8f3c21e2cb5172).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8881] Fix algorithm for scheduling exec...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7274#issuecomment-119830139
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8931] [SQL] Fallback to interpreted eva...

2015-07-08 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/7309#issuecomment-119830088
  
Is there some environmental variable or config variable we set for tests?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8840][SparkR] Add float coercion on Spa...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7280#issuecomment-119830018
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-7422][MLLIB] Add argmax to Vector, Spar...

2015-07-08 Thread GeorgeDittmar
Github user GeorgeDittmar commented on the pull request:

https://github.com/apache/spark/pull/6112#issuecomment-119829973
  
@mengxr is the MimaExcludes used for keeping builds clean between versions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8851][YARN] In Yarn client mode, Client...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7255#issuecomment-119829987
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8851][YARN] In Yarn client mode, Client...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7255#issuecomment-119830006
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8840][SparkR] Add float coercion on Spa...

2015-07-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7280#issuecomment-119829957
  
  [Test build #36884 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36884/console)
 for   PR 7280 at commit 
[`6f9159d`](https://github.com/apache/spark/commit/6f9159dac8126cb1b714f9d37ed59aa932d5fad8).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-8313] R Spark packages support

2015-07-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7139#issuecomment-119829649
  
  [Test build #36898 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36898/console)
 for   PR 7139 at commit 
[`eff5ba1`](https://github.com/apache/spark/commit/eff5ba15333085adb9f95f1a953cf2c5f506fd2a).
 * This patch **fails RAT tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8279][SQL]Add math function round

2015-07-08 Thread yijieshen
Github user yijieshen commented on a diff in the pull request:

https://github.com/apache/spark/pull/6938#discussion_r34225665
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/math.scala
 ---
@@ -600,3 +604,146 @@ case class Logarithm(left: Expression, right: 
Expression)
 """
   }
 }
+
+case class Round(child: Expression, scale: Expression) extends Expression 
with ExpectsInputTypes {
+
+  def this(child: Expression) = {
+this(child, Literal(0))
+  }
+
+  override def children: Seq[Expression] = Seq(child, scale)
+
+  override def nullable: Boolean = true
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-8313] R Spark packages support

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7139#issuecomment-119829652
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-8313] R Spark packages support

2015-07-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7139#issuecomment-119829641
  
  [Test build #36898 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36898/consoleFull)
 for   PR 7139 at commit 
[`eff5ba1`](https://github.com/apache/spark/commit/eff5ba15333085adb9f95f1a953cf2c5f506fd2a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-8313] R Spark packages support

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7139#issuecomment-119829285
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8931] [SQL] Fallback to interpreted eva...

2015-07-08 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/7309#issuecomment-119829290
  
@JoshRosen I'm also thinking of that, but how can we easy turn it off for 
tests?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-8313] R Spark packages support

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7139#issuecomment-119829259
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8279][SQL]Add math function round

2015-07-08 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/6938#discussion_r34225587
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/math.scala
 ---
@@ -600,3 +604,146 @@ case class Logarithm(left: Expression, right: 
Expression)
 """
   }
 }
+
+case class Round(child: Expression, scale: Expression) extends Expression 
with ExpectsInputTypes {
+
+  def this(child: Expression) = {
+this(child, Literal(0))
+  }
+
+  override def children: Seq[Expression] = Seq(child, scale)
+
+  override def nullable: Boolean = true
--- End diff --

can you document when round can return null if child/scale are not null?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-8313] R Spark packages support

2015-07-08 Thread brkyvz
Github user brkyvz commented on the pull request:

https://github.com/apache/spark/pull/7139#issuecomment-119828796
  
@shivaram @cafreeman I believe this is ready. I added unit, and end to end 
tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8279][SQL]Add math function round

2015-07-08 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/6938#discussion_r34225535
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/MathExpressionsSuite.scala ---
@@ -198,6 +198,15 @@ class MathExpressionsSuite extends QueryTest {
 testOneToOneMathFunction(rint, math.rint)
   }
 
+  test("round") {
+checkAnswer(
--- End diff --

right now there is only one test for sql expression, no test for the data 
frame function you added to functions.scala. Just add one more test case that 
uses dataframes.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8881] Fix algorithm for scheduling exec...

2015-07-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7274#issuecomment-119828401
  
  [Test build #36897 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36897/consoleFull)
 for   PR 7274 at commit 
[`2d6371c`](https://github.com/apache/spark/commit/2d6371c82b74b12d87b6d7ff93a2ec590d11516a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8279][SQL]Add math function round

2015-07-08 Thread yijieshen
Github user yijieshen commented on a diff in the pull request:

https://github.com/apache/spark/pull/6938#discussion_r34225366
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/MathExpressionsSuite.scala ---
@@ -198,6 +198,15 @@ class MathExpressionsSuite extends QueryTest {
 testOneToOneMathFunction(rint, math.rint)
   }
 
+  test("round") {
+checkAnswer(
--- End diff --

you mean more test here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8881] Fix algorithm for scheduling exec...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7274#issuecomment-119828071
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8279][SQL]Add math function round

2015-07-08 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/6938#discussion_r34225252
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/MathExpressionsSuite.scala ---
@@ -198,6 +198,15 @@ class MathExpressionsSuite extends QueryTest {
 testOneToOneMathFunction(rint, math.rint)
   }
 
+  test("round") {
+checkAnswer(
--- End diff --

you should add a test case that calls the dataframe round function too.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8881] Fix algorithm for scheduling exec...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7274#issuecomment-119828062
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8839][SQL]ThriftServer2 will remove ses...

2015-07-08 Thread tianyi
Github user tianyi commented on the pull request:

https://github.com/apache/spark/pull/7239#issuecomment-119827996
  
I agree with you.
Please remove `trim` in the `onStatementParsed`, that is really meaningless.
Besides that, @liancheng , this PR now is LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8881] Fix algorithm for scheduling exec...

2015-07-08 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/7274#issuecomment-119827713
  
Not sure... retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7292] Local checkpointing

2015-07-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7279#issuecomment-119827564
  
  [Test build #1018 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1018/console)
 for   PR 7279 at commit 
[`c449b38`](https://github.com/apache/spark/commit/c449b38f420e07e581541933c53060f319f948ec).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8942][SQL] use double not decimal when ...

2015-07-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7312#issuecomment-119827545
  
  [Test build #36896 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36896/consoleFull)
 for   PR 7312 at commit 
[`a4589fa`](https://github.com/apache/spark/commit/a4589fa23ceed1473fd9b315ca16e0d4773eff3a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8926][SQL] Code review followup.

2015-07-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7313#issuecomment-119827517
  
  [Test build #36895 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36895/consoleFull)
 for   PR 7313 at commit 
[`f8d5533`](https://github.com/apache/spark/commit/f8d55330f31be94627e4553c36e127c976ea3a50).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7902] [SPARK-6289] [SPARK-8685] [SQL] [...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7301#issuecomment-119826318
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8942][SQL] use double not decimal when ...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7312#issuecomment-119825851
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7902] [SPARK-6289] [SPARK-8685] [SQL] [...

2015-07-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7301#issuecomment-119826000
  
  [Test build #36887 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36887/console)
 for   PR 7301 at commit 
[`e9217bd`](https://github.com/apache/spark/commit/e9217bd9d1f1cc53e99af4da51d3a63a4220f62b).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8942][SQL] use double not decimal when ...

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7312#issuecomment-119825780
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8926][SQL] Code review followup.

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7313#issuecomment-119825807
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8926][SQL] Code review followup.

2015-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7313#issuecomment-119825758
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8926][SQL] Code review followup.

2015-07-08 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/7313#issuecomment-119824675
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8813][SQL] Support combine text/parquet...

2015-07-08 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/7210#issuecomment-119824697
  
Also, I think users can easily workaround this issue without using 
`CombineFileIntputFormat` by adding a `coalesce(n)` call, where `n` is the 
desired task number. In MapReduce, basically the framework decides how many 
splits to use, but in Spark, it can be controlled explicitly. For example:

```scala
sqlContext.read.parquet("hdfs://some/path").coalesce(1).collect()
```

In this way, only a single task is used to read all the files at the given 
path. Does this trick work for you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8942][SQL] use double not decimal when ...

2015-07-08 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/7312#issuecomment-119824609
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >