[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-11 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1733


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-11 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-51866151
  
LGTM. Merged into both master and branch-1.1. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-51857628
  
QA results for PR 1733:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18340/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-51853748
  
QA tests have started for PR 1733. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18340/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-11 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r16085437
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import breeze.linalg.{DenseMatrix => BDM}
+import cern.jet.stat.Probability.chiSquareComplemented
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the chi-squared test for the input RDDs using the specified 
method.
+ * Goodness-of-fit test is conducted on two `Vectors`, whereas test of 
independence is conducted
+ * on an input of type `Matrix` in which independence between columns is 
assessed.
+ * We also provide a method for computing the chi-squared statistic 
between each feature and the
+ * label for an input `RDD[LabeledPoint]`, return an 
`Array[ChiSquaredTestResult]` of size =
+ * number of features in the inpuy RDD.
+ *
+ * Supported methods for goodness of fit: `pearson` (default)
+ * Supported methods for independence: `pearson` (default)
+ *
+ * More information on Chi-squared test: 
http://en.wikipedia.org/wiki/Chi-squared_test
+ */
+private[stat] object ChiSqTest extends Logging {
+
+  /**
+   * @param name String name for the method.
+   * @param chiSqFunc Function for computing the statistic given the 
observed and expected counts.
+   */
+  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
+
+  // Pearson's chi-squared test: 
http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
+  val PEARSON = new Method("pearson", (observed: Double, expected: Double) 
=> {
+val dev = observed - expected
+dev * dev / expected
+  })
+
+  // Null hypothesis for the two different types of chi-squared tests to 
be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val goodnessOfFit = Value("observed follows the same distribution as 
expected.")
+val independence = Value("observations in each column are 
statistically independent.")
+  }
+
+  // Method identification based on input methodName string
+  private def methodFromString(methodName: String): Method = {
+methodName match {
+  case PEARSON.name => PEARSON
+  case _ => throw new IllegalArgumentException("Unrecognized method 
for Chi squared test.")
+}
+  }
+
+  /**
+   * Conduct Pearson's independence test for each feature against the 
label across the input RDD.
+   * The contingency table is constructed from the raw (feature, label) 
pairs and used to conduct
+   * the independence test.
+   * Returns an array containing the ChiSquaredTestResult for every 
feature against the label.
+   */
+  def chiSquaredFeatures(data: RDD[LabeledPoint],
+  methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
+val numCols = data.first().features.size
+val results = new Array[ChiSqTestResult](numCols)
+var labels = Array[Double]()
+// At most 100 columns at a time
+val batchSize = 100
+var batch = 0
+while (batch * batchSize < numCols) {
+  // The following block of code can be cleaned up and made public as
+  // chiSquared(data: RDD[(V1, V2)])
+  val startCol = batch * batchSize
+  val endCol = startCol + math.min(batchSize, numCols - startCol)
+  val pairCounts = data.flatMap { p =>
+// assume dense vectors
+p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case 
(feature, col) =>
+  (col, feature, p.label)
+}
+  }.countByValue()
+
+  if (labels.size == 0) {
+// Do this only once for the first column

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-11 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r16085233
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSqTest.scala ---
@@ -0,0 +1,221 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import breeze.linalg.{DenseMatrix => BDM}
+import cern.jet.stat.Probability.chiSquareComplemented
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the chi-squared test for the input RDDs using the specified 
method.
+ * Goodness-of-fit test is conducted on two `Vectors`, whereas test of 
independence is conducted
+ * on an input of type `Matrix` in which independence between columns is 
assessed.
+ * We also provide a method for computing the chi-squared statistic 
between each feature and the
+ * label for an input `RDD[LabeledPoint]`, return an 
`Array[ChiSquaredTestResult]` of size =
+ * number of features in the inpuy RDD.
+ *
+ * Supported methods for goodness of fit: `pearson` (default)
+ * Supported methods for independence: `pearson` (default)
+ *
+ * More information on Chi-squared test: 
http://en.wikipedia.org/wiki/Chi-squared_test
+ */
+private[stat] object ChiSqTest extends Logging {
+
+  /**
+   * @param name String name for the method.
+   * @param chiSqFunc Function for computing the statistic given the 
observed and expected counts.
+   */
+  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
+
+  // Pearson's chi-squared test: 
http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
+  val PEARSON = new Method("pearson", (observed: Double, expected: Double) 
=> {
+val dev = observed - expected
+dev * dev / expected
+  })
+
+  // Null hypothesis for the two different types of chi-squared tests to 
be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val goodnessOfFit = Value("observed follows the same distribution as 
expected.")
+val independence = Value("observations in each column are 
statistically independent.")
+  }
+
+  // Method identification based on input methodName string
+  private def methodFromString(methodName: String): Method = {
+methodName match {
+  case PEARSON.name => PEARSON
+  case _ => throw new IllegalArgumentException("Unrecognized method 
for Chi squared test.")
+}
+  }
+
+  /**
+   * Conduct Pearson's independence test for each feature against the 
label across the input RDD.
+   * The contingency table is constructed from the raw (feature, label) 
pairs and used to conduct
+   * the independence test.
+   * Returns an array containing the ChiSquaredTestResult for every 
feature against the label.
+   */
+  def chiSquaredFeatures(data: RDD[LabeledPoint],
+  methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
+val numCols = data.first().features.size
+val results = new Array[ChiSqTestResult](numCols)
+var labels: Map[Double, Int] = null
+// At most 100 columns at a time
+val batchSize = 100
+var batch = 0
+while (batch * batchSize < numCols) {
+  // The following block of code can be cleaned up and made public as
+  // chiSquared(data: RDD[(V1, V2)])
+  val startCol = batch * batchSize
+  val endCol = startCol + math.min(batchSize, numCols - startCol)
+  val pairCounts = data.flatMap { p =>
+// assume dense vectors
+p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case 
(feature, col) =>
+  (col, feature, p.label)
+}
+  }.countByValue()
+
+  if (labels == null) {
+// Do this only once for the first column

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-51849419
  
QA results for PR 1733:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18333/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-51844208
  
QA tests have started for PR 1733. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18333/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-11 Thread dorx
Github user dorx commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r16075494
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import breeze.linalg.{DenseMatrix => BDM}
+import cern.jet.stat.Probability.chiSquareComplemented
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the chi-squared test for the input RDDs using the specified 
method.
+ * Goodness-of-fit test is conducted on two `Vectors`, whereas test of 
independence is conducted
+ * on an input of type `Matrix` in which independence between columns is 
assessed.
+ * We also provide a method for computing the chi-squared statistic 
between each feature and the
+ * label for an input `RDD[LabeledPoint]`, return an 
`Array[ChiSquaredTestResult]` of size =
+ * number of features in the inpuy RDD.
+ *
+ * Supported methods for goodness of fit: `pearson` (default)
+ * Supported methods for independence: `pearson` (default)
+ *
+ * More information on Chi-squared test: 
http://en.wikipedia.org/wiki/Chi-squared_test
+ */
+private[stat] object ChiSqTest extends Logging {
+
+  /**
+   * @param name String name for the method.
+   * @param chiSqFunc Function for computing the statistic given the 
observed and expected counts.
+   */
+  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
+
+  // Pearson's chi-squared test: 
http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
+  val PEARSON = new Method("pearson", (observed: Double, expected: Double) 
=> {
+val dev = observed - expected
+dev * dev / expected
+  })
+
+  // Null hypothesis for the two different types of chi-squared tests to 
be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val goodnessOfFit = Value("observed follows the same distribution as 
expected.")
+val independence = Value("observations in each column are 
statistically independent.")
+  }
+
+  // Method identification based on input methodName string
+  private def methodFromString(methodName: String): Method = {
+methodName match {
+  case PEARSON.name => PEARSON
+  case _ => throw new IllegalArgumentException("Unrecognized method 
for Chi squared test.")
+}
+  }
+
+  /**
+   * Conduct Pearson's independence test for each feature against the 
label across the input RDD.
+   * The contingency table is constructed from the raw (feature, label) 
pairs and used to conduct
+   * the independence test.
+   * Returns an array containing the ChiSquaredTestResult for every 
feature against the label.
+   */
+  def chiSquaredFeatures(data: RDD[LabeledPoint],
+  methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
+val numCols = data.first().features.size
+val results = new Array[ChiSqTestResult](numCols)
+var labels = Array[Double]()
+// At most 100 columns at a time
+val batchSize = 100
+var batch = 0
+while (batch * batchSize < numCols) {
+  // The following block of code can be cleaned up and made public as
+  // chiSquared(data: RDD[(V1, V2)])
+  val startCol = batch * batchSize
+  val endCol = startCol + math.min(batchSize, numCols - startCol)
+  val pairCounts = data.flatMap { p =>
+// assume dense vectors
+p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case 
(feature, col) =>
+  (col, feature, p.label)
+}
+  }.countByValue()
+
+  if (labels.size == 0) {
+// Do this only once for the first column s

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r16026262
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import breeze.linalg.{DenseMatrix => BDM}
+import cern.jet.stat.Probability.chiSquareComplemented
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the chi-squared test for the input RDDs using the specified 
method.
+ * Goodness-of-fit test is conducted on two `Vectors`, whereas test of 
independence is conducted
+ * on an input of type `Matrix` in which independence between columns is 
assessed.
+ * We also provide a method for computing the chi-squared statistic 
between each feature and the
+ * label for an input `RDD[LabeledPoint]`, return an 
`Array[ChiSquaredTestResult]` of size =
+ * number of features in the inpuy RDD.
+ *
+ * Supported methods for goodness of fit: `pearson` (default)
+ * Supported methods for independence: `pearson` (default)
+ *
+ * More information on Chi-squared test: 
http://en.wikipedia.org/wiki/Chi-squared_test
+ */
+private[stat] object ChiSqTest extends Logging {
+
+  /**
+   * @param name String name for the method.
+   * @param chiSqFunc Function for computing the statistic given the 
observed and expected counts.
+   */
+  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
+
+  // Pearson's chi-squared test: 
http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
+  val PEARSON = new Method("pearson", (observed: Double, expected: Double) 
=> {
+val dev = observed - expected
+dev * dev / expected
+  })
+
+  // Null hypothesis for the two different types of chi-squared tests to 
be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val goodnessOfFit = Value("observed follows the same distribution as 
expected.")
+val independence = Value("observations in each column are 
statistically independent.")
+  }
+
+  // Method identification based on input methodName string
+  private def methodFromString(methodName: String): Method = {
+methodName match {
+  case PEARSON.name => PEARSON
+  case _ => throw new IllegalArgumentException("Unrecognized method 
for Chi squared test.")
+}
+  }
+
+  /**
+   * Conduct Pearson's independence test for each feature against the 
label across the input RDD.
+   * The contingency table is constructed from the raw (feature, label) 
pairs and used to conduct
+   * the independence test.
+   * Returns an array containing the ChiSquaredTestResult for every 
feature against the label.
+   */
+  def chiSquaredFeatures(data: RDD[LabeledPoint],
+  methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
+val numCols = data.first().features.size
+val results = new Array[ChiSqTestResult](numCols)
+var labels = Array[Double]()
+// At most 100 columns at a time
+val batchSize = 100
+var batch = 0
+while (batch * batchSize < numCols) {
+  // The following block of code can be cleaned up and made public as
+  // chiSquared(data: RDD[(V1, V2)])
+  val startCol = batch * batchSize
+  val endCol = startCol + math.min(batchSize, numCols - startCol)
+  val pairCounts = data.flatMap { p =>
+// assume dense vectors
+p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case 
(feature, col) =>
+  (col, feature, p.label)
+}
+  }.countByValue()
+
+  if (labels.size == 0) {
+// Do this only once for the first column

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-09 Thread dorx
Github user dorx commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r16024886
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import breeze.linalg.{DenseMatrix => BDM}
+import cern.jet.stat.Probability.chiSquareComplemented
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the chi-squared test for the input RDDs using the specified 
method.
+ * Goodness-of-fit test is conducted on two `Vectors`, whereas test of 
independence is conducted
+ * on an input of type `Matrix` in which independence between columns is 
assessed.
+ * We also provide a method for computing the chi-squared statistic 
between each feature and the
+ * label for an input `RDD[LabeledPoint]`, return an 
`Array[ChiSquaredTestResult]` of size =
+ * number of features in the inpuy RDD.
+ *
+ * Supported methods for goodness of fit: `pearson` (default)
+ * Supported methods for independence: `pearson` (default)
+ *
+ * More information on Chi-squared test: 
http://en.wikipedia.org/wiki/Chi-squared_test
+ */
+private[stat] object ChiSqTest extends Logging {
+
+  /**
+   * @param name String name for the method.
+   * @param chiSqFunc Function for computing the statistic given the 
observed and expected counts.
+   */
+  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
+
+  // Pearson's chi-squared test: 
http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
+  val PEARSON = new Method("pearson", (observed: Double, expected: Double) 
=> {
+val dev = observed - expected
+dev * dev / expected
+  })
+
+  // Null hypothesis for the two different types of chi-squared tests to 
be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val goodnessOfFit = Value("observed follows the same distribution as 
expected.")
+val independence = Value("observations in each column are 
statistically independent.")
+  }
+
+  // Method identification based on input methodName string
+  private def methodFromString(methodName: String): Method = {
+methodName match {
+  case PEARSON.name => PEARSON
+  case _ => throw new IllegalArgumentException("Unrecognized method 
for Chi squared test.")
+}
+  }
+
+  /**
+   * Conduct Pearson's independence test for each feature against the 
label across the input RDD.
+   * The contingency table is constructed from the raw (feature, label) 
pairs and used to conduct
+   * the independence test.
+   * Returns an array containing the ChiSquaredTestResult for every 
feature against the label.
+   */
+  def chiSquaredFeatures(data: RDD[LabeledPoint],
+  methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
+val numCols = data.first().features.size
+val results = new Array[ChiSqTestResult](numCols)
+var labels = Array[Double]()
+// At most 100 columns at a time
+val batchSize = 100
+var batch = 0
+while (batch * batchSize < numCols) {
+  // The following block of code can be cleaned up and made public as
+  // chiSquared(data: RDD[(V1, V2)])
+  val startCol = batch * batchSize
+  val endCol = startCol + math.min(batchSize, numCols - startCol)
+  val pairCounts = data.flatMap { p =>
+// assume dense vectors
+p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case 
(feature, col) =>
+  (col, feature, p.label)
+}
+  }.countByValue()
+
+  if (labels.size == 0) {
+// Do this only once for the first column s

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r16024877
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import breeze.linalg.{DenseMatrix => BDM}
+import cern.jet.stat.Probability.chiSquareComplemented
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the chi-squared test for the input RDDs using the specified 
method.
+ * Goodness-of-fit test is conducted on two `Vectors`, whereas test of 
independence is conducted
+ * on an input of type `Matrix` in which independence between columns is 
assessed.
+ * We also provide a method for computing the chi-squared statistic 
between each feature and the
+ * label for an input `RDD[LabeledPoint]`, return an 
`Array[ChiSquaredTestResult]` of size =
+ * number of features in the inpuy RDD.
+ *
+ * Supported methods for goodness of fit: `pearson` (default)
+ * Supported methods for independence: `pearson` (default)
+ *
+ * More information on Chi-squared test: 
http://en.wikipedia.org/wiki/Chi-squared_test
+ */
+private[stat] object ChiSqTest extends Logging {
+
+  /**
+   * @param name String name for the method.
+   * @param chiSqFunc Function for computing the statistic given the 
observed and expected counts.
+   */
+  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
+
+  // Pearson's chi-squared test: 
http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
+  val PEARSON = new Method("pearson", (observed: Double, expected: Double) 
=> {
+val dev = observed - expected
+dev * dev / expected
+  })
+
+  // Null hypothesis for the two different types of chi-squared tests to 
be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val goodnessOfFit = Value("observed follows the same distribution as 
expected.")
+val independence = Value("observations in each column are 
statistically independent.")
+  }
+
+  // Method identification based on input methodName string
+  private def methodFromString(methodName: String): Method = {
+methodName match {
+  case PEARSON.name => PEARSON
+  case _ => throw new IllegalArgumentException("Unrecognized method 
for Chi squared test.")
+}
+  }
+
+  /**
+   * Conduct Pearson's independence test for each feature against the 
label across the input RDD.
+   * The contingency table is constructed from the raw (feature, label) 
pairs and used to conduct
+   * the independence test.
+   * Returns an array containing the ChiSquaredTestResult for every 
feature against the label.
+   */
+  def chiSquaredFeatures(data: RDD[LabeledPoint],
+  methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
+val numCols = data.first().features.size
+val results = new Array[ChiSqTestResult](numCols)
+var labels = Array[Double]()
+// At most 100 columns at a time
+val batchSize = 100
+var batch = 0
+while (batch * batchSize < numCols) {
+  // The following block of code can be cleaned up and made public as
+  // chiSquared(data: RDD[(V1, V2)])
+  val startCol = batch * batchSize
+  val endCol = startCol + math.min(batchSize, numCols - startCol)
+  val pairCounts = data.flatMap { p =>
+// assume dense vectors
+p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case 
(feature, col) =>
+  (col, feature, p.label)
+}
+  }.countByValue()
+
+  if (labels.size == 0) {
+// Do this only once for the first column

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r16024829
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import breeze.linalg.{DenseMatrix => BDM}
+import cern.jet.stat.Probability.chiSquareComplemented
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the chi-squared test for the input RDDs using the specified 
method.
+ * Goodness-of-fit test is conducted on two `Vectors`, whereas test of 
independence is conducted
+ * on an input of type `Matrix` in which independence between columns is 
assessed.
+ * We also provide a method for computing the chi-squared statistic 
between each feature and the
+ * label for an input `RDD[LabeledPoint]`, return an 
`Array[ChiSquaredTestResult]` of size =
+ * number of features in the inpuy RDD.
+ *
+ * Supported methods for goodness of fit: `pearson` (default)
+ * Supported methods for independence: `pearson` (default)
+ *
+ * More information on Chi-squared test: 
http://en.wikipedia.org/wiki/Chi-squared_test
+ */
+private[stat] object ChiSqTest extends Logging {
+
+  /**
+   * @param name String name for the method.
+   * @param chiSqFunc Function for computing the statistic given the 
observed and expected counts.
+   */
+  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
+
+  // Pearson's chi-squared test: 
http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
+  val PEARSON = new Method("pearson", (observed: Double, expected: Double) 
=> {
+val dev = observed - expected
+dev * dev / expected
+  })
+
+  // Null hypothesis for the two different types of chi-squared tests to 
be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val goodnessOfFit = Value("observed follows the same distribution as 
expected.")
+val independence = Value("observations in each column are 
statistically independent.")
+  }
+
+  // Method identification based on input methodName string
+  private def methodFromString(methodName: String): Method = {
+methodName match {
+  case PEARSON.name => PEARSON
+  case _ => throw new IllegalArgumentException("Unrecognized method 
for Chi squared test.")
+}
+  }
+
+  /**
+   * Conduct Pearson's independence test for each feature against the 
label across the input RDD.
+   * The contingency table is constructed from the raw (feature, label) 
pairs and used to conduct
+   * the independence test.
+   * Returns an array containing the ChiSquaredTestResult for every 
feature against the label.
+   */
+  def chiSquaredFeatures(data: RDD[LabeledPoint],
+  methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
+val numCols = data.first().features.size
+val results = new Array[ChiSqTestResult](numCols)
+var labels = Array[Double]()
+// At most 100 columns at a time
+val batchSize = 100
+var batch = 0
+while (batch * batchSize < numCols) {
+  // The following block of code can be cleaned up and made public as
+  // chiSquared(data: RDD[(V1, V2)])
+  val startCol = batch * batchSize
+  val endCol = startCol + math.min(batchSize, numCols - startCol)
+  val pairCounts = data.flatMap { p =>
+// assume dense vectors
+p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case 
(feature, col) =>
+  (col, feature, p.label)
+}
+  }.countByValue()
+
+  if (labels.size == 0) {
+// Do this only once for the first column

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread dorx
Github user dorx commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r16024698
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import breeze.linalg.{DenseMatrix => BDM}
+import cern.jet.stat.Probability.chiSquareComplemented
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the chi-squared test for the input RDDs using the specified 
method.
+ * Goodness-of-fit test is conducted on two `Vectors`, whereas test of 
independence is conducted
+ * on an input of type `Matrix` in which independence between columns is 
assessed.
+ * We also provide a method for computing the chi-squared statistic 
between each feature and the
+ * label for an input `RDD[LabeledPoint]`, return an 
`Array[ChiSquaredTestResult]` of size =
+ * number of features in the inpuy RDD.
+ *
+ * Supported methods for goodness of fit: `pearson` (default)
+ * Supported methods for independence: `pearson` (default)
+ *
+ * More information on Chi-squared test: 
http://en.wikipedia.org/wiki/Chi-squared_test
+ */
+private[stat] object ChiSqTest extends Logging {
+
+  /**
+   * @param name String name for the method.
+   * @param chiSqFunc Function for computing the statistic given the 
observed and expected counts.
+   */
+  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
+
+  // Pearson's chi-squared test: 
http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
+  val PEARSON = new Method("pearson", (observed: Double, expected: Double) 
=> {
+val dev = observed - expected
+dev * dev / expected
+  })
+
+  // Null hypothesis for the two different types of chi-squared tests to 
be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val goodnessOfFit = Value("observed follows the same distribution as 
expected.")
+val independence = Value("observations in each column are 
statistically independent.")
+  }
+
+  // Method identification based on input methodName string
+  private def methodFromString(methodName: String): Method = {
+methodName match {
+  case PEARSON.name => PEARSON
+  case _ => throw new IllegalArgumentException("Unrecognized method 
for Chi squared test.")
+}
+  }
+
+  /**
+   * Conduct Pearson's independence test for each feature against the 
label across the input RDD.
+   * The contingency table is constructed from the raw (feature, label) 
pairs and used to conduct
+   * the independence test.
+   * Returns an array containing the ChiSquaredTestResult for every 
feature against the label.
+   */
+  def chiSquaredFeatures(data: RDD[LabeledPoint],
+  methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
+val numCols = data.first().features.size
+val results = new Array[ChiSqTestResult](numCols)
+var labels = Array[Double]()
+// At most 100 columns at a time
+val batchSize = 100
+var batch = 0
+while (batch * batchSize < numCols) {
+  // The following block of code can be cleaned up and made public as
+  // chiSquared(data: RDD[(V1, V2)])
+  val startCol = batch * batchSize
+  val endCol = startCol + math.min(batchSize, numCols - startCol)
+  val pairCounts = data.flatMap { p =>
+// assume dense vectors
+p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case 
(feature, col) =>
+  (col, feature, p.label)
+}
+  }.countByValue()
+
+  if (labels.size == 0) {
+// Do this only once for the first column s

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread dorx
Github user dorx commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r16024688
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import breeze.linalg.{DenseMatrix => BDM}
+import cern.jet.stat.Probability.chiSquareComplemented
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the chi-squared test for the input RDDs using the specified 
method.
+ * Goodness-of-fit test is conducted on two `Vectors`, whereas test of 
independence is conducted
+ * on an input of type `Matrix` in which independence between columns is 
assessed.
+ * We also provide a method for computing the chi-squared statistic 
between each feature and the
+ * label for an input `RDD[LabeledPoint]`, return an 
`Array[ChiSquaredTestResult]` of size =
+ * number of features in the inpuy RDD.
+ *
+ * Supported methods for goodness of fit: `pearson` (default)
+ * Supported methods for independence: `pearson` (default)
+ *
+ * More information on Chi-squared test: 
http://en.wikipedia.org/wiki/Chi-squared_test
+ */
+private[stat] object ChiSqTest extends Logging {
+
+  /**
+   * @param name String name for the method.
+   * @param chiSqFunc Function for computing the statistic given the 
observed and expected counts.
+   */
+  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
+
+  // Pearson's chi-squared test: 
http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
+  val PEARSON = new Method("pearson", (observed: Double, expected: Double) 
=> {
+val dev = observed - expected
+dev * dev / expected
+  })
+
+  // Null hypothesis for the two different types of chi-squared tests to 
be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val goodnessOfFit = Value("observed follows the same distribution as 
expected.")
+val independence = Value("observations in each column are 
statistically independent.")
+  }
+
+  // Method identification based on input methodName string
+  private def methodFromString(methodName: String): Method = {
+methodName match {
+  case PEARSON.name => PEARSON
+  case _ => throw new IllegalArgumentException("Unrecognized method 
for Chi squared test.")
+}
+  }
+
+  /**
+   * Conduct Pearson's independence test for each feature against the 
label across the input RDD.
+   * The contingency table is constructed from the raw (feature, label) 
pairs and used to conduct
+   * the independence test.
+   * Returns an array containing the ChiSquaredTestResult for every 
feature against the label.
+   */
+  def chiSquaredFeatures(data: RDD[LabeledPoint],
+  methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
+val numCols = data.first().features.size
+val results = new Array[ChiSqTestResult](numCols)
+var labels = Array[Double]()
+// At most 100 columns at a time
+val batchSize = 100
+var batch = 0
+while (batch * batchSize < numCols) {
+  // The following block of code can be cleaned up and made public as
+  // chiSquared(data: RDD[(V1, V2)])
+  val startCol = batch * batchSize
+  val endCol = startCol + math.min(batchSize, numCols - startCol)
+  val pairCounts = data.flatMap { p =>
+// assume dense vectors
+p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case 
(feature, col) =>
+  (col, feature, p.label)
+}
+  }.countByValue()
+
+  if (labels.size == 0) {
+// Do this only once for the first column s

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r16024028
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import breeze.linalg.{DenseMatrix => BDM}
+import cern.jet.stat.Probability.chiSquareComplemented
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the chi-squared test for the input RDDs using the specified 
method.
+ * Goodness-of-fit test is conducted on two `Vectors`, whereas test of 
independence is conducted
+ * on an input of type `Matrix` in which independence between columns is 
assessed.
+ * We also provide a method for computing the chi-squared statistic 
between each feature and the
+ * label for an input `RDD[LabeledPoint]`, return an 
`Array[ChiSquaredTestResult]` of size =
+ * number of features in the inpuy RDD.
+ *
+ * Supported methods for goodness of fit: `pearson` (default)
+ * Supported methods for independence: `pearson` (default)
+ *
+ * More information on Chi-squared test: 
http://en.wikipedia.org/wiki/Chi-squared_test
+ */
+private[stat] object ChiSqTest extends Logging {
+
+  /**
+   * @param name String name for the method.
+   * @param chiSqFunc Function for computing the statistic given the 
observed and expected counts.
+   */
+  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
+
+  // Pearson's chi-squared test: 
http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
+  val PEARSON = new Method("pearson", (observed: Double, expected: Double) 
=> {
+val dev = observed - expected
+dev * dev / expected
+  })
+
+  // Null hypothesis for the two different types of chi-squared tests to 
be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val goodnessOfFit = Value("observed follows the same distribution as 
expected.")
+val independence = Value("observations in each column are 
statistically independent.")
+  }
+
+  // Method identification based on input methodName string
+  private def methodFromString(methodName: String): Method = {
+methodName match {
+  case PEARSON.name => PEARSON
+  case _ => throw new IllegalArgumentException("Unrecognized method 
for Chi squared test.")
+}
+  }
+
+  /**
+   * Conduct Pearson's independence test for each feature against the 
label across the input RDD.
+   * The contingency table is constructed from the raw (feature, label) 
pairs and used to conduct
+   * the independence test.
+   * Returns an array containing the ChiSquaredTestResult for every 
feature against the label.
+   */
+  def chiSquaredFeatures(data: RDD[LabeledPoint],
+  methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
+val numCols = data.first().features.size
+val results = new Array[ChiSqTestResult](numCols)
+var labels = Array[Double]()
+// At most 100 columns at a time
+val batchSize = 100
+var batch = 0
+while (batch * batchSize < numCols) {
+  // The following block of code can be cleaned up and made public as
+  // chiSquared(data: RDD[(V1, V2)])
+  val startCol = batch * batchSize
+  val endCol = startCol + math.min(batchSize, numCols - startCol)
+  val pairCounts = data.flatMap { p =>
+// assume dense vectors
+p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case 
(feature, col) =>
+  (col, feature, p.label)
+}
+  }.countByValue()
+
+  if (labels.size == 0) {
+// Do this only once for the first column

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r16024037
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import breeze.linalg.{DenseMatrix => BDM}
+import cern.jet.stat.Probability.chiSquareComplemented
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the chi-squared test for the input RDDs using the specified 
method.
+ * Goodness-of-fit test is conducted on two `Vectors`, whereas test of 
independence is conducted
+ * on an input of type `Matrix` in which independence between columns is 
assessed.
+ * We also provide a method for computing the chi-squared statistic 
between each feature and the
+ * label for an input `RDD[LabeledPoint]`, return an 
`Array[ChiSquaredTestResult]` of size =
+ * number of features in the inpuy RDD.
+ *
+ * Supported methods for goodness of fit: `pearson` (default)
+ * Supported methods for independence: `pearson` (default)
+ *
+ * More information on Chi-squared test: 
http://en.wikipedia.org/wiki/Chi-squared_test
+ */
+private[stat] object ChiSqTest extends Logging {
+
+  /**
+   * @param name String name for the method.
+   * @param chiSqFunc Function for computing the statistic given the 
observed and expected counts.
+   */
+  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
+
+  // Pearson's chi-squared test: 
http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
+  val PEARSON = new Method("pearson", (observed: Double, expected: Double) 
=> {
+val dev = observed - expected
+dev * dev / expected
+  })
+
+  // Null hypothesis for the two different types of chi-squared tests to 
be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val goodnessOfFit = Value("observed follows the same distribution as 
expected.")
+val independence = Value("observations in each column are 
statistically independent.")
+  }
+
+  // Method identification based on input methodName string
+  private def methodFromString(methodName: String): Method = {
+methodName match {
+  case PEARSON.name => PEARSON
+  case _ => throw new IllegalArgumentException("Unrecognized method 
for Chi squared test.")
+}
+  }
+
+  /**
+   * Conduct Pearson's independence test for each feature against the 
label across the input RDD.
+   * The contingency table is constructed from the raw (feature, label) 
pairs and used to conduct
+   * the independence test.
+   * Returns an array containing the ChiSquaredTestResult for every 
feature against the label.
+   */
+  def chiSquaredFeatures(data: RDD[LabeledPoint],
+  methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
+val numCols = data.first().features.size
+val results = new Array[ChiSqTestResult](numCols)
+var labels = Array[Double]()
+// At most 100 columns at a time
+val batchSize = 100
+var batch = 0
+while (batch * batchSize < numCols) {
+  // The following block of code can be cleaned up and made public as
+  // chiSquared(data: RDD[(V1, V2)])
+  val startCol = batch * batchSize
+  val endCol = startCol + math.min(batchSize, numCols - startCol)
+  val pairCounts = data.flatMap { p =>
+// assume dense vectors
+p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case 
(feature, col) =>
+  (col, feature, p.label)
+}
+  }.countByValue()
+
+  if (labels.size == 0) {
+// Do this only once for the first column

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r16024031
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import breeze.linalg.{DenseMatrix => BDM}
+import cern.jet.stat.Probability.chiSquareComplemented
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the chi-squared test for the input RDDs using the specified 
method.
+ * Goodness-of-fit test is conducted on two `Vectors`, whereas test of 
independence is conducted
+ * on an input of type `Matrix` in which independence between columns is 
assessed.
+ * We also provide a method for computing the chi-squared statistic 
between each feature and the
+ * label for an input `RDD[LabeledPoint]`, return an 
`Array[ChiSquaredTestResult]` of size =
+ * number of features in the inpuy RDD.
+ *
+ * Supported methods for goodness of fit: `pearson` (default)
+ * Supported methods for independence: `pearson` (default)
+ *
+ * More information on Chi-squared test: 
http://en.wikipedia.org/wiki/Chi-squared_test
+ */
+private[stat] object ChiSqTest extends Logging {
+
+  /**
+   * @param name String name for the method.
+   * @param chiSqFunc Function for computing the statistic given the 
observed and expected counts.
+   */
+  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
+
+  // Pearson's chi-squared test: 
http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
+  val PEARSON = new Method("pearson", (observed: Double, expected: Double) 
=> {
+val dev = observed - expected
+dev * dev / expected
+  })
+
+  // Null hypothesis for the two different types of chi-squared tests to 
be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val goodnessOfFit = Value("observed follows the same distribution as 
expected.")
+val independence = Value("observations in each column are 
statistically independent.")
+  }
+
+  // Method identification based on input methodName string
+  private def methodFromString(methodName: String): Method = {
+methodName match {
+  case PEARSON.name => PEARSON
+  case _ => throw new IllegalArgumentException("Unrecognized method 
for Chi squared test.")
+}
+  }
+
+  /**
+   * Conduct Pearson's independence test for each feature against the 
label across the input RDD.
+   * The contingency table is constructed from the raw (feature, label) 
pairs and used to conduct
+   * the independence test.
+   * Returns an array containing the ChiSquaredTestResult for every 
feature against the label.
+   */
+  def chiSquaredFeatures(data: RDD[LabeledPoint],
+  methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
+val numCols = data.first().features.size
+val results = new Array[ChiSqTestResult](numCols)
+var labels = Array[Double]()
+// At most 100 columns at a time
+val batchSize = 100
+var batch = 0
+while (batch * batchSize < numCols) {
+  // The following block of code can be cleaned up and made public as
+  // chiSquared(data: RDD[(V1, V2)])
+  val startCol = batch * batchSize
+  val endCol = startCol + math.min(batchSize, numCols - startCol)
+  val pairCounts = data.flatMap { p =>
+// assume dense vectors
+p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case 
(feature, col) =>
+  (col, feature, p.label)
+}
+  }.countByValue()
+
+  if (labels.size == 0) {
+// Do this only once for the first column

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r16024040
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import breeze.linalg.{DenseMatrix => BDM}
+import cern.jet.stat.Probability.chiSquareComplemented
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the chi-squared test for the input RDDs using the specified 
method.
+ * Goodness-of-fit test is conducted on two `Vectors`, whereas test of 
independence is conducted
+ * on an input of type `Matrix` in which independence between columns is 
assessed.
+ * We also provide a method for computing the chi-squared statistic 
between each feature and the
+ * label for an input `RDD[LabeledPoint]`, return an 
`Array[ChiSquaredTestResult]` of size =
+ * number of features in the inpuy RDD.
+ *
+ * Supported methods for goodness of fit: `pearson` (default)
+ * Supported methods for independence: `pearson` (default)
+ *
+ * More information on Chi-squared test: 
http://en.wikipedia.org/wiki/Chi-squared_test
+ */
+private[stat] object ChiSqTest extends Logging {
+
+  /**
+   * @param name String name for the method.
+   * @param chiSqFunc Function for computing the statistic given the 
observed and expected counts.
+   */
+  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
+
+  // Pearson's chi-squared test: 
http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
+  val PEARSON = new Method("pearson", (observed: Double, expected: Double) 
=> {
+val dev = observed - expected
+dev * dev / expected
+  })
+
+  // Null hypothesis for the two different types of chi-squared tests to 
be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val goodnessOfFit = Value("observed follows the same distribution as 
expected.")
+val independence = Value("observations in each column are 
statistically independent.")
+  }
+
+  // Method identification based on input methodName string
+  private def methodFromString(methodName: String): Method = {
+methodName match {
+  case PEARSON.name => PEARSON
+  case _ => throw new IllegalArgumentException("Unrecognized method 
for Chi squared test.")
+}
+  }
+
+  /**
+   * Conduct Pearson's independence test for each feature against the 
label across the input RDD.
+   * The contingency table is constructed from the raw (feature, label) 
pairs and used to conduct
+   * the independence test.
+   * Returns an array containing the ChiSquaredTestResult for every 
feature against the label.
+   */
+  def chiSquaredFeatures(data: RDD[LabeledPoint],
+  methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
+val numCols = data.first().features.size
+val results = new Array[ChiSqTestResult](numCols)
+var labels = Array[Double]()
+// At most 100 columns at a time
+val batchSize = 100
+var batch = 0
+while (batch * batchSize < numCols) {
+  // The following block of code can be cleaned up and made public as
+  // chiSquared(data: RDD[(V1, V2)])
+  val startCol = batch * batchSize
+  val endCol = startCol + math.min(batchSize, numCols - startCol)
+  val pairCounts = data.flatMap { p =>
+// assume dense vectors
+p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case 
(feature, col) =>
+  (col, feature, p.label)
+}
+  }.countByValue()
+
+  if (labels.size == 0) {
+// Do this only once for the first column

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r16024043
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import breeze.linalg.{DenseMatrix => BDM}
+import cern.jet.stat.Probability.chiSquareComplemented
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the chi-squared test for the input RDDs using the specified 
method.
+ * Goodness-of-fit test is conducted on two `Vectors`, whereas test of 
independence is conducted
+ * on an input of type `Matrix` in which independence between columns is 
assessed.
+ * We also provide a method for computing the chi-squared statistic 
between each feature and the
+ * label for an input `RDD[LabeledPoint]`, return an 
`Array[ChiSquaredTestResult]` of size =
+ * number of features in the inpuy RDD.
+ *
+ * Supported methods for goodness of fit: `pearson` (default)
+ * Supported methods for independence: `pearson` (default)
+ *
+ * More information on Chi-squared test: 
http://en.wikipedia.org/wiki/Chi-squared_test
+ */
+private[stat] object ChiSqTest extends Logging {
+
+  /**
+   * @param name String name for the method.
+   * @param chiSqFunc Function for computing the statistic given the 
observed and expected counts.
+   */
+  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
+
+  // Pearson's chi-squared test: 
http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
+  val PEARSON = new Method("pearson", (observed: Double, expected: Double) 
=> {
+val dev = observed - expected
+dev * dev / expected
+  })
+
+  // Null hypothesis for the two different types of chi-squared tests to 
be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val goodnessOfFit = Value("observed follows the same distribution as 
expected.")
+val independence = Value("observations in each column are 
statistically independent.")
+  }
+
+  // Method identification based on input methodName string
+  private def methodFromString(methodName: String): Method = {
+methodName match {
+  case PEARSON.name => PEARSON
+  case _ => throw new IllegalArgumentException("Unrecognized method 
for Chi squared test.")
+}
+  }
+
+  /**
+   * Conduct Pearson's independence test for each feature against the 
label across the input RDD.
+   * The contingency table is constructed from the raw (feature, label) 
pairs and used to conduct
+   * the independence test.
+   * Returns an array containing the ChiSquaredTestResult for every 
feature against the label.
+   */
+  def chiSquaredFeatures(data: RDD[LabeledPoint],
+  methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
+val numCols = data.first().features.size
+val results = new Array[ChiSqTestResult](numCols)
+var labels = Array[Double]()
+// At most 100 columns at a time
+val batchSize = 100
+var batch = 0
+while (batch * batchSize < numCols) {
+  // The following block of code can be cleaned up and made public as
+  // chiSquared(data: RDD[(V1, V2)])
+  val startCol = batch * batchSize
+  val endCol = startCol + math.min(batchSize, numCols - startCol)
+  val pairCounts = data.flatMap { p =>
+// assume dense vectors
+p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case 
(feature, col) =>
+  (col, feature, p.label)
+}
+  }.countByValue()
+
+  if (labels.size == 0) {
+// Do this only once for the first column

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r16024030
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import breeze.linalg.{DenseMatrix => BDM}
+import cern.jet.stat.Probability.chiSquareComplemented
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the chi-squared test for the input RDDs using the specified 
method.
+ * Goodness-of-fit test is conducted on two `Vectors`, whereas test of 
independence is conducted
+ * on an input of type `Matrix` in which independence between columns is 
assessed.
+ * We also provide a method for computing the chi-squared statistic 
between each feature and the
+ * label for an input `RDD[LabeledPoint]`, return an 
`Array[ChiSquaredTestResult]` of size =
+ * number of features in the inpuy RDD.
+ *
+ * Supported methods for goodness of fit: `pearson` (default)
+ * Supported methods for independence: `pearson` (default)
+ *
+ * More information on Chi-squared test: 
http://en.wikipedia.org/wiki/Chi-squared_test
+ */
+private[stat] object ChiSqTest extends Logging {
+
+  /**
+   * @param name String name for the method.
+   * @param chiSqFunc Function for computing the statistic given the 
observed and expected counts.
+   */
+  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
+
+  // Pearson's chi-squared test: 
http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
+  val PEARSON = new Method("pearson", (observed: Double, expected: Double) 
=> {
+val dev = observed - expected
+dev * dev / expected
+  })
+
+  // Null hypothesis for the two different types of chi-squared tests to 
be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val goodnessOfFit = Value("observed follows the same distribution as 
expected.")
+val independence = Value("observations in each column are 
statistically independent.")
+  }
+
+  // Method identification based on input methodName string
+  private def methodFromString(methodName: String): Method = {
+methodName match {
+  case PEARSON.name => PEARSON
+  case _ => throw new IllegalArgumentException("Unrecognized method 
for Chi squared test.")
+}
+  }
+
+  /**
+   * Conduct Pearson's independence test for each feature against the 
label across the input RDD.
+   * The contingency table is constructed from the raw (feature, label) 
pairs and used to conduct
+   * the independence test.
+   * Returns an array containing the ChiSquaredTestResult for every 
feature against the label.
+   */
+  def chiSquaredFeatures(data: RDD[LabeledPoint],
+  methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
+val numCols = data.first().features.size
+val results = new Array[ChiSqTestResult](numCols)
+var labels = Array[Double]()
+// At most 100 columns at a time
+val batchSize = 100
+var batch = 0
+while (batch * batchSize < numCols) {
+  // The following block of code can be cleaned up and made public as
+  // chiSquared(data: RDD[(V1, V2)])
+  val startCol = batch * batchSize
+  val endCol = startCol + math.min(batchSize, numCols - startCol)
+  val pairCounts = data.flatMap { p =>
+// assume dense vectors
+p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case 
(feature, col) =>
+  (col, feature, p.label)
+}
+  }.countByValue()
+
+  if (labels.size == 0) {
+// Do this only once for the first column

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r16024033
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import breeze.linalg.{DenseMatrix => BDM}
+import cern.jet.stat.Probability.chiSquareComplemented
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the chi-squared test for the input RDDs using the specified 
method.
+ * Goodness-of-fit test is conducted on two `Vectors`, whereas test of 
independence is conducted
+ * on an input of type `Matrix` in which independence between columns is 
assessed.
+ * We also provide a method for computing the chi-squared statistic 
between each feature and the
+ * label for an input `RDD[LabeledPoint]`, return an 
`Array[ChiSquaredTestResult]` of size =
+ * number of features in the inpuy RDD.
+ *
+ * Supported methods for goodness of fit: `pearson` (default)
+ * Supported methods for independence: `pearson` (default)
+ *
+ * More information on Chi-squared test: 
http://en.wikipedia.org/wiki/Chi-squared_test
+ */
+private[stat] object ChiSqTest extends Logging {
+
+  /**
+   * @param name String name for the method.
+   * @param chiSqFunc Function for computing the statistic given the 
observed and expected counts.
+   */
+  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
+
+  // Pearson's chi-squared test: 
http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
+  val PEARSON = new Method("pearson", (observed: Double, expected: Double) 
=> {
+val dev = observed - expected
+dev * dev / expected
+  })
+
+  // Null hypothesis for the two different types of chi-squared tests to 
be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val goodnessOfFit = Value("observed follows the same distribution as 
expected.")
+val independence = Value("observations in each column are 
statistically independent.")
+  }
+
+  // Method identification based on input methodName string
+  private def methodFromString(methodName: String): Method = {
+methodName match {
+  case PEARSON.name => PEARSON
+  case _ => throw new IllegalArgumentException("Unrecognized method 
for Chi squared test.")
+}
+  }
+
+  /**
+   * Conduct Pearson's independence test for each feature against the 
label across the input RDD.
+   * The contingency table is constructed from the raw (feature, label) 
pairs and used to conduct
+   * the independence test.
+   * Returns an array containing the ChiSquaredTestResult for every 
feature against the label.
+   */
+  def chiSquaredFeatures(data: RDD[LabeledPoint],
+  methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
+val numCols = data.first().features.size
+val results = new Array[ChiSqTestResult](numCols)
+var labels = Array[Double]()
+// At most 100 columns at a time
+val batchSize = 100
+var batch = 0
+while (batch * batchSize < numCols) {
+  // The following block of code can be cleaned up and made public as
+  // chiSquared(data: RDD[(V1, V2)])
+  val startCol = batch * batchSize
+  val endCol = startCol + math.min(batchSize, numCols - startCol)
+  val pairCounts = data.flatMap { p =>
+// assume dense vectors
+p.features.toArray.slice(startCol, endCol).zipWithIndex.map { case 
(feature, col) =>
+  (col, feature, p.label)
+}
+  }.countByValue()
+
+  if (labels.size == 0) {
+// Do this only once for the first column

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r16024027
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import breeze.linalg.{DenseMatrix => BDM}
+import cern.jet.stat.Probability.chiSquareComplemented
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the chi-squared test for the input RDDs using the specified 
method.
+ * Goodness-of-fit test is conducted on two `Vectors`, whereas test of 
independence is conducted
+ * on an input of type `Matrix` in which independence between columns is 
assessed.
+ * We also provide a method for computing the chi-squared statistic 
between each feature and the
+ * label for an input `RDD[LabeledPoint]`, return an 
`Array[ChiSquaredTestResult]` of size =
+ * number of features in the inpuy RDD.
+ *
+ * Supported methods for goodness of fit: `pearson` (default)
+ * Supported methods for independence: `pearson` (default)
+ *
+ * More information on Chi-squared test: 
http://en.wikipedia.org/wiki/Chi-squared_test
+ */
+private[stat] object ChiSqTest extends Logging {
+
+  /**
+   * @param name String name for the method.
+   * @param chiSqFunc Function for computing the statistic given the 
observed and expected counts.
+   */
+  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
+
+  // Pearson's chi-squared test: 
http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
+  val PEARSON = new Method("pearson", (observed: Double, expected: Double) 
=> {
+val dev = observed - expected
+dev * dev / expected
+  })
+
+  // Null hypothesis for the two different types of chi-squared tests to 
be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val goodnessOfFit = Value("observed follows the same distribution as 
expected.")
+val independence = Value("observations in each column are 
statistically independent.")
+  }
+
+  // Method identification based on input methodName string
+  private def methodFromString(methodName: String): Method = {
+methodName match {
+  case PEARSON.name => PEARSON
+  case _ => throw new IllegalArgumentException("Unrecognized method 
for Chi squared test.")
+}
+  }
+
+  /**
+   * Conduct Pearson's independence test for each feature against the 
label across the input RDD.
+   * The contingency table is constructed from the raw (feature, label) 
pairs and used to conduct
+   * the independence test.
+   * Returns an array containing the ChiSquaredTestResult for every 
feature against the label.
+   */
+  def chiSquaredFeatures(data: RDD[LabeledPoint],
+  methodName: String = PEARSON.name): Array[ChiSqTestResult] = {
+val numCols = data.first().features.size
+val results = new Array[ChiSqTestResult](numCols)
+var labels = Array[Double]()
--- End diff --

could be initialized as a null


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-51662596
  
QA results for PR 1733:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18226/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-51657309
  
QA tests have started for PR 1733. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18226/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-51656796
  
QA results for PR 1733:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18217/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread dorx
Github user dorx commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r16015802
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import org.apache.spark.annotation.Experimental
+
+/**
+ * :: Experimental ::
+ * Trait for hypothesis test results.
+ * @tparam DF Return type of `degreesOfFreedom`
+ */
+@Experimental
+trait TestResult[DF] {
+
+  /**
+   *
+   */
+  def pValue: Double
+
+  /**
+   *
+   * @return
+   */
+  def degreesOfFreedom: DF
+
+  /**
+   *
+   * @return
+   */
+  def statistic: Double
+
+  /**
+   * String explaining the hypothesis test result.
+   * Specific classes implementing this trait should override this method 
to output test-specific
+   * information.
+   */
+  override def toString: String = {
+
+// String explaining what the p-value indicates.
+val pValueExplain = if (pValue <= 0.01) {
+  "Very strong presumption against null hypothesis."
+} else if (0.01 < pValue && pValue <= 0.05) {
+  "Strong presumption against null hypothesis."
+} else if (0.05 < pValue && pValue <= 0.01) {
+  "Low presumption against null hypothesis."
+} else {
+  "No presumption against null hypothesis."
+}
+
+s"degrees of freedom = ${degreesOfFreedom.toString} \n" +
+s"statistic = $statistic \n" +
+s"pValue = $pValue \n" + pValueExplain
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Object containing the test results for the chi squared hypothesis test.
+ */
+@Experimental
+case class ChiSquaredTestResult(override val pValue: Double,
--- End diff --

Whether correction is used or not can actually be reflected in the method 
name (`pearson` v `yates`). I doubt there's a lot of use cases for parsing the 
result back from JSON so let's not worry about it for now. The way I see case 
classes is that they're like data structs that encapsulates immutable fields 
(the list of fields can be modified in later releases given that this is all 
experimental), but if there are compiler optimization complications, I can 
change it to a regular class. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r16014291
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import org.apache.spark.annotation.Experimental
+
+/**
+ * :: Experimental ::
+ * Trait for hypothesis test results.
+ * @tparam DF Return type of `degreesOfFreedom`
+ */
+@Experimental
+trait TestResult[DF] {
+
+  /**
+   *
+   */
+  def pValue: Double
+
+  /**
+   *
+   * @return
+   */
+  def degreesOfFreedom: DF
+
+  /**
+   *
+   * @return
+   */
+  def statistic: Double
+
+  /**
+   * String explaining the hypothesis test result.
+   * Specific classes implementing this trait should override this method 
to output test-specific
+   * information.
+   */
+  override def toString: String = {
+
+// String explaining what the p-value indicates.
+val pValueExplain = if (pValue <= 0.01) {
+  "Very strong presumption against null hypothesis."
+} else if (0.01 < pValue && pValue <= 0.05) {
+  "Strong presumption against null hypothesis."
+} else if (0.05 < pValue && pValue <= 0.01) {
+  "Low presumption against null hypothesis."
+} else {
+  "No presumption against null hypothesis."
+}
+
+s"degrees of freedom = ${degreesOfFreedom.toString} \n" +
+s"statistic = $statistic \n" +
+s"pValue = $pValue \n" + pValueExplain
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Object containing the test results for the chi squared hypothesis test.
+ */
+@Experimental
+case class ChiSquaredTestResult(override val pValue: Double,
--- End diff --

Btw, shall we rename it to `ChiSqTestResult`? So `chiSqTest() returns 
ChiSqTestResult`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r16014267
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import org.apache.spark.annotation.Experimental
+
+/**
+ * :: Experimental ::
+ * Trait for hypothesis test results.
+ * @tparam DF Return type of `degreesOfFreedom`
+ */
+@Experimental
+trait TestResult[DF] {
+
+  /**
+   *
+   */
+  def pValue: Double
+
+  /**
+   *
+   * @return
+   */
+  def degreesOfFreedom: DF
+
+  /**
+   *
+   * @return
+   */
+  def statistic: Double
+
+  /**
+   * String explaining the hypothesis test result.
+   * Specific classes implementing this trait should override this method 
to output test-specific
+   * information.
+   */
+  override def toString: String = {
+
+// String explaining what the p-value indicates.
+val pValueExplain = if (pValue <= 0.01) {
+  "Very strong presumption against null hypothesis."
+} else if (0.01 < pValue && pValue <= 0.05) {
+  "Strong presumption against null hypothesis."
+} else if (0.05 < pValue && pValue <= 0.01) {
+  "Low presumption against null hypothesis."
+} else {
+  "No presumption against null hypothesis."
+}
+
+s"degrees of freedom = ${degreesOfFreedom.toString} \n" +
+s"statistic = $statistic \n" +
+s"pValue = $pValue \n" + pValueExplain
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Object containing the test results for the chi squared hypothesis test.
+ */
+@Experimental
+case class ChiSquaredTestResult(override val pValue: Double,
--- End diff --

No case class features are used, especially pattern matching. This case 
class will extend `Product5` and make it impossible to add a field, for 
example, whether correction is used or not. Also, with a case class, it is very 
hard to add a static method. We might want to write the test result to JSON and 
later parse it back. A natural choice would be 
`ChiSquaredTestResult.fromJSON(json: String)` but it is very complicated to 
match the type signature generated by Scala's compiler. We had this problem 
with `LabeledPoint` in MLlib, which is a public case class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-51651306
  
QA tests have started for PR 1733. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18217/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread dorx
Github user dorx commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r16009835
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/stat/HypothesisTestSuite.scala ---
@@ -0,0 +1,128 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat
+
+import org.scalatest.FunSuite
+
+import org.apache.spark.mllib.linalg.{DenseVector, Matrices, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.stat.test.ChiSquaredTest
+import org.apache.spark.mllib.util.LocalSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+
+class HypothesisTestSuite extends FunSuite with LocalSparkContext {
+
+  test("chi squared pearson goodness of fit") {
+
+val observed = new DenseVector(Array[Double](4, 6, 5))
+val pearson = Statistics.chiSqTest(observed)
+
+// Results validated against the R command `chisq.test(c(4, 6, 5), 
p=c(1/3, 1/3, 1/3))`
+assert(pearson.statistic === 0.4)
+assert(pearson.degreesOfFreedom === 2)
+assert(pearson.pValue ~= 0.8187 relTol 1e-4)
+assert(pearson.method === ChiSquaredTest.PEARSON.name)
+assert(pearson.nullHypothesis === 
ChiSquaredTest.NullHypothesis.goodnessOfFit.toString)
+
+// different expected and observed sum
+val observed1 = new DenseVector(Array[Double](21, 38, 43, 80))
+val expected1 = new DenseVector(Array[Double](3, 5, 7, 20))
+val pearson1 = Statistics.chiSqTest(observed1, expected1)
+
+// Results validated against the R command
+// `chisq.test(c(21, 38, 43, 80), p=c(3/35, 1/7, 1/5, 4/7))`
+assert(pearson1.statistic ~= 14.1429 relTol 1e-4)
+assert(pearson1.degreesOfFreedom === 3)
+assert(pearson1.pValue ~= 0.002717 relTol 1e-4)
+assert(pearson1.method === ChiSquaredTest.PEARSON.name)
+assert(pearson1.nullHypothesis === 
ChiSquaredTest.NullHypothesis.goodnessOfFit.toString)
+
+// SparseVector representation to make sure memory doesn't blow up
--- End diff --

It's actually meant as a note to perf testers, but okay.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread dorx
Github user dorx commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r16009653
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import org.apache.spark.annotation.Experimental
+
+/**
+ * :: Experimental ::
+ * Trait for hypothesis test results.
+ * @tparam DF Return type of `degreesOfFreedom`
+ */
+@Experimental
+trait TestResult[DF] {
+
+  /**
+   *
+   */
+  def pValue: Double
+
+  /**
+   *
+   * @return
+   */
+  def degreesOfFreedom: DF
+
+  /**
+   *
+   * @return
+   */
+  def statistic: Double
+
+  /**
+   * String explaining the hypothesis test result.
+   * Specific classes implementing this trait should override this method 
to output test-specific
+   * information.
+   */
+  override def toString: String = {
+
+// String explaining what the p-value indicates.
+val pValueExplain = if (pValue <= 0.01) {
+  "Very strong presumption against null hypothesis."
+} else if (0.01 < pValue && pValue <= 0.05) {
+  "Strong presumption against null hypothesis."
+} else if (0.05 < pValue && pValue <= 0.01) {
+  "Low presumption against null hypothesis."
+} else {
+  "No presumption against null hypothesis."
+}
+
+s"degrees of freedom = ${degreesOfFreedom.toString} \n" +
+s"statistic = $statistic \n" +
+s"pValue = $pValue \n" + pValueExplain
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Object containing the test results for the chi squared hypothesis test.
+ */
+@Experimental
+case class ChiSquaredTestResult(override val pValue: Double,
--- End diff --

Case class is a logical choice here since it's essentially an immutable 
object holding a bunch of invariant fields and doesn't do any stateful 
computations inside of the class. Is there development plan for extending this 
classes in the future?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-51574118
  
Verified test results with R and all good :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r15981590
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/stat/HypothesisTestSuite.scala ---
@@ -0,0 +1,128 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat
+
+import org.scalatest.FunSuite
+
+import org.apache.spark.mllib.linalg.{DenseVector, Matrices, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.stat.test.ChiSquaredTest
+import org.apache.spark.mllib.util.LocalSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+
+class HypothesisTestSuite extends FunSuite with LocalSparkContext {
+
+  test("chi squared pearson goodness of fit") {
+
+val observed = new DenseVector(Array[Double](4, 6, 5))
+val pearson = Statistics.chiSqTest(observed)
+
+// Results validated against the R command `chisq.test(c(4, 6, 5), 
p=c(1/3, 1/3, 1/3))`
+assert(pearson.statistic === 0.4)
+assert(pearson.degreesOfFreedom === 2)
+assert(pearson.pValue ~= 0.8187 relTol 1e-4)
--- End diff --

`~=` -> `~==`. The latter tells more when something is wrong. (and please 
also update other places)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r15981559
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/stat/HypothesisTestSuite.scala ---
@@ -0,0 +1,128 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat
+
+import org.scalatest.FunSuite
+
+import org.apache.spark.mllib.linalg.{DenseVector, Matrices, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.stat.test.ChiSquaredTest
+import org.apache.spark.mllib.util.LocalSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+
+class HypothesisTestSuite extends FunSuite with LocalSparkContext {
+
+  test("chi squared pearson goodness of fit") {
+
+val observed = new DenseVector(Array[Double](4, 6, 5))
+val pearson = Statistics.chiSqTest(observed)
+
+// Results validated against the R command `chisq.test(c(4, 6, 5), 
p=c(1/3, 1/3, 1/3))`
+assert(pearson.statistic === 0.4)
+assert(pearson.degreesOfFreedom === 2)
+assert(pearson.pValue ~= 0.8187 relTol 1e-4)
+assert(pearson.method === ChiSquaredTest.PEARSON.name)
+assert(pearson.nullHypothesis === 
ChiSquaredTest.NullHypothesis.goodnessOfFit.toString)
+
+// different expected and observed sum
+val observed1 = new DenseVector(Array[Double](21, 38, 43, 80))
+val expected1 = new DenseVector(Array[Double](3, 5, 7, 20))
+val pearson1 = Statistics.chiSqTest(observed1, expected1)
+
+// Results validated against the R command
+// `chisq.test(c(21, 38, 43, 80), p=c(3/35, 1/7, 1/5, 4/7))`
+assert(pearson1.statistic ~= 14.1429 relTol 1e-4)
+assert(pearson1.degreesOfFreedom === 3)
+assert(pearson1.pValue ~= 0.002717 relTol 1e-4)
+assert(pearson1.method === ChiSquaredTest.PEARSON.name)
+assert(pearson1.nullHypothesis === 
ChiSquaredTest.NullHypothesis.goodnessOfFit.toString)
+
+// SparseVector representation to make sure memory doesn't blow up
--- End diff --

Remove commented blocks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r15981500
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import org.apache.spark.annotation.Experimental
+
+/**
+ * :: Experimental ::
+ * Trait for hypothesis test results.
+ * @tparam DF Return type of `degreesOfFreedom`
+ */
+@Experimental
+trait TestResult[DF] {
+
+  /**
+   *
+   */
+  def pValue: Double
+
+  /**
+   *
+   * @return
+   */
+  def degreesOfFreedom: DF
+
+  /**
+   *
+   * @return
+   */
+  def statistic: Double
+
+  /**
+   * String explaining the hypothesis test result.
+   * Specific classes implementing this trait should override this method 
to output test-specific
+   * information.
+   */
+  override def toString: String = {
+
+// String explaining what the p-value indicates.
+val pValueExplain = if (pValue <= 0.01) {
+  "Very strong presumption against null hypothesis."
+} else if (0.01 < pValue && pValue <= 0.05) {
+  "Strong presumption against null hypothesis."
+} else if (0.05 < pValue && pValue <= 0.01) {
+  "Low presumption against null hypothesis."
+} else {
+  "No presumption against null hypothesis."
+}
+
+s"degrees of freedom = ${degreesOfFreedom.toString} \n" +
+s"statistic = $statistic \n" +
+s"pValue = $pValue \n" + pValueExplain
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Object containing the test results for the chi squared hypothesis test.
+ */
+@Experimental
+case class ChiSquaredTestResult(override val pValue: Double,
--- End diff --

Does it need to be a case class? Scala compiler will add many methods to a 
case class and make it very hard to extend.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r15981448
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,211 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import breeze.linalg.{DenseMatrix => BDM}
+import cern.jet.stat.Probability.chiSquareComplemented
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the Chi-squared test for the input RDDs using the specified 
method.
+ * Goodness-of-fit test is conducted on two `Vectors`, whereas test of 
independence is conducted
+ * on an input of type `Matrix` in which independence between columns is 
assessed.
+ * We also provide a method for computing the chi-squared statistic 
between each feature and the
+ * label for an input `RDD[LabeledPoint]`, return an 
`Array[ChiSquaredTestResult]` of size =
+ * number of features in the inpuy RDD.
+ *
+ * Supported methods for goodness of fit: `pearson` (default)
+ * Supported methods for independence: `pearson` (default)
+ *
+ * More information on Chi-squared test: 
http://en.wikipedia.org/wiki/Chi-squared_test
+ */
+private[stat] object ChiSquaredTest extends Logging {
+
+  /**
+   * @param name String name for the method.
+   * @param chiSqFunc Function for computing the statistic given the 
observed and expected counts.
+   */
+  case class Method(name: String, chiSqFunc: (Double, Double) => Double)
+
+  // Pearson's chi-squared test: 
http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
+  val PEARSON = new Method("pearson", (observed: Double, expected: Double) 
=> {
+val dev = observed - expected
+dev * dev / expected
+  })
+
+  // Null hypothesis for the two different types of chi-squared tests to 
be included in the result.
+  object NullHypothesis extends Enumeration {
+type NullHypothesis = Value
+val goodnessOfFit = Value("observed follows the same distribution as 
expected.")
+val independence = Value("observations in each column are 
statistically independent.")
+  }
+
+  // Method identification based on input methodName string
+  private def methodFromString(methodName: String): Method = {
+methodName match {
+  case PEARSON.name => PEARSON
+  case _ => throw new IllegalArgumentException("Unrecognized method 
for Chi squared test.")
+}
+  }
+
+  /**
+   * Conduct Pearson's independence test for each feature against the 
label across the input RDD.
+   * The contingency table is constructed from the raw (feature, label) 
pairs and used to conduct
+   * the independence test.
+   * Returns an array containing the ChiSquaredTestResult for every 
feature against the label.
+   */
+  def chiSquaredFeatures(data: RDD[LabeledPoint],
+  methodName: String = PEARSON.name): Array[ChiSquaredTestResult] = {
+val numCols = data.first().features.size
+val results = new Array[ChiSquaredTestResult](numCols)
+var labels = Array[Double]()
+var col = 0
+while (col < numCols) {
--- End diff --

This could be done in a single pass (or in batches if numCols is large):

~~~
data.flatMap { p =>
  // assume dense vectors
  p.features.toArray.view.zipWithIndex { case (f, j) =>
(j, p, f)
  }
}.countByValue()
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-

[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r15981456
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import org.apache.spark.annotation.Experimental
+
+/**
+ * :: Experimental ::
+ * Trait for hypothesis test results.
+ * @tparam DF Return type of `degreesOfFreedom`
+ */
+@Experimental
+trait TestResult[DF] {
+
+  /**
+   *
+   */
+  def pValue: Double
+
+  /**
+   *
--- End diff --

doc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r15981454
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import org.apache.spark.annotation.Experimental
+
+/**
+ * :: Experimental ::
+ * Trait for hypothesis test results.
+ * @tparam DF Return type of `degreesOfFreedom`
+ */
+@Experimental
+trait TestResult[DF] {
+
+  /**
+   *
--- End diff --

doc?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r15981458
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import org.apache.spark.annotation.Experimental
+
+/**
+ * :: Experimental ::
+ * Trait for hypothesis test results.
+ * @tparam DF Return type of `degreesOfFreedom`
+ */
+@Experimental
+trait TestResult[DF] {
+
+  /**
+   *
+   */
+  def pValue: Double
+
+  /**
+   *
+   * @return
+   */
+  def degreesOfFreedom: DF
+
+  /**
+   *
--- End diff --

doc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r15981447
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,211 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import breeze.linalg.{DenseMatrix => BDM}
+import cern.jet.stat.Probability.chiSquareComplemented
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the Chi-squared test for the input RDDs using the specified 
method.
+ * Goodness-of-fit test is conducted on two `Vectors`, whereas test of 
independence is conducted
+ * on an input of type `Matrix` in which independence between columns is 
assessed.
+ * We also provide a method for computing the chi-squared statistic 
between each feature and the
+ * label for an input `RDD[LabeledPoint]`, return an 
`Array[ChiSquaredTestResult]` of size =
+ * number of features in the inpuy RDD.
+ *
+ * Supported methods for goodness of fit: `pearson` (default)
+ * Supported methods for independence: `pearson` (default)
+ *
+ * More information on Chi-squared test: 
http://en.wikipedia.org/wiki/Chi-squared_test
+ */
+private[stat] object ChiSquaredTest extends Logging {
--- End diff --

minor: `ChiSquaredTest` -> `ChiSqTest` (to match the public method names)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r15981435
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala 
---
@@ -89,4 +91,64 @@ object Statistics {
*/
   @Experimental
   def corr(x: RDD[Double], y: RDD[Double], method: String): Double = 
Correlations.corr(x, y, method)
+
+  /**
+   * :: Experimental ::
+   * Conduct Pearson's chi-squared goodness of fit test of the observed 
data against the
+   * expected distribution.
+   *
+   * Note: the two input Vectors need to have the same size.
+   *   `observed` cannot contain negative values.
+   *   `expected` cannot contain nonpositive values.
+   *
+   * @param observed Vector containing the observed categorical 
counts/relative frequencies.
+   * @param expected Vector containing the expected categorical 
counts/relative frequencies.
+   * `expected` is rescaled if the `expected` sum differs 
from the `observed` sum.
+   * @return ChiSquaredTest object containing the test statistic, degrees 
of freedom, p-value,
+   * the method used, and the null hypothesis.
+   */
+  @Experimental
+  def chiSqTest(observed: Vector,
--- End diff --

the following style may be better:

~~~
def chiSqTest(observed: Vector, expected: Vector): ChiSquaredTestResul =
  ChiSquaredTest.chiSquared(observed, expected)
~~~

~~~
def chiSqTest(observed: Vector, expected: Vector): ChiSquaredTestResult = {
  ChiSquaredTest.chiSquared(observed, expected)
}
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r15981441
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala 
---
@@ -89,4 +91,64 @@ object Statistics {
*/
   @Experimental
   def corr(x: RDD[Double], y: RDD[Double], method: String): Double = 
Correlations.corr(x, y, method)
+
+  /**
+   * :: Experimental ::
+   * Conduct Pearson's chi-squared goodness of fit test of the observed 
data against the
+   * expected distribution.
+   *
+   * Note: the two input Vectors need to have the same size.
+   *   `observed` cannot contain negative values.
+   *   `expected` cannot contain nonpositive values.
+   *
+   * @param observed Vector containing the observed categorical 
counts/relative frequencies.
+   * @param expected Vector containing the expected categorical 
counts/relative frequencies.
+   * `expected` is rescaled if the `expected` sum differs 
from the `observed` sum.
+   * @return ChiSquaredTest object containing the test statistic, degrees 
of freedom, p-value,
+   * the method used, and the null hypothesis.
+   */
+  @Experimental
+  def chiSqTest(observed: Vector,
+  expected: Vector): ChiSquaredTestResult = 
ChiSquaredTest.chiSquared(observed, expected)
+
+  /**
+   * :: Experimental ::
+   * Conduct Pearson's chi-squared goodness of fit test of the observed 
data against the uniform
+   * distribution, with each category having an expected frequency of `1 / 
observed.size`.
+   *
+   * Note: `observed` cannot contain negative values.
+   *
+   * @param observed Vector containing the observed categorical 
counts/relative frequencies.
+   * @return ChiSquaredTest object containing the test statistic, degrees 
of freedom, p-value,
+   * the method used, and the null hypothesis.
+   */
+  @Experimental
+  def chiSqTest(observed: Vector): ChiSquaredTestResult = 
ChiSquaredTest.chiSquared(observed)
+
+  /**
+   * :: Experimental ::
+   * Conduct Pearson's independence test on the input contingency matrix, 
which cannot contain
+   * negative entries or columns or rows that sum up to 0.
+   *
+   * @param counts The contingency matrix.
+   * @return ChiSquaredTest object containing the test statistic, degrees 
of freedom, p-value,
+   * the method used, and the null hypothesis.
+   */
+  @Experimental
+  def chiSqTest(counts: Matrix): ChiSquaredTestResult = 
ChiSquaredTest.chiSquaredMatrix(counts)
+
+  /**
+   * :: Experimental ::
+   * Conduct Pearson's independence test for every feature against the 
label across the input RDD.
+   * For each feature, the (feature, label) pairs are converted into a 
contingency matrix for which
+   * the chi-squared statistic is computed.
+   *
+   * @param data an `RDD[LabeledPoint]` containing the Labeled dataset.
--- End diff --

mention categorical here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r15981444
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/ChiSquaredTest.scala ---
@@ -0,0 +1,211 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import breeze.linalg.{DenseMatrix => BDM}
+import cern.jet.stat.Probability.chiSquareComplemented
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.rdd.RDD
+
+/**
+ * Conduct the Chi-squared test for the input RDDs using the specified 
method.
--- End diff --

`Chi` -> `chi`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r15981438
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala 
---
@@ -89,4 +91,64 @@ object Statistics {
*/
   @Experimental
   def corr(x: RDD[Double], y: RDD[Double], method: String): Double = 
Correlations.corr(x, y, method)
+
+  /**
+   * :: Experimental ::
+   * Conduct Pearson's chi-squared goodness of fit test of the observed 
data against the
+   * expected distribution.
+   *
+   * Note: the two input Vectors need to have the same size.
+   *   `observed` cannot contain negative values.
+   *   `expected` cannot contain nonpositive values.
+   *
+   * @param observed Vector containing the observed categorical 
counts/relative frequencies.
+   * @param expected Vector containing the expected categorical 
counts/relative frequencies.
+   * `expected` is rescaled if the `expected` sum differs 
from the `observed` sum.
+   * @return ChiSquaredTest object containing the test statistic, degrees 
of freedom, p-value,
+   * the method used, and the null hypothesis.
+   */
+  @Experimental
+  def chiSqTest(observed: Vector,
+  expected: Vector): ChiSquaredTestResult = 
ChiSquaredTest.chiSquared(observed, expected)
+
+  /**
+   * :: Experimental ::
+   * Conduct Pearson's chi-squared goodness of fit test of the observed 
data against the uniform
+   * distribution, with each category having an expected frequency of `1 / 
observed.size`.
+   *
+   * Note: `observed` cannot contain negative values.
+   *
+   * @param observed Vector containing the observed categorical 
counts/relative frequencies.
+   * @return ChiSquaredTest object containing the test statistic, degrees 
of freedom, p-value,
+   * the method used, and the null hypothesis.
+   */
+  @Experimental
+  def chiSqTest(observed: Vector): ChiSquaredTestResult = 
ChiSquaredTest.chiSquared(observed)
+
+  /**
+   * :: Experimental ::
+   * Conduct Pearson's independence test on the input contingency matrix, 
which cannot contain
+   * negative entries or columns or rows that sum up to 0.
+   *
+   * @param counts The contingency matrix.
+   * @return ChiSquaredTest object containing the test statistic, degrees 
of freedom, p-value,
+   * the method used, and the null hypothesis.
+   */
+  @Experimental
+  def chiSqTest(counts: Matrix): ChiSquaredTestResult = 
ChiSquaredTest.chiSquaredMatrix(counts)
--- End diff --

`counts` -> `observed`? This table could also be probabilities.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-08 Thread dorx
Github user dorx commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-51570348
  
@mengxr @jkbradley @falaki 
In case you guys haven't noticed, the latest version implements the 
discussed APIs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-07 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-51545655
  
QA results for PR 1733:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18150/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-07 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-51541696
  
QA tests have started for PR 1733. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18150/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-06 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-51347255
  
The previous proposal may be hard to implement in Python. Another solution 
would be separate goodness-of-fit test from independence test, e.g., 
`chiSqGofTest` and `chiSqIndTest`.

~~~
def chiSqGofTest(counts: Vector)

def chiSqGofTest(counts: Vector, p: Vector)

def chiSqIndTest(counts: Matrix)

def chiSqIndTest[V1, V2](observations: RDD[(V1, V2)])
~~~

We can also add direct RDD support, which may be unnecessary:

~~~
def chiSqGofTest[V](observations: RDD[V], p: Map[V, Double])
~~~

Since we only support `pearson`, we can hide `method` in the public API for 
now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-05 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r15857945
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import org.apache.spark.annotation.Experimental
+
+/**
+ * :: Experimental ::
+ * Trait for hypothesis test results.
+ */
+@Experimental
+trait TestResult {
+
+  def pValue: Double
+
+  def degreesOfFreedom: Array[Long]
--- End diff --

`df` should be an array of double or we can make it a generic type. In 
t-test and f-test, `df` are not integers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-05 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-51298178
  
@dorx I checked R's implementation and finally figured out what is going on.

1. When only a vector `x` is given, it is treated as a vector containing 
frequency counts for categories and tested against multinomial distribution.
2. When a matrix `x` is given, it is treated as a contingency table and the 
test is for independence. 
3. When both `x` and `y` are given, both vectors are treated as factors 
(categorical values) and the test is for independence.

I want to suggest the following APIs:

~~~
// test observed frequencies against multinomial distribution with
// `p = (1/n, 1/n, ..., 1/n)`
def chiSqTest(counts: Vector)

// test observed frequencies against the given multinomial distribution
def chiSqTest(counts: Vector, p: Vector)

// test independence using the given contingency table 
def chiSqTest(counts: Matrix)

// test independence using the given observed pairs (assuming categorical 
values)
def chiSqTest[V1, V2](observations: RDD[(V1, V2)])
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-05 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-51290954
  
I think we should either allow user to input the raw observations or use 
`Map[_, Long]` for input frequencies. I'm going to take a look at R's 
implementation ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-51289600
  
QA results for PR 1733:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17975/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-05 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r15854511
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import org.apache.spark.annotation.Experimental
+
+/**
+ * :: Experimental ::
+ * Trait for hypothesis test results.
+ */
+@Experimental
+trait TestResult {
+
+  def pValue: Double
+
+  def degreesOfFreedom: Array[Long]
+
+  def statistic: Double
+
+  /**
+   * String explaining the hypothesis test result.
+   * Specific classes implementing this trait should override this method 
to output test-specific
+   * information.
+   */
+  override def toString: String = {
+
+val pValueExplain = if (pValue <= 0.01) {
+  "Very strong presumption against null hypothesis."
+} else if (0.01 < pValue && pValue <= 0.05) {
+  "Strong presumption against null hypothesis."
+} else if (0.05 < pValue && pValue <= 0.01) {
+  "Low presumption against null hypothesis."
+} else {
+  "No presumption against null hypothesis."
+}
+
+s"degrees of freedom = ${degreesOfFreedom.mkString} \n" +
--- End diff --

`mkString("[", ",", "]")`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-05 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r15854488
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import org.apache.spark.annotation.Experimental
+
+/**
+ * :: Experimental ::
+ * Trait for hypothesis test results.
+ */
+@Experimental
+trait TestResult {
+
+  def pValue: Double
+
+  def degreesOfFreedom: Array[Long]
+
+  def statistic: Double
--- End diff --

ditto: doc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-05 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r15854484
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import org.apache.spark.annotation.Experimental
+
+/**
+ * :: Experimental ::
+ * Trait for hypothesis test results.
+ */
+@Experimental
+trait TestResult {
+
+  def pValue: Double
--- End diff --

documentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-05 Thread dorx
Github user dorx commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r15854474
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala 
---
@@ -89,4 +90,76 @@ object Statistics {
*/
   @Experimental
   def corr(x: RDD[Double], y: RDD[Double], method: String): Double = 
Correlations.corr(x, y, method)
+
+  /**
+   * :: Experimental ::
+   * Conduct the Chi-squared goodness of fit test of the observed data 
against the
+   * expected distribution.
+   *
+   * Note: the two input RDDs need to have the same number of partitions 
and the same number of
+   * elements in each partition.
+   *
+   * @param observed RDD[Double] containing the observed counts.
+   * @param expected RDD[Double] containing the expected counts. If the 
observed total differs from
+   * the expected total, this RDD is rescaled to sum up to 
the observed total.
+   * @param method String specifying the method to use for the Chi-squared 
test.
+   *   Supported: `pearson` (default)
+   * @return ChiSquaredTest object containing the test statistic, degrees 
of freedom, p-value,
+   * the method used, and the null hypothesis.
+   */
+  @Experimental
+  def chiSquared(observed: RDD[Double],
--- End diff --

`chiSqTest` sounds good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-05 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r15854487
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.test
+
+import org.apache.spark.annotation.Experimental
+
+/**
+ * :: Experimental ::
+ * Trait for hypothesis test results.
+ */
+@Experimental
+trait TestResult {
+
+  def pValue: Double
+
+  def degreesOfFreedom: Array[Long]
--- End diff --

ditto: doc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-51289343
  
QA results for PR 1733:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17974/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-05 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r15854426
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala 
---
@@ -89,4 +90,76 @@ object Statistics {
*/
   @Experimental
   def corr(x: RDD[Double], y: RDD[Double], method: String): Double = 
Correlations.corr(x, y, method)
+
+  /**
+   * :: Experimental ::
+   * Conduct the Chi-squared goodness of fit test of the observed data 
against the
--- End diff --

`Chi` -> `chi`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-05 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r15854417
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala 
---
@@ -89,4 +90,76 @@ object Statistics {
*/
   @Experimental
   def corr(x: RDD[Double], y: RDD[Double], method: String): Double = 
Correlations.corr(x, y, method)
+
+  /**
+   * :: Experimental ::
+   * Conduct the Chi-squared goodness of fit test of the observed data 
against the
+   * expected distribution.
+   *
+   * Note: the two input RDDs need to have the same number of partitions 
and the same number of
+   * elements in each partition.
+   *
+   * @param observed RDD[Double] containing the observed counts.
+   * @param expected RDD[Double] containing the expected counts. If the 
observed total differs from
+   * the expected total, this RDD is rescaled to sum up to 
the observed total.
+   * @param method String specifying the method to use for the Chi-squared 
test.
+   *   Supported: `pearson` (default)
+   * @return ChiSquaredTest object containing the test statistic, degrees 
of freedom, p-value,
+   * the method used, and the null hypothesis.
+   */
+  @Experimental
+  def chiSquared(observed: RDD[Double],
+  expected: RDD[Double],
+  method: String): ChiSquaredTestResult = {
+ChiSquaredTest.chiSquared(observed, expected, method)
+  }
+
+  /**
+   * :: Experimental ::
+   * Conduct the Chi-squared goodness of fit test of the observed data 
against the
+   * expected distribution.
--- End diff --

mention `pearson` here?

minor: I think it should be fine to remove the rest of the doc and point 
users to the method with the full set of parameters, so we only maintain one 
copy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-05 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1733#discussion_r15854415
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala 
---
@@ -89,4 +90,76 @@ object Statistics {
*/
   @Experimental
   def corr(x: RDD[Double], y: RDD[Double], method: String): Double = 
Correlations.corr(x, y, method)
+
+  /**
+   * :: Experimental ::
+   * Conduct the Chi-squared goodness of fit test of the observed data 
against the
+   * expected distribution.
+   *
+   * Note: the two input RDDs need to have the same number of partitions 
and the same number of
+   * elements in each partition.
+   *
+   * @param observed RDD[Double] containing the observed counts.
+   * @param expected RDD[Double] containing the expected counts. If the 
observed total differs from
+   * the expected total, this RDD is rescaled to sum up to 
the observed total.
+   * @param method String specifying the method to use for the Chi-squared 
test.
+   *   Supported: `pearson` (default)
+   * @return ChiSquaredTest object containing the test statistic, degrees 
of freedom, p-value,
+   * the method used, and the null hypothesis.
+   */
+  @Experimental
+  def chiSquared(observed: RDD[Double],
--- End diff --

Shall we call it `chiSqTest` (following R's)? We need `test` in the method 
name because X_2 is also a distribution. I feel `chiSqTest` may be better than 
`chiSquaredTest` because it is also called `chi-square test` without `d`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-05 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-51287008
  
remove space between `@` and `jkbradley` ~ :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-51286717
  
QA tests have started for PR 1733. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17975/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-05 Thread dorx
Github user dorx commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-51286506
  
@mengxr @ jkbradley @falaki 
PR ready for review now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-51286427
  
QA tests have started for PR 1733. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17974/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-50953135
  
QA results for PR 1733:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17744/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1733#issuecomment-50953122
  
QA tests have started for PR 1733. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17744/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2515][mllib] Chi Squared test

2014-08-01 Thread dorx
GitHub user dorx opened a pull request:

https://github.com/apache/spark/pull/1733

[SPARK-2515][mllib] Chi Squared test



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dorx/spark chisquare

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1733.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1733


commit ff17423bd714592d38b69df426382838216cd133
Author: Doris Xin 
Date:   2014-07-25T19:31:35Z

WIP

commit 6598379979e1ed69de6956ebf56ad0b7b47029bf
Author: Doris Xin 
Date:   2014-07-25T22:29:08Z

API and code structure.

commit 706d436aea3db8b8cf15db0bcccb25e19c121a78
Author: Doris Xin 
Date:   2014-07-25T22:38:07Z

Added API for RDD[Vector]

commit 3d615828a913b341c9fc7afe6e371f3950d591ab
Author: Doris Xin 
Date:   2014-07-25T22:54:23Z

input names

commit e6b83f35375701f71f699697a83236e7e0c76d6c
Author: Doris Xin 
Date:   2014-08-01T20:33:04Z

reviewer comments

commit 4e4e36199aa81d9d1628322c499e40556fbdc6ef
Author: Doris Xin 
Date:   2014-08-02T02:15:57Z

WIP

commit 50703a57712ced5afbed4e2be73a268e7009c0c9
Author: Doris Xin 
Date:   2014-08-02T02:20:03Z

merge master

commit bc7eb2eeba4e2ccf10b891e4ce59db55823cea3b
Author: Doris Xin 
Date:   2014-08-02T03:48:05Z

unit passed; still need docs and some refactoring




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---