[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-28 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17421


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-28 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108535695
  
--- Diff: dev/sparktestsupport/modules.py ---
@@ -431,6 +431,7 @@ def __hash__(self):
 "pyspark.ml.linalg.__init__",
 "pyspark.ml.recommendation",
 "pyspark.ml.regression",
+"pyspark.ml.stat",
--- End diff --

OK, no problem, I just wanted to check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-28 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108511847
  
--- Diff: dev/sparktestsupport/modules.py ---
@@ -431,6 +431,7 @@ def __hash__(self):
 "pyspark.ml.linalg.__init__",
 "pyspark.ml.recommendation",
 "pyspark.ml.regression",
+"pyspark.ml.stat",
--- End diff --

Oh yah sorry, its anything which is a new sub-directory and when I was 
reading this PR yesterday I thought this was a new directory, but looking it 
today that isn't the case, sorry.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-28 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108485003
  
--- Diff: dev/sparktestsupport/modules.py ---
@@ -431,6 +431,7 @@ def __hash__(self):
 "pyspark.ml.linalg.__init__",
 "pyspark.ml.recommendation",
 "pyspark.ml.regression",
+"pyspark.ml.stat",
--- End diff --

@holdenk  If we need to add pyspark.ml.stat to setup.py, then why are we 
not adding the other analogous modules: pyspark.ml.{classification, clustering, 
regression,...}?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-27 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108299791
  
--- Diff: dev/sparktestsupport/modules.py ---
@@ -431,6 +431,7 @@ def __hash__(self):
 "pyspark.ml.linalg.__init__",
 "pyspark.ml.recommendation",
 "pyspark.ml.regression",
+"pyspark.ml.stat",
--- End diff --

Thanks @jkbradley, I reverted setup.py.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-27 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108286819
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -0,0 +1,104 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark import since, SparkContext
+from pyspark.ml.common import _java2py, _py2java
+from pyspark.ml.wrapper import _jvm
+
+
+class ChiSquareTest(object):
+"""
+.. note:: Experimental
+
+Conduct Pearson's independence test for every feature against the 
label. For each feature,
+the (feature, label) pairs are converted into a contingency matrix for 
which the Chi-squared
+statistic is computed. All label and feature values must be 
categorical.
+
+The null hypothesis is that the occurrence of the outcomes is 
statistically independent.
+
+:param dataset:
+  DataFrame of categorical labels and categorical features.
+  Real-valued features will be treated as categorical for each 
distinct value.
+:param featuresCol:
+  Name of features column in dataset, of type `Vector` (`VectorUDT`).
+:param labelCol:
+  Name of label column in dataset, of any numerical type.
+:return:
+  DataFrame containing the test result for every feature against the 
label.
+  This DataFrame will contain a single Row with the following fields:
+  - `pValues: Vector`
+  - `degreesOfFreedom: Array[Int]`
+  - `statistics: Vector`
+  Each of these fields has one value per feature.
+
+>>> from pyspark.ml.linalg import Vectors
+>>> from pyspark.ml.stat import ChiSquareTest
+>>> dataset = [[0, Vectors.dense([0, 0, 1])],
+...[0, Vectors.dense([1, 0, 1])],
+...[1, Vectors.dense([2, 1, 1])],
+...[1, Vectors.dense([3, 1, 1])]]
+>>> dataset = spark.createDataFrame(dataset, ["label", "features"])
+>>> chiSqResult = ChiSquareTest.test(dataset, 'features', 'label')
+>>> chiSqResult.select("degreesOfFreedom").collect()[0]
+Row(degreesOfFreedom=[3, 1, 0])
+
+.. versionadded:: 2.2.0
+
+"""
+@staticmethod
+@since("2.2.0")
+def test(dataset, featuresCol, labelCol):
+"""
+Perform a Pearson's independence test using dataset.
+"""
+sc = SparkContext._active_spark_context
+javaTestObj = _jvm().org.apache.spark.ml.stat.ChiSquareTest
+args = [_py2java(sc, arg) for arg in (dataset, featuresCol, 
labelCol)]
+return _java2py(sc, javaTestObj.test(*args))
+
+
+if __name__ == "__main__":
+import doctest
+import pyspark.ml.stat
+from pyspark.sql import SparkSession
+
+globs = pyspark.ml.stat.__dict__.copy()
+# The small batch size here ensures that we see multiple batches,
+# even in these small test examples:
+spark = SparkSession.builder \
+.master("local[2]") \
+.appName("ml.stat tests") \
+.getOrCreate()
+sc = spark.sparkContext
+globs['sc'] = sc
+globs['spark'] = spark
+import tempfile
+
+temp_path = tempfile.mkdtemp()
--- End diff --

I don't think this test is using the temp path?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-27 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108299231
  
--- Diff: dev/sparktestsupport/modules.py ---
@@ -431,6 +431,7 @@ def __hash__(self):
 "pyspark.ml.linalg.__init__",
 "pyspark.ml.recommendation",
 "pyspark.ml.regression",
+"pyspark.ml.stat",
--- End diff --

Sub-modules aren't automatically packaged so we do need to explicitly add 
it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-27 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108287406
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -41,9 +41,7 @@
 import tempfile
 import array as pyarray
 import numpy as np
-from numpy import (
-abs, all, arange, array, array_equal, dot, exp, inf, mean, ones, 
random, tile, zeros)
-from numpy import sum as array_sum
+from numpy import abs, all, arange, array, array_equal, inf, ones, tile, 
zeros
--- End diff --

Thanks for cleaning up the numpy imports :) +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108296529
  
--- Diff: dev/sparktestsupport/modules.py ---
@@ -431,6 +431,7 @@ def __hash__(self):
 "pyspark.ml.linalg.__init__",
 "pyspark.ml.recommendation",
 "pyspark.ml.regression",
+"pyspark.ml.stat",
--- End diff --

Wait, do we need to update setup.py?  This is creating a module, not a 
package, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-27 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108286662
  
--- Diff: dev/sparktestsupport/modules.py ---
@@ -431,6 +431,7 @@ def __hash__(self):
 "pyspark.ml.linalg.__init__",
 "pyspark.ml.recommendation",
 "pyspark.ml.regression",
+"pyspark.ml.stat",
--- End diff --

@holdenk thanks for catching that, should be fixed now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-27 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108283757
  
--- Diff: dev/sparktestsupport/modules.py ---
@@ -431,6 +431,7 @@ def __hash__(self):
 "pyspark.ml.linalg.__init__",
 "pyspark.ml.recommendation",
 "pyspark.ml.regression",
+"pyspark.ml.stat",
--- End diff --

We just took it out in 
https://github.com/apache/spark/commit/314cf51ded52834cfbaacf58d3d05a220965ca2a 
, but since this is adding back in ml.stat we also need to update setup.py (you 
might need to update your branch from the latest master to see this).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108026117
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -0,0 +1,87 @@
+from pyspark import since, SparkContext
+from pyspark.ml.common import _java2py, _py2java
+from pyspark.ml.wrapper import _jvm
+
+
+class ChiSquareTest(object):
--- End diff --

Also, we put the triple-quotes on their own line elsewhere in pyspark


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108023140
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1692,6 +1692,23 @@ def test_new_java_array(self):
 self.assertEqual(_java2py(self.sc, java_array), [])
 
 
+class ChiSquareTestTests(SparkSessionTestCase):
+
+def test_ChiSquareTest(self):
+labels = [1, 2, 0]
+vectors = [_convert_to_vector([0, 1, 2]),
+   _convert_to_vector([1, 1, 1]),
+   _convert_to_vector([2, 1, 0])]
+data = zip(labels, vectors)
+df = self.spark.createDataFrame(data, ['label', 'feat'])
+res = ChiSquareTest.test(df, 'feat', 'label')
+# pValues = res.select("pValues").collect())
--- End diff --

(Noting that this can be updated once the Spark SQL bug is fixed)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108022935
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1692,6 +1692,23 @@ def test_new_java_array(self):
 self.assertEqual(_java2py(self.sc, java_array), [])
 
 
+class ChiSquareTestTests(SparkSessionTestCase):
+
+def test_ChiSquareTest(self):
--- End diff --

This is a little arbitrary, but to follow other examples, write this as: 
```test_chisquaretest```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108026690
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1692,6 +1692,23 @@ def test_new_java_array(self):
 self.assertEqual(_java2py(self.sc, java_array), [])
 
 
+class ChiSquareTestTests(SparkSessionTestCase):
+
+def test_ChiSquareTest(self):
+labels = [1, 2, 0]
+vectors = [_convert_to_vector([0, 1, 2]),
+   _convert_to_vector([1, 1, 1]),
+   _convert_to_vector([2, 1, 0])]
+data = zip(labels, vectors)
--- End diff --

Same for the doc test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108022929
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -0,0 +1,87 @@
+from pyspark import since, SparkContext
+from pyspark.ml.common import _java2py, _py2java
+from pyspark.ml.wrapper import _jvm
+
+
+class ChiSquareTest(object):
--- End diff --

Mark as Experimental  (Search for other examples to see how this is marked)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108023008
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1692,6 +1692,23 @@ def test_new_java_array(self):
 self.assertEqual(_java2py(self.sc, java_array), [])
 
 
+class ChiSquareTestTests(SparkSessionTestCase):
+
+def test_ChiSquareTest(self):
+labels = [1, 2, 0]
+vectors = [_convert_to_vector([0, 1, 2]),
--- End diff --

Use DenseVector, not _convert_to_vector.  (use public APIs wherever 
possible)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108026677
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -0,0 +1,102 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark import since, SparkContext
+from pyspark.ml.common import _java2py, _py2java
+from pyspark.ml.wrapper import _jvm
+
+
+class ChiSquareTest(object):
+""" Conduct Pearson's independence test for every feature against the 
label. For each feature,
+the (feature, label) pairs are converted into a contingency matrix for 
which the Chi-squared
+statistic is computed. All label and feature values must be 
categorical.
+
+The null hypothesis is that the occurrence of the outcomes is 
statistically independent.
+
+:param dataset:
--- End diff --

Same for the return value text


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108026673
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -0,0 +1,102 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark import since, SparkContext
+from pyspark.ml.common import _java2py, _py2java
+from pyspark.ml.wrapper import _jvm
+
+
+class ChiSquareTest(object):
+""" Conduct Pearson's independence test for every feature against the 
label. For each feature,
+the (feature, label) pairs are converted into a contingency matrix for 
which the Chi-squared
+statistic is computed. All label and feature values must be 
categorical.
+
+The null hypothesis is that the occurrence of the outcomes is 
statistically independent.
+
+:param dataset:
--- End diff --

Copy param text from the Scala doc, unless there's a need to customize it 
for Python


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108022984
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -0,0 +1,87 @@
+from pyspark import since, SparkContext
+from pyspark.ml.common import _java2py, _py2java
+from pyspark.ml.wrapper import _jvm
+
+
+class ChiSquareTest(object):
--- End diff --

Mark as Experimental  (Search for other example of this)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108023069
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1692,6 +1692,23 @@ def test_new_java_array(self):
 self.assertEqual(_java2py(self.sc, java_array), [])
 
 
+class ChiSquareTestTests(SparkSessionTestCase):
+
+def test_ChiSquareTest(self):
+labels = [1, 2, 0]
+vectors = [_convert_to_vector([0, 1, 2]),
+   _convert_to_vector([1, 1, 1]),
+   _convert_to_vector([2, 1, 0])]
+data = zip(labels, vectors)
--- End diff --

It can also be nicer to write this in a per-row format, rather than zipping 
labels and vectors which are defined separately.  See other examples of 
createDataFrame in this file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108026186
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -0,0 +1,104 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark import since, SparkContext
+from pyspark.ml.common import _java2py, _py2java
+from pyspark.ml.wrapper import _jvm
+
+
+class ChiSquareTest(object):
+""" Conduct Pearson's independence test for every feature against the 
label. For each feature,
--- End diff --

I just saw you changed this from the Scala doc b/c I left "RDD" there.  
Would you mind correcting the Scala doc too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-24 Thread MrBago
GitHub user MrBago opened a pull request:

https://github.com/apache/spark/pull/17421

[SPARK-20040][ML][python] pyspark wrapper for ChiSquareTest

## What changes were proposed in this pull request?

A pyspark wrapper for spark.ml.stat.ChiSquareTest

## How was this patch tested?

unit tests
doctests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MrBago/spark chiSquareTestWrapper

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17421.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17421


commit a6bc10c9aa9166e7274d9c9ca3959a15b70e87ec
Author: Bago Amirbekian 
Date:   2017-03-24T23:58:21Z

Added pyspark wrapper for ChiSquareTest and associated tests.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org