[GitHub] spark pull request #18122: [SPARK-20899][PySpark] PySpark supports stringInd...

2017-05-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18122


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18122: [SPARK-20899][PySpark] PySpark supports stringInd...

2017-05-30 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18122#discussion_r119146702
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -538,6 +538,19 @@ def test_rformula_force_index_label(self):
 transformedDF2 = model2.transform(df)
 self.assertEqual(transformedDF2.head().label, 0.0)
 
+def test_rformula_string_indexer_order_type(self):
+df = self.spark.createDataFrame([
+(1.0, 1.0, "a"),
+(0.0, 2.0, "b"),
+(1.0, 0.0, "a")], ["y", "x", "s"])
+rf = RFormula(formula="y ~ x + s", 
stringIndexerOrderType="alphabetDesc")
+self.assertEqual(rf.getStringIndexerOrderType(), 'alphabetDesc')
+transformedDF = rf.fit(df).transform(df)
+observed = transformedDF.select("features").collect()
+expected = [[1.0, 0.0], [2.0, 1.0], [0.0, 0.0]]
+for i in range(0, len(expected)):
+self.assertTrue((observed[i]["features"].toArray() == 
expected[i]).all())
--- End diff --

Minor: Usually we're more prefer to use 
```self.assertTrue(all(observed[i]["features"].toArray() == expected[i]))```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18122: [SPARK-20899][PySpark] PySpark supports stringInd...

2017-05-28 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18122#discussion_r118844255
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -3032,6 +3032,18 @@ class RFormula(JavaEstimator, HasFeaturesCol, 
HasLabelCol, JavaMLReadable, JavaM
 ...
 >>> str(loadedModel)
 'RFormulaModel(ResolvedRFormula(label=y, terms=[x,s], 
hasIntercept=true)) (uid=...)'
+>>> rf = RFormula(formula="y ~ x + s", 
stringIndexerOrderType="alphabetDesc")
+>>> rf.getStringIndexerOrderType()
+'alphabetDesc'
+>>> rf.fit(df).transform(df).show()
++---+---+---+-+-+
+|  y|  x|  s| features|label|
++---+---+---+-+-+
+|1.0|1.0|  a|[1.0,0.0]|  1.0|
+|0.0|2.0|  b|[2.0,1.0]|  0.0|
+|0.0|0.0|  a|(2,[],[])|  0.0|
++---+---+---+-+-+
+...
--- End diff --

Could you move the newly added test to ```tests.py```? We keep the basic 
doc tests here both for test and example, other tests should be placed at 
```tests.py```. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18122: [SPARK-20899][PySpark] PySpark supports stringInd...

2017-05-26 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18122#discussion_r118796569
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -3043,26 +3055,35 @@ class RFormula(JavaEstimator, HasFeaturesCol, 
HasLabelCol, JavaMLReadable, JavaM
 "Force to index label whether it is numeric or 
string",
 typeConverter=TypeConverters.toBoolean)
 
+stringIndexerOrderType = Param(Params._dummy(), 
"stringIndexerOrderType",
+   "How to order categories of a string 
FEATURE column used by " +
--- End diff --

Changed it to lower case now. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18122: [SPARK-20899][PySpark] PySpark supports stringInd...

2017-05-26 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18122#discussion_r118780954
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -3043,26 +3055,35 @@ class RFormula(JavaEstimator, HasFeaturesCol, 
HasLabelCol, JavaMLReadable, JavaM
 "Force to index label whether it is numeric or 
string",
 typeConverter=TypeConverters.toBoolean)
 
+stringIndexerOrderType = Param(Params._dummy(), 
"stringIndexerOrderType",
+   "How to order categories of a string 
FEATURE column used by " +
--- End diff --

FEATURE capitalize is common here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18122: [SPARK-20899][PySpark] PySpark supports stringInd...

2017-05-26 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request:

https://github.com/apache/spark/pull/18122

[SPARK-20899][PySpark] PySpark supports stringIndexerOrderType in RFormula

## What changes were proposed in this pull request?

PySpark supports stringIndexerOrderType in RFormula as in #17967. 

## How was this patch tested?
docstring test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/actuaryzhang/spark PythonRFormula

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18122.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18122


commit 4bca4d95613e6e18361de8fe0a36667182c2d446
Author: actuaryzhang 
Date:   2017-05-26T07:40:22Z

Pyhton port for Rformula stringIndexerOrderType




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org