[GitHub] spark pull request #18122: [SPARK-20899][PySpark] PySpark supports stringInd...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18122 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18122: [SPARK-20899][PySpark] PySpark supports stringInd...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18122#discussion_r119146702 --- Diff: python/pyspark/ml/tests.py --- @@ -538,6 +538,19 @@ def test_rformula_force_index_label(self): transformedDF2 = model2.transform(df) self.assertEqual(transformedDF2.head().label, 0.0) +def test_rformula_string_indexer_order_type(self): +df = self.spark.createDataFrame([ +(1.0, 1.0, "a"), +(0.0, 2.0, "b"), +(1.0, 0.0, "a")], ["y", "x", "s"]) +rf = RFormula(formula="y ~ x + s", stringIndexerOrderType="alphabetDesc") +self.assertEqual(rf.getStringIndexerOrderType(), 'alphabetDesc') +transformedDF = rf.fit(df).transform(df) +observed = transformedDF.select("features").collect() +expected = [[1.0, 0.0], [2.0, 1.0], [0.0, 0.0]] +for i in range(0, len(expected)): +self.assertTrue((observed[i]["features"].toArray() == expected[i]).all()) --- End diff -- Minor: Usually we're more prefer to use ```self.assertTrue(all(observed[i]["features"].toArray() == expected[i]))```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18122: [SPARK-20899][PySpark] PySpark supports stringInd...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18122#discussion_r118844255 --- Diff: python/pyspark/ml/feature.py --- @@ -3032,6 +3032,18 @@ class RFormula(JavaEstimator, HasFeaturesCol, HasLabelCol, JavaMLReadable, JavaM ... >>> str(loadedModel) 'RFormulaModel(ResolvedRFormula(label=y, terms=[x,s], hasIntercept=true)) (uid=...)' +>>> rf = RFormula(formula="y ~ x + s", stringIndexerOrderType="alphabetDesc") +>>> rf.getStringIndexerOrderType() +'alphabetDesc' +>>> rf.fit(df).transform(df).show() ++---+---+---+-+-+ +| y| x| s| features|label| ++---+---+---+-+-+ +|1.0|1.0| a|[1.0,0.0]| 1.0| +|0.0|2.0| b|[2.0,1.0]| 0.0| +|0.0|0.0| a|(2,[],[])| 0.0| ++---+---+---+-+-+ +... --- End diff -- Could you move the newly added test to ```tests.py```? We keep the basic doc tests here both for test and example, other tests should be placed at ```tests.py```. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18122: [SPARK-20899][PySpark] PySpark supports stringInd...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/18122#discussion_r118796569 --- Diff: python/pyspark/ml/feature.py --- @@ -3043,26 +3055,35 @@ class RFormula(JavaEstimator, HasFeaturesCol, HasLabelCol, JavaMLReadable, JavaM "Force to index label whether it is numeric or string", typeConverter=TypeConverters.toBoolean) +stringIndexerOrderType = Param(Params._dummy(), "stringIndexerOrderType", + "How to order categories of a string FEATURE column used by " + --- End diff -- Changed it to lower case now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18122: [SPARK-20899][PySpark] PySpark supports stringInd...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/18122#discussion_r118780954 --- Diff: python/pyspark/ml/feature.py --- @@ -3043,26 +3055,35 @@ class RFormula(JavaEstimator, HasFeaturesCol, HasLabelCol, JavaMLReadable, JavaM "Force to index label whether it is numeric or string", typeConverter=TypeConverters.toBoolean) +stringIndexerOrderType = Param(Params._dummy(), "stringIndexerOrderType", + "How to order categories of a string FEATURE column used by " + --- End diff -- FEATURE capitalize is common here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18122: [SPARK-20899][PySpark] PySpark supports stringInd...
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/18122 [SPARK-20899][PySpark] PySpark supports stringIndexerOrderType in RFormula ## What changes were proposed in this pull request? PySpark supports stringIndexerOrderType in RFormula as in #17967. ## How was this patch tested? docstring test You can merge this pull request into a Git repository by running: $ git pull https://github.com/actuaryzhang/spark PythonRFormula Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18122.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18122 commit 4bca4d95613e6e18361de8fe0a36667182c2d446 Author: actuaryzhang Date: 2017-05-26T07:40:22Z Pyhton port for Rformula stringIndexerOrderType --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org