[GitHub] spark pull request #19892: [SPARK-22797][PySpark] Bucketizer support multi-c...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19892 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19892: [SPARK-22797][PySpark] Bucketizer support multi-c...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/19892#discussion_r162900053 --- Diff: python/pyspark/ml/feature.py --- @@ -315,13 +315,19 @@ class BucketedRandomProjectionLSHModel(LSHModel, JavaMLReadable, JavaMLWritable) @inherit_doc -class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, HasHandleInvalid, - JavaMLReadable, JavaMLWritable): -""" -Maps a column of continuous features to a column of feature buckets. - ->>> values = [(0.1,), (0.4,), (1.2,), (1.5,), (float("nan"),), (float("nan"),)] ->>> df = spark.createDataFrame(values, ["values"]) +class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, HasInputCols, HasOutputCols, + HasHandleInvalid, JavaMLReadable, JavaMLWritable): +""" +Maps a column of continuous features to a column of feature buckets. Since 2.3.0, +:py:class:`Bucketizer` can map multiple columns at once by setting the :py:attr:`inputCols` +parameter. Note that when both the :py:attr:`inputCol` and :py:attr:`inputCols` parameters +are set, a log warning will be printed and only :py:attr:`inputCol` will take effect, while --- End diff -- @holdenk this comment will need to be changed as per #19993 - but that has not been merged yet. I think #19993 will block 2.3 though, so we could preemptively change the doc here to match the Scala side in #19993 about throwing and exception. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19892: [SPARK-22797][PySpark] Bucketizer support multi-c...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/19892#discussion_r161719111 --- Diff: python/pyspark/ml/feature.py --- @@ -317,26 +317,34 @@ class BucketedRandomProjectionLSHModel(LSHModel, JavaMLReadable, JavaMLWritable) @inherit_doc -class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, HasHandleInvalid, - JavaMLReadable, JavaMLWritable): -""" -Maps a column of continuous features to a column of feature buckets. - ->>> values = [(0.1,), (0.4,), (1.2,), (1.5,), (float("nan"),), (float("nan"),)] ->>> df = spark.createDataFrame(values, ["values"]) +class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, HasInputCols, HasOutputCols, + HasHandleInvalid, JavaMLReadable, JavaMLWritable): +""" +Maps a column of continuous features to a column of feature buckets. Since 2.3.0, +:py:class:`Bucketizer` can map multiple columns at once by setting the :py:attr:`inputCols` +parameter. Note that when both the :py:attr:`inputCol` and :py:attr:`inputCols` parameters +are set, a log warning will be printed and only :py:attr:`inputCol` will take effect, while +:py:attr:`inputCols` will be ignored. The :py:attr:`splits` parameter is only used for single +column usage, and :py:attr:`splitsArray` is for multiple columns. + +>>> values = [(0.1, 0.0), (0.4, 1.0), (1.2, 1.3), (1.5, float("nan")), +... (float("nan"), 1.0), (float("nan"), 0.0)] +>>> df = spark.createDataFrame(values, ["values1", "values2"]) >>> bucketizer = Bucketizer(splits=[-float("inf"), 0.5, 1.4, float("inf")], -... inputCol="values", outputCol="buckets") ->>> bucketed = bucketizer.setHandleInvalid("keep").transform(df).collect() ->>> len(bucketed) -6 ->>> bucketed[0].buckets -0.0 ->>> bucketed[1].buckets -0.0 ->>> bucketed[2].buckets -1.0 ->>> bucketed[3].buckets -2.0 +... inputCol="values1", outputCol="buckets") +>>> bucketed = bucketizer.setHandleInvalid("keep").transform(df) --- End diff -- It may actually be neater to show only `values1` and `bucketed` - so perhaps `.transform(df.select('values1'))`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19892: [SPARK-22797][PySpark] Bucketizer support multi-c...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/19892#discussion_r161683821 --- Diff: python/pyspark/ml/feature.py --- @@ -317,13 +317,19 @@ class BucketedRandomProjectionLSHModel(LSHModel, JavaMLReadable, JavaMLWritable) @inherit_doc -class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, HasHandleInvalid, - JavaMLReadable, JavaMLWritable): -""" -Maps a column of continuous features to a column of feature buckets. - ->>> values = [(0.1,), (0.4,), (1.2,), (1.5,), (float("nan"),), (float("nan"),)] ->>> df = spark.createDataFrame(values, ["values"]) +class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, HasInputCols, HasOutputCols, + HasHandleInvalid, JavaMLReadable, JavaMLWritable): +""" +Maps a column of continuous features to a column of feature buckets. Since 2.3.0, +:py:class:`Bucketizer` can map multiple columns at once by setting the :py:attr:`inputCols` +parameter. Note that when both the :py:attr:`inputCol` and :py:attr:`inputCols` parameters +are set, a log warning will be printed and only :py:attr:`inputCol` will take effect, while +:py:attr:`inputCols` will be ignored. The :py:attr:`splits` parameter is only used for single +column usage, and :py:attr:`splitsArray` is for multiple columns. + +>>> values = [(0.1, 0.0), (0.4, 1.0), (1.2, 1.3), (1.5, float("nan")), +... (float("nan"), 1.0), (float("nan"), 0.0)] +>>> df = spark.createDataFrame(values, ["values", "numbers"]) --- End diff -- `values1` & `values2`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19892: [SPARK-22797][PySpark] Bucketizer support multi-c...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/19892#discussion_r161683714 --- Diff: python/pyspark/ml/feature.py --- @@ -347,6 +353,28 @@ class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, HasHandleInvalid, >>> bucketed = bucketizer.setHandleInvalid("skip").transform(df).collect() >>> len(bucketed) 4 +>>> bucketizer2 = Bucketizer(splitsArray= +... [[-float("inf"), 0.5, 1.4, float("inf")], [-float("inf"), 0.5, float("inf")]], +... inputCols=["values", "numbers"], outputCols=["buckets1", "buckets2"]) +>>> bucketed2 = bucketizer2.setHandleInvalid("keep").transform(df).collect() +>>> len(bucketed2) +6 +>>> bucketed2[0].buckets1 --- End diff -- Perhaps it would be cleaner to do a `df.show()` here? Likewise above for `bucketed` we could change that part of the doctest too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19892: [SPARK-22797][PySpark] Bucketizer support multi-c...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/19892#discussion_r161684641 --- Diff: python/pyspark/ml/param/__init__.py --- @@ -134,6 +134,16 @@ def toListFloat(value): return [float(v) for v in value] raise TypeError("Could not convert %s to list of floats" % value) +@staticmethod +def toListListFloat(value): --- End diff -- We need a test case in `ParamTypeConversionTests` for this new method; see `test_list_float` for reference. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19892: [SPARK-22797][PySpark] Bucketizer support multi-c...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19892#discussion_r157685008 --- Diff: python/pyspark/ml/feature.py --- @@ -315,13 +315,19 @@ class BucketedRandomProjectionLSHModel(LSHModel, JavaMLReadable, JavaMLWritable) @inherit_doc -class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, HasHandleInvalid, - JavaMLReadable, JavaMLWritable): -""" -Maps a column of continuous features to a column of feature buckets. - ->>> values = [(0.1,), (0.4,), (1.2,), (1.5,), (float("nan"),), (float("nan"),)] ->>> df = spark.createDataFrame(values, ["values"]) +class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, HasInputCols, HasOutputCols, + HasHandleInvalid, JavaMLReadable, JavaMLWritable): +""" +Maps a column of continuous features to a column of feature buckets. Since 2.3.0, +:py:class:`Bucketizer` can map multiple columns at once by setting the :py:attr:`inputCols` +parameter. Note that when both the :py:attr:`inputCol` and :py:attr:`inputCols` parameters +are set, a log warning will be printed and only :py:attr:`inputCol` will take effect, while --- End diff -- Note: there is a work to change this behavior to throw an exception, instead of a log warning. Remember to change this document later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org