[GitHub] spark pull request #19892: [SPARK-22797][PySpark] Bucketizer support multi-c...

2018-01-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19892


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19892: [SPARK-22797][PySpark] Bucketizer support multi-c...

2018-01-22 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/19892#discussion_r162900053
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -315,13 +315,19 @@ class BucketedRandomProjectionLSHModel(LSHModel, 
JavaMLReadable, JavaMLWritable)
 
 
 @inherit_doc
-class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, 
HasHandleInvalid,
- JavaMLReadable, JavaMLWritable):
-"""
-Maps a column of continuous features to a column of feature buckets.
-
->>> values = [(0.1,), (0.4,), (1.2,), (1.5,), (float("nan"),), 
(float("nan"),)]
->>> df = spark.createDataFrame(values, ["values"])
+class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, HasInputCols, 
HasOutputCols,
+ HasHandleInvalid, JavaMLReadable, JavaMLWritable):
+"""
+Maps a column of continuous features to a column of feature buckets. 
Since 2.3.0,
+:py:class:`Bucketizer` can map multiple columns at once by setting the 
:py:attr:`inputCols`
+parameter. Note that when both the :py:attr:`inputCol` and 
:py:attr:`inputCols` parameters
+are set, a log warning will be printed and only :py:attr:`inputCol` 
will take effect, while
--- End diff --

@holdenk this comment will need to be changed as per #19993 - but that has 
not been merged yet. I think #19993 will block 2.3 though, so we could 
preemptively change the doc here to match the Scala side in #19993 about 
throwing and exception.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19892: [SPARK-22797][PySpark] Bucketizer support multi-c...

2018-01-16 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/19892#discussion_r161719111
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -317,26 +317,34 @@ class BucketedRandomProjectionLSHModel(LSHModel, 
JavaMLReadable, JavaMLWritable)
 
 
 @inherit_doc
-class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, 
HasHandleInvalid,
- JavaMLReadable, JavaMLWritable):
-"""
-Maps a column of continuous features to a column of feature buckets.
-
->>> values = [(0.1,), (0.4,), (1.2,), (1.5,), (float("nan"),), 
(float("nan"),)]
->>> df = spark.createDataFrame(values, ["values"])
+class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, HasInputCols, 
HasOutputCols,
+ HasHandleInvalid, JavaMLReadable, JavaMLWritable):
+"""
+Maps a column of continuous features to a column of feature buckets. 
Since 2.3.0,
+:py:class:`Bucketizer` can map multiple columns at once by setting the 
:py:attr:`inputCols`
+parameter. Note that when both the :py:attr:`inputCol` and 
:py:attr:`inputCols` parameters
+are set, a log warning will be printed and only :py:attr:`inputCol` 
will take effect, while
+:py:attr:`inputCols` will be ignored. The :py:attr:`splits` parameter 
is only used for single
+column usage, and :py:attr:`splitsArray` is for multiple columns.
+
+>>> values = [(0.1, 0.0), (0.4, 1.0), (1.2, 1.3), (1.5, float("nan")),
+... (float("nan"), 1.0), (float("nan"), 0.0)]
+>>> df = spark.createDataFrame(values, ["values1", "values2"])
 >>> bucketizer = Bucketizer(splits=[-float("inf"), 0.5, 1.4, 
float("inf")],
-... inputCol="values", outputCol="buckets")
->>> bucketed = 
bucketizer.setHandleInvalid("keep").transform(df).collect()
->>> len(bucketed)
-6
->>> bucketed[0].buckets
-0.0
->>> bucketed[1].buckets
-0.0
->>> bucketed[2].buckets
-1.0
->>> bucketed[3].buckets
-2.0
+... inputCol="values1", outputCol="buckets")
+>>> bucketed = bucketizer.setHandleInvalid("keep").transform(df)
--- End diff --

It may actually be neater to show only `values1` and `bucketed` - so 
perhaps `.transform(df.select('values1'))`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19892: [SPARK-22797][PySpark] Bucketizer support multi-c...

2018-01-16 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/19892#discussion_r161683821
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -317,13 +317,19 @@ class BucketedRandomProjectionLSHModel(LSHModel, 
JavaMLReadable, JavaMLWritable)
 
 
 @inherit_doc
-class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, 
HasHandleInvalid,
- JavaMLReadable, JavaMLWritable):
-"""
-Maps a column of continuous features to a column of feature buckets.
-
->>> values = [(0.1,), (0.4,), (1.2,), (1.5,), (float("nan"),), 
(float("nan"),)]
->>> df = spark.createDataFrame(values, ["values"])
+class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, HasInputCols, 
HasOutputCols,
+ HasHandleInvalid, JavaMLReadable, JavaMLWritable):
+"""
+Maps a column of continuous features to a column of feature buckets. 
Since 2.3.0,
+:py:class:`Bucketizer` can map multiple columns at once by setting the 
:py:attr:`inputCols`
+parameter. Note that when both the :py:attr:`inputCol` and 
:py:attr:`inputCols` parameters
+are set, a log warning will be printed and only :py:attr:`inputCol` 
will take effect, while
+:py:attr:`inputCols` will be ignored. The :py:attr:`splits` parameter 
is only used for single
+column usage, and :py:attr:`splitsArray` is for multiple columns.
+
+>>> values = [(0.1, 0.0), (0.4, 1.0), (1.2, 1.3), (1.5, float("nan")),
+... (float("nan"), 1.0), (float("nan"), 0.0)]
+>>> df = spark.createDataFrame(values, ["values", "numbers"])
--- End diff --

`values1` & `values2`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19892: [SPARK-22797][PySpark] Bucketizer support multi-c...

2018-01-16 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/19892#discussion_r161683714
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -347,6 +353,28 @@ class Bucketizer(JavaTransformer, HasInputCol, 
HasOutputCol, HasHandleInvalid,
 >>> bucketed = 
bucketizer.setHandleInvalid("skip").transform(df).collect()
 >>> len(bucketed)
 4
+>>> bucketizer2 = Bucketizer(splitsArray=
+... [[-float("inf"), 0.5, 1.4, float("inf")], [-float("inf"), 0.5, 
float("inf")]],
+... inputCols=["values", "numbers"], outputCols=["buckets1", 
"buckets2"])
+>>> bucketed2 = 
bucketizer2.setHandleInvalid("keep").transform(df).collect()
+>>> len(bucketed2)
+6
+>>> bucketed2[0].buckets1
--- End diff --

Perhaps it would be cleaner to do a `df.show()` here? Likewise above for 
`bucketed` we could change that part of the doctest too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19892: [SPARK-22797][PySpark] Bucketizer support multi-c...

2018-01-16 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/19892#discussion_r161684641
  
--- Diff: python/pyspark/ml/param/__init__.py ---
@@ -134,6 +134,16 @@ def toListFloat(value):
 return [float(v) for v in value]
 raise TypeError("Could not convert %s to list of floats" % value)
 
+@staticmethod
+def toListListFloat(value):
--- End diff --

We need a test case in `ParamTypeConversionTests` for this new method; see 
`test_list_float` for reference.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19892: [SPARK-22797][PySpark] Bucketizer support multi-c...

2017-12-19 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19892#discussion_r157685008
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -315,13 +315,19 @@ class BucketedRandomProjectionLSHModel(LSHModel, 
JavaMLReadable, JavaMLWritable)
 
 
 @inherit_doc
-class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, 
HasHandleInvalid,
- JavaMLReadable, JavaMLWritable):
-"""
-Maps a column of continuous features to a column of feature buckets.
-
->>> values = [(0.1,), (0.4,), (1.2,), (1.5,), (float("nan"),), 
(float("nan"),)]
->>> df = spark.createDataFrame(values, ["values"])
+class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, HasInputCols, 
HasOutputCols,
+ HasHandleInvalid, JavaMLReadable, JavaMLWritable):
+"""
+Maps a column of continuous features to a column of feature buckets. 
Since 2.3.0,
+:py:class:`Bucketizer` can map multiple columns at once by setting the 
:py:attr:`inputCols`
+parameter. Note that when both the :py:attr:`inputCol` and 
:py:attr:`inputCols` parameters
+are set, a log warning will be printed and only :py:attr:`inputCol` 
will take effect, while
--- End diff --

Note: there is a work to change this behavior to throw an exception, 
instead of a log warning. Remember to change this document later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org