[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...

2017-07-13 Thread zero323
Github user zero323 closed the pull request at:

https://github.com/apache/spark/pull/17922


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...

2017-07-11 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/17922#discussion_r126767873
  
--- Diff: python/pyspark/ml/param/__init__.py ---
@@ -170,6 +170,15 @@ def toVector(value):
 raise TypeError("Could not convert %s to vector" % value)
 
 @staticmethod
+def toMatrix(value):
+"""
+Convert a value to ML Matrix, if possible
+"""
+if isinstance(value, Matrix):
+return value
--- End diff --

Is this method really necessary?  It's not actually converting anything, 
just checking the type.  If this wasn't here and the user put something other 
than a Matrix, what error would be raised?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...

2017-07-11 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/17922#discussion_r126765747
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -246,18 +246,55 @@ class LogisticRegression(JavaEstimator, 
HasFeaturesCol, HasLabelCol, HasPredicti
"be used in the model. Supported options: auto, 
binomial, multinomial",
typeConverter=TypeConverters.toString)
 
+lowerBoundsOnCoefficients = Param(Params._dummy(), 
"lowerBoundsOnCoefficients",
+  "The lower bounds on coefficients if 
fitting under bound "
+  "constrained optimization. The bound 
matrix must be "
+  "compatible with the shape "
+  "(1, number of features) for 
binomial regression, or "
+  "(number of classes, number of 
features) "
+  "for multinomial regression.",
+  
typeConverter=TypeConverters.toMatrix)
+
+upperBoundsOnCoefficients = Param(Params._dummy(), 
"upperBoundsOnCoefficients",
+  "The upper bounds on coefficients if 
fitting under bound "
+  "constrained optimization. The bound 
matrix must be "
+  "compatible with the shape "
+  "(1, number of features) for 
binomial regression, or "
+  "(number of classes, number of 
features) "
+  "for multinomial regression.",
+  
typeConverter=TypeConverters.toMatrix)
--- End diff --

I think you can condense this and the above text blocks some more


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...

2017-07-11 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/17922#discussion_r126769478
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -832,6 +860,96 @@ def test_logistic_regression(self):
 except OSError:
 pass
 
+def logistic_regression_check_thresholds(self):
+self.assertIsInstance(
+LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]),
+LogisticRegressionModel
+)
+
+self.assertRaisesRegexp(
+ValueError,
+"Logistic Regression getThreshold found inconsistent.*$",
+LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5]
+)
+
+def test_binomial_logistic_regression_bounds(self):
--- End diff --

I agree this is probably overkill for testing this.  The functionality is 
already in Scala and should be tested there, here in python we are just setting 
the parameters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...

2017-07-11 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/17922#discussion_r126766060
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -246,18 +246,55 @@ class LogisticRegression(JavaEstimator, 
HasFeaturesCol, HasLabelCol, HasPredicti
"be used in the model. Supported options: auto, 
binomial, multinomial",
typeConverter=TypeConverters.toString)
 
+lowerBoundsOnCoefficients = Param(Params._dummy(), 
"lowerBoundsOnCoefficients",
+  "The lower bounds on coefficients if 
fitting under bound "
+  "constrained optimization. The bound 
matrix must be "
+  "compatible with the shape "
+  "(1, number of features) for 
binomial regression, or "
+  "(number of classes, number of 
features) "
+  "for multinomial regression.",
+  
typeConverter=TypeConverters.toMatrix)
+
+upperBoundsOnCoefficients = Param(Params._dummy(), 
"upperBoundsOnCoefficients",
+  "The upper bounds on coefficients if 
fitting under bound "
+  "constrained optimization. The bound 
matrix must be "
+  "compatible with the shape "
+  "(1, number of features) for 
binomial regression, or "
+  "(number of classes, number of 
features) "
+  "for multinomial regression.",
+  
typeConverter=TypeConverters.toMatrix)
+
+lowerBoundsOnIntercepts = Param(Params._dummy(), 
"lowerBoundsOnIntercepts",
+"The lower bounds on intercepts if 
fitting under bound "
+"constrained optimization. The bounds 
vector size must be"
+"equal with 1 for binomial regression, 
or the number of"
+"lasses for multinomial regression.",
+typeConverter=TypeConverters.toVector)
+
+upperBoundsOnIntercepts = Param(Params._dummy(), 
"upperBoundsOnIntercepts",
+"The upper bounds on intercepts if 
fitting under bound "
+"constrained optimization. The bound 
vector size must be "
+"equal with 1 for binomial regression, 
or the number of "
+"classes for multinomial regression.",
+typeConverter=TypeConverters.toVector)
+
 @keyword_only
 def __init__(self, featuresCol="features", labelCol="label", 
predictionCol="prediction",
  maxIter=100, regParam=0.0, elasticNetParam=0.0, tol=1e-6, 
fitIntercept=True,
  threshold=0.5, thresholds=None, 
probabilityCol="probability",
  rawPredictionCol="rawPrediction", standardization=True, 
weightCol=None,
- aggregationDepth=2, family="auto"):
+ aggregationDepth=2, family="auto",
+ lowerBoundsOnCoefficients=None, 
upperBoundsOnCoefficients=None,
--- End diff --

should fill up the previous line before starting another, here and below


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...

2017-06-23 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17922#discussion_r123769816
  
--- Diff: python/pyspark/ml/param/__init__.py ---
@@ -170,6 +170,15 @@ def toVector(value):
 raise TypeError("Could not convert %s to vector" % value)
 
 @staticmethod
+def toMatrix(value):
+"""
+Convert a value to ML Matrix, if possible
--- End diff --

This is not a big issue, but you can still refer the 
[clarification](https://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api)
 in MLlib user guide to get the convention in MLlib.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...

2017-06-23 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17922#discussion_r123768466
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -832,6 +860,96 @@ def test_logistic_regression(self):
 except OSError:
 pass
 
+def logistic_regression_check_thresholds(self):
+self.assertIsInstance(
+LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]),
+LogisticRegressionModel
+)
+
+self.assertRaisesRegexp(
+ValueError,
+"Logistic Regression getThreshold found inconsistent.*$",
+LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5]
+)
+
+def test_binomial_logistic_regression_bounds(self):
--- End diff --

For PySpark, we should only check the output is consistent with Scala. The 
most straight-forward way for this test should be loading data directly and run 
constraint LR on it:
```
data_path = "data/mllib/sample_multiclass_classification_data.txt"
df = spark.read.format("libsvm").load(data_path)
..
```
This will make the test case simple and time-saving. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...

2017-06-23 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17922#discussion_r123766355
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -832,6 +860,96 @@ def test_logistic_regression(self):
 except OSError:
 pass
 
+def logistic_regression_check_thresholds(self):
+self.assertIsInstance(
+LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]),
+LogisticRegressionModel
+)
+
+self.assertRaisesRegexp(
+ValueError,
+"Logistic Regression getThreshold found inconsistent.*$",
+LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5]
+)
+
+def test_binomial_logistic_regression_bounds(self):
--- End diff --

Example datasets are not that good for checking constraints, and generator 
seems like a better idea than creating large enough example by hand. I can of 
course remove it, if this is an issue. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...

2017-06-23 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17922#discussion_r123765079
  
--- Diff: python/pyspark/ml/param/__init__.py ---
@@ -170,6 +170,15 @@ def toVector(value):
 raise TypeError("Could not convert %s to vector" % value)
 
 @staticmethod
+def toMatrix(value):
+"""
+Convert a value to ML Matrix, if possible
--- End diff --

While I am aware of this, distinction between `ml.linalg` and 
`mllib.linalg`, is a common source of confusion for the PySpark users. Of 
course we could be more forgiving, and automatically convert objects to the 
required class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...

2017-06-23 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17922#discussion_r123707614
  
--- Diff: python/pyspark/ml/param/__init__.py ---
@@ -170,6 +170,15 @@ def toVector(value):
 raise TypeError("Could not convert %s to vector" % value)
 
 @staticmethod
+def toMatrix(value):
+"""
+Convert a value to ML Matrix, if possible
--- End diff --

```ML``` -> ```MLlib```, ```MLlib``` is the only official name for both 
```spark.mllib``` and ```spark.ml``` package.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...

2017-06-23 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17922#discussion_r123711301
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -832,6 +860,96 @@ def test_logistic_regression(self):
 except OSError:
 pass
 
+def logistic_regression_check_thresholds(self):
+self.assertIsInstance(
+LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]),
+LogisticRegressionModel
+)
+
+self.assertRaisesRegexp(
+ValueError,
+"Logistic Regression getThreshold found inconsistent.*$",
+LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5]
+)
+
+def test_binomial_logistic_regression_bounds(self):
--- End diff --

@zero323 We usually run PySpark MLlib test with loading a dataset from 
```data/mllib/``` or manual generating a dummy/hard-coded dataset rather than 
rewrite the same test case as Scala. We keep PySpark test as simple as 
possible. You can refer [this test 
case](https://github.com/apache/spark/blob/master/python/pyspark/ml/tests.py#L664).
 Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...

2017-06-03 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17922#discussion_r119992981
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -819,6 +847,84 @@ def logistic_regression_check_thresholds(self):
 LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5]
 )
 
+def test_binomial_logistic_regression_bounds(self):
--- End diff --

Usually it's not need to write exactly the same test as Scala in PySpark, 
we can use a simple test with loading a dataset  or generating a very simple 
dataset and run constrained LR on it. You can refer test cases in 
[test.py](https://github.com/apache/spark/blob/master/python/pyspark/ml/tests.py)
 or other tests like 
[this](https://github.com/apache/spark/blob/master/python/pyspark/ml/classification.py#L200).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...

2017-05-09 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17922#discussion_r11267
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -374,6 +415,48 @@ def getFamily(self):
 """
 return self.getOrDefault(self.family)
 
+@since("2.2.0")
--- End diff --

Probably. I've seen that Scala version has been targeted for 2.2.1 so who 
knows? But let's make 2.3.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...

2017-05-09 Thread nchammas
Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17922#discussion_r115497704
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -71,6 +71,34 @@
 ser = PickleSerializer()
 
 
+def generate_multinomial_logistic_input(
+weights, x_mean, x_variance, add_intercept, n_points, seed=None):
+"""Creates multinomial logistic dataset"""
+
+if seed:
+np.random.seed(seed)
+n_features = x_mean.shape[0]
+
+x = np.random.randn(n_points, n_features)
+x = x * np.sqrt(x_variance) + x_mean
+
+if add_intercept:
+x = np.hstack([x, np.ones((n_points, 1))])
+
+# Compute margins
+margins = np.hstack([np.zeros((n_points, 1)), x.dot(weights.T)])
+# Shift to avoid overflow and compute probs
+probs = np.exp(np.subtract(margins, 
margins.max(axis=1).reshape(n_points, -1)))
+# Compute cumulative prob
+cum_probs = np.cumsum(probs / probs.sum(axis=1).reshape(n_points, -1), 
axis=1)
+# Asign class
--- End diff --

"Assign class", though IMO you could also just do away with the comments in 
this section.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...

2017-05-09 Thread nchammas
Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17922#discussion_r115497473
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -374,6 +415,48 @@ def getFamily(self):
 """
 return self.getOrDefault(self.family)
 
+@since("2.2.0")
--- End diff --

Since we're voting on 2.2 now, I presume this will make it for 2.3.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org