[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...
Github user zero323 closed the pull request at: https://github.com/apache/spark/pull/17922 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/17922#discussion_r126767873 --- Diff: python/pyspark/ml/param/__init__.py --- @@ -170,6 +170,15 @@ def toVector(value): raise TypeError("Could not convert %s to vector" % value) @staticmethod +def toMatrix(value): +""" +Convert a value to ML Matrix, if possible +""" +if isinstance(value, Matrix): +return value --- End diff -- Is this method really necessary? It's not actually converting anything, just checking the type. If this wasn't here and the user put something other than a Matrix, what error would be raised? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/17922#discussion_r126765747 --- Diff: python/pyspark/ml/classification.py --- @@ -246,18 +246,55 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredicti "be used in the model. Supported options: auto, binomial, multinomial", typeConverter=TypeConverters.toString) +lowerBoundsOnCoefficients = Param(Params._dummy(), "lowerBoundsOnCoefficients", + "The lower bounds on coefficients if fitting under bound " + "constrained optimization. The bound matrix must be " + "compatible with the shape " + "(1, number of features) for binomial regression, or " + "(number of classes, number of features) " + "for multinomial regression.", + typeConverter=TypeConverters.toMatrix) + +upperBoundsOnCoefficients = Param(Params._dummy(), "upperBoundsOnCoefficients", + "The upper bounds on coefficients if fitting under bound " + "constrained optimization. The bound matrix must be " + "compatible with the shape " + "(1, number of features) for binomial regression, or " + "(number of classes, number of features) " + "for multinomial regression.", + typeConverter=TypeConverters.toMatrix) --- End diff -- I think you can condense this and the above text blocks some more --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/17922#discussion_r126769478 --- Diff: python/pyspark/ml/tests.py --- @@ -832,6 +860,96 @@ def test_logistic_regression(self): except OSError: pass +def logistic_regression_check_thresholds(self): +self.assertIsInstance( +LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]), +LogisticRegressionModel +) + +self.assertRaisesRegexp( +ValueError, +"Logistic Regression getThreshold found inconsistent.*$", +LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5] +) + +def test_binomial_logistic_regression_bounds(self): --- End diff -- I agree this is probably overkill for testing this. The functionality is already in Scala and should be tested there, here in python we are just setting the parameters. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/17922#discussion_r126766060 --- Diff: python/pyspark/ml/classification.py --- @@ -246,18 +246,55 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredicti "be used in the model. Supported options: auto, binomial, multinomial", typeConverter=TypeConverters.toString) +lowerBoundsOnCoefficients = Param(Params._dummy(), "lowerBoundsOnCoefficients", + "The lower bounds on coefficients if fitting under bound " + "constrained optimization. The bound matrix must be " + "compatible with the shape " + "(1, number of features) for binomial regression, or " + "(number of classes, number of features) " + "for multinomial regression.", + typeConverter=TypeConverters.toMatrix) + +upperBoundsOnCoefficients = Param(Params._dummy(), "upperBoundsOnCoefficients", + "The upper bounds on coefficients if fitting under bound " + "constrained optimization. The bound matrix must be " + "compatible with the shape " + "(1, number of features) for binomial regression, or " + "(number of classes, number of features) " + "for multinomial regression.", + typeConverter=TypeConverters.toMatrix) + +lowerBoundsOnIntercepts = Param(Params._dummy(), "lowerBoundsOnIntercepts", +"The lower bounds on intercepts if fitting under bound " +"constrained optimization. The bounds vector size must be" +"equal with 1 for binomial regression, or the number of" +"lasses for multinomial regression.", +typeConverter=TypeConverters.toVector) + +upperBoundsOnIntercepts = Param(Params._dummy(), "upperBoundsOnIntercepts", +"The upper bounds on intercepts if fitting under bound " +"constrained optimization. The bound vector size must be " +"equal with 1 for binomial regression, or the number of " +"classes for multinomial regression.", +typeConverter=TypeConverters.toVector) + @keyword_only def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", maxIter=100, regParam=0.0, elasticNetParam=0.0, tol=1e-6, fitIntercept=True, threshold=0.5, thresholds=None, probabilityCol="probability", rawPredictionCol="rawPrediction", standardization=True, weightCol=None, - aggregationDepth=2, family="auto"): + aggregationDepth=2, family="auto", + lowerBoundsOnCoefficients=None, upperBoundsOnCoefficients=None, --- End diff -- should fill up the previous line before starting another, here and below --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/17922#discussion_r123769816 --- Diff: python/pyspark/ml/param/__init__.py --- @@ -170,6 +170,15 @@ def toVector(value): raise TypeError("Could not convert %s to vector" % value) @staticmethod +def toMatrix(value): +""" +Convert a value to ML Matrix, if possible --- End diff -- This is not a big issue, but you can still refer the [clarification](https://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api) in MLlib user guide to get the convention in MLlib. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/17922#discussion_r123768466 --- Diff: python/pyspark/ml/tests.py --- @@ -832,6 +860,96 @@ def test_logistic_regression(self): except OSError: pass +def logistic_regression_check_thresholds(self): +self.assertIsInstance( +LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]), +LogisticRegressionModel +) + +self.assertRaisesRegexp( +ValueError, +"Logistic Regression getThreshold found inconsistent.*$", +LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5] +) + +def test_binomial_logistic_regression_bounds(self): --- End diff -- For PySpark, we should only check the output is consistent with Scala. The most straight-forward way for this test should be loading data directly and run constraint LR on it: ``` data_path = "data/mllib/sample_multiclass_classification_data.txt" df = spark.read.format("libsvm").load(data_path) .. ``` This will make the test case simple and time-saving. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17922#discussion_r123766355 --- Diff: python/pyspark/ml/tests.py --- @@ -832,6 +860,96 @@ def test_logistic_regression(self): except OSError: pass +def logistic_regression_check_thresholds(self): +self.assertIsInstance( +LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]), +LogisticRegressionModel +) + +self.assertRaisesRegexp( +ValueError, +"Logistic Regression getThreshold found inconsistent.*$", +LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5] +) + +def test_binomial_logistic_regression_bounds(self): --- End diff -- Example datasets are not that good for checking constraints, and generator seems like a better idea than creating large enough example by hand. I can of course remove it, if this is an issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17922#discussion_r123765079 --- Diff: python/pyspark/ml/param/__init__.py --- @@ -170,6 +170,15 @@ def toVector(value): raise TypeError("Could not convert %s to vector" % value) @staticmethod +def toMatrix(value): +""" +Convert a value to ML Matrix, if possible --- End diff -- While I am aware of this, distinction between `ml.linalg` and `mllib.linalg`, is a common source of confusion for the PySpark users. Of course we could be more forgiving, and automatically convert objects to the required class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/17922#discussion_r123707614 --- Diff: python/pyspark/ml/param/__init__.py --- @@ -170,6 +170,15 @@ def toVector(value): raise TypeError("Could not convert %s to vector" % value) @staticmethod +def toMatrix(value): +""" +Convert a value to ML Matrix, if possible --- End diff -- ```ML``` -> ```MLlib```, ```MLlib``` is the only official name for both ```spark.mllib``` and ```spark.ml``` package. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/17922#discussion_r123711301 --- Diff: python/pyspark/ml/tests.py --- @@ -832,6 +860,96 @@ def test_logistic_regression(self): except OSError: pass +def logistic_regression_check_thresholds(self): +self.assertIsInstance( +LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]), +LogisticRegressionModel +) + +self.assertRaisesRegexp( +ValueError, +"Logistic Regression getThreshold found inconsistent.*$", +LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5] +) + +def test_binomial_logistic_regression_bounds(self): --- End diff -- @zero323 We usually run PySpark MLlib test with loading a dataset from ```data/mllib/``` or manual generating a dummy/hard-coded dataset rather than rewrite the same test case as Scala. We keep PySpark test as simple as possible. You can refer [this test case](https://github.com/apache/spark/blob/master/python/pyspark/ml/tests.py#L664). Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/17922#discussion_r119992981 --- Diff: python/pyspark/ml/tests.py --- @@ -819,6 +847,84 @@ def logistic_regression_check_thresholds(self): LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5] ) +def test_binomial_logistic_regression_bounds(self): --- End diff -- Usually it's not need to write exactly the same test as Scala in PySpark, we can use a simple test with loading a dataset or generating a very simple dataset and run constrained LR on it. You can refer test cases in [test.py](https://github.com/apache/spark/blob/master/python/pyspark/ml/tests.py) or other tests like [this](https://github.com/apache/spark/blob/master/python/pyspark/ml/classification.py#L200). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17922#discussion_r11267 --- Diff: python/pyspark/ml/classification.py --- @@ -374,6 +415,48 @@ def getFamily(self): """ return self.getOrDefault(self.family) +@since("2.2.0") --- End diff -- Probably. I've seen that Scala version has been targeted for 2.2.1 so who knows? But let's make 2.3. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/17922#discussion_r115497704 --- Diff: python/pyspark/ml/tests.py --- @@ -71,6 +71,34 @@ ser = PickleSerializer() +def generate_multinomial_logistic_input( +weights, x_mean, x_variance, add_intercept, n_points, seed=None): +"""Creates multinomial logistic dataset""" + +if seed: +np.random.seed(seed) +n_features = x_mean.shape[0] + +x = np.random.randn(n_points, n_features) +x = x * np.sqrt(x_variance) + x_mean + +if add_intercept: +x = np.hstack([x, np.ones((n_points, 1))]) + +# Compute margins +margins = np.hstack([np.zeros((n_points, 1)), x.dot(weights.T)]) +# Shift to avoid overflow and compute probs +probs = np.exp(np.subtract(margins, margins.max(axis=1).reshape(n_points, -1))) +# Compute cumulative prob +cum_probs = np.cumsum(probs / probs.sum(axis=1).reshape(n_points, -1), axis=1) +# Asign class --- End diff -- "Assign class", though IMO you could also just do away with the comments in this section. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/17922#discussion_r115497473 --- Diff: python/pyspark/ml/classification.py --- @@ -374,6 +415,48 @@ def getFamily(self): """ return self.getOrDefault(self.family) +@since("2.2.0") --- End diff -- Since we're voting on 2.2 now, I presume this will make it for 2.3. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org