[GitHub] spark pull request #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

MLnick Fri, 17 Mar 2017 02:48:37 -0700

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17316#discussion_r106615221
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -871,6 +872,164 @@ def idf(self):
     
     
     @inherit_doc
    +class Imputer(JavaEstimator, HasInputCols, JavaMLReadable, JavaMLWritable):
    +    """
    +    .. note:: Experimental
    +
    +    Imputation estimator for completing missing values, either using the 
mean or the median
    +    of the column in which the missing values are located. The input 
column should be of
    +    DoubleType or FloatType. Currently Imputer does not support 
categorical features and
    +    possibly creates incorrect values for a categorical feature.
    +
    +    Note that the mean/median value is computed after filtering out 
missing values.
    +    All Null values in the input column are treated as missing, and so are 
also imputed. For
    +    computing median, :py:meth:`approxQuantile` is used with a relative 
error of 0.001.
    +
    +    >>> df = spark.createDataFrame([(1.0, float("nan")), (2.0, 
float("nan")), (float("nan"), 3.0),
    +    ...                             (4.0, 4.0), (5.0, 5.0)], ["a", "b"])
    +    >>> imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", 
"out_b"])
    +    >>> model = imputer.fit(df)
    +    >>> model.surrogateDF.show()
    +    +---+---+
    +    |  a|  b|
    +    +---+---+
    +    |3.0|4.0|
    +    +---+---+
    +    ...
    +    >>> model.transform(df).show()
    +    +---+---+-----+-----+
    +    |  a|  b|out_a|out_b|
    +    +---+---+-----+-----+
    +    |1.0|NaN|  1.0|  4.0|
    +    |2.0|NaN|  2.0|  4.0|
    +    |NaN|3.0|  3.0|  3.0|
    +    ...
    +    >>> 
imputer.setStrategy("median").setMissingValue(1.0).fit(df).transform(df).show()
    +    +---+---+-----+-----+
    +    |  a|  b|out_a|out_b|
    +    +---+---+-----+-----+
    +    |1.0|NaN|  4.0|  NaN|
    +    ...
    +    >>> imputerPath = temp_path + "/imputer"
    +    >>> imputer.save(imputerPath)
    +    >>> loadedImputer = Imputer.load(imputerPath)
    +    >>> loadedImputer.getStrategy() == imputer.getStrategy()
    +    True
    +    >>> loadedImputer.getMissingValue()
    +    1.0
    +    >>> modelPath = temp_path + "/imputer-model"
    +    >>> model.save(modelPath)
    +    >>> loadedModel = ImputerModel.load(modelPath)
    +    >>> loadedModel.transform(df).head().out_a == 
model.transform(df).head().out_a
    +    True
    +
    +    .. versionadded:: 2.2.0
    +    """
    +
    +    outputCols = Param(Params._dummy(), "outputCols",
    +                       "output column names.", 
typeConverter=TypeConverters.toListString)
    +
    +    strategy = Param(Params._dummy(), "strategy",
    +                     "strategy for imputation. If mean, then replace 
missing values using the mean "
    +                     "value of the feature. If median, then replace 
missing values using the "
    +                     "median value of the feature.",
    +                     typeConverter=TypeConverters.toString)
    +
    +    missingValue = Param(Params._dummy(), "missingValue",
    +                         "The placeholder for the missing values. All 
occurrences of missingValue "
    +                         "will be imputed.", 
typeConverter=TypeConverters.toFloat)
    +
    +    @keyword_only
    +    def __init__(self, strategy="mean", missingValue=float("nan"), 
inputCols=None,
    +                 outputCols=None):
    +        """
    +        __init__(self, strategy="mean", missingValue=float("nan"), 
inputCols=None, \
    +                 outputCols=None):
    +        """
    +        super(Imputer, self).__init__()
    +        self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.Imputer", self.uid)
    +        self._setDefault(strategy="mean", missingValue=float("nan"))
    +        kwargs = self._input_kwargs
    +        self.setParams(**kwargs)
    +
    +    @keyword_only
    +    @since("2.2.0")
    +    def setParams(self, strategy="mean", missingValue=float("nan"), 
inputCols=None,
    +                  outputCols=None):
    +        """
    +        setParams(self, strategy="mean", missingValue=float("nan"), 
inputCols=None, \
    +                  outputCols=None)
    +        Sets params for this Imputer.
    +        """
    +        kwargs = self._input_kwargs
    +        return self._set(**kwargs)
    +
    +    @since("2.2.0")
    +    def setOutputCols(self, value):
    +        """
    +        Sets the value of :py:attr:`outputCols`.
    +        """
    +        return self._set(outputCols=value)
    +
    +    @since("2.2.0")
    +    def getOutputCols(self):
    +        """
    +        Gets the value of :py:attr:`outputCols` or its default value.
    +        """
    +        return self.getOrDefault(self.outputCols)
    --- End diff --
    
    Do we really need that? The first call to `$(inputCols)` in 
`validateAndTransformSchema` will just throw an error with `Failed to find a 
default value ...`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Reply via email to