[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36127: [SPARK-38844][PYTHON][SQL] Implement linear interpolate

GitBox Sun, 10 Apr 2022 20:45:33 -0700


HyukjinKwon commented on code in PR #36127:
URL: https://github.com/apache/spark/pull/36127#discussion_r846923142



##########
python/pyspark/pandas/generic.py:
##########
@@ -3181,6 +3181,83 @@ def ffill(
 
     pad = ffill
 
+    # TODO: add 'axis', 'inplace', 'limit_direction', 'limit_area', 'downcast'
+    def interpolate(
+        self: FrameLike,
+        method: Optional[str] = None,
+        limit: Optional[int] = None,
+    ) -> FrameLike:
+        """
+        Fill NaN values using an interpolation method.
+
+        Parameters
+        ----------
+        method : str, default 'linear'
+            Interpolation technique to use. One of:
+
+            * 'linear': Ignore the index and treat the values as equally
+              spaced.
+
+        limit : int, optional
+            Maximum number of consecutive NaNs to fill. Must be greater than
+            0.
+
+        Returns
+        -------
+        Series or DataFrame or None
+            Returns the same object type as the caller, interpolated at
+            some or all NA values.
+
+        See Also
+        --------
+        fillna : Fill missing values using different methods.
+
+        Examples
+        --------
+        Filling in NA via linear interpolation.
+
+        >>> s = ps.Series([0, 1, np.nan, 3])
+        >>> s
+        0    0.0
+        1    1.0
+        2    NaN
+        3    3.0
+        dtype: float64
+        >>> s.interpolate()
+        0    0.0
+        1    1.0
+        2    2.0
+        3    3.0
+        dtype: float64
+
+        Fill the DataFrame forward (that is, going down) along each column
+        using linear interpolation.
+
+        Note how the last entry in column 'a' is interpolated differently,
+        because there is no entry after it to use for interpolation.
+        Note how the first entry in column 'b' remains NA, because there
+        is no entry before it to use for interpolation.
+
+        >>> df = ps.DataFrame([(0.0, np.nan, -1.0, 1.0),
+        ...                    (np.nan, 2.0, np.nan, np.nan),
+        ...                    (2.0, 3.0, np.nan, 9.0),
+        ...                    (np.nan, 4.0, -4.0, 16.0)],
+        ...                   columns=list('abcd'))
+        >>> df
+             a    b    c     d
+        0  0.0  NaN -1.0   1.0
+        1  NaN  2.0  NaN   NaN
+        2  2.0  3.0  NaN   9.0
+        3  NaN  4.0 -4.0  16.0
+        >>> df.interpolate(method='linear')
+             a    b    c     d
+        0  0.0  NaN -1.0   1.0
+        1  1.0  2.0 -2.0   5.0
+        2  2.0  3.0 -3.0   9.0
+        3  2.0  4.0 -4.0  16.0

Review Comment:
   I think we should better probably add `Notes` section with describing that 
this API is expensive because Window functions will be executed within one 
executor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36127: [SPARK-38844][PYTHON][SQL] Implement linear interpolate

Reply via email to