[ https://issues.apache.org/jira/browse/SPARK-38844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-38844: ------------------------------------ Assignee: (was: Apache Spark) > impl Series.interpolate and DataFrame.interpolate > ------------------------------------------------- > > Key: SPARK-38844 > URL: https://issues.apache.org/jira/browse/SPARK-38844 > Project: Spark > Issue Type: Sub-task > Components: PySpark > Affects Versions: 3.4.0 > Reporter: zhengruifeng > Priority: Major > > h2. Goal: > [pandas's > interpolate|https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html] > supports many methods, _linear_ is applied by default, other methods ( _pad_ > _ffill_ _backfill_ _bifll_ ) can also be implemented in pandas API on spark. > The remainder ones ( including _quadratic_ _cubic_ _spline_ ) can not be > implemented easily since scipy is used internally and the window frame used > is complex. > Since methods ( _pad_ _ffill_ _backfill_ _bifll_ ) were already implemented > in pandas API on spark via {_}fillna{_}, so this work currently focus on > implementing the missing *linear interpolation* > h2. > h2. Impl: > To implement the linear interpolation, two extra window functions are added, > one ( _null_index_ ) is to compute the indices of missing values in each > consecutive seq, the other ({_}last_not_null{_}) is to keep the last > no-missing value. > ||index||value||_null_index_forward_||_last_not_null_forward_||_null_index_backward_||_last_not_null_backward_||filled||filled > (limit=1)|| > |1|nan|1|nan|1|1|-|-| > |2|1|0|1|0|1| | | > |3|nan|1|1|3|5|2.0|2.0| > |4|nan|2|1|2|5|3.0|-| > |5|nan|3|1|1|5|4.0|-| > |6|5|0|5|0|5| | | > |7|6|0|6|0|6| | | > |8|nan|1|6|2|nan|6.0|6.0| > |9|nan|2|6|1|nan|6.0|-| > * for the NANs at indices (3,4,5), we always compute the filled value via > ({_}last_not_null_backward{_} - {_}last_not_null_forward{_}) / > ({_}null_index_forward{_} + {_}null_index_backward{_}) * _null_index_forward_ > + _last_not_null_forward_ > * for the NaN at index(1), skip it due to the default *limit_direction* = > _forward_ > * for the NaN at index(8), fill it like _ffill_ with vlaue > _last_not_null_forward_ > * If _limit_ is set, then NaNs with _null_index_forward_ greater than > _limit_ will not be interpolated. > h2. Plan > 1, impl the basic _linear interpolate_ with param _limit_ > 2, add param _limit_direction_ > 3, add param _limit_area_ -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org