[jira] [Updated] (SPARK-38425) Avoid possible errors due to incorrect file size or type supplied in hadoop conf.

2022-04-09 Thread luyuan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

luyuan updated SPARK-38425:
---
Summary: Avoid possible errors due to incorrect file size or type supplied 
in hadoop conf.  (was: Avoid possible errors due to incorrect file size or type 
supplied in spark conf.)

> Avoid possible errors due to incorrect file size or type supplied in hadoop 
> conf.
> -
>
> Key: SPARK-38425
> URL: https://issues.apache.org/jira/browse/SPARK-38425
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.1
>Reporter: luyuan
>Priority: Major
>
> This would avoid failures, in case the files are a bit large or a user places 
> a binary file inside the HADOOP_CONF_DIR.
> Both of which are not supported at the moment.
> The reason is, underlying etcd store does limit the size of each entry to 
> only 1.5 MiB.
> [https://etcd.io/docs/v3.4.0/dev-guide/limit/] 
> We can process this like SPARK-32221 
> https://issues.apache.org/jira/browse/SPARK-32221.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38425) Avoid possible errors due to incorrect file size or type supplied in hadoop conf.

2022-04-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519947#comment-17519947
 ] 

Apache Spark commented on SPARK-38425:
--

User 'lyssg' has created a pull request for this issue:
https://github.com/apache/spark/pull/35667

> Avoid possible errors due to incorrect file size or type supplied in hadoop 
> conf.
> -
>
> Key: SPARK-38425
> URL: https://issues.apache.org/jira/browse/SPARK-38425
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.1
>Reporter: luyuan
>Priority: Major
>
> This would avoid failures, in case the files are a bit large or a user places 
> a binary file inside the HADOOP_CONF_DIR.
> Both of which are not supported at the moment.
> The reason is, underlying etcd store does limit the size of each entry to 
> only 1.5 MiB.
> [https://etcd.io/docs/v3.4.0/dev-guide/limit/] 
> We can process this like SPARK-32221 
> https://issues.apache.org/jira/browse/SPARK-32221.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38425) Avoid possible errors due to incorrect file size or type supplied in hadoop conf.

2022-04-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38425:


Assignee: (was: Apache Spark)

> Avoid possible errors due to incorrect file size or type supplied in hadoop 
> conf.
> -
>
> Key: SPARK-38425
> URL: https://issues.apache.org/jira/browse/SPARK-38425
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.1
>Reporter: luyuan
>Priority: Major
>
> This would avoid failures, in case the files are a bit large or a user places 
> a binary file inside the HADOOP_CONF_DIR.
> Both of which are not supported at the moment.
> The reason is, underlying etcd store does limit the size of each entry to 
> only 1.5 MiB.
> [https://etcd.io/docs/v3.4.0/dev-guide/limit/] 
> We can process this like SPARK-32221 
> https://issues.apache.org/jira/browse/SPARK-32221.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38425) Avoid possible errors due to incorrect file size or type supplied in hadoop conf.

2022-04-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38425:


Assignee: Apache Spark

> Avoid possible errors due to incorrect file size or type supplied in hadoop 
> conf.
> -
>
> Key: SPARK-38425
> URL: https://issues.apache.org/jira/browse/SPARK-38425
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.1
>Reporter: luyuan
>Assignee: Apache Spark
>Priority: Major
>
> This would avoid failures, in case the files are a bit large or a user places 
> a binary file inside the HADOOP_CONF_DIR.
> Both of which are not supported at the moment.
> The reason is, underlying etcd store does limit the size of each entry to 
> only 1.5 MiB.
> [https://etcd.io/docs/v3.4.0/dev-guide/limit/] 
> We can process this like SPARK-32221 
> https://issues.apache.org/jira/browse/SPARK-32221.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38425) Avoid possible errors due to incorrect file size or type supplied in hadoop conf.

2022-04-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519949#comment-17519949
 ] 

Apache Spark commented on SPARK-38425:
--

User 'lyssg' has created a pull request for this issue:
https://github.com/apache/spark/pull/35667

> Avoid possible errors due to incorrect file size or type supplied in hadoop 
> conf.
> -
>
> Key: SPARK-38425
> URL: https://issues.apache.org/jira/browse/SPARK-38425
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.1
>Reporter: luyuan
>Priority: Major
>
> This would avoid failures, in case the files are a bit large or a user places 
> a binary file inside the HADOOP_CONF_DIR.
> Both of which are not supported at the moment.
> The reason is, underlying etcd store does limit the size of each entry to 
> only 1.5 MiB.
> [https://etcd.io/docs/v3.4.0/dev-guide/limit/] 
> We can process this like SPARK-32221 
> https://issues.apache.org/jira/browse/SPARK-32221.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38843) Fix translate metadata col filters

2022-04-09 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-38843:
---

 Summary: Fix translate metadata col filters
 Key: SPARK-38843
 URL: https://issues.apache.org/jira/browse/SPARK-38843
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Yuming Wang


Actually it can't be pushed down:
{noformat}
07:24:45.131 ERROR 
org.apache.spark.sql.execution.datasources.FileSourceStrategy: Pushed Filters: 
IsNotNull(_metadata.file_path),StringContains(_metadata.file_path,data/f0)
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38843) Fix translate metadata col filters

2022-04-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38843:


Assignee: Apache Spark

> Fix translate metadata col filters
> --
>
> Key: SPARK-38843
> URL: https://issues.apache.org/jira/browse/SPARK-38843
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> Actually it can't be pushed down:
> {noformat}
> 07:24:45.131 ERROR 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy: Pushed 
> Filters: 
> IsNotNull(_metadata.file_path),StringContains(_metadata.file_path,data/f0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38843) Fix translate metadata col filters

2022-04-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38843:


Assignee: (was: Apache Spark)

> Fix translate metadata col filters
> --
>
> Key: SPARK-38843
> URL: https://issues.apache.org/jira/browse/SPARK-38843
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> Actually it can't be pushed down:
> {noformat}
> 07:24:45.131 ERROR 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy: Pushed 
> Filters: 
> IsNotNull(_metadata.file_path),StringContains(_metadata.file_path,data/f0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38843) Fix translate metadata col filters

2022-04-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520050#comment-17520050
 ] 

Apache Spark commented on SPARK-38843:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/36126

> Fix translate metadata col filters
> --
>
> Key: SPARK-38843
> URL: https://issues.apache.org/jira/browse/SPARK-38843
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> Actually it can't be pushed down:
> {noformat}
> 07:24:45.131 ERROR 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy: Pushed 
> Filters: 
> IsNotNull(_metadata.file_path),StringContains(_metadata.file_path,data/f0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38843) Fix translate metadata col filters

2022-04-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520051#comment-17520051
 ] 

Apache Spark commented on SPARK-38843:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/36126

> Fix translate metadata col filters
> --
>
> Key: SPARK-38843
> URL: https://issues.apache.org/jira/browse/SPARK-38843
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> Actually it can't be pushed down:
> {noformat}
> 07:24:45.131 ERROR 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy: Pushed 
> Filters: 
> IsNotNull(_metadata.file_path),StringContains(_metadata.file_path,data/f0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37398) Inline type hints for python/pyspark/ml/classification.py

2022-04-09 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-37398.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 36071
[https://github.com/apache/spark/pull/36071]

> Inline type hints for python/pyspark/ml/classification.py
> -
>
> Key: SPARK-37398
> URL: https://issues.apache.org/jira/browse/SPARK-37398
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: alper tankut turker
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hints from python/pyspark/ml/classification.pyi to 
> python/pyspark/ml/classification.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37398) Inline type hints for python/pyspark/ml/classification.py

2022-04-09 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-37398:
--

Assignee: alper tankut turker

> Inline type hints for python/pyspark/ml/classification.py
> -
>
> Key: SPARK-37398
> URL: https://issues.apache.org/jira/browse/SPARK-37398
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: alper tankut turker
>Priority: Major
>
> Inline type hints from python/pyspark/ml/classification.pyi to 
> python/pyspark/ml/classification.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37395) Inline type hint files for files in python/pyspark/ml

2022-04-09 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-37395:
--

Assignee: Maciej Szymkiewicz

> Inline type hint files for files in python/pyspark/ml
> -
>
> Key: SPARK-37395
> URL: https://issues.apache.org/jira/browse/SPARK-37395
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> Currently there are type hint stub files ({{*.pyi}}) to show the expected 
> types for functions, but we can also take advantage of static type checking 
> within the functions by inlining the type hints.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38844) impl Series.interpolate and DataFrame.interpolate

2022-04-09 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-38844:


 Summary: impl Series.interpolate and DataFrame.interpolate
 Key: SPARK-38844
 URL: https://issues.apache.org/jira/browse/SPARK-38844
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: zhengruifeng


h2. Goal:

[pandas's 
interpolate|https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html]
 supports many methods, _linear_ is applied by default, other methods ( _pad_ 
_ffill_ _backfill_ _bifll_ ) can also be implemented in pandas API on spark.

The remainder ones ( including _quadratic_ _cubic_ _spline_ ) can not be 
implemented easily since scipy is used internally and the window frame used is 
complex.

Since methods ( _pad_ _ffill_ _backfill_ _bifll_ ) were already implemented in 
pandas API on spark via {_}fillna{_}, so this work currently focus on 
implementing the missing *linear interpolation*
h2.  
h2. Impl:

To implement the linear interpolation, two extra window functions are added, 
one ( _null_index_ ) is to compute the indices of missing values in each 
consecutive seq, the other ({_}last_not_null{_}) is to keep the last no-missing 
value.
||index||value||_null_index_forward_||_last_not_null_forward_||_null_index_backward_||_last_not_null_backward_||filled||filled
 (limit=1)||
|1|nan|1|nan|1|1|-|-|
|2|1|0|1|0|1| | |
|3|nan|1|1|3|5|2.0|2.0|
|4|nan|2|1|2|5|3.0|-|
|5|nan|3|1|1|5|4.0|-|
|6|5|0|5|0|5| | |
|7|6|0|6|0|6| | |
|8|nan|1|6|2|nan|6.0|6.0|
|9|nan|2|6|1|nan|6.0|-|
 * for the NANs at indices (3,4,5), we always compute the filled value via

({_}last_not_null_backward{_} - {_}last_not_null_forward{_}) / 
({_}null_index_forward{_} + {_}null_index_backward{_}) * _null_index_forward_ + 
_last_not_null_forward_

 * for the NaN at index(1), skip it due to the default *limit_direction* = 
_forward_

 * for the NaN at index(8), fill it like _ffill_ with vlaue 
_last_not_null_forward_

 * If _limit_ is set, then NaNs with _null_index_forward_ greater than _limit_ 
will not be interpolated.

h2. Plan

1, impl the basic _linear interpolate_ with param _limit_

2, add param _limit_direction_

3, add param _limit_area_



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38844) impl Series.interpolate and DataFrame.interpolate

2022-04-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520063#comment-17520063
 ] 

Apache Spark commented on SPARK-38844:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/36127

> impl Series.interpolate and DataFrame.interpolate
> -
>
> Key: SPARK-38844
> URL: https://issues.apache.org/jira/browse/SPARK-38844
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Priority: Major
>
> h2. Goal:
> [pandas's 
> interpolate|https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html]
>  supports many methods, _linear_ is applied by default, other methods ( _pad_ 
> _ffill_ _backfill_ _bifll_ ) can also be implemented in pandas API on spark.
> The remainder ones ( including _quadratic_ _cubic_ _spline_ ) can not be 
> implemented easily since scipy is used internally and the window frame used 
> is complex.
> Since methods ( _pad_ _ffill_ _backfill_ _bifll_ ) were already implemented 
> in pandas API on spark via {_}fillna{_}, so this work currently focus on 
> implementing the missing *linear interpolation*
> h2.  
> h2. Impl:
> To implement the linear interpolation, two extra window functions are added, 
> one ( _null_index_ ) is to compute the indices of missing values in each 
> consecutive seq, the other ({_}last_not_null{_}) is to keep the last 
> no-missing value.
> ||index||value||_null_index_forward_||_last_not_null_forward_||_null_index_backward_||_last_not_null_backward_||filled||filled
>  (limit=1)||
> |1|nan|1|nan|1|1|-|-|
> |2|1|0|1|0|1| | |
> |3|nan|1|1|3|5|2.0|2.0|
> |4|nan|2|1|2|5|3.0|-|
> |5|nan|3|1|1|5|4.0|-|
> |6|5|0|5|0|5| | |
> |7|6|0|6|0|6| | |
> |8|nan|1|6|2|nan|6.0|6.0|
> |9|nan|2|6|1|nan|6.0|-|
>  * for the NANs at indices (3,4,5), we always compute the filled value via
> ({_}last_not_null_backward{_} - {_}last_not_null_forward{_}) / 
> ({_}null_index_forward{_} + {_}null_index_backward{_}) * _null_index_forward_ 
> + _last_not_null_forward_
>  * for the NaN at index(1), skip it due to the default *limit_direction* = 
> _forward_
>  * for the NaN at index(8), fill it like _ffill_ with vlaue 
> _last_not_null_forward_
>  * If _limit_ is set, then NaNs with _null_index_forward_ greater than 
> _limit_ will not be interpolated.
> h2. Plan
> 1, impl the basic _linear interpolate_ with param _limit_
> 2, add param _limit_direction_
> 3, add param _limit_area_



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38844) impl Series.interpolate and DataFrame.interpolate

2022-04-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38844:


Assignee: Apache Spark

> impl Series.interpolate and DataFrame.interpolate
> -
>
> Key: SPARK-38844
> URL: https://issues.apache.org/jira/browse/SPARK-38844
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>
> h2. Goal:
> [pandas's 
> interpolate|https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html]
>  supports many methods, _linear_ is applied by default, other methods ( _pad_ 
> _ffill_ _backfill_ _bifll_ ) can also be implemented in pandas API on spark.
> The remainder ones ( including _quadratic_ _cubic_ _spline_ ) can not be 
> implemented easily since scipy is used internally and the window frame used 
> is complex.
> Since methods ( _pad_ _ffill_ _backfill_ _bifll_ ) were already implemented 
> in pandas API on spark via {_}fillna{_}, so this work currently focus on 
> implementing the missing *linear interpolation*
> h2.  
> h2. Impl:
> To implement the linear interpolation, two extra window functions are added, 
> one ( _null_index_ ) is to compute the indices of missing values in each 
> consecutive seq, the other ({_}last_not_null{_}) is to keep the last 
> no-missing value.
> ||index||value||_null_index_forward_||_last_not_null_forward_||_null_index_backward_||_last_not_null_backward_||filled||filled
>  (limit=1)||
> |1|nan|1|nan|1|1|-|-|
> |2|1|0|1|0|1| | |
> |3|nan|1|1|3|5|2.0|2.0|
> |4|nan|2|1|2|5|3.0|-|
> |5|nan|3|1|1|5|4.0|-|
> |6|5|0|5|0|5| | |
> |7|6|0|6|0|6| | |
> |8|nan|1|6|2|nan|6.0|6.0|
> |9|nan|2|6|1|nan|6.0|-|
>  * for the NANs at indices (3,4,5), we always compute the filled value via
> ({_}last_not_null_backward{_} - {_}last_not_null_forward{_}) / 
> ({_}null_index_forward{_} + {_}null_index_backward{_}) * _null_index_forward_ 
> + _last_not_null_forward_
>  * for the NaN at index(1), skip it due to the default *limit_direction* = 
> _forward_
>  * for the NaN at index(8), fill it like _ffill_ with vlaue 
> _last_not_null_forward_
>  * If _limit_ is set, then NaNs with _null_index_forward_ greater than 
> _limit_ will not be interpolated.
> h2. Plan
> 1, impl the basic _linear interpolate_ with param _limit_
> 2, add param _limit_direction_
> 3, add param _limit_area_



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38844) impl Series.interpolate and DataFrame.interpolate

2022-04-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38844:


Assignee: (was: Apache Spark)

> impl Series.interpolate and DataFrame.interpolate
> -
>
> Key: SPARK-38844
> URL: https://issues.apache.org/jira/browse/SPARK-38844
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Priority: Major
>
> h2. Goal:
> [pandas's 
> interpolate|https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html]
>  supports many methods, _linear_ is applied by default, other methods ( _pad_ 
> _ffill_ _backfill_ _bifll_ ) can also be implemented in pandas API on spark.
> The remainder ones ( including _quadratic_ _cubic_ _spline_ ) can not be 
> implemented easily since scipy is used internally and the window frame used 
> is complex.
> Since methods ( _pad_ _ffill_ _backfill_ _bifll_ ) were already implemented 
> in pandas API on spark via {_}fillna{_}, so this work currently focus on 
> implementing the missing *linear interpolation*
> h2.  
> h2. Impl:
> To implement the linear interpolation, two extra window functions are added, 
> one ( _null_index_ ) is to compute the indices of missing values in each 
> consecutive seq, the other ({_}last_not_null{_}) is to keep the last 
> no-missing value.
> ||index||value||_null_index_forward_||_last_not_null_forward_||_null_index_backward_||_last_not_null_backward_||filled||filled
>  (limit=1)||
> |1|nan|1|nan|1|1|-|-|
> |2|1|0|1|0|1| | |
> |3|nan|1|1|3|5|2.0|2.0|
> |4|nan|2|1|2|5|3.0|-|
> |5|nan|3|1|1|5|4.0|-|
> |6|5|0|5|0|5| | |
> |7|6|0|6|0|6| | |
> |8|nan|1|6|2|nan|6.0|6.0|
> |9|nan|2|6|1|nan|6.0|-|
>  * for the NANs at indices (3,4,5), we always compute the filled value via
> ({_}last_not_null_backward{_} - {_}last_not_null_forward{_}) / 
> ({_}null_index_forward{_} + {_}null_index_backward{_}) * _null_index_forward_ 
> + _last_not_null_forward_
>  * for the NaN at index(1), skip it due to the default *limit_direction* = 
> _forward_
>  * for the NaN at index(8), fill it like _ffill_ with vlaue 
> _last_not_null_forward_
>  * If _limit_ is set, then NaNs with _null_index_forward_ greater than 
> _limit_ will not be interpolated.
> h2. Plan
> 1, impl the basic _linear interpolate_ with param _limit_
> 2, add param _limit_direction_
> 3, add param _limit_area_



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-34444) Pushdown scalar-subquery filter to FileSourceScan

2022-04-09 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reopened SPARK-3:
-

> Pushdown scalar-subquery filter to FileSourceScan
> -
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> We can pushdown {{a < (select max(d) from t2)}} to FileSourceScan:
> {code:scala}
> sql("CREATE TABLE t1 using parquet AS SELECT id AS a, id AS b FROM 
> range(5L)")
> sql("CREATE TABLE t2 using parquet AS SELECT id AS d FROM range(20)")
> sql("SELECT * FROM t1 WHERE b = (select max(d) from t2)").show
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34444) Pushdown scalar-subquery filter to FileSourceScan

2022-04-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520080#comment-17520080
 ] 

Apache Spark commented on SPARK-3:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/36128

> Pushdown scalar-subquery filter to FileSourceScan
> -
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> We can pushdown {{a < (select max(d) from t2)}} to FileSourceScan:
> {code:scala}
> sql("CREATE TABLE t1 using parquet AS SELECT id AS a, id AS b FROM 
> range(5L)")
> sql("CREATE TABLE t2 using parquet AS SELECT id AS d FROM range(20)")
> sql("SELECT * FROM t1 WHERE b = (select max(d) from t2)").show
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34444) Pushdown scalar-subquery filter to FileSourceScan

2022-04-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3:


Assignee: (was: Apache Spark)

> Pushdown scalar-subquery filter to FileSourceScan
> -
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> We can pushdown {{a < (select max(d) from t2)}} to FileSourceScan:
> {code:scala}
> sql("CREATE TABLE t1 using parquet AS SELECT id AS a, id AS b FROM 
> range(5L)")
> sql("CREATE TABLE t2 using parquet AS SELECT id AS d FROM range(20)")
> sql("SELECT * FROM t1 WHERE b = (select max(d) from t2)").show
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34444) Pushdown scalar-subquery filter to FileSourceScan

2022-04-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520079#comment-17520079
 ] 

Apache Spark commented on SPARK-3:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/36128

> Pushdown scalar-subquery filter to FileSourceScan
> -
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> We can pushdown {{a < (select max(d) from t2)}} to FileSourceScan:
> {code:scala}
> sql("CREATE TABLE t1 using parquet AS SELECT id AS a, id AS b FROM 
> range(5L)")
> sql("CREATE TABLE t2 using parquet AS SELECT id AS d FROM range(20)")
> sql("SELECT * FROM t1 WHERE b = (select max(d) from t2)").show
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34444) Pushdown scalar-subquery filter to FileSourceScan

2022-04-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3:


Assignee: Apache Spark

> Pushdown scalar-subquery filter to FileSourceScan
> -
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> We can pushdown {{a < (select max(d) from t2)}} to FileSourceScan:
> {code:scala}
> sql("CREATE TABLE t1 using parquet AS SELECT id AS a, id AS b FROM 
> range(5L)")
> sql("CREATE TABLE t2 using parquet AS SELECT id AS d FROM range(20)")
> sql("SELECT * FROM t1 WHERE b = (select max(d) from t2)").show
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org