date:20211201

[jira] [Updated] (SPARK-37512) Support TimedeltaIndex creation (from Series/Index) and TimedeltaIndex.astype

2021-12-01 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-37512:
-
Summary: Support TimedeltaIndex creation (from Series/Index) and 
TimedeltaIndex.astype  (was: Support TimedeltaIndex creation given a timedelta 
Series/Index)

> Support TimedeltaIndex creation (from Series/Index) and TimedeltaIndex.astype
> -
>
> Key: SPARK-37512
> URL: https://issues.apache.org/jira/browse/SPARK-37512
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> To solve the issues below:
> {code:java}
> >>> idx = ps.TimedeltaIndex([timedelta(1), timedelta(microseconds=2)])
> >>> idx
> TimedeltaIndex(['1 days 00:00:00', '0 days 00:00:00.02'], 
> dtype='timedelta64[ns]', freq=None)
> >>> ps.TimedeltaIndex(idx)
> Traceback (most recent call last):
> ...
>     raise TypeError("astype can not be applied to %s." % self.pretty_name)
> TypeError: astype can not be applied to timedelta.
>  {code}
>  
>  
>  
> {code:java}
> >>> s = ps.Series([timedelta(1), timedelta(microseconds=2)], index=[10, 20])
> >>> s
> 10          1 days 00:00:00
> 20   0 days 00:00:00.02
> dtype: timedelta64[ns]
> >>> ps.TimedeltaIndex(s)
> Traceback (most recent call last):
> ...
>     raise TypeError("astype can not be applied to %s." % self.pretty_name)
> TypeError: astype can not be applied to timedelta.
>  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36396) Implement DataFrame.cov

2021-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452194#comment-17452194
 ] 

Apache Spark commented on SPARK-36396:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/34778

> Implement DataFrame.cov
> ---
>
> Key: SPARK-36396
> URL: https://issues.apache.org/jira/browse/SPARK-36396
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37326) Support TimestampNTZ in CSV data source

2021-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452178#comment-17452178
 ] 

Apache Spark commented on SPARK-37326:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/34777

> Support TimestampNTZ in CSV data source
> ---
>
> Key: SPARK-37326
> URL: https://issues.apache.org/jira/browse/SPARK-37326
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37326) Support TimestampNTZ in CSV data source

2021-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452177#comment-17452177
 ] 

Apache Spark commented on SPARK-37326:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/34777

> Support TimestampNTZ in CSV data source
> ---
>
> Key: SPARK-37326
> URL: https://issues.apache.org/jira/browse/SPARK-37326
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37512) Support TimedeltaIndex creation given a timedelta Series/Index

2021-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452175#comment-17452175
 ] 

Apache Spark commented on SPARK-37512:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/34776

> Support TimedeltaIndex creation given a timedelta Series/Index
> --
>
> Key: SPARK-37512
> URL: https://issues.apache.org/jira/browse/SPARK-37512
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> To solve the issues below:
> {code:java}
> >>> idx = ps.TimedeltaIndex([timedelta(1), timedelta(microseconds=2)])
> >>> idx
> TimedeltaIndex(['1 days 00:00:00', '0 days 00:00:00.02'], 
> dtype='timedelta64[ns]', freq=None)
> >>> ps.TimedeltaIndex(idx)
> Traceback (most recent call last):
> ...
>     raise TypeError("astype can not be applied to %s." % self.pretty_name)
> TypeError: astype can not be applied to timedelta.
>  {code}
>  
>  
>  
> {code:java}
> >>> s = ps.Series([timedelta(1), timedelta(microseconds=2)], index=[10, 20])
> >>> s
> 10          1 days 00:00:00
> 20   0 days 00:00:00.02
> dtype: timedelta64[ns]
> >>> ps.TimedeltaIndex(s)
> Traceback (most recent call last):
> ...
>     raise TypeError("astype can not be applied to %s." % self.pretty_name)
> TypeError: astype can not be applied to timedelta.
>  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37512) Support TimedeltaIndex creation given a timedelta Series/Index

2021-12-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37512:


Assignee: (was: Apache Spark)

> Support TimedeltaIndex creation given a timedelta Series/Index
> --
>
> Key: SPARK-37512
> URL: https://issues.apache.org/jira/browse/SPARK-37512
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> To solve the issues below:
> {code:java}
> >>> idx = ps.TimedeltaIndex([timedelta(1), timedelta(microseconds=2)])
> >>> idx
> TimedeltaIndex(['1 days 00:00:00', '0 days 00:00:00.02'], 
> dtype='timedelta64[ns]', freq=None)
> >>> ps.TimedeltaIndex(idx)
> Traceback (most recent call last):
> ...
>     raise TypeError("astype can not be applied to %s." % self.pretty_name)
> TypeError: astype can not be applied to timedelta.
>  {code}
>  
>  
>  
> {code:java}
> >>> s = ps.Series([timedelta(1), timedelta(microseconds=2)], index=[10, 20])
> >>> s
> 10          1 days 00:00:00
> 20   0 days 00:00:00.02
> dtype: timedelta64[ns]
> >>> ps.TimedeltaIndex(s)
> Traceback (most recent call last):
> ...
>     raise TypeError("astype can not be applied to %s." % self.pretty_name)
> TypeError: astype can not be applied to timedelta.
>  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37512) Support TimedeltaIndex creation given a timedelta Series/Index

2021-12-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37512:


Assignee: Apache Spark

> Support TimedeltaIndex creation given a timedelta Series/Index
> --
>
> Key: SPARK-37512
> URL: https://issues.apache.org/jira/browse/SPARK-37512
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
>  
> To solve the issues below:
> {code:java}
> >>> idx = ps.TimedeltaIndex([timedelta(1), timedelta(microseconds=2)])
> >>> idx
> TimedeltaIndex(['1 days 00:00:00', '0 days 00:00:00.02'], 
> dtype='timedelta64[ns]', freq=None)
> >>> ps.TimedeltaIndex(idx)
> Traceback (most recent call last):
> ...
>     raise TypeError("astype can not be applied to %s." % self.pretty_name)
> TypeError: astype can not be applied to timedelta.
>  {code}
>  
>  
>  
> {code:java}
> >>> s = ps.Series([timedelta(1), timedelta(microseconds=2)], index=[10, 20])
> >>> s
> 10          1 days 00:00:00
> 20   0 days 00:00:00.02
> dtype: timedelta64[ns]
> >>> ps.TimedeltaIndex(s)
> Traceback (most recent call last):
> ...
>     raise TypeError("astype can not be applied to %s." % self.pretty_name)
> TypeError: astype can not be applied to timedelta.
>  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37512) Support TimedeltaIndex creation given a timedelta Series/Index

2021-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452174#comment-17452174
 ] 

Apache Spark commented on SPARK-37512:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/34776

> Support TimedeltaIndex creation given a timedelta Series/Index
> --
>
> Key: SPARK-37512
> URL: https://issues.apache.org/jira/browse/SPARK-37512
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> To solve the issues below:
> {code:java}
> >>> idx = ps.TimedeltaIndex([timedelta(1), timedelta(microseconds=2)])
> >>> idx
> TimedeltaIndex(['1 days 00:00:00', '0 days 00:00:00.02'], 
> dtype='timedelta64[ns]', freq=None)
> >>> ps.TimedeltaIndex(idx)
> Traceback (most recent call last):
> ...
>     raise TypeError("astype can not be applied to %s." % self.pretty_name)
> TypeError: astype can not be applied to timedelta.
>  {code}
>  
>  
>  
> {code:java}
> >>> s = ps.Series([timedelta(1), timedelta(microseconds=2)], index=[10, 20])
> >>> s
> 10          1 days 00:00:00
> 20   0 days 00:00:00.02
> dtype: timedelta64[ns]
> >>> ps.TimedeltaIndex(s)
> Traceback (most recent call last):
> ...
>     raise TypeError("astype can not be applied to %s." % self.pretty_name)
> TypeError: astype can not be applied to timedelta.
>  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37511) Introduce TimedeltaIndex to pandas API on Spark

2021-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452172#comment-17452172
 ] 

Apache Spark commented on SPARK-37511:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34775

> Introduce TimedeltaIndex to pandas API on Spark
> ---
>
> Key: SPARK-37511
> URL: https://issues.apache.org/jira/browse/SPARK-37511
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> Introduce TimedeltaIndex to pandas API on Spark.
> Properties, functions, and basic operations of TimedeltaIndex will be 
> supported in follow-up PRs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37516) Uses Python's standard string formatter for SQL API in PySpark

2021-12-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37516:


Assignee: Apache Spark

> Uses Python's standard string formatter for SQL API in PySpark
> --
>
> Key: SPARK-37516
> URL: https://issues.apache.org/jira/browse/SPARK-37516
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> This is similar with SPARK-37436. It aims to add the Python string formatter 
> support for {{SparkSession.sql}} API.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37516) Uses Python's standard string formatter for SQL API in PySpark

2021-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452171#comment-17452171
 ] 

Apache Spark commented on SPARK-37516:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34774

> Uses Python's standard string formatter for SQL API in PySpark
> --
>
> Key: SPARK-37516
> URL: https://issues.apache.org/jira/browse/SPARK-37516
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This is similar with SPARK-37436. It aims to add the Python string formatter 
> support for {{SparkSession.sql}} API.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37516) Uses Python's standard string formatter for SQL API in PySpark

2021-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452170#comment-17452170
 ] 

Apache Spark commented on SPARK-37516:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34774

> Uses Python's standard string formatter for SQL API in PySpark
> --
>
> Key: SPARK-37516
> URL: https://issues.apache.org/jira/browse/SPARK-37516
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This is similar with SPARK-37436. It aims to add the Python string formatter 
> support for {{SparkSession.sql}} API.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37516) Uses Python's standard string formatter for SQL API in PySpark

2021-12-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37516:


Assignee: (was: Apache Spark)

> Uses Python's standard string formatter for SQL API in PySpark
> --
>
> Key: SPARK-37516
> URL: https://issues.apache.org/jira/browse/SPARK-37516
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This is similar with SPARK-37436. It aims to add the Python string formatter 
> support for {{SparkSession.sql}} API.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37516) Uses Python's standard string formatter for SQL API in PySpark

2021-12-01 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-37516:


 Summary: Uses Python's standard string formatter for SQL API in 
PySpark
 Key: SPARK-37516
 URL: https://issues.apache.org/jira/browse/SPARK-37516
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


This is similar with SPARK-37436. It aims to add the Python string formatter 
support for {{SparkSession.sql}} API.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37494) Unify v1 and v2 options output of `SHOW CREATE TABLE` command

2021-12-01 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37494.
-
Resolution: Fixed

Issue resolved by pull request 34753
[https://github.com/apache/spark/pull/34753]

> Unify v1 and v2 options output of `SHOW CREATE TABLE` command
> -
>
> Key: SPARK-37494
> URL: https://issues.apache.org/jira/browse/SPARK-37494
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Assignee: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37494) Unify v1 and v2 options output of `SHOW CREATE TABLE` command

2021-12-01 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37494:
---

Assignee: PengLei

> Unify v1 and v2 options output of `SHOW CREATE TABLE` command
> -
>
> Key: SPARK-37494
> URL: https://issues.apache.org/jira/browse/SPARK-37494
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Assignee: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33898) Support SHOW CREATE TABLE in v2

2021-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452164#comment-17452164
 ] 

Apache Spark commented on SPARK-33898:
--

User 'Peng-Lei' has created a pull request for this issue:
https://github.com/apache/spark/pull/34773

> Support SHOW CREATE TABLE in v2
> ---
>
> Key: SPARK-33898
> URL: https://issues.apache.org/jira/browse/SPARK-33898
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: PengLei
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37476) udaf doesnt work with nullable (or option of) case class result

2021-12-01 Thread koert kuipers (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452118#comment-17452118
 ] 

koert kuipers commented on SPARK-37476:
---

note that changing my case class to:
{code:java}
case class SumAndProduct(sum: java.lang.Double, product: java.lang.Double)  
{code}
did not fix anything. same errors.

> udaf doesnt work with nullable (or option of) case class result 
> 
>
> Key: SPARK-37476
> URL: https://issues.apache.org/jira/browse/SPARK-37476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark master branch on nov 27
>Reporter: koert kuipers
>Priority: Minor
>
> i have a need to have a dataframe aggregation return a nullable case class. 
> there seems to be no way to get this to work. the suggestion to wrap the 
> result in an option doesnt work either.
> first attempt using nulls:
> {code:java}
> case class SumAndProduct(sum: Double, product: Double)
> val sumAndProductAgg = new Aggregator[Double, SumAndProduct, SumAndProduct] {
>   def zero: SumAndProduct = null
>   def reduce(b: SumAndProduct, a: Double): SumAndProduct =
>     if (b == null) {
>       SumAndProduct(a, a)
>     } else {
>       SumAndProduct(b.sum + a, b.product * a)
>     }
>   def merge(b1: SumAndProduct, b2: SumAndProduct): SumAndProduct =
>     if (b1 == null) {
>       b2
>     } else if (b2 == null) {
>       b1
>     } else {
>       SumAndProduct(b1.sum + b2.sum, b1.product * b2.product)
>     }
>   def finish(r: SumAndProduct): SumAndProduct = r
>   def bufferEncoder: Encoder[SumAndProduct] = ExpressionEncoder()
>   def outputEncoder: Encoder[SumAndProduct] = ExpressionEncoder()
> }
> val df = Seq.empty[Double]
>   .toDF()
>   .select(udaf(sumAndProductAgg).apply(col("value")))
> df.printSchema()
> df.show()
> {code}
> this gives:
> {code:java}
> root
>  |-- $anon$3(value): struct (nullable = true)
>  |    |-- sum: double (nullable = false)
>  |    |-- product: double (nullable = false)
> 16:44:54.882 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
> in stage 1491.0 (TID 1929)
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> top level Product or row object
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
> knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, 
> true])).sum AS sum#20070
> knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, 
> true])).product AS product#20071
>     at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.expressionEncodingError(QueryExecutionErrors.scala:1125)
>  {code}
> i dont really understand the error, this result is not a top-level row object.
> anyhow taking the advice to heart and using option we get to the second 
> attempt using options:
> {code:java}
> case class SumAndProduct(sum: Double, product: Double) 
> val sumAndProductAgg = new Aggregator[Double, Option[SumAndProduct], 
> Option[SumAndProduct]] {
>   def zero: Option[SumAndProduct] = None
>   def reduce(b: Option[SumAndProduct], a: Double): Option[SumAndProduct] =
>     b
>       .map{ b => SumAndProduct(b.sum + a, b.product * a) }
>       .orElse{ Option(SumAndProduct(a, a)) }
>   def merge(b1: Option[SumAndProduct], b2: Option[SumAndProduct]): 
> Option[SumAndProduct] =
>     b1.map{ b1 =>
>       b2.map{ b2 =>
>         SumAndProduct(b1.sum + b2.sum, b1.product * b2.product)
>       }.getOrElse(b1)
>     }.orElse(b2)
>   def finish(r: Option[SumAndProduct]): Option[SumAndProduct] = r
>   def bufferEncoder: Encoder[Option[SumAndProduct]] = ExpressionEncoder()
>   def outputEncoder: Encoder[Option[SumAndProduct]] = ExpressionEncoder()
> }
> val df = Seq.empty[Double]
>   .toDF()
>   .select(udaf(sumAndProductAgg).apply(col("value")))
> df.printSchema()
> df.show()
> {code}
> this gives:
> {code:java}
> root
>  |-- $anon$4(value): struct (nullable = true)
>  |    |-- sum: double (nullable = false)
>  |    |-- product: double (nullable = false)
> 16:44:54.998 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
> in stage 1493.0 (TID 1930)
> java.lang.AssertionError: index (1) should < 1
>     at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:142)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:338)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateResultProjection$5(Aggregatio

[jira] [Assigned] (SPARK-37514) Remove workarounds due to older pandas

2021-12-01 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37514:


Assignee: Takuya Ueshin

> Remove workarounds due to older pandas
> --
>
> Key: SPARK-37514
> URL: https://issues.apache.org/jira/browse/SPARK-37514
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>
> Now that we upgraded the minimum version of pandas to {{1.0.5}}.
> We can remove workarounds for pandas API on Spark to run with older pandas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37514) Remove workarounds due to older pandas

2021-12-01 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37514.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34772
[https://github.com/apache/spark/pull/34772]

> Remove workarounds due to older pandas
> --
>
> Key: SPARK-37514
> URL: https://issues.apache.org/jira/browse/SPARK-37514
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.3.0
>
>
> Now that we upgraded the minimum version of pandas to {{1.0.5}}.
> We can remove workarounds for pandas API on Spark to run with older pandas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37430) Inline type hints for python/pyspark/mllib/linalg/distributed.py

2021-12-01 Thread Jyoti Singh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452114#comment-17452114
 ] 

Jyoti Singh commented on SPARK-37430:
-

Sure, I’ll check the other one first.

> Inline type hints for python/pyspark/mllib/linalg/distributed.py
> 
>
> Key: SPARK-37430
> URL: https://issues.apache.org/jira/browse/SPARK-37430
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/mlib/linalg/distributed.pyi to 
> python/pyspark/mllib/linalg/distributed.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37430) Inline type hints for python/pyspark/mllib/linalg/distributed.py

2021-12-01 Thread Maciej Szymkiewicz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452110#comment-17452110
 ] 

Maciej Szymkiewicz commented on SPARK-37430:


Thank you [~jyoti_08], but please keep in mind that before we get deeper into 
SPARK-3739 we should fully resolve SPARK-37094.

> Inline type hints for python/pyspark/mllib/linalg/distributed.py
> 
>
> Key: SPARK-37430
> URL: https://issues.apache.org/jira/browse/SPARK-37430
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/mlib/linalg/distributed.pyi to 
> python/pyspark/mllib/linalg/distributed.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-37430) Inline type hints for python/pyspark/mllib/linalg/distributed.py

2021-12-01 Thread Maciej Szymkiewicz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452110#comment-17452110
 ] 

Maciej Szymkiewicz edited comment on SPARK-37430 at 12/2/21, 1:35 AM:
--

Thank you [~jyoti_08], but please keep in mind that before we get deeper into 
SPARK-37396 we should fully resolve SPARK-37094.


was (Author: zero323):
Thank you [~jyoti_08], but please keep in mind that before we get deeper into 
SPARK-3739 we should fully resolve SPARK-37094.

> Inline type hints for python/pyspark/mllib/linalg/distributed.py
> 
>
> Key: SPARK-37430
> URL: https://issues.apache.org/jira/browse/SPARK-37430
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/mlib/linalg/distributed.pyi to 
> python/pyspark/mllib/linalg/distributed.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37430) Inline type hints for python/pyspark/mllib/linalg/distributed.py

2021-12-01 Thread Jyoti Singh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452103#comment-17452103
 ] 

Jyoti Singh commented on SPARK-37430:
-

I can start working on this issue.

> Inline type hints for python/pyspark/mllib/linalg/distributed.py
> 
>
> Key: SPARK-37430
> URL: https://issues.apache.org/jira/browse/SPARK-37430
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/mlib/linalg/distributed.pyi to 
> python/pyspark/mllib/linalg/distributed.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-37476) udaf doesnt work with nullable (or option of) case class result

2021-12-01 Thread koert kuipers (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452101#comment-17452101
 ] 

koert kuipers edited comment on SPARK-37476 at 12/2/21, 12:58 AM:
--

i get that scala Double cannot be null. however i dont understand how this is 
relevant? my case class can be null, yet it fails when i try to return null for 
the case class. in so far i know a nullable struct is perfectly valid in spark?


was (Author: koert):
i get that scala Double cannot be null. however i dont understand how this is 
relevant? my case class can be null, yet it fails when i try to return null for 
the case class.

> udaf doesnt work with nullable (or option of) case class result 
> 
>
> Key: SPARK-37476
> URL: https://issues.apache.org/jira/browse/SPARK-37476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark master branch on nov 27
>Reporter: koert kuipers
>Priority: Minor
>
> i have a need to have a dataframe aggregation return a nullable case class. 
> there seems to be no way to get this to work. the suggestion to wrap the 
> result in an option doesnt work either.
> first attempt using nulls:
> {code:java}
> case class SumAndProduct(sum: Double, product: Double)
> val sumAndProductAgg = new Aggregator[Double, SumAndProduct, SumAndProduct] {
>   def zero: SumAndProduct = null
>   def reduce(b: SumAndProduct, a: Double): SumAndProduct =
>     if (b == null) {
>       SumAndProduct(a, a)
>     } else {
>       SumAndProduct(b.sum + a, b.product * a)
>     }
>   def merge(b1: SumAndProduct, b2: SumAndProduct): SumAndProduct =
>     if (b1 == null) {
>       b2
>     } else if (b2 == null) {
>       b1
>     } else {
>       SumAndProduct(b1.sum + b2.sum, b1.product * b2.product)
>     }
>   def finish(r: SumAndProduct): SumAndProduct = r
>   def bufferEncoder: Encoder[SumAndProduct] = ExpressionEncoder()
>   def outputEncoder: Encoder[SumAndProduct] = ExpressionEncoder()
> }
> val df = Seq.empty[Double]
>   .toDF()
>   .select(udaf(sumAndProductAgg).apply(col("value")))
> df.printSchema()
> df.show()
> {code}
> this gives:
> {code:java}
> root
>  |-- $anon$3(value): struct (nullable = true)
>  |    |-- sum: double (nullable = false)
>  |    |-- product: double (nullable = false)
> 16:44:54.882 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
> in stage 1491.0 (TID 1929)
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> top level Product or row object
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
> knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, 
> true])).sum AS sum#20070
> knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, 
> true])).product AS product#20071
>     at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.expressionEncodingError(QueryExecutionErrors.scala:1125)
>  {code}
> i dont really understand the error, this result is not a top-level row object.
> anyhow taking the advice to heart and using option we get to the second 
> attempt using options:
> {code:java}
> case class SumAndProduct(sum: Double, product: Double) 
> val sumAndProductAgg = new Aggregator[Double, Option[SumAndProduct], 
> Option[SumAndProduct]] {
>   def zero: Option[SumAndProduct] = None
>   def reduce(b: Option[SumAndProduct], a: Double): Option[SumAndProduct] =
>     b
>       .map{ b => SumAndProduct(b.sum + a, b.product * a) }
>       .orElse{ Option(SumAndProduct(a, a)) }
>   def merge(b1: Option[SumAndProduct], b2: Option[SumAndProduct]): 
> Option[SumAndProduct] =
>     b1.map{ b1 =>
>       b2.map{ b2 =>
>         SumAndProduct(b1.sum + b2.sum, b1.product * b2.product)
>       }.getOrElse(b1)
>     }.orElse(b2)
>   def finish(r: Option[SumAndProduct]): Option[SumAndProduct] = r
>   def bufferEncoder: Encoder[Option[SumAndProduct]] = ExpressionEncoder()
>   def outputEncoder: Encoder[Option[SumAndProduct]] = ExpressionEncoder()
> }
> val df = Seq.empty[Double]
>   .toDF()
>   .select(udaf(sumAndProductAgg).apply(col("value")))
> df.printSchema()
> df.show()
> {code}
> this gives:
> {code:java}
> root
>  |-- $anon$4(value): struct (nullable = true)
>  |    |-- sum: double (nullable = false)
>  |    |-- product: double (nullable = false)
> 16:44:54.998 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
> in stage 1493.0 (TID 1930)
> java.lang.AssertionError: index (1) should < 1
>     at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:142)
>     a

[jira] [Commented] (SPARK-37476) udaf doesnt work with nullable (or option of) case class result

2021-12-01 Thread koert kuipers (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452101#comment-17452101
 ] 

koert kuipers commented on SPARK-37476:
---

i get that scala Double cannot be null. however i dont understand how this is 
relevant? my case class can be null, yet it fails when i try to return null for 
the case class.

> udaf doesnt work with nullable (or option of) case class result 
> 
>
> Key: SPARK-37476
> URL: https://issues.apache.org/jira/browse/SPARK-37476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark master branch on nov 27
>Reporter: koert kuipers
>Priority: Minor
>
> i have a need to have a dataframe aggregation return a nullable case class. 
> there seems to be no way to get this to work. the suggestion to wrap the 
> result in an option doesnt work either.
> first attempt using nulls:
> {code:java}
> case class SumAndProduct(sum: Double, product: Double)
> val sumAndProductAgg = new Aggregator[Double, SumAndProduct, SumAndProduct] {
>   def zero: SumAndProduct = null
>   def reduce(b: SumAndProduct, a: Double): SumAndProduct =
>     if (b == null) {
>       SumAndProduct(a, a)
>     } else {
>       SumAndProduct(b.sum + a, b.product * a)
>     }
>   def merge(b1: SumAndProduct, b2: SumAndProduct): SumAndProduct =
>     if (b1 == null) {
>       b2
>     } else if (b2 == null) {
>       b1
>     } else {
>       SumAndProduct(b1.sum + b2.sum, b1.product * b2.product)
>     }
>   def finish(r: SumAndProduct): SumAndProduct = r
>   def bufferEncoder: Encoder[SumAndProduct] = ExpressionEncoder()
>   def outputEncoder: Encoder[SumAndProduct] = ExpressionEncoder()
> }
> val df = Seq.empty[Double]
>   .toDF()
>   .select(udaf(sumAndProductAgg).apply(col("value")))
> df.printSchema()
> df.show()
> {code}
> this gives:
> {code:java}
> root
>  |-- $anon$3(value): struct (nullable = true)
>  |    |-- sum: double (nullable = false)
>  |    |-- product: double (nullable = false)
> 16:44:54.882 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
> in stage 1491.0 (TID 1929)
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> top level Product or row object
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
> knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, 
> true])).sum AS sum#20070
> knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, 
> true])).product AS product#20071
>     at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.expressionEncodingError(QueryExecutionErrors.scala:1125)
>  {code}
> i dont really understand the error, this result is not a top-level row object.
> anyhow taking the advice to heart and using option we get to the second 
> attempt using options:
> {code:java}
> case class SumAndProduct(sum: Double, product: Double) 
> val sumAndProductAgg = new Aggregator[Double, Option[SumAndProduct], 
> Option[SumAndProduct]] {
>   def zero: Option[SumAndProduct] = None
>   def reduce(b: Option[SumAndProduct], a: Double): Option[SumAndProduct] =
>     b
>       .map{ b => SumAndProduct(b.sum + a, b.product * a) }
>       .orElse{ Option(SumAndProduct(a, a)) }
>   def merge(b1: Option[SumAndProduct], b2: Option[SumAndProduct]): 
> Option[SumAndProduct] =
>     b1.map{ b1 =>
>       b2.map{ b2 =>
>         SumAndProduct(b1.sum + b2.sum, b1.product * b2.product)
>       }.getOrElse(b1)
>     }.orElse(b2)
>   def finish(r: Option[SumAndProduct]): Option[SumAndProduct] = r
>   def bufferEncoder: Encoder[Option[SumAndProduct]] = ExpressionEncoder()
>   def outputEncoder: Encoder[Option[SumAndProduct]] = ExpressionEncoder()
> }
> val df = Seq.empty[Double]
>   .toDF()
>   .select(udaf(sumAndProductAgg).apply(col("value")))
> df.printSchema()
> df.show()
> {code}
> this gives:
> {code:java}
> root
>  |-- $anon$4(value): struct (nullable = true)
>  |    |-- sum: double (nullable = false)
>  |    |-- product: double (nullable = false)
> 16:44:54.998 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
> in stage 1493.0 (TID 1930)
> java.lang.AssertionError: index (1) should < 1
>     at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:142)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:338)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateResultProjection$5(Ag

[jira] [Commented] (SPARK-37476) udaf doesnt work with nullable (or option of) case class result

2021-12-01 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452099#comment-17452099
 ] 

Hyukjin Kwon commented on SPARK-37476:
--

Awesome! In Scala, their double cannot be null:

{code}
scala> val a: Double = null
:22: error: an expression of type Null is ineligible for implicit 
conversion
   val a: Double = null
   ^

scala> null.asInstanceOf[Double]
res1: Double = 0.0
{code}

probably that's why the initial reproducer failed to do that.

> udaf doesnt work with nullable (or option of) case class result 
> 
>
> Key: SPARK-37476
> URL: https://issues.apache.org/jira/browse/SPARK-37476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark master branch on nov 27
>Reporter: koert kuipers
>Priority: Minor
>
> i have a need to have a dataframe aggregation return a nullable case class. 
> there seems to be no way to get this to work. the suggestion to wrap the 
> result in an option doesnt work either.
> first attempt using nulls:
> {code:java}
> case class SumAndProduct(sum: Double, product: Double)
> val sumAndProductAgg = new Aggregator[Double, SumAndProduct, SumAndProduct] {
>   def zero: SumAndProduct = null
>   def reduce(b: SumAndProduct, a: Double): SumAndProduct =
>     if (b == null) {
>       SumAndProduct(a, a)
>     } else {
>       SumAndProduct(b.sum + a, b.product * a)
>     }
>   def merge(b1: SumAndProduct, b2: SumAndProduct): SumAndProduct =
>     if (b1 == null) {
>       b2
>     } else if (b2 == null) {
>       b1
>     } else {
>       SumAndProduct(b1.sum + b2.sum, b1.product * b2.product)
>     }
>   def finish(r: SumAndProduct): SumAndProduct = r
>   def bufferEncoder: Encoder[SumAndProduct] = ExpressionEncoder()
>   def outputEncoder: Encoder[SumAndProduct] = ExpressionEncoder()
> }
> val df = Seq.empty[Double]
>   .toDF()
>   .select(udaf(sumAndProductAgg).apply(col("value")))
> df.printSchema()
> df.show()
> {code}
> this gives:
> {code:java}
> root
>  |-- $anon$3(value): struct (nullable = true)
>  |    |-- sum: double (nullable = false)
>  |    |-- product: double (nullable = false)
> 16:44:54.882 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
> in stage 1491.0 (TID 1929)
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> top level Product or row object
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
> knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, 
> true])).sum AS sum#20070
> knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, 
> true])).product AS product#20071
>     at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.expressionEncodingError(QueryExecutionErrors.scala:1125)
>  {code}
> i dont really understand the error, this result is not a top-level row object.
> anyhow taking the advice to heart and using option we get to the second 
> attempt using options:
> {code:java}
> case class SumAndProduct(sum: Double, product: Double) 
> val sumAndProductAgg = new Aggregator[Double, Option[SumAndProduct], 
> Option[SumAndProduct]] {
>   def zero: Option[SumAndProduct] = None
>   def reduce(b: Option[SumAndProduct], a: Double): Option[SumAndProduct] =
>     b
>       .map{ b => SumAndProduct(b.sum + a, b.product * a) }
>       .orElse{ Option(SumAndProduct(a, a)) }
>   def merge(b1: Option[SumAndProduct], b2: Option[SumAndProduct]): 
> Option[SumAndProduct] =
>     b1.map{ b1 =>
>       b2.map{ b2 =>
>         SumAndProduct(b1.sum + b2.sum, b1.product * b2.product)
>       }.getOrElse(b1)
>     }.orElse(b2)
>   def finish(r: Option[SumAndProduct]): Option[SumAndProduct] = r
>   def bufferEncoder: Encoder[Option[SumAndProduct]] = ExpressionEncoder()
>   def outputEncoder: Encoder[Option[SumAndProduct]] = ExpressionEncoder()
> }
> val df = Seq.empty[Double]
>   .toDF()
>   .select(udaf(sumAndProductAgg).apply(col("value")))
> df.printSchema()
> df.show()
> {code}
> this gives:
> {code:java}
> root
>  |-- $anon$4(value): struct (nullable = true)
>  |    |-- sum: double (nullable = false)
>  |    |-- product: double (nullable = false)
> 16:44:54.998 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
> in stage 1493.0 (TID 1930)
> java.lang.AssertionError: index (1) should < 1
>     at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:142)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:338)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$S

[jira] [Commented] (SPARK-37515) minRatePerPartition works as "max messages per partition per a batch" (it should be per seconds)

2021-12-01 Thread Sungpeo Kook (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452097#comment-17452097
 ] 

Sungpeo Kook commented on SPARK-37515:
--

I couldn't find the proper JIRA's components for Kafka Streaming. 

 

> minRatePerPartition works as "max messages per partition per a batch" (it 
> should be per seconds)
> 
>
> Key: SPARK-37515
> URL: https://issues.apache.org/jira/browse/SPARK-37515
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.2.0
>Reporter: Sungpeo Kook
>Assignee: Apache Spark
>Priority: Major
>
> {{maxRatePerPartition}} means "max messages per partition per second".
> But minRatePerPartition does not. ("max messages per partition per a batch"). 
> This is a bug.
>  
> minRatePerPartition should be works as "min messages per partition per 
> second".
>  * 
> [https://github.com/apache/spark/blob/f97de309792f382ae823894e978f7e54f34f1a29/docs/configuration.md?plain=1#L2878-L2886]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37515) minRatePerPartition works as "max messages per partition per a batch" (it should be per seconds)

2021-12-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37515:


Assignee: (was: Apache Spark)

> minRatePerPartition works as "max messages per partition per a batch" (it 
> should be per seconds)
> 
>
> Key: SPARK-37515
> URL: https://issues.apache.org/jira/browse/SPARK-37515
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.2.0
>Reporter: Sungpeo Kook
>Priority: Major
>
> {{maxRatePerPartition}} means "max messages per partition per second".
> But minRatePerPartition does not. ("max messages per partition per a batch"). 
> This is a bug.
>  
> minRatePerPartition should be works as "min messages per partition per 
> second".
>  * 
> [https://github.com/apache/spark/blob/f97de309792f382ae823894e978f7e54f34f1a29/docs/configuration.md?plain=1#L2878-L2886]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37515) minRatePerPartition works as "max messages per partition per a batch" (it should be per seconds)

2021-12-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37515:


Assignee: Apache Spark

> minRatePerPartition works as "max messages per partition per a batch" (it 
> should be per seconds)
> 
>
> Key: SPARK-37515
> URL: https://issues.apache.org/jira/browse/SPARK-37515
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.2.0
>Reporter: Sungpeo Kook
>Assignee: Apache Spark
>Priority: Major
>
> {{maxRatePerPartition}} means "max messages per partition per second".
> But minRatePerPartition does not. ("max messages per partition per a batch"). 
> This is a bug.
>  
> minRatePerPartition should be works as "min messages per partition per 
> second".
>  * 
> [https://github.com/apache/spark/blob/f97de309792f382ae823894e978f7e54f34f1a29/docs/configuration.md?plain=1#L2878-L2886]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37515) minRatePerPartition works as "max messages per partition per a batch" (it should be per seconds)

2021-12-01 Thread Sungpeo Kook (Jira)

Sungpeo Kook created SPARK-37515:


 Summary: minRatePerPartition works as "max messages per partition 
per a batch" (it should be per seconds)
 Key: SPARK-37515
 URL: https://issues.apache.org/jira/browse/SPARK-37515
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.0, 2.4.8
Reporter: Sungpeo Kook


{{maxRatePerPartition}} means "max messages per partition per second".
But minRatePerPartition does not. ("max messages per partition per a batch"). 
This is a bug.

 

minRatePerPartition should be works as "min messages per partition per second".
 * 
[https://github.com/apache/spark/blob/f97de309792f382ae823894e978f7e54f34f1a29/docs/configuration.md?plain=1#L2878-L2886]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37514) Remove workarounds due to older pandas

2021-12-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37514:


Assignee: Apache Spark

> Remove workarounds due to older pandas
> --
>
> Key: SPARK-37514
> URL: https://issues.apache.org/jira/browse/SPARK-37514
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> Now that we upgraded the minimum version of pandas to {{1.0.5}}.
> We can remove workarounds for pandas API on Spark to run with older pandas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37514) Remove workarounds due to older pandas

2021-12-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37514:


Assignee: (was: Apache Spark)

> Remove workarounds due to older pandas
> --
>
> Key: SPARK-37514
> URL: https://issues.apache.org/jira/browse/SPARK-37514
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Now that we upgraded the minimum version of pandas to {{1.0.5}}.
> We can remove workarounds for pandas API on Spark to run with older pandas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37514) Remove workarounds due to older pandas

2021-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452092#comment-17452092
 ] 

Apache Spark commented on SPARK-37514:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/34772

> Remove workarounds due to older pandas
> --
>
> Key: SPARK-37514
> URL: https://issues.apache.org/jira/browse/SPARK-37514
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Now that we upgraded the minimum version of pandas to {{1.0.5}}.
> We can remove workarounds for pandas API on Spark to run with older pandas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37514) Remove workarounds due to older pandas

2021-12-01 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-37514:
-

 Summary: Remove workarounds due to older pandas
 Key: SPARK-37514
 URL: https://issues.apache.org/jira/browse/SPARK-37514
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Takuya Ueshin


Now that we upgraded the minimum version of pandas to {{1.0.5}}.
We can remove workarounds for pandas API on Spark to run with older pandas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe

2021-12-01 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452089#comment-17452089
 ] 

Sean R. Owen commented on SPARK-26589:
--

That approach ends up being about n log n and seems worse than just sorting? I 
think.
Or bootstrap this kind of approach with approximate median - would converge 
faster.

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37496) Migrate ReplaceTableAsSelectStatement to v2 command

2021-12-01 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao reassigned SPARK-37496:
--

Assignee: Huaxin Gao

> Migrate ReplaceTableAsSelectStatement to v2 command
> ---
>
> Key: SPARK-37496
> URL: https://issues.apache.org/jira/browse/SPARK-37496
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37496) Migrate ReplaceTableAsSelectStatement to v2 command

2021-12-01 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-37496.

Resolution: Fixed

> Migrate ReplaceTableAsSelectStatement to v2 command
> ---
>
> Key: SPARK-37496
> URL: https://issues.apache.org/jira/browse/SPARK-37496
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe

2021-12-01 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452081#comment-17452081
 ] 

Nicholas Chammas commented on SPARK-26589:
--

[~srowen] - I'll ask for help on the dev list if appropriate, but I'm wondering 
if you can give me some high level guidance here.

I have an outline of an approach to calculate the median that does not require 
sorting or shuffling the data. It's based on the approach I linked to in my 
previous comment (by Michael Harris). It does require, however, multiple passes 
over the data for the algorithm to converge on the median.

Here's a working sketch of the approach:
{code:python}
def spark_median(data):
total_count = data.count()
if total_count % 2 == 0:
target_positions = [total_count // 2, total_count // 2 + 1]
else:
target_positions = [total_count // 2 + 1]
target_values = [
kth_position(data, k, data_count=total_count)
for k in target_positions
]
return sum(target_values) / len(target_values)


def kth_position(data, k, data_count=None):
if data_count is None:
total_count = data.count()
else:
total_count = data_count
if k > total_count or k < 1:
return None
while True:
# This value, along with the following two counts, are the only data 
that need
# to be shared across nodes.
some_value = data.first()["id"]
# These two counts can be performed together via an aggregator.
larger_count = data.where(col("id") > some_value).count()
equal_count = data.where(col("id") == some_value).count()
value_positions = range(
total_count - larger_count - equal_count + 1,
total_count - larger_count + 1,
)
# print(some_value, total_count, k, value_positions)
if k in value_positions:
return some_value
elif k >= value_positions.stop:
k -= (value_positions.stop - 1)
data = data.where(col("id") > some_value)
total_count = larger_count
elif k < value_positions.start:
data = data.where(col("id") < some_value)
total_count -= (larger_count + equal_count)
{code}
Of course, this needs to be converted into a Catalyst Expression, but the basic 
idea is expressed there.

I am looking at the definitions of 
[DeclarativeAggregate|https://github.com/apache/spark/blob/1dd0ca23f64acfc7a3dc697e19627a1b74012a2d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L381-L394]
 and 
[ImperativeAggregate|https://github.com/apache/spark/blob/1dd0ca23f64acfc7a3dc697e19627a1b74012a2d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L267-L285]
 and trying to find an existing expression to model after, but I don't think we 
have any existing aggregates that would work like this median—specifically, 
where multiple passes over the data are required (in this case, to count 
elements matching different filters).

Do you have any advice on how to approach converting this into a Catalyst 
expression?

There is an 
[NthValue|https://github.com/apache/spark/blob/568ad6aa4435ce76ca3b5d9966e64259ea1f9b38/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L648-L675]
 window expression, but I don't think I can build on it to make my median 
expression since a) median shouldn't be limited to window expressions, and b) 
NthValue requires a complete sort of the data, which I want to avoid.

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe

2021-12-01 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451936#comment-17451936
 ] 

Nicholas Chammas commented on SPARK-26589:
--

Just for reference, Stack Overflow provides evidence that a proper median 
function has been in high demand for some time:
 * [How can I calculate exact median with Apache 
Spark?|https://stackoverflow.com/q/28158729/877069] (14K views)
 * [How to find median and quantiles using 
Spark|https://stackoverflow.com/q/31432843/877069] (117K views)
 * [Median / quantiles within PySpark 
groupBy|https://stackoverflow.com/q/46845672/877069] (67K views)

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37480) Configurations in docs/running-on-kubernetes.md are not uptodate

2021-12-01 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37480:
--
Affects Version/s: 3.2.0
   (was: 3.3.0)

> Configurations in docs/running-on-kubernetes.md are not uptodate
> 
>
> Key: SPARK-37480
> URL: https://issues.apache.org/jira/browse/SPARK-37480
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Minor
> Fix For: 3.2.1, 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37480) Configurations in docs/running-on-kubernetes.md are not uptodate

2021-12-01 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37480:
--
Fix Version/s: 3.2.1

> Configurations in docs/running-on-kubernetes.md are not uptodate
> 
>
> Key: SPARK-37480
> URL: https://issues.apache.org/jira/browse/SPARK-37480
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Minor
> Fix For: 3.2.1, 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37476) udaf doesnt work with nullable (or option of) case class result

2021-12-01 Thread koert kuipers (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451921#comment-17451921
 ] 

koert kuipers commented on SPARK-37476:
---

yes it works well

 

 
{code:java}
import java.lang.{Double => JDouble}

val sumAgg = new Aggregator[Double, JDouble, JDouble] {
  def zero: JDouble = null

  def reduce(b: JDouble, a: Double): JDouble =
    if (b == null) {
      a
    } else {
      b + a
    }

  def merge(b1: JDouble, b2: JDouble): JDouble =
    if (b1 == null) {
  b2
    } else if (b2 == null) {
      b1
    } else {
      b1 + b2
    }

  def finish(r: JDouble): JDouble = r

  def bufferEncoder: Encoder[JDouble] = ExpressionEncoder()
  def outputEncoder: Encoder[JDouble] = ExpressionEncoder()
}

val df = Seq.empty[Double]
  .toDF()
  .select(udaf(sumAgg).apply(col("value")))

df.printSchema()
root
 |-- $anon$3(value): double (nullable = true) 

df.show()
+--+
|$anon$3(value)|
+--+
|          null|
+--+
 {code}
it works with Option without issues too

 

> udaf doesnt work with nullable (or option of) case class result 
> 
>
> Key: SPARK-37476
> URL: https://issues.apache.org/jira/browse/SPARK-37476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark master branch on nov 27
>Reporter: koert kuipers
>Priority: Minor
>
> i have a need to have a dataframe aggregation return a nullable case class. 
> there seems to be no way to get this to work. the suggestion to wrap the 
> result in an option doesnt work either.
> first attempt using nulls:
> {code:java}
> case class SumAndProduct(sum: Double, product: Double)
> val sumAndProductAgg = new Aggregator[Double, SumAndProduct, SumAndProduct] {
>   def zero: SumAndProduct = null
>   def reduce(b: SumAndProduct, a: Double): SumAndProduct =
>     if (b == null) {
>       SumAndProduct(a, a)
>     } else {
>       SumAndProduct(b.sum + a, b.product * a)
>     }
>   def merge(b1: SumAndProduct, b2: SumAndProduct): SumAndProduct =
>     if (b1 == null) {
>       b2
>     } else if (b2 == null) {
>       b1
>     } else {
>       SumAndProduct(b1.sum + b2.sum, b1.product * b2.product)
>     }
>   def finish(r: SumAndProduct): SumAndProduct = r
>   def bufferEncoder: Encoder[SumAndProduct] = ExpressionEncoder()
>   def outputEncoder: Encoder[SumAndProduct] = ExpressionEncoder()
> }
> val df = Seq.empty[Double]
>   .toDF()
>   .select(udaf(sumAndProductAgg).apply(col("value")))
> df.printSchema()
> df.show()
> {code}
> this gives:
> {code:java}
> root
>  |-- $anon$3(value): struct (nullable = true)
>  |    |-- sum: double (nullable = false)
>  |    |-- product: double (nullable = false)
> 16:44:54.882 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
> in stage 1491.0 (TID 1929)
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> top level Product or row object
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
> knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, 
> true])).sum AS sum#20070
> knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, 
> true])).product AS product#20071
>     at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.expressionEncodingError(QueryExecutionErrors.scala:1125)
>  {code}
> i dont really understand the error, this result is not a top-level row object.
> anyhow taking the advice to heart and using option we get to the second 
> attempt using options:
> {code:java}
> case class SumAndProduct(sum: Double, product: Double) 
> val sumAndProductAgg = new Aggregator[Double, Option[SumAndProduct], 
> Option[SumAndProduct]] {
>   def zero: Option[SumAndProduct] = None
>   def reduce(b: Option[SumAndProduct], a: Double): Option[SumAndProduct] =
>     b
>       .map{ b => SumAndProduct(b.sum + a, b.product * a) }
>       .orElse{ Option(SumAndProduct(a, a)) }
>   def merge(b1: Option[SumAndProduct], b2: Option[SumAndProduct]): 
> Option[SumAndProduct] =
>     b1.map{ b1 =>
>       b2.map{ b2 =>
>         SumAndProduct(b1.sum + b2.sum, b1.product * b2.product)
>       }.getOrElse(b1)
>     }.orElse(b2)
>   def finish(r: Option[SumAndProduct]): Option[SumAndProduct] = r
>   def bufferEncoder: Encoder[Option[SumAndProduct]] = ExpressionEncoder()
>   def outputEncoder: Encoder[Option[SumAndProduct]] = ExpressionEncoder()
> }
> val df = Seq.empty[Double]
>   .toDF()
>   .select(udaf(sumAndProductAgg).apply(col("value")))
> df.printSchema()
> df.show()
> {code}
> this gives:
> {code:java}
> root
>  |-- $anon$4(value

[jira] [Comment Edited] (SPARK-37476) udaf doesnt work with nullable (or option of) case class result

2021-12-01 Thread koert kuipers (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451921#comment-17451921
 ] 

koert kuipers edited comment on SPARK-37476 at 12/1/21, 4:57 PM:
-

yes it works well
{code:java}
import java.lang.{Double => JDouble}

val sumAgg = new Aggregator[Double, JDouble, JDouble] {
  def zero: JDouble = null

  def reduce(b: JDouble, a: Double): JDouble =
    if (b == null) {
      a
    } else {
      b + a
    }

  def merge(b1: JDouble, b2: JDouble): JDouble =
    if (b1 == null) {
  b2
    } else if (b2 == null) {
      b1
    } else {
      b1 + b2
    }

  def finish(r: JDouble): JDouble = r

  def bufferEncoder: Encoder[JDouble] = ExpressionEncoder()
  def outputEncoder: Encoder[JDouble] = ExpressionEncoder()
}

val df = Seq.empty[Double]
  .toDF()
  .select(udaf(sumAgg).apply(col("value")))

df.printSchema()
root
 |-- $anon$3(value): double (nullable = true) 

df.show()
+--+
|$anon$3(value)|
+--+
|          null|
+--+
 {code}
it works with Option without issues too

 


was (Author: koert):
yes it works well

 

 
{code:java}
import java.lang.{Double => JDouble}

val sumAgg = new Aggregator[Double, JDouble, JDouble] {
  def zero: JDouble = null

  def reduce(b: JDouble, a: Double): JDouble =
    if (b == null) {
      a
    } else {
      b + a
    }

  def merge(b1: JDouble, b2: JDouble): JDouble =
    if (b1 == null) {
  b2
    } else if (b2 == null) {
      b1
    } else {
      b1 + b2
    }

  def finish(r: JDouble): JDouble = r

  def bufferEncoder: Encoder[JDouble] = ExpressionEncoder()
  def outputEncoder: Encoder[JDouble] = ExpressionEncoder()
}

val df = Seq.empty[Double]
  .toDF()
  .select(udaf(sumAgg).apply(col("value")))

df.printSchema()
root
 |-- $anon$3(value): double (nullable = true) 

df.show()
+--+
|$anon$3(value)|
+--+
|          null|
+--+
 {code}
it works with Option without issues too

 

> udaf doesnt work with nullable (or option of) case class result 
> 
>
> Key: SPARK-37476
> URL: https://issues.apache.org/jira/browse/SPARK-37476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark master branch on nov 27
>Reporter: koert kuipers
>Priority: Minor
>
> i have a need to have a dataframe aggregation return a nullable case class. 
> there seems to be no way to get this to work. the suggestion to wrap the 
> result in an option doesnt work either.
> first attempt using nulls:
> {code:java}
> case class SumAndProduct(sum: Double, product: Double)
> val sumAndProductAgg = new Aggregator[Double, SumAndProduct, SumAndProduct] {
>   def zero: SumAndProduct = null
>   def reduce(b: SumAndProduct, a: Double): SumAndProduct =
>     if (b == null) {
>       SumAndProduct(a, a)
>     } else {
>       SumAndProduct(b.sum + a, b.product * a)
>     }
>   def merge(b1: SumAndProduct, b2: SumAndProduct): SumAndProduct =
>     if (b1 == null) {
>       b2
>     } else if (b2 == null) {
>       b1
>     } else {
>       SumAndProduct(b1.sum + b2.sum, b1.product * b2.product)
>     }
>   def finish(r: SumAndProduct): SumAndProduct = r
>   def bufferEncoder: Encoder[SumAndProduct] = ExpressionEncoder()
>   def outputEncoder: Encoder[SumAndProduct] = ExpressionEncoder()
> }
> val df = Seq.empty[Double]
>   .toDF()
>   .select(udaf(sumAndProductAgg).apply(col("value")))
> df.printSchema()
> df.show()
> {code}
> this gives:
> {code:java}
> root
>  |-- $anon$3(value): struct (nullable = true)
>  |    |-- sum: double (nullable = false)
>  |    |-- product: double (nullable = false)
> 16:44:54.882 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
> in stage 1491.0 (TID 1929)
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> top level Product or row object
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
> knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, 
> true])).sum AS sum#20070
> knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, 
> true])).product AS product#20071
>     at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.expressionEncodingError(QueryExecutionErrors.scala:1125)
>  {code}
> i dont really understand the error, this result is not a top-level row object.
> anyhow taking the advice to heart and using option we get to the second 
> attempt using options:
> {code:java}
> case class SumAndProduct(sum: Double, product: Double) 
> val sumAndProductAgg = new Aggregator[Double, Opti

[jira] [Commented] (SPARK-37326) Support TimestampNTZ in CSV data source

2021-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451900#comment-17451900
 ] 

Apache Spark commented on SPARK-37326:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34771

> Support TimestampNTZ in CSV data source
> ---
>
> Key: SPARK-37326
> URL: https://issues.apache.org/jira/browse/SPARK-37326
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37326) Support TimestampNTZ in CSV data source

2021-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451899#comment-17451899
 ] 

Apache Spark commented on SPARK-37326:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34771

> Support TimestampNTZ in CSV data source
> ---
>
> Key: SPARK-37326
> URL: https://issues.apache.org/jira/browse/SPARK-37326
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37480) Configurations in docs/running-on-kubernetes.md are not uptodate

2021-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451882#comment-17451882
 ] 

Apache Spark commented on SPARK-37480:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34770

> Configurations in docs/running-on-kubernetes.md are not uptodate
> 
>
> Key: SPARK-37480
> URL: https://issues.apache.org/jira/browse/SPARK-37480
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Minor
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37501) CREATE/REPLACE TABLE should qualify location for v2 command

2021-12-01 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37501:
---

Assignee: PengLei

> CREATE/REPLACE TABLE should qualify location for v2 command
> ---
>
> Key: SPARK-37501
> URL: https://issues.apache.org/jira/browse/SPARK-37501
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Assignee: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37501) CREATE/REPLACE TABLE should qualify location for v2 command

2021-12-01 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37501.
-
Resolution: Fixed

Issue resolved by pull request 34758
[https://github.com/apache/spark/pull/34758]

> CREATE/REPLACE TABLE should qualify location for v2 command
> ---
>
> Key: SPARK-37501
> URL: https://issues.apache.org/jira/browse/SPARK-37501
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Assignee: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37463) Read/Write Timestamp ntz from/to Orc uses UTC time zone

2021-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451767#comment-17451767
 ] 

Apache Spark commented on SPARK-37463:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34769

> Read/Write Timestamp ntz from/to Orc uses UTC time zone
> ---
>
> Key: SPARK-37463
> URL: https://issues.apache.org/jira/browse/SPARK-37463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> There are some example code:
> import java.util.TimeZone
> TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
> sql("set spark.sql.session.timeZone=America/Los_Angeles")
> val df = sql("select timestamp_ntz '2021-06-01 00:00:00' ts_ntz, timestamp 
> '2021-06-01 00:00:00' ts")
> df.write.mode("overwrite").orc("ts_ntz_orc")
> df.write.mode("overwrite").parquet("ts_ntz_parquet")
> df.write.mode("overwrite").format("avro").save("ts_ntz_avro")
> val query = """
>   select 'orc', *
>   from `orc`.`ts_ntz_orc`
>   union all
>   select 'parquet', *
>   from `parquet`.`ts_ntz_parquet`
>   union all
>   select 'avro', *
>   from `avro`.`ts_ntz_avro`
> """
> val tzs = Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam")
> for (tz <- tzs) {
>   TimeZone.setDefault(TimeZone.getTimeZone(tz))
>   sql(s"set spark.sql.session.timeZone=$tz")
>   println(s"Time zone is ${TimeZone.getDefault.getID}")
>   sql(query).show(false)
> }
> The output show below looks so strange.
> Time zone is America/Los_Angeles
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-06-01 00:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 00:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 00:00:00|
> +---+---+---+
> Time zone is UTC
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-05-31 17:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 07:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 07:00:00|
> +---+---+---+
> Time zone is Europe/Amsterdam
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-05-31 15:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 09:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 09:00:00|
> +---+---+---+



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37463) Read/Write Timestamp ntz from/to Orc uses UTC time zone

2021-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451766#comment-17451766
 ] 

Apache Spark commented on SPARK-37463:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34769

> Read/Write Timestamp ntz from/to Orc uses UTC time zone
> ---
>
> Key: SPARK-37463
> URL: https://issues.apache.org/jira/browse/SPARK-37463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> There are some example code:
> import java.util.TimeZone
> TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
> sql("set spark.sql.session.timeZone=America/Los_Angeles")
> val df = sql("select timestamp_ntz '2021-06-01 00:00:00' ts_ntz, timestamp 
> '2021-06-01 00:00:00' ts")
> df.write.mode("overwrite").orc("ts_ntz_orc")
> df.write.mode("overwrite").parquet("ts_ntz_parquet")
> df.write.mode("overwrite").format("avro").save("ts_ntz_avro")
> val query = """
>   select 'orc', *
>   from `orc`.`ts_ntz_orc`
>   union all
>   select 'parquet', *
>   from `parquet`.`ts_ntz_parquet`
>   union all
>   select 'avro', *
>   from `avro`.`ts_ntz_avro`
> """
> val tzs = Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam")
> for (tz <- tzs) {
>   TimeZone.setDefault(TimeZone.getTimeZone(tz))
>   sql(s"set spark.sql.session.timeZone=$tz")
>   println(s"Time zone is ${TimeZone.getDefault.getID}")
>   sql(query).show(false)
> }
> The output show below looks so strange.
> Time zone is America/Los_Angeles
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-06-01 00:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 00:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 00:00:00|
> +---+---+---+
> Time zone is UTC
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-05-31 17:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 07:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 07:00:00|
> +---+---+---+
> Time zone is Europe/Amsterdam
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-05-31 15:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 09:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 09:00:00|
> +---+---+---+



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11150) Dynamic partition pruning

2021-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451764#comment-17451764
 ] 

Apache Spark commented on SPARK-11150:
--

User 'weixiuli' has created a pull request for this issue:
https://github.com/apache/spark/pull/34768

> Dynamic partition pruning
> -
>
> Key: SPARK-11150
> URL: https://issues.apache.org/jira/browse/SPARK-11150
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0, 2.0.0, 2.1.2, 2.2.1, 2.3.0
>Reporter: Younes
>Assignee: Wei Xue
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
> Attachments: image-2019-10-04-11-20-02-616.png
>
>
> Implements dynamic partition pruning by adding a dynamic-partition-pruning 
> filter if there is a partitioned table and a filter on the dimension table. 
> The filter is then planned using a heuristic approach:
>  # As a broadcast relation if it is a broadcast hash join. The broadcast 
> relation will then be transformed into a reused broadcast exchange by the 
> {{ReuseExchange}} rule; or
>  # As a subquery duplicate if the estimated benefit of partition table scan 
> being saved is greater than the estimated cost of the extra scan of the 
> duplicated subquery; otherwise
>  # As a bypassed condition ({{true}}).
>  Below shows a basic example of DPP.
> !image-2019-10-04-11-20-02-616.png|width=521,height=225!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37460) ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented

2021-12-01 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37460:
---

Assignee: Yuto Akutsu

> ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented
> -
>
> Key: SPARK-37460
> URL: https://issues.apache.org/jira/browse/SPARK-37460
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.2.0
>Reporter: Yuto Akutsu
>Assignee: Yuto Akutsu
>Priority: Minor
>
> The instruction of {color:#ff}ALTER DATABASE ... SET LOCATION ... 
> {color:#172b4d}command{color}{color} should be documented in a 
> sql-ref-syntax-ddl-create-table page.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11150) Dynamic partition pruning

2021-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451762#comment-17451762
 ] 

Apache Spark commented on SPARK-11150:
--

User 'weixiuli' has created a pull request for this issue:
https://github.com/apache/spark/pull/34768

> Dynamic partition pruning
> -
>
> Key: SPARK-11150
> URL: https://issues.apache.org/jira/browse/SPARK-11150
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0, 2.0.0, 2.1.2, 2.2.1, 2.3.0
>Reporter: Younes
>Assignee: Wei Xue
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
> Attachments: image-2019-10-04-11-20-02-616.png
>
>
> Implements dynamic partition pruning by adding a dynamic-partition-pruning 
> filter if there is a partitioned table and a filter on the dimension table. 
> The filter is then planned using a heuristic approach:
>  # As a broadcast relation if it is a broadcast hash join. The broadcast 
> relation will then be transformed into a reused broadcast exchange by the 
> {{ReuseExchange}} rule; or
>  # As a subquery duplicate if the estimated benefit of partition table scan 
> being saved is greater than the estimated cost of the extra scan of the 
> duplicated subquery; otherwise
>  # As a bypassed condition ({{true}}).
>  Below shows a basic example of DPP.
> !image-2019-10-04-11-20-02-616.png|width=521,height=225!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37460) ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented

2021-12-01 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37460.
-
Fix Version/s: 3.3.0
   3.2.1
   Resolution: Fixed

Issue resolved by pull request 34718
[https://github.com/apache/spark/pull/34718]

> ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented
> -
>
> Key: SPARK-37460
> URL: https://issues.apache.org/jira/browse/SPARK-37460
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.2.0
>Reporter: Yuto Akutsu
>Assignee: Yuto Akutsu
>Priority: Minor
> Fix For: 3.3.0, 3.2.1
>
>
> The instruction of {color:#ff}ALTER DATABASE ... SET LOCATION ... 
> {color:#172b4d}command{color}{color} should be documented in a 
> sql-ref-syntax-ddl-create-table page.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37508) Add CONTAINS() function

2021-12-01 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-37508:


Assignee: angerszhu

> Add CONTAINS() function
> ---
>
> Key: SPARK-37508
> URL: https://issues.apache.org/jira/browse/SPARK-37508
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: angerszhu
>Priority: Major
>
> {{contains()}} is a common convenient function supported by a number of 
> database systems:
>  # 
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#contains_substr]
>  # [https://docs.snowflake.com/en/_static/apple-touch-icon.png!CONTAINS — 
> Snowflake 
> Documentation|https://docs.snowflake.com/en/sql-reference/functions/contains.html]
> Proposed syntax:
> {code:java}
> contains(haystack, needle)
> return type: boolean {code}
> It is semantically equivalent to {{haystack like '%needle%'}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37508) Add CONTAINS() function

2021-12-01 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-37508.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34761
[https://github.com/apache/spark/pull/34761]

> Add CONTAINS() function
> ---
>
> Key: SPARK-37508
> URL: https://issues.apache.org/jira/browse/SPARK-37508
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> {{contains()}} is a common convenient function supported by a number of 
> database systems:
>  # 
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#contains_substr]
>  # [https://docs.snowflake.com/en/_static/apple-touch-icon.png!CONTAINS — 
> Snowflake 
> Documentation|https://docs.snowflake.com/en/sql-reference/functions/contains.html]
> Proposed syntax:
> {code:java}
> contains(haystack, needle)
> return type: boolean {code}
> It is semantically equivalent to {{haystack like '%needle%'}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37511) Introduce TimedeltaIndex to pandas API on Spark

2021-12-01 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37511:


Assignee: Xinrong Meng

> Introduce TimedeltaIndex to pandas API on Spark
> ---
>
> Key: SPARK-37511
> URL: https://issues.apache.org/jira/browse/SPARK-37511
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> Introduce TimedeltaIndex to pandas API on Spark.
> Properties, functions, and basic operations of TimedeltaIndex will be 
> supported in follow-up PRs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37511) Introduce TimedeltaIndex to pandas API on Spark

2021-12-01 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37511.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34657
[https://github.com/apache/spark/pull/34657]

> Introduce TimedeltaIndex to pandas API on Spark
> ---
>
> Key: SPARK-37511
> URL: https://issues.apache.org/jira/browse/SPARK-37511
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> Introduce TimedeltaIndex to pandas API on Spark.
> Properties, functions, and basic operations of TimedeltaIndex will be 
> supported in follow-up PRs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-35332) Not Coalesce shuffle partitions when cache table

2021-12-01 Thread zhangyangyang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451620#comment-17451620
 ] 

zhangyangyang edited comment on SPARK-35332 at 12/1/21, 8:49 AM:
-

[~luxianghao]  https://issues.apache.org/jira/browse/MAPREDUCE-6944, hello, 
please help me for the last comment. Thank you very much


was (Author: yangyang):
https://issues.apache.org/jira/browse/MAPREDUCE-6944, hello, please help me for 
the last comment. Thank you very much

> Not Coalesce shuffle partitions when cache table
> 
>
> Key: SPARK-35332
> URL: https://issues.apache.org/jira/browse/SPARK-35332
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
> Environment: latest spark version
>Reporter: Xianghao Lu
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: cacheTable.png
>
>
> *How to reproduce the problem*
> _linux shell command to prepare data:_
>  for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
> data.text
> _sql to reproduce the problem:_
>  * create table data_table(id int, str string, num int) row format delimited 
> fields terminated by ',';
>  * load data local inpath '/path/to/data.text' into table data_table;
>  * CACHE TABLE test_cache_table AS
>  SELECT str
>  FROM
>  (SELECT id,str FROM data_table
>  )group by str;
> Finally you will see a stage with 200 tasks and not coalesce shuffle 
> partitions, the problem will waste resource when data size is small.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table

2021-12-01 Thread zhangyangyang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451620#comment-17451620
 ] 

zhangyangyang commented on SPARK-35332:
---

https://issues.apache.org/jira/browse/MAPREDUCE-6944, hello, please help me for 
the last comment. Thank you very much

> Not Coalesce shuffle partitions when cache table
> 
>
> Key: SPARK-35332
> URL: https://issues.apache.org/jira/browse/SPARK-35332
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
> Environment: latest spark version
>Reporter: Xianghao Lu
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: cacheTable.png
>
>
> *How to reproduce the problem*
> _linux shell command to prepare data:_
>  for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
> data.text
> _sql to reproduce the problem:_
>  * create table data_table(id int, str string, num int) row format delimited 
> fields terminated by ',';
>  * load data local inpath '/path/to/data.text' into table data_table;
>  * CACHE TABLE test_cache_table AS
>  SELECT str
>  FROM
>  (SELECT id,str FROM data_table
>  )group by str;
> Finally you will see a stage with 200 tasks and not coalesce shuffle 
> partitions, the problem will waste resource when data size is small.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37461) yarn-client mode client's appid value is null

2021-12-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451619#comment-17451619
 ] 

Apache Spark commented on SPARK-37461:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34767

> yarn-client mode client's appid value is null
> -
>
> Key: SPARK-37461
> URL: https://issues.apache.org/jira/browse/SPARK-37461
> Project: Spark
>  Issue Type: Task
>  Components: YARN
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
> Fix For: 3.3.0
>
>
> In yarn-client mode, *Client.appId* variable is not assigned, it is always 
> {*}null{*}, in cluster mode, this variable will be assigned to the true 
> value. In this patch, we assign true application id to `appId` too.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37512) Support TimedeltaIndex creation given a timedelta Series/Index

2021-12-01 Thread Xinrong Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451610#comment-17451610
 ] 

Xinrong Meng commented on SPARK-37512:
--

I am working on this.

> Support TimedeltaIndex creation given a timedelta Series/Index
> --
>
> Key: SPARK-37512
> URL: https://issues.apache.org/jira/browse/SPARK-37512
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> To solve the issues below:
> {code:java}
> >>> idx = ps.TimedeltaIndex([timedelta(1), timedelta(microseconds=2)])
> >>> idx
> TimedeltaIndex(['1 days 00:00:00', '0 days 00:00:00.02'], 
> dtype='timedelta64[ns]', freq=None)
> >>> ps.TimedeltaIndex(idx)
> Traceback (most recent call last):
> ...
>     raise TypeError("astype can not be applied to %s." % self.pretty_name)
> TypeError: astype can not be applied to timedelta.
>  {code}
>  
>  
>  
> {code:java}
> >>> s = ps.Series([timedelta(1), timedelta(microseconds=2)], index=[10, 20])
> >>> s
> 10          1 days 00:00:00
> 20   0 days 00:00:00.02
> dtype: timedelta64[ns]
> >>> ps.TimedeltaIndex(s)
> Traceback (most recent call last):
> ...
>     raise TypeError("astype can not be applied to %s." % self.pretty_name)
> TypeError: astype can not be applied to timedelta.
>  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37513) date +/- interval with only day-time fields returns different data type between Spark3.2 and Spark3.1

2021-12-01 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37513.
-
Fix Version/s: 3.3.0
   3.2.1
   Resolution: Fixed

Issue resolved by pull request 34766
[https://github.com/apache/spark/pull/34766]

> date +/- interval with only day-time fields returns different data type 
> between Spark3.2 and Spark3.1
> -
>
> Key: SPARK-37513
> URL: https://issues.apache.org/jira/browse/SPARK-37513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0, 3.2.1
>
>
> {code:java}
> select date '2011-11-11' + interval 12 hours;
> {code}
>  Previously returned the date type, now it returns the timestamp type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37513) date +/- interval with only day-time fields returns different data type between Spark3.2 and Spark3.1

2021-12-01 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37513:
---

Assignee: jiaan.geng

> date +/- interval with only day-time fields returns different data type 
> between Spark3.2 and Spark3.1
> -
>
> Key: SPARK-37513
> URL: https://issues.apache.org/jira/browse/SPARK-37513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> {code:java}
> select date '2011-11-11' + interval 12 hours;
> {code}
>  Previously returned the date type, now it returns the timestamp type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37458) Remove unnecessary object serialization on foreachBatch

2021-12-01 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37458:
-

Assignee: Jungtaek Lim

> Remove unnecessary object serialization on foreachBatch
> ---
>
> Key: SPARK-37458
> URL: https://issues.apache.org/jira/browse/SPARK-37458
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> Currently, ForeachBatchSink leverages ExternalRDD with converting 
> RDD[InternalRow] to RDD[T], to provide Dataset[T] to the user function. This 
> adds SerializeFromObject in the plan, which is actually not required.
> We can leverage LogicalRDD instead, to remove SerializeFromObject from the 
> plan.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37458) Remove unnecessary object serialization on foreachBatch

2021-12-01 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37458.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34706
[https://github.com/apache/spark/pull/34706]

> Remove unnecessary object serialization on foreachBatch
> ---
>
> Key: SPARK-37458
> URL: https://issues.apache.org/jira/browse/SPARK-37458
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently, ForeachBatchSink leverages ExternalRDD with converting 
> RDD[InternalRow] to RDD[T], to provide Dataset[T] to the user function. This 
> adds SerializeFromObject in the plan, which is actually not required.
> We can leverage LogicalRDD instead, to remove SerializeFromObject from the 
> plan.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

67 matches

Mail list logo