date:20210623

[jira] [Resolved] (SPARK-35777) Check all year-month interval types in UDF

2021-06-23 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-35777.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32985
[https://github.com/apache/spark/pull/32985]

> Check all year-month interval types in UDF
> --
>
> Key: SPARK-35777
> URL: https://issues.apache.org/jira/browse/SPARK-35777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> Check all year-month interval types in UDF:
> # INTERVAL YEAR
> # INTERVAL YEAR TO MONTH
> # INTERVAL MONTH



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35777) Check all year-month interval types in UDF

2021-06-23 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-35777:


Assignee: angerszhu

> Check all year-month interval types in UDF
> --
>
> Key: SPARK-35777
> URL: https://issues.apache.org/jira/browse/SPARK-35777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: angerszhu
>Priority: Major
>
> Check all year-month interval types in UDF:
> # INTERVAL YEAR
> # INTERVAL YEAR TO MONTH
> # INTERVAL MONTH



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35778) Check multiply/divide of year-month intervals of any fields by numeric

2021-06-23 Thread PengLei (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368636#comment-17368636
 ] 

PengLei commented on SPARK-35778:
-

[~angerszhuuu] May I continue to finish this question?

> Check multiply/divide of year-month intervals of any fields by numeric
> --
>
> Key: SPARK-35778
> URL: https://issues.apache.org/jira/browse/SPARK-35778
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Write tests that checks multiply/divide of the following intervals by numeric:
> # INTERVAL YEAR
> # INTERVAL YEAR TO MONTH
> # INTERVAL MONTH



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34885) Port/integrate Koalas documentation into PySpark

2021-06-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34885.
--
Resolution: Done

> Port/integrate Koalas documentation into PySpark
> 
>
> Key: SPARK-34885
> URL: https://issues.apache.org/jira/browse/SPARK-34885
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Hyukjin Kwon
>Priority: Major
>
> This JIRA aims to port [Koalas 
> documentation|https://koalas.readthedocs.io/en/latest/index.html] 
> appropriately to [PySpark 
> documentation|https://spark.apache.org/docs/latest/api/python/index.html].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35696) Refine the code examples in the pandas API on Spark documentation.

2021-06-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35696.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33017
[https://github.com/apache/spark/pull/33017]

> Refine the code examples in the pandas API on Spark documentation.
> --
>
> Key: SPARK-35696
> URL: https://issues.apache.org/jira/browse/SPARK-35696
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.2.0
>
>
> There are code examples from Koalas, such as `kdf = ks.DataFrame...` in the 
> documents.
>  
> We should fix this properly to fit the pandas APIs on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35696) Refine the code examples in the pandas API on Spark documentation.

2021-06-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35696:


Assignee: Haejoon Lee

> Refine the code examples in the pandas API on Spark documentation.
> --
>
> Key: SPARK-35696
> URL: https://issues.apache.org/jira/browse/SPARK-35696
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> There are code examples from Koalas, such as `kdf = ks.DataFrame...` in the 
> documents.
>  
> We should fix this properly to fit the pandas APIs on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35868) Add fs.s3a.downgrade.syncable.exceptions if not set

2021-06-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35868.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33044
[https://github.com/apache/spark/pull/33044]

> Add fs.s3a.downgrade.syncable.exceptions if not set
> ---
>
> Key: SPARK-35868
> URL: https://issues.apache.org/jira/browse/SPARK-35868
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Blocker
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35852) Improve the implementation for DateType +/- DayTimeIntervalType(DAY)

2021-06-23 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-35852:


Assignee: PengLei

> Improve the implementation for DateType +/- DayTimeIntervalType(DAY)
> 
>
> Key: SPARK-35852
> URL: https://issues.apache.org/jira/browse/SPARK-35852
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: PengLei
>Assignee: PengLei
>Priority: Major
> Fix For: 3.2.0
>
>
> At now, `DateType +/- DayTimeIntervalType()` will convert the DateType to 
> TimestampType, then TimeAdd. When interval type is  DayTimeIntervalType(DAY), 
> it can use DateAdd instead of TimeAdd.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35852) Improve the implementation for DateType +/- DayTimeIntervalType(DAY)

2021-06-23 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-35852.
--
Resolution: Fixed

Issue resolved by pull request 33033
[https://github.com/apache/spark/pull/33033]

> Improve the implementation for DateType +/- DayTimeIntervalType(DAY)
> 
>
> Key: SPARK-35852
> URL: https://issues.apache.org/jira/browse/SPARK-35852
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: PengLei
>Assignee: PengLei
>Priority: Major
> Fix For: 3.2.0
>
>
> At now, `DateType +/- DayTimeIntervalType()` will convert the DateType to 
> TimestampType, then TimeAdd. When interval type is  DayTimeIntervalType(DAY), 
> it can use DateAdd instead of TimeAdd.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35868) Add fs.s3a.downgrade.syncable.exceptions if not set

2021-06-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-35868:
-

Assignee: Dongjoon Hyun

> Add fs.s3a.downgrade.syncable.exceptions if not set
> ---
>
> Key: SPARK-35868
> URL: https://issues.apache.org/jira/browse/SPARK-35868
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35301) Document migration guide from Koalas to pandas APIs on Spark

2021-06-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35301:


Assignee: (was: Apache Spark)

> Document migration guide from Koalas to pandas APIs on Spark
> 
>
> Key: SPARK-35301
> URL: https://issues.apache.org/jira/browse/SPARK-35301
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA aims to document the migration from the thirdparty Koalas to Apache 
> Spark pandas APIs on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35301) Document migration guide from Koalas to pandas APIs on Spark

2021-06-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368632#comment-17368632
 ] 

Apache Spark commented on SPARK-35301:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33050

> Document migration guide from Koalas to pandas APIs on Spark
> 
>
> Key: SPARK-35301
> URL: https://issues.apache.org/jira/browse/SPARK-35301
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA aims to document the migration from the thirdparty Koalas to Apache 
> Spark pandas APIs on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35301) Document migration guide from Koalas to pandas APIs on Spark

2021-06-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35301:


Assignee: Apache Spark

> Document migration guide from Koalas to pandas APIs on Spark
> 
>
> Key: SPARK-35301
> URL: https://issues.apache.org/jira/browse/SPARK-35301
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> This JIRA aims to document the migration from the thirdparty Koalas to Apache 
> Spark pandas APIs on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35783) Set the list of read columns in the task configuration to reduce reading of ORC data.

2021-06-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35783:
--
Fix Version/s: 3.0.4
   3.1.3

> Set the list of read columns in the task configuration to reduce reading of 
> ORC data.
> -
>
> Key: SPARK-35783
> URL: https://issues.apache.org/jira/browse/SPARK-35783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: weixiuli
>Assignee: weixiuli
>Priority: Major
> Fix For: 3.2.0, 3.1.3, 3.0.4
>
>
> Now, the ORC reader will read all columns of the ORC table when the task 
> configuration does not set the read column list. Therefore, we should set the 
> list of read columns in the task configuration to reduce reading of ORC data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35783) Set the list of read columns in the task configuration to reduce reading of ORC data.

2021-06-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35783:
--
Issue Type: Bug  (was: Improvement)

> Set the list of read columns in the task configuration to reduce reading of 
> ORC data.
> -
>
> Key: SPARK-35783
> URL: https://issues.apache.org/jira/browse/SPARK-35783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: weixiuli
>Assignee: weixiuli
>Priority: Major
> Fix For: 3.2.0
>
>
> Now, the ORC reader will read all columns of the ORC table when the task 
> configuration does not set the read column list. Therefore, we should set the 
> list of read columns in the task configuration to reduce reading of ORC data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35783) Set the list of read columns in the task configuration to reduce reading of ORC data.

2021-06-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35783:
--
Affects Version/s: 3.0.3
   3.1.2

> Set the list of read columns in the task configuration to reduce reading of 
> ORC data.
> -
>
> Key: SPARK-35783
> URL: https://issues.apache.org/jira/browse/SPARK-35783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: weixiuli
>Assignee: weixiuli
>Priority: Major
> Fix For: 3.2.0
>
>
> Now, the ORC reader will read all columns of the ORC table when the task 
> configuration does not set the read column list. Therefore, we should set the 
> list of read columns in the task configuration to reduce reading of ORC data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35605) Move to_pandas_on_spark to the Spark DataFrame.

2021-06-23 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-35605:

Summary: Move to_pandas_on_spark to the Spark DataFrame.  (was: Move 
pandas_on_spark accessor to the Spark DataFrame.)

> Move to_pandas_on_spark to the Spark DataFrame.
> ---
>
> Key: SPARK-35605
> URL: https://issues.apache.org/jira/browse/SPARK-35605
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Inspired by https://github.com/apache/spark/pull/32729#discussion_r643591322,
> As Koalas is ported into PySpark, we don't need auto patch anymore 
> ([https://github.com/apache/spark/blob/master/python/pyspark/pandas/__init__.py#L136-L150])
> Thus, we should to_pandas_on_spark and to_koalas(deprecated) to the PySpark 
> DataFrame and add related tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35678) add a common softmax function

2021-06-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368602#comment-17368602
 ] 

Apache Spark commented on SPARK-35678:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/33049

> add a common softmax function
> -
>
> Key: SPARK-35678
> URL: https://issues.apache.org/jira/browse/SPARK-35678
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.2.0
>
>
> add softmax function in utils, which can be used in multi places



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35678) add a common softmax function

2021-06-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368603#comment-17368603
 ] 

Apache Spark commented on SPARK-35678:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/33049

> add a common softmax function
> -
>
> Key: SPARK-35678
> URL: https://issues.apache.org/jira/browse/SPARK-35678
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.2.0
>
>
> add softmax function in utils, which can be used in multi places



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35869) Cannot run program "python" error when run do-release-docker.sh

2021-06-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35869.
--
Fix Version/s: 3.2.0
 Assignee: wuyi
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/33048

> Cannot run program "python" error when run do-release-docker.sh
> ---
>
> Key: SPARK-35869
> URL: https://issues.apache.org/jira/browse/SPARK-35869
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.0.0, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.2.0
>
>
> Moving back into docs dir.
> Moving to SQL directory and building docs.
> Generating SQL API Markdown files.
> 21/06/15 09:45:04 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Exception in thread "main" java.io.IOException: Cannot run program "python": 
> error=2, No such file or directory
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
> at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)
> at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
> at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.io.IOException: error=2, No such file or directory
> at java.lang.UNIXProcess.forkAndExec(Native Method)
> at java.lang.UNIXProcess.(UNIXProcess.java:247)
> at java.lang.ProcessImpl.start(ProcessImpl.java:134)
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
> ... 14 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35869) Cannot run program "python" error when run do-release-docker.sh

2021-06-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368596#comment-17368596
 ] 

Apache Spark commented on SPARK-35869:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/33048

> Cannot run program "python" error when run do-release-docker.sh
> ---
>
> Key: SPARK-35869
> URL: https://issues.apache.org/jira/browse/SPARK-35869
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.0.0, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2
>Reporter: wuyi
>Priority: Major
>
> Moving back into docs dir.
> Moving to SQL directory and building docs.
> Generating SQL API Markdown files.
> 21/06/15 09:45:04 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Exception in thread "main" java.io.IOException: Cannot run program "python": 
> error=2, No such file or directory
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
> at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)
> at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
> at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.io.IOException: error=2, No such file or directory
> at java.lang.UNIXProcess.forkAndExec(Native Method)
> at java.lang.UNIXProcess.(UNIXProcess.java:247)
> at java.lang.ProcessImpl.start(ProcessImpl.java:134)
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
> ... 14 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35869) Cannot run program "python" error when run do-release-docker.sh

2021-06-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368597#comment-17368597
 ] 

Apache Spark commented on SPARK-35869:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/33048

> Cannot run program "python" error when run do-release-docker.sh
> ---
>
> Key: SPARK-35869
> URL: https://issues.apache.org/jira/browse/SPARK-35869
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.0.0, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2
>Reporter: wuyi
>Priority: Major
>
> Moving back into docs dir.
> Moving to SQL directory and building docs.
> Generating SQL API Markdown files.
> 21/06/15 09:45:04 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Exception in thread "main" java.io.IOException: Cannot run program "python": 
> error=2, No such file or directory
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
> at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)
> at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
> at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.io.IOException: error=2, No such file or directory
> at java.lang.UNIXProcess.forkAndExec(Native Method)
> at java.lang.UNIXProcess.(UNIXProcess.java:247)
> at java.lang.ProcessImpl.start(ProcessImpl.java:134)
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
> ... 14 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35869) Cannot run program "python" error when run do-release-docker.sh

2021-06-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35869:


Assignee: (was: Apache Spark)

> Cannot run program "python" error when run do-release-docker.sh
> ---
>
> Key: SPARK-35869
> URL: https://issues.apache.org/jira/browse/SPARK-35869
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.0.0, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2
>Reporter: wuyi
>Priority: Major
>
> Moving back into docs dir.
> Moving to SQL directory and building docs.
> Generating SQL API Markdown files.
> 21/06/15 09:45:04 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Exception in thread "main" java.io.IOException: Cannot run program "python": 
> error=2, No such file or directory
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
> at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)
> at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
> at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.io.IOException: error=2, No such file or directory
> at java.lang.UNIXProcess.forkAndExec(Native Method)
> at java.lang.UNIXProcess.(UNIXProcess.java:247)
> at java.lang.ProcessImpl.start(ProcessImpl.java:134)
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
> ... 14 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35869) Cannot run program "python" error when run do-release-docker.sh

2021-06-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35869:


Assignee: Apache Spark

> Cannot run program "python" error when run do-release-docker.sh
> ---
>
> Key: SPARK-35869
> URL: https://issues.apache.org/jira/browse/SPARK-35869
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.0.0, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
> Moving back into docs dir.
> Moving to SQL directory and building docs.
> Generating SQL API Markdown files.
> 21/06/15 09:45:04 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Exception in thread "main" java.io.IOException: Cannot run program "python": 
> error=2, No such file or directory
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
> at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)
> at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
> at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.io.IOException: error=2, No such file or directory
> at java.lang.UNIXProcess.forkAndExec(Native Method)
> at java.lang.UNIXProcess.(UNIXProcess.java:247)
> at java.lang.ProcessImpl.start(ProcessImpl.java:134)
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
> ... 14 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35730) Check all day-time interval types in UDF

2021-06-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35730:


Assignee: (was: Apache Spark)

> Check all day-time interval types in UDF
> 
>
> Key: SPARK-35730
> URL: https://issues.apache.org/jira/browse/SPARK-35730
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Check all day-time interval types in UDF:
> # INTERVAL DAY
> # INTERVAL DAY TO HOUR
> # INTERVAL DAY TO MINUTE
> # INTERVAL HOUR
> # INTERVAL HOUR TO MINUTE
> # INTERVAL HOUR TO SECOND
> # INTERVAL MINUTE
> # INTERVAL MINUTE TO SECOND
> # INTERVAL SECOND



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35730) Check all day-time interval types in UDF

2021-06-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35730:


Assignee: Apache Spark

> Check all day-time interval types in UDF
> 
>
> Key: SPARK-35730
> URL: https://issues.apache.org/jira/browse/SPARK-35730
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Check all day-time interval types in UDF:
> # INTERVAL DAY
> # INTERVAL DAY TO HOUR
> # INTERVAL DAY TO MINUTE
> # INTERVAL HOUR
> # INTERVAL HOUR TO MINUTE
> # INTERVAL HOUR TO SECOND
> # INTERVAL MINUTE
> # INTERVAL MINUTE TO SECOND
> # INTERVAL SECOND



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35730) Check all day-time interval types in UDF

2021-06-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368594#comment-17368594
 ] 

Apache Spark commented on SPARK-35730:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33047

> Check all day-time interval types in UDF
> 
>
> Key: SPARK-35730
> URL: https://issues.apache.org/jira/browse/SPARK-35730
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Check all day-time interval types in UDF:
> # INTERVAL DAY
> # INTERVAL DAY TO HOUR
> # INTERVAL DAY TO MINUTE
> # INTERVAL HOUR
> # INTERVAL HOUR TO MINUTE
> # INTERVAL HOUR TO SECOND
> # INTERVAL MINUTE
> # INTERVAL MINUTE TO SECOND
> # INTERVAL SECOND



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35869) Cannot run program "python" error when run do-release-docker.sh

2021-06-23 Thread wuyi (Jira)

wuyi created SPARK-35869:


 Summary: Cannot run program "python" error when run 
do-release-docker.sh
 Key: SPARK-35869
 URL: https://issues.apache.org/jira/browse/SPARK-35869
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 3.1.2, 3.1.1, 3.1.0, 3.0.2, 3.0.0, 3.0.3
Reporter: wuyi


Moving back into docs dir.
Moving to SQL directory and building docs.
Generating SQL API Markdown files.
21/06/15 09:45:04 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Exception in thread "main" java.io.IOException: Cannot run program "python": 
error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)
at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 14 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35282) Support AQE side shuffled hash join formula

2021-06-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368563#comment-17368563
 ] 

Apache Spark commented on SPARK-35282:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/33046

> Support AQE side shuffled hash join formula
> ---
>
> Key: SPARK-35282
> URL: https://issues.apache.org/jira/browse/SPARK-35282
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.2.0
>
>
> Use AQE runtime statistics to decide if we can use shuffled hash join instead 
> of sort merge join. Currently, the formula of shuffled hash join selection 
> does not work due to the dymanic shuffle partition number.
>  
> Add a new config `spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold` to 
> decide if join can be converted to shuffled hash join safely.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35569) Document the deprecation of Koalas Accessor after porting the documentation.

2021-06-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35569.
--
Resolution: Duplicate

I will do it in SPARK-35301

> Document the deprecation of Koalas Accessor after porting the documentation.
> 
>
> Key: SPARK-35569
> URL: https://issues.apache.org/jira/browse/SPARK-35569
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> As we changed the name from 'koalas' to 'pandas_on_spark', the name of 
> 'Koalas Accessor' is also changed to 'Pandas API on Spark Accessor'.
> However, we keep the name of 'koalas' when use Pandas API on Spark Accessor 
> for backward compatibility.
> Thus, after porting the documentation, we should add the document that we 
> recommend to use 'pandas_on_spark' rather than 'koalas' when users use the 
> Pandas API on Spark Accessor since the Koalas Accessor is deprecated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34807) Push down filter through window after TransposeWindow

2021-06-23 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-34807:
---

Assignee: Tanel Kiis

> Push down filter through window after TransposeWindow
> -
>
> Key: SPARK-34807
> URL: https://issues.apache.org/jira/browse/SPARK-34807
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Tanel Kiis
>Priority: Major
>
> {code:scala}
>   spark.range(10).selectExpr("id AS a", "id AS b", "id AS c", "id AS 
> d").createTempView("t1")
>   val df = spark.sql(
> """
>   |SELECT *
>   |  FROM (
>   |SELECT b,
>   |  sum(d) OVER (PARTITION BY a, b),
>   |  rank() OVER (PARTITION BY a ORDER BY c)
>   |FROM t1
>   |  ) v1
>   |WHERE b = 2
>   |""".stripMargin)
> {code}
> Current optimized plan:
> {noformat}
> == Optimized Logical Plan ==
> Project [b#221L, sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED 
> PRECEDING AND UNBOUNDED FOLLOWING)#231L, RANK() OVER (PARTITION BY a ORDER BY 
> c ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232]
> +- Filter (b#221L = 2)
>+- Window [rank(c#222L) windowspecdefinition(a#220L, c#222L ASC NULLS 
> FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS RANK() OVER (PARTITION BY a ORDER BY c ASC NULLS FIRST ROWS BETWEEN 
> UNBOUNDED PRECEDING AND CURRENT ROW)#232], [a#220L], [c#222L ASC NULLS FIRST]
>   +- Project [b#221L, a#220L, c#222L, sum(d) OVER (PARTITION BY a, b ROWS 
> BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#231L]
>  +- Window [sum(d#223L) windowspecdefinition(a#220L, b#221L, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING)#231L], [a#220L, b#221L]
> +- Project [id#218L AS b#221L, id#218L AS d#223L, id#218L AS 
> a#220L, id#218L AS c#222L]
>+- Range (0, 10, step=1, splits=Some(2))
> {noformat}
> Expected optimized plan:
> {noformat}
> == Optimized Logical Plan ==
> Project [b#221L, sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED 
> PRECEDING AND UNBOUNDED FOLLOWING)#231L, RANK() OVER (PARTITION BY a ORDER BY 
> c ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232]
> +- Window [sum(d#223L) windowspecdefinition(a#220L, b#221L, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING)#231L], [a#220L, b#221L]
>+- Project [b#221L, d#223L, a#220L, RANK() OVER (PARTITION BY a ORDER BY c 
> ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232]
>   +- Filter (b#221L = 2)
>  +- Window [rank(c#222L) windowspecdefinition(a#220L, c#222L ASC 
> NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> currentrow$())) AS RANK() OVER (PARTITION BY a ORDER BY c ASC NULLS FIRST 
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232], [a#220L], [c#222L ASC 
> NULLS FIRST]
> +- Project [id#218L AS b#221L, id#218L AS d#223L, id#218L AS 
> a#220L, id#218L AS c#222L]
>+- Range (0, 10, step=1, splits=Some(2))
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34807) Push down filter through window after TransposeWindow

2021-06-23 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-34807.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31980
[https://github.com/apache/spark/pull/31980]

> Push down filter through window after TransposeWindow
> -
>
> Key: SPARK-34807
> URL: https://issues.apache.org/jira/browse/SPARK-34807
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Tanel Kiis
>Priority: Major
> Fix For: 3.2.0
>
>
> {code:scala}
>   spark.range(10).selectExpr("id AS a", "id AS b", "id AS c", "id AS 
> d").createTempView("t1")
>   val df = spark.sql(
> """
>   |SELECT *
>   |  FROM (
>   |SELECT b,
>   |  sum(d) OVER (PARTITION BY a, b),
>   |  rank() OVER (PARTITION BY a ORDER BY c)
>   |FROM t1
>   |  ) v1
>   |WHERE b = 2
>   |""".stripMargin)
> {code}
> Current optimized plan:
> {noformat}
> == Optimized Logical Plan ==
> Project [b#221L, sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED 
> PRECEDING AND UNBOUNDED FOLLOWING)#231L, RANK() OVER (PARTITION BY a ORDER BY 
> c ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232]
> +- Filter (b#221L = 2)
>+- Window [rank(c#222L) windowspecdefinition(a#220L, c#222L ASC NULLS 
> FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS RANK() OVER (PARTITION BY a ORDER BY c ASC NULLS FIRST ROWS BETWEEN 
> UNBOUNDED PRECEDING AND CURRENT ROW)#232], [a#220L], [c#222L ASC NULLS FIRST]
>   +- Project [b#221L, a#220L, c#222L, sum(d) OVER (PARTITION BY a, b ROWS 
> BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#231L]
>  +- Window [sum(d#223L) windowspecdefinition(a#220L, b#221L, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING)#231L], [a#220L, b#221L]
> +- Project [id#218L AS b#221L, id#218L AS d#223L, id#218L AS 
> a#220L, id#218L AS c#222L]
>+- Range (0, 10, step=1, splits=Some(2))
> {noformat}
> Expected optimized plan:
> {noformat}
> == Optimized Logical Plan ==
> Project [b#221L, sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED 
> PRECEDING AND UNBOUNDED FOLLOWING)#231L, RANK() OVER (PARTITION BY a ORDER BY 
> c ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232]
> +- Window [sum(d#223L) windowspecdefinition(a#220L, b#221L, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS sum(d) OVER (PARTITION BY a, b ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING)#231L], [a#220L, b#221L]
>+- Project [b#221L, d#223L, a#220L, RANK() OVER (PARTITION BY a ORDER BY c 
> ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232]
>   +- Filter (b#221L = 2)
>  +- Window [rank(c#222L) windowspecdefinition(a#220L, c#222L ASC 
> NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> currentrow$())) AS RANK() OVER (PARTITION BY a ORDER BY c ASC NULLS FIRST 
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#232], [a#220L], [c#222L ASC 
> NULLS FIRST]
> +- Project [id#218L AS b#221L, id#218L AS d#223L, id#218L AS 
> a#220L, id#218L AS c#222L]
>+- Range (0, 10, step=1, splits=Some(2))
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35730) Check all day-time interval types in UDF

2021-06-23 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368556#comment-17368556
 ] 

angerszhu commented on SPARK-35730:
---

working on this

> Check all day-time interval types in UDF
> 
>
> Key: SPARK-35730
> URL: https://issues.apache.org/jira/browse/SPARK-35730
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Check all day-time interval types in UDF:
> # INTERVAL DAY
> # INTERVAL DAY TO HOUR
> # INTERVAL DAY TO MINUTE
> # INTERVAL HOUR
> # INTERVAL HOUR TO MINUTE
> # INTERVAL HOUR TO SECOND
> # INTERVAL MINUTE
> # INTERVAL MINUTE TO SECOND
> # INTERVAL SECOND



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35623) Volcano resource manager for Spark on Kubernetes

2021-06-23 Thread Kevin Su (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368553#comment-17368553
 ] 

Kevin Su commented on SPARK-35623:
--

[~dipanjanK] Here is my email address. pings...@gmail.com

> Volcano resource manager for Spark on Kubernetes
> 
>
> Key: SPARK-35623
> URL: https://issues.apache.org/jira/browse/SPARK-35623
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Kubernetes
>Affects Versions: 3.1.1, 3.1.2
>Reporter: Dipanjan Kailthya
>Priority: Minor
>  Labels: kubernetes, resourcemanager
>
> Dear Spark Developers, 
>   
>  Hello from the Netherlands! Posting this here as I still haven't gotten 
> accepted to post in the spark dev mailing list.
>   
>  My team is planning to use spark with Kubernetes support on our shared 
> (multi-tenant) on premise Kubernetes cluster. However we would like to have 
> certain scheduling features like fair-share and preemption which as we 
> understand are not built into the current spark-kubernetes resource manager 
> yet. We have been working on and are close to a first successful prototype 
> integration with Volcano ([https://volcano.sh/en/docs/]). Briefly this means 
> a new resource manager component with lots in common with existing 
> spark-kubernetes resource manager, but instead of pods it launches Volcano 
> jobs which delegate the driver and executor pod creation and lifecycle 
> management to Volcano. We are interested in contributing this to open source, 
> either directly in spark or as a separate project.
>   
>  So, two questions: 
>   
>  1. Do the spark maintainers see this as a valuable contribution to the 
> mainline spark codebase? If so, can we have some guidance on how to publish 
> the changes? 
>   
>  2. Are any other developers / organizations interested to contribute to this 
> effort? If so, please get in touch.
>   
>  Best,
>  Dipanjan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33863) Pyspark UDF wrongly changes timestamps to UTC

2021-06-23 Thread Nasir Ali (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368550#comment-17368550
 ] 

Nasir Ali commented on SPARK-33863:
---

[~dc-heros] Could you please share the output you got when you ran the above 
code?

> Pyspark UDF wrongly changes timestamps to UTC
> -
>
> Key: SPARK-33863
> URL: https://issues.apache.org/jira/browse/SPARK-33863
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1, 3.0.2, 3.1.0, 3.1.1, 3.1.2
> Environment: MAC/Linux
> Standalone cluster / local machine
>Reporter: Nasir Ali
>Priority: Major
>
> *Problem*:
> I have a dataframe with a ts (timestamp) column in UTC. If I create a new 
> column using udf, pyspark udf wrongly changes timestamps into UTC time. ts 
> (timestamp) column is already in UTC time. Therefore, pyspark udf should not 
> convert ts (timestamp) column into UTC timestamp. 
> I have used following configs to let spark know the timestamps are in UTC:
>  
> {code:java}
> --conf spark.driver.extraJavaOptions=-Duser.timezone=UTC 
> --conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
> --conf spark.sql.session.timeZone=UTC
> {code}
> Below is a code snippet to reproduce the error:
>  
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> from pyspark.sql.types import StringType, TimestampType
> import datetime
> spark = SparkSession.builder.config("spark.sql.session.timeZone", 
> "UTC").getOrCreate()
> df = spark.createDataFrame([("usr1",17.00, "2018-02-10T15:27:18+00:00"),
> ("usr1",13.00, "2018-02-11T12:27:18+00:00"),
> ("usr1",25.00, "2018-02-12T11:27:18+00:00"),
> ("usr1",20.00, "2018-02-13T15:27:18+00:00"),
> ("usr1",17.00, "2018-02-14T12:27:18+00:00"),
> ("usr2",99.00, "2018-02-15T11:27:18+00:00"),
> ("usr2",156.00, "2018-02-22T11:27:18+00:00")
> ],
>["user","id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> df.show(truncate=False)
> def some_time_udf(i):
> if  datetime.time(5, 0)<=i.time() < datetime.time(12, 0):
> tmp= "Morning: " + str(i)
> elif  datetime.time(12, 0)<=i.time() < datetime.time(17, 0):
> tmp= "Afternoon: " + str(i)
> elif  datetime.time(17, 0)<=i.time() < datetime.time(21, 0):
> tmp= "Evening"
> elif  datetime.time(21, 0)<=i.time() < datetime.time(0, 0):
> tmp= "Night"
> elif  datetime.time(0, 0)<=i.time() < datetime.time(5, 0):
> tmp= "Night"
> return tmp
> udf = F.udf(some_time_udf,StringType())
> df.withColumn("day_part", udf(df.ts)).show(truncate=False)
> {code}
>  
> Below is the output of the above code:
> {code:java}
> ++-+---++
> |user|id   |ts |day_part|
> ++-+---++
> |usr1|17.0 |2018-02-10 15:27:18|Morning: 2018-02-10 09:27:18|
> |usr1|13.0 |2018-02-11 12:27:18|Morning: 2018-02-11 06:27:18|
> |usr1|25.0 |2018-02-12 11:27:18|Morning: 2018-02-12 05:27:18|
> |usr1|20.0 |2018-02-13 15:27:18|Morning: 2018-02-13 09:27:18|
> |usr1|17.0 |2018-02-14 12:27:18|Morning: 2018-02-14 06:27:18|
> |usr2|99.0 |2018-02-15 11:27:18|Morning: 2018-02-15 05:27:18|
> |usr2|156.0|2018-02-22 11:27:18|Morning: 2018-02-22 05:27:18|
> ++-+---++
> {code}
> Above output is incorrect. You can see ts and day_part columns don't have 
> same timestamps. Below is the output I would expect:
>  
> {code:java}
> ++-+---++
> |user|id   |ts |day_part|
> ++-+---++
> |usr1|17.0 |2018-02-10 15:27:18|Afternoon: 2018-02-10 15:27:18|
> |usr1|13.0 |2018-02-11 12:27:18|Afternoon: 2018-02-11 12:27:18|
> |usr1|25.0 |2018-02-12 11:27:18|Morning: 2018-02-12 11:27:18|
> |usr1|20.0 |2018-02-13 15:27:18|Afternoon: 2018-02-13 15:27:18|
> |usr1|17.0 |2018-02-14 12:27:18|Afternoon: 2018-02-14 12:27:18|
> |usr2|99.0 |2018-02-15 11:27:18|Morning: 2018-02-15 11:27:18|
> |usr2|156.0|2018-02-22 11:27:18|Morning: 2018-02-22 11:27:18|
> ++-+---++{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-33863) Pyspark UDF wrongly changes timestamps to UTC

2021-06-23 Thread Nasir Ali (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nasir Ali reopened SPARK-33863:
---

Issue is not resolved. 

> Pyspark UDF wrongly changes timestamps to UTC
> -
>
> Key: SPARK-33863
> URL: https://issues.apache.org/jira/browse/SPARK-33863
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1, 3.0.2, 3.1.0, 3.1.1, 3.1.2
> Environment: MAC/Linux
> Standalone cluster / local machine
>Reporter: Nasir Ali
>Priority: Major
>
> *Problem*:
> I have a dataframe with a ts (timestamp) column in UTC. If I create a new 
> column using udf, pyspark udf wrongly changes timestamps into UTC time. ts 
> (timestamp) column is already in UTC time. Therefore, pyspark udf should not 
> convert ts (timestamp) column into UTC timestamp. 
> I have used following configs to let spark know the timestamps are in UTC:
>  
> {code:java}
> --conf spark.driver.extraJavaOptions=-Duser.timezone=UTC 
> --conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
> --conf spark.sql.session.timeZone=UTC
> {code}
> Below is a code snippet to reproduce the error:
>  
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> from pyspark.sql.types import StringType, TimestampType
> import datetime
> spark = SparkSession.builder.config("spark.sql.session.timeZone", 
> "UTC").getOrCreate()
> df = spark.createDataFrame([("usr1",17.00, "2018-02-10T15:27:18+00:00"),
> ("usr1",13.00, "2018-02-11T12:27:18+00:00"),
> ("usr1",25.00, "2018-02-12T11:27:18+00:00"),
> ("usr1",20.00, "2018-02-13T15:27:18+00:00"),
> ("usr1",17.00, "2018-02-14T12:27:18+00:00"),
> ("usr2",99.00, "2018-02-15T11:27:18+00:00"),
> ("usr2",156.00, "2018-02-22T11:27:18+00:00")
> ],
>["user","id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> df.show(truncate=False)
> def some_time_udf(i):
> if  datetime.time(5, 0)<=i.time() < datetime.time(12, 0):
> tmp= "Morning: " + str(i)
> elif  datetime.time(12, 0)<=i.time() < datetime.time(17, 0):
> tmp= "Afternoon: " + str(i)
> elif  datetime.time(17, 0)<=i.time() < datetime.time(21, 0):
> tmp= "Evening"
> elif  datetime.time(21, 0)<=i.time() < datetime.time(0, 0):
> tmp= "Night"
> elif  datetime.time(0, 0)<=i.time() < datetime.time(5, 0):
> tmp= "Night"
> return tmp
> udf = F.udf(some_time_udf,StringType())
> df.withColumn("day_part", udf(df.ts)).show(truncate=False)
> {code}
>  
> Below is the output of the above code:
> {code:java}
> ++-+---++
> |user|id   |ts |day_part|
> ++-+---++
> |usr1|17.0 |2018-02-10 15:27:18|Morning: 2018-02-10 09:27:18|
> |usr1|13.0 |2018-02-11 12:27:18|Morning: 2018-02-11 06:27:18|
> |usr1|25.0 |2018-02-12 11:27:18|Morning: 2018-02-12 05:27:18|
> |usr1|20.0 |2018-02-13 15:27:18|Morning: 2018-02-13 09:27:18|
> |usr1|17.0 |2018-02-14 12:27:18|Morning: 2018-02-14 06:27:18|
> |usr2|99.0 |2018-02-15 11:27:18|Morning: 2018-02-15 05:27:18|
> |usr2|156.0|2018-02-22 11:27:18|Morning: 2018-02-22 05:27:18|
> ++-+---++
> {code}
> Above output is incorrect. You can see ts and day_part columns don't have 
> same timestamps. Below is the output I would expect:
>  
> {code:java}
> ++-+---++
> |user|id   |ts |day_part|
> ++-+---++
> |usr1|17.0 |2018-02-10 15:27:18|Afternoon: 2018-02-10 15:27:18|
> |usr1|13.0 |2018-02-11 12:27:18|Afternoon: 2018-02-11 12:27:18|
> |usr1|25.0 |2018-02-12 11:27:18|Morning: 2018-02-12 11:27:18|
> |usr1|20.0 |2018-02-13 15:27:18|Afternoon: 2018-02-13 15:27:18|
> |usr1|17.0 |2018-02-14 12:27:18|Afternoon: 2018-02-14 12:27:18|
> |usr2|99.0 |2018-02-15 11:27:18|Morning: 2018-02-15 11:27:18|
> |usr2|156.0|2018-02-22 11:27:18|Morning: 2018-02-22 11:27:18|
> ++-+---++{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33863) Pyspark UDF wrongly changes timestamps to UTC

2021-06-23 Thread Nasir Ali (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nasir Ali updated SPARK-33863:
--
Affects Version/s: 3.1.2

> Pyspark UDF wrongly changes timestamps to UTC
> -
>
> Key: SPARK-33863
> URL: https://issues.apache.org/jira/browse/SPARK-33863
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1, 3.0.2, 3.1.0, 3.1.1, 3.1.2
> Environment: MAC/Linux
> Standalone cluster / local machine
>Reporter: Nasir Ali
>Priority: Major
>
> *Problem*:
> I have a dataframe with a ts (timestamp) column in UTC. If I create a new 
> column using udf, pyspark udf wrongly changes timestamps into UTC time. ts 
> (timestamp) column is already in UTC time. Therefore, pyspark udf should not 
> convert ts (timestamp) column into UTC timestamp. 
> I have used following configs to let spark know the timestamps are in UTC:
>  
> {code:java}
> --conf spark.driver.extraJavaOptions=-Duser.timezone=UTC 
> --conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
> --conf spark.sql.session.timeZone=UTC
> {code}
> Below is a code snippet to reproduce the error:
>  
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> from pyspark.sql.types import StringType, TimestampType
> import datetime
> spark = SparkSession.builder.config("spark.sql.session.timeZone", 
> "UTC").getOrCreate()
> df = spark.createDataFrame([("usr1",17.00, "2018-02-10T15:27:18+00:00"),
> ("usr1",13.00, "2018-02-11T12:27:18+00:00"),
> ("usr1",25.00, "2018-02-12T11:27:18+00:00"),
> ("usr1",20.00, "2018-02-13T15:27:18+00:00"),
> ("usr1",17.00, "2018-02-14T12:27:18+00:00"),
> ("usr2",99.00, "2018-02-15T11:27:18+00:00"),
> ("usr2",156.00, "2018-02-22T11:27:18+00:00")
> ],
>["user","id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> df.show(truncate=False)
> def some_time_udf(i):
> if  datetime.time(5, 0)<=i.time() < datetime.time(12, 0):
> tmp= "Morning: " + str(i)
> elif  datetime.time(12, 0)<=i.time() < datetime.time(17, 0):
> tmp= "Afternoon: " + str(i)
> elif  datetime.time(17, 0)<=i.time() < datetime.time(21, 0):
> tmp= "Evening"
> elif  datetime.time(21, 0)<=i.time() < datetime.time(0, 0):
> tmp= "Night"
> elif  datetime.time(0, 0)<=i.time() < datetime.time(5, 0):
> tmp= "Night"
> return tmp
> udf = F.udf(some_time_udf,StringType())
> df.withColumn("day_part", udf(df.ts)).show(truncate=False)
> {code}
>  
> Below is the output of the above code:
> {code:java}
> ++-+---++
> |user|id   |ts |day_part|
> ++-+---++
> |usr1|17.0 |2018-02-10 15:27:18|Morning: 2018-02-10 09:27:18|
> |usr1|13.0 |2018-02-11 12:27:18|Morning: 2018-02-11 06:27:18|
> |usr1|25.0 |2018-02-12 11:27:18|Morning: 2018-02-12 05:27:18|
> |usr1|20.0 |2018-02-13 15:27:18|Morning: 2018-02-13 09:27:18|
> |usr1|17.0 |2018-02-14 12:27:18|Morning: 2018-02-14 06:27:18|
> |usr2|99.0 |2018-02-15 11:27:18|Morning: 2018-02-15 05:27:18|
> |usr2|156.0|2018-02-22 11:27:18|Morning: 2018-02-22 05:27:18|
> ++-+---++
> {code}
> Above output is incorrect. You can see ts and day_part columns don't have 
> same timestamps. Below is the output I would expect:
>  
> {code:java}
> ++-+---++
> |user|id   |ts |day_part|
> ++-+---++
> |usr1|17.0 |2018-02-10 15:27:18|Afternoon: 2018-02-10 15:27:18|
> |usr1|13.0 |2018-02-11 12:27:18|Afternoon: 2018-02-11 12:27:18|
> |usr1|25.0 |2018-02-12 11:27:18|Morning: 2018-02-12 11:27:18|
> |usr1|20.0 |2018-02-13 15:27:18|Afternoon: 2018-02-13 15:27:18|
> |usr1|17.0 |2018-02-14 12:27:18|Afternoon: 2018-02-14 12:27:18|
> |usr2|99.0 |2018-02-15 11:27:18|Morning: 2018-02-15 11:27:18|
> |usr2|156.0|2018-02-22 11:27:18|Morning: 2018-02-22 11:27:18|
> ++-+---++{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35476) Enable disallow_untyped_defs mypy check for pyspark.pandas.series.

2021-06-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368546#comment-17368546
 ] 

Apache Spark commented on SPARK-35476:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/33045

> Enable disallow_untyped_defs mypy check for pyspark.pandas.series.
> --
>
> Key: SPARK-35476
> URL: https://issues.apache.org/jira/browse/SPARK-35476
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35476) Enable disallow_untyped_defs mypy check for pyspark.pandas.series.

2021-06-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35476:


Assignee: Apache Spark

> Enable disallow_untyped_defs mypy check for pyspark.pandas.series.
> --
>
> Key: SPARK-35476
> URL: https://issues.apache.org/jira/browse/SPARK-35476
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35476) Enable disallow_untyped_defs mypy check for pyspark.pandas.series.

2021-06-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368545#comment-17368545
 ] 

Apache Spark commented on SPARK-35476:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/33045

> Enable disallow_untyped_defs mypy check for pyspark.pandas.series.
> --
>
> Key: SPARK-35476
> URL: https://issues.apache.org/jira/browse/SPARK-35476
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35476) Enable disallow_untyped_defs mypy check for pyspark.pandas.series.

2021-06-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35476:


Assignee: (was: Apache Spark)

> Enable disallow_untyped_defs mypy check for pyspark.pandas.series.
> --
>
> Key: SPARK-35476
> URL: https://issues.apache.org/jira/browse/SPARK-35476
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35588) Merge Binder integration and quickstart notebook

2021-06-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35588.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33041
[https://github.com/apache/spark/pull/33041]

> Merge Binder integration and quickstart notebook
> 
>
> Key: SPARK-35588
> URL: https://issues.apache.org/jira/browse/SPARK-35588
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.2.0
>
>
> We should merge:
> https://github.com/apache/spark/blob/master/python/docs/source/getting_started/quickstart.ipynb
> https://github.com/databricks/koalas/blob/master/docs/source/getting_started/10min.ipynb



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35588) Merge Binder integration and quickstart notebook

2021-06-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35588:


Assignee: Hyukjin Kwon

> Merge Binder integration and quickstart notebook
> 
>
> Key: SPARK-35588
> URL: https://issues.apache.org/jira/browse/SPARK-35588
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We should merge:
> https://github.com/apache/spark/blob/master/python/docs/source/getting_started/quickstart.ipynb
> https://github.com/databricks/koalas/blob/master/docs/source/getting_started/10min.ipynb



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35868) Add fs.s3a.downgrade.syncable.exceptions if not set

2021-06-23 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368541#comment-17368541
 ] 

Dongjoon Hyun commented on SPARK-35868:
---

I marked this as `Blocker` because this causes `SparkContext` initialize 
failures.

> Add fs.s3a.downgrade.syncable.exceptions if not set
> ---
>
> Key: SPARK-35868
> URL: https://issues.apache.org/jira/browse/SPARK-35868
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35868) Add fs.s3a.downgrade.syncable.exceptions if not set

2021-06-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35868:
--
Target Version/s: 3.2.0

> Add fs.s3a.downgrade.syncable.exceptions if not set
> ---
>
> Key: SPARK-35868
> URL: https://issues.apache.org/jira/browse/SPARK-35868
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35868) Add fs.s3a.downgrade.syncable.exceptions if not set

2021-06-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35868:
--
Priority: Blocker  (was: Major)

> Add fs.s3a.downgrade.syncable.exceptions if not set
> ---
>
> Key: SPARK-35868
> URL: https://issues.apache.org/jira/browse/SPARK-35868
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35868) Add fs.s3a.downgrade.syncable.exceptions if not set

2021-06-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35868:


Assignee: (was: Apache Spark)

> Add fs.s3a.downgrade.syncable.exceptions if not set
> ---
>
> Key: SPARK-35868
> URL: https://issues.apache.org/jira/browse/SPARK-35868
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35868) Add fs.s3a.downgrade.syncable.exceptions if not set

2021-06-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35868:


Assignee: Apache Spark

> Add fs.s3a.downgrade.syncable.exceptions if not set
> ---
>
> Key: SPARK-35868
> URL: https://issues.apache.org/jira/browse/SPARK-35868
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35868) Add fs.s3a.downgrade.syncable.exceptions if not set

2021-06-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368537#comment-17368537
 ] 

Apache Spark commented on SPARK-35868:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/33044

> Add fs.s3a.downgrade.syncable.exceptions if not set
> ---
>
> Key: SPARK-35868
> URL: https://issues.apache.org/jira/browse/SPARK-35868
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35868) Add fs.s3a.downgrade.syncable.exceptions if not set

2021-06-23 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-35868:
-

 Summary: Add fs.s3a.downgrade.syncable.exceptions if not set
 Key: SPARK-35868
 URL: https://issues.apache.org/jira/browse/SPARK-35868
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35851) Wrong docs in function GraphGenerators.sampleLogNormal

2021-06-23 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-35851:
-
Description: In class GraphGenerators docs, function sampleLogNormal use 
wrong variable to compute X，m and s should not be used in the formula.  (was: 
In class GraphGenerators docs, function sampleLogNormal use wrong variable to 
compute X，expression 'X = math.exp(mu + sigma*Z)' shoud be 'X = math.exp(m+ 
s*Z)'.)

> Wrong docs in function GraphGenerators.sampleLogNormal
> --
>
> Key: SPARK-35851
> URL: https://issues.apache.org/jira/browse/SPARK-35851
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, GraphX
>Affects Versions: 3.1.2
>Reporter: zengrui
>Assignee: zengrui
>Priority: Minor
> Fix For: 3.2.0
>
>
> In class GraphGenerators docs, function sampleLogNormal use wrong variable to 
> compute X，m and s should not be used in the formula.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35851) Wrong docs in function GraphGenerators.sampleLogNormal

2021-06-23 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-35851.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33010
[https://github.com/apache/spark/pull/33010]

> Wrong docs in function GraphGenerators.sampleLogNormal
> --
>
> Key: SPARK-35851
> URL: https://issues.apache.org/jira/browse/SPARK-35851
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, GraphX
>Affects Versions: 3.1.2
>Reporter: zengrui
>Assignee: zengrui
>Priority: Minor
> Fix For: 3.2.0
>
>
> In class GraphGenerators docs, function sampleLogNormal use wrong variable to 
> compute X，expression 'X = math.exp(mu + sigma*Z)' shoud be 'X = math.exp(m+ 
> s*Z)'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35851) Wrong docs in function GraphGenerators.sampleLogNormal

2021-06-23 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-35851:


Assignee: zengrui

> Wrong docs in function GraphGenerators.sampleLogNormal
> --
>
> Key: SPARK-35851
> URL: https://issues.apache.org/jira/browse/SPARK-35851
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, GraphX
>Affects Versions: 3.1.2
>Reporter: zengrui
>Assignee: zengrui
>Priority: Minor
>
> In class GraphGenerators docs, function sampleLogNormal use wrong variable to 
> compute X，expression 'X = math.exp(mu + sigma*Z)' shoud be 'X = math.exp(m+ 
> s*Z)'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19256) Hive bucketing write support

2021-06-23 Thread Cheng Su (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-19256:
-
Affects Version/s: 3.2.0

> Hive bucketing write support
> 
>
> Key: SPARK-19256
> URL: https://issues.apache.org/jira/browse/SPARK-19256
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: Tejas Patil
>Priority: Minor
>
> Update (2020 by Cheng Su):
> We use this JIRA to track progress for Hive bucketing write support in Spark. 
> The goal is for Spark to write Hive bucketed table, to be compatible with 
> other compute engines (Hive and Presto).
>  
> Current status for Hive bucketed table in Spark:
> Not support for reading Hive bucketed table: read bucketed table as 
> non-bucketed table.
> Wrong behavior for writing Hive ORC and Parquet bucketed table: write 
> orc/parquet bucketed table as non-bucketed table (code path: 
> InsertIntoHadoopFsRelationCommand -> FileFormatWriter).
> Do not allow for writing Hive non-ORC/Parquet bucketed table: throw exception 
> by default if writing non-orc/parquet bucketed table (code path: 
> InsertIntoHiveTable), and exception can be disabled by setting config 
> `hive.enforce.bucketing`=false and `hive.enforce.sorting`=false, which will 
> write as non-bucketed table.
>  
> Current status for Hive bucketed table in Hive:
> Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash 
> (https://issues.apache.org/jira/browse/HIVE-18910).
> Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash.
> Hive on Tez: support zero and multiple files per bucket 
> (https://issues.apache.org/jira/browse/HIVE-14014). And more code pointer on 
> read path - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/metainfo/annotation/OpTraitsRulesProcFactory.java#L183-L212]
>  .
>  
> Current status for Hive bucketed table in Presto (take presto-sql here):
> Support writing bucketed table with Hive murmur3hash and hivehash 
> ([https://github.com/prestosql/presto/pull/1697]).
> Support zero and multiple files per bucket 
> ([https://github.com/prestosql/presto/pull/822]).
>  
> TLDR is to achieve Hive bucketed table compatibility across Spark, Presto and 
> Hive. Here with this JIRA, we need to add support writing Hive bucketed table 
> with Hive murmur3hash (for Hive 3.x.y) and hivehash (for Hive 1.x.y and 
> 2.x.y).
>  
> To allow Spark efficiently read Hive bucketed table, this needs more radical 
> change and we decide to wait until data source v2 supports bucketing, and do 
> the read path on data source v2. Read path will not covered by this JIRA.
>  
> Original description (2017 by Tejas Patil):
> JIRA to track design discussions and tasks related to Hive bucketing support 
> in Spark.
> Proposal : 
> [https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19256) Hive bucketing write support

2021-06-23 Thread Pushkar Kumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368523#comment-17368523
 ] 

Pushkar Kumar commented on SPARK-19256:
---

Thank you [~chengsu]!!

> Hive bucketing write support
> 
>
> Key: SPARK-19256
> URL: https://issues.apache.org/jira/browse/SPARK-19256
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0, 3.1.0
>Reporter: Tejas Patil
>Priority: Minor
>
> Update (2020 by Cheng Su):
> We use this JIRA to track progress for Hive bucketing write support in Spark. 
> The goal is for Spark to write Hive bucketed table, to be compatible with 
> other compute engines (Hive and Presto).
>  
> Current status for Hive bucketed table in Spark:
> Not support for reading Hive bucketed table: read bucketed table as 
> non-bucketed table.
> Wrong behavior for writing Hive ORC and Parquet bucketed table: write 
> orc/parquet bucketed table as non-bucketed table (code path: 
> InsertIntoHadoopFsRelationCommand -> FileFormatWriter).
> Do not allow for writing Hive non-ORC/Parquet bucketed table: throw exception 
> by default if writing non-orc/parquet bucketed table (code path: 
> InsertIntoHiveTable), and exception can be disabled by setting config 
> `hive.enforce.bucketing`=false and `hive.enforce.sorting`=false, which will 
> write as non-bucketed table.
>  
> Current status for Hive bucketed table in Hive:
> Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash 
> (https://issues.apache.org/jira/browse/HIVE-18910).
> Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash.
> Hive on Tez: support zero and multiple files per bucket 
> (https://issues.apache.org/jira/browse/HIVE-14014). And more code pointer on 
> read path - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/metainfo/annotation/OpTraitsRulesProcFactory.java#L183-L212]
>  .
>  
> Current status for Hive bucketed table in Presto (take presto-sql here):
> Support writing bucketed table with Hive murmur3hash and hivehash 
> ([https://github.com/prestosql/presto/pull/1697]).
> Support zero and multiple files per bucket 
> ([https://github.com/prestosql/presto/pull/822]).
>  
> TLDR is to achieve Hive bucketed table compatibility across Spark, Presto and 
> Hive. Here with this JIRA, we need to add support writing Hive bucketed table 
> with Hive murmur3hash (for Hive 3.x.y) and hivehash (for Hive 1.x.y and 
> 2.x.y).
>  
> To allow Spark efficiently read Hive bucketed table, this needs more radical 
> change and we decide to wait until data source v2 supports bucketing, and do 
> the read path on data source v2. Read path will not covered by this JIRA.
>  
> Original description (2017 by Tejas Patil):
> JIRA to track design discussions and tasks related to Hive bucketing support 
> in Spark.
> Proposal : 
> [https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19256) Hive bucketing write support

2021-06-23 Thread Cheng Su (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368522#comment-17368522
 ] 

Cheng Su commented on SPARK-19256:
--

[~pushkarcse] - we are currently working on 
https://issues.apache.org/jira/browse/SPARK-33298 . After that, I will resume 
the discussion of https://issues.apache.org/jira/browse/SPARK-32709 .

> Hive bucketing write support
> 
>
> Key: SPARK-19256
> URL: https://issues.apache.org/jira/browse/SPARK-19256
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0, 3.1.0
>Reporter: Tejas Patil
>Priority: Minor
>
> Update (2020 by Cheng Su):
> We use this JIRA to track progress for Hive bucketing write support in Spark. 
> The goal is for Spark to write Hive bucketed table, to be compatible with 
> other compute engines (Hive and Presto).
>  
> Current status for Hive bucketed table in Spark:
> Not support for reading Hive bucketed table: read bucketed table as 
> non-bucketed table.
> Wrong behavior for writing Hive ORC and Parquet bucketed table: write 
> orc/parquet bucketed table as non-bucketed table (code path: 
> InsertIntoHadoopFsRelationCommand -> FileFormatWriter).
> Do not allow for writing Hive non-ORC/Parquet bucketed table: throw exception 
> by default if writing non-orc/parquet bucketed table (code path: 
> InsertIntoHiveTable), and exception can be disabled by setting config 
> `hive.enforce.bucketing`=false and `hive.enforce.sorting`=false, which will 
> write as non-bucketed table.
>  
> Current status for Hive bucketed table in Hive:
> Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash 
> (https://issues.apache.org/jira/browse/HIVE-18910).
> Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash.
> Hive on Tez: support zero and multiple files per bucket 
> (https://issues.apache.org/jira/browse/HIVE-14014). And more code pointer on 
> read path - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/metainfo/annotation/OpTraitsRulesProcFactory.java#L183-L212]
>  .
>  
> Current status for Hive bucketed table in Presto (take presto-sql here):
> Support writing bucketed table with Hive murmur3hash and hivehash 
> ([https://github.com/prestosql/presto/pull/1697]).
> Support zero and multiple files per bucket 
> ([https://github.com/prestosql/presto/pull/822]).
>  
> TLDR is to achieve Hive bucketed table compatibility across Spark, Presto and 
> Hive. Here with this JIRA, we need to add support writing Hive bucketed table 
> with Hive murmur3hash (for Hive 3.x.y) and hivehash (for Hive 1.x.y and 
> 2.x.y).
>  
> To allow Spark efficiently read Hive bucketed table, this needs more radical 
> change and we decide to wait until data source v2 supports bucketing, and do 
> the read path on data source v2. Read path will not covered by this JIRA.
>  
> Original description (2017 by Tejas Patil):
> JIRA to track design discussions and tasks related to Hive bucketing support 
> in Spark.
> Proposal : 
> [https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35623) Volcano resource manager for Spark on Kubernetes

2021-06-23 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-35623:
-
Shepherd: Holden Karau

> Volcano resource manager for Spark on Kubernetes
> 
>
> Key: SPARK-35623
> URL: https://issues.apache.org/jira/browse/SPARK-35623
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Kubernetes
>Affects Versions: 3.1.1, 3.1.2
>Reporter: Dipanjan Kailthya
>Priority: Minor
>  Labels: kubernetes, resourcemanager
>
> Dear Spark Developers, 
>   
>  Hello from the Netherlands! Posting this here as I still haven't gotten 
> accepted to post in the spark dev mailing list.
>   
>  My team is planning to use spark with Kubernetes support on our shared 
> (multi-tenant) on premise Kubernetes cluster. However we would like to have 
> certain scheduling features like fair-share and preemption which as we 
> understand are not built into the current spark-kubernetes resource manager 
> yet. We have been working on and are close to a first successful prototype 
> integration with Volcano ([https://volcano.sh/en/docs/]). Briefly this means 
> a new resource manager component with lots in common with existing 
> spark-kubernetes resource manager, but instead of pods it launches Volcano 
> jobs which delegate the driver and executor pod creation and lifecycle 
> management to Volcano. We are interested in contributing this to open source, 
> either directly in spark or as a separate project.
>   
>  So, two questions: 
>   
>  1. Do the spark maintainers see this as a valuable contribution to the 
> mainline spark codebase? If so, can we have some guidance on how to publish 
> the changes? 
>   
>  2. Are any other developers / organizations interested to contribute to this 
> effort? If so, please get in touch.
>   
>  Best,
>  Dipanjan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35623) Volcano resource manager for Spark on Kubernetes

2021-06-23 Thread Holden Karau (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368486#comment-17368486
 ] 

Holden Karau commented on SPARK-35623:
--

I'm also interested in this. I sent a message to the dev list back on Jun 17th 
about this (or more generally adding support for batch schedulers in general).

 

I know some groups inside of Spark have had a working group format where they 
sync periodically and write reports back to the mailing list. Since it seems 
like there are a few folks  interested maybe we could try that?

> Volcano resource manager for Spark on Kubernetes
> 
>
> Key: SPARK-35623
> URL: https://issues.apache.org/jira/browse/SPARK-35623
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Kubernetes
>Affects Versions: 3.1.1, 3.1.2
>Reporter: Dipanjan Kailthya
>Priority: Minor
>  Labels: kubernetes, resourcemanager
>
> Dear Spark Developers, 
>   
>  Hello from the Netherlands! Posting this here as I still haven't gotten 
> accepted to post in the spark dev mailing list.
>   
>  My team is planning to use spark with Kubernetes support on our shared 
> (multi-tenant) on premise Kubernetes cluster. However we would like to have 
> certain scheduling features like fair-share and preemption which as we 
> understand are not built into the current spark-kubernetes resource manager 
> yet. We have been working on and are close to a first successful prototype 
> integration with Volcano ([https://volcano.sh/en/docs/]). Briefly this means 
> a new resource manager component with lots in common with existing 
> spark-kubernetes resource manager, but instead of pods it launches Volcano 
> jobs which delegate the driver and executor pod creation and lifecycle 
> management to Volcano. We are interested in contributing this to open source, 
> either directly in spark or as a separate project.
>   
>  So, two questions: 
>   
>  1. Do the spark maintainers see this as a valuable contribution to the 
> mainline spark codebase? If so, can we have some guidance on how to publish 
> the changes? 
>   
>  2. Are any other developers / organizations interested to contribute to this 
> effort? If so, please get in touch.
>   
>  Best,
>  Dipanjan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35476) Enable disallow_untyped_defs mypy check for pyspark.pandas.series.

2021-06-23 Thread Takuya Ueshin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368476#comment-17368476
 ] 

Takuya Ueshin commented on SPARK-35476:
---

I'm working on this.

> Enable disallow_untyped_defs mypy check for pyspark.pandas.series.
> --
>
> Key: SPARK-35476
> URL: https://issues.apache.org/jira/browse/SPARK-35476
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35867) Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-06-23 Thread Chao Sun (Jira)

Chao Sun created SPARK-35867:


 Summary: Enable vectorized read for 
VectorizedPlainValuesReader.readBooleans
 Key: SPARK-35867
 URL: https://issues.apache.org/jira/browse/SPARK-35867
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


Currently we decode PLAIN encoded booleans as follow:
{code:java}
  public final void readBooleans(int total, WritableColumnVector c, int rowId) {
// TODO: properly vectorize this
for (int i = 0; i < total; i++) {
  c.putBoolean(rowId + i, readBoolean());
}
  }
{code}

Ideally we should vectorize this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35866) Improve error message quality

2021-06-23 Thread Karen Feng (Jira)

Karen Feng created SPARK-35866:
--

 Summary: Improve error message quality
 Key: SPARK-35866
 URL: https://issues.apache.org/jira/browse/SPARK-35866
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.2.0
Reporter: Karen Feng


In the SPIP: Standardize Exception Messages in Spark, there are three major 
improvements proposed:
 # Group error messages in dedicated files: 
[SPARK-33539|https://issues.apache.org/jira/browse/SPARK-33539]
 # Establish an error message guideline for developers 
[SPARK-35140|https://issues.apache.org/jira/browse/SPARK-35140]
 # Improve error message quality

Based on the guideline, we can start improving the error messages in the 
dedicated files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35731) Check all day-time interval types in arrow

2021-06-23 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-35731.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33039
[https://github.com/apache/spark/pull/33039]

> Check all day-time interval types in arrow
> --
>
> Key: SPARK-35731
> URL: https://issues.apache.org/jira/browse/SPARK-35731
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> Add tests to check that all day-time interval types are supported in 
> (de-)serialization from/to Arrow format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35731) Check all day-time interval types in arrow

2021-06-23 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-35731:


Assignee: angerszhu

> Check all day-time interval types in arrow
> --
>
> Key: SPARK-35731
> URL: https://issues.apache.org/jira/browse/SPARK-35731
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: angerszhu
>Priority: Major
>
> Add tests to check that all day-time interval types are supported in 
> (de-)serialization from/to Arrow format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35729) Check all day-time interval types in aggregate expressions

2021-06-23 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-35729.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33042
[https://github.com/apache/spark/pull/33042]

> Check all day-time interval types in aggregate expressions
> --
>
> Key: SPARK-35729
> URL: https://issues.apache.org/jira/browse/SPARK-35729
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0
>
>
> Check all supported combination of DayTimeIntervalType fields in the 
> aggregate expression: sum and avg.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35729) Check all day-time interval types in aggregate expressions

2021-06-23 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-35729:


Assignee: Kousuke Saruta

> Check all day-time interval types in aggregate expressions
> --
>
> Key: SPARK-35729
> URL: https://issues.apache.org/jira/browse/SPARK-35729
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Kousuke Saruta
>Priority: Major
>
> Check all supported combination of DayTimeIntervalType fields in the 
> aggregate expression: sum and avg.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34889) Introduce MergingSessionsIterator merging elements directly which belong to the same session

2021-06-23 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-34889.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31987
[https://github.com/apache/spark/pull/31987]

> Introduce MergingSessionsIterator merging elements directly which belong to 
> the same session
> 
>
> Key: SPARK-34889
> URL: https://issues.apache.org/jira/browse/SPARK-34889
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.2.0
>
>
> This issue tracks effort on introducing MergingSessionsIterator, which 
> enables to merge elements belong to the same session directly. This would be 
> quite performant compared to UpdatingSessionIterator. Note that 
> MergingSessionsIterator can only apply to the cases aggregation can be 
> applied altogether, so there're still rooms for UpdatingSessionIterator to be 
> used.
> This issue also introduces MergingSessionsExec which is the physical node on 
> leveraging MergingSessionsIterator to sort the input rows and aggregate rows 
> according to the session windows.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34889) Introduce MergingSessionsIterator merging elements directly which belong to the same session

2021-06-23 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-34889:
---

Assignee: Jungtaek Lim

> Introduce MergingSessionsIterator merging elements directly which belong to 
> the same session
> 
>
> Key: SPARK-34889
> URL: https://issues.apache.org/jira/browse/SPARK-34889
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> This issue tracks effort on introducing MergingSessionsIterator, which 
> enables to merge elements belong to the same session directly. This would be 
> quite performant compared to UpdatingSessionIterator. Note that 
> MergingSessionsIterator can only apply to the cases aggregation can be 
> applied altogether, so there're still rooms for UpdatingSessionIterator to be 
> used.
> This issue also introduces MergingSessionsExec which is the physical node on 
> leveraging MergingSessionsIterator to sort the input rows and aggregate rows 
> according to the session windows.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35695) QueryExecutionListener does not see any observed metrics fired before persist/cache

2021-06-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35695:
--
Fix Version/s: (was: 3.0.4)
   3.0.3

> QueryExecutionListener does not see any observed metrics fired before 
> persist/cache
> ---
>
> Key: SPARK-35695
> URL: https://issues.apache.org/jira/browse/SPARK-35695
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Assignee: Tanel Kiis
>Priority: Major
> Fix For: 3.0.3, 3.2.0, 3.1.3
>
>
> This example properly fires the event
> {code}
> spark.range(100)
>   .observe(
> name = "other_event",
> avg($"id").cast("int").as("avg_val"))
>   .collect()
> {code}
> But when I add persist, then no event is fired or seen (not sure which):
> {code}
> spark.range(100)
>   .observe(
> name = "my_event",
> avg($"id").cast("int").as("avg_val"))
>   .persist()
>   .collect()
> {code}
> The listener:
> {code}
> val metricMaps = ArrayBuffer.empty[Map[String, Row]]
> val listener = new QueryExecutionListener {
>   override def onSuccess(funcName: String, qe: QueryExecution, duration: 
> Long): Unit = {
> metricMaps += qe.observedMetrics
>   }
>   override def onFailure(funcName: String, qe: QueryExecution, exception: 
> Exception): Unit = {
> // No-op
>   }
> }
> spark.listenerManager.register(listener)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35695) QueryExecutionListener does not see any observed metrics fired before persist/cache

2021-06-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35695:
--
Fix Version/s: (was: 3.0.3)
   3.0.4

> QueryExecutionListener does not see any observed metrics fired before 
> persist/cache
> ---
>
> Key: SPARK-35695
> URL: https://issues.apache.org/jira/browse/SPARK-35695
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Assignee: Tanel Kiis
>Priority: Major
> Fix For: 3.2.0, 3.1.3, 3.0.4
>
>
> This example properly fires the event
> {code}
> spark.range(100)
>   .observe(
> name = "other_event",
> avg($"id").cast("int").as("avg_val"))
>   .collect()
> {code}
> But when I add persist, then no event is fired or seen (not sure which):
> {code}
> spark.range(100)
>   .observe(
> name = "my_event",
> avg($"id").cast("int").as("avg_val"))
>   .persist()
>   .collect()
> {code}
> The listener:
> {code}
> val metricMaps = ArrayBuffer.empty[Map[String, Row]]
> val listener = new QueryExecutionListener {
>   override def onSuccess(funcName: String, qe: QueryExecution, duration: 
> Long): Unit = {
> metricMaps += qe.observedMetrics
>   }
>   override def onFailure(funcName: String, qe: QueryExecution, exception: 
> Exception): Unit = {
> // No-op
>   }
> }
> spark.listenerManager.register(listener)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35865) Remove await (syncMode) in ChunkFetchRequestHandler

2021-06-23 Thread Baohe Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-35865:

Description: 
SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
throting the max number of threads for sending responses of chunk fetch 
requests. But it causes severe performance degradation because the throughput 
of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
sync mode configurable and makes the async mode the default. 

SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout issue 
and we rarely see sasl timeout issues with async mode in our production 
clusters today. 

Few days ago we accidentally turned on sync mode on one cluster and we observed 
severe shuffle performance degradation. As a result, We benchmarked the 
performance comparison between async and sync mode and *we suggest removing 
sync mode in the code base* as it seems not to provide any benefits today. We 
would like to share the benchmark result and hear the opinion from the 
community.

 

benchmark on job's run time (sync mode is 2x - 3x slower):
 YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB 
memory, each node manager has 1GB heap size.

shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 
1000 reduce tasks.

results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 6 
mins.

 

benchmark on metrics of external shuffle service:
 YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes 
as sync mode, shuffling 2.5 GB data.

results: in openblockreuqestslatencymillis_ratemean and some other metrics, the 
nodes in sync mode are 3x - 4x higher than nodes in async mode. I attached some 
screenshots of the metrics.

  was:
SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
throting the max number of threads for sending responses of chunk fetch 
requests. But it causes severe performance degradation because the throughput 
of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
sync mode configurable and makes the async mode the default. 

SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout issue 
and we rarely see sasl timeout issues with async mode in our production 
clusters today. 

Few days ago we accidentally turned on sync mode on one cluster and we observed 
severe shuffle performance degradation. As a result, We benchmarked the 
performance comparison between async and sync mode and *we suggest removing 
sync mode in the code base* as it seems not to provide any benefits today. We 
would like to share the benchmark result and hear the opinion from the 
community.

 

benchmark on job's run time (sync mode is 2x - 3x slower):
 YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB 
memory, each node manager has 1GB heap size.

shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 
1000 reduce tasks.

results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 6 
mins.

 

benchmark on metrics of external shuffle service:
 YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes 
as sync mode, shuffling 2.5 GB data.

results: in openblockreuqestslatencymillis_ratemean and some other metrics, the 
nodes in sync mode are 3x - 4x higher than nodes in async mode. I attached some 
screenshots of the metrics.

!openblock.png!

!openblock-compare.png!  


> Remove await (syncMode) in ChunkFetchRequestHandler
> ---
>
> Key: SPARK-35865
> URL: https://issues.apache.org/jira/browse/SPARK-35865
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.8, 3.1.2
>Reporter: Baohe Zhang
>Priority: Major
> Attachments: openblock-compare.png, openblock.png
>
>
> SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
> throting the max number of threads for sending responses of chunk fetch 
> requests. But it causes severe performance degradation because the throughput 
> of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
> sync mode configurable and makes the async mode the default. 
> SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout 
> issue and we rarely see sasl timeout issues with async mode in our production 
> clusters today. 
> Few days ago we accidentally turned on sync mode on one cluster and we 
> observed severe shuffle performance degradation. As a result, We benchmarked 
> the performance comparison between async and sync mode and *we suggest 
> removing sync mode in the code base* as it seems not to provide any benefits 
> today. We would like to share the benchmark result and

[jira] [Updated] (SPARK-35865) Remove await (syncMode) in ChunkFetchRequestHandler

2021-06-23 Thread Baohe Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-35865:

Description: 
SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
throting the max number of threads for sending responses of chunk fetch 
requests. But it causes severe performance degradation because the throughput 
of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
sync mode configurable and makes the async mode the default. 

SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout issue 
and we rarely see sasl timeout issues with async mode in our production 
clusters today. 

Few days ago we accidentally turned on sync mode on one cluster and we observed 
severe shuffle performance degradation. As a result, We benchmarked the 
performance comparison between async and sync mode and *we suggest removing 
sync mode in the code base* as it seems not to provide any benefits today. We 
would like to share the benchmark result and hear the opinion from the 
community.

 

benchmark on job's run time (sync mode is 2x - 3x slower):
 YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB 
memory, each node manager has 1GB heap size.

shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 
1000 reduce tasks.

results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 6 
mins.

 

benchmark on metrics of external shuffle service:
 YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes 
as sync mode, shuffling 2.5 GB data.

results: in openblockreuqestslatencymillis_ratemean and some other metrics, the 
nodes in sync mode are 3x - 4x higher than nodes in async mode. I attached some 
screenshots of the metrics.

!openblock.png!

!openblock-compare.png!  

  was:
SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
throting the max number of threads for sending responses of chunk fetch 
requests. But it causes severe performance degradation because the throughput 
of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
sync mode configurable and makes the async mode the default. 

SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout issue 
and we rarely see sasl timeout issues with async mode in our production 
clusters today. 

Few days ago we accidentally turned on sync mode on one cluster and we observed 
severe shuffle performance degradation. As a result, We benchmarked the 
performance comparison between async and sync mode and *we suggest removing 
sync mode in the code base* as it seems not to provide any benefits today. We 
would like to share the benchmark result and hear the opinion from the 
community.

 

benchmark on job's run time (sync mode is 2x - 3x slower):
YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB 
memory, each node manager has 1GB heap size.

shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 
1000 reduce tasks.

results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 6 
mins.

 

benchmark on metrics of external shuffle service:
YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes 
as sync mode, shuffling 2.5 GB data.

results: in openblockreuqestslatencymillis_ratemean and some other metrics, the 
nodes in sync mode are 3x - 4x higher than nodes in async mode. I attached some 
screenshots of the metrics.

 


> Remove await (syncMode) in ChunkFetchRequestHandler
> ---
>
> Key: SPARK-35865
> URL: https://issues.apache.org/jira/browse/SPARK-35865
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.8, 3.1.2
>Reporter: Baohe Zhang
>Priority: Major
> Attachments: openblock-compare.png, openblock.png
>
>
> SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
> throting the max number of threads for sending responses of chunk fetch 
> requests. But it causes severe performance degradation because the throughput 
> of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
> sync mode configurable and makes the async mode the default. 
> SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout 
> issue and we rarely see sasl timeout issues with async mode in our production 
> clusters today. 
> Few days ago we accidentally turned on sync mode on one cluster and we 
> observed severe shuffle performance degradation. As a result, We benchmarked 
> the performance comparison between async and sync mode and *we suggest 
> removing sync mode in the code base* as it seems not to provide any benefits 
> today. We would like to share the benchmark result

[jira] [Updated] (SPARK-35865) Remove await (syncMode) in ChunkFetchRequestHandler

2021-06-23 Thread Baohe Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-35865:

Attachment: openblock-compare.png

> Remove await (syncMode) in ChunkFetchRequestHandler
> ---
>
> Key: SPARK-35865
> URL: https://issues.apache.org/jira/browse/SPARK-35865
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.8, 3.1.2
>Reporter: Baohe Zhang
>Priority: Major
> Attachments: openblock-compare.png, openblock.png
>
>
> SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
> throting the max number of threads for sending responses of chunk fetch 
> requests. But it causes severe performance degradation because the throughput 
> of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
> sync mode configurable and makes the async mode the default. 
> SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout 
> issue and we rarely see sasl timeout issues with async mode in our production 
> clusters today. 
> Few days ago we accidentally turned on sync mode on one cluster and we 
> observed severe shuffle performance degradation. As a result, We benchmarked 
> the performance comparison between async and sync mode and *we suggest 
> removing sync mode in the code base* as it seems not to provide any benefits 
> today. We would like to share the benchmark result and hear the opinion from 
> the community.
>  
> benchmark on job's run time (sync mode is 2x - 3x slower):
> YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB 
> memory, each node manager has 1GB heap size.
> shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 
> 1000 reduce tasks.
> results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 
> 6 mins.
>  
> benchmark on metrics of external shuffle service:
> YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes 
> as sync mode, shuffling 2.5 GB data.
> results: in openblockreuqestslatencymillis_ratemean and some other metrics, 
> the nodes in sync mode are 3x - 4x higher than nodes in async mode. I 
> attached some screenshots of the metrics.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35865) Remove await (syncMode) in ChunkFetchRequestHandler

2021-06-23 Thread Baohe Zhang (Jira)

Baohe Zhang created SPARK-35865:
---

 Summary: Remove await (syncMode) in ChunkFetchRequestHandler
 Key: SPARK-35865
 URL: https://issues.apache.org/jira/browse/SPARK-35865
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.1.2, 2.4.8
Reporter: Baohe Zhang
 Attachments: openblock-compare.png, openblock.png

SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
throting the max number of threads for sending responses of chunk fetch 
requests. But it causes severe performance degradation because the throughput 
of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
sync mode configurable and makes the async mode the default. 

SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout issue 
and we rarely see sasl timeout issues with async mode in our production 
clusters today. 

Few days ago we accidentally turned on sync mode on one cluster and we observed 
severe shuffle performance degradation. As a result, We benchmarked the 
performance comparison between async and sync mode and *we suggest removing 
sync mode in the code base* as it seems not to provide any benefits today. We 
would like to share the benchmark result and hear the opinion from the 
community.

 

benchmark on job's run time (sync mode is 2x - 3x slower):
YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB 
memory, each node manager has 1GB heap size.

shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 
1000 reduce tasks.

results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 6 
mins.

 

benchmark on metrics of external shuffle service:
YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes 
as sync mode, shuffling 2.5 GB data.

results: in openblockreuqestslatencymillis_ratemean and some other metrics, the 
nodes in sync mode are 3x - 4x higher than nodes in async mode. I attached some 
screenshots of the metrics.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35865) Remove await (syncMode) in ChunkFetchRequestHandler

2021-06-23 Thread Baohe Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-35865:

Attachment: openblock.png

> Remove await (syncMode) in ChunkFetchRequestHandler
> ---
>
> Key: SPARK-35865
> URL: https://issues.apache.org/jira/browse/SPARK-35865
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.8, 3.1.2
>Reporter: Baohe Zhang
>Priority: Major
> Attachments: openblock-compare.png, openblock.png
>
>
> SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
> throting the max number of threads for sending responses of chunk fetch 
> requests. But it causes severe performance degradation because the throughput 
> of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
> sync mode configurable and makes the async mode the default. 
> SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout 
> issue and we rarely see sasl timeout issues with async mode in our production 
> clusters today. 
> Few days ago we accidentally turned on sync mode on one cluster and we 
> observed severe shuffle performance degradation. As a result, We benchmarked 
> the performance comparison between async and sync mode and *we suggest 
> removing sync mode in the code base* as it seems not to provide any benefits 
> today. We would like to share the benchmark result and hear the opinion from 
> the community.
>  
> benchmark on job's run time (sync mode is 2x - 3x slower):
> YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB 
> memory, each node manager has 1GB heap size.
> shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 
> 1000 reduce tasks.
> results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 
> 6 mins.
>  
> benchmark on metrics of external shuffle service:
> YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes 
> as sync mode, shuffling 2.5 GB data.
> results: in openblockreuqestslatencymillis_ratemean and some other metrics, 
> the nodes in sync mode are 3x - 4x higher than nodes in async mode. I 
> attached some screenshots of the metrics.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35864) CatalogFileIndex only used if Partition Columns Nonempty

2021-06-23 Thread Josh (Jira)

Josh created SPARK-35864:


 Summary: CatalogFileIndex only used if Partition Columns Nonempty
 Key: SPARK-35864
 URL: https://issues.apache.org/jira/browse/SPARK-35864
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.2
Reporter: Josh


Currently, when deciding whether to use a CatalogFileIndex, we gate on 
catalogTable.get.partitionColumnNames.nonEmpty ([see 
here|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L398])]
 I believe what we actually want is to check that the 
catalogTable.get.dataSchema.nonEmpty, as I don't think it's actually necessary 
that a table have partition columns in order to be read from a 
CatalogFileIndex. This isn't a correctness issue, just a missed optimization 
any time there are no partition columns for a table.

 

Palantir [fixed this in our 
fork|https://github.com/palantir/spark/commit/e040ff5bf4d1b2d37264ad19468e0892c63b9798]
 a long time ago, but I don't think anyone ever remembered to push it upstream.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35729) Check all day-time interval types in aggregate expressions

2021-06-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35729:


Assignee: (was: Apache Spark)

> Check all day-time interval types in aggregate expressions
> --
>
> Key: SPARK-35729
> URL: https://issues.apache.org/jira/browse/SPARK-35729
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Check all supported combination of DayTimeIntervalType fields in the 
> aggregate expression: sum and avg.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35729) Check all day-time interval types in aggregate expressions

2021-06-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35729:


Assignee: Apache Spark

> Check all day-time interval types in aggregate expressions
> --
>
> Key: SPARK-35729
> URL: https://issues.apache.org/jira/browse/SPARK-35729
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Check all supported combination of DayTimeIntervalType fields in the 
> aggregate expression: sum and avg.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35729) Check all day-time interval types in aggregate expressions

2021-06-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368344#comment-17368344
 ] 

Apache Spark commented on SPARK-35729:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33042

> Check all day-time interval types in aggregate expressions
> --
>
> Key: SPARK-35729
> URL: https://issues.apache.org/jira/browse/SPARK-35729
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Check all supported combination of DayTimeIntervalType fields in the 
> aggregate expression: sum and avg.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35733) Check all day-time interval types in HiveInspectors tests

2021-06-23 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-35733.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33036
[https://github.com/apache/spark/pull/33036]

> Check all day-time interval types in HiveInspectors tests
> -
>
> Key: SPARK-35733
> URL: https://issues.apache.org/jira/browse/SPARK-35733
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> Check all day-time interval types are supported by HiveInspectors:
> # INTERVAL DAY
> # INTERVAL DAY TO HOUR
> # INTERVAL DAY TO MINUTE
> # INTERVAL HOUR
> # INTERVAL HOUR TO MINUTE
> # INTERVAL HOUR TO SECOND
> # INTERVAL MINUTE
> # INTERVAL MINUTE TO SECOND
> # INTERVAL SECOND



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35733) Check all day-time interval types in HiveInspectors tests

2021-06-23 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-35733:


Assignee: angerszhu

> Check all day-time interval types in HiveInspectors tests
> -
>
> Key: SPARK-35733
> URL: https://issues.apache.org/jira/browse/SPARK-35733
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: angerszhu
>Priority: Major
>
> Check all day-time interval types are supported by HiveInspectors:
> # INTERVAL DAY
> # INTERVAL DAY TO HOUR
> # INTERVAL DAY TO MINUTE
> # INTERVAL HOUR
> # INTERVAL HOUR TO MINUTE
> # INTERVAL HOUR TO SECOND
> # INTERVAL MINUTE
> # INTERVAL MINUTE TO SECOND
> # INTERVAL SECOND



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35817) Queries against wide Avro tables can be slow

2021-06-23 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-35817:
--

Assignee: Bruce Robbins

> Queries against wide Avro tables can be slow
> 
>
> Key: SPARK-35817
> URL: https://issues.apache.org/jira/browse/SPARK-35817
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.1.2, 3.2.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>
> A query against an Avro table can be quite slow when all are true:
> - There are many columns in the Avro file
> - The query contains a wide projection
> - There are many splits in the input
> - Some of the splits are read serially (e.g., less executors than there are 
> tasks)
> A write to an Avro table can be quite slow when all are true:
> - There are many columns in the new rows
> - The operation is creating many files
> For example, a single-threaded query against a 6000 column Avro data set with 
> 50K rows and 20 files takes less than a minute with Spark 3.0.1 but over 7 
> minutes with Spark 3.2.0-SNAPSHOT.
> The culprit appears to be this line of code:
> https://github.com/apache/spark/blob/3fb044e043a2feab01d79b30c25b93d4fd166b12/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala#L226
> For each split, AvroDeserializer will call this function once for each column 
> in the projection, resulting in a potential n^2 lookup per split.
> For each file, AvroSerializer will call this function once for each column, 
> resulting in an n^2 lookup per file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35817) Queries against wide Avro tables can be slow

2021-06-23 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-35817:
---
Affects Version/s: 3.1.1
   3.1.2

> Queries against wide Avro tables can be slow
> 
>
> Key: SPARK-35817
> URL: https://issues.apache.org/jira/browse/SPARK-35817
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.1.2, 3.2.0
>Reporter: Bruce Robbins
>Priority: Major
>
> A query against an Avro table can be quite slow when all are true:
> - There are many columns in the Avro file
> - The query contains a wide projection
> - There are many splits in the input
> - Some of the splits are read serially (e.g., less executors than there are 
> tasks)
> A write to an Avro table can be quite slow when all are true:
> - There are many columns in the new rows
> - The operation is creating many files
> For example, a single-threaded query against a 6000 column Avro data set with 
> 50K rows and 20 files takes less than a minute with Spark 3.0.1 but over 7 
> minutes with Spark 3.2.0-SNAPSHOT.
> The culprit appears to be this line of code:
> https://github.com/apache/spark/blob/3fb044e043a2feab01d79b30c25b93d4fd166b12/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala#L226
> For each split, AvroDeserializer will call this function once for each column 
> in the projection, resulting in a potential n^2 lookup per split.
> For each file, AvroSerializer will call this function once for each column, 
> resulting in an n^2 lookup per file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34849) SPIP: Support pandas API layer on PySpark

2021-06-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34849:


Assignee: Haejoon Lee

> SPIP: Support pandas API layer on PySpark
> -
>
> Key: SPARK-34849
> URL: https://issues.apache.org/jira/browse/SPARK-34849
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Blocker
>  Labels: SPIP
>
> This is a SPIP for porting [Koalas 
> project|https://github.com/databricks/koalas] to PySpark, that is once 
> discussed on the dev-mailing list with the same title, [[DISCUSS] Support 
> pandas API layer on 
> PySpark|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Support-pandas-API-layer-on-PySpark-td30945.html].
>  
> *Q1. What are you trying to do? Articulate your objectives using absolutely 
> no jargon.*
>  Porting Koalas into PySpark to support the pandas API layer on PySpark for:
>  - Users can easily leverage their existing Spark cluster to scale their 
> pandas workloads.
>  - Support plot and drawing a chart in PySpark
>  - Users can easily switch between pandas APIs and PySpark APIs
> *Q2. What problem is this proposal NOT designed to solve?*
> Some APIs of pandas are explicitly unsupported. For example, {{memory_usage}} 
> in pandas will not be supported because DataFrames are not materialized in 
> memory in Spark unlike pandas.
> This does not replace the existing PySpark APIs. PySpark API has lots of 
> users and existing code in many projects, and there are still many PySpark 
> users who prefer Spark’s immutable DataFrame API to the pandas API.
> *Q3. How is it done today, and what are the limits of current practice?*
> The current practice has 2 limits as below.
>  # There are many features missing in Apache Spark that are very commonly 
> used in data science. Specifically, plotting and drawing a chart is missing 
> which is one of the most important features that almost every data scientist 
> use in their daily work.
>  # Data scientists tend to prefer pandas APIs, but it is very hard to change 
> them into PySpark APIs when they need to scale their workloads. This is 
> because PySpark APIs are difficult to learn compared to pandas' and there are 
> many missing features in PySpark.
> *Q4. What is new in your approach and why do you think it will be successful?*
> I believe this suggests a new way for both PySpark and pandas users to easily 
> scale their workloads. I think we can be successful because more and more 
> people tend to use Python and pandas. In fact, there are already similar 
> tries such as Dask and Modin which are all growing fast and successfully.
> *Q5. Who cares? If you are successful, what difference will it make?*
> Anyone who wants to scale their pandas workloads on their Spark cluster. It 
> will also significantly improve the usability of PySpark.
> *Q6. What are the risks?*
> Technically I don't see many risks yet given that:
> - Koalas has grown separately for more than two years, and has greatly 
> improved maturity and stability.
> - Koalas will be ported into PySpark as a separate package
> It is more about putting documentation and test cases in place properly with 
> properly handling dependencies. For example, Koalas currently uses pytest 
> with various dependencies whereas PySpark uses the plain unittest with fewer 
> dependencies.
> In addition, Koalas' default Indexing system could not be much loved because 
> it could potentially cause overhead, so applying it properly to PySpark might 
> be a challenge.
> *Q7. How long will it take?*
> Before the Spark 3.2 release.
> *Q8. What are the mid-term and final “exams” to check for success?*
> The first check for success would be to make sure that all the existing 
> Koalas APIs and tests work as they are without any affecting the existing 
> Koalas workloads on PySpark.
> The last thing to confirm is to check whether the usability and convenience 
> that we aim for is actually increased through user feedback and PySpark usage 
> statistics.
> *Also refer to:*
> - [Koalas internals 
> documentation|https://docs.google.com/document/d/1tk24aq6FV5Wu2bX_Ym606doLFnrZsh4FdUd52FqojZU/edit]
> - [[VOTE] SPIP: Support pandas API layer on 
> PySpark|http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-Support-pandas-API-layer-on-PySpark-td30996.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35805) API auditing in Pandas API on Spark

2021-06-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35805:


Assignee: Haejoon Lee

> API auditing in Pandas API on Spark
> ---
>
> Key: SPARK-35805
> URL: https://issues.apache.org/jira/browse/SPARK-35805
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Blocker
>
> There are several things that need improvement in pandas on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35337) pandas APIs on Spark: Separate basic operations into data type based structures

2021-06-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35337:


Assignee: Xinrong Meng

> pandas APIs on Spark: Separate basic operations into data type based 
> structures
> ---
>
> Key: SPARK-35337
> URL: https://issues.apache.org/jira/browse/SPARK-35337
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> Currently, the same basic operation of all data types is defined in one 
> function, so it’s difficult to extend the behavior change based on the data 
> types. For example, the binary operation Series + Series behaves differently 
> based on the data type, e.g., just adding for numerical operands, 
> concatenating for string operands, etc. The behavior difference is done by 
> if-else in the function, so it’s messy and difficult to maintain or reuse the 
> logic.
> We should provide an infrastructure to manage the differences in these 
> operations.
> Please refer to [pandas APIs on Spark: Separate basic operations into data 
> type based 
> structures|https://docs.google.com/document/d/12MS6xK0hETYmrcl5b9pX5lgV4FmGVfpmcSKq--_oQlc/edit?usp=sharing]
>  for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35464) pandas APIs on Spark: Enable mypy check "disallow_untyped_defs" for main codes.

2021-06-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35464:


Assignee: Takuya Ueshin

> pandas APIs on Spark: Enable mypy check "disallow_untyped_defs" for main 
> codes.
> ---
>
> Key: SPARK-35464
> URL: https://issues.apache.org/jira/browse/SPARK-35464
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>
> Currently many functions in the main codes are still missing type annotations 
> and disabled {{mypy}} check "disallow_untyped_defs".
> We should add more type annotations and enable the {{mypy}} check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32185) User Guide - Monitoring

2021-06-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32185.
--
Resolution: Later

> User Guide - Monitoring
> ---
>
> Key: SPARK-32185
> URL: https://issues.apache.org/jira/browse/SPARK-32185
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Abhijeet Prasad
>Priority: Major
>
> Monitoring. We should focus on how to monitor PySpark jobs.
> - Custom Worker, see also 
> https://github.com/apache/spark/tree/master/python/test_coverage to enable 
> test coverage that include worker sides too.
> - Sentry Support \(?\) 
> https://blog.sentry.io/2019/11/12/sentry-for-data-error-monitoring-with-pyspark
> - Link back https://spark.apache.org/docs/latest/monitoring.html . 
> - ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32666) Install ipython and nbsphinx in Jenkins for Binder integration

2021-06-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32666:


Assignee: Shane Knapp

> Install ipython and nbsphinx in Jenkins for Binder integration
> --
>
> Key: SPARK-32666
> URL: https://issues.apache.org/jira/browse/SPARK-32666
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> Binder integration requires IPython and nbsphinx to use the notebook file as 
> the documentation in PySpark.
> See SPARK-32204 and its PR for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35588) Merge Binder integration and quickstart notebook

2021-06-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368175#comment-17368175
 ] 

Apache Spark commented on SPARK-35588:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33041

> Merge Binder integration and quickstart notebook
> 
>
> Key: SPARK-35588
> URL: https://issues.apache.org/jira/browse/SPARK-35588
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should merge:
> https://github.com/apache/spark/blob/master/python/docs/source/getting_started/quickstart.ipynb
> https://github.com/databricks/koalas/blob/master/docs/source/getting_started/10min.ipynb



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-35481) Create more robust link for Data Source Options

2021-06-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-35481:
-
Comment: was deleted

(was: User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/32620)

> Create more robust link for Data Source Options
> ---
>
> Key: SPARK-35481
> URL: https://issues.apache.org/jira/browse/SPARK-35481
> Project: Spark
>  Issue Type: Documentation
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Now the link for the Data Source Options using /latest/, but it possibly be 
> broken when we cut branch-3.2
> For example, [Data Source Option for 
> Avro|https://spark.apache.org/docs/latest/sql-data-sources-avro.html#data-source-option]
> It should point 3.2 document only in branch-3.2, so it's better to use a 
> relative link instead of /latest/.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-35481) Create more robust link for Data Source Options

2021-06-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-35481:
-
Comment: was deleted

(was: User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/32620)

> Create more robust link for Data Source Options
> ---
>
> Key: SPARK-35481
> URL: https://issues.apache.org/jira/browse/SPARK-35481
> Project: Spark
>  Issue Type: Documentation
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Now the link for the Data Source Options using /latest/, but it possibly be 
> broken when we cut branch-3.2
> For example, [Data Source Option for 
> Avro|https://spark.apache.org/docs/latest/sql-data-sources-avro.html#data-source-option]
> It should point 3.2 document only in branch-3.2, so it's better to use a 
> relative link instead of /latest/.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35863) Upgrade Ivy to 2.5.0

2021-06-23 Thread Adam Binford (Jira)

Adam Binford created SPARK-35863:


 Summary: Upgrade Ivy to 2.5.0
 Key: SPARK-35863
 URL: https://issues.apache.org/jira/browse/SPARK-35863
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 3.1.2
Reporter: Adam Binford


Apache Ivy 2.5.0 was released nearly two years ago. The new bug fixes and 
features can be found here: 
[https://ant.apache.org/ivy/history/latest-milestone/release-notes.html]

Most notably, the adding of ivy.maven.lookup.sources and 
ivy.maven.lookup.javadoc configs can significantly speed up module resolution 
time if these are turned off, especially behind a proxy. These could arguably 
be turned off by default, because when submitting jobs you probably don't care 
about the sources or javadoc jars.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35563) [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows

2021-06-23 Thread Robert Joseph Evans (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368098#comment-17368098
 ] 

Robert Joseph Evans commented on SPARK-35563:
-

Or just do the overflow check on the int. I personally don't see a problem with 
Spark not supporting really large windows. I just want to avoid data 
corruption/loss.

> [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows
> --
>
> Key: SPARK-35563
> URL: https://issues.apache.org/jira/browse/SPARK-35563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: data-loss
>
> I think this impacts a lot more versions of Spark, but I don't know for sure 
> because it takes a long time to test. As a part of doing corner case 
> validation testing for spark rapids I found that if a window function has 
> more than {{Int.MaxValue + 1}} rows the result is silently truncated to that 
> many rows. I have only tested this on 3.0.2 with {{row_number}}, but I 
> suspect it will impact others as well. This is a really rare corner case, but 
> because it is silent data corruption I personally think it is quite serious.
> {code:scala}
> import org.apache.spark.sql.expressions.Window
> val windowSpec = Window.partitionBy("a").orderBy("b")
> val df = spark.range(Int.MaxValue.toLong + 100).selectExpr(s"1 as a", "id as 
> b")
> spark.time(df.select(col("a"), col("b"), 
> row_number().over(windowSpec).alias("rn")).orderBy(desc("a"), 
> desc("b")).select((col("rn") < 0).alias("dir")).groupBy("dir").count.show(20))
> +-+--+
>   
> |  dir| count|
> +-+--+
> |false|2147483647|
> | true| 1|
> +-+--+
> Time taken: 1139089 ms
> Int.MaxValue.toLong + 100
> res15: Long = 2147483747
> 2147483647L + 1
> res16: Long = 2147483648
> {code}
> I had to make sure that I ran the above with at least 64GiB of heap for the 
> executor (I did it in local mode and it worked, but took forever to run)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35563) [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows

2021-06-23 Thread Robert Joseph Evans (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368096#comment-17368096
 ] 

Robert Joseph Evans commented on SPARK-35563:
-

Yes, technically if we switch it from an int to a long then we will have a 
similar problem with LONG_MAX.  But that kicks the can down the road a very 
long ways. With the current spark memory layout for unsafe row where there is a 
long for nullability followed by a long for each column (possibly more) for a 
single column dataframe we would need 32 exabytes of memory to hold this window 
before we hit the problem.  But yes we should look at doing an overflow check 
as well. I just would want to measure the performance impact of it so we can 
make an informed decision.

> [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows
> --
>
> Key: SPARK-35563
> URL: https://issues.apache.org/jira/browse/SPARK-35563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: data-loss
>
> I think this impacts a lot more versions of Spark, but I don't know for sure 
> because it takes a long time to test. As a part of doing corner case 
> validation testing for spark rapids I found that if a window function has 
> more than {{Int.MaxValue + 1}} rows the result is silently truncated to that 
> many rows. I have only tested this on 3.0.2 with {{row_number}}, but I 
> suspect it will impact others as well. This is a really rare corner case, but 
> because it is silent data corruption I personally think it is quite serious.
> {code:scala}
> import org.apache.spark.sql.expressions.Window
> val windowSpec = Window.partitionBy("a").orderBy("b")
> val df = spark.range(Int.MaxValue.toLong + 100).selectExpr(s"1 as a", "id as 
> b")
> spark.time(df.select(col("a"), col("b"), 
> row_number().over(windowSpec).alias("rn")).orderBy(desc("a"), 
> desc("b")).select((col("rn") < 0).alias("dir")).groupBy("dir").count.show(20))
> +-+--+
>   
> |  dir| count|
> +-+--+
> |false|2147483647|
> | true| 1|
> +-+--+
> Time taken: 1139089 ms
> Int.MaxValue.toLong + 100
> res15: Long = 2147483747
> 2147483647L + 1
> res16: Long = 2147483648
> {code}
> I had to make sure that I ran the above with at least 64GiB of heap for the 
> executor (I did it in local mode and it worked, but took forever to run)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19256) Hive bucketing write support

2021-06-23 Thread Pushkar Kumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368089#comment-17368089
 ] 

Pushkar Kumar commented on SPARK-19256:
---

Hi [~chengsu] , Could you please update us here.

> Hive bucketing write support
> 
>
> Key: SPARK-19256
> URL: https://issues.apache.org/jira/browse/SPARK-19256
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0, 3.1.0
>Reporter: Tejas Patil
>Priority: Minor
>
> Update (2020 by Cheng Su):
> We use this JIRA to track progress for Hive bucketing write support in Spark. 
> The goal is for Spark to write Hive bucketed table, to be compatible with 
> other compute engines (Hive and Presto).
>  
> Current status for Hive bucketed table in Spark:
> Not support for reading Hive bucketed table: read bucketed table as 
> non-bucketed table.
> Wrong behavior for writing Hive ORC and Parquet bucketed table: write 
> orc/parquet bucketed table as non-bucketed table (code path: 
> InsertIntoHadoopFsRelationCommand -> FileFormatWriter).
> Do not allow for writing Hive non-ORC/Parquet bucketed table: throw exception 
> by default if writing non-orc/parquet bucketed table (code path: 
> InsertIntoHiveTable), and exception can be disabled by setting config 
> `hive.enforce.bucketing`=false and `hive.enforce.sorting`=false, which will 
> write as non-bucketed table.
>  
> Current status for Hive bucketed table in Hive:
> Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash 
> (https://issues.apache.org/jira/browse/HIVE-18910).
> Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash.
> Hive on Tez: support zero and multiple files per bucket 
> (https://issues.apache.org/jira/browse/HIVE-14014). And more code pointer on 
> read path - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/metainfo/annotation/OpTraitsRulesProcFactory.java#L183-L212]
>  .
>  
> Current status for Hive bucketed table in Presto (take presto-sql here):
> Support writing bucketed table with Hive murmur3hash and hivehash 
> ([https://github.com/prestosql/presto/pull/1697]).
> Support zero and multiple files per bucket 
> ([https://github.com/prestosql/presto/pull/822]).
>  
> TLDR is to achieve Hive bucketed table compatibility across Spark, Presto and 
> Hive. Here with this JIRA, we need to add support writing Hive bucketed table 
> with Hive murmur3hash (for Hive 3.x.y) and hivehash (for Hive 1.x.y and 
> 2.x.y).
>  
> To allow Spark efficiently read Hive bucketed table, this needs more radical 
> change and we decide to wait until data source v2 supports bucketing, and do 
> the read path on data source v2. Read path will not covered by this JIRA.
>  
> Original description (2017 by Tejas Patil):
> JIRA to track design discussions and tasks related to Hive bucketing support 
> in Spark.
> Proposal : 
> [https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35290) unionByName with null filling fails for some nested structs

2021-06-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368070#comment-17368070
 ] 

Apache Spark commented on SPARK-35290:
--

User 'Kimahriman' has created a pull request for this issue:
https://github.com/apache/spark/pull/33040

> unionByName with null filling fails for some nested structs
> ---
>
> Key: SPARK-35290
> URL: https://issues.apache.org/jira/browse/SPARK-35290
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> We've encountered a few weird edge cases that seem to fail the new null 
> filling unionByName (which has been a great addition!). It seems to stem from 
> the fields being sorted by name and corrupted along the way. The simple 
> reproduction is:
> {code:java}
> df = spark.createDataFrame([[]])
> df1 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('ba').alias('ba')
> ).alias('b')
> ))
> )
> df2 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('aa').alias('aa')
> ).alias('a'),
> F.struct(
> F.lit('bb').alias('bb')
> ).alias('b'),
> ))
> )
> df1.unionByName(df2, True).printSchema()
> {code}
> This results in the exception:
> {code:java}
> pyspark.sql.utils.AnalysisException: Union can only be performed on tables 
> with the compatible column types. 
> struct,b:struct> <> 
> struct,b:struct> at the first column 
> of the second table;
> {code}
> You can see in the second schema that it has 
> {code:java}
> b:struct
> {code}
> when it should be
> {code:java}
> b:struct
> {code}
> It seems to happen somewhere during 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala#L73,]
>  as everything seems correct up to that point from my testing. It's either 
> modifying one expression during the transformUp then corrupts other 
> expressions that are then modified, or the ExtractValue before the 
> addFieldsInto is remembering the ordinal position in the struct that is then 
> changing and causing issues.
>  
> I found that simply using sortStructFields instead of 
> sortStructFieldsInWithFields gets things working correctly, but definitely 
> has a performance impact. The deep expr unionByName test takes ~1-2 seconds 
> normally but ~12-15 seconds with this change. I assume because the original 
> method tried to rewrite existing expressions vs the sortStructFields just 
> adds expressions on top of existing ones to project the new order.
> I'm not sure if it makes sense to take the slower but works in the edge cases 
> method (assuming it doesn't break other cases, all existing tests pass), or 
> if there's a way to fix the existing method for cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35787) Does anyone has performance issue after upgrade from 3.0 to 3.1?

2021-06-23 Thread Vidmantas Drasutis (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368055#comment-17368055
 ] 

Vidmantas Drasutis commented on SPARK-35787:


[~hyukjin.kwon] maybe you know when 3.1.3 Spark will be released? I see all 
task are done and see memory leaks and other important fixes - which maybe (I 
hope) could help/fix performance issues we facing. 

Thanks

> Does anyone has performance issue after upgrade from 3.0 to 3.1?
> 
>
> Key: SPARK-35787
> URL: https://issues.apache.org/jira/browse/SPARK-35787
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Vidmantas Drasutis
>Priority: Major
> Attachments: Execution_plan_difference.png, 
> analysis_query_no_UDFs.png, spark_3.0_execution_plan_details_fast.txt, 
> spark_3.1_execution_plan_details_slow.txt, spark_job_info_1.png, 
> spark_job_info_2.png
>
>
> Hello.
>  
> We had using spark 3.0.2 and query was executed in ~100 seconds.
> After we upgraded Spark to 3.1.1 (tried also 3.1.2 - same, slow performance) 
> - our query execution time started taking ~260 seconds it is huge increase 
> 250-300 % of execution time increase.
>  
> We tried quite simple query.
> In query we using UDF (*org.apache.spark.sql.functions*)
> ) - which explodes data and do polygon hit test. Nothing changed in our code 
> from query perspective.
>  It is 1 VM box cluster
>  
> Maybe anyone faced similar issue?
> Attached some details from spark dashboard.
>  
> *Looks like it is UDF related slowdown. As queries which does not use UDF`s 
> performance is same and which uses UDFs - starting from 3.1 performance 
> decreased.*
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35862) Watermark timestamp only can be format in UTC timeZone, unfriendly to users in other time zones

2021-06-23 Thread Yazhi Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yazhi Wang updated SPARK-35862:
---
Description: 
Timestamp is formatted in `ProgressReporter` by `formatTimestamp` for watermark 
and eventTime stats. the timestampFormat is hardcoded in UTC time zone.

`

private val timestampFormat = new 
SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSS'Z'") // ISO8601
 timestampFormat.setTimeZone(DateTimeUtils.getTimeZone("UTC"))

`

When users set the different timezone by java options `-Duser.timezone` , they 
may be confused by the information mixed with different timezone.

eg

`

 {color:#FF}*2021-06-23 16:12:07*{color} [stream execution thread for [id = 
92f4f363-df85-48e9-aef9-5ea6f2b70316, runId = 
5733ef8e-11d1-46c4-95cc-219bde6e7a20]] INFO [MicroBatchExecution:54]: Streaming 
query made progress: {
 "id" : "92f4f363-df85-48e9-aef9-5ea6f2b70316",
 "runId" : "5733ef8e-11d1-46c4-95cc-219bde6e7a20",
 "name" : null,
 "timestamp" : "2021-06-23T08:11:56.790Z",
 "batchId" : 91740,
 "numInputRows" : 2577,
 "inputRowsPerSecond" : 155.33453887884266,
 "processedRowsPerSecond" : 242.29033471229786,
 "durationMs" :

{ "addBatch" : 8671, "getBatch" : 3, "getOffset" : 1139, "queryPlanning" : 79, 
"triggerExecution" : 10636, "walCommit" : 162 }

,
 "eventTime" :

{color:#FF}*{ "avg" : "2021-06-23T08:11:46.307Z", "max" : 
"2021-06-23T08:11:55.000Z", "min" : "2021-06-23T08:11:37.000Z", "watermark" : 
"2021-06-23T07:41:39.000Z" }*{color}

,

`

maybe we need to unified the timezone for time format

  was:
Timestamp is formatted in `ProgressReporter` by `formatTimestamp` for watermark 
and eventTime stats. the timestampFormat is hardcoded in UTC time zone.

`

private val timestampFormat = new 
SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSS'Z'") // ISO8601
timestampFormat.setTimeZone(DateTimeUtils.getTimeZone("UTC"))

`

When users set the diffenrent timezone by java options `-Duser.timezone` , they 
may be confused by the information mixed with different timezone.

eg

`

 2021-06-23 16:12:07 [stream execution thread for [id = 
92f4f363-df85-48e9-aef9-5ea6f2b70316, runId = 
5733ef8e-11d1-46c4-95cc-219bde6e7a20]] INFO [MicroBatchExecution:54]: Streaming 
query made progress: {
 "id" : "92f4f363-df85-48e9-aef9-5ea6f2b70316",
 "runId" : "5733ef8e-11d1-46c4-95cc-219bde6e7a20",
 "name" : null,
 "timestamp" : "2021-06-23T08:11:56.790Z",
 "batchId" : 91740,
 "numInputRows" : 2577,
 "inputRowsPerSecond" : 155.33453887884266,
 "processedRowsPerSecond" : 242.29033471229786,
 "durationMs" : {
 "addBatch" : 8671,
 "getBatch" : 3,
 "getOffset" : 1139,
 "queryPlanning" : 79,
 "triggerExecution" : 10636,
 "walCommit" : 162
 },
 "eventTime" : {
 "avg" : "2021-06-23T08:11:46.307Z",
 "max" : "2021-06-23T08:11:55.000Z",
 "min" : "2021-06-23T08:11:37.000Z",
 "watermark" : "2021-06-23T07:41:39.000Z"
 },

`

maybe we need to unified the timezone for time format


> Watermark timestamp only can be format in UTC timeZone, unfriendly to users 
> in other time zones
> ---
>
> Key: SPARK-35862
> URL: https://issues.apache.org/jira/browse/SPARK-35862
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.2
>Reporter: Yazhi Wang
>Priority: Minor
>
> Timestamp is formatted in `ProgressReporter` by `formatTimestamp` for 
> watermark and eventTime stats. the timestampFormat is hardcoded in UTC time 
> zone.
> `
> private val timestampFormat = new 
> SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSS'Z'") // ISO8601
>  timestampFormat.setTimeZone(DateTimeUtils.getTimeZone("UTC"))
> `
> When users set the different timezone by java options `-Duser.timezone` , 
> they may be confused by the information mixed with different timezone.
> eg
> `
>  {color:#FF}*2021-06-23 16:12:07*{color} [stream execution thread for [id 
> = 92f4f363-df85-48e9-aef9-5ea6f2b70316, runId = 
> 5733ef8e-11d1-46c4-95cc-219bde6e7a20]] INFO [MicroBatchExecution:54]: 
> Streaming query made progress: {
>  "id" : "92f4f363-df85-48e9-aef9-5ea6f2b70316",
>  "runId" : "5733ef8e-11d1-46c4-95cc-219bde6e7a20",
>  "name" : null,
>  "timestamp" : "2021-06-23T08:11:56.790Z",
>  "batchId" : 91740,
>  "numInputRows" : 2577,
>  "inputRowsPerSecond" : 155.33453887884266,
>  "processedRowsPerSecond" : 242.29033471229786,
>  "durationMs" :
> { "addBatch" : 8671, "getBatch" : 3, "getOffset" : 1139, "queryPlanning" : 
> 79, "triggerExecution" : 10636, "walCommit" : 162 }
> ,
>  "eventTime" :
> {color:#FF}*{ "avg" : "2021-06-23T08:11:46.307Z", "max" : 
> "2021-06-23T08:11:55.000Z", "min" : "2021-06-23T08:11:37.000Z", "watermark" : 
> "2021-06-23T07:41:39.000Z" }*{color}
> ,
> `
> maybe we need to unified the timezone for time format



--
This message was sent by Atlassian

[jira] [Commented] (SPARK-35787) Does anyone has performance issue after upgrade from 3.0 to 3.1?

2021-06-23 Thread Vidmantas Drasutis (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368041#comment-17368041
 ] 

Vidmantas Drasutis commented on SPARK-35787:


We also seeing performance decrease in aggregation ether where we do not use 
any UDFs.
Please check attached "*analysis_query_no_UDFs.png*" image for more details. 
Test was run on two identical VM.

> Does anyone has performance issue after upgrade from 3.0 to 3.1?
> 
>
> Key: SPARK-35787
> URL: https://issues.apache.org/jira/browse/SPARK-35787
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Vidmantas Drasutis
>Priority: Major
> Attachments: Execution_plan_difference.png, 
> analysis_query_no_UDFs.png, spark_3.0_execution_plan_details_fast.txt, 
> spark_3.1_execution_plan_details_slow.txt, spark_job_info_1.png, 
> spark_job_info_2.png
>
>
> Hello.
>  
> We had using spark 3.0.2 and query was executed in ~100 seconds.
> After we upgraded Spark to 3.1.1 (tried also 3.1.2 - same, slow performance) 
> - our query execution time started taking ~260 seconds it is huge increase 
> 250-300 % of execution time increase.
>  
> We tried quite simple query.
> In query we using UDF (*org.apache.spark.sql.functions*)
> ) - which explodes data and do polygon hit test. Nothing changed in our code 
> from query perspective.
>  It is 1 VM box cluster
>  
> Maybe anyone faced similar issue?
> Attached some details from spark dashboard.
>  
> *Looks like it is UDF related slowdown. As queries which does not use UDF`s 
> performance is same and which uses UDFs - starting from 3.1 performance 
> decreased.*
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35787) Does anyone has performance issue after upgrade from 3.0 to 3.1?

2021-06-23 Thread Vidmantas Drasutis (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vidmantas Drasutis updated SPARK-35787:
---
Attachment: analysis_query_no_UDFs.png

> Does anyone has performance issue after upgrade from 3.0 to 3.1?
> 
>
> Key: SPARK-35787
> URL: https://issues.apache.org/jira/browse/SPARK-35787
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Vidmantas Drasutis
>Priority: Major
> Attachments: Execution_plan_difference.png, 
> analysis_query_no_UDFs.png, spark_3.0_execution_plan_details_fast.txt, 
> spark_3.1_execution_plan_details_slow.txt, spark_job_info_1.png, 
> spark_job_info_2.png
>
>
> Hello.
>  
> We had using spark 3.0.2 and query was executed in ~100 seconds.
> After we upgraded Spark to 3.1.1 (tried also 3.1.2 - same, slow performance) 
> - our query execution time started taking ~260 seconds it is huge increase 
> 250-300 % of execution time increase.
>  
> We tried quite simple query.
> In query we using UDF (*org.apache.spark.sql.functions*)
> ) - which explodes data and do polygon hit test. Nothing changed in our code 
> from query perspective.
>  It is 1 VM box cluster
>  
> Maybe anyone faced similar issue?
> Attached some details from spark dashboard.
>  
> *Looks like it is UDF related slowdown. As queries which does not use UDF`s 
> performance is same and which uses UDFs - starting from 3.1 performance 
> decreased.*
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 126 matches

Mail list logo