date:20211129

[jira] [Resolved] (SPARK-37499) Close HiveClientImpl.sessionState when shutdown

2021-11-29 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu resolved SPARK-37499.
---
Resolution: Not A Problem

> Close HiveClientImpl.sessionState when shutdown
> ---
>
> Key: SPARK-37499
> URL: https://issues.apache.org/jira/browse/SPARK-37499
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> HiveClientImple not close sessionState after application shutdown, cause 
> remain many session files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37494) Unify v1 and v2 options output of `SHOW CREATE TABLE` command

2021-11-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450909#comment-17450909
 ] 

Apache Spark commented on SPARK-37494:
--

User 'Peng-Lei' has created a pull request for this issue:
https://github.com/apache/spark/pull/34753

> Unify v1 and v2 options output of `SHOW CREATE TABLE` command
> -
>
> Key: SPARK-37494
> URL: https://issues.apache.org/jira/browse/SPARK-37494
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37494) Unify v1 and v2 options output of `SHOW CREATE TABLE` command

2021-11-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37494:


Assignee: (was: Apache Spark)

> Unify v1 and v2 options output of `SHOW CREATE TABLE` command
> -
>
> Key: SPARK-37494
> URL: https://issues.apache.org/jira/browse/SPARK-37494
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37494) Unify v1 and v2 options output of `SHOW CREATE TABLE` command

2021-11-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37494:


Assignee: Apache Spark

> Unify v1 and v2 options output of `SHOW CREATE TABLE` command
> -
>
> Key: SPARK-37494
> URL: https://issues.apache.org/jira/browse/SPARK-37494
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36396) Implement DataFrame.cov

2021-11-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36396.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34213
[https://github.com/apache/spark/pull/34213]

> Implement DataFrame.cov
> ---
>
> Key: SPARK-36396
> URL: https://issues.apache.org/jira/browse/SPARK-36396
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36396) Implement DataFrame.cov

2021-11-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36396:


Assignee: Xinrong Meng

> Implement DataFrame.cov
> ---
>
> Key: SPARK-36396
> URL: https://issues.apache.org/jira/browse/SPARK-36396
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37489) Skip hasnans check in numops if eager_check disable

2021-11-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37489:


Assignee: Yikun Jiang

> Skip hasnans check in numops if eager_check disable
> ---
>
> Key: SPARK-37489
> URL: https://issues.apache.org/jira/browse/SPARK-37489
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37489) Skip hasnans check in numops if eager_check disable

2021-11-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37489.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34746
[https://github.com/apache/spark/pull/34746]

> Skip hasnans check in numops if eager_check disable
> ---
>
> Key: SPARK-37489
> URL: https://issues.apache.org/jira/browse/SPARK-37489
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37499) Close HiveClientImpl.sessionState when shutdown

2021-11-29 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450884#comment-17450884
 ] 

angerszhu commented on SPARK-37499:
---

Raise PR soon

> Close HiveClientImpl.sessionState when shutdown
> ---
>
> Key: SPARK-37499
> URL: https://issues.apache.org/jira/browse/SPARK-37499
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> HiveClientImple not close sessionState after application shutdown, cause 
> remain many session files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37499) Close HiveClientImpl.sessionState when shutdown

2021-11-29 Thread angerszhu (Jira)

angerszhu created SPARK-37499:
-

 Summary: Close HiveClientImpl.sessionState when shutdown
 Key: SPARK-37499
 URL: https://issues.apache.org/jira/browse/SPARK-37499
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu


HiveClientImple not close sessionState after application shutdown, cause remain 
many session files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37497) Promote ExecutorPods[PollingSnapshot|WatchSnapshot]Source to DeveloperApi

2021-11-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450883#comment-17450883
 ] 

Apache Spark commented on SPARK-37497:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34751

> Promote ExecutorPods[PollingSnapshot|WatchSnapshot]Source to DeveloperApi
> -
>
> Key: SPARK-37497
> URL: https://issues.apache.org/jira/browse/SPARK-37497
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37497) Promote ExecutorPods[PollingSnapshot|WatchSnapshot]Source to DeveloperApi

2021-11-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37497:


Assignee: (was: Apache Spark)

> Promote ExecutorPods[PollingSnapshot|WatchSnapshot]Source to DeveloperApi
> -
>
> Key: SPARK-37497
> URL: https://issues.apache.org/jira/browse/SPARK-37497
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37497) Promote ExecutorPods[PollingSnapshot|WatchSnapshot]Source to DeveloperApi

2021-11-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450882#comment-17450882
 ] 

Apache Spark commented on SPARK-37497:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34751

> Promote ExecutorPods[PollingSnapshot|WatchSnapshot]Source to DeveloperApi
> -
>
> Key: SPARK-37497
> URL: https://issues.apache.org/jira/browse/SPARK-37497
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37497) Promote ExecutorPods[PollingSnapshot|WatchSnapshot]Source to DeveloperApi

2021-11-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37497:


Assignee: Apache Spark

> Promote ExecutorPods[PollingSnapshot|WatchSnapshot]Source to DeveloperApi
> -
>
> Key: SPARK-37497
> URL: https://issues.apache.org/jira/browse/SPARK-37497
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37498) test_reuse_worker_of_parallelize_range is flaky

2021-11-29 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-37498:
---

 Summary:  test_reuse_worker_of_parallelize_range is flaky
 Key: SPARK-37498
 URL: https://issues.apache.org/jira/browse/SPARK-37498
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Tests
Affects Versions: 3.3.0
Reporter: Yikun Jiang


 
{code:java}
ERROR [2.132s]: test_reuse_worker_of_parallelize_range 
(pyspark.tests.test_worker.WorkerReuseTest)
--
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/tests/test_worker.py", line 195, in 
test_reuse_worker_of_parallelize_range
    self.assertTrue(pid in previous_pids)
AssertionError: False is not true
--
Ran 12 tests in 22.589s
{code}
 

 

[1] https://github.com/apache/spark/runs/1182154542?check_suite_focus=true
[2] https://github.com/apache/spark/pull/33657#issuecomment-893969310
[3] https://github.com/Yikun/spark/runs/4362783540?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37497) Promote ExecutorPods[PollingSnapshot|WatchSnapshot]Source to DeveloperApi

2021-11-29 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-37497:
-

 Summary: Promote ExecutorPods[PollingSnapshot|WatchSnapshot]Source 
to DeveloperApi
 Key: SPARK-37497
 URL: https://issues.apache.org/jira/browse/SPARK-37497
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.3.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37495) Skip identical index checking of Series.compare when config 'compute.eager_check' is disabled

2021-11-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37495:


Assignee: Apache Spark

> Skip identical index checking of Series.compare when config 
> 'compute.eager_check' is disabled
> -
>
> Key: SPARK-37495
> URL: https://issues.apache.org/jira/browse/SPARK-37495
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37495) Skip identical index checking of Series.compare when config 'compute.eager_check' is disabled

2021-11-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37495:


Assignee: (was: Apache Spark)

> Skip identical index checking of Series.compare when config 
> 'compute.eager_check' is disabled
> -
>
> Key: SPARK-37495
> URL: https://issues.apache.org/jira/browse/SPARK-37495
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37495) Skip identical index checking of Series.compare when config 'compute.eager_check' is disabled

2021-11-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450875#comment-17450875
 ] 

Apache Spark commented on SPARK-37495:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/34750

> Skip identical index checking of Series.compare when config 
> 'compute.eager_check' is disabled
> -
>
> Key: SPARK-37495
> URL: https://issues.apache.org/jira/browse/SPARK-37495
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37496) Migrate ReplaceTableAsSelectStatement to v2 command

2021-11-29 Thread Huaxin Gao (Jira)

Huaxin Gao created SPARK-37496:
--

 Summary: Migrate ReplaceTableAsSelectStatement to v2 command
 Key: SPARK-37496
 URL: https://issues.apache.org/jira/browse/SPARK-37496
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Huaxin Gao






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37487) CollectMetrics is executed twice if it is followed by a sort

2021-11-29 Thread Tanel Kiis (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanel Kiis updated SPARK-37487:
---
Description: 
It is best examplified by this new UT in DataFrameCallbackSuite:
{code}
  test("SPARK-37487: get observable metrics with sort by callback") {
val df = spark.range(100)
  .observe(
name = "my_event",
min($"id").as("min_val"),
max($"id").as("max_val"),
// Test unresolved alias
sum($"id"),
count(when($"id" % 2 === 0, 1)).as("num_even"))
  .observe(
name = "other_event",
avg($"id").cast("int").as("avg_val"))
  .sort($"id".desc)

validateObservedMetrics(df)
  }
{code}

The count and sum aggregate report twice the number of rows:
{code}
[info] - SPARK-37487: get observable metrics with sort by callback *** FAILED 
*** (169 milliseconds)
[info]   [0,99,9900,100] did not equal [0,99,4950,50] 
(DataFrameCallbackSuite.scala:342)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at 
org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
[info]   at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.checkMetrics$1(DataFrameCallbackSuite.scala:342)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.validateObservedMetrics(DataFrameCallbackSuite.scala:350)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.$anonfun$new$21(DataFrameCallbackSuite.scala:324)
[info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
{code}

I could not figure out how this happes. Hopefully the UT can help with debugging

  was:
It is best examplified by this new UT in DataFrameCallbackSuite:
{code}
  test("SPARK-X: get observable metrics with sort by callback") {
val df = spark.range(100)
  .observe(
name = "my_event",
min($"id").as("min_val"),
max($"id").as("max_val"),
// Test unresolved alias
sum($"id"),
count(when($"id" % 2 === 0, 1)).as("num_even"))
  .observe(
name = "other_event",
avg($"id").cast("int").as("avg_val"))
  .sort($"id".desc)

validateObservedMetrics(df)
  }
{code}

The count and sum aggregate report twice the number of rows:
{code}
[info] - SPARK-X: get observable metrics with sort by callback *** FAILED 
*** (169 milliseconds)
[info]   [0,99,9900,100] did not equal [0,99,4950,50] 
(DataFrameCallbackSuite.scala:342)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at 
org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
[info]   at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.checkMetrics$1(DataFrameCallbackSuite.scala:342)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.validateObservedMetrics(DataFrameCallbackSuite.scala:350)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.$anonfun$new$21(DataFrameCallbackSuite.scala:324)
[info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
{code}

I could not figure out how this happes. Hopefully the UT can help with debugging


> CollectMetrics is executed twice if it is followed by a sort
> 
>
> Key: SPARK-37487
> URL: https://issues.apache.org/jira/browse/SPARK-37487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Priority: Major
>  Labels: correctness
>
> It is best examplified by this new UT in DataFrameCallbackSuite:
> {code}
>   test("SPARK-37487: get observable metrics with sort by callback") {
> val df = spark.range(100)
>

[jira] [Resolved] (SPARK-37465) PySpark tests failing on Pandas 0.23

2021-11-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37465.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34717
[https://github.com/apache/spark/pull/34717]

> PySpark tests failing on Pandas 0.23
> 
>
> Key: SPARK-37465
> URL: https://issues.apache.org/jira/browse/SPARK-37465
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Willi Raschkowski
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.0
>
>
> I was running Spark tests with Pandas {{0.23.4}} and got the error below. The 
> minimum Pandas version is currently {{0.23.2}} 
> [(Github)|https://github.com/apache/spark/blob/v3.2.0/python/setup.py#L114]. 
> Upgrading to {{0.24.0}} fixes the error. I think Spark needs [this fix 
> (Github)|https://github.com/pandas-dev/pandas/pull/21160/files#diff-1b7183f5b3970e2a1d39a82d71686e39c765d18a34231b54c857b0c4c9bb8222]
>  in Pandas.
> {code:java}
> $ python/run-tests --testnames 
> 'pyspark.pandas.tests.data_type_ops.test_boolean_ops 
> BooleanOpsTest.test_floordiv'
> ...
> ==
> ERROR [5.785s]: test_floordiv 
> (pyspark.pandas.tests.data_type_ops.test_boolean_ops.BooleanOpsTest)
> --
> Traceback (most recent call last):
>   File 
> "/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py",
>  line 128, in test_floordiv
> self.assert_eq(b_pser // b_pser.astype(int), b_psser // 
> b_psser.astype(int))
>   File 
> "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
>  line 1069, in wrapper
> result = safe_na_op(lvalues, rvalues)
>   File 
> "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
>  line 1033, in safe_na_op
> return na_op(lvalues, rvalues)
>   File 
> "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
>  line 1027, in na_op
> result = missing.fill_zeros(result, x, y, op_name, fill_zeros)
>   File 
> "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/missing.py",
>  line 641, in fill_zeros
> signs = np.sign(y if name.startswith(('r', '__r')) else x)
> TypeError: ufunc 'sign' did not contain a loop with signature matching types 
> dtype('bool') dtype('bool')
> {code}
> These are my relevant package versions:
> {code:java}
> $ conda list | grep -e numpy -e pyarrow -e pandas -e python
> # packages in environment at /home/circleci/miniconda/envs/python3:
> numpy 1.16.6   py36h0a8e133_3  
> numpy-base1.16.6   py36h41b4c56_3  
> pandas0.23.4   py36h04863e7_0  
> pyarrow   1.0.1   py36h6200943_36_cpuconda-forge
> python3.6.12   hcff3b4d_2anaconda
> python-dateutil   2.8.1  py_0anaconda
> python_abi3.6 1_cp36mconda-forg
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37465) PySpark tests failing on Pandas 0.23

2021-11-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37465:


Assignee: Yikun Jiang  (was: Hyukjin Kwon)

> PySpark tests failing on Pandas 0.23
> 
>
> Key: SPARK-37465
> URL: https://issues.apache.org/jira/browse/SPARK-37465
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Willi Raschkowski
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.3.0
>
>
> I was running Spark tests with Pandas {{0.23.4}} and got the error below. The 
> minimum Pandas version is currently {{0.23.2}} 
> [(Github)|https://github.com/apache/spark/blob/v3.2.0/python/setup.py#L114]. 
> Upgrading to {{0.24.0}} fixes the error. I think Spark needs [this fix 
> (Github)|https://github.com/pandas-dev/pandas/pull/21160/files#diff-1b7183f5b3970e2a1d39a82d71686e39c765d18a34231b54c857b0c4c9bb8222]
>  in Pandas.
> {code:java}
> $ python/run-tests --testnames 
> 'pyspark.pandas.tests.data_type_ops.test_boolean_ops 
> BooleanOpsTest.test_floordiv'
> ...
> ==
> ERROR [5.785s]: test_floordiv 
> (pyspark.pandas.tests.data_type_ops.test_boolean_ops.BooleanOpsTest)
> --
> Traceback (most recent call last):
>   File 
> "/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py",
>  line 128, in test_floordiv
> self.assert_eq(b_pser // b_pser.astype(int), b_psser // 
> b_psser.astype(int))
>   File 
> "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
>  line 1069, in wrapper
> result = safe_na_op(lvalues, rvalues)
>   File 
> "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
>  line 1033, in safe_na_op
> return na_op(lvalues, rvalues)
>   File 
> "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
>  line 1027, in na_op
> result = missing.fill_zeros(result, x, y, op_name, fill_zeros)
>   File 
> "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/missing.py",
>  line 641, in fill_zeros
> signs = np.sign(y if name.startswith(('r', '__r')) else x)
> TypeError: ufunc 'sign' did not contain a loop with signature matching types 
> dtype('bool') dtype('bool')
> {code}
> These are my relevant package versions:
> {code:java}
> $ conda list | grep -e numpy -e pyarrow -e pandas -e python
> # packages in environment at /home/circleci/miniconda/envs/python3:
> numpy 1.16.6   py36h0a8e133_3  
> numpy-base1.16.6   py36h41b4c56_3  
> pandas0.23.4   py36h04863e7_0  
> pyarrow   1.0.1   py36h6200943_36_cpuconda-forge
> python3.6.12   hcff3b4d_2anaconda
> python-dateutil   2.8.1  py_0anaconda
> python_abi3.6 1_cp36mconda-forg
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37465) PySpark tests failing on Pandas 0.23

2021-11-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37465:


Assignee: Hyukjin Kwon

> PySpark tests failing on Pandas 0.23
> 
>
> Key: SPARK-37465
> URL: https://issues.apache.org/jira/browse/SPARK-37465
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Willi Raschkowski
>Assignee: Hyukjin Kwon
>Priority: Major
>
> I was running Spark tests with Pandas {{0.23.4}} and got the error below. The 
> minimum Pandas version is currently {{0.23.2}} 
> [(Github)|https://github.com/apache/spark/blob/v3.2.0/python/setup.py#L114]. 
> Upgrading to {{0.24.0}} fixes the error. I think Spark needs [this fix 
> (Github)|https://github.com/pandas-dev/pandas/pull/21160/files#diff-1b7183f5b3970e2a1d39a82d71686e39c765d18a34231b54c857b0c4c9bb8222]
>  in Pandas.
> {code:java}
> $ python/run-tests --testnames 
> 'pyspark.pandas.tests.data_type_ops.test_boolean_ops 
> BooleanOpsTest.test_floordiv'
> ...
> ==
> ERROR [5.785s]: test_floordiv 
> (pyspark.pandas.tests.data_type_ops.test_boolean_ops.BooleanOpsTest)
> --
> Traceback (most recent call last):
>   File 
> "/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py",
>  line 128, in test_floordiv
> self.assert_eq(b_pser // b_pser.astype(int), b_psser // 
> b_psser.astype(int))
>   File 
> "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
>  line 1069, in wrapper
> result = safe_na_op(lvalues, rvalues)
>   File 
> "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
>  line 1033, in safe_na_op
> return na_op(lvalues, rvalues)
>   File 
> "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
>  line 1027, in na_op
> result = missing.fill_zeros(result, x, y, op_name, fill_zeros)
>   File 
> "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/missing.py",
>  line 641, in fill_zeros
> signs = np.sign(y if name.startswith(('r', '__r')) else x)
> TypeError: ufunc 'sign' did not contain a loop with signature matching types 
> dtype('bool') dtype('bool')
> {code}
> These are my relevant package versions:
> {code:java}
> $ conda list | grep -e numpy -e pyarrow -e pandas -e python
> # packages in environment at /home/circleci/miniconda/envs/python3:
> numpy 1.16.6   py36h0a8e133_3  
> numpy-base1.16.6   py36h41b4c56_3  
> pandas0.23.4   py36h04863e7_0  
> pyarrow   1.0.1   py36h6200943_36_cpuconda-forge
> python3.6.12   hcff3b4d_2anaconda
> python-dateutil   2.8.1  py_0anaconda
> python_abi3.6 1_cp36mconda-forg
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37495) Skip identical index checking of Series.compare when config 'compute.eager_check' is disabled

2021-11-29 Thread dch nguyen (Jira)

dch nguyen created SPARK-37495:
--

 Summary: Skip identical index checking of Series.compare when 
config 'compute.eager_check' is disabled
 Key: SPARK-37495
 URL: https://issues.apache.org/jira/browse/SPARK-37495
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: dch nguyen






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37492) Optimize Orc test code with withAllNativeOrcReaders

2021-11-29 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37492:
-

Assignee: jiaan.geng

> Optimize Orc test code with withAllNativeOrcReaders
> ---
>
> Key: SPARK-37492
> URL: https://issues.apache.org/jira/browse/SPARK-37492
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37492) Optimize Orc test code with withAllNativeOrcReaders

2021-11-29 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37492.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34723
[https://github.com/apache/spark/pull/34723]

> Optimize Orc test code with withAllNativeOrcReaders
> ---
>
> Key: SPARK-37492
> URL: https://issues.apache.org/jira/browse/SPARK-37492
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36850) Migrate CreateTableStatement to v2 command framework

2021-11-29 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36850.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34060
[https://github.com/apache/spark/pull/34060]

> Migrate CreateTableStatement to v2 command framework
> 
>
> Key: SPARK-36850
> URL: https://issues.apache.org/jira/browse/SPARK-36850
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36850) Migrate CreateTableStatement to v2 command framework

2021-11-29 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36850:
---

Assignee: Huaxin Gao

> Migrate CreateTableStatement to v2 command framework
> 
>
> Key: SPARK-36850
> URL: https://issues.apache.org/jira/browse/SPARK-36850
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37452) Char and Varchar breaks backward compatibility between v3 and v2

2021-11-29 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37452:
--
Fix Version/s: 3.1.3

> Char and Varchar breaks backward compatibility between v3 and v2
> 
>
> Key: SPARK-37452
> URL: https://issues.apache.org/jira/browse/SPARK-37452
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.1.3, 3.2.1, 3.3.0
>
>
> We will store table schema in table properties for the read-side to restore. 
> In Spark 3.1, we add char/varchar support natively. In some commands like 
> `create table`, `alter table` with these types,  the char(n) or varchar(n) 
> will be stored directly to those properties. If a user uses Spark 2 to read 
> such a table it will fail to parse the schema.
> A table can be a newly created one by Spark 3.1 and later or an existing one 
> modified by Spark 3.1 and on.  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37472) Missing functionality in spark.pandas

2021-11-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37472:
-
Parent: SPARK-36394
Issue Type: Sub-task  (was: New Feature)

> Missing functionality in spark.pandas
> -
>
> Key: SPARK-37472
> URL: https://issues.apache.org/jira/browse/SPARK-37472
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Rens Jochemsen
>Priority: Major
>
> dear,
>  
> I am missing the functionality to include local variables in the query method.
> ```
> seg = 'A'
> psdf.query("segment == @seg")
>  
> ``` 
>  
> or 
> seg = ['A', 'B']
> psdf.query("segment == @seg")
> ```
>  
> Furthermore I was wondering whether date-offset functionalty as 
> pd.offsets.MonthEnd will be added to future versions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35867) Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-11-29 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-35867.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34611
[https://github.com/apache/spark/pull/34611]

> Enable vectorized read for VectorizedPlainValuesReader.readBooleans
> ---
>
> Key: SPARK-35867
> URL: https://issues.apache.org/jira/browse/SPARK-35867
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Kazuyuki Tanimura
>Priority: Minor
> Fix For: 3.3.0
>
>
> Currently we decode PLAIN encoded booleans as follow:
> {code:java}
>   public final void readBooleans(int total, WritableColumnVector c, int 
> rowId) {
> // TODO: properly vectorize this
> for (int i = 0; i < total; i++) {
>   c.putBoolean(rowId + i, readBoolean());
> }
>   }
> {code}
> Ideally we should vectorize this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35867) Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-11-29 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-35867:


Assignee: Kazuyuki Tanimura

> Enable vectorized read for VectorizedPlainValuesReader.readBooleans
> ---
>
> Key: SPARK-35867
> URL: https://issues.apache.org/jira/browse/SPARK-35867
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Kazuyuki Tanimura
>Priority: Minor
>
> Currently we decode PLAIN encoded booleans as follow:
> {code:java}
>   public final void readBooleans(int total, WritableColumnVector c, int 
> rowId) {
> // TODO: properly vectorize this
> for (int i = 0; i < total; i++) {
>   c.putBoolean(rowId + i, readBoolean());
> }
>   }
> {code}
> Ideally we should vectorize this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37493) expose driver gc time and duration time

2021-11-29 Thread zhoubin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450822#comment-17450822
 ] 

zhoubin commented on SPARK-37493:
-

Issue resolved by pull request 34749

https://github.com/apache/spark/pull/34749

> expose driver gc time and duration time
> ---
>
> Key: SPARK-37493
> URL: https://issues.apache.org/jira/browse/SPARK-37493
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: zhoubin
>Priority: Major
>
> when we browse executor pages of driver side or history side sparkUI,driver's 
> gc statistics is not acquired obviously ,thus making it hard to deside how to 
> config the resource of driver.
>  
> we can use the application time as driver's task duration time, and use 
> "TotalGCTime" to the JVMHeapMemory of ExecutorMetricType



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37486) an error occurred while using the udf jars located in the lakefs, a inner filesystem in Tencent Cloud DLC.

2021-11-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450820#comment-17450820
 ] 

Apache Spark commented on SPARK-37486:
--

User 'kevincmchen' has created a pull request for this issue:
https://github.com/apache/spark/pull/34742

> an error occurred while using the udf jars located in the lakefs, a inner 
> filesystem in Tencent Cloud DLC.
> --
>
> Key: SPARK-37486
> URL: https://issues.apache.org/jira/browse/SPARK-37486
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kevin Pis
>Priority: Major
>
> when using livy to execute sql statements that will call the udf jars located 
> in lakefs, a inner filesystem in Tencent Cloud DLC. it will threw the 
> following exceptions:
>  
> {code:java}
> 21/11/25 21:12:43 ERROR Session: Exception when executing code
> java.lang.LinkageError: loader constraint violation: loader (instance of 
> sun/misc/Launcher$AppClassLoader) previously initiated loading for a 
> different type with name "com/qcloud/cos/auth/COSCredentials"
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2306)
>   at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2271)
>   at 
> org.apache.hadoop.conf.Configuration.getClasses(Configuration.java:2344)
>   at 
> org.apache.hadoop.fs.CosNUtils.loadCosProviderClasses(CosNUtils.java:68)
>   at 
> org.apache.hadoop.fs.CosFileSystem.initRangerClientImpl(CosFileSystem.java:848)
>   at org.apache.hadoop.fs.CosFileSystem.initialize(CosFileSystem.java:95)
>   at 
> com.tencent.cloud.fs.CompatibleFileSystem.initialize(CompatibleFileSystem.java:20)
>   at 
> com.tencent.cloud.fs.LakeFileSystem.initialize(LakeFileSystem.java:56)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2812)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.hadoop.fs.FsUrlConnection.connect(FsUrlConnection.java:49)
>   at 
> org.apache.hadoop.fs.FsUrlConnection.getInputStream(FsUrlConnection.java:59)
>   at sun.net.www.protocol.jar.URLJarFile.retrieve(URLJarFile.java:214)
>   at sun.net.www.protocol.jar.URLJarFile.getJarFile(URLJarFile.java:71)
>   at sun.net.www.protocol.jar.JarFileFactory.get(JarFileFactory.java:84)
>   at 
> sun.net.www.protocol.jar.JarURLConnection.connect(JarURLConnection.java:122)
>   at 
> sun.net.www.protocol.jar.JarURLConnection.getJarFile(JarURLConnection.java:89)
>   at sun.misc.URLClassPath$JarLoader.getJarFile(URLClassPath.java:944)
>   at sun.misc.URLClassPath$JarLoader.access$800(URLClassPath.java:801)
>   at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:886)
>   at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:879)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at sun.misc.URLClassPath$JarLoader.ensureOpen(URLClassPath.java:878)
>   at sun.misc.URLClassPath$JarLoader.(URLClassPath.java:829)
>   at sun.misc.URLClassPath$3.run(URLClassPath.java:575)
>   at sun.misc.URLClassPath$3.run(URLClassPath.java:565)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at sun.misc.URLClassPath.getLoader(URLClassPath.java:564)
>   at sun.misc.URLClassPath.getLoader(URLClassPath.java:529)
>   at sun.misc.URLClassPath.getNextLoader(URLClassPath.java:494)
>   at sun.misc.URLClassPath.access$100(URLClassPath.java:66)
>   at sun.misc.URLClassPath$1.next(URLClassPath.

[jira] [Assigned] (SPARK-37493) expose driver gc time and duration time

2021-11-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37493:


Assignee: (was: Apache Spark)

> expose driver gc time and duration time
> ---
>
> Key: SPARK-37493
> URL: https://issues.apache.org/jira/browse/SPARK-37493
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: zhoubin
>Priority: Major
>
> when we browse executor pages of driver side or history side sparkUI,driver's 
> gc statistics is not acquired obviously ,thus making it hard to deside how to 
> config the resource of driver.
>  
> we can use the application time as driver's task duration time, and use 
> "TotalGCTime" to the JVMHeapMemory of ExecutorMetricType



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37493) expose driver gc time and duration time

2021-11-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37493:


Assignee: Apache Spark

> expose driver gc time and duration time
> ---
>
> Key: SPARK-37493
> URL: https://issues.apache.org/jira/browse/SPARK-37493
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: zhoubin
>Assignee: Apache Spark
>Priority: Major
>
> when we browse executor pages of driver side or history side sparkUI,driver's 
> gc statistics is not acquired obviously ,thus making it hard to deside how to 
> config the resource of driver.
>  
> we can use the application time as driver's task duration time, and use 
> "TotalGCTime" to the JVMHeapMemory of ExecutorMetricType



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37493) expose driver gc time and duration time

2021-11-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450819#comment-17450819
 ] 

Apache Spark commented on SPARK-37493:
--

User 'summaryzb' has created a pull request for this issue:
https://github.com/apache/spark/pull/34749

> expose driver gc time and duration time
> ---
>
> Key: SPARK-37493
> URL: https://issues.apache.org/jira/browse/SPARK-37493
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: zhoubin
>Priority: Major
>
> when we browse executor pages of driver side or history side sparkUI,driver's 
> gc statistics is not acquired obviously ,thus making it hard to deside how to 
> config the resource of driver.
>  
> we can use the application time as driver's task duration time, and use 
> "TotalGCTime" to the JVMHeapMemory of ExecutorMetricType



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37482) Skip check monotonic increasing for Series.asof with 'compute.eager_check'

2021-11-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37482:


Assignee: dch nguyen

> Skip check monotonic increasing for Series.asof with 'compute.eager_check'
> --
>
> Key: SPARK-37482
> URL: https://issues.apache.org/jira/browse/SPARK-37482
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37482) Skip check monotonic increasing for Series.asof with 'compute.eager_check'

2021-11-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37482.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34737
[https://github.com/apache/spark/pull/34737]

> Skip check monotonic increasing for Series.asof with 'compute.eager_check'
> --
>
> Key: SPARK-37482
> URL: https://issues.apache.org/jira/browse/SPARK-37482
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37484) Replace Get and getOrElse with getOrElse

2021-11-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37484.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34739
[https://github.com/apache/spark/pull/34739]

> Replace Get and getOrElse with getOrElse
> 
>
> Key: SPARK-37484
> URL: https://issues.apache.org/jira/browse/SPARK-37484
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Trivial
> Fix For: 3.3.0
>
>
> There are some combined calls of get and getOrElse that can be directly 
> replaced by getOrElse
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37484) Replace Get and getOrElse with getOrElse

2021-11-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37484:


Assignee: Yang Jie

> Replace Get and getOrElse with getOrElse
> 
>
> Key: SPARK-37484
> URL: https://issues.apache.org/jira/browse/SPARK-37484
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Trivial
>
> There are some combined calls of get and getOrElse that can be directly 
> replaced by getOrElse
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37485) Replace map with expressions which produce no result with foreach

2021-11-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37485.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34740
[https://github.com/apache/spark/pull/34740]

> Replace map with expressions which produce no result with foreach 
> --
>
> Key: SPARK-37485
> URL: https://issues.apache.org/jira/browse/SPARK-37485
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Trivial
> Fix For: 3.3.0
>
>
> Use foreach instead of  map with expressions which produce no result.
>  
> Before
>  
> {code:java}
> def functionWithNoReturnValue: Unit = {}  
> Seq(1, 2).map(functionWithNoReturnValue) {code}
>  
>  
> After
>  
> {code:java}
> def functionWithNoReturnValue: Unit = {}   
> Seq(1, 2).foreach(functionWithNoReturnValue) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37485) Replace map with expressions which produce no result with foreach

2021-11-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37485:


Assignee: Yang Jie

> Replace map with expressions which produce no result with foreach 
> --
>
> Key: SPARK-37485
> URL: https://issues.apache.org/jira/browse/SPARK-37485
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Trivial
>
> Use foreach instead of  map with expressions which produce no result.
>  
> Before
>  
> {code:java}
> def functionWithNoReturnValue: Unit = {}  
> Seq(1, 2).map(functionWithNoReturnValue) {code}
>  
>  
> After
>  
> {code:java}
> def functionWithNoReturnValue: Unit = {}   
> Seq(1, 2).foreach(functionWithNoReturnValue) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37494) Unify v1 and v2 options output of `SHOW CREATE TABLE` command

2021-11-29 Thread PengLei (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PengLei updated SPARK-37494:

Summary: Unify v1 and v2 options output of `SHOW CREATE TABLE` command  
(was: Unify v1 and v2 option output of `SHOW CREATE TABLE` command)

> Unify v1 and v2 options output of `SHOW CREATE TABLE` command
> -
>
> Key: SPARK-37494
> URL: https://issues.apache.org/jira/browse/SPARK-37494
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37483) Support push down top N to JDBC data source V2

2021-11-29 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-37483:
---
Summary: Support push down top N to JDBC data source V2  (was: Support 
pushdown down top N to JDBC data source V2)

> Support push down top N to JDBC data source V2
> --
>
> Key: SPARK-37483
> URL: https://issues.apache.org/jira/browse/SPARK-37483
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37493) expose driver gc time and duration time

2021-11-29 Thread zhoubin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoubin updated SPARK-37493:

Summary: expose driver gc time and duration time  (was: expose driver gc 
time)

> expose driver gc time and duration time
> ---
>
> Key: SPARK-37493
> URL: https://issues.apache.org/jira/browse/SPARK-37493
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: zhoubin
>Priority: Major
>
> when we browse executor pages of driver side or history side sparkUI,driver's 
> gc statistics is not acquired obviously ,thus making it hard to deside how to 
> config the resource of driver.
>  
> we can use the application time as driver's task duration time, and use 
> "TotalGCTime" to the JVMHeapMemory of ExecutorMetricType



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37494) Unify v1 and v2 option output of `SHOW CREATE TABLE` command

2021-11-29 Thread PengLei (Jira)

PengLei created SPARK-37494:
---

 Summary: Unify v1 and v2 option output of `SHOW CREATE TABLE` 
command
 Key: SPARK-37494
 URL: https://issues.apache.org/jira/browse/SPARK-37494
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: PengLei
 Fix For: 3.3.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37493) expose driver gc time

2021-11-29 Thread zhoubin (Jira)

zhoubin created SPARK-37493:
---

 Summary: expose driver gc time
 Key: SPARK-37493
 URL: https://issues.apache.org/jira/browse/SPARK-37493
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.1
Reporter: zhoubin


when we browse executor pages of driver side or history side sparkUI,driver's 
gc statistics is not acquired obviously ,thus making it hard to deside how to 
config the resource of driver.

 

we can use the application time as driver's task duration time, and use 
"TotalGCTime" to the JVMHeapMemory of ExecutorMetricType



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37492) Optimize Orc test code with withAllNativeOrcReaders

2021-11-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450795#comment-17450795
 ] 

Apache Spark commented on SPARK-37492:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34723

> Optimize Orc test code with withAllNativeOrcReaders
> ---
>
> Key: SPARK-37492
> URL: https://issues.apache.org/jira/browse/SPARK-37492
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37492) Optimize Orc test code with withAllNativeOrcReaders

2021-11-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450794#comment-17450794
 ] 

Apache Spark commented on SPARK-37492:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34723

> Optimize Orc test code with withAllNativeOrcReaders
> ---
>
> Key: SPARK-37492
> URL: https://issues.apache.org/jira/browse/SPARK-37492
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37492) Optimize Orc test code with withAllNativeOrcReaders

2021-11-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37492:


Assignee: Apache Spark

> Optimize Orc test code with withAllNativeOrcReaders
> ---
>
> Key: SPARK-37492
> URL: https://issues.apache.org/jira/browse/SPARK-37492
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37492) Optimize Orc test code with withAllNativeOrcReaders

2021-11-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37492:


Assignee: (was: Apache Spark)

> Optimize Orc test code with withAllNativeOrcReaders
> ---
>
> Key: SPARK-37492
> URL: https://issues.apache.org/jira/browse/SPARK-37492
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37492) Optimize Orc test code with withAllNativeOrcReaders

2021-11-29 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-37492:
--

 Summary: Optimize Orc test code with withAllNativeOrcReaders
 Key: SPARK-37492
 URL: https://issues.apache.org/jira/browse/SPARK-37492
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.3.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37491) Fix Series.asof when values of the series is not sorted

2021-11-29 Thread dch nguyen (Jira)

dch nguyen created SPARK-37491:
--

 Summary: Fix Series.asof when values of the series is not sorted
 Key: SPARK-37491
 URL: https://issues.apache.org/jira/browse/SPARK-37491
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.3.0
Reporter: dch nguyen


https://github.com/apache/spark/pull/34737#discussion_r758223279



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37468) Support ANSI intervals and TimestampNTZ for UnionEstimation

2021-11-29 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-37468.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34716
[https://github.com/apache/spark/pull/34716]

> Support ANSI intervals and TimestampNTZ for UnionEstimation
> ---
>
> Key: SPARK-37468
> URL: https://issues.apache.org/jira/browse/SPARK-37468
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently, UnionEstimation doesn't support ANSI intervals and TimestampNTZ. 
> But I think it can support those types because their underlying types are 
> integer or long, which UnionEstimation can compute stats for.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37487) CollectMetrics is executed twice if it is followed by a sort

2021-11-29 Thread Tanel Kiis (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanel Kiis updated SPARK-37487:
---
Labels: correctness  (was: )

> CollectMetrics is executed twice if it is followed by a sort
> 
>
> Key: SPARK-37487
> URL: https://issues.apache.org/jira/browse/SPARK-37487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Priority: Major
>  Labels: correctness
>
> It is best examplified by this new UT in DataFrameCallbackSuite:
> {code}
>   test("SPARK-X: get observable metrics with sort by callback") {
> val df = spark.range(100)
>   .observe(
> name = "my_event",
> min($"id").as("min_val"),
> max($"id").as("max_val"),
> // Test unresolved alias
> sum($"id"),
> count(when($"id" % 2 === 0, 1)).as("num_even"))
>   .observe(
> name = "other_event",
> avg($"id").cast("int").as("avg_val"))
>   .sort($"id".desc)
> validateObservedMetrics(df)
>   }
> {code}
> The count and sum aggregate report twice the number of rows:
> {code}
> [info] - SPARK-X: get observable metrics with sort by callback *** FAILED 
> *** (169 milliseconds)
> [info]   [0,99,9900,100] did not equal [0,99,4950,50] 
> (DataFrameCallbackSuite.scala:342)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
> [info]   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.checkMetrics$1(DataFrameCallbackSuite.scala:342)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.validateObservedMetrics(DataFrameCallbackSuite.scala:350)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.$anonfun$new$21(DataFrameCallbackSuite.scala:324)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> {code}
> I could not figure out how this happes. Hopefully the UT can help with 
> debugging



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37490) Show hint if analyzer fails due to ANSI type coercion

2021-11-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37490:


Assignee: Gengliang Wang  (was: Apache Spark)

> Show hint if analyzer fails due to ANSI type coercion
> -
>
> Key: SPARK-37490
> URL: https://issues.apache.org/jira/browse/SPARK-37490
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Show hint in the error message if analysis failed only with ANSI type 
> coercion:
> {code:java}
> To fix the error, you might need to add explicit type casts.
> To bypass the error with lenient type coercion rules, set 
> spark.sql.ansi.enabled as false. {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37490) Show hint if analyzer fails due to ANSI type coercion

2021-11-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450593#comment-17450593
 ] 

Apache Spark commented on SPARK-37490:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34747

> Show hint if analyzer fails due to ANSI type coercion
> -
>
> Key: SPARK-37490
> URL: https://issues.apache.org/jira/browse/SPARK-37490
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Show hint in the error message if analysis failed only with ANSI type 
> coercion:
> {code:java}
> To fix the error, you might need to add explicit type casts.
> To bypass the error with lenient type coercion rules, set 
> spark.sql.ansi.enabled as false. {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37490) Show hint if analyzer fails due to ANSI type coercion

2021-11-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37490:


Assignee: Apache Spark  (was: Gengliang Wang)

> Show hint if analyzer fails due to ANSI type coercion
> -
>
> Key: SPARK-37490
> URL: https://issues.apache.org/jira/browse/SPARK-37490
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Show hint in the error message if analysis failed only with ANSI type 
> coercion:
> {code:java}
> To fix the error, you might need to add explicit type casts.
> To bypass the error with lenient type coercion rules, set 
> spark.sql.ansi.enabled as false. {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37490) Show hint if analyzer fails due to ANSI type coercion

2021-11-29 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-37490:
--

 Summary: Show hint if analyzer fails due to ANSI type coercion
 Key: SPARK-37490
 URL: https://issues.apache.org/jira/browse/SPARK-37490
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


Show hint in the error message if analysis failed only with ANSI type coercion:
{code:java}
To fix the error, you might need to add explicit type casts.
To bypass the error with lenient type coercion rules, set 
spark.sql.ansi.enabled as false. {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37489) Skip hasnans check in numops if eager_check disable

2021-11-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450569#comment-17450569
 ] 

Apache Spark commented on SPARK-37489:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34746

> Skip hasnans check in numops if eager_check disable
> ---
>
> Key: SPARK-37489
> URL: https://issues.apache.org/jira/browse/SPARK-37489
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37489) Skip hasnans check in numops if eager_check disable

2021-11-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450567#comment-17450567
 ] 

Apache Spark commented on SPARK-37489:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34746

> Skip hasnans check in numops if eager_check disable
> ---
>
> Key: SPARK-37489
> URL: https://issues.apache.org/jira/browse/SPARK-37489
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37489) Skip hasnans check in numops if eager_check disable

2021-11-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37489:


Assignee: (was: Apache Spark)

> Skip hasnans check in numops if eager_check disable
> ---
>
> Key: SPARK-37489
> URL: https://issues.apache.org/jira/browse/SPARK-37489
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37489) Skip hasnans check in numops if eager_check disable

2021-11-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37489:


Assignee: Apache Spark

> Skip hasnans check in numops if eager_check disable
> ---
>
> Key: SPARK-37489
> URL: https://issues.apache.org/jira/browse/SPARK-37489
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37391) SIGNIFICANT bottleneck introduced by fix for SPARK-32001

2021-11-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450560#comment-17450560
 ] 

Apache Spark commented on SPARK-37391:
--

User 'tdg5' has created a pull request for this issue:
https://github.com/apache/spark/pull/34745

> SIGNIFICANT bottleneck introduced by fix for SPARK-32001
> 
>
> Key: SPARK-37391
> URL: https://issues.apache.org/jira/browse/SPARK-37391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
> Environment: N/A
>Reporter: Danny Guinther
>Priority: Major
> Attachments: so-much-blocking.jpg, spark-regression-dashes.jpg
>
>
> The fix for https://issues.apache.org/jira/browse/SPARK-32001 ( 
> [https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58]
>  ) does not seem to have consider the reality that some apps may rely on 
> being able to establish many JDBC connections simultaneously for performance 
> reasons.
> The fix forces concurrency to 1 when establishing database connections and 
> that strikes me as a *significant* user impacting change and a *significant* 
> bottleneck.
> Can anyone propose a workaround for this? I have an app that makes 
> connections to thousands of databases and I can't upgrade to any version 
> >3.1.x because of this significant bottleneck.
>  
> Thanks in advance for your help!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37391) SIGNIFICANT bottleneck introduced by fix for SPARK-32001

2021-11-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450559#comment-17450559
 ] 

Apache Spark commented on SPARK-37391:
--

User 'tdg5' has created a pull request for this issue:
https://github.com/apache/spark/pull/34745

> SIGNIFICANT bottleneck introduced by fix for SPARK-32001
> 
>
> Key: SPARK-37391
> URL: https://issues.apache.org/jira/browse/SPARK-37391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
> Environment: N/A
>Reporter: Danny Guinther
>Priority: Major
> Attachments: so-much-blocking.jpg, spark-regression-dashes.jpg
>
>
> The fix for https://issues.apache.org/jira/browse/SPARK-32001 ( 
> [https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58]
>  ) does not seem to have consider the reality that some apps may rely on 
> being able to establish many JDBC connections simultaneously for performance 
> reasons.
> The fix forces concurrency to 1 when establishing database connections and 
> that strikes me as a *significant* user impacting change and a *significant* 
> bottleneck.
> Can anyone propose a workaround for this? I have an app that makes 
> connections to thousands of databases and I can't upgrade to any version 
> >3.1.x because of this significant bottleneck.
>  
> Thanks in advance for your help!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37391) SIGNIFICANT bottleneck introduced by fix for SPARK-32001

2021-11-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37391:


Assignee: Apache Spark

> SIGNIFICANT bottleneck introduced by fix for SPARK-32001
> 
>
> Key: SPARK-37391
> URL: https://issues.apache.org/jira/browse/SPARK-37391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
> Environment: N/A
>Reporter: Danny Guinther
>Assignee: Apache Spark
>Priority: Major
> Attachments: so-much-blocking.jpg, spark-regression-dashes.jpg
>
>
> The fix for https://issues.apache.org/jira/browse/SPARK-32001 ( 
> [https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58]
>  ) does not seem to have consider the reality that some apps may rely on 
> being able to establish many JDBC connections simultaneously for performance 
> reasons.
> The fix forces concurrency to 1 when establishing database connections and 
> that strikes me as a *significant* user impacting change and a *significant* 
> bottleneck.
> Can anyone propose a workaround for this? I have an app that makes 
> connections to thousands of databases and I can't upgrade to any version 
> >3.1.x because of this significant bottleneck.
>  
> Thanks in advance for your help!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37391) SIGNIFICANT bottleneck introduced by fix for SPARK-32001

2021-11-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37391:


Assignee: (was: Apache Spark)

> SIGNIFICANT bottleneck introduced by fix for SPARK-32001
> 
>
> Key: SPARK-37391
> URL: https://issues.apache.org/jira/browse/SPARK-37391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
> Environment: N/A
>Reporter: Danny Guinther
>Priority: Major
> Attachments: so-much-blocking.jpg, spark-regression-dashes.jpg
>
>
> The fix for https://issues.apache.org/jira/browse/SPARK-32001 ( 
> [https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58]
>  ) does not seem to have consider the reality that some apps may rely on 
> being able to establish many JDBC connections simultaneously for performance 
> reasons.
> The fix forces concurrency to 1 when establishing database connections and 
> that strikes me as a *significant* user impacting change and a *significant* 
> bottleneck.
> Can anyone propose a workaround for this? I have an app that makes 
> connections to thousands of databases and I can't upgrade to any version 
> >3.1.x because of this significant bottleneck.
>  
> Thanks in advance for your help!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37489) Skip hasnans check in numops if eager_check disable

2021-11-29 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-37489:
---

 Summary: Skip hasnans check in numops if eager_check disable
 Key: SPARK-37489
 URL: https://issues.apache.org/jira/browse/SPARK-37489
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Yikun Jiang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37454) support expressions in time travel timestamp

2021-11-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450553#comment-17450553
 ] 

Apache Spark commented on SPARK-37454:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/34744

> support expressions in time travel timestamp
> 
>
> Key: SPARK-37454
> URL: https://issues.apache.org/jira/browse/SPARK-37454
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37454) support expressions in time travel timestamp

2021-11-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450552#comment-17450552
 ] 

Apache Spark commented on SPARK-37454:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/34744

> support expressions in time travel timestamp
> 
>
> Key: SPARK-37454
> URL: https://issues.apache.org/jira/browse/SPARK-37454
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37259) JDBC read is always going to wrap the query in a select statement

2021-11-29 Thread Kevin Appel (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450525#comment-17450525
 ] 

Kevin Appel commented on SPARK-37259:
-

[~petertoth] [~akhalymon] Thank you both for working on these patches, it took 
me a little bit to figure out how to test them but i got the Spark 
3.3.0-SNAPSHOT compiled and then added both of your changes to different 
working copies and then recompile the spark-sql and then was able to test both 
of your changes.  I added comments into the github pull request links with how 
the testing went so far

> JDBC read is always going to wrap the query in a select statement
> -
>
> Key: SPARK-37259
> URL: https://issues.apache.org/jira/browse/SPARK-37259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Kevin Appel
>Priority: Major
>
> The read jdbc is wrapping the query it sends to the database server inside a 
> select statement and there is no way to override this currently.
> Initially I ran into this issue when trying to run a CTE query against SQL 
> server and it fails, the details of the failure is in these cases:
> [https://github.com/microsoft/mssql-jdbc/issues/1340]
> [https://github.com/microsoft/mssql-jdbc/issues/1657]
> [https://github.com/microsoft/sql-spark-connector/issues/147]
> https://issues.apache.org/jira/browse/SPARK-32825
> https://issues.apache.org/jira/browse/SPARK-34928
> I started to patch the code to get the query to run and ran into a few 
> different items, if there is a way to add these features to allow this code 
> path to run, this would be extremely helpful to running these type of edge 
> case queries.  These are basic examples here the actual queries are much more 
> complex and would require significant time to rewrite.
> Inside JDBCOptions.scala the query is being set to either, using the dbtable 
> this allows the query to be passed without modification
>  
> {code:java}
> name.trim
> or
> s"(${subquery}) SPARK_GEN_SUBQ_${curId.getAndIncrement()}"
> {code}
>  
> Inside JDBCRelation.scala this is going to try to get the schema for this 
> query, and this ends up running dialect.getSchemaQuery which is doing:
> {code:java}
> s"SELECT * FROM $table WHERE 1=0"{code}
> Overriding the dialect here and initially just passing back the $table gets 
> passed here and to the next issue which is in the compute function in 
> JDBCRDD.scala
>  
> {code:java}
> val sqlText = s"SELECT $columnList FROM ${options.tableOrQuery} 
> $myTableSampleClause" + s" $myWhereClause $getGroupByClause $myLimitClause"
>  
> {code}
>  
> For these two queries, about a CTE query and using temp tables, finding out 
> the schema is difficult without actually running the query and for the temp 
> table if you run it in the schema check that will have the table now exist 
> and fail when it runs the actual query.
>  
> The way I patched these is by doing these two items:
> JDBCRDD.scala (compute)
>  
> {code:java}
>     val runQueryAsIs = options.parameters.getOrElse("runQueryAsIs", 
> "false").toBoolean
>     val sqlText = if (runQueryAsIs) {
>       s"${options.tableOrQuery}"
>     } else {
>       s"SELECT $columnList FROM ${options.tableOrQuery} $myWhereClause"
>     }
> {code}
> JDBCRelation.scala (getSchema)
> {code:java}
> val useCustomSchema = jdbcOptions.parameters.getOrElse("useCustomSchema", 
> "false").toBoolean
>     if (useCustomSchema) {
>       val myCustomSchema = jdbcOptions.parameters.getOrElse("customSchema", 
> "").toString
>       val newSchema = CatalystSqlParser.parseTableSchema(myCustomSchema)
>       logInfo(s"Going to return the new $newSchema because useCustomSchema is 
> $useCustomSchema and passed in $myCustomSchema")
>       newSchema
>     } else {
>       val tableSchema = JDBCRDD.resolveTable(jdbcOptions)
>       jdbcOptions.customSchema match {
>       case Some(customSchema) => JdbcUtils.getCustomSchema(
>         tableSchema, customSchema, resolver)
>       case None => tableSchema
>       }
>     }{code}
>  
> This is allowing the query to run as is, by using the dbtable option and then 
> provide a custom schema that will bypass the dialect schema check
>  
> Test queries
>  
> {code:java}
> query1 = """ 
> SELECT 1 as DummyCOL
> """
> query2 = """ 
> WITH DummyCTE AS
> (
> SELECT 1 as DummyCOL
> )
> SELECT *
> FROM DummyCTE
> """
> query3 = """
> (SELECT *
> INTO #Temp1a
> FROM
> (SELECT @@VERSION as version) data
> )
> (SELECT *
> FROM
> #Temp1a)
> """
> {code}
>  
> Test schema
>  
> {code:java}
> schema1 = """
> DummyXCOL INT
> """
> schema2 = """
> DummyXCOL STRING
> """
> {code}
>  
> Test code
>  
> {code:java}
> jdbcDFWorking = (
>     spark.read.format("jdbc")
>     .option("url", 
> f"jdbc:sqlserver://{server}:{port};dat

[jira] [Updated] (SPARK-37484) Replace Get and getOrElse with getOrElse

2021-11-29 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-37484:
-
Priority: Trivial  (was: Minor)

> Replace Get and getOrElse with getOrElse
> 
>
> Key: SPARK-37484
> URL: https://issues.apache.org/jira/browse/SPARK-37484
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Trivial
>
> There are some combined calls of get and getOrElse that can be directly 
> replaced by getOrElse
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37485) Replace map with expressions which produce no result with foreach

2021-11-29 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-37485:
-
Priority: Trivial  (was: Minor)

> Replace map with expressions which produce no result with foreach 
> --
>
> Key: SPARK-37485
> URL: https://issues.apache.org/jira/browse/SPARK-37485
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Trivial
>
> Use foreach instead of  map with expressions which produce no result.
>  
> Before
>  
> {code:java}
> def functionWithNoReturnValue: Unit = {}  
> Seq(1, 2).map(functionWithNoReturnValue) {code}
>  
>  
> After
>  
> {code:java}
> def functionWithNoReturnValue: Unit = {}   
> Seq(1, 2).foreach(functionWithNoReturnValue) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37488) With enough resources, the task may still be permanently pending

2021-11-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450492#comment-17450492
 ] 

Apache Spark commented on SPARK-37488:
--

User 'guiyanakuang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34743

> With enough resources, the task may still be permanently pending
> 
>
> Key: SPARK-37488
> URL: https://issues.apache.org/jira/browse/SPARK-37488
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
> Environment: Spark 3.1.2，Default Configuration
>Reporter: Yiqun Zhang
>Priority: Major
>
> {code:java}
> // The online environment is actually hive partition data imported to tidb, 
> the code logic can be simplified as follows
> SparkSession testApp = SparkSession.builder()
> .master("local[*]")
> .appName("test app")
> .enableHiveSupport()
>     .getOrCreate();
> Dataset dataset = testApp.sql("select * from default.test where dt = 
> '20211129'");
> dataset.persist(StorageLevel.MEMORY_AND_DISK());
> dataset.count();
> {code}
> I have observed that tasks are permanently blocked and reruns can always be 
> reproduced.
> Since it is only reproducible online, I use the arthas runtime to see the 
> status of the function entries and returns within the TaskSetManager.
> https://gist.github.com/guiyanakuang/431584f191645513552a937d16ae8fbd
> NODE_LOCAL level, because the persist function is called, the 
> pendingTasks.forHost has a collection of pending tasks, but it points to the 
> machine where the block of partitioned data is located, and since the only 
> resource spark gets is the driver. In this case, it cannot be scheduled. 
> getAllowedLocalityLevel gives the wrong runlevel, so it cannot be run with 
> TaskLocality.Any
> The task pending permanently because the scheduling time is very short and it 
> is too late to raise the runlevel with a timeout.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37488) With enough resources, the task may still be permanently pending

2021-11-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37488:


Assignee: (was: Apache Spark)

> With enough resources, the task may still be permanently pending
> 
>
> Key: SPARK-37488
> URL: https://issues.apache.org/jira/browse/SPARK-37488
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
> Environment: Spark 3.1.2，Default Configuration
>Reporter: Yiqun Zhang
>Priority: Major
>
> {code:java}
> // The online environment is actually hive partition data imported to tidb, 
> the code logic can be simplified as follows
> SparkSession testApp = SparkSession.builder()
> .master("local[*]")
> .appName("test app")
> .enableHiveSupport()
> .getOrCreate();
> Dataset dataset = testApp.sql("select * from default.test where dt = 
> '20211129'");
> dataset.persist(StorageLevel.MEMORY_AND_DISK());
> dataset.count();
> {code}
> I have observed that tasks are permanently blocked and reruns can always be 
> reproduced.
> Since it is only reproducible online, I use the arthas runtime to see the 
> status of the function entries and returns within the TaskSetManager.
> https://gist.github.com/guiyanakuang/431584f191645513552a937d16ae8fbd
> NODE_LOCAL level, because the persist function is called, the 
> pendingTasks.forHost has a collection of pending tasks, but it points to the 
> machine where the block of partitioned data is located, and since the only 
> resource spark gets is the driver. In this case, it cannot be scheduled. 
> getAllowedLocalityLevel gives the wrong runlevel, so it cannot be run with 
> TaskLocality.Any
> The task pending permanently because the scheduling time is very short and it 
> is too late to raise the runlevel with a timeout.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37488) With enough resources, the task may still be permanently pending

2021-11-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37488:


Assignee: Apache Spark

> With enough resources, the task may still be permanently pending
> 
>
> Key: SPARK-37488
> URL: https://issues.apache.org/jira/browse/SPARK-37488
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
> Environment: Spark 3.1.2，Default Configuration
>Reporter: Yiqun Zhang
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> // The online environment is actually hive partition data imported to tidb, 
> the code logic can be simplified as follows
> SparkSession testApp = SparkSession.builder()
> .master("local[*]")
> .appName("test app")
> .enableHiveSupport()
> .getOrCreate();
> Dataset dataset = testApp.sql("select * from default.test where dt = 
> '20211129'");
> dataset.persist(StorageLevel.MEMORY_AND_DISK());
> dataset.count();
> {code}
> I have observed that tasks are permanently blocked and reruns can always be 
> reproduced.
> Since it is only reproducible online, I use the arthas runtime to see the 
> status of the function entries and returns within the TaskSetManager.
> https://gist.github.com/guiyanakuang/431584f191645513552a937d16ae8fbd
> NODE_LOCAL level, because the persist function is called, the 
> pendingTasks.forHost has a collection of pending tasks, but it points to the 
> machine where the block of partitioned data is located, and since the only 
> resource spark gets is the driver. In this case, it cannot be scheduled. 
> getAllowedLocalityLevel gives the wrong runlevel, so it cannot be run with 
> TaskLocality.Any
> The task pending permanently because the scheduling time is very short and it 
> is too late to raise the runlevel with a timeout.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37488) With enough resources, the task may still be permanently pending

2021-11-29 Thread Yiqun Zhang (Jira)

Yiqun Zhang created SPARK-37488:
---

 Summary: With enough resources, the task may still be permanently 
pending
 Key: SPARK-37488
 URL: https://issues.apache.org/jira/browse/SPARK-37488
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 3.2.0, 3.1.2, 3.0.3
 Environment: Spark 3.1.2，Default Configuration
Reporter: Yiqun Zhang


{code:java}
// The online environment is actually hive partition data imported to tidb, the 
code logic can be simplified as follows
SparkSession testApp = SparkSession.builder()
.master("local[*]")
.appName("test app")
.enableHiveSupport()
.getOrCreate();
Dataset dataset = testApp.sql("select * from default.test where dt = 
'20211129'");
dataset.persist(StorageLevel.MEMORY_AND_DISK());
dataset.count();
{code}

I have observed that tasks are permanently blocked and reruns can always be 
reproduced.

Since it is only reproducible online, I use the arthas runtime to see the 
status of the function entries and returns within the TaskSetManager.
https://gist.github.com/guiyanakuang/431584f191645513552a937d16ae8fbd

NODE_LOCAL level, because the persist function is called, the 
pendingTasks.forHost has a collection of pending tasks, but it points to the 
machine where the block of partitioned data is located, and since the only 
resource spark gets is the driver. In this case, it cannot be scheduled. 
getAllowedLocalityLevel gives the wrong runlevel, so it cannot be run with 
TaskLocality.Any

The task pending permanently because the scheduling time is very short and it 
is too late to raise the runlevel with a timeout.




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37487) CollectMetrics is executed twice if it is followed by a sort

2021-11-29 Thread Tanel Kiis (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanel Kiis updated SPARK-37487:
---
Summary: CollectMetrics is executed twice if it is followed by a sort  
(was: CollectMetrics is executed twice if it is followed by an sort)

> CollectMetrics is executed twice if it is followed by a sort
> 
>
> Key: SPARK-37487
> URL: https://issues.apache.org/jira/browse/SPARK-37487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Priority: Major
>
> It is best examplified by this new UT in DataFrameCallbackSuite:
> {code}
>   test("SPARK-X: get observable metrics with sort by callback") {
> val df = spark.range(100)
>   .observe(
> name = "my_event",
> min($"id").as("min_val"),
> max($"id").as("max_val"),
> // Test unresolved alias
> sum($"id"),
> count(when($"id" % 2 === 0, 1)).as("num_even"))
>   .observe(
> name = "other_event",
> avg($"id").cast("int").as("avg_val"))
>   .sort($"id".desc)
> validateObservedMetrics(df)
>   }
> {code}
> The count and sum aggregate report twice the number of rows:
> {code}
> [info] - SPARK-X: get observable metrics with sort by callback *** FAILED 
> *** (169 milliseconds)
> [info]   [0,99,9900,100] did not equal [0,99,4950,50] 
> (DataFrameCallbackSuite.scala:342)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
> [info]   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.checkMetrics$1(DataFrameCallbackSuite.scala:342)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.validateObservedMetrics(DataFrameCallbackSuite.scala:350)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.$anonfun$new$21(DataFrameCallbackSuite.scala:324)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> {code}
> I could not figure out how this happes. Hopefully the UT can help with 
> debugging



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37487) CollectMetrics is executed twice if it is followed by an sort

2021-11-29 Thread Tanel Kiis (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450452#comment-17450452
 ] 

Tanel Kiis commented on SPARK-37487:


[~cloud_fan] and [~sarutak], you helped with the last CollectMetrics bug. 
Perhaps you have some idea, why this is happening.


> CollectMetrics is executed twice if it is followed by an sort
> -
>
> Key: SPARK-37487
> URL: https://issues.apache.org/jira/browse/SPARK-37487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Priority: Major
>
> It is best examplified by this new UT in DataFrameCallbackSuite:
> {code}
>   test("SPARK-X: get observable metrics with sort by callback") {
> val df = spark.range(100)
>   .observe(
> name = "my_event",
> min($"id").as("min_val"),
> max($"id").as("max_val"),
> // Test unresolved alias
> sum($"id"),
> count(when($"id" % 2 === 0, 1)).as("num_even"))
>   .observe(
> name = "other_event",
> avg($"id").cast("int").as("avg_val"))
>   .sort($"id".desc)
> validateObservedMetrics(df)
>   }
> {code}
> The count and sum aggregate report twice the number of rows:
> {code}
> [info] - SPARK-X: get observable metrics with sort by callback *** FAILED 
> *** (169 milliseconds)
> [info]   [0,99,9900,100] did not equal [0,99,4950,50] 
> (DataFrameCallbackSuite.scala:342)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
> [info]   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.checkMetrics$1(DataFrameCallbackSuite.scala:342)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.validateObservedMetrics(DataFrameCallbackSuite.scala:350)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.$anonfun$new$21(DataFrameCallbackSuite.scala:324)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> {code}
> I could not figure out how this happes. Hopefully the UT can help with 
> debugging



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37487) CollectMetrics is executed twice if it is followed by an sort

2021-11-29 Thread Tanel Kiis (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanel Kiis updated SPARK-37487:
---
Description: 
It is best examplified by this new UT in DataFrameCallbackSuite:
{code}
  test("SPARK-X: get observable metrics with sort by callback") {
val df = spark.range(100)
  .observe(
name = "my_event",
min($"id").as("min_val"),
max($"id").as("max_val"),
// Test unresolved alias
sum($"id"),
count(when($"id" % 2 === 0, 1)).as("num_even"))
  .observe(
name = "other_event",
avg($"id").cast("int").as("avg_val"))
  .sort($"id".desc)

validateObservedMetrics(df)
  }
{code}

The count and sum aggregate reports twice the number of rows:
{code}
[info] - SPARK-X: get observable metrics with sort by callback *** FAILED 
*** (169 milliseconds)
[info]   [0,99,9900,100] did not equal [0,99,4950,50] 
(DataFrameCallbackSuite.scala:342)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at 
org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
[info]   at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.checkMetrics$1(DataFrameCallbackSuite.scala:342)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.validateObservedMetrics(DataFrameCallbackSuite.scala:350)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.$anonfun$new$21(DataFrameCallbackSuite.scala:324)
[info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
{code}

I could not figure out how this happes. Hopefully the UT can help with debugging

  was:
It is best examplified by this new UT in DataFrameCallbackSuite:
{code}
  test("SPARK-X: get observable metrics with sort by callback") {
val df = spark.range(100)
  .observe(
name = "my_event",
min($"id").as("min_val"),
max($"id").as("max_val"),
// Test unresolved alias
sum($"id"),
count(when($"id" % 2 === 0, 1)).as("num_even"))
  .observe(
name = "other_event",
avg($"id").cast("int").as("avg_val"))
  .sort($"id".desc)

validateObservedMetrics(df)
  }
{code}

The count aggregate reports twice the number of rows:
{code}
[info] - SPARK-X: get observable metrics with sort by callback *** FAILED 
*** (169 milliseconds)
[info]   [0,99,9900,100] did not equal [0,99,4950,50] 
(DataFrameCallbackSuite.scala:342)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at 
org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
[info]   at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.checkMetrics$1(DataFrameCallbackSuite.scala:342)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.validateObservedMetrics(DataFrameCallbackSuite.scala:350)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.$anonfun$new$21(DataFrameCallbackSuite.scala:324)
[info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
{code}

I could not figure out how this happes. Hopefully the UT can help with debugging


> CollectMetrics is executed twice if it is followed by an sort
> -
>
> Key: SPARK-37487
> URL: https://issues.apache.org/jira/browse/SPARK-37487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Priority: Major
>
> It is best examplified by this new UT in DataFrameCallbackSuite:
> {code}
>   test("SPARK-X: get observable metrics with sort by callback") {
> val df = spark.range(100)
>   .observe(
> name = "my_e

[jira] [Updated] (SPARK-37487) CollectMetrics is executed twice if it is followed by an sort

2021-11-29 Thread Tanel Kiis (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanel Kiis updated SPARK-37487:
---
Description: 
It is best examplified by this new UT in DataFrameCallbackSuite:
{code}
  test("SPARK-X: get observable metrics with sort by callback") {
val df = spark.range(100)
  .observe(
name = "my_event",
min($"id").as("min_val"),
max($"id").as("max_val"),
// Test unresolved alias
sum($"id"),
count(when($"id" % 2 === 0, 1)).as("num_even"))
  .observe(
name = "other_event",
avg($"id").cast("int").as("avg_val"))
  .sort($"id".desc)

validateObservedMetrics(df)
  }
{code}

The count aggregate reports twice the number of rows:
{code}
[info] - SPARK-X: get observable metrics with sort by callback *** FAILED 
*** (169 milliseconds)
[info]   [0,99,9900,100] did not equal [0,99,4950,50] 
(DataFrameCallbackSuite.scala:342)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at 
org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
[info]   at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.checkMetrics$1(DataFrameCallbackSuite.scala:342)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.validateObservedMetrics(DataFrameCallbackSuite.scala:350)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.$anonfun$new$21(DataFrameCallbackSuite.scala:324)
[info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
{code}

I could not figure out how this happes. Hopefully the UT can help with debugging

  was:
It is bets examplified by this new UT in DataFrameCallbackSuite:
{code}
  test("SPARK-X: get observable metrics with sort by callback") {
val df = spark.range(100)
  .observe(
name = "my_event",
min($"id").as("min_val"),
max($"id").as("max_val"),
// Test unresolved alias
sum($"id"),
count(when($"id" % 2 === 0, 1)).as("num_even"))
  .observe(
name = "other_event",
avg($"id").cast("int").as("avg_val"))
  .sort($"id".desc)

validateObservedMetrics(df)
  }
{code}

The count aggregate reports twice the number of rows:
{code}
[info] - SPARK-X: get observable metrics with sort by callback *** FAILED 
*** (169 milliseconds)
[info]   [0,99,9900,100] did not equal [0,99,4950,50] 
(DataFrameCallbackSuite.scala:342)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at 
org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
[info]   at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.checkMetrics$1(DataFrameCallbackSuite.scala:342)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.validateObservedMetrics(DataFrameCallbackSuite.scala:350)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.$anonfun$new$21(DataFrameCallbackSuite.scala:324)
[info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
{code}

I could not figure out how this happes. Hopefully the UT can help with debugging


> CollectMetrics is executed twice if it is followed by an sort
> -
>
> Key: SPARK-37487
> URL: https://issues.apache.org/jira/browse/SPARK-37487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Priority: Major
>
> It is best examplified by this new UT in DataFrameCallbackSuite:
> {code}
>   test("SPARK-X: get observable metrics with sort by callback") {
> val df = spark.range(100)
>   .observe(
> name = "my_event",
>

[jira] [Updated] (SPARK-37487) CollectMetrics is executed twice if it is followed by an sort

2021-11-29 Thread Tanel Kiis (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanel Kiis updated SPARK-37487:
---
Description: 
It is best examplified by this new UT in DataFrameCallbackSuite:
{code}
  test("SPARK-X: get observable metrics with sort by callback") {
val df = spark.range(100)
  .observe(
name = "my_event",
min($"id").as("min_val"),
max($"id").as("max_val"),
// Test unresolved alias
sum($"id"),
count(when($"id" % 2 === 0, 1)).as("num_even"))
  .observe(
name = "other_event",
avg($"id").cast("int").as("avg_val"))
  .sort($"id".desc)

validateObservedMetrics(df)
  }
{code}

The count and sum aggregate report twice the number of rows:
{code}
[info] - SPARK-X: get observable metrics with sort by callback *** FAILED 
*** (169 milliseconds)
[info]   [0,99,9900,100] did not equal [0,99,4950,50] 
(DataFrameCallbackSuite.scala:342)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at 
org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
[info]   at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.checkMetrics$1(DataFrameCallbackSuite.scala:342)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.validateObservedMetrics(DataFrameCallbackSuite.scala:350)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.$anonfun$new$21(DataFrameCallbackSuite.scala:324)
[info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
{code}

I could not figure out how this happes. Hopefully the UT can help with debugging

  was:
It is best examplified by this new UT in DataFrameCallbackSuite:
{code}
  test("SPARK-X: get observable metrics with sort by callback") {
val df = spark.range(100)
  .observe(
name = "my_event",
min($"id").as("min_val"),
max($"id").as("max_val"),
// Test unresolved alias
sum($"id"),
count(when($"id" % 2 === 0, 1)).as("num_even"))
  .observe(
name = "other_event",
avg($"id").cast("int").as("avg_val"))
  .sort($"id".desc)

validateObservedMetrics(df)
  }
{code}

The count and sum aggregate reports twice the number of rows:
{code}
[info] - SPARK-X: get observable metrics with sort by callback *** FAILED 
*** (169 milliseconds)
[info]   [0,99,9900,100] did not equal [0,99,4950,50] 
(DataFrameCallbackSuite.scala:342)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at 
org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
[info]   at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.checkMetrics$1(DataFrameCallbackSuite.scala:342)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.validateObservedMetrics(DataFrameCallbackSuite.scala:350)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.$anonfun$new$21(DataFrameCallbackSuite.scala:324)
[info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
{code}

I could not figure out how this happes. Hopefully the UT can help with debugging


> CollectMetrics is executed twice if it is followed by an sort
> -
>
> Key: SPARK-37487
> URL: https://issues.apache.org/jira/browse/SPARK-37487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Priority: Major
>
> It is best examplified by this new UT in DataFrameCallbackSuite:
> {code}
>   test("SPARK-X: get observable metrics with sort by callback") {
> val df = spark.range(100)
>   .observe(
> name

[jira] [Created] (SPARK-37487) CollectMetrics is executed twice if it is followed by an sort

2021-11-29 Thread Tanel Kiis (Jira)

Tanel Kiis created SPARK-37487:
--

 Summary: CollectMetrics is executed twice if it is followed by an 
sort
 Key: SPARK-37487
 URL: https://issues.apache.org/jira/browse/SPARK-37487
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Tanel Kiis


It is bets examplified by this new UT in DataFrameCallbackSuite:
{code}
  test("SPARK-X: get observable metrics with sort by callback") {
val df = spark.range(100)
  .observe(
name = "my_event",
min($"id").as("min_val"),
max($"id").as("max_val"),
// Test unresolved alias
sum($"id"),
count(when($"id" % 2 === 0, 1)).as("num_even"))
  .observe(
name = "other_event",
avg($"id").cast("int").as("avg_val"))
  .sort($"id".desc)

validateObservedMetrics(df)
  }
{code}

The count aggregate reports twice the number of rows:
{code}
[info] - SPARK-X: get observable metrics with sort by callback *** FAILED 
*** (169 milliseconds)
[info]   [0,99,9900,100] did not equal [0,99,4950,50] 
(DataFrameCallbackSuite.scala:342)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at 
org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
[info]   at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.checkMetrics$1(DataFrameCallbackSuite.scala:342)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.validateObservedMetrics(DataFrameCallbackSuite.scala:350)
[info]   at 
org.apache.spark.sql.util.DataFrameCallbackSuite.$anonfun$new$21(DataFrameCallbackSuite.scala:324)
[info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
{code}

I could not figure out how this happes. Hopefully the UT can help with debugging



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37463) Read/Write Timestamp ntz from/to Orc uses UTC time zone

2021-11-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450396#comment-17450396
 ] 

Apache Spark commented on SPARK-37463:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34741

> Read/Write Timestamp ntz from/to Orc uses UTC time zone
> ---
>
> Key: SPARK-37463
> URL: https://issues.apache.org/jira/browse/SPARK-37463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> There are some example code:
> import java.util.TimeZone
> TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
> sql("set spark.sql.session.timeZone=America/Los_Angeles")
> val df = sql("select timestamp_ntz '2021-06-01 00:00:00' ts_ntz, timestamp 
> '2021-06-01 00:00:00' ts")
> df.write.mode("overwrite").orc("ts_ntz_orc")
> df.write.mode("overwrite").parquet("ts_ntz_parquet")
> df.write.mode("overwrite").format("avro").save("ts_ntz_avro")
> val query = """
>   select 'orc', *
>   from `orc`.`ts_ntz_orc`
>   union all
>   select 'parquet', *
>   from `parquet`.`ts_ntz_parquet`
>   union all
>   select 'avro', *
>   from `avro`.`ts_ntz_avro`
> """
> val tzs = Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam")
> for (tz <- tzs) {
>   TimeZone.setDefault(TimeZone.getTimeZone(tz))
>   sql(s"set spark.sql.session.timeZone=$tz")
>   println(s"Time zone is ${TimeZone.getDefault.getID}")
>   sql(query).show(false)
> }
> The output show below looks so strange.
> Time zone is America/Los_Angeles
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-06-01 00:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 00:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 00:00:00|
> +---+---+---+
> Time zone is UTC
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-05-31 17:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 07:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 07:00:00|
> +---+---+---+
> Time zone is Europe/Amsterdam
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-05-31 15:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 09:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 09:00:00|
> +---+---+---+



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37463) Read/Write Timestamp ntz from/to Orc uses UTC time zone

2021-11-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450395#comment-17450395
 ] 

Apache Spark commented on SPARK-37463:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34741

> Read/Write Timestamp ntz from/to Orc uses UTC time zone
> ---
>
> Key: SPARK-37463
> URL: https://issues.apache.org/jira/browse/SPARK-37463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> There are some example code:
> import java.util.TimeZone
> TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
> sql("set spark.sql.session.timeZone=America/Los_Angeles")
> val df = sql("select timestamp_ntz '2021-06-01 00:00:00' ts_ntz, timestamp 
> '2021-06-01 00:00:00' ts")
> df.write.mode("overwrite").orc("ts_ntz_orc")
> df.write.mode("overwrite").parquet("ts_ntz_parquet")
> df.write.mode("overwrite").format("avro").save("ts_ntz_avro")
> val query = """
>   select 'orc', *
>   from `orc`.`ts_ntz_orc`
>   union all
>   select 'parquet', *
>   from `parquet`.`ts_ntz_parquet`
>   union all
>   select 'avro', *
>   from `avro`.`ts_ntz_avro`
> """
> val tzs = Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam")
> for (tz <- tzs) {
>   TimeZone.setDefault(TimeZone.getTimeZone(tz))
>   sql(s"set spark.sql.session.timeZone=$tz")
>   println(s"Time zone is ${TimeZone.getDefault.getID}")
>   sql(query).show(false)
> }
> The output show below looks so strange.
> Time zone is America/Los_Angeles
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-06-01 00:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 00:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 00:00:00|
> +---+---+---+
> Time zone is UTC
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-05-31 17:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 07:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 07:00:00|
> +---+---+---+
> Time zone is Europe/Amsterdam
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-05-31 15:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 09:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 09:00:00|
> +---+---+---+



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37463) Read/Write Timestamp ntz from/to Orc uses UTC timestamp

2021-11-29 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-37463:
---
Summary: Read/Write Timestamp ntz from/to Orc uses UTC timestamp  (was: 
Read/Write Timestamp ntz to Orc uses UTC timestamp)

> Read/Write Timestamp ntz from/to Orc uses UTC timestamp
> ---
>
> Key: SPARK-37463
> URL: https://issues.apache.org/jira/browse/SPARK-37463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> There are some example code:
> import java.util.TimeZone
> TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
> sql("set spark.sql.session.timeZone=America/Los_Angeles")
> val df = sql("select timestamp_ntz '2021-06-01 00:00:00' ts_ntz, timestamp 
> '2021-06-01 00:00:00' ts")
> df.write.mode("overwrite").orc("ts_ntz_orc")
> df.write.mode("overwrite").parquet("ts_ntz_parquet")
> df.write.mode("overwrite").format("avro").save("ts_ntz_avro")
> val query = """
>   select 'orc', *
>   from `orc`.`ts_ntz_orc`
>   union all
>   select 'parquet', *
>   from `parquet`.`ts_ntz_parquet`
>   union all
>   select 'avro', *
>   from `avro`.`ts_ntz_avro`
> """
> val tzs = Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam")
> for (tz <- tzs) {
>   TimeZone.setDefault(TimeZone.getTimeZone(tz))
>   sql(s"set spark.sql.session.timeZone=$tz")
>   println(s"Time zone is ${TimeZone.getDefault.getID}")
>   sql(query).show(false)
> }
> The output show below looks so strange.
> Time zone is America/Los_Angeles
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-06-01 00:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 00:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 00:00:00|
> +---+---+---+
> Time zone is UTC
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-05-31 17:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 07:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 07:00:00|
> +---+---+---+
> Time zone is Europe/Amsterdam
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-05-31 15:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 09:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 09:00:00|
> +---+---+---+



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37463) Read/Write Timestamp ntz from/to Orc uses UTC time zone

2021-11-29 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-37463:
---
Summary: Read/Write Timestamp ntz from/to Orc uses UTC time zone  (was: 
Read/Write Timestamp ntz from/to Orc uses UTC timestamp)

> Read/Write Timestamp ntz from/to Orc uses UTC time zone
> ---
>
> Key: SPARK-37463
> URL: https://issues.apache.org/jira/browse/SPARK-37463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> There are some example code:
> import java.util.TimeZone
> TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
> sql("set spark.sql.session.timeZone=America/Los_Angeles")
> val df = sql("select timestamp_ntz '2021-06-01 00:00:00' ts_ntz, timestamp 
> '2021-06-01 00:00:00' ts")
> df.write.mode("overwrite").orc("ts_ntz_orc")
> df.write.mode("overwrite").parquet("ts_ntz_parquet")
> df.write.mode("overwrite").format("avro").save("ts_ntz_avro")
> val query = """
>   select 'orc', *
>   from `orc`.`ts_ntz_orc`
>   union all
>   select 'parquet', *
>   from `parquet`.`ts_ntz_parquet`
>   union all
>   select 'avro', *
>   from `avro`.`ts_ntz_avro`
> """
> val tzs = Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam")
> for (tz <- tzs) {
>   TimeZone.setDefault(TimeZone.getTimeZone(tz))
>   sql(s"set spark.sql.session.timeZone=$tz")
>   println(s"Time zone is ${TimeZone.getDefault.getID}")
>   sql(query).show(false)
> }
> The output show below looks so strange.
> Time zone is America/Los_Angeles
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-06-01 00:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 00:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 00:00:00|
> +---+---+---+
> Time zone is UTC
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-05-31 17:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 07:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 07:00:00|
> +---+---+---+
> Time zone is Europe/Amsterdam
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-05-31 15:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 09:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 09:00:00|
> +---+---+---+



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37463) Read/Write Timestamp ntz to Orc uses UTC timestamp

2021-11-29 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-37463:
---
Summary: Read/Write Timestamp ntz to Orc uses UTC timestamp  (was: 
Read/Write Timestamp ntz or ltz to Orc uses UTC timestamp)

> Read/Write Timestamp ntz to Orc uses UTC timestamp
> --
>
> Key: SPARK-37463
> URL: https://issues.apache.org/jira/browse/SPARK-37463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> There are some example code:
> import java.util.TimeZone
> TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
> sql("set spark.sql.session.timeZone=America/Los_Angeles")
> val df = sql("select timestamp_ntz '2021-06-01 00:00:00' ts_ntz, timestamp 
> '2021-06-01 00:00:00' ts")
> df.write.mode("overwrite").orc("ts_ntz_orc")
> df.write.mode("overwrite").parquet("ts_ntz_parquet")
> df.write.mode("overwrite").format("avro").save("ts_ntz_avro")
> val query = """
>   select 'orc', *
>   from `orc`.`ts_ntz_orc`
>   union all
>   select 'parquet', *
>   from `parquet`.`ts_ntz_parquet`
>   union all
>   select 'avro', *
>   from `avro`.`ts_ntz_avro`
> """
> val tzs = Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam")
> for (tz <- tzs) {
>   TimeZone.setDefault(TimeZone.getTimeZone(tz))
>   sql(s"set spark.sql.session.timeZone=$tz")
>   println(s"Time zone is ${TimeZone.getDefault.getID}")
>   sql(query).show(false)
> }
> The output show below looks so strange.
> Time zone is America/Los_Angeles
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-06-01 00:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 00:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 00:00:00|
> +---+---+---+
> Time zone is UTC
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-05-31 17:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 07:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 07:00:00|
> +---+---+---+
> Time zone is Europe/Amsterdam
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-05-31 15:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 09:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 09:00:00|
> +---+---+---+



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37485) Replace map with expressions which produce no result with foreach

2021-11-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37485:


Assignee: (was: Apache Spark)

> Replace map with expressions which produce no result with foreach 
> --
>
> Key: SPARK-37485
> URL: https://issues.apache.org/jira/browse/SPARK-37485
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> Use foreach instead of  map with expressions which produce no result.
>  
> Before
>  
> {code:java}
> def functionWithNoReturnValue: Unit = {}  
> Seq(1, 2).map(functionWithNoReturnValue) {code}
>  
>  
> After
>  
> {code:java}
> def functionWithNoReturnValue: Unit = {}   
> Seq(1, 2).foreach(functionWithNoReturnValue) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37485) Replace map with expressions which produce no result with foreach

2021-11-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450390#comment-17450390
 ] 

Apache Spark commented on SPARK-37485:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34740

> Replace map with expressions which produce no result with foreach 
> --
>
> Key: SPARK-37485
> URL: https://issues.apache.org/jira/browse/SPARK-37485
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> Use foreach instead of  map with expressions which produce no result.
>  
> Before
>  
> {code:java}
> def functionWithNoReturnValue: Unit = {}  
> Seq(1, 2).map(functionWithNoReturnValue) {code}
>  
>  
> After
>  
> {code:java}
> def functionWithNoReturnValue: Unit = {}   
> Seq(1, 2).foreach(functionWithNoReturnValue) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37485) Replace map with expressions which produce no result with foreach

2021-11-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37485:


Assignee: Apache Spark

> Replace map with expressions which produce no result with foreach 
> --
>
> Key: SPARK-37485
> URL: https://issues.apache.org/jira/browse/SPARK-37485
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> Use foreach instead of  map with expressions which produce no result.
>  
> Before
>  
> {code:java}
> def functionWithNoReturnValue: Unit = {}  
> Seq(1, 2).map(functionWithNoReturnValue) {code}
>  
>  
> After
>  
> {code:java}
> def functionWithNoReturnValue: Unit = {}   
> Seq(1, 2).foreach(functionWithNoReturnValue) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37485) Replace map with expressions which produce no result with foreach

2021-11-29 Thread Yang Jie (Jira)

Yang Jie created SPARK-37485:


 Summary: Replace map with expressions which produce no result with 
foreach 
 Key: SPARK-37485
 URL: https://issues.apache.org/jira/browse/SPARK-37485
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Yang Jie


Use foreach instead of  map with expressions which produce no result.

 

Before

 
{code:java}
def functionWithNoReturnValue: Unit = {}  
Seq(1, 2).map(functionWithNoReturnValue) {code}
 

 

After

 
{code:java}
def functionWithNoReturnValue: Unit = {}   
Seq(1, 2).foreach(functionWithNoReturnValue) {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37484) Replace Get and getOrElse with getOrElse

2021-11-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37484:


Assignee: (was: Apache Spark)

> Replace Get and getOrElse with getOrElse
> 
>
> Key: SPARK-37484
> URL: https://issues.apache.org/jira/browse/SPARK-37484
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> There are some combined calls of get and getOrElse that can be directly 
> replaced by getOrElse
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37484) Replace Get and getOrElse with getOrElse

2021-11-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450375#comment-17450375
 ] 

Apache Spark commented on SPARK-37484:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34739

> Replace Get and getOrElse with getOrElse
> 
>
> Key: SPARK-37484
> URL: https://issues.apache.org/jira/browse/SPARK-37484
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> There are some combined calls of get and getOrElse that can be directly 
> replaced by getOrElse
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37484) Replace Get and getOrElse with getOrElse

2021-11-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37484:


Assignee: Apache Spark

> Replace Get and getOrElse with getOrElse
> 
>
> Key: SPARK-37484
> URL: https://issues.apache.org/jira/browse/SPARK-37484
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> There are some combined calls of get and getOrElse that can be directly 
> replaced by getOrElse
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37484) Replace Get and getOrElse with getOrElse

2021-11-29 Thread Yang Jie (Jira)

Yang Jie created SPARK-37484:


 Summary: Replace Get and getOrElse with getOrElse
 Key: SPARK-37484
 URL: https://issues.apache.org/jira/browse/SPARK-37484
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.3.0
Reporter: Yang Jie


There are some combined calls of get and getOrElse that can be directly 
replaced by getOrElse

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37483) Support pushdown down top N to JDBC data source V2

2021-11-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450316#comment-17450316
 ] 

Apache Spark commented on SPARK-37483:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34738

> Support pushdown down top N to JDBC data source V2
> --
>
> Key: SPARK-37483
> URL: https://issues.apache.org/jira/browse/SPARK-37483
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37482) Skip check monotonic increasing for Series.asof with 'compute.eager_check'

2021-11-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450315#comment-17450315
 ] 

Apache Spark commented on SPARK-37482:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/34737

> Skip check monotonic increasing for Series.asof with 'compute.eager_check'
> --
>
> Key: SPARK-37482
> URL: https://issues.apache.org/jira/browse/SPARK-37482
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 118 matches

Mail list logo