[jira] [Created] (SPARK-47991) Arrange the test cases for window frames and window functions.

2024-04-25 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-47991:
--

 Summary: Arrange the test cases for window frames and window 
functions.
 Key: SPARK-47991
 URL: https://issues.apache.org/jira/browse/SPARK-47991
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47795) Supplement the doc of job schedule for K8S

2024-04-10 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-47795:
--

 Summary: Supplement the doc of job schedule for K8S
 Key: SPARK-47795
 URL: https://issues.apache.org/jira/browse/SPARK-47795
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47391) Remove the test case workaround for JDK 8

2024-03-14 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-47391:
--

 Summary: Remove the test case workaround for JDK 8
 Key: SPARK-47391
 URL: https://issues.apache.org/jira/browse/SPARK-47391
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng


Spark SQL test case in ExpressionEncoderSuite fails in windows operation system.

{code:java}
Internal error (java.io.FileNotFoundException): 
D:\Users\gja\git-forks\spark\sql\catalyst\target\scala-2.13\test-classes\org\apache\spark\sql\catalyst\encoders\ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassName1$OuterLevelWithVeryVeryVeryLongClassName2$OuterLevelWithVeryVeryVeryLongClassName3$OuterLevelWithVeryVeryVeryLongClassName4$OuterLevelWithVeryVeryVeryLongClassName5$OuterLevelWithVeryVeryVeryLongClassName6$.class
 (文件名、目录名或卷标语法不正确。)
java.io.FileNotFoundException: 
D:\Users\gja\git-forks\spark\sql\catalyst\target\scala-2.13\test-classes\org\apache\spark\sql\catalyst\encoders\ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassName1$OuterLevelWithVeryVeryVeryLongClassName2$OuterLevelWithVeryVeryVeryLongClassName3$OuterLevelWithVeryVeryVeryLongClassName4$OuterLevelWithVeryVeryVeryLongClassName5$OuterLevelWithVeryVeryVeryLongClassName6$.class
 (文件名、目录名或卷标语法不正确。)
at java.base/java.io.FileInputStream.open0(Native Method)
at java.base/java.io.FileInputStream.open(FileInputStream.java:216)
at java.base/java.io.FileInputStream.(FileInputStream.java:157)
at 
com.intellij.openapi.util.io.FileUtil.loadFileBytes(FileUtil.java:211)
at 
org.jetbrains.jps.incremental.scala.local.LazyCompiledClass.$anonfun$getContent$1(LazyCompiledClass.scala:18)
at scala.Option.getOrElse(Option.scala:201)
at 
org.jetbrains.jps.incremental.scala.local.LazyCompiledClass.getContent(LazyCompiledClass.scala:17)
at 
org.jetbrains.jps.incremental.instrumentation.BaseInstrumentingBuilder.performBuild(BaseInstrumentingBuilder.java:38)
at 
org.jetbrains.jps.incremental.instrumentation.ClassProcessingBuilder.build(ClassProcessingBuilder.java:80)
at 
org.jetbrains.jps.incremental.IncProjectBuilder.runModuleLevelBuilders(IncProjectBuilder.java:1569)
at 
org.jetbrains.jps.incremental.IncProjectBuilder.runBuildersForChunk(IncProjectBuilder.java:1198)
at 
org.jetbrains.jps.incremental.IncProjectBuilder.buildTargetsChunk(IncProjectBuilder.java:1349)
at 
org.jetbrains.jps.incremental.IncProjectBuilder.buildChunkIfAffected(IncProjectBuilder.java:1163)
at 
org.jetbrains.jps.incremental.IncProjectBuilder$BuildParallelizer$1.run(IncProjectBuilder.java:1129)
at 
com.intellij.util.concurrency.BoundedTaskExecutor.doRun(BoundedTaskExecutor.java:244)
at 
com.intellij.util.concurrency.BoundedTaskExecutor.access$200(BoundedTaskExecutor.java:30)
at 
com.intellij.util.concurrency.BoundedTaskExecutor$1.executeFirstTaskAndHelpQueue(BoundedTaskExecutor.java:222)
at 
com.intellij.util.ConcurrencyUtil.runUnderThreadName(ConcurrencyUtil.java:218)
at 
com.intellij.util.concurrency.BoundedTaskExecutor$1.run(BoundedTaskExecutor.java:210)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:842)
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46929) Use ThreadUtils.shutdown to close thread pools

2024-01-30 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-46929:
---
Component/s: Connect
 Spark Core
 SS
 (was: SQL)

> Use ThreadUtils.shutdown to close thread pools
> --
>
> Key: SPARK-46929
> URL: https://issues.apache.org/jira/browse/SPARK-46929
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Spark Core, SS
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46929) Use ThreadUtils.shutdown to close thread pools

2024-01-30 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-46929:
--

 Summary: Use ThreadUtils.shutdown to close thread pools
 Key: SPARK-46929
 URL: https://issues.apache.org/jira/browse/SPARK-46929
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46895) Replace Timer with single thread scheduled executor

2024-01-28 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-46895:
--

 Summary: Replace Timer with single thread scheduled executor
 Key: SPARK-46895
 URL: https://issues.apache.org/jira/browse/SPARK-46895
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng


Spark exists some Timer.
We should replace Timer with single thread scheduled executor



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46882) Remove unnecessary AtomicInteger

2024-01-26 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-46882:
---
Summary: Remove unnecessary AtomicInteger  (was: Remove unnessary 
AtomicInteger)

> Remove unnecessary AtomicInteger
> 
>
> Key: SPARK-46882
> URL: https://issues.apache.org/jira/browse/SPARK-46882
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46882) Remove unnessary AtomicInteger

2024-01-26 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-46882:
--

 Summary: Remove unnessary AtomicInteger
 Key: SPARK-46882
 URL: https://issues.apache.org/jira/browse/SPARK-46882
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46760) Make the document of spark.sql.adaptive.coalescePartitions.parallelismFirst clearer

2024-01-18 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-46760:
--

 Summary: Make the document of 
spark.sql.adaptive.coalescePartitions.parallelismFirst clearer
 Key: SPARK-46760
 URL: https://issues.apache.org/jira/browse/SPARK-46760
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46611) Remove ThreadLocal by replace SimpleDateFormat with DateTimeFormatter

2024-01-06 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-46611:
--

 Summary: Remove ThreadLocal by replace SimpleDateFormat with 
DateTimeFormatter
 Key: SPARK-46611
 URL: https://issues.apache.org/jira/browse/SPARK-46611
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46494) Remove the parse rule of First, Last and Any_value

2023-12-24 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng resolved SPARK-46494.

Resolution: Won't Fix

> Remove the parse rule of First, Last and Any_value
> --
>
> Key: SPARK-46494
> URL: https://issues.apache.org/jira/browse/SPARK-46494
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
>
> Spark have separate parse rule for First, Last and Any_value.
> In fact, the parse rule for general function call support works well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46494) Remove the parse rule of First, Last and Any_value

2023-12-23 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-46494:
---
Description: 
Spark have separate parse rule for First, Last and Any_value.
In fact, the parse rule for general function call support works well.

  was:Spark have separate parse rule for Merge the parse rule of PercentileCont 
and PercentileDisc into functionCall


> Remove the parse rule of First, Last and Any_value
> --
>
> Key: SPARK-46494
> URL: https://issues.apache.org/jira/browse/SPARK-46494
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> Spark have separate parse rule for First, Last and Any_value.
> In fact, the parse rule for general function call support works well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46494) Remove the parse rule of First, Last and Any_value

2023-12-23 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-46494:
---
Description: Spark have separate parse rule for Merge the parse rule of 
PercentileCont and PercentileDisc into functionCall  (was: Spark have separate 
parse rule for )

> Remove the parse rule of First, Last and Any_value
> --
>
> Key: SPARK-46494
> URL: https://issues.apache.org/jira/browse/SPARK-46494
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> Spark have separate parse rule for Merge the parse rule of PercentileCont and 
> PercentileDisc into functionCall



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46494) Remove the parse rule of First, Last and Any_value

2023-12-23 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-46494:
---
Description: Spark have separate parse rule for 

> Remove the parse rule of First, Last and Any_value
> --
>
> Key: SPARK-46494
> URL: https://issues.apache.org/jira/browse/SPARK-46494
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> Spark have separate parse rule for 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46494) Remove the parse rule of First, Last and Any_value

2023-12-23 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-46494:
--

 Summary: Remove the parse rule of First, Last and Any_value
 Key: SPARK-46494
 URL: https://issues.apache.org/jira/browse/SPARK-46494
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46491) Eliminate the aggregation if the group keys is the subset of the partition keys

2023-12-23 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-46491:
--

 Summary: Eliminate the aggregation if the group keys is the subset 
of the partition keys
 Key: SPARK-46491
 URL: https://issues.apache.org/jira/browse/SPARK-46491
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46207) Support MergeInto in DataFrameWriterV2

2023-12-20 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng reassigned SPARK-46207:
--

Assignee: Huaxin Gao

> Support MergeInto in DataFrameWriterV2
> --
>
> Key: SPARK-46207
> URL: https://issues.apache.org/jira/browse/SPARK-46207
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46207) Support MergeInto in DataFrameWriterV2

2023-12-20 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng resolved SPARK-46207.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44119
[https://github.com/apache/spark/pull/44119]

> Support MergeInto in DataFrameWriterV2
> --
>
> Key: SPARK-46207
> URL: https://issues.apache.org/jira/browse/SPARK-46207
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46443) Decimal precision and scale should decided by JDBC dialect.

2023-12-18 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-46443:
---
Summary: Decimal precision and scale should decided by JDBC dialect.  (was: 
Ensure Decimal precision and scale should decided by JDBC dialect.)

> Decimal precision and scale should decided by JDBC dialect.
> ---
>
> Key: SPARK-46443
> URL: https://issues.apache.org/jira/browse/SPARK-46443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46442) DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC

2023-12-17 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-46442:
--

 Summary: DS V2 supports push down PERCENTILE_CONT and 
PERCENTILE_DISC
 Key: SPARK-46442
 URL: https://issues.apache.org/jira/browse/SPARK-46442
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46406) Assign a name to the error class _LEGACY_ERROR_TEMP_1023

2023-12-16 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng resolved SPARK-46406.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44355
[https://github.com/apache/spark/pull/44355]

> Assign a name to the error class _LEGACY_ERROR_TEMP_1023
> 
>
> Key: SPARK-46406
> URL: https://issues.apache.org/jira/browse/SPARK-46406
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45795) DS V2 supports push down Mode

2023-12-16 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng resolved SPARK-45795.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43661
[https://github.com/apache/spark/pull/43661]

> DS V2 supports push down Mode
> -
>
> Key: SPARK-45795
> URL: https://issues.apache.org/jira/browse/SPARK-45795
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Many databases support the aggregate function mode. So DS V2 could push down 
> it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46403) Decode parquet binary with getBytesUnsafe method

2023-12-15 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng reassigned SPARK-46403:
--

Assignee: Wan Kun

> Decode parquet binary with getBytesUnsafe method
> 
>
> Key: SPARK-46403
> URL: https://issues.apache.org/jira/browse/SPARK-46403
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wan Kun
>Assignee: Wan Kun
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2023-12-14-16-30-39-104.png
>
>
> Now spark will get a parquet binary object with getBytes() method.
> The *Binary.getBytes()* method will always make a new copy of the internal 
> bytes.
> We can use *Binary.getBytesUnsafe()* method to get the cached bytes if it has 
> already been called getBytes() and has the cached bytes.
> !image-2023-12-14-16-30-39-104.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46403) Decode parquet binary with getBytesUnsafe method

2023-12-15 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng resolved SPARK-46403.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44351
[https://github.com/apache/spark/pull/44351]

> Decode parquet binary with getBytesUnsafe method
> 
>
> Key: SPARK-46403
> URL: https://issues.apache.org/jira/browse/SPARK-46403
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wan Kun
>Assignee: Wan Kun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: image-2023-12-14-16-30-39-104.png
>
>
> Now spark will get a parquet binary object with getBytes() method.
> The *Binary.getBytes()* method will always make a new copy of the internal 
> bytes.
> We can use *Binary.getBytesUnsafe()* method to get the cached bytes if it has 
> already been called getBytes() and has the cached bytes.
> !image-2023-12-14-16-30-39-104.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46406) Assign a name to the error class _LEGACY_ERROR_TEMP_1023

2023-12-14 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-46406:
--

 Summary: Assign a name to the error class _LEGACY_ERROR_TEMP_1023
 Key: SPARK-46406
 URL: https://issues.apache.org/jira/browse/SPARK-46406
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45796) Support MODE() WITHIN GROUP (ORDER BY col)

2023-12-14 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng resolved SPARK-45796.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44184
[https://github.com/apache/spark/pull/44184]

> Support MODE() WITHIN GROUP (ORDER BY col) 
> ---
>
> Key: SPARK-45796
> URL: https://issues.apache.org/jira/browse/SPARK-45796
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Many mainstream databases supports the syntax show below.
> { MODE() WITHIN GROUP (ORDER BY sortSpecification) }
> [FILTER (WHERE expression)] [OVER windowNameOrSpecification]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46363) Improve the java text block with java15 feature

2023-12-11 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-46363:
--

 Summary: Improve the java text block with java15 feature
 Key: SPARK-46363
 URL: https://issues.apache.org/jira/browse/SPARK-46363
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45649) Unify the prepare framework for `OffsetWindowFunctionFrame`

2023-12-11 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng resolved SPARK-45649.

Resolution: Fixed

> Unify the prepare framework for `OffsetWindowFunctionFrame`
> ---
>
> Key: SPARK-45649
> URL: https://issues.apache.org/jira/browse/SPARK-45649
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Attachments: test_table.parquet.zip
>
>
> Currently, the implementation the `prepare` of  all the 
> `OffsetWindowFunctionFrame` have the same code logic show below.
> ```
>   override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = {
> if (offset > rows.length) {
>   fillDefaultValue(EmptyRow)
> } else {
>   resetStates(rows)
>   if (ignoreNulls) {
> ...
>   } else {
> ...
>   }
> }
>   }
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46270) Use java14 instanceof expressions to replace the java8 instanceof statement

2023-12-05 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-46270:
--

 Summary: Use java14 instanceof expressions to replace the java8 
instanceof statement
 Key: SPARK-46270
 URL: https://issues.apache.org/jira/browse/SPARK-46270
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46009) Merge the parse rule of PercentileCont and PercentileDisc into functionCall

2023-12-04 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng resolved SPARK-46009.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43910
[https://github.com/apache/spark/pull/43910]

> Merge the parse rule of PercentileCont and PercentileDisc into functionCall
> ---
>
> Key: SPARK-46009
> URL: https://issues.apache.org/jira/browse/SPARK-46009
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Spark SQL parser have a special rule to parse 
> [percentile_cont|percentile_disc](percentage) WITHIN GROUP (ORDER BY v).
> We should merge this rule into the functionCall.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46101) Replace (string|array).size with (string|array).length in all the modules

2023-11-27 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-46101:
---
Summary: Replace (string|array).size with (string|array).length in all the 
modules  (was: Replace (string|array).size with (string|array).length in module 
SQL)

> Replace (string|array).size with (string|array).length in all the modules
> -
>
> Key: SPARK-46101
> URL: https://issues.apache.org/jira/browse/SPARK-46101
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46100) Replace (string|array).size with (string|array).length in all the modules

2023-11-27 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-46100:
---
Summary: Replace (string|array).size with (string|array).length in all the 
modules  (was: Replace (string|array).size with (string|array).length in module 
core)

> Replace (string|array).size with (string|array).length in all the modules
> -
>
> Key: SPARK-46100
> URL: https://issues.apache.org/jira/browse/SPARK-46100
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46100) Replace (string|array).size with (string|array).length in module core

2023-11-27 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-46100:
---
Summary: Replace (string|array).size with (string|array).length in module 
core  (was: Replace (string|array).size with (string|array).length in all the 
modules)

> Replace (string|array).size with (string|array).length in module core
> -
>
> Key: SPARK-46100
> URL: https://issues.apache.org/jira/browse/SPARK-46100
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46101) Fix these issue in module sql

2023-11-25 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-46101:
--

 Summary: Fix these issue in module sql
 Key: SPARK-46101
 URL: https://issues.apache.org/jira/browse/SPARK-46101
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46100) Fix these issue in module core

2023-11-25 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-46100:
--

 Summary: Fix these issue in module core
 Key: SPARK-46100
 URL: https://issues.apache.org/jira/browse/SPARK-46100
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46098) Reduce stack depth by replace (string|array).size with (string|array).length

2023-11-25 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-46098:
---
Description: 
There are a lot of (string|array).size called.
In fact, the size calls the underlying length, this behavior increase the stack 
length.
We should call (string|array).length directly.

We also get the compile waring Replace .size with .length on arrays and strings

  was:
There are a lot of (string|array).size called.
In fact, the size calls the underlying length, this behavior increase the stack 
length.
We should call 

# Replace .size with .length on arrays and strings


> Reduce stack depth by replace (string|array).size with (string|array).length
> 
>
> Key: SPARK-46098
> URL: https://issues.apache.org/jira/browse/SPARK-46098
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> There are a lot of (string|array).size called.
> In fact, the size calls the underlying length, this behavior increase the 
> stack length.
> We should call (string|array).length directly.
> We also get the compile waring Replace .size with .length on arrays and 
> strings



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46098) Reduce stack depth by replace (string|array).size with (string|array).length

2023-11-25 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-46098:
---
Description: 
There are a lot of (string|array).size called.
In fact, the size calls the underlying length, this behavior increase the stack 
length.
We should call 

# Replace .size with .length on arrays and strings

  was:
There are a lot of # Replace .size with .length on arrays and strings

# Replace .size with .length on arrays and strings


> Reduce stack depth by replace (string|array).size with (string|array).length
> 
>
> Key: SPARK-46098
> URL: https://issues.apache.org/jira/browse/SPARK-46098
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> There are a lot of (string|array).size called.
> In fact, the size calls the underlying length, this behavior increase the 
> stack length.
> We should call 
> # Replace .size with .length on arrays and strings



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46098) Reduce stack depth by replace (string|array).size with (string|array).length

2023-11-25 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-46098:
---
Description: 
There are a lot of # Replace .size with .length on arrays and strings

# Replace .size with .length on arrays and strings

  was:
There are a lot of 

# Replace .size with .length on arrays and strings


> Reduce stack depth by replace (string|array).size with (string|array).length
> 
>
> Key: SPARK-46098
> URL: https://issues.apache.org/jira/browse/SPARK-46098
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> There are a lot of # Replace .size with .length on arrays and strings
> # Replace .size with .length on arrays and strings



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45649) Unify the prepare framework for `OffsetWindowFunctionFrame`

2023-11-21 Thread Jiaan Geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17788329#comment-17788329
 ] 

Jiaan Geng commented on SPARK-45649:


[~cloud_fan]I see. I will investigate this bug.

> Unify the prepare framework for `OffsetWindowFunctionFrame`
> ---
>
> Key: SPARK-45649
> URL: https://issues.apache.org/jira/browse/SPARK-45649
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: test_table.parquet.zip
>
>
> Currently, the implementation the `prepare` of  all the 
> `OffsetWindowFunctionFrame` have the same code logic show below.
> ```
>   override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = {
> if (offset > rows.length) {
>   fillDefaultValue(EmptyRow)
> } else {
>   resetStates(rows)
>   if (ignoreNulls) {
> ...
>   } else {
> ...
>   }
> }
>   }
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46029) Escape the single quote, _ and % for DS V2 pushdown

2023-11-21 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-46029:
---
Summary: Escape the single quote, _ and % for DS V2 pushdown  (was: Escape 
the ', _ and % for DS V2 pushdown)

> Escape the single quote, _ and % for DS V2 pushdown
> ---
>
> Key: SPARK-46029
> URL: https://issues.apache.org/jira/browse/SPARK-46029
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
>
> Spark supports push down startsWith, endWith and contains to JDBC database 
> with DS V2 pushdown.
> But the V2ExpressionSQLBuilder didn't escape the single quote, _ and %, it 
> can cause unexpected result.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46009) Merge the parse rule of PercentileCont and PercentileDisc into functionCall

2023-11-20 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-46009:
---
Description: 
Spark SQL parser have a special rule to parse 
[percentile_cont|percentile_disc](percentage) WITHIN GROUP (ORDER BY v).
We should merge this rule into the functionCall.

  was:
Spark SQL parse have a special rule to parse 
[percentile_cont|percentile_disc](percentage) WITHIN GROUP (ORDER BY v).
We should merge this rule into the functionCall.


> Merge the parse rule of PercentileCont and PercentileDisc into functionCall
> ---
>
> Key: SPARK-46009
> URL: https://issues.apache.org/jira/browse/SPARK-46009
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> Spark SQL parser have a special rule to parse 
> [percentile_cont|percentile_disc](percentage) WITHIN GROUP (ORDER BY v).
> We should merge this rule into the functionCall.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46009) Merge the parse rule of PercentileCont and PercentileDisc into functionCall

2023-11-20 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-46009:
---
Description: 
Spark SQL parse have a special rule to parse 
[percentile_cont|percentile_disc](percentage) WITHIN GROUP (ORDER BY v).
We should merge this rule into the functionCall.

  was:
Spark SQL parse have a special rule to parse 
[percentile_cont|percentile_disc](percentage) WITHIN GROUP (ORDER BY v).
We should merge this rule into the 


> Merge the parse rule of PercentileCont and PercentileDisc into functionCall
> ---
>
> Key: SPARK-46009
> URL: https://issues.apache.org/jira/browse/SPARK-46009
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> Spark SQL parse have a special rule to parse 
> [percentile_cont|percentile_disc](percentage) WITHIN GROUP (ORDER BY v).
> We should merge this rule into the functionCall.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46009) Merge the parse rule of PercentileCont and PercentileDisc into functionCall

2023-11-20 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-46009:
---
Description: 
Spark SQL parse have a special rule to parse 
[percentile_cont|percentile_disc](percentage) WITHIN GROUP (ORDER BY v).
We should merge this rule into the 

> Merge the parse rule of PercentileCont and PercentileDisc into functionCall
> ---
>
> Key: SPARK-46009
> URL: https://issues.apache.org/jira/browse/SPARK-46009
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> Spark SQL parse have a special rule to parse 
> [percentile_cont|percentile_disc](percentage) WITHIN GROUP (ORDER BY v).
> We should merge this rule into the 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46009) Merge the parse rule of PercentileCont and PercentileDisc into functionCall

2023-11-20 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-46009:
--

 Summary: Merge the parse rule of PercentileCont and PercentileDisc 
into functionCall
 Key: SPARK-46009
 URL: https://issues.apache.org/jira/browse/SPARK-46009
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45933) Runtime filter should infers more application side.

2023-11-15 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-45933:
--

 Summary: Runtime filter should infers more application side.
 Key: SPARK-45933
 URL: https://issues.apache.org/jira/browse/SPARK-45933
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45904) Mode function supports sort direction

2023-11-13 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-45904:
--

 Summary: Mode function supports sort direction
 Key: SPARK-45904
 URL: https://issues.apache.org/jira/browse/SPARK-45904
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng


Currently, mode function doesn't support sort.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45840) Fix these issue in module sql/hive, sql/hive-thriftserver

2023-11-08 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45840:
---
Summary: Fix these issue in module sql/hive, sql/hive-thriftserver  (was: 
Fix these issue in module sql/hive)

> Fix these issue in module sql/hive, sql/hive-thriftserver
> -
>
> Key: SPARK-45840
> URL: https://issues.apache.org/jira/browse/SPARK-45840
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45840) Fix these issue in module sql/hive

2023-11-08 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-45840:
--

 Summary: Fix these issue in module sql/hive
 Key: SPARK-45840
 URL: https://issues.apache.org/jira/browse/SPARK-45840
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45839) Fix these issue in module sql/api

2023-11-08 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-45839:
--

 Summary: Fix these issue in module sql/api
 Key: SPARK-45839
 URL: https://issues.apache.org/jira/browse/SPARK-45839
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45838) Fix these issue in module sql/core

2023-11-08 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-45838:
--

 Summary: Fix these issue in module sql/core
 Key: SPARK-45838
 URL: https://issues.apache.org/jira/browse/SPARK-45838
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45825) Fix these issue in module sql/catalyst

2023-11-08 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45825:
---
Summary: Fix these issue in module sql/catalyst  (was: Fix these issue in 
package sql/catalyst)

> Fix these issue in module sql/catalyst
> --
>
> Key: SPARK-45825
> URL: https://issues.apache.org/jira/browse/SPARK-45825
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45816) Return null when overflowing during casting from timestamp to integers

2023-11-08 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng resolved SPARK-45816.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43694
[https://github.com/apache/spark/pull/43694]

> Return null when overflowing during casting from timestamp to integers
> --
>
> Key: SPARK-45816
> URL: https://issues.apache.org/jira/browse/SPARK-45816
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.3, 3.4.1, 3.5.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Spark cast works in two modes: ansi and non-ansi. When overflowing during 
> casting, the common behavior under non-ansi mode is to return null. However, 
> casting from Timestamp to Int/Short/Byte returns a wrapping value now. The 
> behavior to silently overflow doesn't make sense.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45816) Return null when overflowing during casting from timestamp to integers

2023-11-08 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng reassigned SPARK-45816:
--

Assignee: L. C. Hsieh

> Return null when overflowing during casting from timestamp to integers
> --
>
> Key: SPARK-45816
> URL: https://issues.apache.org/jira/browse/SPARK-45816
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.3, 3.4.1, 3.5.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: pull-request-available
>
> Spark cast works in two modes: ansi and non-ansi. When overflowing during 
> casting, the common behavior under non-ansi mode is to return null. However, 
> casting from Timestamp to Int/Short/Byte returns a wrapping value now. The 
> behavior to silently overflow doesn't make sense.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45606) Release restrictions on multi-layer runtime filter

2023-11-08 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng resolved SPARK-45606.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43449
[https://github.com/apache/spark/pull/43449]

> Release restrictions on multi-layer runtime filter
> --
>
> Key: SPARK-45606
> URL: https://issues.apache.org/jira/browse/SPARK-45606
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Before https://issues.apache.org/jira/browse/SPARK-41674, Spark only supports 
> insert runtime filter for application side of shuffle join on single-layer. 
> Considered it's not worth to insert more runtime filter if one side of the 
> shuffle join already exists runtime filter, Spark restricts it.
> After https://issues.apache.org/jira/browse/SPARK-41674, Spark supports 
> insert runtime filter for one side of any shuffle join on multi-layer. But 
> the restrictions on multi-layer runtime filter looks outdated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45832) Fix 'Super method + is deprecated.'

2023-11-07 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-45832:
--

 Summary: Fix 'Super method + is deprecated.'
 Key: SPARK-45832
 URL: https://issues.apache.org/jira/browse/SPARK-45832
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng


  @deprecated("Consider requiring an immutable Map or fall back to 
Map.concat.", "2.13.0")
  def + [V1 >: V](kv: (K, V1)): CC[K, V1] =
mapFactory.from(new View.Appended(this, kv))



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45825) Fix these issue in package sql/catalyst

2023-11-07 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-45825:
--

 Summary: Fix these issue in package sql/catalyst
 Key: SPARK-45825
 URL: https://issues.apache.org/jira/browse/SPARK-45825
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45823) Fix some scala compile warnings

2023-11-07 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45823:
---
Description: 
# Replace .size with .length on arrays and strings
# The enclosing block is redundant
# Replace with .head
# Replace with .nonEmpty
# Replace with .isDefined
# Unnecessary parentheses
# Replace with .isEmpty

  was:
# Replace .size with .length on arrays and strings
# The enclosing block is redundant
# Replace with .head
# Replace with .nonEmpty
# Replace with .isDefined
# Unnecessary parentheses


> Fix some scala compile warnings
> ---
>
> Key: SPARK-45823
> URL: https://issues.apache.org/jira/browse/SPARK-45823
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> # Replace .size with .length on arrays and strings
> # The enclosing block is redundant
> # Replace with .head
> # Replace with .nonEmpty
> # Replace with .isDefined
> # Unnecessary parentheses
> # Replace with .isEmpty



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45823) Fix some scala compile warnings

2023-11-07 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45823:
---
Description: 
# Replace .size with .length on arrays and strings
# The enclosing block is redundant
# Replace with .head
# Replace with .nonEmpty
# Replace with .isDefined
# Unnecessary parentheses

  was:
# Replace .size with .length on arrays and strings
# The enclosing block is redundant
# Replace with .head
# Replace with .nonEmpty
# Replace with .isDefined


> Fix some scala compile warnings
> ---
>
> Key: SPARK-45823
> URL: https://issues.apache.org/jira/browse/SPARK-45823
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> # Replace .size with .length on arrays and strings
> # The enclosing block is redundant
> # Replace with .head
> # Replace with .nonEmpty
> # Replace with .isDefined
> # Unnecessary parentheses



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45823) Fix some scala compile warnings

2023-11-07 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45823:
---
Description: 
# Replace .size with .length on arrays and strings
# The enclosing block is redundant
# Replace with .head
# Replace with .nonEmpty
# Replace with .isDefined

  was:
# Replace .size with .length on arrays and strings
# The enclosing block is redundant
# Replace with .head
# Replace with .nonEmpty


> Fix some scala compile warnings
> ---
>
> Key: SPARK-45823
> URL: https://issues.apache.org/jira/browse/SPARK-45823
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> # Replace .size with .length on arrays and strings
> # The enclosing block is redundant
> # Replace with .head
> # Replace with .nonEmpty
> # Replace with .isDefined



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45823) Fix some scala compile warnings

2023-11-07 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45823:
---
Description: 
# Replace .size with .length on arrays and strings
# The enclosing block is redundant
# Replace with .head
# Replace with .nonEmpty

  was:
# Replace .size with .length on arrays and strings
# The enclosing block is redundant
# Replace with .head
# Replace with .contains


> Fix some scala compile warnings
> ---
>
> Key: SPARK-45823
> URL: https://issues.apache.org/jira/browse/SPARK-45823
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> # Replace .size with .length on arrays and strings
> # The enclosing block is redundant
> # Replace with .head
> # Replace with .nonEmpty



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45823) Fix some scala compile warnings

2023-11-07 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45823:
---
Description: 
# Replace .size with .length on arrays and strings
# The enclosing block is redundant
# Replace with .head
# Replace with .contains

  was:
# Replace .size with .length on arrays and strings
# The enclosing block is redundant
# Replace with .head


> Fix some scala compile warnings
> ---
>
> Key: SPARK-45823
> URL: https://issues.apache.org/jira/browse/SPARK-45823
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> # Replace .size with .length on arrays and strings
> # The enclosing block is redundant
> # Replace with .head
> # Replace with .contains



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45823) Fix some scala compile warnings

2023-11-07 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45823:
---
Description: 
# Replace .size with .length on arrays and strings
# The enclosing block is redundant
# Replace with .head

  was:
# Replace .size with .length on arrays and strings
# The enclosing block is redundant
# 


> Fix some scala compile warnings
> ---
>
> Key: SPARK-45823
> URL: https://issues.apache.org/jira/browse/SPARK-45823
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> # Replace .size with .length on arrays and strings
> # The enclosing block is redundant
> # Replace with .head



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45823) Fix some scala compile warnings

2023-11-07 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45823:
---
Description: 
# Replace .size with .length on arrays and strings
# The enclosing block is redundant
# 

  was:
# Replace .size with .length on arrays and strings
# 


> Fix some scala compile warnings
> ---
>
> Key: SPARK-45823
> URL: https://issues.apache.org/jira/browse/SPARK-45823
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> # Replace .size with .length on arrays and strings
> # The enclosing block is redundant
> # 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45823) Fix some scala compile warnings

2023-11-07 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45823:
---
Summary: Fix some scala compile warnings  (was: Fix some compile warnings)

> Fix some scala compile warnings
> ---
>
> Key: SPARK-45823
> URL: https://issues.apache.org/jira/browse/SPARK-45823
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> # Replace .size with .length on arrays and strings
> # 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45823) Fix some compile warnings

2023-11-07 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45823:
---
Summary: Fix some compile warnings  (was: Fix Replace .size with .length on 
arrays and strings)

> Fix some compile warnings
> -
>
> Key: SPARK-45823
> URL: https://issues.apache.org/jira/browse/SPARK-45823
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> # Replace .size with .length on arrays and strings
> # 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45823) Fix Replace .size with .length on arrays and strings

2023-11-07 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45823:
---
Summary: Fix Replace .size with .length on arrays and strings  (was: Fix 
The enclosing block is redundant)

> Fix Replace .size with .length on arrays and strings
> 
>
> Key: SPARK-45823
> URL: https://issues.apache.org/jira/browse/SPARK-45823
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45823) Fix The enclosing block is redundant

2023-11-07 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-45823:
--

 Summary: Fix The enclosing block is redundant
 Key: SPARK-45823
 URL: https://issues.apache.org/jira/browse/SPARK-45823
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45793) Improve the built-in compression codecs

2023-11-06 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng resolved SPARK-45793.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43659
[https://github.com/apache/spark/pull/43659]

> Improve the built-in compression codecs
> ---
>
> Key: SPARK-45793
> URL: https://issues.apache.org/jira/browse/SPARK-45793
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, Spark supported many built-in compression codecs used for I/O and 
> storage.
> There are a lot of magic strings copy from built-in compression codecs. This 
> issue lead to developers need to manually maintain its consistency. It is 
> easy to make mistakes and reduce development efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45758) Introduce a mapper for hadoop compression codecs

2023-11-06 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng resolved SPARK-45758.

Resolution: Resolved

> Introduce a mapper for hadoop compression codecs
> 
>
> Key: SPARK-45758
> URL: https://issues.apache.org/jira/browse/SPARK-45758
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
>
> Currently, Spark supported partial Hadoop compression codecs, but the Hadoop 
> supported compression codecs and spark supported are not completely 
> one-on-one due to Spark introduce two fake compression codecs none and 
> uncompress.
> There are a lot of magic strings copy from Hadoop compression codecs. This 
> issue lead to developers need to manually maintain its consistency. It is 
> easy to make mistakes and reduce development efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45758) Introduce a mapper for hadoop compression codecs

2023-11-06 Thread Jiaan Geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783169#comment-17783169
 ] 

Jiaan Geng commented on SPARK-45758:


Resolved by https://github.com/apache/spark/pull/43620

> Introduce a mapper for hadoop compression codecs
> 
>
> Key: SPARK-45758
> URL: https://issues.apache.org/jira/browse/SPARK-45758
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
>
> Currently, Spark supported partial Hadoop compression codecs, but the Hadoop 
> supported compression codecs and spark supported are not completely 
> one-on-one due to Spark introduce two fake compression codecs none and 
> uncompress.
> There are a lot of magic strings copy from Hadoop compression codecs. This 
> issue lead to developers need to manually maintain its consistency. It is 
> easy to make mistakes and reduce development efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45796) Support MODE() WITHIN GROUP (ORDER BY col)

2023-11-05 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45796:
---
Description: 
Many mainstream databases supports the syntax show below.
{ MODE() WITHIN GROUP (ORDER BY sortSpecification) }
[FILTER (WHERE expression)] [OVER windowNameOrSpecification]

> Support MODE() WITHIN GROUP (ORDER BY col) 
> ---
>
> Key: SPARK-45796
> URL: https://issues.apache.org/jira/browse/SPARK-45796
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> Many mainstream databases supports the syntax show below.
> { MODE() WITHIN GROUP (ORDER BY sortSpecification) }
> [FILTER (WHERE expression)] [OVER windowNameOrSpecification]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45796) Support MODE() WITHIN GROUP (ORDER BY col)

2023-11-05 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-45796:
--

 Summary: Support MODE() WITHIN GROUP (ORDER BY col) 
 Key: SPARK-45796
 URL: https://issues.apache.org/jira/browse/SPARK-45796
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45795) DS V2 supports push down Mode

2023-11-05 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-45795:
--

 Summary: DS V2 supports push down Mode
 Key: SPARK-45795
 URL: https://issues.apache.org/jira/browse/SPARK-45795
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45793) Improve the built-in compression codecs

2023-11-04 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45793:
---
Description: 
Currently, Spark supported many built-in compression codecs used for I/O and 
storage.
There are a lot of magic strings copy from built-in compression codecs. This 
issue lead to developers need to manually maintain its consistency. It is easy 
to make mistakes and reduce development efficiency.

> Improve the built-in compression codecs
> ---
>
> Key: SPARK-45793
> URL: https://issues.apache.org/jira/browse/SPARK-45793
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> Currently, Spark supported many built-in compression codecs used for I/O and 
> storage.
> There are a lot of magic strings copy from built-in compression codecs. This 
> issue lead to developers need to manually maintain its consistency. It is 
> easy to make mistakes and reduce development efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45758) Introduce a mapper for hadoop compression codecs

2023-11-01 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45758:
---
Description: 
Currently, Spark supported partial Hadoop compression codecs, but the Hadoop 
supported compression codecs and spark supported are not completely one-on-one 
due to Spark introduce two fake compression codecs none and uncompress.

There are a lot of magic strings copy from Hadoop compression codecs. This 
issue lead to developers need to manually maintain its consistency. It is easy 
to make mistakes and reduce development efficiency.

> Introduce a mapper for hadoop compression codecs
> 
>
> Key: SPARK-45758
> URL: https://issues.apache.org/jira/browse/SPARK-45758
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> Currently, Spark supported partial Hadoop compression codecs, but the Hadoop 
> supported compression codecs and spark supported are not completely 
> one-on-one due to Spark introduce two fake compression codecs none and 
> uncompress.
> There are a lot of magic strings copy from Hadoop compression codecs. This 
> issue lead to developers need to manually maintain its consistency. It is 
> easy to make mistakes and reduce development efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45758) Introduce a mapper for hadoop compression codecs

2023-11-01 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-45758:
--

 Summary: Introduce a mapper for hadoop compression codecs
 Key: SPARK-45758
 URL: https://issues.apache.org/jira/browse/SPARK-45758
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45755) Push down limit through Dataset.isEmpty()

2023-11-01 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng reassigned SPARK-45755:
--

Assignee: Yuming Wang

> Push down limit through Dataset.isEmpty()
> -
>
> Key: SPARK-45755
> URL: https://issues.apache.org/jira/browse/SPARK-45755
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>
> Push down LocalLimit can not optimize the case of distinct.
> {code:scala}
>   def isEmpty: Boolean = withAction("isEmpty",
> withTypedPlan { LocalLimit(Literal(1), select().logicalPlan) 
> }.queryExecution) { plan =>
> plan.executeTake(1).isEmpty
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45755) Push down limit through Dataset.isEmpty()

2023-11-01 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng resolved SPARK-45755.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43617
[https://github.com/apache/spark/pull/43617]

> Push down limit through Dataset.isEmpty()
> -
>
> Key: SPARK-45755
> URL: https://issues.apache.org/jira/browse/SPARK-45755
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Push down LocalLimit can not optimize the case of distinct.
> {code:scala}
>   def isEmpty: Boolean = withAction("isEmpty",
> withTypedPlan { LocalLimit(Literal(1), select().logicalPlan) 
> }.queryExecution) { plan =>
> plan.executeTake(1).isEmpty
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45711) Introduce a mapper for avro compression codecs

2023-10-28 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng resolved SPARK-45711.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43562
[https://github.com/apache/spark/pull/43562]

> Introduce a mapper for avro compression codecs
> --
>
> Key: SPARK-45711
> URL: https://issues.apache.org/jira/browse/SPARK-45711
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, Spark supported all the avro compression codecs, but the avro 
> supported compression codecs and spark supported are not completely 
> one-on-one due to Spark introduce the compression codecs UNCOMPRESSED.
> There are a lot of magic strings copy from avro compression codecs. This 
> issue lead to developers need to manually maintain its consistency. It is 
> easy to make mistakes and reduce development efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45711) Introduce a mapper for avro compression codecs

2023-10-27 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45711:
---
Description: 
Currently, Spark supported all the avro compression codecs, but the avro 
supported compression codecs and spark supported are not completely one-on-one 
due to Spark introduce the compression codecs UNCOMPRESSED.

There are a lot of magic strings copy from avro compression codecs. This issue 
lead to developers need to manually maintain its consistency. It is easy to 
make mistakes and reduce development efficiency.

  was:
Currently, Spark supported all the avro compression codecs, but the avro 
supported compression codecs and spark supported are not completely one-on-one 
due to Spark introduce the compression codecs UNCOMPRESSED.

There are a lot of magic strings copy from orc compression codecs. This issue 
lead to developers need to manually maintain its consistency. It is easy to 
make mistakes and reduce development efficiency.


> Introduce a mapper for avro compression codecs
> --
>
> Key: SPARK-45711
> URL: https://issues.apache.org/jira/browse/SPARK-45711
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> Currently, Spark supported all the avro compression codecs, but the avro 
> supported compression codecs and spark supported are not completely 
> one-on-one due to Spark introduce the compression codecs UNCOMPRESSED.
> There are a lot of magic strings copy from avro compression codecs. This 
> issue lead to developers need to manually maintain its consistency. It is 
> easy to make mistakes and reduce development efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45711) Introduce a mapper for avro compression codecs

2023-10-27 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45711:
---
Description: 
Currently, Spark supported all the avro compression codecs, but the avro 
supported compression codecs and spark supported are not completely one-on-one 
due to Spark introduce the compression codecs UNCOMPRESSED.

There are a lot of magic strings copy from orc compression codecs. This issue 
lead to developers need to manually maintain its consistency. It is easy to 
make mistakes and reduce development efficiency.

  was:
Currently, Spark supported all the orc compression codecs, but the orc 
supported compression codecs and spark supported are not completely one-on-one 
due to Spark introduce two compression codecs none and UNCOMPRESSED.

There are a lot of magic strings copy from orc compression codecs. This issue 
lead to developers need to manually maintain its consistency. It is easy to 
make mistakes and reduce development efficiency.


> Introduce a mapper for avro compression codecs
> --
>
> Key: SPARK-45711
> URL: https://issues.apache.org/jira/browse/SPARK-45711
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> Currently, Spark supported all the avro compression codecs, but the avro 
> supported compression codecs and spark supported are not completely 
> one-on-one due to Spark introduce the compression codecs UNCOMPRESSED.
> There are a lot of magic strings copy from orc compression codecs. This issue 
> lead to developers need to manually maintain its consistency. It is easy to 
> make mistakes and reduce development efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45711) Introduce a mapper for avro compression codecs

2023-10-27 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45711:
---
Description: 
Currently, Spark supported all the orc compression codecs, but the orc 
supported compression codecs and spark supported are not completely one-on-one 
due to Spark introduce two compression codecs none and UNCOMPRESSED.

There are a lot of magic strings copy from orc compression codecs. This issue 
lead to developers need to manually maintain its consistency. It is easy to 
make mistakes and reduce development efficiency.

> Introduce a mapper for avro compression codecs
> --
>
> Key: SPARK-45711
> URL: https://issues.apache.org/jira/browse/SPARK-45711
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> Currently, Spark supported all the orc compression codecs, but the orc 
> supported compression codecs and spark supported are not completely 
> one-on-one due to Spark introduce two compression codecs none and 
> UNCOMPRESSED.
> There are a lot of magic strings copy from orc compression codecs. This issue 
> lead to developers need to manually maintain its consistency. It is easy to 
> make mistakes and reduce development efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45481) Introduce a mapper for parquet compression codecs

2023-10-26 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng resolved SPARK-45481.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43308
[https://github.com/apache/spark/pull/43308]

> Introduce a mapper for parquet compression codecs
> -
>
> Key: SPARK-45481
> URL: https://issues.apache.org/jira/browse/SPARK-45481
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, Spark supported all the parquet compression codecs, but the 
> parquet supported compression codecs and spark supported are not completely 
> one-on-one due to Spark introduce a fake compression codecs none.
> There are a lot of magic strings copy from parquet compression codecs. This 
> issue lead to developers need to manually maintain its consistency. It is 
> easy to make mistakes and reduce development efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45664) Introduce a mapper for orc compression codecs

2023-10-25 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45664:
---
Description: 
Currently, Spark supported all the orc compression codecs, but the orc 
supported compression codecs and spark supported are not completely one-on-one 
due to Spark introduce two compression codecs none and UNCOMPRESSED.

There are a lot of magic strings copy from orc compression codecs. This issue 
lead to developers need to manually maintain its consistency. It is easy to 
make mistakes and reduce development efficiency.

  was:
Currently, Spark supported all the orc compression codecs, but the orc 
supported compression codecs and spark supported are not completely one-on-one 
due to Spark introduce two compression codecs none and UNCOMPRESSED.

There are a lot of magic strings copy from parquet compression codecs. This 
issue lead to developers need to manually maintain its consistency. It is easy 
to make mistakes and reduce development efficiency.


> Introduce a mapper for orc compression codecs
> -
>
> Key: SPARK-45664
> URL: https://issues.apache.org/jira/browse/SPARK-45664
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> Currently, Spark supported all the orc compression codecs, but the orc 
> supported compression codecs and spark supported are not completely 
> one-on-one due to Spark introduce two compression codecs none and 
> UNCOMPRESSED.
> There are a lot of magic strings copy from orc compression codecs. This issue 
> lead to developers need to manually maintain its consistency. It is easy to 
> make mistakes and reduce development efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45664) Introduce a mapper for orc compression codecs

2023-10-25 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45664:
---
Description: 
Currently, Spark supported all the orc compression codecs, but the orc 
supported compression codecs and spark supported are not completely one-on-one 
due to Spark introduce two compression codecs none and UNCOMPRESSED.

There are a lot of magic strings copy from parquet compression codecs. This 
issue lead to developers need to manually maintain its consistency. It is easy 
to make mistakes and reduce development efficiency.

> Introduce a mapper for orc compression codecs
> -
>
> Key: SPARK-45664
> URL: https://issues.apache.org/jira/browse/SPARK-45664
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> Currently, Spark supported all the orc compression codecs, but the orc 
> supported compression codecs and spark supported are not completely 
> one-on-one due to Spark introduce two compression codecs none and 
> UNCOMPRESSED.
> There are a lot of magic strings copy from parquet compression codecs. This 
> issue lead to developers need to manually maintain its consistency. It is 
> easy to make mistakes and reduce development efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45481) Introduce a mapper for parquet compression codecs

2023-10-25 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45481:
---
Description: 
Currently, Spark supported all the parquet compression codecs, but the parquet 
supported compression codecs and spark supported are not completely one-on-one 
due to Spark introduce a fake compression codecs none.

There are a lot of magic strings copy from parquet compression codecs. This 
issue lead to developers need to manually maintain its consistency. It is easy 
to make mistakes and reduce development efficiency.

  was:
Currently, Spark supported most of all parquet compression codecs, the parquet 
supported compression codecs and spark supported are not completely one-on-one.

There are a lot of magic strings copy from parquet compression codecs. This 
issue lead to developers need to manually maintain its consistency. It is easy 
to make mistakes and reduce development efficiency.


> Introduce a mapper for parquet compression codecs
> -
>
> Key: SPARK-45481
> URL: https://issues.apache.org/jira/browse/SPARK-45481
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
>
> Currently, Spark supported all the parquet compression codecs, but the 
> parquet supported compression codecs and spark supported are not completely 
> one-on-one due to Spark introduce a fake compression codecs none.
> There are a lot of magic strings copy from parquet compression codecs. This 
> issue lead to developers need to manually maintain its consistency. It is 
> easy to make mistakes and reduce development efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45649) Unify the prepare framework for `OffsetWindowFunctionFrame`

2023-10-24 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45649:
---
Summary: Unify the prepare framework for `OffsetWindowFunctionFrame`  (was: 
Unified the prepare framework for `OffsetWindowFunctionFrame`)

> Unify the prepare framework for `OffsetWindowFunctionFrame`
> ---
>
> Key: SPARK-45649
> URL: https://issues.apache.org/jira/browse/SPARK-45649
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> Currently, the implementation the `prepare` of  all the 
> `OffsetWindowFunctionFrame` have the same code logic show below.
> ```
>   override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = {
> if (offset > rows.length) {
>   fillDefaultValue(EmptyRow)
> } else {
>   resetStates(rows)
>   if (ignoreNulls) {
> ...
>   } else {
> ...
>   }
> }
>   }
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45649) Unified the prepare framework for `OffsetWindowFunctionFrame`

2023-10-24 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-45649:
--

 Summary: Unified the prepare framework for 
`OffsetWindowFunctionFrame`
 Key: SPARK-45649
 URL: https://issues.apache.org/jira/browse/SPARK-45649
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng


Currently, the implementation the `prepare` of  all the 
`OffsetWindowFunctionFrame` have the same code logic show below.
```
  override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = {
if (offset > rows.length) {
  fillDefaultValue(EmptyRow)
} else {
  resetStates(rows)
  if (ignoreNulls) {
...
  } else {
...
  }
}
  }
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45543) InferWindowGroupLimit causes bug if the other window functions haven't the same window frame as the rank-like functions

2023-10-19 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng resolved SPARK-45543.

Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 43385
[https://github.com/apache/spark/pull/43385]

> InferWindowGroupLimit causes bug if the other window functions haven't the 
> same window frame as the rank-like functions
> ---
>
> Key: SPARK-45543
> URL: https://issues.apache.org/jira/browse/SPARK-45543
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 3.5.0
>Reporter: Ron Serruya
>Assignee: Jiaan Geng
>Priority: Critical
>  Labels: correctness, data-loss, pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>
> First, it's my first bug, so I'm hoping I'm doing it right, also, as I'm not 
> very knowledgeable about spark internals, I hope I diagnosed the problem 
> correctly
> I found the degradation in spark version 3.5.0:
> When using multiple windows that share the same partition and ordering (but 
> with different "frame boundaries", where one window is a ranking function, 
> "WindowGroupLimit" is added to the plan causing wrong values to be created 
> from the other windows.
> *This behavior didn't exist in versions 3.3 and 3.4.*
> Example:
>  
> {code:python}
> import pysparkfrom pyspark.sql import functions as F, Window  
> df = spark.createDataFrame([
> {'row_id': 1, 'name': 'Dave', 'score': 1, 'year': 2020},
> {'row_id': 1, 'name': 'Dave', 'score': 2, 'year': 2022},
> {'row_id': 1, 'name': 'Dave', 'score': 3, 'year': 2023},
> {'row_id': 2, 'name': 'Amy', 'score': 6, 'year': 2021},
> ])
> # Create first window for row number
> window_spec = Window.partitionBy('row_id', 'name').orderBy(F.desc('year'))
> # Create additional window from the first window with unbounded frame
> unbound_spec = window_spec.rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)
> # Try to keep the first row by year, and also collect all scores into a list
> df2 = df.withColumn(
> 'rn', 
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores', 
> F.collect_list('score').over(unbound_spec)
> ){code}
> So far everything works, and if we display df2:
>  
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Dave|1 |2|2022|2  |[3, 2, 1] |
> |Dave|1 |1|2020|3  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
>  
> However, once we filter to keep only the first row number:
>  
> {noformat}
> df2.filter("rn=1").show(truncate=False)
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3]   |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
> As you can see just filtering changed the "all_scores" array for Dave.
> (This example uses `collect_list`, however, the same result happens with 
> other functions, such as max, min, count, etc)
>  
> Now, if instead of using the two windows we used, I will use the first window 
> and a window with different ordering, or create a completely new window with 
> same partition but no ordering, it will work fine:
> {code:python}
> new_window = Window.partitionBy('row_id', 
> 'name').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
> df3 = df.withColumn(
> 'rn',
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores',
> F.collect_list('score').over(new_window)
> )
> df3.filter("rn=1").show(truncate=False){code}
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+
> {noformat}
> In addition, if we use all 3 windows to create 3 different columns, it will 
> also work ok. So it seems the issue happens only when all the windows used 
> share the same partition and ordering.
> Here is the final plan for the faulty dataframe:
> {noformat}
> df2.filter("rn=1").explain()
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Filter (rn#9 = 1)
>    +- Window [row_number() windowspecdefinition(row_id#1L, name#0, year#3L 
> DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> currentrow$())) AS rn#9, collect_list(score#2L, 0, 0) 
> windowspecdefinition(row_id#1L, name#0, year#3L DESC NULLS 

[jira] [Created] (SPARK-45606) Release restrictions on multi-layer runtime filter

2023-10-19 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-45606:
--

 Summary: Release restrictions on multi-layer runtime filter
 Key: SPARK-45606
 URL: https://issues.apache.org/jira/browse/SPARK-45606
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng


Before https://issues.apache.org/jira/browse/SPARK-41674, Spark only supports 
insert runtime filter for application side of shuffle join on single-layer. 
Considered it's not worth to insert more runtime filter if one side of the 
shuffle join already exists runtime filter, Spark restricts it.

After https://issues.apache.org/jira/browse/SPARK-41674, Spark supports insert 
runtime filter for one side of any shuffle join on multi-layer. But the 
restrictions on multi-layer runtime filter looks outdated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45543) InferWindowGroupLimit causes bug if the other window functions haven't the same window frame as the rank-like functions

2023-10-18 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45543:
---
Summary: InferWindowGroupLimit causes bug if the other window functions 
haven't the same window frame as the rank-like functions  (was: 
InferWindowGroupLimit causes bug if the window frame is different between 
rank-like functions and others)

> InferWindowGroupLimit causes bug if the other window functions haven't the 
> same window frame as the rank-like functions
> ---
>
> Key: SPARK-45543
> URL: https://issues.apache.org/jira/browse/SPARK-45543
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 3.5.0
>Reporter: Ron Serruya
>Assignee: Jiaan Geng
>Priority: Critical
>  Labels: correctness, data-loss, pull-request-available
>
> First, it's my first bug, so I'm hoping I'm doing it right, also, as I'm not 
> very knowledgeable about spark internals, I hope I diagnosed the problem 
> correctly
> I found the degradation in spark version 3.5.0:
> When using multiple windows that share the same partition and ordering (but 
> with different "frame boundaries", where one window is a ranking function, 
> "WindowGroupLimit" is added to the plan causing wrong values to be created 
> from the other windows.
> *This behavior didn't exist in versions 3.3 and 3.4.*
> Example:
>  
> {code:python}
> import pysparkfrom pyspark.sql import functions as F, Window  
> df = spark.createDataFrame([
> {'row_id': 1, 'name': 'Dave', 'score': 1, 'year': 2020},
> {'row_id': 1, 'name': 'Dave', 'score': 2, 'year': 2022},
> {'row_id': 1, 'name': 'Dave', 'score': 3, 'year': 2023},
> {'row_id': 2, 'name': 'Amy', 'score': 6, 'year': 2021},
> ])
> # Create first window for row number
> window_spec = Window.partitionBy('row_id', 'name').orderBy(F.desc('year'))
> # Create additional window from the first window with unbounded frame
> unbound_spec = window_spec.rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)
> # Try to keep the first row by year, and also collect all scores into a list
> df2 = df.withColumn(
> 'rn', 
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores', 
> F.collect_list('score').over(unbound_spec)
> ){code}
> So far everything works, and if we display df2:
>  
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Dave|1 |2|2022|2  |[3, 2, 1] |
> |Dave|1 |1|2020|3  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
>  
> However, once we filter to keep only the first row number:
>  
> {noformat}
> df2.filter("rn=1").show(truncate=False)
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3]   |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
> As you can see just filtering changed the "all_scores" array for Dave.
> (This example uses `collect_list`, however, the same result happens with 
> other functions, such as max, min, count, etc)
>  
> Now, if instead of using the two windows we used, I will use the first window 
> and a window with different ordering, or create a completely new window with 
> same partition but no ordering, it will work fine:
> {code:python}
> new_window = Window.partitionBy('row_id', 
> 'name').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
> df3 = df.withColumn(
> 'rn',
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores',
> F.collect_list('score').over(new_window)
> )
> df3.filter("rn=1").show(truncate=False){code}
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+
> {noformat}
> In addition, if we use all 3 windows to create 3 different columns, it will 
> also work ok. So it seems the issue happens only when all the windows used 
> share the same partition and ordering.
> Here is the final plan for the faulty dataframe:
> {noformat}
> df2.filter("rn=1").explain()
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Filter (rn#9 = 1)
>    +- Window [row_number() windowspecdefinition(row_id#1L, name#0, year#3L 
> DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> currentrow$())) AS rn#9, collect_list(score#2L, 0, 0) 
> 

[jira] [Updated] (SPARK-45543) InferWindowGroupLimit causes bug if the window frame is different between rank-like functions and others

2023-10-18 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45543:
---
Summary: InferWindowGroupLimit causes bug if the window frame is different 
between rank-like functions and others  (was: InferWindowGroupLimit causes bug 
if the other window functions haven't the same window frame as the rank-like 
functions)

> InferWindowGroupLimit causes bug if the window frame is different between 
> rank-like functions and others
> 
>
> Key: SPARK-45543
> URL: https://issues.apache.org/jira/browse/SPARK-45543
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 3.5.0
>Reporter: Ron Serruya
>Assignee: Jiaan Geng
>Priority: Critical
>  Labels: correctness, data-loss, pull-request-available
>
> First, it's my first bug, so I'm hoping I'm doing it right, also, as I'm not 
> very knowledgeable about spark internals, I hope I diagnosed the problem 
> correctly
> I found the degradation in spark version 3.5.0:
> When using multiple windows that share the same partition and ordering (but 
> with different "frame boundaries", where one window is a ranking function, 
> "WindowGroupLimit" is added to the plan causing wrong values to be created 
> from the other windows.
> *This behavior didn't exist in versions 3.3 and 3.4.*
> Example:
>  
> {code:python}
> import pysparkfrom pyspark.sql import functions as F, Window  
> df = spark.createDataFrame([
> {'row_id': 1, 'name': 'Dave', 'score': 1, 'year': 2020},
> {'row_id': 1, 'name': 'Dave', 'score': 2, 'year': 2022},
> {'row_id': 1, 'name': 'Dave', 'score': 3, 'year': 2023},
> {'row_id': 2, 'name': 'Amy', 'score': 6, 'year': 2021},
> ])
> # Create first window for row number
> window_spec = Window.partitionBy('row_id', 'name').orderBy(F.desc('year'))
> # Create additional window from the first window with unbounded frame
> unbound_spec = window_spec.rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)
> # Try to keep the first row by year, and also collect all scores into a list
> df2 = df.withColumn(
> 'rn', 
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores', 
> F.collect_list('score').over(unbound_spec)
> ){code}
> So far everything works, and if we display df2:
>  
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Dave|1 |2|2022|2  |[3, 2, 1] |
> |Dave|1 |1|2020|3  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
>  
> However, once we filter to keep only the first row number:
>  
> {noformat}
> df2.filter("rn=1").show(truncate=False)
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3]   |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
> As you can see just filtering changed the "all_scores" array for Dave.
> (This example uses `collect_list`, however, the same result happens with 
> other functions, such as max, min, count, etc)
>  
> Now, if instead of using the two windows we used, I will use the first window 
> and a window with different ordering, or create a completely new window with 
> same partition but no ordering, it will work fine:
> {code:python}
> new_window = Window.partitionBy('row_id', 
> 'name').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
> df3 = df.withColumn(
> 'rn',
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores',
> F.collect_list('score').over(new_window)
> )
> df3.filter("rn=1").show(truncate=False){code}
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+
> {noformat}
> In addition, if we use all 3 windows to create 3 different columns, it will 
> also work ok. So it seems the issue happens only when all the windows used 
> share the same partition and ordering.
> Here is the final plan for the faulty dataframe:
> {noformat}
> df2.filter("rn=1").explain()
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Filter (rn#9 = 1)
>    +- Window [row_number() windowspecdefinition(row_id#1L, name#0, year#3L 
> DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> currentrow$())) AS rn#9, collect_list(score#2L, 0, 0) 
> windowspecdefinition(row_id#1L, name#0, 

[jira] [Updated] (SPARK-45543) InferWindowGroupLimit causes bug if the other window functions haven't the same window frame as the rank-like functions

2023-10-18 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45543:
---
Summary: InferWindowGroupLimit causes bug if the other window functions 
haven't the same window frame as the rank-like functions  (was: 
WindowGroupLimit causes bug if the other window functions haven't the same 
window frame as the rank-like functions)

> InferWindowGroupLimit causes bug if the other window functions haven't the 
> same window frame as the rank-like functions
> ---
>
> Key: SPARK-45543
> URL: https://issues.apache.org/jira/browse/SPARK-45543
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 3.5.0
>Reporter: Ron Serruya
>Assignee: Jiaan Geng
>Priority: Critical
>  Labels: correctness, data-loss, pull-request-available
>
> First, it's my first bug, so I'm hoping I'm doing it right, also, as I'm not 
> very knowledgeable about spark internals, I hope I diagnosed the problem 
> correctly
> I found the degradation in spark version 3.5.0:
> When using multiple windows that share the same partition and ordering (but 
> with different "frame boundaries", where one window is a ranking function, 
> "WindowGroupLimit" is added to the plan causing wrong values to be created 
> from the other windows.
> *This behavior didn't exist in versions 3.3 and 3.4.*
> Example:
>  
> {code:python}
> import pysparkfrom pyspark.sql import functions as F, Window  
> df = spark.createDataFrame([
> {'row_id': 1, 'name': 'Dave', 'score': 1, 'year': 2020},
> {'row_id': 1, 'name': 'Dave', 'score': 2, 'year': 2022},
> {'row_id': 1, 'name': 'Dave', 'score': 3, 'year': 2023},
> {'row_id': 2, 'name': 'Amy', 'score': 6, 'year': 2021},
> ])
> # Create first window for row number
> window_spec = Window.partitionBy('row_id', 'name').orderBy(F.desc('year'))
> # Create additional window from the first window with unbounded frame
> unbound_spec = window_spec.rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)
> # Try to keep the first row by year, and also collect all scores into a list
> df2 = df.withColumn(
> 'rn', 
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores', 
> F.collect_list('score').over(unbound_spec)
> ){code}
> So far everything works, and if we display df2:
>  
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Dave|1 |2|2022|2  |[3, 2, 1] |
> |Dave|1 |1|2020|3  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
>  
> However, once we filter to keep only the first row number:
>  
> {noformat}
> df2.filter("rn=1").show(truncate=False)
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3]   |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
> As you can see just filtering changed the "all_scores" array for Dave.
> (This example uses `collect_list`, however, the same result happens with 
> other functions, such as max, min, count, etc)
>  
> Now, if instead of using the two windows we used, I will use the first window 
> and a window with different ordering, or create a completely new window with 
> same partition but no ordering, it will work fine:
> {code:python}
> new_window = Window.partitionBy('row_id', 
> 'name').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
> df3 = df.withColumn(
> 'rn',
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores',
> F.collect_list('score').over(new_window)
> )
> df3.filter("rn=1").show(truncate=False){code}
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+
> {noformat}
> In addition, if we use all 3 windows to create 3 different columns, it will 
> also work ok. So it seems the issue happens only when all the windows used 
> share the same partition and ordering.
> Here is the final plan for the faulty dataframe:
> {noformat}
> df2.filter("rn=1").explain()
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Filter (rn#9 = 1)
>    +- Window [row_number() windowspecdefinition(row_id#1L, name#0, year#3L 
> DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> currentrow$())) AS rn#9, collect_list(score#2L, 0, 0) 
> 

[jira] [Resolved] (SPARK-45484) Fix the bug that uses incorrect parquet compression codec lz4raw

2023-10-16 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng resolved SPARK-45484.

Fix Version/s: 3.5.1
   Resolution: Fixed

Issue resolved by pull request 43330
[https://github.com/apache/spark/pull/43330]

> Fix the bug that uses incorrect parquet compression codec lz4raw
> 
>
> Key: SPARK-45484
> URL: https://issues.apache.org/jira/browse/SPARK-45484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1
>
>
> lz4raw is not a correct parquet compression codec name.
> We should use lz4_raw as its name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45543) WindowGroupLimit causes bug if the other window functions haven't the same window frame as the rank-like functions

2023-10-16 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45543:
---
Summary: WindowGroupLimit causes bug if the other window functions haven't 
the same window frame as the rank-like functions  (was: WindowGroupLimit causes 
bug if window expressions exists non-rank function)

> WindowGroupLimit causes bug if the other window functions haven't the same 
> window frame as the rank-like functions
> --
>
> Key: SPARK-45543
> URL: https://issues.apache.org/jira/browse/SPARK-45543
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 3.5.0
>Reporter: Ron Serruya
>Assignee: Jiaan Geng
>Priority: Critical
>  Labels: correctness, data-loss, pull-request-available
>
> First, it's my first bug, so I'm hoping I'm doing it right, also, as I'm not 
> very knowledgeable about spark internals, I hope I diagnosed the problem 
> correctly
> I found the degradation in spark version 3.5.0:
> When using multiple windows that share the same partition and ordering (but 
> with different "frame boundaries", where one window is a ranking function, 
> "WindowGroupLimit" is added to the plan causing wrong values to be created 
> from the other windows.
> *This behavior didn't exist in versions 3.3 and 3.4.*
> Example:
>  
> {code:python}
> import pysparkfrom pyspark.sql import functions as F, Window  
> df = spark.createDataFrame([
> {'row_id': 1, 'name': 'Dave', 'score': 1, 'year': 2020},
> {'row_id': 1, 'name': 'Dave', 'score': 2, 'year': 2022},
> {'row_id': 1, 'name': 'Dave', 'score': 3, 'year': 2023},
> {'row_id': 2, 'name': 'Amy', 'score': 6, 'year': 2021},
> ])
> # Create first window for row number
> window_spec = Window.partitionBy('row_id', 'name').orderBy(F.desc('year'))
> # Create additional window from the first window with unbounded frame
> unbound_spec = window_spec.rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)
> # Try to keep the first row by year, and also collect all scores into a list
> df2 = df.withColumn(
> 'rn', 
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores', 
> F.collect_list('score').over(unbound_spec)
> ){code}
> So far everything works, and if we display df2:
>  
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Dave|1 |2|2022|2  |[3, 2, 1] |
> |Dave|1 |1|2020|3  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
>  
> However, once we filter to keep only the first row number:
>  
> {noformat}
> df2.filter("rn=1").show(truncate=False)
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3]   |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
> As you can see just filtering changed the "all_scores" array for Dave.
> (This example uses `collect_list`, however, the same result happens with 
> other functions, such as max, min, count, etc)
>  
> Now, if instead of using the two windows we used, I will use the first window 
> and a window with different ordering, or create a completely new window with 
> same partition but no ordering, it will work fine:
> {code:python}
> new_window = Window.partitionBy('row_id', 
> 'name').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
> df3 = df.withColumn(
> 'rn',
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores',
> F.collect_list('score').over(new_window)
> )
> df3.filter("rn=1").show(truncate=False){code}
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+
> {noformat}
> In addition, if we use all 3 windows to create 3 different columns, it will 
> also work ok. So it seems the issue happens only when all the windows used 
> share the same partition and ordering.
> Here is the final plan for the faulty dataframe:
> {noformat}
> df2.filter("rn=1").explain()
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Filter (rn#9 = 1)
>    +- Window [row_number() windowspecdefinition(row_id#1L, name#0, year#3L 
> DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> currentrow$())) AS rn#9, collect_list(score#2L, 0, 0) 
> windowspecdefinition(row_id#1L, name#0, year#3L DESC 

[jira] [Commented] (SPARK-45543) WindowGroupLimit causes bug if the other window functions haven't the same window frame as the rank-like functions

2023-10-16 Thread Jiaan Geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17775974#comment-17775974
 ] 

Jiaan Geng commented on SPARK-45543:


[~ronserruya] Thank you.

> WindowGroupLimit causes bug if the other window functions haven't the same 
> window frame as the rank-like functions
> --
>
> Key: SPARK-45543
> URL: https://issues.apache.org/jira/browse/SPARK-45543
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 3.5.0
>Reporter: Ron Serruya
>Assignee: Jiaan Geng
>Priority: Critical
>  Labels: correctness, data-loss, pull-request-available
>
> First, it's my first bug, so I'm hoping I'm doing it right, also, as I'm not 
> very knowledgeable about spark internals, I hope I diagnosed the problem 
> correctly
> I found the degradation in spark version 3.5.0:
> When using multiple windows that share the same partition and ordering (but 
> with different "frame boundaries", where one window is a ranking function, 
> "WindowGroupLimit" is added to the plan causing wrong values to be created 
> from the other windows.
> *This behavior didn't exist in versions 3.3 and 3.4.*
> Example:
>  
> {code:python}
> import pysparkfrom pyspark.sql import functions as F, Window  
> df = spark.createDataFrame([
> {'row_id': 1, 'name': 'Dave', 'score': 1, 'year': 2020},
> {'row_id': 1, 'name': 'Dave', 'score': 2, 'year': 2022},
> {'row_id': 1, 'name': 'Dave', 'score': 3, 'year': 2023},
> {'row_id': 2, 'name': 'Amy', 'score': 6, 'year': 2021},
> ])
> # Create first window for row number
> window_spec = Window.partitionBy('row_id', 'name').orderBy(F.desc('year'))
> # Create additional window from the first window with unbounded frame
> unbound_spec = window_spec.rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)
> # Try to keep the first row by year, and also collect all scores into a list
> df2 = df.withColumn(
> 'rn', 
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores', 
> F.collect_list('score').over(unbound_spec)
> ){code}
> So far everything works, and if we display df2:
>  
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Dave|1 |2|2022|2  |[3, 2, 1] |
> |Dave|1 |1|2020|3  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
>  
> However, once we filter to keep only the first row number:
>  
> {noformat}
> df2.filter("rn=1").show(truncate=False)
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3]   |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
> As you can see just filtering changed the "all_scores" array for Dave.
> (This example uses `collect_list`, however, the same result happens with 
> other functions, such as max, min, count, etc)
>  
> Now, if instead of using the two windows we used, I will use the first window 
> and a window with different ordering, or create a completely new window with 
> same partition but no ordering, it will work fine:
> {code:python}
> new_window = Window.partitionBy('row_id', 
> 'name').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
> df3 = df.withColumn(
> 'rn',
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores',
> F.collect_list('score').over(new_window)
> )
> df3.filter("rn=1").show(truncate=False){code}
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+
> {noformat}
> In addition, if we use all 3 windows to create 3 different columns, it will 
> also work ok. So it seems the issue happens only when all the windows used 
> share the same partition and ordering.
> Here is the final plan for the faulty dataframe:
> {noformat}
> df2.filter("rn=1").explain()
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Filter (rn#9 = 1)
>    +- Window [row_number() windowspecdefinition(row_id#1L, name#0, year#3L 
> DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> currentrow$())) AS rn#9, collect_list(score#2L, 0, 0) 
> windowspecdefinition(row_id#1L, name#0, year#3L DESC NULLS LAST, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS all_scores#16], [row_id#1L, name#0], [year#3L DESC 

[jira] [Updated] (SPARK-45543) WindowGroupLimit causes bug if window expressions exists non-rank function

2023-10-16 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45543:
---
Summary: WindowGroupLimit causes bug if window expressions exists non-rank 
function  (was: WindowGroupLimit causes bug if window expressions exists 
non-rank window function)

> WindowGroupLimit causes bug if window expressions exists non-rank function
> --
>
> Key: SPARK-45543
> URL: https://issues.apache.org/jira/browse/SPARK-45543
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 3.5.0
>Reporter: Ron Serruya
>Assignee: Jiaan Geng
>Priority: Critical
>  Labels: correctness, data-loss
>
> First, it's my first bug, so I'm hoping I'm doing it right, also, as I'm not 
> very knowledgeable about spark internals, I hope I diagnosed the problem 
> correctly
> I found the degradation in spark version 3.5.0:
> When using multiple windows that share the same partition and ordering (but 
> with different "frame boundaries", where one window is a ranking function, 
> "WindowGroupLimit" is added to the plan causing wrong values to be created 
> from the other windows.
> *This behavior didn't exist in versions 3.3 and 3.4.*
> Example:
>  
> {code:python}
> import pysparkfrom pyspark.sql import functions as F, Window  
> df = spark.createDataFrame([
> {'row_id': 1, 'name': 'Dave', 'score': 1, 'year': 2020},
> {'row_id': 1, 'name': 'Dave', 'score': 2, 'year': 2022},
> {'row_id': 1, 'name': 'Dave', 'score': 3, 'year': 2023},
> {'row_id': 2, 'name': 'Amy', 'score': 6, 'year': 2021},
> ])
> # Create first window for row number
> window_spec = Window.partitionBy('row_id', 'name').orderBy(F.desc('year'))
> # Create additional window from the first window with unbounded frame
> unbound_spec = window_spec.rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)
> # Try to keep the first row by year, and also collect all scores into a list
> df2 = df.withColumn(
> 'rn', 
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores', 
> F.collect_list('score').over(unbound_spec)
> ){code}
> So far everything works, and if we display df2:
>  
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Dave|1 |2|2022|2  |[3, 2, 1] |
> |Dave|1 |1|2020|3  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
>  
> However, once we filter to keep only the first row number:
>  
> {noformat}
> df2.filter("rn=1").show(truncate=False)
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3]   |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
> As you can see just filtering changed the "all_scores" array for Dave.
> (This example uses `collect_list`, however, the same result happens with 
> other functions, such as max, min, count, etc)
>  
> Now, if instead of using the two windows we used, I will use the first window 
> and a window with different ordering, or create a completely new window with 
> same partition but no ordering, it will work fine:
> {code:python}
> new_window = Window.partitionBy('row_id', 
> 'name').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
> df3 = df.withColumn(
> 'rn',
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores',
> F.collect_list('score').over(new_window)
> )
> df3.filter("rn=1").show(truncate=False){code}
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+
> {noformat}
> In addition, if we use all 3 windows to create 3 different columns, it will 
> also work ok. So it seems the issue happens only when all the windows used 
> share the same partition and ordering.
> Here is the final plan for the faulty dataframe:
> {noformat}
> df2.filter("rn=1").explain()
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Filter (rn#9 = 1)
>    +- Window [row_number() windowspecdefinition(row_id#1L, name#0, year#3L 
> DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> currentrow$())) AS rn#9, collect_list(score#2L, 0, 0) 
> windowspecdefinition(row_id#1L, name#0, year#3L DESC NULLS LAST, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS all_scores#16], [row_id#1L, name#0], 

[jira] [Updated] (SPARK-45543) WindowGroupLimit causes bug if window expressions exists non-rank window function

2023-10-16 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45543:
---
Summary: WindowGroupLimit causes bug if window expressions exists non-rank 
window function  (was: Cannot insert WindowGroupLimit if window expressions 
exists non-rank window function)

> WindowGroupLimit causes bug if window expressions exists non-rank window 
> function
> -
>
> Key: SPARK-45543
> URL: https://issues.apache.org/jira/browse/SPARK-45543
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 3.5.0
>Reporter: Ron Serruya
>Assignee: Jiaan Geng
>Priority: Critical
>  Labels: correctness, data-loss
>
> First, it's my first bug, so I'm hoping I'm doing it right, also, as I'm not 
> very knowledgeable about spark internals, I hope I diagnosed the problem 
> correctly
> I found the degradation in spark version 3.5.0:
> When using multiple windows that share the same partition and ordering (but 
> with different "frame boundaries", where one window is a ranking function, 
> "WindowGroupLimit" is added to the plan causing wrong values to be created 
> from the other windows.
> *This behavior didn't exist in versions 3.3 and 3.4.*
> Example:
>  
> {code:python}
> import pysparkfrom pyspark.sql import functions as F, Window  
> df = spark.createDataFrame([
> {'row_id': 1, 'name': 'Dave', 'score': 1, 'year': 2020},
> {'row_id': 1, 'name': 'Dave', 'score': 2, 'year': 2022},
> {'row_id': 1, 'name': 'Dave', 'score': 3, 'year': 2023},
> {'row_id': 2, 'name': 'Amy', 'score': 6, 'year': 2021},
> ])
> # Create first window for row number
> window_spec = Window.partitionBy('row_id', 'name').orderBy(F.desc('year'))
> # Create additional window from the first window with unbounded frame
> unbound_spec = window_spec.rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)
> # Try to keep the first row by year, and also collect all scores into a list
> df2 = df.withColumn(
> 'rn', 
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores', 
> F.collect_list('score').over(unbound_spec)
> ){code}
> So far everything works, and if we display df2:
>  
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Dave|1 |2|2022|2  |[3, 2, 1] |
> |Dave|1 |1|2020|3  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
>  
> However, once we filter to keep only the first row number:
>  
> {noformat}
> df2.filter("rn=1").show(truncate=False)
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3]   |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
> As you can see just filtering changed the "all_scores" array for Dave.
> (This example uses `collect_list`, however, the same result happens with 
> other functions, such as max, min, count, etc)
>  
> Now, if instead of using the two windows we used, I will use the first window 
> and a window with different ordering, or create a completely new window with 
> same partition but no ordering, it will work fine:
> {code:python}
> new_window = Window.partitionBy('row_id', 
> 'name').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
> df3 = df.withColumn(
> 'rn',
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores',
> F.collect_list('score').over(new_window)
> )
> df3.filter("rn=1").show(truncate=False){code}
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+
> {noformat}
> In addition, if we use all 3 windows to create 3 different columns, it will 
> also work ok. So it seems the issue happens only when all the windows used 
> share the same partition and ordering.
> Here is the final plan for the faulty dataframe:
> {noformat}
> df2.filter("rn=1").explain()
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Filter (rn#9 = 1)
>    +- Window [row_number() windowspecdefinition(row_id#1L, name#0, year#3L 
> DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> currentrow$())) AS rn#9, collect_list(score#2L, 0, 0) 
> windowspecdefinition(row_id#1L, name#0, year#3L DESC NULLS LAST, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS all_scores#16], 

[jira] [Updated] (SPARK-45543) Cannot insert WindowGroupLimit if window expressions exists non-rank window function

2023-10-16 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45543:
---
Summary: Cannot insert WindowGroupLimit if window expressions exists 
non-rank window function  (was: WindowGroupLimit causes incorrect results if 
window expressions exists non-rank window function)

> Cannot insert WindowGroupLimit if window expressions exists non-rank window 
> function
> 
>
> Key: SPARK-45543
> URL: https://issues.apache.org/jira/browse/SPARK-45543
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 3.5.0
>Reporter: Ron Serruya
>Assignee: Jiaan Geng
>Priority: Critical
>  Labels: correctness, data-loss
>
> First, it's my first bug, so I'm hoping I'm doing it right, also, as I'm not 
> very knowledgeable about spark internals, I hope I diagnosed the problem 
> correctly
> I found the degradation in spark version 3.5.0:
> When using multiple windows that share the same partition and ordering (but 
> with different "frame boundaries", where one window is a ranking function, 
> "WindowGroupLimit" is added to the plan causing wrong values to be created 
> from the other windows.
> *This behavior didn't exist in versions 3.3 and 3.4.*
> Example:
>  
> {code:python}
> import pysparkfrom pyspark.sql import functions as F, Window  
> df = spark.createDataFrame([
> {'row_id': 1, 'name': 'Dave', 'score': 1, 'year': 2020},
> {'row_id': 1, 'name': 'Dave', 'score': 2, 'year': 2022},
> {'row_id': 1, 'name': 'Dave', 'score': 3, 'year': 2023},
> {'row_id': 2, 'name': 'Amy', 'score': 6, 'year': 2021},
> ])
> # Create first window for row number
> window_spec = Window.partitionBy('row_id', 'name').orderBy(F.desc('year'))
> # Create additional window from the first window with unbounded frame
> unbound_spec = window_spec.rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)
> # Try to keep the first row by year, and also collect all scores into a list
> df2 = df.withColumn(
> 'rn', 
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores', 
> F.collect_list('score').over(unbound_spec)
> ){code}
> So far everything works, and if we display df2:
>  
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Dave|1 |2|2022|2  |[3, 2, 1] |
> |Dave|1 |1|2020|3  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
>  
> However, once we filter to keep only the first row number:
>  
> {noformat}
> df2.filter("rn=1").show(truncate=False)
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3]   |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
> As you can see just filtering changed the "all_scores" array for Dave.
> (This example uses `collect_list`, however, the same result happens with 
> other functions, such as max, min, count, etc)
>  
> Now, if instead of using the two windows we used, I will use the first window 
> and a window with different ordering, or create a completely new window with 
> same partition but no ordering, it will work fine:
> {code:python}
> new_window = Window.partitionBy('row_id', 
> 'name').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
> df3 = df.withColumn(
> 'rn',
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores',
> F.collect_list('score').over(new_window)
> )
> df3.filter("rn=1").show(truncate=False){code}
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+
> {noformat}
> In addition, if we use all 3 windows to create 3 different columns, it will 
> also work ok. So it seems the issue happens only when all the windows used 
> share the same partition and ordering.
> Here is the final plan for the faulty dataframe:
> {noformat}
> df2.filter("rn=1").explain()
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Filter (rn#9 = 1)
>    +- Window [row_number() windowspecdefinition(row_id#1L, name#0, year#3L 
> DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> currentrow$())) AS rn#9, collect_list(score#2L, 0, 0) 
> windowspecdefinition(row_id#1L, name#0, year#3L DESC NULLS LAST, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> 

[jira] [Updated] (SPARK-45543) WindowGroupLimit Causes incorrect results when window expressions exists non-rank function

2023-10-16 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45543:
---
Summary: WindowGroupLimit Causes incorrect results when window expressions 
exists non-rank function  (was: WindowGroupLimit Causes incorrect results when 
multiple windows are used)

> WindowGroupLimit Causes incorrect results when window expressions exists 
> non-rank function
> --
>
> Key: SPARK-45543
> URL: https://issues.apache.org/jira/browse/SPARK-45543
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 3.5.0
>Reporter: Ron Serruya
>Assignee: Jiaan Geng
>Priority: Critical
>  Labels: correctness, data-loss
>
> First, it's my first bug, so I'm hoping I'm doing it right, also, as I'm not 
> very knowledgeable about spark internals, I hope I diagnosed the problem 
> correctly
> I found the degradation in spark version 3.5.0:
> When using multiple windows that share the same partition and ordering (but 
> with different "frame boundaries", where one window is a ranking function, 
> "WindowGroupLimit" is added to the plan causing wrong values to be created 
> from the other windows.
> *This behavior didn't exist in versions 3.3 and 3.4.*
> Example:
>  
> {code:python}
> import pysparkfrom pyspark.sql import functions as F, Window  
> df = spark.createDataFrame([
> {'row_id': 1, 'name': 'Dave', 'score': 1, 'year': 2020},
> {'row_id': 1, 'name': 'Dave', 'score': 2, 'year': 2022},
> {'row_id': 1, 'name': 'Dave', 'score': 3, 'year': 2023},
> {'row_id': 2, 'name': 'Amy', 'score': 6, 'year': 2021},
> ])
> # Create first window for row number
> window_spec = Window.partitionBy('row_id', 'name').orderBy(F.desc('year'))
> # Create additional window from the first window with unbounded frame
> unbound_spec = window_spec.rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)
> # Try to keep the first row by year, and also collect all scores into a list
> df2 = df.withColumn(
> 'rn', 
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores', 
> F.collect_list('score').over(unbound_spec)
> ){code}
> So far everything works, and if we display df2:
>  
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Dave|1 |2|2022|2  |[3, 2, 1] |
> |Dave|1 |1|2020|3  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
>  
> However, once we filter to keep only the first row number:
>  
> {noformat}
> df2.filter("rn=1").show(truncate=False)
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3]   |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+{noformat}
> As you can see just filtering changed the "all_scores" array for Dave.
> (This example uses `collect_list`, however, the same result happens with 
> other functions, such as max, min, count, etc)
>  
> Now, if instead of using the two windows we used, I will use the first window 
> and a window with different ordering, or create a completely new window with 
> same partition but no ordering, it will work fine:
> {code:python}
> new_window = Window.partitionBy('row_id', 
> 'name').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
> df3 = df.withColumn(
> 'rn',
> F.row_number().over(window_spec)
> ).withColumn(
> 'all_scores',
> F.collect_list('score').over(new_window)
> )
> df3.filter("rn=1").show(truncate=False){code}
> {noformat}
> ++--+-++---+--+
> |name|row_id|score|year|rn |all_scores|
> ++--+-++---+--+
> |Dave|1 |3|2023|1  |[3, 2, 1] |
> |Amy |2 |6|2021|1  |[6]   |
> ++--+-++---+--+
> {noformat}
> In addition, if we use all 3 windows to create 3 different columns, it will 
> also work ok. So it seems the issue happens only when all the windows used 
> share the same partition and ordering.
> Here is the final plan for the faulty dataframe:
> {noformat}
> df2.filter("rn=1").explain()
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Filter (rn#9 = 1)
>    +- Window [row_number() windowspecdefinition(row_id#1L, name#0, year#3L 
> DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), 
> currentrow$())) AS rn#9, collect_list(score#2L, 0, 0) 
> windowspecdefinition(row_id#1L, name#0, year#3L DESC NULLS LAST, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS 

  1   2   >