[jira] [Created] (SPARK-42498) reduce spark connect service retry time

2023-02-19 Thread Niranjan Jayakar (Jira)
Niranjan Jayakar created SPARK-42498:


 Summary: reduce spark connect service retry time
 Key: SPARK-42498
 URL: https://issues.apache.org/jira/browse/SPARK-42498
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.3.2
Reporter: Niranjan Jayakar


https://github.com/apache/spark/blob/5fc44dabe5084fb784f064afe691951a3c270793/python/pyspark/sql/connect/client.py#L411

 

Currently, 15 retries with the current backoff strategy result in the client 
sitting in
the retry loop for ~400 seconds in the worst case. This means, applications and
users using the spark connect client will hang for >6 minutes with no response.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42497) Basic support of pandas API on Spark for Spark Connect.

2023-02-19 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-42497:
---

 Summary: Basic support of pandas API on Spark for Spark Connect.
 Key: SPARK-42497
 URL: https://issues.apache.org/jira/browse/SPARK-42497
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should enable `pandas API on Spark` on Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42497) Basic support of pandas API on Spark for Spark Connect.

2023-02-19 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691018#comment-17691018
 ] 

Haejoon Lee commented on SPARK-42497:
-

I'm working on this.

Will submit a PR soon.

> Basic support of pandas API on Spark for Spark Connect.
> ---
>
> Key: SPARK-42497
> URL: https://issues.apache.org/jira/browse/SPARK-42497
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should enable `pandas API on Spark` on Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42496) Introduction Spark Connect at main page.

2023-02-19 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691017#comment-17691017
 ] 

Haejoon Lee commented on SPARK-42496:
-

I'm working on it.

> Introduction Spark Connect at main page.
> 
>
> Key: SPARK-42496
> URL: https://issues.apache.org/jira/browse/SPARK-42496
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Documentation
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should document the introduction of Spark Connect at PySpark main 
> documentation page to give a summary to users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42496) Introduction Spark Connect at main page.

2023-02-19 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-42496:
---

 Summary: Introduction Spark Connect at main page.
 Key: SPARK-42496
 URL: https://issues.apache.org/jira/browse/SPARK-42496
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, Documentation
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should document the introduction of Spark Connect at PySpark main 
documentation page to give a summary to users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42475) Getting Started: Live Notebook for Spark Connect

2023-02-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42475:


Assignee: (was: Apache Spark)

> Getting Started: Live Notebook for Spark Connect
> 
>
> Key: SPARK-42475
> URL: https://issues.apache.org/jira/browse/SPARK-42475
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> It would be great to have Live Notebook for Spark Connect in [Getting 
> Started|https://spark.apache.org/docs/latest/api/python/getting_started/index.html]
>  section to help users quick start on Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42475) Getting Started: Live Notebook for Spark Connect

2023-02-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691016#comment-17691016
 ] 

Apache Spark commented on SPARK-42475:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/40092

> Getting Started: Live Notebook for Spark Connect
> 
>
> Key: SPARK-42475
> URL: https://issues.apache.org/jira/browse/SPARK-42475
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> It would be great to have Live Notebook for Spark Connect in [Getting 
> Started|https://spark.apache.org/docs/latest/api/python/getting_started/index.html]
>  section to help users quick start on Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42475) Getting Started: Live Notebook for Spark Connect

2023-02-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42475:


Assignee: Apache Spark

> Getting Started: Live Notebook for Spark Connect
> 
>
> Key: SPARK-42475
> URL: https://issues.apache.org/jira/browse/SPARK-42475
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> It would be great to have Live Notebook for Spark Connect in [Getting 
> Started|https://spark.apache.org/docs/latest/api/python/getting_started/index.html]
>  section to help users quick start on Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41952) Upgrade Parquet to fix off-heap memory leaks in Zstd codec

2023-02-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41952:


Assignee: Apache Spark

> Upgrade Parquet to fix off-heap memory leaks in Zstd codec
> --
>
> Key: SPARK-41952
> URL: https://issues.apache.org/jira/browse/SPARK-41952
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.3, 3.3.1, 3.2.3
>Reporter: Alexey Kudinkin
>Assignee: Apache Spark
>Priority: Critical
>
> Recently, native memory leak have been discovered in Parquet in conjunction 
> of it using Zstd decompressor from luben/zstd-jni library (PARQUET-2160).
> This is very problematic to a point where we can't use Parquet w/ Zstd due to 
> pervasive OOMs taking down our executors and disrupting our jobs.
> Luckily fix addressing this had already landed in Parquet:
> [https://github.com/apache/parquet-mr/pull/982]
>  
> Now, we just need to
>  # Updated version of Parquet is released in a timely manner
>  # Spark is upgraded onto this new version in the upcoming release
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41952) Upgrade Parquet to fix off-heap memory leaks in Zstd codec

2023-02-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691015#comment-17691015
 ] 

Apache Spark commented on SPARK-41952:
--

User 'pan3793' has created a pull request for this issue:
https://github.com/apache/spark/pull/40091

> Upgrade Parquet to fix off-heap memory leaks in Zstd codec
> --
>
> Key: SPARK-41952
> URL: https://issues.apache.org/jira/browse/SPARK-41952
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.3, 3.3.1, 3.2.3
>Reporter: Alexey Kudinkin
>Priority: Critical
>
> Recently, native memory leak have been discovered in Parquet in conjunction 
> of it using Zstd decompressor from luben/zstd-jni library (PARQUET-2160).
> This is very problematic to a point where we can't use Parquet w/ Zstd due to 
> pervasive OOMs taking down our executors and disrupting our jobs.
> Luckily fix addressing this had already landed in Parquet:
> [https://github.com/apache/parquet-mr/pull/982]
>  
> Now, we just need to
>  # Updated version of Parquet is released in a timely manner
>  # Spark is upgraded onto this new version in the upcoming release
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41952) Upgrade Parquet to fix off-heap memory leaks in Zstd codec

2023-02-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41952:


Assignee: (was: Apache Spark)

> Upgrade Parquet to fix off-heap memory leaks in Zstd codec
> --
>
> Key: SPARK-41952
> URL: https://issues.apache.org/jira/browse/SPARK-41952
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.3, 3.3.1, 3.2.3
>Reporter: Alexey Kudinkin
>Priority: Critical
>
> Recently, native memory leak have been discovered in Parquet in conjunction 
> of it using Zstd decompressor from luben/zstd-jni library (PARQUET-2160).
> This is very problematic to a point where we can't use Parquet w/ Zstd due to 
> pervasive OOMs taking down our executors and disrupting our jobs.
> Luckily fix addressing this had already landed in Parquet:
> [https://github.com/apache/parquet-mr/pull/982]
>  
> Now, we just need to
>  # Updated version of Parquet is released in a timely manner
>  # Spark is upgraded onto this new version in the upcoming release
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41952) Upgrade Parquet to fix off-heap memory leaks in Zstd codec

2023-02-19 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691014#comment-17691014
 ] 

Dongjoon Hyun commented on SPARK-41952:
---

May I ask why you put me `Shepherd` field, [~alexey.kudinkin] ? Let me first 
remove me from there.

> Upgrade Parquet to fix off-heap memory leaks in Zstd codec
> --
>
> Key: SPARK-41952
> URL: https://issues.apache.org/jira/browse/SPARK-41952
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.3, 3.3.1, 3.2.3
>Reporter: Alexey Kudinkin
>Priority: Critical
>
> Recently, native memory leak have been discovered in Parquet in conjunction 
> of it using Zstd decompressor from luben/zstd-jni library (PARQUET-2160).
> This is very problematic to a point where we can't use Parquet w/ Zstd due to 
> pervasive OOMs taking down our executors and disrupting our jobs.
> Luckily fix addressing this had already landed in Parquet:
> [https://github.com/apache/parquet-mr/pull/982]
>  
> Now, we just need to
>  # Updated version of Parquet is released in a timely manner
>  # Spark is upgraded onto this new version in the upcoming release
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41952) Upgrade Parquet to fix off-heap memory leaks in Zstd codec

2023-02-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-41952:
--
Shepherd:   (was: Dongjoon Hyun)

> Upgrade Parquet to fix off-heap memory leaks in Zstd codec
> --
>
> Key: SPARK-41952
> URL: https://issues.apache.org/jira/browse/SPARK-41952
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.3, 3.3.1, 3.2.3
>Reporter: Alexey Kudinkin
>Priority: Critical
>
> Recently, native memory leak have been discovered in Parquet in conjunction 
> of it using Zstd decompressor from luben/zstd-jni library (PARQUET-2160).
> This is very problematic to a point where we can't use Parquet w/ Zstd due to 
> pervasive OOMs taking down our executors and disrupting our jobs.
> Luckily fix addressing this had already landed in Parquet:
> [https://github.com/apache/parquet-mr/pull/982]
>  
> Now, we just need to
>  # Updated version of Parquet is released in a timely manner
>  # Spark is upgraded onto this new version in the upcoming release
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41952) Upgrade Parquet to fix off-heap memory leaks in Zstd codec

2023-02-19 Thread Cheng Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691012#comment-17691012
 ] 

Cheng Pan commented on SPARK-41952:
---

Fix on Spark side is feasible, I'm working on this.

> Upgrade Parquet to fix off-heap memory leaks in Zstd codec
> --
>
> Key: SPARK-41952
> URL: https://issues.apache.org/jira/browse/SPARK-41952
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.3, 3.3.1, 3.2.3
>Reporter: Alexey Kudinkin
>Priority: Critical
>
> Recently, native memory leak have been discovered in Parquet in conjunction 
> of it using Zstd decompressor from luben/zstd-jni library (PARQUET-2160).
> This is very problematic to a point where we can't use Parquet w/ Zstd due to 
> pervasive OOMs taking down our executors and disrupting our jobs.
> Luckily fix addressing this had already landed in Parquet:
> [https://github.com/apache/parquet-mr/pull/982]
>  
> Now, we just need to
>  # Updated version of Parquet is released in a timely manner
>  # Spark is upgraded onto this new version in the upcoming release
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42286) Fix internal error for valid CASE WHEN expression with CAST when inserting into a table

2023-02-19 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-42286:
---

Fix Version/s: 3.4.0
 Assignee: Runyao.Chen

> Fix internal error for valid CASE WHEN expression with CAST when inserting 
> into a table
> ---
>
> Key: SPARK-42286
> URL: https://issues.apache.org/jira/browse/SPARK-42286
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Runyao.Chen
>Assignee: Runyao.Chen
>Priority: Major
> Fix For: 3.4.0
>
>
> ```
> spark-sql> create or replace table es570639t1 as select x FROM values (1), 
> (2), (3) as tab(x);
> spark-sql> create or replace table es570639t2 (x Decimal(9, 0));
> spark-sql> insert into es570639t2 select 0 - (case when x = 1 then 1 else x 
> end) from es570639t1 where x = 1;
> ```
> hits the following internal error
> org.apache.spark.SparkException: [INTERNAL_ERROR] Child is not Cast or 
> ExpressionProxy of Cast
>  
> Stack trace:
> org.apache.spark.SparkException: [INTERNAL_ERROR] Child is not Cast or 
> ExpressionProxy of Cast at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:78) at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:82) at 
> org.apache.spark.sql.catalyst.expressions.CheckOverflowInTableInsert.checkChild(Cast.scala:2693)
>  at 
> org.apache.spark.sql.catalyst.expressions.CheckOverflowInTableInsert.withNewChildInternal(Cast.scala:2697)
>  at 
> org.apache.spark.sql.catalyst.expressions.CheckOverflowInTableInsert.withNewChildInternal(Cast.scala:2683)
>  at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.$anonfun$mapChildren$5(TreeNode.scala:1315)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:106)
>  at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1314)
>  at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1309)
>  at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:636)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:570)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:570)
>  
> This internal error comes from `CheckOverflowInTableInsert``checkChild`, 
> where we covered only `Cast` expr and `ExpressionProxy` expr, but not the 
> `CaseWhen` expr.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42473) An explicit cast will be needed when INSERT OVERWRITE SELECT UNION ALL

2023-02-19 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691011#comment-17691011
 ] 

Yuming Wang commented on SPARK-42473:
-

It seems we should backport https://github.com/apache/spark/pull/39855.

> An explicit cast will be needed when INSERT OVERWRITE SELECT UNION ALL
> --
>
> Key: SPARK-42473
> URL: https://issues.apache.org/jira/browse/SPARK-42473
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.1
> Environment: spark 3.3.1
>Reporter: kevinshin
>Priority: Major
>
> *when 'union all' and one select statement use* *Literal as column value , 
> the other* *select statement  has computed expression at the same column , 
> then the whole statement will compile failed. A explicit cast will be needed.*
> for example:
> {color:#4c9aff}explain{color}
> {color:#4c9aff}*INSERT* OVERWRITE *TABLE* test.spark33_decimal_orc{color}
> {color:#4c9aff}*select* *null* *as* amt1, {*}cast{*}('256.99' *as* 
> {*}decimal{*}(20,8)) *as* amt2{color}
> {color:#4c9aff}*union* *all*{color}
> {color:#4c9aff}*select* {*}cast{*}('200.99' *as* 
> {*}decimal{*}(20,8)){*}/{*}100 *as* amt1,{*}cast{*}('256.99' *as* 
> {*}decimal{*}(20,8)) *as* amt2;{color}
> *will got error :* 
> org.apache.spark.{*}sql{*}.catalyst.expressions.Literal cannot be *cast* *to* 
> org.apache.spark.{*}sql{*}.catalyst.expressions.AnsiCast
> The SQL will need to change to : 
> {color:#4c9aff}explain{color}
> {color:#4c9aff}*INSERT* OVERWRITE *TABLE* test.spark33_decimal_orc{color}
> {color:#4c9aff}*select* *null* *as* amt1,{*}cast{*}('256.99' *as* 
> {*}decimal{*}(20,8)) *as* amt2{color}
> {color:#4c9aff}*union* *all*{color}
> {color:#4c9aff}*select* {color:#de350b}{*}cast{*}({color}{*}cast{*}('200.99' 
> *as* {*}decimal{*}(20,8)){*}/{*}100 *as* 
> {*}decimal{*}(20,8){color:#de350b}){color} *as* amt1,{*}cast{*}('256.99' *as* 
> {*}decimal{*}(20,8)) *as* amt2;{color}
>  
> *but this is not need in spark3.2.1 , is this a bug for spark3.3.1 ?* 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41741) [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8

2023-02-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41741:


Assignee: (was: Apache Spark)

> [SQL] ParquetFilters StringStartsWith push down matching string do not use 
> UTF-8
> 
>
> Key: SPARK-41741
> URL: https://issues.apache.org/jira/browse/SPARK-41741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jiale He
>Priority: Major
> Attachments: image-2022-12-28-18-00-00-861.png, 
> image-2022-12-28-18-00-21-586.png, image-2023-01-09-11-10-31-262.png, 
> image-2023-01-09-18-27-53-479.png, 
> part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet
>
>
> Hello ~
>  
> I found a problem, but there are two ways to solve it.
>  
> The parquet filter is pushed down. When using the like '***%' statement to 
> query, if the system default encoding is not UTF-8, it may cause an error.
>  
> There are two ways to bypass this problem as far as I know
> 1. spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8"
> 2. spark.sql.parquet.filterPushdown.string.startsWith=false
>  
> The following is the information to reproduce this problem
> The parquet sample file is in the attachment
> {code:java}
> spark.read.parquet("file:///home/kylin/hjldir/part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet").createTempView("tmp”)
> spark.sql("select * from tmp where `1` like '啦啦乐乐%'").show(false) {code}
> !image-2022-12-28-18-00-00-861.png|width=879,height=430!
>  
>   !image-2022-12-28-18-00-21-586.png|width=799,height=731!
>  
> I think the correct code should be:
> {code:java}
> private val strToBinary = 
> Binary.fromReusedByteArray(v.getBytes(StandardCharsets.UTF_8)) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41741) [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8

2023-02-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41741:


Assignee: Apache Spark

> [SQL] ParquetFilters StringStartsWith push down matching string do not use 
> UTF-8
> 
>
> Key: SPARK-41741
> URL: https://issues.apache.org/jira/browse/SPARK-41741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jiale He
>Assignee: Apache Spark
>Priority: Major
> Attachments: image-2022-12-28-18-00-00-861.png, 
> image-2022-12-28-18-00-21-586.png, image-2023-01-09-11-10-31-262.png, 
> image-2023-01-09-18-27-53-479.png, 
> part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet
>
>
> Hello ~
>  
> I found a problem, but there are two ways to solve it.
>  
> The parquet filter is pushed down. When using the like '***%' statement to 
> query, if the system default encoding is not UTF-8, it may cause an error.
>  
> There are two ways to bypass this problem as far as I know
> 1. spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8"
> 2. spark.sql.parquet.filterPushdown.string.startsWith=false
>  
> The following is the information to reproduce this problem
> The parquet sample file is in the attachment
> {code:java}
> spark.read.parquet("file:///home/kylin/hjldir/part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet").createTempView("tmp”)
> spark.sql("select * from tmp where `1` like '啦啦乐乐%'").show(false) {code}
> !image-2022-12-28-18-00-00-861.png|width=879,height=430!
>  
>   !image-2022-12-28-18-00-21-586.png|width=799,height=731!
>  
> I think the correct code should be:
> {code:java}
> private val strToBinary = 
> Binary.fromReusedByteArray(v.getBytes(StandardCharsets.UTF_8)) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41741) [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8

2023-02-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691010#comment-17691010
 ] 

Apache Spark commented on SPARK-41741:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/40090

> [SQL] ParquetFilters StringStartsWith push down matching string do not use 
> UTF-8
> 
>
> Key: SPARK-41741
> URL: https://issues.apache.org/jira/browse/SPARK-41741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jiale He
>Priority: Major
> Attachments: image-2022-12-28-18-00-00-861.png, 
> image-2022-12-28-18-00-21-586.png, image-2023-01-09-11-10-31-262.png, 
> image-2023-01-09-18-27-53-479.png, 
> part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet
>
>
> Hello ~
>  
> I found a problem, but there are two ways to solve it.
>  
> The parquet filter is pushed down. When using the like '***%' statement to 
> query, if the system default encoding is not UTF-8, it may cause an error.
>  
> There are two ways to bypass this problem as far as I know
> 1. spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8"
> 2. spark.sql.parquet.filterPushdown.string.startsWith=false
>  
> The following is the information to reproduce this problem
> The parquet sample file is in the attachment
> {code:java}
> spark.read.parquet("file:///home/kylin/hjldir/part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet").createTempView("tmp”)
> spark.sql("select * from tmp where `1` like '啦啦乐乐%'").show(false) {code}
> !image-2022-12-28-18-00-00-861.png|width=879,height=430!
>  
>   !image-2022-12-28-18-00-21-586.png|width=799,height=731!
>  
> I think the correct code should be:
> {code:java}
> private val strToBinary = 
> Binary.fromReusedByteArray(v.getBytes(StandardCharsets.UTF_8)) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42488) Upgrade commons-crypto from 1.1.0 to 1.2.0

2023-02-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42488:
-

Assignee: Yang Jie

> Upgrade commons-crypto from 1.1.0 to 1.2.0
> --
>
> Key: SPARK-42488
> URL: https://issues.apache.org/jira/browse/SPARK-42488
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> https://github.com/apache/commons-crypto/compare/rel/commons-crypto-1.1.0...rel/commons-crypto-1.2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42488) Upgrade commons-crypto from 1.1.0 to 1.2.0

2023-02-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42488.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40082
[https://github.com/apache/spark/pull/40082]

> Upgrade commons-crypto from 1.1.0 to 1.2.0
> --
>
> Key: SPARK-42488
> URL: https://issues.apache.org/jira/browse/SPARK-42488
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0
>
>
> https://github.com/apache/commons-crypto/compare/rel/commons-crypto-1.1.0...rel/commons-crypto-1.2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42495) Scala Client: Add 2nd batch of functions

2023-02-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691001#comment-17691001
 ] 

Apache Spark commented on SPARK-42495:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/40089

> Scala Client: Add 2nd batch of functions
> 
>
> Key: SPARK-42495
> URL: https://issues.apache.org/jira/browse/SPARK-42495
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42495) Scala Client: Add 2nd batch of functions

2023-02-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42495:


Assignee: Herman van Hövell  (was: Apache Spark)

> Scala Client: Add 2nd batch of functions
> 
>
> Key: SPARK-42495
> URL: https://issues.apache.org/jira/browse/SPARK-42495
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42495) Scala Client: Add 2nd batch of functions

2023-02-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42495:


Assignee: Apache Spark  (was: Herman van Hövell)

> Scala Client: Add 2nd batch of functions
> 
>
> Key: SPARK-42495
> URL: https://issues.apache.org/jira/browse/SPARK-42495
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42495) Scala Client: Add 2nd batch of functions

2023-02-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691002#comment-17691002
 ] 

Apache Spark commented on SPARK-42495:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/40089

> Scala Client: Add 2nd batch of functions
> 
>
> Key: SPARK-42495
> URL: https://issues.apache.org/jira/browse/SPARK-42495
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42495) Scala Client: Add 2nd batch of functions

2023-02-19 Thread Jira
Herman van Hövell created SPARK-42495:
-

 Summary: Scala Client: Add 2nd batch of functions
 Key: SPARK-42495
 URL: https://issues.apache.org/jira/browse/SPARK-42495
 Project: Spark
  Issue Type: Task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Herman van Hövell
Assignee: Herman van Hövell






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42494) Add official image Dockerfile for Spark v3.3.2

2023-02-19 Thread Yikun Jiang (Jira)
Yikun Jiang created SPARK-42494:
---

 Summary: Add official image Dockerfile for Spark v3.3.2
 Key: SPARK-42494
 URL: https://issues.apache.org/jira/browse/SPARK-42494
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Docker
Affects Versions: 3.3.2
Reporter: Yikun Jiang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42467) Spark Connect Scala Client: GroupBy and Aggregation

2023-02-19 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-42467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690997#comment-17690997
 ] 

Herman van Hövell commented on SPARK-42467:
---

Small reminder please add tests for grouping/grouping_id functions as soon as 
we implement cube/rollup/groupingsets.

> Spark Connect Scala Client: GroupBy and Aggregation
> ---
>
> Key: SPARK-42467
> URL: https://issues.apache.org/jira/browse/SPARK-42467
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42493) Spark SQL, DataFrames and Datasets Guide - make Python the first code example tab

2023-02-19 Thread Allan Folting (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Folting updated SPARK-42493:
--
Summary: Spark SQL, DataFrames and Datasets Guide - make Python the first 
code example tab  (was: Spark SQL, DataFrames and Datasets Guide - make Python 
the first example tab)

> Spark SQL, DataFrames and Datasets Guide - make Python the first code example 
> tab
> -
>
> Key: SPARK-42493
> URL: https://issues.apache.org/jira/browse/SPARK-42493
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Allan Folting
>Priority: Major
>
> Python is the easiest approachable and most popular language so it should be 
> the primary language in examples etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38427) DataFilter pushed down with PartitionFilter for Orc

2023-02-19 Thread Jackey Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690971#comment-17690971
 ] 

Jackey Lee commented on SPARK-38427:


[~LuciferYang] 

> DataFilter pushed down with PartitionFilter for Orc
> ---
>
> Key: SPARK-38427
> URL: https://issues.apache.org/jira/browse/SPARK-38427
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jackey Lee
>Priority: Major
>
> At present, for orc data source, the Filter is divided into DataFilter and 
> PartitionFilter when it is pushed down, but when the Filter removes the 
> PartitionFilter, it means that all Partitions will scan all DataFilter 
> conditions, which may cause full data scan.
> Based on SPARK-38041, we can pushdown dataFilter with partitionFilter to ORC, 
> and remove partitionFilter at runtime.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-38427) DataFilter pushed down with PartitionFilter for Orc

2023-02-19 Thread Jackey Lee (Jira)


[ https://issues.apache.org/jira/browse/SPARK-38427 ]


Jackey Lee deleted comment on SPARK-38427:


was (Author: jackey lee):
[~LuciferYang] 

> DataFilter pushed down with PartitionFilter for Orc
> ---
>
> Key: SPARK-38427
> URL: https://issues.apache.org/jira/browse/SPARK-38427
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jackey Lee
>Priority: Major
>
> At present, for orc data source, the Filter is divided into DataFilter and 
> PartitionFilter when it is pushed down, but when the Filter removes the 
> PartitionFilter, it means that all Partitions will scan all DataFilter 
> conditions, which may cause full data scan.
> Based on SPARK-38041, we can pushdown dataFilter with partitionFilter to ORC, 
> and remove partitionFilter at runtime.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42427) Conv should return an error if the internal conversion overflows

2023-02-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690970#comment-17690970
 ] 

Apache Spark commented on SPARK-42427:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/40088

> Conv should return an error if the internal conversion overflows
> 
>
> Key: SPARK-42427
> URL: https://issues.apache.org/jira/browse/SPARK-42427
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38427) DataFilter pushed down with PartitionFilter for Orc

2023-02-19 Thread Jackey Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690969#comment-17690969
 ] 

Jackey Lee commented on SPARK-38427:


[~LuciferYang] 

> DataFilter pushed down with PartitionFilter for Orc
> ---
>
> Key: SPARK-38427
> URL: https://issues.apache.org/jira/browse/SPARK-38427
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jackey Lee
>Priority: Major
>
> At present, for orc data source, the Filter is divided into DataFilter and 
> PartitionFilter when it is pushed down, but when the Filter removes the 
> PartitionFilter, it means that all Partitions will scan all DataFilter 
> conditions, which may cause full data scan.
> Based on SPARK-38041, we can pushdown dataFilter with partitionFilter to ORC, 
> and remove partitionFilter at runtime.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-42473) An explicit cast will be needed when INSERT OVERWRITE SELECT UNION ALL

2023-02-19 Thread kevinshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690965#comment-17690965
 ] 

kevinshin edited comment on SPARK-42473 at 2/20/23 1:47 AM:


[~yumwang]  'What is your test.spark33_decimal_orc column type?'

{color:#4c9aff}*CREATE* *TABLE* *IF* *NOT* *EXISTS* 
test.spark33_decimal_orc({color}

{color:#4c9aff}   amt1        {*}decimal{*}(20,8),{color}

{color:#4c9aff}   amt2        {*}decimal{*}(20,8){color}

{color:#4c9aff})STORED *AS* ORC;{color}


was (Author: JIRAUSER281772):
[~yumwang]  'What is your test.spark33_decimal_orc column type?'

*CREATE* *TABLE* *IF* *NOT* *EXISTS* test.spark33_decimal_orc(

   amt1        *decimal*(20,8),

   amt2        *decimal*(20,8)

)STORED *AS* ORC;

> An explicit cast will be needed when INSERT OVERWRITE SELECT UNION ALL
> --
>
> Key: SPARK-42473
> URL: https://issues.apache.org/jira/browse/SPARK-42473
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.1
> Environment: spark 3.3.1
>Reporter: kevinshin
>Priority: Major
>
> *when 'union all' and one select statement use* *Literal as column value , 
> the other* *select statement  has computed expression at the same column , 
> then the whole statement will compile failed. A explicit cast will be needed.*
> for example:
> {color:#4c9aff}explain{color}
> {color:#4c9aff}*INSERT* OVERWRITE *TABLE* test.spark33_decimal_orc{color}
> {color:#4c9aff}*select* *null* *as* amt1, {*}cast{*}('256.99' *as* 
> {*}decimal{*}(20,8)) *as* amt2{color}
> {color:#4c9aff}*union* *all*{color}
> {color:#4c9aff}*select* {*}cast{*}('200.99' *as* 
> {*}decimal{*}(20,8)){*}/{*}100 *as* amt1,{*}cast{*}('256.99' *as* 
> {*}decimal{*}(20,8)) *as* amt2;{color}
> *will got error :* 
> org.apache.spark.{*}sql{*}.catalyst.expressions.Literal cannot be *cast* *to* 
> org.apache.spark.{*}sql{*}.catalyst.expressions.AnsiCast
> The SQL will need to change to : 
> {color:#4c9aff}explain{color}
> {color:#4c9aff}*INSERT* OVERWRITE *TABLE* test.spark33_decimal_orc{color}
> {color:#4c9aff}*select* *null* *as* amt1,{*}cast{*}('256.99' *as* 
> {*}decimal{*}(20,8)) *as* amt2{color}
> {color:#4c9aff}*union* *all*{color}
> {color:#4c9aff}*select* {color:#de350b}{*}cast{*}({color}{*}cast{*}('200.99' 
> *as* {*}decimal{*}(20,8)){*}/{*}100 *as* 
> {*}decimal{*}(20,8){color:#de350b}){color} *as* amt1,{*}cast{*}('256.99' *as* 
> {*}decimal{*}(20,8)) *as* amt2;{color}
>  
> *but this is not need in spark3.2.1 , is this a bug for spark3.3.1 ?* 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42473) An explicit cast will be needed when INSERT OVERWRITE SELECT UNION ALL

2023-02-19 Thread kevinshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690965#comment-17690965
 ] 

kevinshin commented on SPARK-42473:
---

[~yumwang]  'What is your test.spark33_decimal_orc column type?'

*CREATE* *TABLE* *IF* *NOT* *EXISTS* test.spark33_decimal_orc(

   amt1        *decimal*(20,8),

   amt2        *decimal*(20,8)

)STORED *AS* ORC;

> An explicit cast will be needed when INSERT OVERWRITE SELECT UNION ALL
> --
>
> Key: SPARK-42473
> URL: https://issues.apache.org/jira/browse/SPARK-42473
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.1
> Environment: spark 3.3.1
>Reporter: kevinshin
>Priority: Major
>
> *when 'union all' and one select statement use* *Literal as column value , 
> the other* *select statement  has computed expression at the same column , 
> then the whole statement will compile failed. A explicit cast will be needed.*
> for example:
> {color:#4c9aff}explain{color}
> {color:#4c9aff}*INSERT* OVERWRITE *TABLE* test.spark33_decimal_orc{color}
> {color:#4c9aff}*select* *null* *as* amt1, {*}cast{*}('256.99' *as* 
> {*}decimal{*}(20,8)) *as* amt2{color}
> {color:#4c9aff}*union* *all*{color}
> {color:#4c9aff}*select* {*}cast{*}('200.99' *as* 
> {*}decimal{*}(20,8)){*}/{*}100 *as* amt1,{*}cast{*}('256.99' *as* 
> {*}decimal{*}(20,8)) *as* amt2;{color}
> *will got error :* 
> org.apache.spark.{*}sql{*}.catalyst.expressions.Literal cannot be *cast* *to* 
> org.apache.spark.{*}sql{*}.catalyst.expressions.AnsiCast
> The SQL will need to change to : 
> {color:#4c9aff}explain{color}
> {color:#4c9aff}*INSERT* OVERWRITE *TABLE* test.spark33_decimal_orc{color}
> {color:#4c9aff}*select* *null* *as* amt1,{*}cast{*}('256.99' *as* 
> {*}decimal{*}(20,8)) *as* amt2{color}
> {color:#4c9aff}*union* *all*{color}
> {color:#4c9aff}*select* {color:#de350b}{*}cast{*}({color}{*}cast{*}('200.99' 
> *as* {*}decimal{*}(20,8)){*}/{*}100 *as* 
> {*}decimal{*}(20,8){color:#de350b}){color} *as* amt1,{*}cast{*}('256.99' *as* 
> {*}decimal{*}(20,8)) *as* amt2;{color}
>  
> *but this is not need in spark3.2.1 , is this a bug for spark3.3.1 ?* 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42493) Spark SQL, DataFrames and Datasets Guide - make Python the first example tab

2023-02-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42493:
--
Affects Version/s: 3.5.0
   (was: 3.4.0)

> Spark SQL, DataFrames and Datasets Guide - make Python the first example tab
> 
>
> Key: SPARK-42493
> URL: https://issues.apache.org/jira/browse/SPARK-42493
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Allan Folting
>Priority: Major
>
> Python is the easiest approachable and most popular language so it should be 
> the primary language in examples etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42493) Spark SQL, DataFrames and Datasets Guide - make Python the first example tab

2023-02-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42493:


Assignee: (was: Apache Spark)

> Spark SQL, DataFrames and Datasets Guide - make Python the first example tab
> 
>
> Key: SPARK-42493
> URL: https://issues.apache.org/jira/browse/SPARK-42493
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Allan Folting
>Priority: Major
>
> Python is the easiest approachable and most popular language so it should be 
> the primary language in examples etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42493) Spark SQL, DataFrames and Datasets Guide - make Python the first example tab

2023-02-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690958#comment-17690958
 ] 

Apache Spark commented on SPARK-42493:
--

User 'allanf-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/40087

> Spark SQL, DataFrames and Datasets Guide - make Python the first example tab
> 
>
> Key: SPARK-42493
> URL: https://issues.apache.org/jira/browse/SPARK-42493
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Allan Folting
>Priority: Major
>
> Python is the easiest approachable and most popular language so it should be 
> the primary language in examples etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42493) Spark SQL, DataFrames and Datasets Guide - make Python the first example tab

2023-02-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42493:


Assignee: Apache Spark

> Spark SQL, DataFrames and Datasets Guide - make Python the first example tab
> 
>
> Key: SPARK-42493
> URL: https://issues.apache.org/jira/browse/SPARK-42493
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Allan Folting
>Assignee: Apache Spark
>Priority: Major
>
> Python is the easiest approachable and most popular language so it should be 
> the primary language in examples etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42487) Upgrade Netty to 4.1.89

2023-02-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42487:
-

Assignee: Yang Jie

> Upgrade Netty to 4.1.89
> ---
>
> Key: SPARK-42487
> URL: https://issues.apache.org/jira/browse/SPARK-42487
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> This release contains a fix for two regressions that were introduced by 
> 4.1.88.Final:
>  * Don't fail on HttpObjectDecoder's maxHeaderSize greater then 
> (Integer.MAX_VALUE - 2) ([#13216|https://github.com/netty/netty/pull/13216])
>  * dyld: Symbol not found: _netty_jni_util_JNI_OnLoad when upgrading from 
> 4.1.87.Final to 4.1.88.Final 
> ([#13214|https://github.com/netty/netty/pull/13214])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42487) Upgrade Netty to 4.1.89

2023-02-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42487.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40081
[https://github.com/apache/spark/pull/40081]

> Upgrade Netty to 4.1.89
> ---
>
> Key: SPARK-42487
> URL: https://issues.apache.org/jira/browse/SPARK-42487
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>
> This release contains a fix for two regressions that were introduced by 
> 4.1.88.Final:
>  * Don't fail on HttpObjectDecoder's maxHeaderSize greater then 
> (Integer.MAX_VALUE - 2) ([#13216|https://github.com/netty/netty/pull/13216])
>  * dyld: Symbol not found: _netty_jni_util_JNI_OnLoad when upgrading from 
> 4.1.87.Final to 4.1.88.Final 
> ([#13214|https://github.com/netty/netty/pull/13214])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42493) Spark SQL, DataFrames and Datasets Guide - make Python the first example tab

2023-02-19 Thread Allan Folting (Jira)
Allan Folting created SPARK-42493:
-

 Summary: Spark SQL, DataFrames and Datasets Guide - make Python 
the first example tab
 Key: SPARK-42493
 URL: https://issues.apache.org/jira/browse/SPARK-42493
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Allan Folting


Python is the easiest approachable and most popular language so it should be 
the primary language in examples etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42048) Different column name of lit(np.int8)

2023-02-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42048:


Assignee: Takuya Ueshin

> Different column name of lit(np.int8)
> -
>
> Key: SPARK-42048
> URL: https://issues.apache.org/jira/browse/SPARK-42048
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Takuya Ueshin
>Priority: Minor
>
> {code:java}
> ('1', 'tinyint')
> ('CAST(1 AS TINYINT)', 'tinyint')
> - [('1', 'tinyint')]
> + [('CAST(1 AS TINYINT)', 'tinyint')]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42048) Different column name of lit(np.int8)

2023-02-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42048.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40076
[https://github.com/apache/spark/pull/40076]

> Different column name of lit(np.int8)
> -
>
> Key: SPARK-42048
> URL: https://issues.apache.org/jira/browse/SPARK-42048
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Takuya Ueshin
>Priority: Minor
> Fix For: 3.4.0
>
>
> {code:java}
> ('1', 'tinyint')
> ('CAST(1 AS TINYINT)', 'tinyint')
> - [('1', 'tinyint')]
> + [('CAST(1 AS TINYINT)', 'tinyint')]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42485) SPIP: Shutting down spark structured streaming when the streaming process completed current process

2023-02-19 Thread Mich Talebzadeh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690947#comment-17690947
 ] 

Mich Talebzadeh commented on SPARK-42485:
-

done thanks

> SPIP: Shutting down spark structured streaming when the streaming process 
> completed current process
> ---
>
> Key: SPARK-42485
> URL: https://issues.apache.org/jira/browse/SPARK-42485
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.2
>Reporter: Mich Talebzadeh
>Priority: Major
>  Labels: SPIP
>
> Spark Structured Streaming is a very useful tool in dealing with Event Driven 
> Architecture. In an Event Driven Architecture, there is generally a main loop 
> that listens for events and then triggers a call-back function when one of 
> those events is detected. In a streaming application the application waits to 
> receive the source messages in a set interval or whenever they happen and 
> reacts accordingly.
> There are occasions that you may want to stop the Spark program gracefully. 
> Gracefully meaning that Spark application handles the last streaming message 
> completely and terminates the application. This is different from invoking 
> interrupts such as CTRL-C.
> Of course one can terminate the process based on the following
>  # query.awaitTermination() # Waits for the termination of this query, with 
> stop() or with error
>  # query.awaitTermination(timeoutMs) # Returns true if this query is 
> terminated within the timeout in milliseconds.
> So the first one above waits until an interrupt signal is received. The 
> second one will count the timeout and will exit when timeout in milliseconds 
> is reached.
> The issue is that one needs to predict how long the streaming job needs to 
> run. Clearly any interrupt at the terminal or OS level (kill process), may 
> end up the processing terminated without a proper completion of the streaming 
> process.
> I have devised a method that allows one to terminate the spark application 
> internally after processing the last received message. Within say 2 seconds 
> of the confirmation of shutdown, the process will invoke a graceful shutdown.
> {color:#00}This new feature proposes a solution to handle the topic doing 
> work for the message being processed gracefully, wait for it to complete and 
> shutdown the streaming process for a given topic without loss of data or 
> orphaned transactions{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42485) SPIP: Shutting down spark structured streaming when the streaming process completed current process

2023-02-19 Thread Mich Talebzadeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mich Talebzadeh updated SPARK-42485:

Affects Version/s: 3.2.2

> SPIP: Shutting down spark structured streaming when the streaming process 
> completed current process
> ---
>
> Key: SPARK-42485
> URL: https://issues.apache.org/jira/browse/SPARK-42485
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.2, 3.3.2
>Reporter: Mich Talebzadeh
>Priority: Major
>  Labels: SPIP
>
> Spark Structured Streaming is a very useful tool in dealing with Event Driven 
> Architecture. In an Event Driven Architecture, there is generally a main loop 
> that listens for events and then triggers a call-back function when one of 
> those events is detected. In a streaming application the application waits to 
> receive the source messages in a set interval or whenever they happen and 
> reacts accordingly.
> There are occasions that you may want to stop the Spark program gracefully. 
> Gracefully meaning that Spark application handles the last streaming message 
> completely and terminates the application. This is different from invoking 
> interrupts such as CTRL-C.
> Of course one can terminate the process based on the following
>  # query.awaitTermination() # Waits for the termination of this query, with 
> stop() or with error
>  # query.awaitTermination(timeoutMs) # Returns true if this query is 
> terminated within the timeout in milliseconds.
> So the first one above waits until an interrupt signal is received. The 
> second one will count the timeout and will exit when timeout in milliseconds 
> is reached.
> The issue is that one needs to predict how long the streaming job needs to 
> run. Clearly any interrupt at the terminal or OS level (kill process), may 
> end up the processing terminated without a proper completion of the streaming 
> process.
> I have devised a method that allows one to terminate the spark application 
> internally after processing the last received message. Within say 2 seconds 
> of the confirmation of shutdown, the process will invoke a graceful shutdown.
> {color:#00}This new feature proposes a solution to handle the topic doing 
> work for the message being processed gracefully, wait for it to complete and 
> shutdown the streaming process for a given topic without loss of data or 
> orphaned transactions{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42485) SPIP: Shutting down spark structured streaming when the streaming process completed current process

2023-02-19 Thread Mich Talebzadeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mich Talebzadeh updated SPARK-42485:

Affects Version/s: (was: 3.3.2)

> SPIP: Shutting down spark structured streaming when the streaming process 
> completed current process
> ---
>
> Key: SPARK-42485
> URL: https://issues.apache.org/jira/browse/SPARK-42485
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.2
>Reporter: Mich Talebzadeh
>Priority: Major
>  Labels: SPIP
>
> Spark Structured Streaming is a very useful tool in dealing with Event Driven 
> Architecture. In an Event Driven Architecture, there is generally a main loop 
> that listens for events and then triggers a call-back function when one of 
> those events is detected. In a streaming application the application waits to 
> receive the source messages in a set interval or whenever they happen and 
> reacts accordingly.
> There are occasions that you may want to stop the Spark program gracefully. 
> Gracefully meaning that Spark application handles the last streaming message 
> completely and terminates the application. This is different from invoking 
> interrupts such as CTRL-C.
> Of course one can terminate the process based on the following
>  # query.awaitTermination() # Waits for the termination of this query, with 
> stop() or with error
>  # query.awaitTermination(timeoutMs) # Returns true if this query is 
> terminated within the timeout in milliseconds.
> So the first one above waits until an interrupt signal is received. The 
> second one will count the timeout and will exit when timeout in milliseconds 
> is reached.
> The issue is that one needs to predict how long the streaming job needs to 
> run. Clearly any interrupt at the terminal or OS level (kill process), may 
> end up the processing terminated without a proper completion of the streaming 
> process.
> I have devised a method that allows one to terminate the spark application 
> internally after processing the last received message. Within say 2 seconds 
> of the confirmation of shutdown, the process will invoke a graceful shutdown.
> {color:#00}This new feature proposes a solution to handle the topic doing 
> work for the message being processed gracefully, wait for it to complete and 
> shutdown the streaming process for a given topic without loss of data or 
> orphaned transactions{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42485) SPIP: Shutting down spark structured streaming when the streaming process completed current process

2023-02-19 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690944#comment-17690944
 ] 

Dongjoon Hyun commented on SPARK-42485:
---

Thank you for the removal of Target Version. The remaining stuff among my 
comments is to change `Affect Versions` to 3.5.0 from 3.2.2.

> SPIP: Shutting down spark structured streaming when the streaming process 
> completed current process
> ---
>
> Key: SPARK-42485
> URL: https://issues.apache.org/jira/browse/SPARK-42485
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.3.2
>Reporter: Mich Talebzadeh
>Priority: Major
>  Labels: SPIP
>
> Spark Structured Streaming is a very useful tool in dealing with Event Driven 
> Architecture. In an Event Driven Architecture, there is generally a main loop 
> that listens for events and then triggers a call-back function when one of 
> those events is detected. In a streaming application the application waits to 
> receive the source messages in a set interval or whenever they happen and 
> reacts accordingly.
> There are occasions that you may want to stop the Spark program gracefully. 
> Gracefully meaning that Spark application handles the last streaming message 
> completely and terminates the application. This is different from invoking 
> interrupts such as CTRL-C.
> Of course one can terminate the process based on the following
>  # query.awaitTermination() # Waits for the termination of this query, with 
> stop() or with error
>  # query.awaitTermination(timeoutMs) # Returns true if this query is 
> terminated within the timeout in milliseconds.
> So the first one above waits until an interrupt signal is received. The 
> second one will count the timeout and will exit when timeout in milliseconds 
> is reached.
> The issue is that one needs to predict how long the streaming job needs to 
> run. Clearly any interrupt at the terminal or OS level (kill process), may 
> end up the processing terminated without a proper completion of the streaming 
> process.
> I have devised a method that allows one to terminate the spark application 
> internally after processing the last received message. Within say 2 seconds 
> of the confirmation of shutdown, the process will invoke a graceful shutdown.
> {color:#00}This new feature proposes a solution to handle the topic doing 
> work for the message being processed gracefully, wait for it to complete and 
> shutdown the streaming process for a given topic without loss of data or 
> orphaned transactions{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42482) Scala client Write API V1

2023-02-19 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-42482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-42482.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

> Scala client Write API V1
> -
>
> Key: SPARK-42482
> URL: https://issues.apache.org/jira/browse/SPARK-42482
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Zhen Li
>Priority: Major
> Fix For: 3.4.0
>
>
> Add basic Dataset#write API for Scala client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42492) Add new function filter_value

2023-02-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42492:


Assignee: Apache Spark

> Add new function filter_value
> -
>
> Key: SPARK-42492
> URL: https://issues.apache.org/jira/browse/SPARK-42492
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: Adam Binford
>Assignee: Apache Spark
>Priority: Major
>
> Doing data validation in Spark can lead to a lot of extra evaluations of 
> expressions. This is because conditionally evaluated expressions aren't 
> candidates for subexpression elimination. For example a simple expression 
> such as 
> {{when(validate(col), col)}}
> to only keep col if it matches some condition, will lead to col being 
> evaluated twice. And if call itself is made up of a series of expensive 
> expressions itself, like regular expression checks, this can lead to a lot of 
> wasted computation time.
> The initial attempt to resolve this was 
> https://issues.apache.org/jira/browse/SPARK-35564, adding support for 
> subexpression elimination to conditionally evaluated expressions. However I 
> have not been able to get that merged, so this is an alternative (though I 
> believe that is still useful on top of this).
> We can add a new higher order function "filter_value" that takes the column 
> you want to validate as an argument, and then a function that runs a lambda 
> expression returning a boolean on whether to keep that column or not. It 
> would have the same semantics as the above when expression, except it would 
> guarantee to only evaluate the initial column once.
> An alternative would be to implement a real definition for the NullIf 
> expression, but that would only support exact equals checks and not any 
> generic condition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42492) Add new function filter_value

2023-02-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690894#comment-17690894
 ] 

Apache Spark commented on SPARK-42492:
--

User 'Kimahriman' has created a pull request for this issue:
https://github.com/apache/spark/pull/40085

> Add new function filter_value
> -
>
> Key: SPARK-42492
> URL: https://issues.apache.org/jira/browse/SPARK-42492
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: Adam Binford
>Priority: Major
>
> Doing data validation in Spark can lead to a lot of extra evaluations of 
> expressions. This is because conditionally evaluated expressions aren't 
> candidates for subexpression elimination. For example a simple expression 
> such as 
> {{when(validate(col), col)}}
> to only keep col if it matches some condition, will lead to col being 
> evaluated twice. And if call itself is made up of a series of expensive 
> expressions itself, like regular expression checks, this can lead to a lot of 
> wasted computation time.
> The initial attempt to resolve this was 
> https://issues.apache.org/jira/browse/SPARK-35564, adding support for 
> subexpression elimination to conditionally evaluated expressions. However I 
> have not been able to get that merged, so this is an alternative (though I 
> believe that is still useful on top of this).
> We can add a new higher order function "filter_value" that takes the column 
> you want to validate as an argument, and then a function that runs a lambda 
> expression returning a boolean on whether to keep that column or not. It 
> would have the same semantics as the above when expression, except it would 
> guarantee to only evaluate the initial column once.
> An alternative would be to implement a real definition for the NullIf 
> expression, but that would only support exact equals checks and not any 
> generic condition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42492) Add new function filter_value

2023-02-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42492:


Assignee: (was: Apache Spark)

> Add new function filter_value
> -
>
> Key: SPARK-42492
> URL: https://issues.apache.org/jira/browse/SPARK-42492
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: Adam Binford
>Priority: Major
>
> Doing data validation in Spark can lead to a lot of extra evaluations of 
> expressions. This is because conditionally evaluated expressions aren't 
> candidates for subexpression elimination. For example a simple expression 
> such as 
> {{when(validate(col), col)}}
> to only keep col if it matches some condition, will lead to col being 
> evaluated twice. And if call itself is made up of a series of expensive 
> expressions itself, like regular expression checks, this can lead to a lot of 
> wasted computation time.
> The initial attempt to resolve this was 
> https://issues.apache.org/jira/browse/SPARK-35564, adding support for 
> subexpression elimination to conditionally evaluated expressions. However I 
> have not been able to get that merged, so this is an alternative (though I 
> believe that is still useful on top of this).
> We can add a new higher order function "filter_value" that takes the column 
> you want to validate as an argument, and then a function that runs a lambda 
> expression returning a boolean on whether to keep that column or not. It 
> would have the same semantics as the above when expression, except it would 
> guarantee to only evaluate the initial column once.
> An alternative would be to implement a real definition for the NullIf 
> expression, but that would only support exact equals checks and not any 
> generic condition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42492) Add new function filter_value

2023-02-19 Thread Adam Binford (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Binford updated SPARK-42492:
-
Description: 
Doing data validation in Spark can lead to a lot of extra evaluations of 
expressions. This is because conditionally evaluated expressions aren't 
candidates for subexpression elimination. For example a simple expression such 
as 

{{when(validate(col), col)}}

to only keep col if it matches some condition, will lead to col being evaluated 
twice. And if call itself is made up of a series of expensive expressions 
itself, like regular expression checks, this can lead to a lot of wasted 
computation time.

The initial attempt to resolve this was 
https://issues.apache.org/jira/browse/SPARK-35564, adding support for 
subexpression elimination to conditionally evaluated expressions. However I 
have not been able to get that merged, so this is an alternative (though I 
believe that is still useful on top of this).

We can add a new higher order function "filter_value" that takes the column you 
want to validate as an argument, and then a function that runs a lambda 
expression returning a boolean on whether to keep that column or not. It would 
have the same semantics as the above when expression, except it would guarantee 
to only evaluate the initial column once.

An alternative would be to implement a real definition for the NullIf 
expression, but that would only support exact equals checks and not any generic 
condition.

  was:
Doing data validation in Spark can lead to a lot of extra evaluations of 
expressions. This is because conditionally evaluated expressions aren't 
candidates for subexpression elimination. For example a simple expression such 
as 

{{when(validate(col), col)}}

to only keep col if it matches some condition, will lead to col being evaluated 
twice. And if call itself is made up of a series of expensive expressions 
itself, like regular expression checks, this can lead to a lot of wasted 
computation time.

The initial attempt to resolve this was 
https://issues.apache.org/jira/browse/SPARK-35564, adding support for 
subexpression elimination to conditionally evaluated expressions. However I 
have not been able to get that merged, so this is an alternative (though I 
believe that is still useful on top of this).

We can add a new lambda function "filter_value" that takes the column you want 
to validate as an argument, and then a function that runs a lambda expression 
returning a boolean on whether to keep that column or not. It would have the 
same semantics as the above when expression, except it would guarantee to only 
evaluate the initial column once.

An alternative would be to implement a real definition for the NullIf 
expression, but that would only support exact equals checks and not any generic 
condition.


> Add new function filter_value
> -
>
> Key: SPARK-42492
> URL: https://issues.apache.org/jira/browse/SPARK-42492
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: Adam Binford
>Priority: Major
>
> Doing data validation in Spark can lead to a lot of extra evaluations of 
> expressions. This is because conditionally evaluated expressions aren't 
> candidates for subexpression elimination. For example a simple expression 
> such as 
> {{when(validate(col), col)}}
> to only keep col if it matches some condition, will lead to col being 
> evaluated twice. And if call itself is made up of a series of expensive 
> expressions itself, like regular expression checks, this can lead to a lot of 
> wasted computation time.
> The initial attempt to resolve this was 
> https://issues.apache.org/jira/browse/SPARK-35564, adding support for 
> subexpression elimination to conditionally evaluated expressions. However I 
> have not been able to get that merged, so this is an alternative (though I 
> believe that is still useful on top of this).
> We can add a new higher order function "filter_value" that takes the column 
> you want to validate as an argument, and then a function that runs a lambda 
> expression returning a boolean on whether to keep that column or not. It 
> would have the same semantics as the above when expression, except it would 
> guarantee to only evaluate the initial column once.
> An alternative would be to implement a real definition for the NullIf 
> expression, but that would only support exact equals checks and not any 
> generic condition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42492) Add new function filter_value

2023-02-19 Thread Adam Binford (Jira)
Adam Binford created SPARK-42492:


 Summary: Add new function filter_value
 Key: SPARK-42492
 URL: https://issues.apache.org/jira/browse/SPARK-42492
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.3.2
Reporter: Adam Binford


Doing data validation in Spark can lead to a lot of extra evaluations of 
expressions. This is because conditionally evaluated expressions aren't 
candidates for subexpression elimination. For example a simple expression such 
as 

{{when(validate(col), col)}}

to only keep col if it matches some condition, will lead to col being evaluated 
twice. And if call itself is made up of a series of expensive expressions 
itself, like regular expression checks, this can lead to a lot of wasted 
computation time.

The initial attempt to resolve this was 
https://issues.apache.org/jira/browse/SPARK-35564, adding support for 
subexpression elimination to conditionally evaluated expressions. However I 
have not been able to get that merged, so this is an alternative (though I 
believe that is still useful on top of this).

We can add a new lambda function "filter_value" that takes the column you want 
to validate as an argument, and then a function that runs a lambda expression 
returning a boolean on whether to keep that column or not. It would have the 
same semantics as the above when expression, except it would guarantee to only 
evaluate the initial column once.

An alternative would be to implement a real definition for the NullIf 
expression, but that would only support exact equals checks and not any generic 
condition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42490) Upgrade protobuf-java to 3.22.0

2023-02-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690879#comment-17690879
 ] 

Apache Spark commented on SPARK-42490:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40084

> Upgrade protobuf-java to 3.22.0
> ---
>
> Key: SPARK-42490
> URL: https://issues.apache.org/jira/browse/SPARK-42490
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> https://github.com/protocolbuffers/protobuf/releases/tag/v22.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42487) Upgrade Netty to 4.1.89

2023-02-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690878#comment-17690878
 ] 

Apache Spark commented on SPARK-42487:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40081

> Upgrade Netty to 4.1.89
> ---
>
> Key: SPARK-42487
> URL: https://issues.apache.org/jira/browse/SPARK-42487
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> This release contains a fix for two regressions that were introduced by 
> 4.1.88.Final:
>  * Don't fail on HttpObjectDecoder's maxHeaderSize greater then 
> (Integer.MAX_VALUE - 2) ([#13216|https://github.com/netty/netty/pull/13216])
>  * dyld: Symbol not found: _netty_jni_util_JNI_OnLoad when upgrading from 
> 4.1.87.Final to 4.1.88.Final 
> ([#13214|https://github.com/netty/netty/pull/13214])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42490) Upgrade protobuf-java to 3.22.0

2023-02-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42490:


Assignee: Apache Spark

> Upgrade protobuf-java to 3.22.0
> ---
>
> Key: SPARK-42490
> URL: https://issues.apache.org/jira/browse/SPARK-42490
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
> https://github.com/protocolbuffers/protobuf/releases/tag/v22.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42491) Upgrade jetty to 9.4.51.v20230217

2023-02-19 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690877#comment-17690877
 ] 

Yang Jie commented on SPARK-42491:
--

it has not been published to the central repository

> Upgrade jetty to  9.4.51.v20230217
> --
>
> Key: SPARK-42491
> URL: https://issues.apache.org/jira/browse/SPARK-42491
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.51.v20230217



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42490) Upgrade protobuf-java to 3.22.0

2023-02-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42490:


Assignee: (was: Apache Spark)

> Upgrade protobuf-java to 3.22.0
> ---
>
> Key: SPARK-42490
> URL: https://issues.apache.org/jira/browse/SPARK-42490
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> https://github.com/protocolbuffers/protobuf/releases/tag/v22.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42487) Upgrade Netty to 4.1.89

2023-02-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42487:


Assignee: (was: Apache Spark)

> Upgrade Netty to 4.1.89
> ---
>
> Key: SPARK-42487
> URL: https://issues.apache.org/jira/browse/SPARK-42487
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> This release contains a fix for two regressions that were introduced by 
> 4.1.88.Final:
>  * Don't fail on HttpObjectDecoder's maxHeaderSize greater then 
> (Integer.MAX_VALUE - 2) ([#13216|https://github.com/netty/netty/pull/13216])
>  * dyld: Symbol not found: _netty_jni_util_JNI_OnLoad when upgrading from 
> 4.1.87.Final to 4.1.88.Final 
> ([#13214|https://github.com/netty/netty/pull/13214])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42487) Upgrade Netty to 4.1.89

2023-02-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42487:


Assignee: Apache Spark

> Upgrade Netty to 4.1.89
> ---
>
> Key: SPARK-42487
> URL: https://issues.apache.org/jira/browse/SPARK-42487
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
> This release contains a fix for two regressions that were introduced by 
> 4.1.88.Final:
>  * Don't fail on HttpObjectDecoder's maxHeaderSize greater then 
> (Integer.MAX_VALUE - 2) ([#13216|https://github.com/netty/netty/pull/13216])
>  * dyld: Symbol not found: _netty_jni_util_JNI_OnLoad when upgrading from 
> 4.1.87.Final to 4.1.88.Final 
> ([#13214|https://github.com/netty/netty/pull/13214])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42490) Upgrade protobuf-java to 3.22.0

2023-02-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690876#comment-17690876
 ] 

Apache Spark commented on SPARK-42490:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40084

> Upgrade protobuf-java to 3.22.0
> ---
>
> Key: SPARK-42490
> URL: https://issues.apache.org/jira/browse/SPARK-42490
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> https://github.com/protocolbuffers/protobuf/releases/tag/v22.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42489) Upgrdae scala-parser-combinators from 2.1.1 to 2.2.0

2023-02-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42489:


Assignee: Apache Spark

> Upgrdae scala-parser-combinators from 2.1.1 to 2.2.0
> 
>
> Key: SPARK-42489
> URL: https://issues.apache.org/jira/browse/SPARK-42489
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> https://github.com/scala/scala-parser-combinators/releases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42489) Upgrdae scala-parser-combinators from 2.1.1 to 2.2.0

2023-02-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42489:


Assignee: (was: Apache Spark)

> Upgrdae scala-parser-combinators from 2.1.1 to 2.2.0
> 
>
> Key: SPARK-42489
> URL: https://issues.apache.org/jira/browse/SPARK-42489
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://github.com/scala/scala-parser-combinators/releases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42489) Upgrdae scala-parser-combinators from 2.1.1 to 2.2.0

2023-02-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690875#comment-17690875
 ] 

Apache Spark commented on SPARK-42489:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40083

> Upgrdae scala-parser-combinators from 2.1.1 to 2.2.0
> 
>
> Key: SPARK-42489
> URL: https://issues.apache.org/jira/browse/SPARK-42489
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://github.com/scala/scala-parser-combinators/releases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42488) Upgrade commons-crypto from 1.1.0 to 1.2.0

2023-02-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42488:


Assignee: (was: Apache Spark)

> Upgrade commons-crypto from 1.1.0 to 1.2.0
> --
>
> Key: SPARK-42488
> URL: https://issues.apache.org/jira/browse/SPARK-42488
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://github.com/apache/commons-crypto/compare/rel/commons-crypto-1.1.0...rel/commons-crypto-1.2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42488) Upgrade commons-crypto from 1.1.0 to 1.2.0

2023-02-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42488:


Assignee: Apache Spark

> Upgrade commons-crypto from 1.1.0 to 1.2.0
> --
>
> Key: SPARK-42488
> URL: https://issues.apache.org/jira/browse/SPARK-42488
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> https://github.com/apache/commons-crypto/compare/rel/commons-crypto-1.1.0...rel/commons-crypto-1.2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42488) Upgrade commons-crypto from 1.1.0 to 1.2.0

2023-02-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690874#comment-17690874
 ] 

Apache Spark commented on SPARK-42488:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40082

> Upgrade commons-crypto from 1.1.0 to 1.2.0
> --
>
> Key: SPARK-42488
> URL: https://issues.apache.org/jira/browse/SPARK-42488
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://github.com/apache/commons-crypto/compare/rel/commons-crypto-1.1.0...rel/commons-crypto-1.2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42491) Upgrade jetty to 9.4.51.v20230217

2023-02-19 Thread Yang Jie (Jira)
Yang Jie created SPARK-42491:


 Summary: Upgrade jetty to  9.4.51.v20230217
 Key: SPARK-42491
 URL: https://issues.apache.org/jira/browse/SPARK-42491
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42491) Upgrade jetty to 9.4.51.v20230217

2023-02-19 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-42491:
-
Description: 
https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.51.v20230217

> Upgrade jetty to  9.4.51.v20230217
> --
>
> Key: SPARK-42491
> URL: https://issues.apache.org/jira/browse/SPARK-42491
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.51.v20230217



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42490) Upgrade protobuf-java to 3.22.0

2023-02-19 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-42490:
-
Description: https://github.com/protocolbuffers/protobuf/releases/tag/v22.0

> Upgrade protobuf-java to 3.22.0
> ---
>
> Key: SPARK-42490
> URL: https://issues.apache.org/jira/browse/SPARK-42490
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> https://github.com/protocolbuffers/protobuf/releases/tag/v22.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42490) Upgrade protobuf-java to 3.22.0

2023-02-19 Thread Yang Jie (Jira)
Yang Jie created SPARK-42490:


 Summary: Upgrade protobuf-java to 3.22.0
 Key: SPARK-42490
 URL: https://issues.apache.org/jira/browse/SPARK-42490
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42489) Upgrdae scala-parser-combinators from 2.1.1 to 2.2.0

2023-02-19 Thread Yang Jie (Jira)
Yang Jie created SPARK-42489:


 Summary: Upgrdae scala-parser-combinators from 2.1.1 to 2.2.0
 Key: SPARK-42489
 URL: https://issues.apache.org/jira/browse/SPARK-42489
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: Yang Jie


https://github.com/scala/scala-parser-combinators/releases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42488) Upgrade commons-crypto from 1.1.0 to 1.2.0

2023-02-19 Thread Yang Jie (Jira)
Yang Jie created SPARK-42488:


 Summary: Upgrade commons-crypto from 1.1.0 to 1.2.0
 Key: SPARK-42488
 URL: https://issues.apache.org/jira/browse/SPARK-42488
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: Yang Jie


https://github.com/apache/commons-crypto/compare/rel/commons-crypto-1.1.0...rel/commons-crypto-1.2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42487) Upgrade Netty to 4.1.89

2023-02-19 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-42487:
-
Summary: Upgrade Netty to 4.1.89  (was: Upgrade Netty from 4.1.89)

> Upgrade Netty to 4.1.89
> ---
>
> Key: SPARK-42487
> URL: https://issues.apache.org/jira/browse/SPARK-42487
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> This release contains a fix for two regressions that were introduced by 
> 4.1.88.Final:
>  * Don't fail on HttpObjectDecoder's maxHeaderSize greater then 
> (Integer.MAX_VALUE - 2) ([#13216|https://github.com/netty/netty/pull/13216])
>  * dyld: Symbol not found: _netty_jni_util_JNI_OnLoad when upgrading from 
> 4.1.87.Final to 4.1.88.Final 
> ([#13214|https://github.com/netty/netty/pull/13214])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42487) Upgrade Netty from 4.1.89

2023-02-19 Thread Yang Jie (Jira)
Yang Jie created SPARK-42487:


 Summary: Upgrade Netty from 4.1.89
 Key: SPARK-42487
 URL: https://issues.apache.org/jira/browse/SPARK-42487
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: Yang Jie


This release contains a fix for two regressions that were introduced by 
4.1.88.Final:
 * Don't fail on HttpObjectDecoder's maxHeaderSize greater then 
(Integer.MAX_VALUE - 2) ([#13216|https://github.com/netty/netty/pull/13216])
 * dyld: Symbol not found: _netty_jni_util_JNI_OnLoad when upgrading from 
4.1.87.Final to 4.1.88.Final 
([#13214|https://github.com/netty/netty/pull/13214])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42323) Assign name to _LEGACY_ERROR_TEMP_2332

2023-02-19 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-42323.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 39977
[https://github.com/apache/spark/pull/39977]

> Assign name to _LEGACY_ERROR_TEMP_2332
> --
>
> Key: SPARK-42323
> URL: https://issues.apache.org/jira/browse/SPARK-42323
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42323) Assign name to _LEGACY_ERROR_TEMP_2332

2023-02-19 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-42323:


Assignee: Haejoon Lee

> Assign name to _LEGACY_ERROR_TEMP_2332
> --
>
> Key: SPARK-42323
> URL: https://issues.apache.org/jira/browse/SPARK-42323
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-41952) Upgrade Parquet to fix off-heap memory leaks in Zstd codec

2023-02-19 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690853#comment-17690853
 ] 

Yang Jie edited comment on SPARK-41952 at 2/19/23 8:54 AM:
---

For the old Spark versions, is it possible to introduce other costs by 
upgrading parquet?  Should we directly introduce parquet.hadoop.CodecFactory to 
old Spark version and fix them accordingly? 

After that, we can also revert the changes of the Spark version(for example, 
master and Spark 3.4) that can be solved by upgrading parquet


was (Author: luciferyang):
For the old Spark versions, is it possible to introduce other costs by 
upgrading parquet?  Should we directly introduce parquet.hadoop.CodecFactory to 
old Spark version and fix them accordingly? 

After that, we can also revert the changes to the Spark version(for example, 
master and Spark 3.4) that can be solved by upgrading parquet

> Upgrade Parquet to fix off-heap memory leaks in Zstd codec
> --
>
> Key: SPARK-41952
> URL: https://issues.apache.org/jira/browse/SPARK-41952
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.3, 3.3.1, 3.2.3
>Reporter: Alexey Kudinkin
>Priority: Critical
>
> Recently, native memory leak have been discovered in Parquet in conjunction 
> of it using Zstd decompressor from luben/zstd-jni library (PARQUET-2160).
> This is very problematic to a point where we can't use Parquet w/ Zstd due to 
> pervasive OOMs taking down our executors and disrupting our jobs.
> Luckily fix addressing this had already landed in Parquet:
> [https://github.com/apache/parquet-mr/pull/982]
>  
> Now, we just need to
>  # Updated version of Parquet is released in a timely manner
>  # Spark is upgraded onto this new version in the upcoming release
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41952) Upgrade Parquet to fix off-heap memory leaks in Zstd codec

2023-02-19 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690853#comment-17690853
 ] 

Yang Jie commented on SPARK-41952:
--

For the old Spark versions, is it possible to introduce other costs by 
upgrading parquet?  Should we directly introduce parquet.hadoop.CodecFactory to 
old Spark version and fix them accordingly? 

After that, we can also revert the changes to the Spark version(for example, 
master and Spark 3.4) that can be solved by upgrading parquet

> Upgrade Parquet to fix off-heap memory leaks in Zstd codec
> --
>
> Key: SPARK-41952
> URL: https://issues.apache.org/jira/browse/SPARK-41952
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.3, 3.3.1, 3.2.3
>Reporter: Alexey Kudinkin
>Priority: Critical
>
> Recently, native memory leak have been discovered in Parquet in conjunction 
> of it using Zstd decompressor from luben/zstd-jni library (PARQUET-2160).
> This is very problematic to a point where we can't use Parquet w/ Zstd due to 
> pervasive OOMs taking down our executors and disrupting our jobs.
> Luckily fix addressing this had already landed in Parquet:
> [https://github.com/apache/parquet-mr/pull/982]
>  
> Now, we just need to
>  # Updated version of Parquet is released in a timely manner
>  # Spark is upgraded onto this new version in the upcoming release
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42485) SPIP: Shutting down spark structured streaming when the streaming process completed current process

2023-02-19 Thread Mich Talebzadeh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690847#comment-17690847
 ] 

Mich Talebzadeh commented on SPARK-42485:
-

How about Target Version?

> SPIP: Shutting down spark structured streaming when the streaming process 
> completed current process
> ---
>
> Key: SPARK-42485
> URL: https://issues.apache.org/jira/browse/SPARK-42485
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.3.2
>Reporter: Mich Talebzadeh
>Priority: Major
>  Labels: SPIP
>
> Spark Structured Streaming is a very useful tool in dealing with Event Driven 
> Architecture. In an Event Driven Architecture, there is generally a main loop 
> that listens for events and then triggers a call-back function when one of 
> those events is detected. In a streaming application the application waits to 
> receive the source messages in a set interval or whenever they happen and 
> reacts accordingly.
> There are occasions that you may want to stop the Spark program gracefully. 
> Gracefully meaning that Spark application handles the last streaming message 
> completely and terminates the application. This is different from invoking 
> interrupts such as CTRL-C.
> Of course one can terminate the process based on the following
>  # query.awaitTermination() # Waits for the termination of this query, with 
> stop() or with error
>  # query.awaitTermination(timeoutMs) # Returns true if this query is 
> terminated within the timeout in milliseconds.
> So the first one above waits until an interrupt signal is received. The 
> second one will count the timeout and will exit when timeout in milliseconds 
> is reached.
> The issue is that one needs to predict how long the streaming job needs to 
> run. Clearly any interrupt at the terminal or OS level (kill process), may 
> end up the processing terminated without a proper completion of the streaming 
> process.
> I have devised a method that allows one to terminate the spark application 
> internally after processing the last received message. Within say 2 seconds 
> of the confirmation of shutdown, the process will invoke a graceful shutdown.
> {color:#00}This new feature proposes a solution to handle the topic doing 
> work for the message being processed gracefully, wait for it to complete and 
> shutdown the streaming process for a given topic without loss of data or 
> orphaned transactions{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42485) SPIP: Shutting down spark structured streaming when the streaming process completed current process

2023-02-19 Thread Mich Talebzadeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mich Talebzadeh updated SPARK-42485:

Target Version/s:   (was: 3.3.2)

> SPIP: Shutting down spark structured streaming when the streaming process 
> completed current process
> ---
>
> Key: SPARK-42485
> URL: https://issues.apache.org/jira/browse/SPARK-42485
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.3.2
>Reporter: Mich Talebzadeh
>Priority: Major
>  Labels: SPIP
>
> Spark Structured Streaming is a very useful tool in dealing with Event Driven 
> Architecture. In an Event Driven Architecture, there is generally a main loop 
> that listens for events and then triggers a call-back function when one of 
> those events is detected. In a streaming application the application waits to 
> receive the source messages in a set interval or whenever they happen and 
> reacts accordingly.
> There are occasions that you may want to stop the Spark program gracefully. 
> Gracefully meaning that Spark application handles the last streaming message 
> completely and terminates the application. This is different from invoking 
> interrupts such as CTRL-C.
> Of course one can terminate the process based on the following
>  # query.awaitTermination() # Waits for the termination of this query, with 
> stop() or with error
>  # query.awaitTermination(timeoutMs) # Returns true if this query is 
> terminated within the timeout in milliseconds.
> So the first one above waits until an interrupt signal is received. The 
> second one will count the timeout and will exit when timeout in milliseconds 
> is reached.
> The issue is that one needs to predict how long the streaming job needs to 
> run. Clearly any interrupt at the terminal or OS level (kill process), may 
> end up the processing terminated without a proper completion of the streaming 
> process.
> I have devised a method that allows one to terminate the spark application 
> internally after processing the last received message. Within say 2 seconds 
> of the confirmation of shutdown, the process will invoke a graceful shutdown.
> {color:#00}This new feature proposes a solution to handle the topic doing 
> work for the message being processed gracefully, wait for it to complete and 
> shutdown the streaming process for a given topic without loss of data or 
> orphaned transactions{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org