date:20230425

[jira] [Assigned] (SPARK-42992) Introduce PySparkRuntimeError

2023-04-25 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-42992:
-

Assignee: Haejoon Lee

> Introduce PySparkRuntimeError
> -
>
> Key: SPARK-42992
> URL: https://issues.apache.org/jira/browse/SPARK-42992
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Introduce PySparkRuntimeError to cover the RuntimeError in PySpark-specific 
> way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42992) Introduce PySparkRuntimeError

2023-04-25 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-42992.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40617
[https://github.com/apache/spark/pull/40617]

> Introduce PySparkRuntimeError
> -
>
> Key: SPARK-42992
> URL: https://issues.apache.org/jira/browse/SPARK-42992
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> Introduce PySparkRuntimeError to cover the RuntimeError in PySpark-specific 
> way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43275) Migrate Spark Connect GroupedData error into error class

2023-04-25 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43275.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40938
[https://github.com/apache/spark/pull/40938]

> Migrate Spark Connect GroupedData error into error class
> 
>
> Key: SPARK-43275
> URL: https://issues.apache.org/jira/browse/SPARK-43275
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> Migrate Spark Connect GroupedData error into error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43275) Migrate Spark Connect GroupedData error into error class

2023-04-25 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43275:
-

Assignee: Haejoon Lee

> Migrate Spark Connect GroupedData error into error class
> 
>
> Key: SPARK-43275
> URL: https://issues.apache.org/jira/browse/SPARK-43275
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Migrate Spark Connect GroupedData error into error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43274) Introduce `PySparkNotImplementError`

2023-04-25 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43274.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40938
[https://github.com/apache/spark/pull/40938]

> Introduce `PySparkNotImplementError`
> 
>
> Key: SPARK-43274
> URL: https://issues.apache.org/jira/browse/SPARK-43274
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> Introduce `PySparkNotImplementError` corresponding for `NotImplementError`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43274) Introduce `PySparkNotImplementError`

2023-04-25 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43274:
-

Assignee: Haejoon Lee

> Introduce `PySparkNotImplementError`
> 
>
> Key: SPARK-43274
> URL: https://issues.apache.org/jira/browse/SPARK-43274
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Introduce `PySparkNotImplementError` corresponding for `NotImplementError`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43291) Match behavior for DataFrame.cov on string DataFrame

2023-04-25 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-43291:

Summary: Match behavior for DataFrame.cov on string DataFrame  (was: 
Re-enable test for DataFrame.cov on string DataFrame.)

> Match behavior for DataFrame.cov on string DataFrame
> 
>
> Key: SPARK-43291
> URL: https://issues.apache.org/jira/browse/SPARK-43291
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Should enable test below:
> {code:java}
> pdf = pd.DataFrame([("1", "2"), ("0", "3"), ("2", "0"), ("1", "1")], 
> columns=["a", "b"])
> psdf = ps.from_pandas(pdf)
> self.assert_eq(pdf.cov(), psdf.cov()) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43291) Re-enable test for DataFrame.cov on string DataFrame.

2023-04-25 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-43291:
---

 Summary: Re-enable test for DataFrame.cov on string DataFrame.
 Key: SPARK-43291
 URL: https://issues.apache.org/jira/browse/SPARK-43291
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Should enable test below:
{code:java}
pdf = pd.DataFrame([("1", "2"), ("0", "3"), ("2", "0"), ("1", "1")], 
columns=["a", "b"])
psdf = ps.from_pandas(pdf)
self.assert_eq(pdf.cov(), psdf.cov()) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43290) Support IV and AAD optional parameters for aes_encrypt

2023-04-25 Thread Steve Weis (Jira)

Steve Weis created SPARK-43290:
--

 Summary: Support IV and AAD optional parameters for aes_encrypt
 Key: SPARK-43290
 URL: https://issues.apache.org/jira/browse/SPARK-43290
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Steve Weis


There are some use cases where callers to aes_encrypt may want to provide 
initialization vectors (IVs) or additional authenticated data (AAD). The most 
common cases will be:
1. Ensuring that ciphertext matches values that have been encrypted by external 
tools. In those cases, the caller will need to provide an identical IV value.
2. For AES-CBC mode, there are some cases where callers want to generate 
deterministic encrypted output.
3. For AES-GCM mode, providing AAD fields allows callers to bind additional 
data to an encrypted ciphertext so that it can only be decrypted by a caller 
providing the same value. This is often used to enforce some context.

The proposed new API is the following:
 * aes_encrypt(expr, key [, mode [, padding [, iv [, aad)

 * aes_decrypt(expr, key [, mode [, padding [, aad]]])

These fields are only supported for specific modes:
 * ECB: Does not support either IV or AAD and will return an error if either 
are provided.
 * CBC: Only supports an IV and will return an error if an AAD is provided
 * GCM: Supports either IV, AAD, or both.

If a caller is only providing an AAD to GCM mode, they would need to pass a 
null value in the IV field.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43156) Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`

2023-04-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-43156:
---

Assignee: Jack Chen

> Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`
> 
>
> Key: SPARK-43156
> URL: https://issues.apache.org/jira/browse/SPARK-43156
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
>
> Example query:
> {code:java}
> spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) 
> from t0").collect()
> res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false])  
> {code}
> In this subquery, count(1) always evaluates to a non-null integer value, so 
> count(1) is null is always false. The correct evaluation of the subquery is 
> always false.
> We incorrectly evaluate it to null for empty groups. The reason is that 
> NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] 
> [false] - this rewrite would be correct normally, but in the context of a 
> scalar subquery it breaks our count bug handling in 
> RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get 
> there, the query appears to not have the count bug - it looks the same as if 
> the original query had a subquery with select any_value(false) from r..., and 
> that case is _not_ subject to the count bug.
>  
> Postgres comparison show correct always-false result: 
> [http://sqlfiddle.com/#!17/67822/5]
> DDL for the example:
> {code:java}
> create or replace temp view t0 (a, b)
> as values
>     (1, 1.0),
>     (2, 2.0);
> create or replace temp view t1 (c, d)
> as values
>     (2, 3.0); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43156) Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`

2023-04-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-43156.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40946
[https://github.com/apache/spark/pull/40946]

> Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`
> 
>
> Key: SPARK-43156
> URL: https://issues.apache.org/jira/browse/SPARK-43156
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
> Fix For: 3.5.0
>
>
> Example query:
> {code:java}
> spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) 
> from t0").collect()
> res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false])  
> {code}
> In this subquery, count(1) always evaluates to a non-null integer value, so 
> count(1) is null is always false. The correct evaluation of the subquery is 
> always false.
> We incorrectly evaluate it to null for empty groups. The reason is that 
> NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] 
> [false] - this rewrite would be correct normally, but in the context of a 
> scalar subquery it breaks our count bug handling in 
> RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get 
> there, the query appears to not have the count bug - it looks the same as if 
> the original query had a subquery with select any_value(false) from r..., and 
> that case is _not_ subject to the count bug.
>  
> Postgres comparison show correct always-false result: 
> [http://sqlfiddle.com/#!17/67822/5]
> DDL for the example:
> {code:java}
> create or replace temp view t0 (a, b)
> as values
>     (1, 1.0),
>     (2, 2.0);
> create or replace temp view t1 (c, d)
> as values
>     (2, 3.0); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43276) Migrate Spark Connect Window errors into error class

2023-04-25 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43276.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40939
[https://github.com/apache/spark/pull/40939]

> Migrate Spark Connect Window errors into error class
> 
>
> Key: SPARK-43276
> URL: https://issues.apache.org/jira/browse/SPARK-43276
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> Migrate Spark Connect Window errors into error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43276) Migrate Spark Connect Window errors into error class

2023-04-25 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43276:
-

Assignee: Haejoon Lee

> Migrate Spark Connect Window errors into error class
> 
>
> Key: SPARK-43276
> URL: https://issues.apache.org/jira/browse/SPARK-43276
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Migrate Spark Connect Window errors into error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43289) PySpark UDF supports python package dependencies

2023-04-25 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-43289:
--

Assignee: Weichen Xu

> PySpark UDF supports python package dependencies
> 
>
> Key: SPARK-43289
> URL: https://issues.apache.org/jira/browse/SPARK-43289
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, ML, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>
> h3. Requirements
>  
> Make the pyspark UDF support annotating python dependencies and when 
> executing UDF, the UDF worker creates a new python environment with provided 
> python dependencies.
> h3. Motivation
>  
> We have two major cases:
>  
>  * For spark connect case, the client python environment is very likely to be 
> different with pyspark server side python environment, this causes user's UDF 
> function execution failure in pyspark server side.
>  * Some machine learning third-party library (e.g. MLflow) requires pyspark 
> UDF supporting  dependencies, because in ML cases, we need to run model 
> inference by pyspark UDF in the exactly the same python environment that 
> trains the model. Currently MLflow supports it by creating a child python 
> process in pyspark UDF worker, and redirecting all UDF input data to the 
> child python process to run model inference, this way it causes significant 
> overhead, if pyspark UDF support builtin python dependency management then we 
> don't need such poorly performing approach.
>  
> h3. Proposed API
> ```
> @pandas_udf("string", pip_requirements=...)
> ```
> `pip_requirements` argument means either an iterable of pip requirement 
> strings (e.g. ``["scikit-learn", "-r /path/to/req2.txt", "-c 
> /path/to/constraints.txt"]``) or the string path to a pip requirements file 
> path on the local filesystem (e.g. ``"/path/to/requirements.txt"``) 
> represents the pip requirements for the python UDF.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43289) PySpark UDF supports python package dependencies

2023-04-25 Thread Weichen Xu (Jira)

Weichen Xu created SPARK-43289:
--

 Summary: PySpark UDF supports python package dependencies
 Key: SPARK-43289
 URL: https://issues.apache.org/jira/browse/SPARK-43289
 Project: Spark
  Issue Type: New Feature
  Components: Connect, ML, PySpark
Affects Versions: 3.5.0
Reporter: Weichen Xu


h3. Requirements

 

Make the pyspark UDF support annotating python dependencies and when executing 
UDF, the UDF worker creates a new python environment with provided python 
dependencies.
h3. Motivation

 

We have two major cases:

 
 * For spark connect case, the client python environment is very likely to be 
different with pyspark server side python environment, this causes user's UDF 
function execution failure in pyspark server side.
 * Some machine learning third-party library (e.g. MLflow) requires pyspark UDF 
supporting  dependencies, because in ML cases, we need to run model inference 
by pyspark UDF in the exactly the same python environment that trains the 
model. Currently MLflow supports it by creating a child python process in 
pyspark UDF worker, and redirecting all UDF input data to the child python 
process to run model inference, this way it causes significant overhead, if 
pyspark UDF support builtin python dependency management then we don't need 
such poorly performing approach.

 
h3. Proposed API

```

@pandas_udf("string", pip_requirements=...)

```

`pip_requirements` argument means either an iterable of pip requirement strings 
(e.g. ``["scikit-learn", "-r /path/to/req2.txt", "-c 
/path/to/constraints.txt"]``) or the string path to a pip requirements file 
path on the local filesystem (e.g. ``"/path/to/requirements.txt"``) represents 
the pip requirements for the python UDF.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43277) Clean up deprecation hadoop api usage in Yarn module

2023-04-25 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-43277.
--
Fix Version/s: 3.5.0
 Assignee: Yang Jie
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/40940

> Clean up deprecation hadoop api usage in Yarn module
> 
>
> Key: SPARK-43277
> URL: https://issues.apache.org/jira/browse/SPARK-43277
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43288) DataSourceV2: CREATE TABLE LIKE

2023-04-25 Thread John Zhuge (Jira)

John Zhuge created SPARK-43288:
--

 Summary: DataSourceV2: CREATE TABLE LIKE
 Key: SPARK-43288
 URL: https://issues.apache.org/jira/browse/SPARK-43288
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.4.0
Reporter: John Zhuge


Support CREATE TABLE LIKE in DSv2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43020) Refactoring similar error classes such as `NOT_XXX`.

2023-04-25 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-43020:

Description: 
We should consolidate the error classes that have a similar error messages into 
single error class, or classify them into a main-sub error class structure.

NOTE: This refactoring should be started after all other initial migration is 
done.

  was:We'd better to add main error class for type errors and switch the 
type-related errors into sub-error classes.


> Refactoring similar error classes such as `NOT_XXX`.
> -
>
> Key: SPARK-43020
> URL: https://issues.apache.org/jira/browse/SPARK-43020
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should consolidate the error classes that have a similar error messages 
> into single error class, or classify them into a main-sub error class 
> structure.
> NOTE: This refactoring should be started after all other initial migration is 
> done.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43020) Refactoring similar error classes such as `NOT_XXX`.

2023-04-25 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-43020:

Description: 
We should consolidate the error classes that have a similar error messages into 
single error class, or classify them into a main-sub error class structure.

*NOTE:* This refactoring should be started after all other initial migration is 
done.

  was:
We should consolidate the error classes that have a similar error messages into 
single error class, or classify them into a main-sub error class structure.

NOTE: This refactoring should be started after all other initial migration is 
done.


> Refactoring similar error classes such as `NOT_XXX`.
> -
>
> Key: SPARK-43020
> URL: https://issues.apache.org/jira/browse/SPARK-43020
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should consolidate the error classes that have a similar error messages 
> into single error class, or classify them into a main-sub error class 
> structure.
> *NOTE:* This refactoring should be started after all other initial migration 
> is done.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43020) Refactoring similar error classes such as `NOT_XXX`.

2023-04-25 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-43020:

Summary: Refactoring similar error classes such as `NOT_XXX`.  (was: Add 
main error class for type errors)

> Refactoring similar error classes such as `NOT_XXX`.
> -
>
> Key: SPARK-43020
> URL: https://issues.apache.org/jira/browse/SPARK-43020
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We'd better to add main error class for type errors and switch the 
> type-related errors into sub-error classes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43280) Reimplement the protobuf breaking change checker script

2023-04-25 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-43280:
--
Summary: Reimplement the protobuf breaking change checker script  (was: 
Improve the protobuf breaking change checker script)

> Reimplement the protobuf breaking change checker script
> ---
>
> Key: SPARK-43280
> URL: https://issues.apache.org/jira/browse/SPARK-43280
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43136) Scala mapGroup, coGroup

2023-04-25 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-43136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell reassigned SPARK-43136:
-

Assignee: Zhen Li

> Scala mapGroup, coGroup
> ---
>
> Key: SPARK-43136
> URL: https://issues.apache.org/jira/browse/SPARK-43136
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Zhen Li
>Assignee: Zhen Li
>Priority: Major
>
> Adding Basics of Dataset#groupByKey -> KeyValueGroupedDataset support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43136) Scala mapGroup, coGroup

2023-04-25 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-43136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-43136.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> Scala mapGroup, coGroup
> ---
>
> Key: SPARK-43136
> URL: https://issues.apache.org/jira/browse/SPARK-43136
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Zhen Li
>Assignee: Zhen Li
>Priority: Major
> Fix For: 3.5.0
>
>
> Adding Basics of Dataset#groupByKey -> KeyValueGroupedDataset support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43287) Connect JVM client REPL not correctly shut down if killed

2023-04-25 Thread Wei Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Liu updated SPARK-43287:

Description: 
How to reproduce:
 # Start a scala client `./connector/connect/bin/spark-connect-scala-client`
 # in another terminal, kill the process `kill `
 # Back to the client terminal, you can't see anything you type, but the 
command still works

 

 
{code:java}
Spark session available as 'spark'.
   _                  __      __                            __
  / ___/   __/ /__   / /___      ___  _/ /_
  \__ \/ __ \/ __ `/ ___/ //_/  / /   / __ \/ __ \/ __ \/ _ \/ ___/ __/
 ___/ / /_/ / /_/ / /  / ,<    / /___/ /_/ / / / / / / /  __/ /__/ /_
// .___/\__,_/_/  /_/|_|   \/\/_/ /_/_/ /_/\___/\___/\__/
    /_/


@ wei.liu:~/oss-spark$ CONTRIBUTING.md  appveyor.yml  conf                      
  examples            logs         resource-managers                    target
LICENSE          artifacts     connector                   graphx              
mllib        sbin                                 tools
LICENSE-binary   assembly      core                        hadoop-cloud        
mllib-local  scalastyle-config.xml
NOTICE           bin           data                        hs_err_pid9062.log  
pom.xml      scalastyle-on-compile.generated.xml
NOTICE-binary    binder        dependency-reduced-pom.xml  launcher            
project      spark-warehouse
R                build         dev                         licenses            
python       sql
README.md        common        docs                        licenses-binary     
repl         streaming
wei.liu:~/oss-spark$ wei.liu:~/oss-spark$ wei.liu:~/oss-spark$ 

{code}
 

I ran 'ls' above, and clicked return multiple times

 

 

 

 

  was:
How to reproduce:
 # Start a scala client `./connector/connect/bin/spark-connect-scala-client`
 # in another terminal, kill the process `kill `
 # Back to the client terminal, you can't see anything you type, but the 
command still works

 

 
{code:java}
Spark session available as 'spark'.
   _                  __      __                            __
  / ___/   __/ /__   / /___      ___  _/ /_
  \__ \/ __ \/ __ `/ ___/ //_/  / /   / __ \/ __ \/ __ \/ _ \/ ___/ __/
 ___/ / /_/ / /_/ / /  / ,<    / /___/ /_/ / / / / / / /  __/ /__/ /_
// .___/\__,_/_/  /_/|_|   \/\/_/ /_/_/ /_/\___/\___/\__/
    /_/


@ wei.liu:~/oss-spark$ CONTRIBUTING.md  appveyor.yml  conf                      
  examples            logs         resource-managers                    target
LICENSE          artifacts     connector                   graphx              
mllib        sbin                                 tools
LICENSE-binary   assembly      core                        hadoop-cloud        
mllib-local  scalastyle-config.xml
NOTICE           bin           data                        hs_err_pid9062.log  
pom.xml      scalastyle-on-compile.generated.xml
NOTICE-binary    binder        dependency-reduced-pom.xml  launcher            
project      spark-warehouse
R                build         dev                         licenses            
python       sql
README.md        common        docs                        licenses-binary     
repl         streaming
wei.liu:~/oss-spark$ wei.liu@ip-10-110-19-234:~/oss-spark$ wei.liu:~/oss-spark$ 

{code}
 

I ran 'ls' above, and clicked return multiple times

 

 

 


> Connect JVM client REPL not correctly shut down if killed
> -
>
> Key: SPARK-43287
> URL: https://issues.apache.org/jira/browse/SPARK-43287
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Wei Liu
>Priority: Major
>
> How to reproduce:
>  # Start a scala client `./connector/connect/bin/spark-connect-scala-client`
>  # in another terminal, kill the process `kill `
>  # Back to the client terminal, you can't see anything you type, but the 
> command still works
>  
>  
> {code:java}
> Spark session available as 'spark'.
>    _                  __      __                            __
>   / ___/   __/ /__   / /___      ___  _/ /_
>   \__ \/ __ \/ __ `/ ___/ //_/  / /   / __ \/ __ \/ __ \/ _ \/ ___/ __/
>  ___/ / /_/ / /_/ / /  / ,<    / /___/ /_/ / / / / / / /  __/ /__/ /_
> // .___/\__,_/_/  /_/|_|   \/\/_/ /_/_/ /_/\___/\___/\__/
>     /_/
> @ wei.liu:~/oss-spark$ CONTRIBUTING.md  appveyor.yml  conf                    
>     examples            logs         resource-managers                    
> target
> LICENSE          artifacts     connector                   graphx             
>  mllib        sbin                                 tools
> LICENSE-binary   assembly      core                        hadoop-cloud

[jira] [Updated] (SPARK-43287) Connect JVM client REPL not correctly shut down if killed

2023-04-25 Thread Wei Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Liu updated SPARK-43287:

Description: 
How to reproduce:
 # Start a scala client `./connector/connect/bin/spark-connect-scala-client`
 # in another terminal, kill the process `kill `
 # Back to the client terminal, you can't see anything you type, but the 
command still works

 

 
{code:java}
Spark session available as 'spark'.
   _                  __      __                            __
  / ___/   __/ /__   / /___      ___  _/ /_
  \__ \/ __ \/ __ `/ ___/ //_/  / /   / __ \/ __ \/ __ \/ _ \/ ___/ __/
 ___/ / /_/ / /_/ / /  / ,<    / /___/ /_/ / / / / / / /  __/ /__/ /_
// .___/\__,_/_/  /_/|_|   \/\/_/ /_/_/ /_/\___/\___/\__/
    /_/


@ wei.liu:~/oss-spark$ CONTRIBUTING.md  appveyor.yml  conf                      
  examples            logs         resource-managers                    target
LICENSE          artifacts     connector                   graphx              
mllib        sbin                                 tools
LICENSE-binary   assembly      core                        hadoop-cloud        
mllib-local  scalastyle-config.xml
NOTICE           bin           data                        hs_err_pid9062.log  
pom.xml      scalastyle-on-compile.generated.xml
NOTICE-binary    binder        dependency-reduced-pom.xml  launcher            
project      spark-warehouse
R                build         dev                         licenses            
python       sql
README.md        common        docs                        licenses-binary     
repl         streaming
wei.liu:~/oss-spark$ wei.liu@ip-10-110-19-234:~/oss-spark$ wei.liu:~/oss-spark$ 

{code}
 

I ran 'ls' above, and clicked return multiple times

 

 

 

  was:
How to reproduce:
 # Start a scala client `./connector/connect/bin/spark-connect-scala-client`
 # in another terminal, kill the process `kill `
 # Back to the client terminal, you can't see anything you type, but the 
command still works

 

 
{code:java}
Spark session available as 'spark'.
   _                  __      __                            __
  / ___/   __/ /__   / /___      ___  _/ /_
  \__ \/ __ \/ __ `/ ___/ //_/  / /   / __ \/ __ \/ __ \/ _ \/ ___/ __/
 ___/ / /_/ / /_/ / /  / ,<    / /___/ /_/ / / / / / / /  __/ /__/ /_
// .___/\__,_/_/  /_/|_|   \/\/_/ /_/_/ /_/\___/\___/\__/
    /_/


@ wei.liu@ip-10-110-19-234:~/oss-spark$ CONTRIBUTING.md  appveyor.yml  conf     
                   examples            logs         resource-managers           
         target
LICENSE          artifacts     connector                   graphx              
mllib        sbin                                 tools
LICENSE-binary   assembly      core                        hadoop-cloud        
mllib-local  scalastyle-config.xml
NOTICE           bin           data                        hs_err_pid9062.log  
pom.xml      scalastyle-on-compile.generated.xml
NOTICE-binary    binder        dependency-reduced-pom.xml  launcher            
project      spark-warehouse
R                build         dev                         licenses            
python       sql
README.md        common        docs                        licenses-binary     
repl         streaming
wei.liu@ip-10-110-19-234:~/oss-spark$ wei.liu@ip-10-110-19-234:~/oss-spark$ 
wei.liu@ip-10-110-19-234:~/oss-spark$ 

{code}
 

I ran 'ls' above, and clicked return multiple times

 

 


> Connect JVM client REPL not correctly shut down if killed
> -
>
> Key: SPARK-43287
> URL: https://issues.apache.org/jira/browse/SPARK-43287
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Wei Liu
>Priority: Major
>
> How to reproduce:
>  # Start a scala client `./connector/connect/bin/spark-connect-scala-client`
>  # in another terminal, kill the process `kill `
>  # Back to the client terminal, you can't see anything you type, but the 
> command still works
>  
>  
> {code:java}
> Spark session available as 'spark'.
>    _                  __      __                            __
>   / ___/   __/ /__   / /___      ___  _/ /_
>   \__ \/ __ \/ __ `/ ___/ //_/  / /   / __ \/ __ \/ __ \/ _ \/ ___/ __/
>  ___/ / /_/ / /_/ / /  / ,<    / /___/ /_/ / / / / / / /  __/ /__/ /_
> // .___/\__,_/_/  /_/|_|   \/\/_/ /_/_/ /_/\___/\___/\__/
>     /_/
> @ wei.liu:~/oss-spark$ CONTRIBUTING.md  appveyor.yml  conf                    
>     examples            logs         resource-managers                    
> target
> LICENSE          artifacts     connector                   graphx             
>  mllib        sbin                                 tools
> LICENSE-binary

[jira] [Updated] (SPARK-43287) Connect JVM client REPL not correctly shut down if killed

2023-04-25 Thread Wei Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Liu updated SPARK-43287:

Description: 
How to reproduce:
 # Start a scala client `./connector/connect/bin/spark-connect-scala-client`
 # in another terminal, kill the process `kill `
 # Back to the client terminal, you can't see anything you type, but the 
command still works

 

 
{code:java}
Spark session available as 'spark'.
   _                  __      __                            __
  / ___/   __/ /__   / /___      ___  _/ /_
  \__ \/ __ \/ __ `/ ___/ //_/  / /   / __ \/ __ \/ __ \/ _ \/ ___/ __/
 ___/ / /_/ / /_/ / /  / ,<    / /___/ /_/ / / / / / / /  __/ /__/ /_
// .___/\__,_/_/  /_/|_|   \/\/_/ /_/_/ /_/\___/\___/\__/
    /_/


@ wei.liu@ip-10-110-19-234:~/oss-spark$ CONTRIBUTING.md  appveyor.yml  conf     
                   examples            logs         resource-managers           
         target
LICENSE          artifacts     connector                   graphx              
mllib        sbin                                 tools
LICENSE-binary   assembly      core                        hadoop-cloud        
mllib-local  scalastyle-config.xml
NOTICE           bin           data                        hs_err_pid9062.log  
pom.xml      scalastyle-on-compile.generated.xml
NOTICE-binary    binder        dependency-reduced-pom.xml  launcher            
project      spark-warehouse
R                build         dev                         licenses            
python       sql
README.md        common        docs                        licenses-binary     
repl         streaming
wei.liu@ip-10-110-19-234:~/oss-spark$ wei.liu@ip-10-110-19-234:~/oss-spark$ 
wei.liu@ip-10-110-19-234:~/oss-spark$ 

{code}
 

I ran 'ls' above, and clicked return multiple times

 

 

  was:
How to reproduce:
 # Start a scala client `./connector/connect/bin/spark-connect-scala-client`
 # in another terminal, kill the process `kill `
 # Back to the client terminal, you can't see anything you type, but the 
command still works

```

Spark session available as 'spark'.

   _                  __      __                            __

  / ___/   __/ /__   / /___      ___  _/ /_

  \__ \/ __ \/ __ `/ ___/ //_/  / /   / __ \/ __ \/ __ \/ _ \/ ___/ __/

 ___/ / /_/ / /_/ / /  / ,<    / /___/ /_/ / / / / / / /  __/ /__/ /_

// .___/\__,_/_/  /_/|_|   \/\/_/ /_/_/ /_/\___/\___/\__/

    /_/

 

@ *wei.liu*:*~/oss-spark*$ CONTRIBUTING.md  appveyor.yml  *conf*                
        *examples*            *logs*         *resource-managers*                
    *target*

LICENSE          *artifacts*     *connector*                   *graphx*         
     *mllib*        *sbin*                                 *tools*

LICENSE-binary   *assembly*      *core*                        *hadoop-cloud*   
     *mllib-local*  scalastyle-config.xml

NOTICE           *bin*           *data*                        
hs_err_pid9062.log  pom.xml      scalastyle-on-compile.generated.xml

NOTICE-binary    *binder*        dependency-reduced-pom.xml  *launcher*         
   *project*      *spark-warehouse*

*R*                *build*         *dev*                         *licenses*     
       *python*       *sql*

README.md        *common*        *docs*                        
*licenses-binary*     *repl*         *streaming*

*wei.liu*:*~/oss-spark*$ *wei.liu*:*~/oss-spark*$ *wei.liu*:*~/oss-spark*$ 

```

I ran 'ls' above, and clicked return multiple times

 


> Connect JVM client REPL not correctly shut down if killed
> -
>
> Key: SPARK-43287
> URL: https://issues.apache.org/jira/browse/SPARK-43287
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Wei Liu
>Priority: Major
>
> How to reproduce:
>  # Start a scala client `./connector/connect/bin/spark-connect-scala-client`
>  # in another terminal, kill the process `kill `
>  # Back to the client terminal, you can't see anything you type, but the 
> command still works
>  
>  
> {code:java}
> Spark session available as 'spark'.
>    _                  __      __                            __
>   / ___/   __/ /__   / /___      ___  _/ /_
>   \__ \/ __ \/ __ `/ ___/ //_/  / /   / __ \/ __ \/ __ \/ _ \/ ___/ __/
>  ___/ / /_/ / /_/ / /  / ,<    / /___/ /_/ / / / / / / /  __/ /__/ /_
> // .___/\__,_/_/  /_/|_|   \/\/_/ /_/_/ /_/\___/\___/\__/
>     /_/
> @ wei.liu@ip-10-110-19-234:~/oss-spark$ CONTRIBUTING.md  appveyor.yml  conf   
>                      examples            logs         resource-managers       
>              target
> LICENSE          artifacts     connector                   graphx             
>

[jira] [Created] (SPARK-43287) Connect JVM client REPL not correctly shut down if killed

2023-04-25 Thread Wei Liu (Jira)

Wei Liu created SPARK-43287:
---

 Summary: Connect JVM client REPL not correctly shut down if killed
 Key: SPARK-43287
 URL: https://issues.apache.org/jira/browse/SPARK-43287
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 3.5.0
Reporter: Wei Liu


How to reproduce:
 # Start a scala client `./connector/connect/bin/spark-connect-scala-client`
 # in another terminal, kill the process `kill `
 # Back to the client terminal, you can't see anything you type, but the 
command still works

```

Spark session available as 'spark'.

   _                  __      __                            __

  / ___/   __/ /__   / /___      ___  _/ /_

  \__ \/ __ \/ __ `/ ___/ //_/  / /   / __ \/ __ \/ __ \/ _ \/ ___/ __/

 ___/ / /_/ / /_/ / /  / ,<    / /___/ /_/ / / / / / / /  __/ /__/ /_

// .___/\__,_/_/  /_/|_|   \/\/_/ /_/_/ /_/\___/\___/\__/

    /_/

 

@ *wei.liu*:*~/oss-spark*$ CONTRIBUTING.md  appveyor.yml  *conf*                
        *examples*            *logs*         *resource-managers*                
    *target*

LICENSE          *artifacts*     *connector*                   *graphx*         
     *mllib*        *sbin*                                 *tools*

LICENSE-binary   *assembly*      *core*                        *hadoop-cloud*   
     *mllib-local*  scalastyle-config.xml

NOTICE           *bin*           *data*                        
hs_err_pid9062.log  pom.xml      scalastyle-on-compile.generated.xml

NOTICE-binary    *binder*        dependency-reduced-pom.xml  *launcher*         
   *project*      *spark-warehouse*

*R*                *build*         *dev*                         *licenses*     
       *python*       *sql*

README.md        *common*        *docs*                        
*licenses-binary*     *repl*         *streaming*

*wei.liu*:*~/oss-spark*$ *wei.liu*:*~/oss-spark*$ *wei.liu*:*~/oss-spark*$ 

```

I ran 'ls' above, and clicked return multiple times

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43285) ReplE2ESuite consistently fails with JDK 17

2023-04-25 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-43285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-43285.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> ReplE2ESuite consistently fails with JDK 17
> ---
>
> Key: SPARK-43285
> URL: https://issues.apache.org/jira/browse/SPARK-43285
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Venkata Sai Akhil Gudesa
>Priority: Major
> Fix For: 3.5.0
>
>
> [[Comment|https://github.com/apache/spark/pull/40675#discussion_r1174696470] 
> from [~gurwls223]]
> This test consistently fails with JDK 17:
> {code:java}
> [info] ReplE2ESuite:
> [info] - Simple query *** FAILED *** (10 seconds, 4 milliseconds)
> [info] java.lang.RuntimeException: REPL Timed out while running command: 
> [info] spark.sql("select 1").collect()
> [info] 
> [info] Console output: 
> [info] Error output: Compiling (synthetic)/ammonite/predef/ArgsPredef.sc
> [info] at 
> org.apache.spark.sql.application.ReplE2ESuite.runCommandsInShell(ReplE2ESuite.scala:87)
> [info] at 
> org.apache.spark.sql.application.ReplE2ESuite.$anonfun$new$1(ReplE2ESuite.scala:102)
> [info] at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> [info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
> [info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
> [info] at 
> org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224){code}
> [https://github.com/apache/spark/actions/runs/4780630672/jobs/8498505928#step:9:4647]
> [https://github.com/apache/spark/actions/runs/4774942961/jobs/8488946907]
> [https://github.com/apache/spark/actions/runs/4769162286/jobs/8479293802]
> [https://github.com/apache/spark/actions/runs/4759278349/jobs/8458399201]
> [https://github.com/apache/spark/actions/runs/4748319019/jobs/8434392414]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43285) ReplE2ESuite consistently fails with JDK 17

2023-04-25 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-43285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell reassigned SPARK-43285:
-

Assignee: Venkata Sai Akhil Gudesa

> ReplE2ESuite consistently fails with JDK 17
> ---
>
> Key: SPARK-43285
> URL: https://issues.apache.org/jira/browse/SPARK-43285
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Venkata Sai Akhil Gudesa
>Priority: Major
>
> [[Comment|https://github.com/apache/spark/pull/40675#discussion_r1174696470] 
> from [~gurwls223]]
> This test consistently fails with JDK 17:
> {code:java}
> [info] ReplE2ESuite:
> [info] - Simple query *** FAILED *** (10 seconds, 4 milliseconds)
> [info] java.lang.RuntimeException: REPL Timed out while running command: 
> [info] spark.sql("select 1").collect()
> [info] 
> [info] Console output: 
> [info] Error output: Compiling (synthetic)/ammonite/predef/ArgsPredef.sc
> [info] at 
> org.apache.spark.sql.application.ReplE2ESuite.runCommandsInShell(ReplE2ESuite.scala:87)
> [info] at 
> org.apache.spark.sql.application.ReplE2ESuite.$anonfun$new$1(ReplE2ESuite.scala:102)
> [info] at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> [info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
> [info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
> [info] at 
> org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
> [info] at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224){code}
> [https://github.com/apache/spark/actions/runs/4780630672/jobs/8498505928#step:9:4647]
> [https://github.com/apache/spark/actions/runs/4774942961/jobs/8488946907]
> [https://github.com/apache/spark/actions/runs/4769162286/jobs/8479293802]
> [https://github.com/apache/spark/actions/runs/4759278349/jobs/8458399201]
> [https://github.com/apache/spark/actions/runs/4748319019/jobs/8434392414]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43286) Update CBC mode in aes_encrypt()/aes_decrypt() to not use KDF

2023-04-25 Thread Steve Weis (Jira)

Steve Weis created SPARK-43286:
--

 Summary: Update CBC mode in aes_encrypt()/aes_decrypt() to not use 
KDF
 Key: SPARK-43286
 URL: https://issues.apache.org/jira/browse/SPARK-43286
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Steve Weis


The current implementation of AES-CBC mode called via `{{{}aes_encrypt{}}}` and 
`{{{}aes_decrypt{}}}` uses a key derivation function (KDF) based on OpenSSL's 
[EVP_BytesToKey|https://www.openssl.org/docs/man3.0/man3/EVP_BytesToKey.html]. 
This is intended for generating keys based on passwords and OpenSSL's documents 
discourage its use: _"Newer applications should use a more modern algorithm"._

`{{{}aes_encrypt{}}}` and `{{{}aes_decrypt{}}}` should use the key directly in 
CBC mode, as it does for both GCM and ECB mode. The output should then be the 
initialization vector (IV) prepended to the ciphertext – as is done with GCM 
mode:
{{(16-byte randomly generated IV | AES-CBC encrypted ciphertext)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43284) _metadata.file_path regression

2023-04-25 Thread David Lewis (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lewis updated SPARK-43284:

Summary: _metadata.file_path regression  (was: _metadata.file_path)

> _metadata.file_path regression
> --
>
> Key: SPARK-43284
> URL: https://issues.apache.org/jira/browse/SPARK-43284
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: David Lewis
>Priority: Major
>
> As part of the [SparkPath 
> refactor](https://issues.apache.org/jira/browse/SPARK-41970) the behavior of 
> `_metadata.file_path` was inadvertently changed. In Spark 3.4+ it now returns 
> a non-encoded path string, as opposed to a url-encoded path string.
> This ticket is to fix that regression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43285) ReplE2ESuite consistently fails with JDK 17

2023-04-25 Thread Venkata Sai Akhil Gudesa (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venkata Sai Akhil Gudesa updated SPARK-43285:
-
Description: 
[[Comment|https://github.com/apache/spark/pull/40675#discussion_r1174696470] 
from [~gurwls223]]

This test consistently fails with JDK 17:
{code:java}
[info] ReplE2ESuite:
[info] - Simple query *** FAILED *** (10 seconds, 4 milliseconds)
[info] java.lang.RuntimeException: REPL Timed out while running command: 
[info] spark.sql("select 1").collect()
[info] 
[info] Console output: 
[info] Error output: Compiling (synthetic)/ammonite/predef/ArgsPredef.sc
[info] at 
org.apache.spark.sql.application.ReplE2ESuite.runCommandsInShell(ReplE2ESuite.scala:87)
[info] at 
org.apache.spark.sql.application.ReplE2ESuite.$anonfun$new$1(ReplE2ESuite.scala:102)
[info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info] at org.scalatest.Transformer.apply(Transformer.scala:22)
[info] at org.scalatest.Transformer.apply(Transformer.scala:20)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
[info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
[info] at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
[info] at 
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224){code}

[https://github.com/apache/spark/actions/runs/4780630672/jobs/8498505928#step:9:4647]
[https://github.com/apache/spark/actions/runs/4774942961/jobs/8488946907]
[https://github.com/apache/spark/actions/runs/4769162286/jobs/8479293802]
[https://github.com/apache/spark/actions/runs/4759278349/jobs/8458399201]
[https://github.com/apache/spark/actions/runs/4748319019/jobs/8434392414]

  was:
[[Comment|https://github.com/apache/spark/pull/40675#discussion_r1174696470] 
from [~gurwls223]]

This test consistently fails with JDK 17:
[info] ReplE2ESuite:
[info] - Simple query *** FAILED *** (10 seconds, 4 milliseconds)
[info]   java.lang.RuntimeException: REPL Timed out while running command: 
[info] spark.sql("select 1").collect()
[info]   
[info] Console output: 
[info] Error output: Compiling (synthetic)/ammonite/predef/ArgsPredef.sc
[info]   at 
org.apache.spark.sql.application.ReplE2ESuite.runCommandsInShell(ReplE2ESuite.scala:87)
[info]   at 
org.apache.spark.sql.application.ReplE2ESuite.$anonfun$new$1(ReplE2ESuite.scala:102)
[info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info]   at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
[info]   at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
[info]   at 
org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
[https://github.com/apache/spark/actions/runs/4780630672/jobs/8498505928#step:9:4647]
[https://github.com/apache/spark/actions/runs/4774942961/jobs/8488946907]
[https://github.com/apache/spark/actions/runs/4769162286/jobs/8479293802]
[https://github.com/apache/spark/actions/runs/4759278349/jobs/8458399201]
[https://github.com/apache/spark/actions/runs/4748319019/jobs/8434392414]


> ReplE2ESuite consistently fails with JDK 17
> ---
>
> Key: SPARK-43285
> URL: https://issues.apache.org/jira/browse/SPARK-43285
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Venkata Sai Akhil Gudesa
>Priority: Major
>
> [[Comment|https://github.com/apache/spark/pull/40675#discussion_r1174696470] 
> from [~gurwls223]]
> This test consistently fails with JDK 17:
> {code:java}
> [info] ReplE2ESuite:
> [info] - Simple query *** FAILED *** (10 seconds, 4 milliseconds)
> [info] java.lang.RuntimeException: REPL Timed out while running command: 
> [info] spark.sql("select 1").collect()
> [info] 
> [info] Console output: 
> [info] Error output: Compiling (synthetic)/ammonite/predef/ArgsPredef.sc
> [info] at 
> org.apache.spark.sql.application.ReplE2ESuite.runCommandsInShell(ReplE2ESuite.scala:87)
> [info] at 
>

[jira] [Created] (SPARK-43285) ReplE2ESuite consistently fails with JDK 17

2023-04-25 Thread Venkata Sai Akhil Gudesa (Jira)

Venkata Sai Akhil Gudesa created SPARK-43285:


 Summary: ReplE2ESuite consistently fails with JDK 17
 Key: SPARK-43285
 URL: https://issues.apache.org/jira/browse/SPARK-43285
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 3.5.0
Reporter: Venkata Sai Akhil Gudesa


[[Comment|https://github.com/apache/spark/pull/40675#discussion_r1174696470] 
from [~gurwls223]]

This test consistently fails with JDK 17:
[info] ReplE2ESuite:
[info] - Simple query *** FAILED *** (10 seconds, 4 milliseconds)
[info]   java.lang.RuntimeException: REPL Timed out while running command: 
[info] spark.sql("select 1").collect()
[info]   
[info] Console output: 
[info] Error output: Compiling (synthetic)/ammonite/predef/ArgsPredef.sc
[info]   at 
org.apache.spark.sql.application.ReplE2ESuite.runCommandsInShell(ReplE2ESuite.scala:87)
[info]   at 
org.apache.spark.sql.application.ReplE2ESuite.$anonfun$new$1(ReplE2ESuite.scala:102)
[info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info]   at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
[info]   at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
[info]   at 
org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
[https://github.com/apache/spark/actions/runs/4780630672/jobs/8498505928#step:9:4647]
[https://github.com/apache/spark/actions/runs/4774942961/jobs/8488946907]
[https://github.com/apache/spark/actions/runs/4769162286/jobs/8479293802]
[https://github.com/apache/spark/actions/runs/4759278349/jobs/8458399201]
[https://github.com/apache/spark/actions/runs/4748319019/jobs/8434392414]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43284) _metadata.file_path

2023-04-25 Thread David Lewis (Jira)

David Lewis created SPARK-43284:
---

 Summary: _metadata.file_path
 Key: SPARK-43284
 URL: https://issues.apache.org/jira/browse/SPARK-43284
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: David Lewis


As part of the [SparkPath 
refactor](https://issues.apache.org/jira/browse/SPARK-41970) the behavior of 
`_metadata.file_path` was inadvertently changed. In Spark 3.4+ it now returns a 
non-encoded path string, as opposed to a url-encoded path string.

This ticket is to fix that regression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43283) _metadata.file_path returns unescaled URLs

2023-04-25 Thread David Lewis (Jira)

David Lewis created SPARK-43283:
---

 Summary: _metadata.file_path returns unescaled URLs
 Key: SPARK-43283
 URL: https://issues.apache.org/jira/browse/SPARK-43283
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: David Lewis


As part of https://issues.apache.org/jira/browse/SPARK-41970 we changed the 
encoding of the string returned by `_metadata.file_path` from url-encoded to 
hadoop-path encoded (i.e. not encoded).

 

This ticket is to undo that behavior change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43156) Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`

2023-04-25 Thread Ignite TC Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716312#comment-17716312
 ] 

Ignite TC Bot commented on SPARK-43156:
---

User 'jchen5' has created a pull request for this issue:
https://github.com/apache/spark/pull/40946

> Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`
> 
>
> Key: SPARK-43156
> URL: https://issues.apache.org/jira/browse/SPARK-43156
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> Example query:
> {code:java}
> spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) 
> from t0").collect()
> res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false])  
> {code}
> In this subquery, count(1) always evaluates to a non-null integer value, so 
> count(1) is null is always false. The correct evaluation of the subquery is 
> always false.
> We incorrectly evaluate it to null for empty groups. The reason is that 
> NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] 
> [false] - this rewrite would be correct normally, but in the context of a 
> scalar subquery it breaks our count bug handling in 
> RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get 
> there, the query appears to not have the count bug - it looks the same as if 
> the original query had a subquery with select any_value(false) from r..., and 
> that case is _not_ subject to the count bug.
>  
> Postgres comparison show correct always-false result: 
> [http://sqlfiddle.com/#!17/67822/5]
> DDL for the example:
> {code:java}
> create or replace temp view t0 (a, b)
> as values
>     (1, 1.0),
>     (2, 2.0);
> create or replace temp view t1 (c, d)
> as values
>     (2, 3.0); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43098) Should not handle the COUNT bug when the GROUP BY clause of a correlated scalar subquery is non-empty

2023-04-25 Thread Ignite TC Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716313#comment-17716313
 ] 

Ignite TC Bot commented on SPARK-43098:
---

User 'jchen5' has created a pull request for this issue:
https://github.com/apache/spark/pull/40946

> Should not handle the COUNT bug when the GROUP BY clause of a correlated 
> scalar subquery is non-empty
> -
>
> Key: SPARK-43098
> URL: https://issues.apache.org/jira/browse/SPARK-43098
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
> Fix For: 3.4.1, 3.5.0
>
>
> From [~allisonwang-db] :
> There is no COUNT bug when the correlated equality predicates are also in the 
> group by clause. However, the current logic to handle the COUNT bug still 
> adds default aggregate function value and returns incorrect results.
>  
> {code:java}
> create view t1(c1, c2) as values (0, 1), (1, 2);
> create view t2(c1, c2) as values (0, 2), (0, 3);
> select c1, c2, (select count(*) from t2 where t1.c1 = t2.c1 group by c1) from 
> t1;
> -- Correct answer: [(0, 1, 2), (1, 2, null)]
> +---+---+--+
> |c1 |c2 |scalarsubquery(c1)|
> +---+---+--+
> |0  |1  |2 |
> |1  |2  |0 |
> +---+---+--+
>  {code}
>  
> This bug affects scalar subqueries in RewriteCorrelatedScalarSubquery, but 
> lateral subqueries handle it correctly in DecorrelateInnerQuery. Related: 
> https://issues.apache.org/jira/browse/SPARK-36113 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36112) Enable DecorrelateInnerQuery for IN/EXISTS subqueries

2023-04-25 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716289#comment-17716289
 ] 

Jia Fan commented on SPARK-36112:
-

Hi, [~allisonwang-db] I checked the code. Seem like the only work is change the 
code in `PullupCorrelatedPredicates`. Just make sure Exists invoke 
`decorrelate`.

!image-2023-04-25-21-51-55-961.png|width=617,height=275!

Because `DecorrelateInnerQuery` already support Filter in subQuery. And Exists 
also be supported in `RewritePredicateSubquery`. Should I change just one line? 
Or is there something else I don't understand?

> Enable DecorrelateInnerQuery for IN/EXISTS subqueries
> -
>
> Key: SPARK-36112
> URL: https://issues.apache.org/jira/browse/SPARK-36112
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
> Attachments: image-2023-04-25-21-51-55-961.png
>
>
> Currently, `DecorrelateInnerQuery` is only enabled for scalar and lateral 
> subqueries. We should enable `DecorrelateInnerQuery` for IN/EXISTS 
> subqueries. Note we need to add the logic to rewrite domain joins in 
> `RewritePredicateSubquery`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43225) Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution

2023-04-25 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-43225:
-
Issue Type: Improvement  (was: Bug)

> Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution
> --
>
> Key: SPARK-43225
> URL: https://issues.apache.org/jira/browse/SPARK-43225
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Minor
>
> To fix CVE issue: https://github.com/apache/spark/security/dependabot/50



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43225) Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution

2023-04-25 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-43225:
-
Priority: Minor  (was: Major)

> Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution
> --
>
> Key: SPARK-43225
> URL: https://issues.apache.org/jira/browse/SPARK-43225
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Minor
>
> To fix CVE issue: https://github.com/apache/spark/security/dependabot/50



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43225) Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution

2023-04-25 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-43225.
--
Fix Version/s: 3.5.0
 Assignee: Yuming Wang
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/40893

> Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution
> --
>
> Key: SPARK-43225
> URL: https://issues.apache.org/jira/browse/SPARK-43225
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Minor
> Fix For: 3.5.0
>
>
> To fix CVE issue: https://github.com/apache/spark/security/dependabot/50



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42798) Upgrade protobuf-java to 3.22.2

2023-04-25 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-42798:
-
Priority: Minor  (was: Major)

> Upgrade protobuf-java to 3.22.2
> ---
>
> Key: SPARK-42798
> URL: https://issues.apache.org/jira/browse/SPARK-42798
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> * [https://github.com/protocolbuffers/protobuf/releases/tag/v22.1]
>  * [https://github.com/protocolbuffers/protobuf/releases/tag/v22.2]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42798) Upgrade protobuf-java to 3.22.2

2023-04-25 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-42798.
--
Fix Version/s: 3.5.0
 Assignee: Yang Jie
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/40430

> Upgrade protobuf-java to 3.22.2
> ---
>
> Key: SPARK-42798
> URL: https://issues.apache.org/jira/browse/SPARK-42798
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0
>
>
> * [https://github.com/protocolbuffers/protobuf/releases/tag/v22.1]
>  * [https://github.com/protocolbuffers/protobuf/releases/tag/v22.2]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36112) Enable DecorrelateInnerQuery for IN/EXISTS subqueries

2023-04-25 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan updated SPARK-36112:

Attachment: image-2023-04-25-21-51-55-961.png

> Enable DecorrelateInnerQuery for IN/EXISTS subqueries
> -
>
> Key: SPARK-36112
> URL: https://issues.apache.org/jira/browse/SPARK-36112
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
> Attachments: image-2023-04-25-21-51-55-961.png
>
>
> Currently, `DecorrelateInnerQuery` is only enabled for scalar and lateral 
> subqueries. We should enable `DecorrelateInnerQuery` for IN/EXISTS 
> subqueries. Note we need to add the logic to rewrite domain joins in 
> `RewritePredicateSubquery`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-39753) Broadcast joins should pushdown join constraints as Filter to the larger relation

2023-04-25 Thread Jean-Christophe Lefebvre (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716268#comment-17716268
 ] 

Jean-Christophe Lefebvre edited comment on SPARK-39753 at 4/25/23 1:40 PM:
---

Any developement on this ticket?


was (Author: JIRAUSER300051):
Any developpement on this ticket?

> Broadcast joins should pushdown join constraints as Filter to the larger 
> relation
> -
>
> Key: SPARK-39753
> URL: https://issues.apache.org/jira/browse/SPARK-39753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0
>Reporter: Victor Delépine
>Priority: Major
>
> SPARK-19609 was bulk-closed a while ago, but not fixed. I've decided to 
> re-open it here for more visibility, since I believe this bug has a major 
> impact and that fixing it could drastically improve the performance of many 
> pipelines.
> Allow me to paste the initial description again here:
> _For broadcast inner-joins, where the smaller relation is known to be small 
> enough to materialize on a worker, the set of values for all join columns is 
> known and fits in memory. Spark should translate these values into a 
> {{Filter}} pushed down to the datasource. The common join condition of 
> equality, i.e. {{{}lhs.a == rhs.a{}}}, can be written as an {{a in ...}} 
> clause. An example of pushing such filters is already present in the form of 
> {{IsNotNull}} filters via_ [~sameerag]{_}'s work on SPARK-12957 subtasks.{_}
> _This optimization could even work when the smaller relation does not fit 
> entirely in memory. This could be done by partitioning the smaller relation 
> into N pieces, applying this predicate pushdown for each piece, and unioning 
> the results._
>  
> Essentially, when doing a Broadcast join, the smaller side can be used to 
> filter down the bigger side before performing the join. As of today, the join 
> will read all partitions of the bigger side, without pruning partitions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39753) Broadcast joins should pushdown join constraints as Filter to the larger relation

2023-04-25 Thread Jean-Christophe Lefebvre (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716268#comment-17716268
 ] 

Jean-Christophe Lefebvre commented on SPARK-39753:
--

Any developpement on this ticket?

> Broadcast joins should pushdown join constraints as Filter to the larger 
> relation
> -
>
> Key: SPARK-39753
> URL: https://issues.apache.org/jira/browse/SPARK-39753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0
>Reporter: Victor Delépine
>Priority: Major
>
> SPARK-19609 was bulk-closed a while ago, but not fixed. I've decided to 
> re-open it here for more visibility, since I believe this bug has a major 
> impact and that fixing it could drastically improve the performance of many 
> pipelines.
> Allow me to paste the initial description again here:
> _For broadcast inner-joins, where the smaller relation is known to be small 
> enough to materialize on a worker, the set of values for all join columns is 
> known and fits in memory. Spark should translate these values into a 
> {{Filter}} pushed down to the datasource. The common join condition of 
> equality, i.e. {{{}lhs.a == rhs.a{}}}, can be written as an {{a in ...}} 
> clause. An example of pushing such filters is already present in the form of 
> {{IsNotNull}} filters via_ [~sameerag]{_}'s work on SPARK-12957 subtasks.{_}
> _This optimization could even work when the smaller relation does not fit 
> entirely in memory. This could be done by partitioning the smaller relation 
> into N pieces, applying this predicate pushdown for each piece, and unioning 
> the results._
>  
> Essentially, when doing a Broadcast join, the smaller side can be used to 
> filter down the bigger side before performing the join. As of today, the join 
> will read all partitions of the bigger side, without pruning partitions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43282) Investigate DataFrame.sort_values with pandas behavior.

2023-04-25 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-43282:
---

 Summary: Investigate DataFrame.sort_values with pandas behavior.
 Key: SPARK-43282
 URL: https://issues.apache.org/jira/browse/SPARK-43282
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


{code:java}
import pandas as pd
pdf = pd.DataFrame(
    {
        "a": pd.Categorical([1, 2, 3, 1, 2, 3]),
        "b": pd.Categorical(
            ["b", "a", "c", "c", "b", "a"], categories=["c", "b", "d", "a"]
        ),
    },
)
pdf.groupby("a").apply(lambda x: x).sort_values(["a"])

Traceback (most recent call last):
...
ValueError: 'a' is both an index level and a column label, which is ambiguous. 
{code}
We should investigate this issue whether this is intended behavior or just bug 
in pandas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-43273) Spark can't read parquet files with a newer LZ4_RAW compression

2023-04-25 Thread Andrew Grigorev (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716135#comment-17716135
 ] 

Andrew Grigorev edited comment on SPARK-43273 at 4/25/23 12:56 PM:
---

Just as a icing on the cake - Clickhouse accidently started to use LZ4_RAW by 
default for their Parquet output format :).

https://github.com/ClickHouse/ClickHouse/issues/49141


was (Author: ei-grad):
Just as a icing on the cake - Clickhouse accidently started to use LZ4_RAW by 
default for their Parquet output format :).

> Spark can't read parquet files with a newer LZ4_RAW compression
> ---
>
> Key: SPARK-43273
> URL: https://issues.apache.org/jira/browse/SPARK-43273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.4, 3.3.3, 3.3.2, 3.4.0
>Reporter: Andrew Grigorev
>Priority: Trivial
>
> hadoop-parquet version should be updated to 1.3.0 (together with other 
> parquet-mr libs)
> {code:java}
> java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job 
> aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent 
> failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): 
> java.lang.IllegalArgumentException: No enum constant 
> org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW
>     at java.base/java.lang.Enum.valueOf(Enum.java:273)
>     at 
> org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26)
>     at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636)
> ... {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43281) Fix concurrent writer does not update file metrics

2023-04-25 Thread XiDuo You (Jira)

XiDuo You created SPARK-43281:
-

 Summary: Fix concurrent writer does not update file metrics
 Key: SPARK-43281
 URL: https://issues.apache.org/jira/browse/SPARK-43281
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: XiDuo You


It uses temp file path to get file status after commit task. However, the temp 
file has already moved to new path during commit task.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43272) Replace reflection w/ direct calling for `SparkHadoopUtil#createFile`

2023-04-25 Thread Nikita Awasthi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716236#comment-17716236
 ] 

Nikita Awasthi commented on SPARK-43272:


User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40945

> Replace reflection w/ direct calling for  `SparkHadoopUtil#createFile`
> --
>
> Key: SPARK-43272
> URL: https://issues.apache.org/jira/browse/SPARK-43272
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43142) DSL expressions fail on attribute with special characters

2023-04-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-43142.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40902
[https://github.com/apache/spark/pull/40902]

> DSL expressions fail on attribute with special characters
> -
>
> Key: SPARK-43142
> URL: https://issues.apache.org/jira/browse/SPARK-43142
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Willi Raschkowski
>Assignee: Willi Raschkowski
>Priority: Major
> Fix For: 3.5.0
>
>
> Expressions on implicitly converted attributes fail if the attributes have 
> names containing special characters. They fail even if the attributes are 
> backtick-quoted:
> {code:java}
> scala> import org.apache.spark.sql.catalyst.dsl.expressions._
> import org.apache.spark.sql.catalyst.dsl.expressions._
> scala> "`slashed/col`".attr
> res0: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = 
> 'slashed/col
> scala> "`slashed/col`".attr.asc
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '/' expecting {, '.', '-'}(line 1, pos 7)
> == SQL ==
> slashed/col
> ---^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43273) Spark can't read parquet files with a newer LZ4_RAW compression

2023-04-25 Thread Andrew Grigorev (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Grigorev updated SPARK-43273:

Description: 
hadoop-parquet version should be updated to 1.3.0 (together with other 
parquet-mr libs)


{code:java}
java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job 
aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent 
failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): 
java.lang.IllegalArgumentException: No enum constant 
org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW
    at java.base/java.lang.Enum.valueOf(Enum.java:273)
    at 
org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26)
    at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636)
... {code}

  was:
hadoop-parquet version should be updated to 1.3.0

 
{code:java}
java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job 
aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent 
failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): 
java.lang.IllegalArgumentException: No enum constant 
org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW
    at java.base/java.lang.Enum.valueOf(Enum.java:273)
    at 
org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26)
    at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636)
... {code}


> Spark can't read parquet files with a newer LZ4_RAW compression
> ---
>
> Key: SPARK-43273
> URL: https://issues.apache.org/jira/browse/SPARK-43273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.4, 3.3.3, 3.3.2, 3.4.0
>Reporter: Andrew Grigorev
>Priority: Trivial
>
> hadoop-parquet version should be updated to 1.3.0 (together with other 
> parquet-mr libs)
> {code:java}
> java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job 
> aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent 
> failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): 
> java.lang.IllegalArgumentException: No enum constant 
> org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW
>     at java.base/java.lang.Enum.valueOf(Enum.java:273)
>     at 
> org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26)
>     at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636)
> ... {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43280) Improve the protobuf breaking change checker script

2023-04-25 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-43280:
--
Summary: Improve the protobuf breaking change checker script  (was: Improve 
the protobuf breaking change script)

> Improve the protobuf breaking change checker script
> ---
>
> Key: SPARK-43280
> URL: https://issues.apache.org/jira/browse/SPARK-43280
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43280) Improve the protobuf breaking change script

2023-04-25 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-43280:
--
Priority: Major  (was: Blocker)

> Improve the protobuf breaking change script
> ---
>
> Key: SPARK-43280
> URL: https://issues.apache.org/jira/browse/SPARK-43280
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43280) Improve the protobuf breaking change script

2023-04-25 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-43280:
-

 Summary: Improve the protobuf breaking change script
 Key: SPARK-43280
 URL: https://issues.apache.org/jira/browse/SPARK-43280
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43279) Cleanup unused members from `SparkHadoopUtil`

2023-04-25 Thread Yang Jie (Jira)

Yang Jie created SPARK-43279:


 Summary: Cleanup unused members from `SparkHadoopUtil`
 Key: SPARK-43279
 URL: https://issues.apache.org/jira/browse/SPARK-43279
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43278) Exception in thread "main" java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;

2023-04-25 Thread jiangjiguang0719 (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiangjiguang0719 updated SPARK-43278:
-
Description: 
Java version: 1.8.0_331, Apache Maven 3.8.4

I run next steps:
 # git clone [https://github.com/apache/spark.git]
 # git checkout -b v3.3.0 3.3.0
 #  mvn clean install -DskipTests
 # copy hive-site.xml to examples/src/main/resources/
 # execute TPC-H Q6 

 
{code:java}
public static void main(String[] args) throws InterruptedException {
        SparkConf sparkConf = new SparkConf()
                .setAppName("demo")
                .setMaster("local[1]")
                ;        
 SparkSession sparkSession = SparkSession.builder()
                .config(sparkConf)
                .enableHiveSupport()
                .getOrCreate();
        sparkSession.sql("use local_tpch_sf10_uncompressed_etl");        
    sparkSession.sql(TPCH.SQL6).show();
} {code}
 

 

get the error info:

Exception in thread "main" java.lang.NoSuchMethodError: 
java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
    at 
org.apache.spark.util.io.ChunkedByteBufferOutputStream.toChunkedByteBuffer(ChunkedByteBufferOutputStream.scala:115)
    at 
org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:325)
    at 
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:140)
    at 
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:95)
    at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
    at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:75)
    at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1529)
    at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.buildReaderWithPartitionValues(ParquetFileFormat.scala:235)
    at 
org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:457)
    at 
org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:448)
    at 
org.apache.spark.sql.execution.FileSourceScanExec.doExecuteColumnar(DataSourceScanExec.scala:547)
    at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:221)
    at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:232)
    at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:229)
    at 
org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:217)

  was:
Java version: 1.8.0_331, Apache Maven 3.8.4

I run next steps:
 # git clone [https://github.com/apache/spark.git]
 # git checkout -b v3.3.0 3.3.0
 #  mvn clean install -DskipTests
 # copy hive-site.xml to examples/src/main/resources/
 # execute TPC-H Q6 

!image-2023-04-25-17-14-50-392.png|width=437,height=246!

get the error info

!image-2023-04-25-17-15-57-874.png|width=466,height=161!


> Exception in thread "main" java.lang.NoSuchMethodError: 
> java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
> ---
>
> Key: SPARK-43278
> URL: https://issues.apache.org/jira/browse/SPARK-43278
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.0
>Reporter: jiangjiguang0719
>Priority: Major
>
> Java version: 1.8.0_331, Apache Maven 3.8.4
> I run next steps:
>  # git clone [https://github.com/apache/spark.git]
>  # git checkout -b v3.3.0 3.3.0
>  #  mvn clean install -DskipTests
>  # copy hive-site.xml to examples/src/main/resources/
>  # execute TPC-H Q6 
>  
> {code:java}
> public static void main(String[] args) throws InterruptedException {
>         SparkConf sparkConf = new SparkConf()
>                 .setAppName("demo")
>                 .setMaster("local[1]")
>                 ;        
>  SparkSession sparkSession = SparkSession.builder()
>                 .config(sparkConf)
>                 .enableHiveSupport()
>                 .getOrCreate();
>         sparkSession.sql("use local_tpch_sf10_uncompressed_etl");        
>     sparkSession.sql(TPCH.SQL6).show();
> } {code}
>  
>  
> get the error info:
> Exception in thread "main" java.lang.NoSuchMethodError: 
> java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
>     at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.toChunkedByteBuffer(ChunkedByteBufferOutputStream.scala:115)
>     at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:325)
>     at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:140)
>     at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:95)
>     at 
>

[jira] [Created] (SPARK-43278) Exception in thread "main" java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;

2023-04-25 Thread jiangjiguang0719 (Jira)

jiangjiguang0719 created SPARK-43278:


 Summary: Exception in thread "main" java.lang.NoSuchMethodError: 
java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
 Key: SPARK-43278
 URL: https://issues.apache.org/jira/browse/SPARK-43278
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 3.3.0
Reporter: jiangjiguang0719


Java version: 1.8.0_331, Apache Maven 3.8.4

I run next steps:
 # git clone [https://github.com/apache/spark.git]
 # git checkout -b v3.3.0 3.3.0
 #  mvn clean install -DskipTests
 # copy hive-site.xml to examples/src/main/resources/
 # execute TPC-H Q6 

!image-2023-04-25-17-14-50-392.png|width=437,height=246!

get the error info

!image-2023-04-25-17-15-57-874.png|width=466,height=161!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42940) Session management support streaming connect

2023-04-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716181#comment-17716181
 ] 

ASF GitHub Bot commented on SPARK-42940:


User 'rangadi' has created a pull request for this issue:
https://github.com/apache/spark/pull/40937

> Session management support streaming connect
> 
>
> Key: SPARK-42940
> URL: https://issues.apache.org/jira/browse/SPARK-42940
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Priority: Major
>
> Add session support for streaming jobs. 
> E.g. a session should stay alive when a streaming job is alive. 
> It might differ more complex scenarios like what happens when client loses 
> track of the session. Such semantics would be handled as part of session 
> semantics across Spark Connect (including streaming). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43204) Align MERGE assignments with table attributes

2023-04-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-43204.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40919
[https://github.com/apache/spark/pull/40919]

> Align MERGE assignments with table attributes
> -
>
> Key: SPARK-43204
> URL: https://issues.apache.org/jira/browse/SPARK-43204
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
> Fix For: 3.5.0
>
>
> Similar to SPARK-42151, we need to do the same for MERGE assignments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43204) Align MERGE assignments with table attributes

2023-04-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-43204:
---

Assignee: Anton Okolnychyi

> Align MERGE assignments with table attributes
> -
>
> Key: SPARK-43204
> URL: https://issues.apache.org/jira/browse/SPARK-43204
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
>
> Similar to SPARK-42151, we need to do the same for MERGE assignments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43277) Clean up deprecation hadoop api usage in Yarn module

2023-04-25 Thread Yang Jie (Jira)

Yang Jie created SPARK-43277:


 Summary: Clean up deprecation hadoop api usage in Yarn module
 Key: SPARK-43277
 URL: https://issues.apache.org/jira/browse/SPARK-43277
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43226) Define extractors for file-constant metadata columns

2023-04-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-43226.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40885
[https://github.com/apache/spark/pull/40885]

> Define extractors for file-constant metadata columns
> 
>
> Key: SPARK-43226
> URL: https://issues.apache.org/jira/browse/SPARK-43226
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ryan Johnson
>Assignee: Ryan Johnson
>Priority: Major
> Fix For: 3.5.0
>
>
> File-source constant metadata columns are often derived indirectly from 
> file-level metadata values rather than exposing those values directly. For 
> example, {{_metadata.file_name}} is currently hard-coded in 
> {{FileFormat.updateMetadataInternalRow}} as:
>  
> {code:java}
> UTF8String.fromString(filePath.getName){code}
>  
> We should add support for metadata extractors, functions that map from 
> {{PartitionedFile}} to {{{}Literal{}}}, so that we can express such columns 
> in a generic way instead of hard-coding them.
> We can't just add them to the metadata map because then they have to be 
> pre-computed even if it turns out the query does not select that field.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43226) Define extractors for file-constant metadata columns

2023-04-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-43226:
---

Assignee: Ryan Johnson

> Define extractors for file-constant metadata columns
> 
>
> Key: SPARK-43226
> URL: https://issues.apache.org/jira/browse/SPARK-43226
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ryan Johnson
>Assignee: Ryan Johnson
>Priority: Major
>
> File-source constant metadata columns are often derived indirectly from 
> file-level metadata values rather than exposing those values directly. For 
> example, {{_metadata.file_name}} is currently hard-coded in 
> {{FileFormat.updateMetadataInternalRow}} as:
>  
> {code:java}
> UTF8String.fromString(filePath.getName){code}
>  
> We should add support for metadata extractors, functions that map from 
> {{PartitionedFile}} to {{{}Literal{}}}, so that we can express such columns 
> in a generic way instead of hard-coding them.
> We can't just add them to the metadata map because then they have to be 
> pre-computed even if it turns out the query does not select that field.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43243) Add Level param to df.printSchema for Python API

2023-04-25 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43243:
-

Assignee: Khalid Mammadov

> Add Level param to df.printSchema for Python API
> 
>
> Key: SPARK-43243
> URL: https://issues.apache.org/jira/browse/SPARK-43243
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Khalid Mammadov
>Assignee: Khalid Mammadov
>Priority: Major
>
> Python printSchema in DataFrame API is missing level parameter which is 
> available in Scala API. This is to add that



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43243) Add Level param to df.printSchema for Python API

2023-04-25 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43243.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40916
[https://github.com/apache/spark/pull/40916]

> Add Level param to df.printSchema for Python API
> 
>
> Key: SPARK-43243
> URL: https://issues.apache.org/jira/browse/SPARK-43243
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Khalid Mammadov
>Assignee: Khalid Mammadov
>Priority: Major
> Fix For: 3.5.0
>
>
> Python printSchema in DataFrame API is missing level parameter which is 
> available in Scala API. This is to add that



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43276) Migrate Spark Connect Window errors into error class

2023-04-25 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-43276:
---

 Summary: Migrate Spark Connect Window errors into error class
 Key: SPARK-43276
 URL: https://issues.apache.org/jira/browse/SPARK-43276
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Migrate Spark Connect Window errors into error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43275) Migrate Spark Connect GroupedData error into error class

2023-04-25 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-43275:
---

 Summary: Migrate Spark Connect GroupedData error into error class
 Key: SPARK-43275
 URL: https://issues.apache.org/jira/browse/SPARK-43275
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Migrate Spark Connect GroupedData error into error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43274) Introduce `PySparkNotImplementError`

2023-04-25 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-43274:
---

 Summary: Introduce `PySparkNotImplementError`
 Key: SPARK-43274
 URL: https://issues.apache.org/jira/browse/SPARK-43274
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Introduce `PySparkNotImplementError` corresponding for `NotImplementError`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43231) Reduce the memory requirement in torch-related tests

2023-04-25 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43231.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40874
[https://github.com/apache/spark/pull/40874]

> Reduce the memory requirement in torch-related tests
> 
>
> Key: SPARK-43231
> URL: https://issues.apache.org/jira/browse/SPARK-43231
> Project: Spark
>  Issue Type: Test
>  Components: Connect, ML, PySpark, Tests
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43231) Reduce the memory requirement in torch-related tests

2023-04-25 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43231:
-

Assignee: Ruifeng Zheng

> Reduce the memory requirement in torch-related tests
> 
>
> Key: SPARK-43231
> URL: https://issues.apache.org/jira/browse/SPARK-43231
> Project: Spark
>  Issue Type: Test
>  Components: Connect, ML, PySpark, Tests
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43273) Spark can't read parquet files with a newer LZ4_RAW compression

2023-04-25 Thread Andrew Grigorev (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716135#comment-17716135
 ] 

Andrew Grigorev commented on SPARK-43273:
-

Just as a icing on the cake - Clickhouse accidently started to use LZ4_RAW by 
default for their Parquet output format :).

> Spark can't read parquet files with a newer LZ4_RAW compression
> ---
>
> Key: SPARK-43273
> URL: https://issues.apache.org/jira/browse/SPARK-43273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.4, 3.3.3, 3.3.2, 3.4.0
>Reporter: Andrew Grigorev
>Priority: Trivial
>
> hadoop-parquet version should be updated to 1.3.0
>  
> {code:java}
> java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job 
> aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent 
> failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): 
> java.lang.IllegalArgumentException: No enum constant 
> org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW
>     at java.base/java.lang.Enum.valueOf(Enum.java:273)
>     at 
> org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26)
>     at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636)
> ... {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43273) Spark can't read parquet files with a newer LZ4_RAW compression

2023-04-25 Thread Andrew Grigorev (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Grigorev updated SPARK-43273:

Description: 
hadoop-parquet version should be updated to 1.3.0

 
{code:java}
java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job 
aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent 
failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): 
java.lang.IllegalArgumentException: No enum constant 
org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW
    at java.base/java.lang.Enum.valueOf(Enum.java:273)
    at 
org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26)
    at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636)
... {code}

  was:hadoop-parquet version should be updated to 1.3.0


> Spark can't read parquet files with a newer LZ4_RAW compression
> ---
>
> Key: SPARK-43273
> URL: https://issues.apache.org/jira/browse/SPARK-43273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.4, 3.3.3, 3.3.2, 3.4.0
>Reporter: Andrew Grigorev
>Priority: Trivial
>
> hadoop-parquet version should be updated to 1.3.0
>  
> {code:java}
> java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job 
> aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent 
> failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): 
> java.lang.IllegalArgumentException: No enum constant 
> org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW
>     at java.base/java.lang.Enum.valueOf(Enum.java:273)
>     at 
> org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26)
>     at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636)
> ... {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43273) Spark can't read parquet files with a newer LZ4_RAW compression

2023-04-25 Thread Andrew Grigorev (Jira)

Andrew Grigorev created SPARK-43273:
---

 Summary: Spark can't read parquet files with a newer LZ4_RAW 
compression
 Key: SPARK-43273
 URL: https://issues.apache.org/jira/browse/SPARK-43273
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0, 3.3.2, 3.2.4, 3.3.3
Reporter: Andrew Grigorev


hadoop-parquet version should be updated to 1.3.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43250) Assign a name to the error class _LEGACY_ERROR_TEMP_2014

2023-04-25 Thread Max Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17716118#comment-17716118
 ] 

Max Gekk commented on SPARK-43250:
--

[~amousavigourabi] Sure, go ahead.

> Assign a name to the error class _LEGACY_ERROR_TEMP_2014
> 
>
> Key: SPARK-43250
> URL: https://issues.apache.org/jira/browse/SPARK-43250
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2014* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43272) Replace reflection w/ direct calling for `SparkHadoopUtil#createFile`

2023-04-25 Thread Yang Jie (Jira)

Yang Jie created SPARK-43272:


 Summary: Replace reflection w/ direct calling for  
`SparkHadoopUtil#createFile`
 Key: SPARK-43272
 URL: https://issues.apache.org/jira/browse/SPARK-43272
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43260) Migrate the Spark SQL pandas arrow type errors into error class.

2023-04-25 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43260:
-

Assignee: Haejoon Lee

> Migrate the Spark SQL pandas arrow type errors into error class.
> 
>
> Key: SPARK-43260
> URL: https://issues.apache.org/jira/browse/SPARK-43260
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> from pyspark/sql/pandas/types.py



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43260) Migrate the Spark SQL pandas arrow type errors into error class.

2023-04-25 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43260.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40924
[https://github.com/apache/spark/pull/40924]

> Migrate the Spark SQL pandas arrow type errors into error class.
> 
>
> Key: SPARK-43260
> URL: https://issues.apache.org/jira/browse/SPARK-43260
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> from pyspark/sql/pandas/types.py



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43144) Scala: DataStreamReader table() API

2023-04-25 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43144.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40887
[https://github.com/apache/spark/pull/40887]

> Scala: DataStreamReader table() API
> ---
>
> Key: SPARK-43144
> URL: https://issues.apache.org/jira/browse/SPARK-43144
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

79 matches

Mail list logo