date:20230508

[jira] [Created] (SPARK-43405) Remove useless code in `ScriptInputOutputSchema`

2023-05-08 Thread Jia Fan (Jira)

Jia Fan created SPARK-43405:
---

 Summary: Remove useless code in `ScriptInputOutputSchema`
 Key: SPARK-43405
 URL: https://issues.apache.org/jira/browse/SPARK-43405
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Jia Fan


In case class `ScriptInputOutputSchema`, some method like `getRowFormatSQL` 
naver be used. So we can remove it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43406) enable spark sql to drop multiple partitions in one call

2023-05-08 Thread chenruotao (Jira)

chenruotao created SPARK-43406:
--

 Summary: enable spark sql to drop multiple partitions in one call
 Key: SPARK-43406
 URL: https://issues.apache.org/jira/browse/SPARK-43406
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0, 3.3.2, 3.2.1
Reporter: chenruotao
 Fix For: 3.5.0


Now spark sql cannot drop multiple partitions in one call, so I fix it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43406) enable spark sql to drop multiple partitions in one call

2023-05-08 Thread chenruotao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenruotao updated SPARK-43406:
---
Description: 
Now spark sql cannot drop multiple partitions in one call, so I fix it

With this patch we can drop multiple partitions like this : 

alter table test.table_partition drop partition(dt<='2023-04-02', 
dt>='2023-03-31')

  was:Now spark sql cannot drop multiple partitions in one call, so I fix it


> enable spark sql to drop multiple partitions in one call
> 
>
> Key: SPARK-43406
> URL: https://issues.apache.org/jira/browse/SPARK-43406
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.2, 3.4.0
>Reporter: chenruotao
>Priority: Major
> Fix For: 3.5.0
>
>
> Now spark sql cannot drop multiple partitions in one call, so I fix it
> With this patch we can drop multiple partitions like this : 
> alter table test.table_partition drop partition(dt<='2023-04-02', 
> dt>='2023-03-31')



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43400) create table support the PRIMARY KEY keyword

2023-05-08 Thread melin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

melin updated SPARK-43400:
--
Description: 
apache paimon and hudi support primary key definitions. It is necessary to 
support the primary key definition syntax

[~gurwls223] 

  was:apache paimon and hudi support primary key definitions. It is necessary 
to support the primary key definition syntax


> create table support the PRIMARY KEY keyword
> 
>
> Key: SPARK-43400
> URL: https://issues.apache.org/jira/browse/SPARK-43400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Major
>
> apache paimon and hudi support primary key definitions. It is necessary to 
> support the primary key definition syntax
> [~gurwls223] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43407) Can executors recover/reuse shuffle files upon failure?

2023-05-08 Thread Faiz Halde (Jira)

Faiz Halde created SPARK-43407:
--

 Summary: Can executors recover/reuse shuffle files upon failure?
 Key: SPARK-43407
 URL: https://issues.apache.org/jira/browse/SPARK-43407
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 3.3.1
Reporter: Faiz Halde


Hello,

We've been in touch with a few spark specialists who suggested us a potential 
solution to improve the reliability of our jobs that are shuffle heavy

Here is what our setup looks like
 * Spark version: 3.3.1
 * Java version: 1.8
 * We do not use external shuffle service
 * We use spot instances

We run spark jobs on clusters that use Amazon EBS volumes. The spark.local.dir 
is mounted on this EBS volume. One of the offerings from the service we use is 
EBS migration which basically means if a host is about to get evicted, a new 
host is created and the EBS volume is attached to it

When Spark assigns a new executor to the newly created instance, it basically 
can recover all the shuffle files that are already persisted in the migrated 
EBS volume

Is this how it works? Do executors recover / re-register the shuffle files that 
they found?

So far I have not come across any recovery mechanism. I can only see 
{noformat}
KubernetesLocalDiskShuffleDataIO{noformat}
 that has a pre-init step where it tries to register the available shuffle 
files to itself

A natural follow-up on this,

If what they claim is true, then ideally we should expect that when an executor 
is killed/OOM'd and a new executor is spawned on the same host, the new 
executor registers the shuffle files to itself. Is that so?

Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43400) create table support the PRIMARY KEY keyword

2023-05-08 Thread melin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

melin updated SPARK-43400:
--
Description: 
apache paimon and hudi support primary key definitions. It is necessary to 
support the primary key definition syntax

https://docs.snowflake.com/en/sql-reference/sql/create-table-constraint#constraint-properties

[~gurwls223] 

  was:
apache paimon and hudi support primary key definitions. It is necessary to 
support the primary key definition syntax

[~gurwls223] 


> create table support the PRIMARY KEY keyword
> 
>
> Key: SPARK-43400
> URL: https://issues.apache.org/jira/browse/SPARK-43400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Major
>
> apache paimon and hudi support primary key definitions. It is necessary to 
> support the primary key definition syntax
> https://docs.snowflake.com/en/sql-reference/sql/create-table-constraint#constraint-properties
> [~gurwls223] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43408) Spark caching in the context of a single job

2023-05-08 Thread Faiz Halde (Jira)

Faiz Halde created SPARK-43408:
--

 Summary: Spark caching in the context of a single job
 Key: SPARK-43408
 URL: https://issues.apache.org/jira/browse/SPARK-43408
 Project: Spark
  Issue Type: Question
  Components: Shuffle
Affects Versions: 3.3.1
Reporter: Faiz Halde


Does caching benefit a spark job with only a single action in it? Spark IIRC 
already optimizes shuffles by persisting them onto the disk

I am unable to find a counter-example where caching would benefit a job with a 
single action. In every case I can think of, the shuffle checkpoint acts as a 
good enough caching mechanism in itself



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43406) enable spark sql to drop multiple partitions in one call

2023-05-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720446#comment-17720446
 ] 

ASF GitHub Bot commented on SPARK-43406:


User 'chenruotao' has created a pull request for this issue:
https://github.com/apache/spark/pull/41090

> enable spark sql to drop multiple partitions in one call
> 
>
> Key: SPARK-43406
> URL: https://issues.apache.org/jira/browse/SPARK-43406
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.2, 3.4.0
>Reporter: chenruotao
>Priority: Major
> Fix For: 3.5.0
>
>
> Now spark sql cannot drop multiple partitions in one call, so I fix it
> With this patch we can drop multiple partitions like this : 
> alter table test.table_partition drop partition(dt<='2023-04-02', 
> dt>='2023-03-31')



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42551) Support more subexpression elimination cases

2023-05-08 Thread Wan Kun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-42551:

Description: 
h1. *Design Sketch*
h2. How to support more subexpressions elimination cases
 * Get all common expressions from input expressions of the current physical 
operator to current CodeGenContext. Recursively visits all subexpressions 
regardless of whether the current expression is a conditional expression.
 * For each common expression:
 * Add a new boolean variable *subExprInit* to indicate whether it has  already 
been evaluated. 
 * Add a new code block in CodeGenSupport trait, and reset those *subExprInit* 
variables to *false* before the physical operators begin to evaluate the input 
row.
 * Add a new wrapper subExpr function for each common subexpression.

|private void subExpr_n(${argList}) {
 if (!subExprInit) {
   ${eval.code}
   subExprInit_n = true;
   subExprIsNull_n = ${eval.isNull};
   subExprValue_n = ${eval.value};
 }
}|
h1. 
**
 * When performing gen code of the input expression,  if the input expression 
is in the common expressions of the current CodeGenContext, the corresponding 
subExpr function will be called. After the first function call, *subExprInit* 
will be set to true, and the subsequent function calls will be skipped.

h2. Why should we support whole-stage subexpression elimination

Right now each spark physical operator shares nothing but the input row, so the 
same expressions may be evaluated multiple times across different operators. 
For example, the expression udf(c1, c2) in plan Project [udf(c1, c2)] - Filter 
[udf(c1, c2) > 0] - Relation will be evaluated both in Project and Filter 
operators.  We can reuse the expression results across different operators such 
as Project and Filter.
h2. How to support whole-stage subexpression elimination
 * Add two properties in CodegenSupport trait, the reusable expressions and the 
the output attributes, we can reuse the expression results only if the output 
attributes are the same.
 * Visit all operators from top to bottom, bound the candidate expressions with 
the output attributes and add to the current candidate reusable expressions.
 * Visit all operators from bottom to top, collect all the common expressions 
to the current operator, and add the initialize code to the current operator if 
the common expressions have not been initialized.
 * Replace the common expressions code when generating codes for  the physical 
operators.

h1. *New support subexpression elimination patterns*
 * 
h2. *Support subexpression elimination with conditional expressions*

{code:java}
SELECT case when v + 2 > 1 then 1
when v + 1 > 2 then 2
when v + 1 > 3 then 3 END vv
FROM values(1) as t2(v)
{code}
We can reuse the result of expression  *v + 1*
{code:java}
SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) min_bc
FROM values(1, 1, 1) as t(a, b, c)
GROUP BY a
{code}
We can reuse the result of expression  b + c
 * 
h2. *Support subexpression elimination in FilterExec*

 
{code:java}
SELECT * FROM (
  SELECT v * v + 1 v1 from values(1) as t2(v)
) t
where v1 > 5 and v1 < 10
{code}
We can reuse the result of expression  *v* * *v* *+* *1*
 * 
h2. *Support subexpression elimination in JoinExec*

 
{code:java}
SELECT * 
FROM values(1, 1) as t1(a, b) 
join values(1, 2) as t2(x, y)

ON b * y between 2 and 3{code}
 
We can reuse the result of expression  *b* * *y*
 * 
h2. *Support subexpression elimination in ExpandExec*

{code:java}
SELECT a, count(b),
count(distinct case when b > 1 then b + c else null end) as count_bc_1,
count(distinct case when b < 0 then b + c else null end) as count_bc_2
FROM values(1, 1, 1) as t(a, b, c)
GROUP BY a
{code}
We can reuse the result of expression  b + c

  was:
h1. *Design Sketch*
 * Get all common expressions from input expressions. Recursively visits all 
subexpressions regardless of whether the current expression is a conditional 
expression.
 * For each common expression:
 * Add a new boolean variable *subExprInit_n* to indicate whether we have  
already evaluated the common expression, and reset it to *false* at the start 
of operator.consume()
 * Add a new wrapper subExpr function for common subexpression.


{code:java}
private void subExpr_n(${argList.mkString(", ")}) {
 if (!subExprInit_n) {
   ${eval.code}
   subExprInit_n = true;
   subExprIsNull_n = ${eval.isNull};
   subExprValue_n = ${eval.value};
 }
}
{code}

 * Replace all the common subexpression with the wrapper function 
*subExpr_n(argList)*.



h1. *New support subexpression elimination patterns*
 * 
h2. *Support subexpression elimination with conditional expressions*


{code:java}
SELECT case when v + 2 > 1 then 1
when v + 1 > 2 then 2
when v + 1 > 3 then 3 END vv
FROM values(1) as t2(v)
{code}


We can reuse the result of expression  *v + 1*


{code:java}
SELECT a

[jira] [Updated] (SPARK-42551) Support more subexpression elimination cases

2023-05-08 Thread Wan Kun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-42551:

Description: 
h1. *Design Sketch*
h2. How to support more subexpressions elimination cases
 * Get all common expressions from input expressions of the current physical 
operator to current CodeGenContext. Recursively visits all subexpressions 
regardless of whether the current expression is a conditional expression.
 * For each common expression:
 * Add a new boolean variable *subExprInit* to indicate whether it has  already 
been evaluated. 
 * Add a new code block in CodeGenSupport trait, and reset those *subExprInit* 
variables to *false* before the physical operators begin to evaluate the input 
row.
 * Add a new wrapper subExpr function for each common subexpression.

|private void subExpr_n(${argList}) {
 if (!subExprInit) {
   ${eval.code}
   subExprInit_n = true;
   subExprIsNull_n = ${eval.isNull};
   subExprValue_n = ${eval.value};
 }
}|

 
 * When performing gen code of the input expression,  if the input expression 
is in the common expressions of the current CodeGenContext, the corresponding 
subExpr function will be called. After the first function call, *subExprInit* 
will be set to true, and the subsequent function calls will be skipped.

h2. Why should we support whole-stage subexpression elimination

Right now each spark physical operator shares nothing but the input row, so the 
same expressions may be evaluated multiple times across different operators. 
For example, the expression udf(c1, c2) in plan Project [udf(c1, c2)] - Filter 
[udf(c1, c2) > 0] - Relation will be evaluated both in Project and Filter 
operators.  We can reuse the expression results across different operators such 
as Project and Filter.
h2. How to support whole-stage subexpression elimination
 * Add two properties in CodegenSupport trait, the reusable expressions and the 
the output attributes, we can reuse the expression results only if the output 
attributes are the same.
 * Visit all operators from top to bottom, bound the candidate expressions with 
the output attributes and add to the current candidate reusable expressions.
 * Visit all operators from bottom to top, collect all the common expressions 
to the current operator, and add the initialize code to the current operator if 
the common expressions have not been initialized.
 * Replace the common expressions code when generating codes for  the physical 
operators.

h1. *New support subexpression elimination patterns*
 * 
h2. *Support subexpression elimination with conditional expressions*

{code:java}
SELECT case when v + 2 > 1 then 1
when v + 1 > 2 then 2
when v + 1 > 3 then 3 END vv
FROM values(1) as t2(v)
{code}
We can reuse the result of expression  *v + 1*
{code:java}
SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) min_bc
FROM values(1, 1, 1) as t(a, b, c)
GROUP BY a
{code}
We can reuse the result of expression  b + c
 * 
h2. *Support subexpression elimination in FilterExec*

 
{code:java}
SELECT * FROM (
  SELECT v * v + 1 v1 from values(1) as t2(v)
) t
where v1 > 5 and v1 < 10
{code}
We can reuse the result of expression  *v* * *v* *+* *1*
 * 
h2. *Support subexpression elimination in JoinExec*

 
{code:java}
SELECT * 
FROM values(1, 1) as t1(a, b) 
join values(1, 2) as t2(x, y)

ON b * y between 2 and 3{code}
 
We can reuse the result of expression  *b* * *y*
 * 
h2. *Support subexpression elimination in ExpandExec*

{code:java}
SELECT a, count(b),
count(distinct case when b > 1 then b + c else null end) as count_bc_1,
count(distinct case when b < 0 then b + c else null end) as count_bc_2
FROM values(1, 1, 1) as t(a, b, c)
GROUP BY a
{code}
We can reuse the result of expression  b + c

  was:
h1. *Design Sketch*
h2. How to support more subexpressions elimination cases
 * Get all common expressions from input expressions of the current physical 
operator to current CodeGenContext. Recursively visits all subexpressions 
regardless of whether the current expression is a conditional expression.
 * For each common expression:
 * Add a new boolean variable *subExprInit* to indicate whether it has  already 
been evaluated. 
 * Add a new code block in CodeGenSupport trait, and reset those *subExprInit* 
variables to *false* before the physical operators begin to evaluate the input 
row.
 * Add a new wrapper subExpr function for each common subexpression.

|private void subExpr_n(${argList}) {
 if (!subExprInit) {
   ${eval.code}
   subExprInit_n = true;
   subExprIsNull_n = ${eval.isNull};
   subExprValue_n = ${eval.value};
 }
}|
h1. 
**
 * When performing gen code of the input expression,  if the input expression 
is in the common expressions of the current CodeGenContext, the corresponding 
subExpr function will be called. After the first function call, *subExprInit* 
will be set to true, and the sub

[jira] [Updated] (SPARK-42551) Support more subexpression elimination cases

2023-05-08 Thread Wan Kun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-42551:

Description: 
h1. *Design Sketch*
h2. How to support more subexpressions elimination cases
 * Get all common expressions from input expressions of the current physical 
operator to current CodeGenContext. Recursively visits all subexpressions 
regardless of whether the current expression is a conditional expression.
 * For each common expression:
 ** Add a new boolean variable *subExprInit* to indicate whether it has  
already been evaluated. 
 ** Add a new code block in CodeGenSupport trait, and reset those *subExprInit* 
variables to *false* before the physical operators begin to evaluate the input 
row.
 ** Add a new wrapper subExpr function for each common subexpression.

|private void subExpr_n(${argList}) {
 if (!subExprInit) {
   ${eval.code}
   subExprInit_n = true;
   subExprIsNull_n = ${eval.isNull};
   subExprValue_n = ${eval.value};
 }
}|

 
 * When generating the input expression code,  if the input expression is a 
common expression, the expression code will be replaced with the corresponding 
subExpr function. When the subExpr function is called for the first time, 
*subExprInit* will be set to true, and the subsequent function calls will do 
nothing.

h2. Why should we support whole-stage subexpression elimination

Right now each spark physical operator shares nothing but the input row, so the 
same expressions may be evaluated multiple times across different operators. 
For example, the expression udf(c1, c2) in plan Project [udf(c1, c2)] - Filter 
[udf(c1, c2) > 0] - Relation will be evaluated both in Project and Filter 
operators.  We can reuse the expression results across different operators such 
as Project and Filter.
h2. How to support whole-stage subexpression elimination
 * Add two properties in CodegenSupport trait, the reusable expressions and the 
the output attributes, we can reuse the expression results only if the output 
attributes are the same.
 * Visit all operators from top to bottom, bound the candidate expressions with 
the output attributes and add to the current candidate reusable expressions.
 * Visit all operators from bottom to top, collect all the common expressions 
to the current operator, and add the initialize code to the current operator if 
the common expressions have not been initialized.
 * Replace the common expressions code when generating codes for  the physical 
operators.

h1. *New support subexpression elimination patterns*
 * 
h2. *Support subexpression elimination with conditional expressions*

{code:java}
SELECT case when v + 2 > 1 then 1
when v + 1 > 2 then 2
when v + 1 > 3 then 3 END vv
FROM values(1) as t2(v)
{code}
We can reuse the result of expression  *v + 1*
{code:java}
SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) min_bc
FROM values(1, 1, 1) as t(a, b, c)
GROUP BY a
{code}
We can reuse the result of expression  b + c
 * 
h2. *Support subexpression elimination in FilterExec*

 
{code:java}
SELECT * FROM (
  SELECT v * v + 1 v1 from values(1) as t2(v)
) t
where v1 > 5 and v1 < 10
{code}
We can reuse the result of expression  *v* * *v* *+* *1*
 * 
h2. *Support subexpression elimination in JoinExec*

 
{code:java}
SELECT * 
FROM values(1, 1) as t1(a, b) 
join values(1, 2) as t2(x, y)

ON b * y between 2 and 3{code}
 
We can reuse the result of expression  *b* * *y*
 * 
h2. *Support subexpression elimination in ExpandExec*

{code:java}
SELECT a, count(b),
count(distinct case when b > 1 then b + c else null end) as count_bc_1,
count(distinct case when b < 0 then b + c else null end) as count_bc_2
FROM values(1, 1, 1) as t(a, b, c)
GROUP BY a
{code}
We can reuse the result of expression  b + c

  was:
h1. *Design Sketch*
h2. How to support more subexpressions elimination cases
 * Get all common expressions from input expressions of the current physical 
operator to current CodeGenContext. Recursively visits all subexpressions 
regardless of whether the current expression is a conditional expression.
 * For each common expression:
 * Add a new boolean variable *subExprInit* to indicate whether it has  already 
been evaluated. 
 * Add a new code block in CodeGenSupport trait, and reset those *subExprInit* 
variables to *false* before the physical operators begin to evaluate the input 
row.
 * Add a new wrapper subExpr function for each common subexpression.

|private void subExpr_n(${argList}) {
 if (!subExprInit) {
   ${eval.code}
   subExprInit_n = true;
   subExprIsNull_n = ${eval.isNull};
   subExprValue_n = ${eval.value};
 }
}|

 
 * When performing gen code of the input expression,  if the input expression 
is in the common expressions of the current CodeGenContext, the corresponding 
subExpr function will be called. After the first function call, *subExprInit* 
will be set to true, and

[jira] [Updated] (SPARK-43408) Spark caching in the context of a single job

2023-05-08 Thread Faiz Halde (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-43408:
---
Description: 
Does caching benefit a spark job with only a single action in it? Spark IIRC 
already optimizes shuffles by persisting them onto the disk

I am unable to find a counter-example where caching would benefit a job with a 
single action. In every case I can think of, the shuffle checkpoint acts as a 
good enough caching mechanism in itself

FWIW, I am talking specifically in the context of the Dataframe API

  was:
Does caching benefit a spark job with only a single action in it? Spark IIRC 
already optimizes shuffles by persisting them onto the disk

I am unable to find a counter-example where caching would benefit a job with a 
single action. In every case I can think of, the shuffle checkpoint acts as a 
good enough caching mechanism in itself


> Spark caching in the context of a single job
> 
>
> Key: SPARK-43408
> URL: https://issues.apache.org/jira/browse/SPARK-43408
> Project: Spark
>  Issue Type: Question
>  Components: Shuffle
>Affects Versions: 3.3.1
>Reporter: Faiz Halde
>Priority: Trivial
>
> Does caching benefit a spark job with only a single action in it? Spark IIRC 
> already optimizes shuffles by persisting them onto the disk
> I am unable to find a counter-example where caching would benefit a job with 
> a single action. In every case I can think of, the shuffle checkpoint acts as 
> a good enough caching mechanism in itself
> FWIW, I am talking specifically in the context of the Dataframe API



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43408) Spark caching in the context of a single job

2023-05-08 Thread Faiz Halde (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-43408:
---
Description: 
Does caching benefit a spark job with only a single action in it? Spark IIRC 
already optimizes shuffles by persisting them onto the disk

I am unable to find a counter-example where caching would benefit a job with a 
single action. In every case I can think of, the shuffle checkpoint acts as a 
good enough caching mechanism in itself

FWIW, I am talking specifically in the context of the Dataframe API. The 
StorageLevel allowed in my case would only be DISK_ONLY i.e. I am not looking 
to speed up by caching data in memory

  was:
Does caching benefit a spark job with only a single action in it? Spark IIRC 
already optimizes shuffles by persisting them onto the disk

I am unable to find a counter-example where caching would benefit a job with a 
single action. In every case I can think of, the shuffle checkpoint acts as a 
good enough caching mechanism in itself

FWIW, I am talking specifically in the context of the Dataframe API


> Spark caching in the context of a single job
> 
>
> Key: SPARK-43408
> URL: https://issues.apache.org/jira/browse/SPARK-43408
> Project: Spark
>  Issue Type: Question
>  Components: Shuffle
>Affects Versions: 3.3.1
>Reporter: Faiz Halde
>Priority: Trivial
>
> Does caching benefit a spark job with only a single action in it? Spark IIRC 
> already optimizes shuffles by persisting them onto the disk
> I am unable to find a counter-example where caching would benefit a job with 
> a single action. In every case I can think of, the shuffle checkpoint acts as 
> a good enough caching mechanism in itself
> FWIW, I am talking specifically in the context of the Dataframe API. The 
> StorageLevel allowed in my case would only be DISK_ONLY i.e. I am not looking 
> to speed up by caching data in memory



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43408) Spark caching in the context of a single job

2023-05-08 Thread Faiz Halde (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-43408:
---
Description: 
Does caching benefit a spark job with only a single action in it? Spark IIRC 
already optimizes shuffles by persisting them onto the disk

I am unable to find a counter-example where caching would benefit a job with a 
single action. In every case I can think of, the shuffle checkpoint acts as a 
good enough caching mechanism in itself

FWIW, I am talking specifically in the context of the Dataframe API. The 
StorageLevel allowed in my case is DISK_ONLY i.e. I am not looking to speed up 
by caching data in memory

To rephrase, is DISK_ONLY caching better or same as shuffle checkpointing in 
the context of a single action

  was:
Does caching benefit a spark job with only a single action in it? Spark IIRC 
already optimizes shuffles by persisting them onto the disk

I am unable to find a counter-example where caching would benefit a job with a 
single action. In every case I can think of, the shuffle checkpoint acts as a 
good enough caching mechanism in itself

FWIW, I am talking specifically in the context of the Dataframe API. The 
StorageLevel allowed in my case would only be DISK_ONLY i.e. I am not looking 
to speed up by caching data in memory


> Spark caching in the context of a single job
> 
>
> Key: SPARK-43408
> URL: https://issues.apache.org/jira/browse/SPARK-43408
> Project: Spark
>  Issue Type: Question
>  Components: Shuffle
>Affects Versions: 3.3.1
>Reporter: Faiz Halde
>Priority: Trivial
>
> Does caching benefit a spark job with only a single action in it? Spark IIRC 
> already optimizes shuffles by persisting them onto the disk
> I am unable to find a counter-example where caching would benefit a job with 
> a single action. In every case I can think of, the shuffle checkpoint acts as 
> a good enough caching mechanism in itself
> FWIW, I am talking specifically in the context of the Dataframe API. The 
> StorageLevel allowed in my case is DISK_ONLY i.e. I am not looking to speed 
> up by caching data in memory
> To rephrase, is DISK_ONLY caching better or same as shuffle checkpointing in 
> the context of a single action



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39281) Speed up Timestamp type inference of legacy format in JSON/CSV data source

2023-05-08 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan updated SPARK-39281:

Description: The optimization of {{DefaultTimestampFormatter}} has been 
implemented in [#36562|https://github.com/apache/spark/pull/36562] , this 
ticket adds the optimization of legacy format. The basic logic is to prevent 
the formatter from throwing exceptions, and then use catch to determine whether 
the parsing is successful.

> Speed up Timestamp type inference of legacy format in JSON/CSV data source
> --
>
> Key: SPARK-39281
> URL: https://issues.apache.org/jira/browse/SPARK-39281
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> The optimization of {{DefaultTimestampFormatter}} has been implemented in 
> [#36562|https://github.com/apache/spark/pull/36562] , this ticket adds the 
> optimization of legacy format. The basic logic is to prevent the formatter 
> from throwing exceptions, and then use catch to determine whether the parsing 
> is successful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39281) Speed up Timestamp type inference of legacy format in JSON/CSV data source

2023-05-08 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan updated SPARK-39281:

Summary: Speed up Timestamp type inference of legacy format in JSON/CSV 
data source  (was: Fasten Timestamp type inference of legacy format in JSON/CSV 
data source)

> Speed up Timestamp type inference of legacy format in JSON/CSV data source
> --
>
> Key: SPARK-39281
> URL: https://issues.apache.org/jira/browse/SPARK-39281
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43409) Add documentation for protobuf deserialization options

2023-05-08 Thread Parth Upadhyay (Jira)

Parth Upadhyay created SPARK-43409:
--

 Summary: Add documentation for protobuf deserialization options
 Key: SPARK-43409
 URL: https://issues.apache.org/jira/browse/SPARK-43409
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Protobuf
Affects Versions: 3.4.0
Reporter: Parth Upadhyay


Follow-up from PR comment: 
[https://github.com/apache/spark/pull/41075#discussion_r1186958551]

 

Currently theres no documentation here 
[https://github.com/apache/spark/blob/master/docs/sql-data-sources-protobuf.md#deploying]
 for the full range of options available for `from_protobuf`, so users would 
need to discover these options by reading the scala code directly. Let's update 
the documentation to include information about these options.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41532) DF operations that involve multiple data frames should fail if sessions don't match

2023-05-08 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-41532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell reassigned SPARK-41532:
-

Assignee: Jia Fan

> DF operations that involve multiple data frames should fail if sessions don't 
> match
> ---
>
> Key: SPARK-41532
> URL: https://issues.apache.org/jira/browse/SPARK-41532
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Jia Fan
>Priority: Major
>
> We do not support joining for example two data frames from different Spark 
> Connect Sessions. To avoid exceptions, the client should clearly fail when it 
> tries to construct such a composition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41532) DF operations that involve multiple data frames should fail if sessions don't match

2023-05-08 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-41532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-41532.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> DF operations that involve multiple data frames should fail if sessions don't 
> match
> ---
>
> Key: SPARK-41532
> URL: https://issues.apache.org/jira/browse/SPARK-41532
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Jia Fan
>Priority: Major
> Fix For: 3.5.0
>
>
> We do not support joining for example two data frames from different Spark 
> Connect Sessions. To avoid exceptions, the client should clearly fail when it 
> tries to construct such a composition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43410) Improve vectorized loop for Packed skipValues

2023-05-08 Thread xiaochen zhou (Jira)

xiaochen zhou created SPARK-43410:
-

 Summary: Improve vectorized loop for Packed skipValues
 Key: SPARK-43410
 URL: https://issues.apache.org/jira/browse/SPARK-43410
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: xiaochen zhou


Improve vectorized loop for Packed skipValues
{{}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43410) Improve vectorized loop for Packed skipValues

2023-05-08 Thread xiaochen zhou (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720525#comment-17720525
 ] 

xiaochen zhou commented on SPARK-43410:
---

https://github.com/apache/spark/pull/41092

> Improve vectorized loop for Packed skipValues
> -
>
> Key: SPARK-43410
> URL: https://issues.apache.org/jira/browse/SPARK-43410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: xiaochen zhou
>Priority: Minor
>
> Improve vectorized loop for Packed skipValues
> {{}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43292) ArtifactManagerSuite can't run using maven

2023-05-08 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-43292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-43292.
---
Fix Version/s: 3.5.0
 Assignee: Yang Jie
   Resolution: Fixed

> ArtifactManagerSuite can't run using maven
> --
>
> Key: SPARK-43292
> URL: https://issues.apache.org/jira/browse/SPARK-43292
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>
> run
> {code:java}
> build/mvn  clean install -DskipTests -Phive 
> build/mvn test -pl connector/connect/server {code}
> ArtifactManagerSuite failed due to 
>  
> {code:java}
> 23/04/26 16:00:07.666 ScalaTest-main-running-DiscoverySuite ERROR Executor: 
> Could not find org.apache.spark.repl.ExecutorClassLoader on classpath! {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43389) spark.read.csv throws NullPointerException when lineSep is set to None

2023-05-08 Thread Zach Liu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720568#comment-17720568
 ] 

Zach Liu commented on SPARK-43389:
--

that's why i set the type as "Improvement" and the priority as "Trivial".

but still, it's an unnecessary confusion

> spark.read.csv throws NullPointerException when lineSep is set to None
> --
>
> Key: SPARK-43389
> URL: https://issues.apache.org/jira/browse/SPARK-43389
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.1
>Reporter: Zach Liu
>Priority: Trivial
>
> lineSep was defined as Optional[str] yet i'm unable to explicitly set it as 
> None:
> reader = spark.read.format("csv")
> read_options={'inferSchema': False, 'header': True, 'mode': 'DROPMALFORMED', 
> 'sep': '\t', 'escape': '\\', 'multiLine': False, 'lineSep': None}
> for option, option_value in read_options.items():
> reader = reader.option(option, option_value)
> df = reader.load("s3://")
> raises exception:
> py4j.protocol.Py4JJavaError: An error occurred while calling o126.load.
> : java.lang.NullPointerException
>   at 
> scala.collection.immutable.StringOps$.length$extension(StringOps.scala:51)
>   at scala.collection.immutable.StringOps.length(StringOps.scala:51)
>   at 
> scala.collection.IndexedSeqOptimized.isEmpty(IndexedSeqOptimized.scala:30)
>   at 
> scala.collection.IndexedSeqOptimized.isEmpty$(IndexedSeqOptimized.scala:30)
>   at scala.collection.immutable.StringOps.isEmpty(StringOps.scala:33)
>   at scala.collection.TraversableOnce.nonEmpty(TraversableOnce.scala:143)
>   at scala.collection.TraversableOnce.nonEmpty$(TraversableOnce.scala:143)
>   at scala.collection.immutable.StringOps.nonEmpty(StringOps.scala:33)
>   at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.$anonfun$lineSeparator$1(CSVOptions.scala:216)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:215)
>   at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:47)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$11(DataSource.scala:210)
>   at scala.Option.orElse(Option.scala:447)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:411)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:282)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at 
> py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
>   at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
>   at java.lang.Thread.run(Thread.java:750)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43410) Improve vectorized loop for Packed skipValues

2023-05-08 Thread xiaochen zhou (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiaochen zhou updated SPARK-43410:
--
Description: Improve vectorized loop for Packed skipValues  (was: Improve 
vectorized loop for Packed skipValues
{{}})

> Improve vectorized loop for Packed skipValues
> -
>
> Key: SPARK-43410
> URL: https://issues.apache.org/jira/browse/SPARK-43410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: xiaochen zhou
>Priority: Minor
>
> Improve vectorized loop for Packed skipValues



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43410) Improve vectorized loop for Packed skipValues

2023-05-08 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-43410.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41092
[https://github.com/apache/spark/pull/41092]

> Improve vectorized loop for Packed skipValues
> -
>
> Key: SPARK-43410
> URL: https://issues.apache.org/jira/browse/SPARK-43410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: xiaochen zhou
>Priority: Minor
> Fix For: 3.5.0
>
>
> Improve vectorized loop for Packed skipValues



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43410) Improve vectorized loop for Packed skipValues

2023-05-08 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-43410:


Assignee: xiaochen zhou

> Improve vectorized loop for Packed skipValues
> -
>
> Key: SPARK-43410
> URL: https://issues.apache.org/jira/browse/SPARK-43410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: xiaochen zhou
>Assignee: xiaochen zhou
>Priority: Minor
> Fix For: 3.5.0
>
>
> Improve vectorized loop for Packed skipValues



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43411) Can't union dataframes with # in subcolumn name

2023-05-08 Thread Rudhra Raveendran (Jira)

Rudhra Raveendran created SPARK-43411:
-

 Summary: Can't union dataframes with # in subcolumn name
 Key: SPARK-43411
 URL: https://issues.apache.org/jira/browse/SPARK-43411
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.2
 Environment: * Azure Synapse Notebooks
 * Apache Spark Pool: [Azure Synapse Runtime for Apache Spark 3.1 (EOLA) - 
Azure Synapse Analytics | Microsoft 
Learn|https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-3-runtime]
 ** Spark 3.1.2
 ** Ubuntu 18.04
 ** Python 3.8
 ** Scala 2.12.10
 ** Hadoop 3.1.1
 ** Java 1.8.0_282
 ** .NET Core 3.1
 ** .NET for Apache Spark 2.0.0
 ** Delta Lake 1.0
Reporter: Rudhra Raveendran


I was using Spark within an Azure Synapse notebook to load dataframes from 
various storage accounts and union them into a single dataframe, but it seems 
to fail as the SQL internal to union doesn't handle special characters 
properly. Here is a code example of what I was running:
{code:java}
val data1 = spark.read.parquet("abfss://PATH1")
val data2 = spark.read.parquet("abfss://PATH2")
val data3 = spark.read.parquet("abfss://PATH3")
val data4 = spark.read.parquet("abfss://PATH4")

val data = data1
.unionByName(data2, allowMissingColumns=true)
.unionByName(data3, allowMissingColumns=true)
.unionByName(data4, allowMissingColumns=true)

data.printSchema() {code}
The issue arose due to having a StructType column, e.g. ABC, that has a 
subcolumn with # in the name, e.g. #XYZ#. This doesn't seem to be a problem 
outright, as other Spark functions like select work fine:
{code:java}
data1.select("ABC.#XYZ#").where(col("#XYZ#").isNotNull).show(5, truncate = 
false) {code}
However, when I ran the earlier snippet with the union statements, I get this 
error:
{code:java}
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '#' expecting {'ADD', 'AFTER', 'ALL', 'ALTER', 'ANALYZE', 
'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 'AUTHORIZATION', 
'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 'CASCADE', 'CASE', 
'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 'CLUSTERED', 'CODEGEN', 
'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 'COMMENT', 'COMMIT', 'COMPACT', 
'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 'CONSTRAINT', 'COST', 'CREATE', 
'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 'CURRENT_TIME', 
'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', DATABASES, 
'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 'DESCRIBE', 'DFS', 
'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 'DIV', 'DROP', 'ELSE', 
'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', 'EXPLAIN', 
'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', 'FIELDS', 
'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 'FOREIGN', 'FORMAT', 
'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GLOBAL', 'GRANT', 
'GROUP', 'GROUPING', 'HAVING', 'IF', 'IGNORE', 'IMPORT', 'IN', 'INDEX', 
'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 'INTERSECT', 'INTERVAL', 
'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 'LATERAL', 'LAZY', 'LEADING', 
'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 
'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 'MATCHED', 'MERGE', 'MSCK', 'NAMESPACE', 
'NAMESPACES', 'NATURAL', 'NO', NOT, 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 
'OPTION', 'OPTIONS', 'OR', 'ORDER', 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 
'OVERLAPS', 'OVERLAY', 'OVERWRITE', 'PARTITION', 'PARTITIONED', 'PARTITIONS', 
'PERCENT', 'PIVOT', 'PLACING', 'POSITION', 'PRECEDING', 'PRIMARY', 
'PRINCIPALS', 'PROPERTIES', 'PURGE', 'QUERY', 'RANGE', 'RECORDREADER', 
'RECORDWRITER', 'RECOVER', 'REDUCE', 'REFERENCES', 'REFRESH', 'RENAME', 
'REPAIR', 'REPLACE', 'RESET', 'RESTRICT', 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 
'ROLES', 'ROLLBACK', 'ROLLUP', 'ROW', 'ROWS', 'SCHEMA', 'SELECT', 'SEMI', 
'SEPARATED', 'SERDE', 'SERDEPROPERTIES', 'SESSION_USER', 'SET', 'MINUS', 
'SETS', 'SHOW', 'SKEWED', 'SOME', 'SORT', 'SORTED', 'START', 'STATISTICS', 
'STORED', 'STRATIFY', 'STRUCT', 'SUBSTR', 'SUBSTRING', 'TABLE', 'TABLES', 
'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 'TIME', 'TO', 
'TOUCH', 'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRIM', 
'TRUE', 'TRUNCATE', 'TYPE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', 
'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 'UPDATE', 'USE', 'USER', 'USING', 
'VALUES', 'VIEW', 'VIEWS', 'WHEN', 'WHERE', 'WINDOW', 'WITH', 'ZONE', 
IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0)
 
== SQL ==
#XYZ#
^^^ {code}
This seems to indicate the issue is with the implementation of 
unionByName/union etc. however I'm not familiar enough with the codebase to 
figure out where this would be an issue (I was able to trace that UnionByName 
calls on Union which I think is defined here: 
[spark/basicLogicalOperators.scala at master · apache/spark

[jira] [Commented] (SPARK-38471) Use error classes in org.apache.spark.rdd

2023-05-08 Thread xin chen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720628#comment-17720628
 ] 

xin chen commented on SPARK-38471:
--

awesome, thanks. 

> Use error classes in org.apache.spark.rdd
> -
>
> Key: SPARK-38471
> URL: https://issues.apache.org/jira/browse/SPARK-38471
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Bo Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38471) Use error classes in org.apache.spark.rdd

2023-05-08 Thread xin chen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720631#comment-17720631
 ] 

xin chen commented on SPARK-38471:
--

Hey [~maxgekk], I'm currently working on this ticket. Could you assign this 
ticket to me? Thanks

> Use error classes in org.apache.spark.rdd
> -
>
> Key: SPARK-38471
> URL: https://issues.apache.org/jira/browse/SPARK-38471
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Bo Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43412) Introduce `SQL_ARROW_BATCHED_UDF` EvalType for Arrow-optimized Python UDFs

2023-05-08 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-43412:


 Summary: Introduce `SQL_ARROW_BATCHED_UDF` EvalType for 
Arrow-optimized Python UDFs
 Key: SPARK-43412
 URL: https://issues.apache.org/jira/browse/SPARK-43412
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Xinrong Meng


We are about to improve nested non-atomic input/output support of an 
Arrow-optimized Python UDF.

However, currently, it shares the same EvalType with a pickled Python UDF, but 
the same implementation with a Pandas UDF.

Introducing an EvalType enables isolating the changes to Arrow-optimized Python 
UDFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43411) Can't union dataframes with # in subcolumn name

2023-05-08 Thread Rudhra Raveendran (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rudhra Raveendran resolved SPARK-43411.
---
Resolution: Won't Fix

Turns out this issue isn't present in later versions of Spark (I just tested on 
the latest version in Synapse, which is 3.3.1, and it worked). I'm assuming 
since this version is EOL this will be a wontfix, hence closing this out.

> Can't union dataframes with # in subcolumn name
> ---
>
> Key: SPARK-43411
> URL: https://issues.apache.org/jira/browse/SPARK-43411
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
> Environment: * Azure Synapse Notebooks
>  * Apache Spark Pool: [Azure Synapse Runtime for Apache Spark 3.1 (EOLA) - 
> Azure Synapse Analytics | Microsoft 
> Learn|https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-3-runtime]
>  ** Spark 3.1.2
>  ** Ubuntu 18.04
>  ** Python 3.8
>  ** Scala 2.12.10
>  ** Hadoop 3.1.1
>  ** Java 1.8.0_282
>  ** .NET Core 3.1
>  ** .NET for Apache Spark 2.0.0
>  ** Delta Lake 1.0
>Reporter: Rudhra Raveendran
>Priority: Major
>
> I was using Spark within an Azure Synapse notebook to load dataframes from 
> various storage accounts and union them into a single dataframe, but it seems 
> to fail as the SQL internal to union doesn't handle special characters 
> properly. Here is a code example of what I was running:
> {code:java}
> val data1 = spark.read.parquet("abfss://PATH1")
> val data2 = spark.read.parquet("abfss://PATH2")
> val data3 = spark.read.parquet("abfss://PATH3")
> val data4 = spark.read.parquet("abfss://PATH4")
> val data = data1
> .unionByName(data2, allowMissingColumns=true)
> .unionByName(data3, allowMissingColumns=true)
> .unionByName(data4, allowMissingColumns=true)
> data.printSchema() {code}
> The issue arose due to having a StructType column, e.g. ABC, that has a 
> subcolumn with # in the name, e.g. #XYZ#. This doesn't seem to be a problem 
> outright, as other Spark functions like select work fine:
> {code:java}
> data1.select("ABC.#XYZ#").where(col("#XYZ#").isNotNull).show(5, truncate = 
> false) {code}
> However, when I ran the earlier snippet with the union statements, I get this 
> error:
> {code:java}
> org.apache.spark.sql.catalyst.parser.ParseException:
> extraneous input '#' expecting {'ADD', 'AFTER', 'ALL', 'ALTER', 'ANALYZE', 
> 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 'AUTHORIZATION', 
> 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 'CASCADE', 'CASE', 
> 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 'CLUSTERED', 'CODEGEN', 
> 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 'COMMENT', 'COMMIT', 'COMPACT', 
> 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 'CONSTRAINT', 'COST', 'CREATE', 
> 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 'CURRENT_TIME', 
> 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', DATABASES, 
> 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 'DESCRIBE', 'DFS', 
> 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 'DIV', 'DROP', 'ELSE', 
> 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', 'EXPLAIN', 
> 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', 'FIELDS', 
> 'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 'FOREIGN', 'FORMAT', 
> 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GLOBAL', 'GRANT', 
> 'GROUP', 'GROUPING', 'HAVING', 'IF', 'IGNORE', 'IMPORT', 'IN', 'INDEX', 
> 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 'INTERSECT', 
> 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 'LATERAL', 'LAZY', 
> 'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 'LOAD', 'LOCAL', 
> 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 'MATCHED', 'MERGE', 
> 'MSCK', 'NAMESPACE', 'NAMESPACES', 'NATURAL', 'NO', NOT, 'NULL', 'NULLS', 
> 'OF', 'ON', 'ONLY', 'OPTION', 'OPTIONS', 'OR', 'ORDER', 'OUT', 'OUTER', 
> 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', 'OVERLAY', 'OVERWRITE', 'PARTITION', 
> 'PARTITIONED', 'PARTITIONS', 'PERCENT', 'PIVOT', 'PLACING', 'POSITION', 
> 'PRECEDING', 'PRIMARY', 'PRINCIPALS', 'PROPERTIES', 'PURGE', 'QUERY', 
> 'RANGE', 'RECORDREADER', 'RECORDWRITER', 'RECOVER', 'REDUCE', 'REFERENCES', 
> 'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', 'RESET', 'RESTRICT', 'REVOKE', 
> 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 'ROLLUP', 'ROW', 'ROWS', 
> 'SCHEMA', 'SELECT', 'SEMI', 'SEPARATED', 'SERDE', 'SERDEPROPERTIES', 
> 'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', 'SKEWED', 'SOME', 'SORT', 
> 'SORTED', 'START', 'STATISTICS', 'STORED', 'STRATIFY', 'STRUCT', 'SUBSTR', 
> 'SUBSTRING', 'TABLE', 'TABLES', 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 
> 'TERMINATED', 'THEN', 'TIME', 'TO', 'TOUCH', 'TRAILING', 'TRANSACTION', 
> 'TRANSACTIONS', 'TRANSFORM', 'TRIM', 'TRUE', 'TRUNCATE', 'TYPE', 'UNARC

[jira] [Created] (SPARK-43413) IN subquery ListQuery has wrong nullability

2023-05-08 Thread Jack Chen (Jira)

Jack Chen created SPARK-43413:
-

 Summary: IN subquery ListQuery has wrong nullability
 Key: SPARK-43413
 URL: https://issues.apache.org/jira/browse/SPARK-43413
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Jack Chen


IN subquery expressions currently are marked as nullable if and only if the 
left-hand-side is nullable - because the right-hand-side of a IN subquery, the 
ListQuery, is currently defined with nullability = false always. This is 
incorrect and can lead to incorrect query transformations.

Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
expression returns NULL when the nullable_col is null, but our code marks it as 
non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> 
TRUE, transforming the expression to non_nullable_col IN (select nullable_col) 
, which is an incorrect transformation because NULL values of nullable_col now 
cause the expression to yield NULL instead of FALSE.

This is a long-standing bug that has existed at least since 2016, as long as 
the ListQuery class has existed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43413) IN subquery ListQuery has wrong nullability

2023-05-08 Thread Jack Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-43413:
--
Description: 
IN subquery expressions currently are marked as nullable if and only if the 
left-hand-side is nullable - because the right-hand-side of a IN subquery, the 
ListQuery, is currently defined with nullability = false always. This is 
incorrect and can lead to incorrect query transformations.

Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
expression returns NULL when the nullable_col is null, but our code marks it as 
non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> 
TRUE, transforming the expression to non_nullable_col IN (select nullable_col) 
, which is an incorrect transformation because NULL values of nullable_col now 
cause the expression to yield NULL instead of FALSE.

This bug can potentially lead to wrong results, but in most cases this doesn't 
directly cause wrong results end-to-end, because IN subqueries are almost 
always transformed to semi/anti/existence joins in RewritePredicateSubquery, 
and this rewrite can also incorrectly discard NULLs, which is another bug. But 
we can observe it causing wrong behavior in unit tests, and it could easily 
lead to incorrect query results if there are changes to the surrounding 
context, so it should be fixed regardless.

This is a long-standing bug that has existed at least since 2016, as long as 
the ListQuery class has existed.

  was:
IN subquery expressions currently are marked as nullable if and only if the 
left-hand-side is nullable - because the right-hand-side of a IN subquery, the 
ListQuery, is currently defined with nullability = false always. This is 
incorrect and can lead to incorrect query transformations.

Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
expression returns NULL when the nullable_col is null, but our code marks it as 
non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> 
TRUE, transforming the expression to non_nullable_col IN (select nullable_col) 
, which is an incorrect transformation because NULL values of nullable_col now 
cause the expression to yield NULL instead of FALSE.

This is a long-standing bug that has existed at least since 2016, as long as 
the ListQuery class has existed.


> IN subquery ListQuery has wrong nullability
> ---
>
> Key: SPARK-43413
> URL: https://issues.apache.org/jira/browse/SPARK-43413
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> IN subquery expressions currently are marked as nullable if and only if the 
> left-hand-side is nullable - because the right-hand-side of a IN subquery, 
> the ListQuery, is currently defined with nullability = false always. This is 
> incorrect and can lead to incorrect query transformations.
> Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
> expression returns NULL when the nullable_col is null, but our code marks it 
> as non-nullable, and therefore SimplifyBinaryComparison transforms away the 
> <=> TRUE, transforming the expression to non_nullable_col IN (select 
> nullable_col) , which is an incorrect transformation because NULL values of 
> nullable_col now cause the expression to yield NULL instead of FALSE.
> This bug can potentially lead to wrong results, but in most cases this 
> doesn't directly cause wrong results end-to-end, because IN subqueries are 
> almost always transformed to semi/anti/existence joins in 
> RewritePredicateSubquery, and this rewrite can also incorrectly discard 
> NULLs, which is another bug. But we can observe it causing wrong behavior in 
> unit tests, and it could easily lead to incorrect query results if there are 
> changes to the surrounding context, so it should be fixed regardless.
> This is a long-standing bug that has existed at least since 2016, as long as 
> the ListQuery class has existed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43414) Fix flakiness in Kafka RDD suites due to port binding configuration issue

2023-05-08 Thread Josh Rosen (Jira)

Josh Rosen created SPARK-43414:
--

 Summary: Fix flakiness in Kafka RDD suites due to port binding 
configuration issue
 Key: SPARK-43414
 URL: https://issues.apache.org/jira/browse/SPARK-43414
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.4.0
Reporter: Josh Rosen
Assignee: Josh Rosen


In SPARK-36837 we updated Kafka to 3.10, which uses a different set of 
configuration options for configuring the broker listener port. That PR only 
updated one of two KafkaTestUtils files (the SQL one), so the other one (used 
by Core tests) had an ineffective port binding configuration and would bind to 
the default 9092 port. This could lead to flakiness if multiple suites binding 
to that port ran in parallel.

To fix this, we just need to copy the updated port binding configuration from 
the other suite. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43413) IN subquery ListQuery has wrong nullability

2023-05-08 Thread Jack Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-43413:
--
Description: 
IN subquery expressions are incorrectly always marked as non-nullable, even 
when they are actually nullable. They correctly check the nullability of the 
left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is 
currently defined with nullability = false always. This is incorrect and can 
lead to incorrect query transformations.

Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
expression returns NULL when the nullable_col is null, but our code marks it as 
non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> 
TRUE, transforming the expression to non_nullable_col IN (select nullable_col) 
, which is an incorrect transformation because NULL values of nullable_col now 
cause the expression to yield NULL instead of FALSE.

This bug can potentially lead to wrong results, but in most cases this doesn't 
directly cause wrong results end-to-end, because IN subqueries are almost 
always transformed to semi/anti/existence joins in RewritePredicateSubquery, 
and this rewrite can also incorrectly discard NULLs, which is another bug. But 
we can observe it causing wrong behavior in unit tests, and it could easily 
lead to incorrect query results if there are changes to the surrounding 
context, so it should be fixed regardless.

This is a long-standing bug that has existed at least since 2016, as long as 
the ListQuery class has existed.

  was:
IN subquery expressions currently are marked as nullable if and only if the 
left-hand-side is nullable - because the right-hand-side of a IN subquery, the 
ListQuery, is currently defined with nullability = false always. This is 
incorrect and can lead to incorrect query transformations.

Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
expression returns NULL when the nullable_col is null, but our code marks it as 
non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> 
TRUE, transforming the expression to non_nullable_col IN (select nullable_col) 
, which is an incorrect transformation because NULL values of nullable_col now 
cause the expression to yield NULL instead of FALSE.

This bug can potentially lead to wrong results, but in most cases this doesn't 
directly cause wrong results end-to-end, because IN subqueries are almost 
always transformed to semi/anti/existence joins in RewritePredicateSubquery, 
and this rewrite can also incorrectly discard NULLs, which is another bug. But 
we can observe it causing wrong behavior in unit tests, and it could easily 
lead to incorrect query results if there are changes to the surrounding 
context, so it should be fixed regardless.

This is a long-standing bug that has existed at least since 2016, as long as 
the ListQuery class has existed.


> IN subquery ListQuery has wrong nullability
> ---
>
> Key: SPARK-43413
> URL: https://issues.apache.org/jira/browse/SPARK-43413
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> IN subquery expressions are incorrectly always marked as non-nullable, even 
> when they are actually nullable. They correctly check the nullability of the 
> left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is 
> currently defined with nullability = false always. This is incorrect and can 
> lead to incorrect query transformations.
> Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
> expression returns NULL when the nullable_col is null, but our code marks it 
> as non-nullable, and therefore SimplifyBinaryComparison transforms away the 
> <=> TRUE, transforming the expression to non_nullable_col IN (select 
> nullable_col) , which is an incorrect transformation because NULL values of 
> nullable_col now cause the expression to yield NULL instead of FALSE.
> This bug can potentially lead to wrong results, but in most cases this 
> doesn't directly cause wrong results end-to-end, because IN subqueries are 
> almost always transformed to semi/anti/existence joins in 
> RewritePredicateSubquery, and this rewrite can also incorrectly discard 
> NULLs, which is another bug. But we can observe it causing wrong behavior in 
> unit tests, and it could easily lead to incorrect query results if there are 
> changes to the surrounding context, so it should be fixed regardless.
> This is a long-standing bug that has existed at least since 2016, as long as 
> the ListQuery class has existed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues

[jira] [Updated] (SPARK-43413) IN subquery ListQuery has wrong nullability

2023-05-08 Thread Jack Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-43413:
--
Description: 
IN subquery expressions are incorrectly marked as non-nullable, even when they 
are actually nullable. They correctly check the nullability of the 
left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is 
currently defined with nullability = false always. This is incorrect and can 
lead to incorrect query transformations.

Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
expression returns NULL when the nullable_col is null, but our code marks it as 
non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> 
TRUE, transforming the expression to non_nullable_col IN (select nullable_col) 
, which is an incorrect transformation because NULL values of nullable_col now 
cause the expression to yield NULL instead of FALSE.

This bug can potentially lead to wrong results, but in most cases this doesn't 
directly cause wrong results end-to-end, because IN subqueries are almost 
always transformed to semi/anti/existence joins in RewritePredicateSubquery, 
and this rewrite can also incorrectly discard NULLs, which is another bug. But 
we can observe it causing wrong behavior in unit tests, and it could easily 
lead to incorrect query results if there are changes to the surrounding 
context, so it should be fixed regardless.

This is a long-standing bug that has existed at least since 2016, as long as 
the ListQuery class has existed.

  was:
IN subquery expressions are incorrectly always marked as non-nullable, even 
when they are actually nullable. They correctly check the nullability of the 
left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is 
currently defined with nullability = false always. This is incorrect and can 
lead to incorrect query transformations.

Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
expression returns NULL when the nullable_col is null, but our code marks it as 
non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> 
TRUE, transforming the expression to non_nullable_col IN (select nullable_col) 
, which is an incorrect transformation because NULL values of nullable_col now 
cause the expression to yield NULL instead of FALSE.

This bug can potentially lead to wrong results, but in most cases this doesn't 
directly cause wrong results end-to-end, because IN subqueries are almost 
always transformed to semi/anti/existence joins in RewritePredicateSubquery, 
and this rewrite can also incorrectly discard NULLs, which is another bug. But 
we can observe it causing wrong behavior in unit tests, and it could easily 
lead to incorrect query results if there are changes to the surrounding 
context, so it should be fixed regardless.

This is a long-standing bug that has existed at least since 2016, as long as 
the ListQuery class has existed.


> IN subquery ListQuery has wrong nullability
> ---
>
> Key: SPARK-43413
> URL: https://issues.apache.org/jira/browse/SPARK-43413
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> IN subquery expressions are incorrectly marked as non-nullable, even when 
> they are actually nullable. They correctly check the nullability of the 
> left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is 
> currently defined with nullability = false always. This is incorrect and can 
> lead to incorrect query transformations.
> Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
> expression returns NULL when the nullable_col is null, but our code marks it 
> as non-nullable, and therefore SimplifyBinaryComparison transforms away the 
> <=> TRUE, transforming the expression to non_nullable_col IN (select 
> nullable_col) , which is an incorrect transformation because NULL values of 
> nullable_col now cause the expression to yield NULL instead of FALSE.
> This bug can potentially lead to wrong results, but in most cases this 
> doesn't directly cause wrong results end-to-end, because IN subqueries are 
> almost always transformed to semi/anti/existence joins in 
> RewritePredicateSubquery, and this rewrite can also incorrectly discard 
> NULLs, which is another bug. But we can observe it causing wrong behavior in 
> unit tests, and it could easily lead to incorrect query results if there are 
> changes to the surrounding context, so it should be fixed regardless.
> This is a long-standing bug that has existed at least since 2016, as long as 
> the ListQuery class has existed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-

[jira] [Created] (SPARK-43415) Impl mapValues for KVGDS#mapValues

2023-05-08 Thread Zhen Li (Jira)

Zhen Li created SPARK-43415:
---

 Summary: Impl mapValues for KVGDS#mapValues
 Key: SPARK-43415
 URL: https://issues.apache.org/jira/browse/SPARK-43415
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Zhen Li


Use an resolved func to pass the mapValues together with all aggExprs. Then on 
the server side unfold it to apply mapValues first before running aggregate.

e.g. 
https://github.com/apache/spark/commit/a234a9b0851ebce87c0ef831b24866f94f0c0d36



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43414) Fix flakiness in Kafka RDD suites due to port binding configuration issue

2023-05-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43414.
---
Fix Version/s: 3.4.1
   Resolution: Fixed

Issue resolved by pull request 41095
[https://github.com/apache/spark/pull/41095]

> Fix flakiness in Kafka RDD suites due to port binding configuration issue
> -
>
> Key: SPARK-43414
> URL: https://issues.apache.org/jira/browse/SPARK-43414
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
> Fix For: 3.4.1
>
>
> In SPARK-36837 we updated Kafka to 3.10, which uses a different set of 
> configuration options for configuring the broker listener port. That PR only 
> updated one of two KafkaTestUtils files (the SQL one), so the other one (used 
> by Core tests) had an ineffective port binding configuration and would bind 
> to the default 9092 port. This could lead to flakiness if multiple suites 
> binding to that port ran in parallel.
> To fix this, we just need to copy the updated port binding configuration from 
> the other suite. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43398) Executor timeout should be max of idleTimeout rddTimeout shuffleTimeout

2023-05-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43398:
--
Affects Version/s: 3.4.0
   3.3.2
   3.2.4
   3.1.3

> Executor timeout should be max of idleTimeout rddTimeout shuffleTimeout
> ---
>
> Key: SPARK-43398
> URL: https://issues.apache.org/jira/browse/SPARK-43398
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.1.3, 3.2.4, 3.3.2, 3.4.0
>Reporter: Zhongwei Zhu
>Priority: Major
>
> When dynamic allocation enabled, Executor timeout should be max of 
> idleTimeout, rddTimeout and shuffleTimeout.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43416) Fix the bug where the ProduceEncoder#tuples fields names are different from server

2023-05-08 Thread Zhen Li (Jira)

Zhen Li created SPARK-43416:
---

 Summary: Fix the bug where the ProduceEncoder#tuples fields names 
are different from server
 Key: SPARK-43416
 URL: https://issues.apache.org/jira/browse/SPARK-43416
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Zhen Li


The fields are named _1, _2, ... etc. However on the server side it could be 
nicely named in agg operations such as key, value etc. Fix this if possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43404) Filter current version while reusing sst files for RocksDB state store provider while uploading to DFS to prevent id mismatch

2023-05-08 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-43404.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41089
[https://github.com/apache/spark/pull/41089]

> Filter current version while reusing sst files for RocksDB state store 
> provider while uploading to DFS to prevent id mismatch
> -
>
> Key: SPARK-43404
> URL: https://issues.apache.org/jira/browse/SPARK-43404
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
> Fix For: 3.5.0
>
>
> Filter current version while reusing sst files for RocksDB state store 
> provider while uploading to DFS to prevent id mismatch



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43404) Filter current version while reusing sst files for RocksDB state store provider while uploading to DFS to prevent id mismatch

2023-05-08 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-43404:


Assignee: Anish Shrigondekar

> Filter current version while reusing sst files for RocksDB state store 
> provider while uploading to DFS to prevent id mismatch
> -
>
> Key: SPARK-43404
> URL: https://issues.apache.org/jira/browse/SPARK-43404
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
>
> Filter current version while reusing sst files for RocksDB state store 
> provider while uploading to DFS to prevent id mismatch



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43417) Improve CBO stats

2023-05-08 Thread Huaxin Gao (Jira)

Huaxin Gao created SPARK-43417:
--

 Summary: Improve CBO stats
 Key: SPARK-43417
 URL: https://issues.apache.org/jira/browse/SPARK-43417
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.5.0
Reporter: Huaxin Gao


When experimenting the DS V2 Col stats, we identified areas where could 
potentially improve. For instance, we can probably propagate Union NDV, and add 
min/max for the varchar columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43397) Log executor decommission duration

2023-05-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43397:
-

Assignee: Zhongwei Zhu

> Log executor decommission duration
> --
>
> Key: SPARK-43397
> URL: https://issues.apache.org/jira/browse/SPARK-43397
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Minor
>
> Log executor decommission duration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43397) Log executor decommission duration in executorLost method

2023-05-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43397:
--
Affects Version/s: 3.5.0
   (was: 3.4.0)

> Log executor decommission duration in executorLost method
> -
>
> Key: SPARK-43397
> URL: https://issues.apache.org/jira/browse/SPARK-43397
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Minor
> Fix For: 3.5.0
>
>
> Log executor decommission duration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43397) Log executor decommission duration in executorLost method

2023-05-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43397:
--
Summary: Log executor decommission duration in executorLost method  (was: 
Log executor decommission duration)

> Log executor decommission duration in executorLost method
> -
>
> Key: SPARK-43397
> URL: https://issues.apache.org/jira/browse/SPARK-43397
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Minor
> Fix For: 3.5.0
>
>
> Log executor decommission duration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43397) Log executor decommission duration

2023-05-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43397.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41077
[https://github.com/apache/spark/pull/41077]

> Log executor decommission duration
> --
>
> Key: SPARK-43397
> URL: https://issues.apache.org/jira/browse/SPARK-43397
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Minor
> Fix For: 3.5.0
>
>
> Log executor decommission duration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43401) Upgrade buf to v1.18.0

2023-05-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-43401.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41087
[https://github.com/apache/spark/pull/41087]

> Upgrade buf to v1.18.0
> --
>
> Key: SPARK-43401
> URL: https://issues.apache.org/jira/browse/SPARK-43401
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Connect
>Affects Versions: 3.4.1
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43401) Upgrade buf to v1.18.0

2023-05-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-43401:
-

Assignee: BingKun Pan

> Upgrade buf to v1.18.0
> --
>
> Key: SPARK-43401
> URL: https://issues.apache.org/jira/browse/SPARK-43401
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Connect
>Affects Versions: 3.4.1
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43418) add SparkSession.Builder.getOrCreate()

2023-05-08 Thread Jira

Herman van Hövell created SPARK-43418:
-

 Summary: add SparkSession.Builder.getOrCreate()
 Key: SPARK-43418
 URL: https://issues.apache.org/jira/browse/SPARK-43418
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 3.4.0
Reporter: Herman van Hövell
Assignee: Herman van Hövell


Add SparkSession.Builder.getOrCreate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42514) Scala Client add partition transforms functions

2023-05-08 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-42514:
-
Epic Link: SPARK-42554

> Scala Client add partition transforms functions
> ---
>
> Key: SPARK-42514
> URL: https://issues.apache.org/jira/browse/SPARK-42514
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43296) Migrate Spark Connect session errors into error class

2023-05-08 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43296:
-

Assignee: Haejoon Lee

> Migrate Spark Connect session errors into error class
> -
>
> Key: SPARK-43296
> URL: https://issues.apache.org/jira/browse/SPARK-43296
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Migrate Spark Connect session errors into error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43296) Migrate Spark Connect session errors into error class

2023-05-08 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43296.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40964
[https://github.com/apache/spark/pull/40964]

> Migrate Spark Connect session errors into error class
> -
>
> Key: SPARK-43296
> URL: https://issues.apache.org/jira/browse/SPARK-43296
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> Migrate Spark Connect session errors into error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43357) Spark AWS Glue date partition push down broken

2023-05-08 Thread Snoot.io (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720763#comment-17720763
 ] 

Snoot.io commented on SPARK-43357:
--

User 'stijndehaes' has created a pull request for this issue:
https://github.com/apache/spark/pull/41035

> Spark AWS Glue date partition push down broken
> --
>
> Key: SPARK-43357
> URL: https://issues.apache.org/jira/browse/SPARK-43357
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 
> 3.3.1, 3.2.3, 3.2.4, 3.3.2
>Reporter: Stijn De Haes
>Priority: Major
>
> When using the following project: 
> [https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore]
> To have glue supported as as a hive metastore for spark there is an issue 
> when reading a date-partitioned data set. Writing is fine.
> You get the following error: 
> {quote}org.apache.hadoop.hive.metastore.api.InvalidObjectException: 
> Unsupported expression '2023 - 05 - 03' (Service: AWSGlue; Status Code: 400; 
> Error Code: InvalidInputException; Request ID: 
> beed68c6-b228-442e-8783-52c25b9d2243; Proxy: null)
> {quote}
>  
> A fix for this is making sure the date passed to glue is quoted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43343) Spark Streaming is not able to read a .txt file whose name has [] special character

2023-05-08 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-43343.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41022
[https://github.com/apache/spark/pull/41022]

> Spark Streaming is not able to read a .txt file whose name has [] special 
> character
> ---
>
> Key: SPARK-43343
> URL: https://issues.apache.org/jira/browse/SPARK-43343
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Minor
> Fix For: 3.5.0
>
>
> * For example, If a directory contains a following file:
> /path/abc[123]
> and users would load spark.readStream.format("text").load("/path") as stream 
> input. It throws an exception, saying no matching path /path/abc[123]. Spark 
> thinks abc[123] is a regex that only matches file named abc1, abc2 and abc3.
>  * Upon investigation this is due to how we 
> [getBatch|https://github.com/databricks/runtime/blob/3af402d23620a0952e151d96c3184d2233217c87/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L269]
>  in the FileStreamSource. In `FileStreamSource` we already check file pattern 
> matching and find all match file names. However, in DataSource we check for 
> glob characters again and try to expend it 
> [here|https://github.com/databricks/runtime/blob/3af402d23620a0952e151d96c3184d2233217c87/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L274].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43343) Spark Streaming is not able to read a .txt file whose name has [] special character

2023-05-08 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-43343:


Assignee: Siying Dong

> Spark Streaming is not able to read a .txt file whose name has [] special 
> character
> ---
>
> Key: SPARK-43343
> URL: https://issues.apache.org/jira/browse/SPARK-43343
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Minor
>
> * For example, If a directory contains a following file:
> /path/abc[123]
> and users would load spark.readStream.format("text").load("/path") as stream 
> input. It throws an exception, saying no matching path /path/abc[123]. Spark 
> thinks abc[123] is a regex that only matches file named abc1, abc2 and abc3.
>  * Upon investigation this is due to how we 
> [getBatch|https://github.com/databricks/runtime/blob/3af402d23620a0952e151d96c3184d2233217c87/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L269]
>  in the FileStreamSource. In `FileStreamSource` we already check file pattern 
> matching and find all match file names. However, in DataSource we check for 
> glob characters again and try to expend it 
> [here|https://github.com/databricks/runtime/blob/3af402d23620a0952e151d96c3184d2233217c87/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L274].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42668) Catch exception while trying to close compressed stream in HDFSStateStoreProvider abort

2023-05-08 Thread Snoot.io (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720769#comment-17720769
 ] 

Snoot.io commented on SPARK-42668:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/41098

> Catch exception while trying to close compressed stream in 
> HDFSStateStoreProvider abort
> ---
>
> Key: SPARK-42668
> URL: https://issues.apache.org/jira/browse/SPARK-42668
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
> Fix For: 3.5.0
>
>
> Catch exception while trying to close compressed stream in 
> HDFSStateStoreProvider abort
> We have seen some cases where the task exits as cancelled/failed which 
> triggers the abort in the task completion listener for 
> HDFSStateStoreProvider. As part of this, we cancel the backing stream and 
> close the compressed stream. However, different stores such as Azure blob 
> store could throw exceptions which are not caught in the current path, 
> leading to job failures. This change proposes to fix this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43014) spark.app.submitTime is not right in k8s cluster mode

2023-05-08 Thread Snoot.io (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720770#comment-17720770
 ] 

Snoot.io commented on SPARK-43014:
--

User 'zhouyifan279' has created a pull request for this issue:
https://github.com/apache/spark/pull/40645

> spark.app.submitTime is not right in k8s cluster mode
> -
>
> Key: SPARK-43014
> URL: https://issues.apache.org/jira/browse/SPARK-43014
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Zhou Yifan
>Priority: Major
>
> If submit Spark in k8s cluster mode, `spark.app.submitTime` will be 
> overwritten when driver starts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43419) [K8S] Make limit.cores be able to be fallen back to request.cores

2023-05-08 Thread Fei Wang (Jira)

Fei Wang created SPARK-43419:


 Summary: [K8S] Make limit.cores be able to be fallen back to 
request.cores
 Key: SPARK-43419
 URL: https://issues.apache.org/jira/browse/SPARK-43419
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.4.0
Reporter: Fei Wang


make limit.cores be able to be fallen back to request.cores

 

If spark.kubernetes.executor/driver.limit.cores, treat request.cores as 
limit.cores.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43419) [K8S] Make limit.cores be able to be fallen back to request.cores

2023-05-08 Thread Fei Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43419:
-
Description: 
make limit.cores be able to be fallen back to request.cores

now without limit.cores, we will meet below issue:
{code:java}
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST 
at: https:///api/v1/namespaces/hadooptessns/pods. Message: 
Forbidden!Configured service account doesn't have access. Service account may 
have been revoked. pods "" is forbidden: failed quota: 
high-qos-limit-requests: must specify limits.cpu. {code}
If spark.kubernetes.executor/driver.limit.cores, treat request.cores as 
limit.cores.

  was:
make limit.cores be able to be fallen back to request.cores

 

If spark.kubernetes.executor/driver.limit.cores, treat request.cores as 
limit.cores.


> [K8S] Make limit.cores be able to be fallen back to request.cores
> -
>
> Key: SPARK-43419
> URL: https://issues.apache.org/jira/browse/SPARK-43419
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> make limit.cores be able to be fallen back to request.cores
> now without limit.cores, we will meet below issue:
> {code:java}
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https:///api/v1/namespaces/hadooptessns/pods. Message: 
> Forbidden!Configured service account doesn't have access. Service account may 
> have been revoked. pods "" is forbidden: failed quota: 
> high-qos-limit-requests: must specify limits.cpu. {code}
> If spark.kubernetes.executor/driver.limit.cores, treat request.cores as 
> limit.cores.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43419) [K8S] Make limit.cores be able to be fallen back to request.cores

2023-05-08 Thread Fei Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43419:
-
Description: 
make limit.cores be able to be fallen back to request.cores

now without limit.cores, we will meet below issue:
{code:java}
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST 
at: https:///api/v1/namespaces/hadooptessns/pods. Message: 
Forbidden!Configured service account doesn't have access. Service account may 
have been revoked. pods "" is forbidden: failed quota: 
high-qos-limit-requests: must specify limits.cpu. {code}
If spark.kubernetes.executor/driver.limit.cores, how about treat request.cores 
as limit.cores?

  was:
make limit.cores be able to be fallen back to request.cores

now without limit.cores, we will meet below issue:
{code:java}
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST 
at: https:///api/v1/namespaces/hadooptessns/pods. Message: 
Forbidden!Configured service account doesn't have access. Service account may 
have been revoked. pods "" is forbidden: failed quota: 
high-qos-limit-requests: must specify limits.cpu. {code}
If spark.kubernetes.executor/driver.limit.cores, treat request.cores as 
limit.cores.


> [K8S] Make limit.cores be able to be fallen back to request.cores
> -
>
> Key: SPARK-43419
> URL: https://issues.apache.org/jira/browse/SPARK-43419
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> make limit.cores be able to be fallen back to request.cores
> now without limit.cores, we will meet below issue:
> {code:java}
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https:///api/v1/namespaces/hadooptessns/pods. Message: 
> Forbidden!Configured service account doesn't have access. Service account may 
> have been revoked. pods "" is forbidden: failed quota: 
> high-qos-limit-requests: must specify limits.cpu. {code}
> If spark.kubernetes.executor/driver.limit.cores, how about treat 
> request.cores as limit.cores?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43413) IN subquery ListQuery has wrong nullability

2023-05-08 Thread ci-cassandra.apache.org (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720780#comment-17720780
 ] 

ci-cassandra.apache.org commented on SPARK-43413:
-

User 'jchen5' has created a pull request for this issue:
https://github.com/apache/spark/pull/41094

> IN subquery ListQuery has wrong nullability
> ---
>
> Key: SPARK-43413
> URL: https://issues.apache.org/jira/browse/SPARK-43413
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jack Chen
>Priority: Major
>
> IN subquery expressions are incorrectly marked as non-nullable, even when 
> they are actually nullable. They correctly check the nullability of the 
> left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is 
> currently defined with nullability = false always. This is incorrect and can 
> lead to incorrect query transformations.
> Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
> expression returns NULL when the nullable_col is null, but our code marks it 
> as non-nullable, and therefore SimplifyBinaryComparison transforms away the 
> <=> TRUE, transforming the expression to non_nullable_col IN (select 
> nullable_col) , which is an incorrect transformation because NULL values of 
> nullable_col now cause the expression to yield NULL instead of FALSE.
> This bug can potentially lead to wrong results, but in most cases this 
> doesn't directly cause wrong results end-to-end, because IN subqueries are 
> almost always transformed to semi/anti/existence joins in 
> RewritePredicateSubquery, and this rewrite can also incorrectly discard 
> NULLs, which is another bug. But we can observe it causing wrong behavior in 
> unit tests, and it could easily lead to incorrect query results if there are 
> changes to the surrounding context, so it should be fixed regardless.
> This is a long-standing bug that has existed at least since 2016, as long as 
> the ListQuery class has existed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43419) [K8S] Make limit.cores be able to be fallen back to request.cores

2023-05-08 Thread Fei Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-43419:
-
Description: 
make limit.cores be able to be fallen back to request.cores

now without limit.cores, we will meet below issue:
{code:java}
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST 
at: https:///api/v1/namespaces/hadooptessns/pods. Message: 
Forbidden!Configured service account doesn't have access. Service account may 
have been revoked. pods "" is forbidden: failed quota: 
high-qos-limit-requests: must specify limits.cpu. {code}
If spark.kubernetes.executor/driver.limit.cores is not specified, how about 
treating request.cores as limit.cores?

  was:
make limit.cores be able to be fallen back to request.cores

now without limit.cores, we will meet below issue:
{code:java}
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST 
at: https:///api/v1/namespaces/hadooptessns/pods. Message: 
Forbidden!Configured service account doesn't have access. Service account may 
have been revoked. pods "" is forbidden: failed quota: 
high-qos-limit-requests: must specify limits.cpu. {code}
If spark.kubernetes.executor/driver.limit.cores, how about treat request.cores 
as limit.cores?


> [K8S] Make limit.cores be able to be fallen back to request.cores
> -
>
> Key: SPARK-43419
> URL: https://issues.apache.org/jira/browse/SPARK-43419
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Fei Wang
>Priority: Major
>
> make limit.cores be able to be fallen back to request.cores
> now without limit.cores, we will meet below issue:
> {code:java}
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https:///api/v1/namespaces/hadooptessns/pods. Message: 
> Forbidden!Configured service account doesn't have access. Service account may 
> have been revoked. pods "" is forbidden: failed quota: 
> high-qos-limit-requests: must specify limits.cpu. {code}
> If spark.kubernetes.executor/driver.limit.cores is not specified, how about 
> treating request.cores as limit.cores?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43338) Support modify the SESSION_CATALOG_NAME value

2023-05-08 Thread Kent Yao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720788#comment-17720788
 ] 

Kent Yao commented on SPARK-43338:
--

Support multiple Hive metastore servers is a big feature, which can not be 
achieved by making SESSION_CATALOG_NAME variable

 

cc [~cloud_fan] 

> Support  modify the SESSION_CATALOG_NAME value
> --
>
> Key: SPARK-43338
> URL: https://issues.apache.org/jira/browse/SPARK-43338
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Major
>
> {code:java}
> private[sql] object CatalogManager {
> val SESSION_CATALOG_NAME: String = "spark_catalog"
> }{code}
>  
> The SESSION_CATALOG_NAME value cannot be modified。
> If multiple Hive Metastores exist, the platform manages multiple hms metadata 
> and classifies them by catalogName. A different catalog name is required
> [~yao] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43338) Support modify the SESSION_CATALOG_NAME value

2023-05-08 Thread melin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720794#comment-17720794
 ] 

melin commented on SPARK-43338:
---

You understand a big feature. Only one hms is accessed in a sparksession. I 
just want spark_catalog to be modified. For example, if you have two hadoop 
clusters, there should be two hms. Metadata management platform (similar to 
databricks unity catalog), the acquisition of the HMS metadata, in order to 
distinguish the uniqueness, need to add catalogName (tableid: catalogName. 
SchemaName. TableName). When spark accesses hive tables, it is consistent with 
the catalogname of tableid instead of spark_catalog。

 

> Support  modify the SESSION_CATALOG_NAME value
> --
>
> Key: SPARK-43338
> URL: https://issues.apache.org/jira/browse/SPARK-43338
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Major
>
> {code:java}
> private[sql] object CatalogManager {
> val SESSION_CATALOG_NAME: String = "spark_catalog"
> }{code}
>  
> The SESSION_CATALOG_NAME value cannot be modified。
> If multiple Hive Metastores exist, the platform manages multiple hms metadata 
> and classifies them by catalogName. A different catalog name is required
> [~yao] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43420) Make DisableUnnecessaryBucketedScan smart with table cache

2023-05-08 Thread XiDuo You (Jira)

XiDuo You created SPARK-43420:
-

 Summary: Make DisableUnnecessaryBucketedScan smart with table cache
 Key: SPARK-43420
 URL: https://issues.apache.org/jira/browse/SPARK-43420
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: XiDuo You


If a bucket scan has no interesting partition or contains shuffle exchange, 
then we would disable it. But If the bucket scan is inside table cache, the 
cached plan would be accessed multi-times, then we should not disable it as it 
could preserve output partitioning and more likely be reused.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43338) Support modify the SESSION_CATALOG_NAME value

2023-05-08 Thread Kent Yao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720798#comment-17720798
 ] 

Kent Yao commented on SPARK-43338:
--

Why not just use Catalog V2 API to implement a separate catalog extension with 
hive support? or using an exist one 
https://kyuubi.readthedocs.io/en/v1.7.1-rc0/connector/spark/hive.html

> Support  modify the SESSION_CATALOG_NAME value
> --
>
> Key: SPARK-43338
> URL: https://issues.apache.org/jira/browse/SPARK-43338
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Major
>
> {code:java}
> private[sql] object CatalogManager {
> val SESSION_CATALOG_NAME: String = "spark_catalog"
> }{code}
>  
> The SESSION_CATALOG_NAME value cannot be modified。
> If multiple Hive Metastores exist, the platform manages multiple hms metadata 
> and classifies them by catalogName. A different catalog name is required
> [~yao] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43422) Tags are lost on LogicalRelation when adding _metadata

2023-05-08 Thread Jan-Ole Sasse (Jira)

Jan-Ole Sasse created SPARK-43422:
-

 Summary: Tags are lost on LogicalRelation when adding _metadata
 Key: SPARK-43422
 URL: https://issues.apache.org/jira/browse/SPARK-43422
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 3.4.0
Reporter: Jan-Ole Sasse


The  AddMetadataColumns does not copy tags for the LogicalRelation when adding 
metadata output in addMetadataCol



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

68 matches

Mail list logo