[jira] [Created] (SPARK-43405) Remove useless code in `ScriptInputOutputSchema`
Jia Fan created SPARK-43405: --- Summary: Remove useless code in `ScriptInputOutputSchema` Key: SPARK-43405 URL: https://issues.apache.org/jira/browse/SPARK-43405 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Jia Fan In case class `ScriptInputOutputSchema`, some method like `getRowFormatSQL` naver be used. So we can remove it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43406) enable spark sql to drop multiple partitions in one call
chenruotao created SPARK-43406: -- Summary: enable spark sql to drop multiple partitions in one call Key: SPARK-43406 URL: https://issues.apache.org/jira/browse/SPARK-43406 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0, 3.3.2, 3.2.1 Reporter: chenruotao Fix For: 3.5.0 Now spark sql cannot drop multiple partitions in one call, so I fix it -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43406) enable spark sql to drop multiple partitions in one call
[ https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenruotao updated SPARK-43406: --- Description: Now spark sql cannot drop multiple partitions in one call, so I fix it With this patch we can drop multiple partitions like this : alter table test.table_partition drop partition(dt<='2023-04-02', dt>='2023-03-31') was:Now spark sql cannot drop multiple partitions in one call, so I fix it > enable spark sql to drop multiple partitions in one call > > > Key: SPARK-43406 > URL: https://issues.apache.org/jira/browse/SPARK-43406 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1, 3.3.2, 3.4.0 >Reporter: chenruotao >Priority: Major > Fix For: 3.5.0 > > > Now spark sql cannot drop multiple partitions in one call, so I fix it > With this patch we can drop multiple partitions like this : > alter table test.table_partition drop partition(dt<='2023-04-02', > dt>='2023-03-31') -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43400) create table support the PRIMARY KEY keyword
[ https://issues.apache.org/jira/browse/SPARK-43400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] melin updated SPARK-43400: -- Description: apache paimon and hudi support primary key definitions. It is necessary to support the primary key definition syntax [~gurwls223] was:apache paimon and hudi support primary key definitions. It is necessary to support the primary key definition syntax > create table support the PRIMARY KEY keyword > > > Key: SPARK-43400 > URL: https://issues.apache.org/jira/browse/SPARK-43400 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: melin >Priority: Major > > apache paimon and hudi support primary key definitions. It is necessary to > support the primary key definition syntax > [~gurwls223] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43407) Can executors recover/reuse shuffle files upon failure?
Faiz Halde created SPARK-43407: -- Summary: Can executors recover/reuse shuffle files upon failure? Key: SPARK-43407 URL: https://issues.apache.org/jira/browse/SPARK-43407 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 3.3.1 Reporter: Faiz Halde Hello, We've been in touch with a few spark specialists who suggested us a potential solution to improve the reliability of our jobs that are shuffle heavy Here is what our setup looks like * Spark version: 3.3.1 * Java version: 1.8 * We do not use external shuffle service * We use spot instances We run spark jobs on clusters that use Amazon EBS volumes. The spark.local.dir is mounted on this EBS volume. One of the offerings from the service we use is EBS migration which basically means if a host is about to get evicted, a new host is created and the EBS volume is attached to it When Spark assigns a new executor to the newly created instance, it basically can recover all the shuffle files that are already persisted in the migrated EBS volume Is this how it works? Do executors recover / re-register the shuffle files that they found? So far I have not come across any recovery mechanism. I can only see {noformat} KubernetesLocalDiskShuffleDataIO{noformat} that has a pre-init step where it tries to register the available shuffle files to itself A natural follow-up on this, If what they claim is true, then ideally we should expect that when an executor is killed/OOM'd and a new executor is spawned on the same host, the new executor registers the shuffle files to itself. Is that so? Thanks -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43400) create table support the PRIMARY KEY keyword
[ https://issues.apache.org/jira/browse/SPARK-43400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] melin updated SPARK-43400: -- Description: apache paimon and hudi support primary key definitions. It is necessary to support the primary key definition syntax https://docs.snowflake.com/en/sql-reference/sql/create-table-constraint#constraint-properties [~gurwls223] was: apache paimon and hudi support primary key definitions. It is necessary to support the primary key definition syntax [~gurwls223] > create table support the PRIMARY KEY keyword > > > Key: SPARK-43400 > URL: https://issues.apache.org/jira/browse/SPARK-43400 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: melin >Priority: Major > > apache paimon and hudi support primary key definitions. It is necessary to > support the primary key definition syntax > https://docs.snowflake.com/en/sql-reference/sql/create-table-constraint#constraint-properties > [~gurwls223] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43408) Spark caching in the context of a single job
Faiz Halde created SPARK-43408: -- Summary: Spark caching in the context of a single job Key: SPARK-43408 URL: https://issues.apache.org/jira/browse/SPARK-43408 Project: Spark Issue Type: Question Components: Shuffle Affects Versions: 3.3.1 Reporter: Faiz Halde Does caching benefit a spark job with only a single action in it? Spark IIRC already optimizes shuffles by persisting them onto the disk I am unable to find a counter-example where caching would benefit a job with a single action. In every case I can think of, the shuffle checkpoint acts as a good enough caching mechanism in itself -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43406) enable spark sql to drop multiple partitions in one call
[ https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720446#comment-17720446 ] ASF GitHub Bot commented on SPARK-43406: User 'chenruotao' has created a pull request for this issue: https://github.com/apache/spark/pull/41090 > enable spark sql to drop multiple partitions in one call > > > Key: SPARK-43406 > URL: https://issues.apache.org/jira/browse/SPARK-43406 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1, 3.3.2, 3.4.0 >Reporter: chenruotao >Priority: Major > Fix For: 3.5.0 > > > Now spark sql cannot drop multiple partitions in one call, so I fix it > With this patch we can drop multiple partitions like this : > alter table test.table_partition drop partition(dt<='2023-04-02', > dt>='2023-03-31') -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42551) Support more subexpression elimination cases
[ https://issues.apache.org/jira/browse/SPARK-42551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Kun updated SPARK-42551: Description: h1. *Design Sketch* h2. How to support more subexpressions elimination cases * Get all common expressions from input expressions of the current physical operator to current CodeGenContext. Recursively visits all subexpressions regardless of whether the current expression is a conditional expression. * For each common expression: * Add a new boolean variable *subExprInit* to indicate whether it has already been evaluated. * Add a new code block in CodeGenSupport trait, and reset those *subExprInit* variables to *false* before the physical operators begin to evaluate the input row. * Add a new wrapper subExpr function for each common subexpression. |private void subExpr_n(${argList}) { if (!subExprInit) { ${eval.code} subExprInit_n = true; subExprIsNull_n = ${eval.isNull}; subExprValue_n = ${eval.value}; } }| h1. ** * When performing gen code of the input expression, if the input expression is in the common expressions of the current CodeGenContext, the corresponding subExpr function will be called. After the first function call, *subExprInit* will be set to true, and the subsequent function calls will be skipped. h2. Why should we support whole-stage subexpression elimination Right now each spark physical operator shares nothing but the input row, so the same expressions may be evaluated multiple times across different operators. For example, the expression udf(c1, c2) in plan Project [udf(c1, c2)] - Filter [udf(c1, c2) > 0] - Relation will be evaluated both in Project and Filter operators. We can reuse the expression results across different operators such as Project and Filter. h2. How to support whole-stage subexpression elimination * Add two properties in CodegenSupport trait, the reusable expressions and the the output attributes, we can reuse the expression results only if the output attributes are the same. * Visit all operators from top to bottom, bound the candidate expressions with the output attributes and add to the current candidate reusable expressions. * Visit all operators from bottom to top, collect all the common expressions to the current operator, and add the initialize code to the current operator if the common expressions have not been initialized. * Replace the common expressions code when generating codes for the physical operators. h1. *New support subexpression elimination patterns* * h2. *Support subexpression elimination with conditional expressions* {code:java} SELECT case when v + 2 > 1 then 1 when v + 1 > 2 then 2 when v + 1 > 3 then 3 END vv FROM values(1) as t2(v) {code} We can reuse the result of expression *v + 1* {code:java} SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) min_bc FROM values(1, 1, 1) as t(a, b, c) GROUP BY a {code} We can reuse the result of expression b + c * h2. *Support subexpression elimination in FilterExec* {code:java} SELECT * FROM ( SELECT v * v + 1 v1 from values(1) as t2(v) ) t where v1 > 5 and v1 < 10 {code} We can reuse the result of expression *v* * *v* *+* *1* * h2. *Support subexpression elimination in JoinExec* {code:java} SELECT * FROM values(1, 1) as t1(a, b) join values(1, 2) as t2(x, y) ON b * y between 2 and 3{code} We can reuse the result of expression *b* * *y* * h2. *Support subexpression elimination in ExpandExec* {code:java} SELECT a, count(b), count(distinct case when b > 1 then b + c else null end) as count_bc_1, count(distinct case when b < 0 then b + c else null end) as count_bc_2 FROM values(1, 1, 1) as t(a, b, c) GROUP BY a {code} We can reuse the result of expression b + c was: h1. *Design Sketch* * Get all common expressions from input expressions. Recursively visits all subexpressions regardless of whether the current expression is a conditional expression. * For each common expression: * Add a new boolean variable *subExprInit_n* to indicate whether we have already evaluated the common expression, and reset it to *false* at the start of operator.consume() * Add a new wrapper subExpr function for common subexpression. {code:java} private void subExpr_n(${argList.mkString(", ")}) { if (!subExprInit_n) { ${eval.code} subExprInit_n = true; subExprIsNull_n = ${eval.isNull}; subExprValue_n = ${eval.value}; } } {code} * Replace all the common subexpression with the wrapper function *subExpr_n(argList)*. h1. *New support subexpression elimination patterns* * h2. *Support subexpression elimination with conditional expressions* {code:java} SELECT case when v + 2 > 1 then 1 when v + 1 > 2 then 2 when v + 1 > 3 then 3 END vv FROM values(1) as t2(v) {code} We can reuse the result of expression *v + 1* {code:java} SELECT a
[jira] [Updated] (SPARK-42551) Support more subexpression elimination cases
[ https://issues.apache.org/jira/browse/SPARK-42551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Kun updated SPARK-42551: Description: h1. *Design Sketch* h2. How to support more subexpressions elimination cases * Get all common expressions from input expressions of the current physical operator to current CodeGenContext. Recursively visits all subexpressions regardless of whether the current expression is a conditional expression. * For each common expression: * Add a new boolean variable *subExprInit* to indicate whether it has already been evaluated. * Add a new code block in CodeGenSupport trait, and reset those *subExprInit* variables to *false* before the physical operators begin to evaluate the input row. * Add a new wrapper subExpr function for each common subexpression. |private void subExpr_n(${argList}) { if (!subExprInit) { ${eval.code} subExprInit_n = true; subExprIsNull_n = ${eval.isNull}; subExprValue_n = ${eval.value}; } }| * When performing gen code of the input expression, if the input expression is in the common expressions of the current CodeGenContext, the corresponding subExpr function will be called. After the first function call, *subExprInit* will be set to true, and the subsequent function calls will be skipped. h2. Why should we support whole-stage subexpression elimination Right now each spark physical operator shares nothing but the input row, so the same expressions may be evaluated multiple times across different operators. For example, the expression udf(c1, c2) in plan Project [udf(c1, c2)] - Filter [udf(c1, c2) > 0] - Relation will be evaluated both in Project and Filter operators. We can reuse the expression results across different operators such as Project and Filter. h2. How to support whole-stage subexpression elimination * Add two properties in CodegenSupport trait, the reusable expressions and the the output attributes, we can reuse the expression results only if the output attributes are the same. * Visit all operators from top to bottom, bound the candidate expressions with the output attributes and add to the current candidate reusable expressions. * Visit all operators from bottom to top, collect all the common expressions to the current operator, and add the initialize code to the current operator if the common expressions have not been initialized. * Replace the common expressions code when generating codes for the physical operators. h1. *New support subexpression elimination patterns* * h2. *Support subexpression elimination with conditional expressions* {code:java} SELECT case when v + 2 > 1 then 1 when v + 1 > 2 then 2 when v + 1 > 3 then 3 END vv FROM values(1) as t2(v) {code} We can reuse the result of expression *v + 1* {code:java} SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) min_bc FROM values(1, 1, 1) as t(a, b, c) GROUP BY a {code} We can reuse the result of expression b + c * h2. *Support subexpression elimination in FilterExec* {code:java} SELECT * FROM ( SELECT v * v + 1 v1 from values(1) as t2(v) ) t where v1 > 5 and v1 < 10 {code} We can reuse the result of expression *v* * *v* *+* *1* * h2. *Support subexpression elimination in JoinExec* {code:java} SELECT * FROM values(1, 1) as t1(a, b) join values(1, 2) as t2(x, y) ON b * y between 2 and 3{code} We can reuse the result of expression *b* * *y* * h2. *Support subexpression elimination in ExpandExec* {code:java} SELECT a, count(b), count(distinct case when b > 1 then b + c else null end) as count_bc_1, count(distinct case when b < 0 then b + c else null end) as count_bc_2 FROM values(1, 1, 1) as t(a, b, c) GROUP BY a {code} We can reuse the result of expression b + c was: h1. *Design Sketch* h2. How to support more subexpressions elimination cases * Get all common expressions from input expressions of the current physical operator to current CodeGenContext. Recursively visits all subexpressions regardless of whether the current expression is a conditional expression. * For each common expression: * Add a new boolean variable *subExprInit* to indicate whether it has already been evaluated. * Add a new code block in CodeGenSupport trait, and reset those *subExprInit* variables to *false* before the physical operators begin to evaluate the input row. * Add a new wrapper subExpr function for each common subexpression. |private void subExpr_n(${argList}) { if (!subExprInit) { ${eval.code} subExprInit_n = true; subExprIsNull_n = ${eval.isNull}; subExprValue_n = ${eval.value}; } }| h1. ** * When performing gen code of the input expression, if the input expression is in the common expressions of the current CodeGenContext, the corresponding subExpr function will be called. After the first function call, *subExprInit* will be set to true, and the sub
[jira] [Updated] (SPARK-42551) Support more subexpression elimination cases
[ https://issues.apache.org/jira/browse/SPARK-42551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Kun updated SPARK-42551: Description: h1. *Design Sketch* h2. How to support more subexpressions elimination cases * Get all common expressions from input expressions of the current physical operator to current CodeGenContext. Recursively visits all subexpressions regardless of whether the current expression is a conditional expression. * For each common expression: ** Add a new boolean variable *subExprInit* to indicate whether it has already been evaluated. ** Add a new code block in CodeGenSupport trait, and reset those *subExprInit* variables to *false* before the physical operators begin to evaluate the input row. ** Add a new wrapper subExpr function for each common subexpression. |private void subExpr_n(${argList}) { if (!subExprInit) { ${eval.code} subExprInit_n = true; subExprIsNull_n = ${eval.isNull}; subExprValue_n = ${eval.value}; } }| * When generating the input expression code, if the input expression is a common expression, the expression code will be replaced with the corresponding subExpr function. When the subExpr function is called for the first time, *subExprInit* will be set to true, and the subsequent function calls will do nothing. h2. Why should we support whole-stage subexpression elimination Right now each spark physical operator shares nothing but the input row, so the same expressions may be evaluated multiple times across different operators. For example, the expression udf(c1, c2) in plan Project [udf(c1, c2)] - Filter [udf(c1, c2) > 0] - Relation will be evaluated both in Project and Filter operators. We can reuse the expression results across different operators such as Project and Filter. h2. How to support whole-stage subexpression elimination * Add two properties in CodegenSupport trait, the reusable expressions and the the output attributes, we can reuse the expression results only if the output attributes are the same. * Visit all operators from top to bottom, bound the candidate expressions with the output attributes and add to the current candidate reusable expressions. * Visit all operators from bottom to top, collect all the common expressions to the current operator, and add the initialize code to the current operator if the common expressions have not been initialized. * Replace the common expressions code when generating codes for the physical operators. h1. *New support subexpression elimination patterns* * h2. *Support subexpression elimination with conditional expressions* {code:java} SELECT case when v + 2 > 1 then 1 when v + 1 > 2 then 2 when v + 1 > 3 then 3 END vv FROM values(1) as t2(v) {code} We can reuse the result of expression *v + 1* {code:java} SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) min_bc FROM values(1, 1, 1) as t(a, b, c) GROUP BY a {code} We can reuse the result of expression b + c * h2. *Support subexpression elimination in FilterExec* {code:java} SELECT * FROM ( SELECT v * v + 1 v1 from values(1) as t2(v) ) t where v1 > 5 and v1 < 10 {code} We can reuse the result of expression *v* * *v* *+* *1* * h2. *Support subexpression elimination in JoinExec* {code:java} SELECT * FROM values(1, 1) as t1(a, b) join values(1, 2) as t2(x, y) ON b * y between 2 and 3{code} We can reuse the result of expression *b* * *y* * h2. *Support subexpression elimination in ExpandExec* {code:java} SELECT a, count(b), count(distinct case when b > 1 then b + c else null end) as count_bc_1, count(distinct case when b < 0 then b + c else null end) as count_bc_2 FROM values(1, 1, 1) as t(a, b, c) GROUP BY a {code} We can reuse the result of expression b + c was: h1. *Design Sketch* h2. How to support more subexpressions elimination cases * Get all common expressions from input expressions of the current physical operator to current CodeGenContext. Recursively visits all subexpressions regardless of whether the current expression is a conditional expression. * For each common expression: * Add a new boolean variable *subExprInit* to indicate whether it has already been evaluated. * Add a new code block in CodeGenSupport trait, and reset those *subExprInit* variables to *false* before the physical operators begin to evaluate the input row. * Add a new wrapper subExpr function for each common subexpression. |private void subExpr_n(${argList}) { if (!subExprInit) { ${eval.code} subExprInit_n = true; subExprIsNull_n = ${eval.isNull}; subExprValue_n = ${eval.value}; } }| * When performing gen code of the input expression, if the input expression is in the common expressions of the current CodeGenContext, the corresponding subExpr function will be called. After the first function call, *subExprInit* will be set to true, and
[jira] [Updated] (SPARK-43408) Spark caching in the context of a single job
[ https://issues.apache.org/jira/browse/SPARK-43408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Faiz Halde updated SPARK-43408: --- Description: Does caching benefit a spark job with only a single action in it? Spark IIRC already optimizes shuffles by persisting them onto the disk I am unable to find a counter-example where caching would benefit a job with a single action. In every case I can think of, the shuffle checkpoint acts as a good enough caching mechanism in itself FWIW, I am talking specifically in the context of the Dataframe API was: Does caching benefit a spark job with only a single action in it? Spark IIRC already optimizes shuffles by persisting them onto the disk I am unable to find a counter-example where caching would benefit a job with a single action. In every case I can think of, the shuffle checkpoint acts as a good enough caching mechanism in itself > Spark caching in the context of a single job > > > Key: SPARK-43408 > URL: https://issues.apache.org/jira/browse/SPARK-43408 > Project: Spark > Issue Type: Question > Components: Shuffle >Affects Versions: 3.3.1 >Reporter: Faiz Halde >Priority: Trivial > > Does caching benefit a spark job with only a single action in it? Spark IIRC > already optimizes shuffles by persisting them onto the disk > I am unable to find a counter-example where caching would benefit a job with > a single action. In every case I can think of, the shuffle checkpoint acts as > a good enough caching mechanism in itself > FWIW, I am talking specifically in the context of the Dataframe API -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43408) Spark caching in the context of a single job
[ https://issues.apache.org/jira/browse/SPARK-43408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Faiz Halde updated SPARK-43408: --- Description: Does caching benefit a spark job with only a single action in it? Spark IIRC already optimizes shuffles by persisting them onto the disk I am unable to find a counter-example where caching would benefit a job with a single action. In every case I can think of, the shuffle checkpoint acts as a good enough caching mechanism in itself FWIW, I am talking specifically in the context of the Dataframe API. The StorageLevel allowed in my case would only be DISK_ONLY i.e. I am not looking to speed up by caching data in memory was: Does caching benefit a spark job with only a single action in it? Spark IIRC already optimizes shuffles by persisting them onto the disk I am unable to find a counter-example where caching would benefit a job with a single action. In every case I can think of, the shuffle checkpoint acts as a good enough caching mechanism in itself FWIW, I am talking specifically in the context of the Dataframe API > Spark caching in the context of a single job > > > Key: SPARK-43408 > URL: https://issues.apache.org/jira/browse/SPARK-43408 > Project: Spark > Issue Type: Question > Components: Shuffle >Affects Versions: 3.3.1 >Reporter: Faiz Halde >Priority: Trivial > > Does caching benefit a spark job with only a single action in it? Spark IIRC > already optimizes shuffles by persisting them onto the disk > I am unable to find a counter-example where caching would benefit a job with > a single action. In every case I can think of, the shuffle checkpoint acts as > a good enough caching mechanism in itself > FWIW, I am talking specifically in the context of the Dataframe API. The > StorageLevel allowed in my case would only be DISK_ONLY i.e. I am not looking > to speed up by caching data in memory -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43408) Spark caching in the context of a single job
[ https://issues.apache.org/jira/browse/SPARK-43408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Faiz Halde updated SPARK-43408: --- Description: Does caching benefit a spark job with only a single action in it? Spark IIRC already optimizes shuffles by persisting them onto the disk I am unable to find a counter-example where caching would benefit a job with a single action. In every case I can think of, the shuffle checkpoint acts as a good enough caching mechanism in itself FWIW, I am talking specifically in the context of the Dataframe API. The StorageLevel allowed in my case is DISK_ONLY i.e. I am not looking to speed up by caching data in memory To rephrase, is DISK_ONLY caching better or same as shuffle checkpointing in the context of a single action was: Does caching benefit a spark job with only a single action in it? Spark IIRC already optimizes shuffles by persisting them onto the disk I am unable to find a counter-example where caching would benefit a job with a single action. In every case I can think of, the shuffle checkpoint acts as a good enough caching mechanism in itself FWIW, I am talking specifically in the context of the Dataframe API. The StorageLevel allowed in my case would only be DISK_ONLY i.e. I am not looking to speed up by caching data in memory > Spark caching in the context of a single job > > > Key: SPARK-43408 > URL: https://issues.apache.org/jira/browse/SPARK-43408 > Project: Spark > Issue Type: Question > Components: Shuffle >Affects Versions: 3.3.1 >Reporter: Faiz Halde >Priority: Trivial > > Does caching benefit a spark job with only a single action in it? Spark IIRC > already optimizes shuffles by persisting them onto the disk > I am unable to find a counter-example where caching would benefit a job with > a single action. In every case I can think of, the shuffle checkpoint acts as > a good enough caching mechanism in itself > FWIW, I am talking specifically in the context of the Dataframe API. The > StorageLevel allowed in my case is DISK_ONLY i.e. I am not looking to speed > up by caching data in memory > To rephrase, is DISK_ONLY caching better or same as shuffle checkpointing in > the context of a single action -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39281) Speed up Timestamp type inference of legacy format in JSON/CSV data source
[ https://issues.apache.org/jira/browse/SPARK-39281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jia Fan updated SPARK-39281: Description: The optimization of {{DefaultTimestampFormatter}} has been implemented in [#36562|https://github.com/apache/spark/pull/36562] , this ticket adds the optimization of legacy format. The basic logic is to prevent the formatter from throwing exceptions, and then use catch to determine whether the parsing is successful. > Speed up Timestamp type inference of legacy format in JSON/CSV data source > -- > > Key: SPARK-39281 > URL: https://issues.apache.org/jira/browse/SPARK-39281 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > > The optimization of {{DefaultTimestampFormatter}} has been implemented in > [#36562|https://github.com/apache/spark/pull/36562] , this ticket adds the > optimization of legacy format. The basic logic is to prevent the formatter > from throwing exceptions, and then use catch to determine whether the parsing > is successful. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39281) Speed up Timestamp type inference of legacy format in JSON/CSV data source
[ https://issues.apache.org/jira/browse/SPARK-39281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jia Fan updated SPARK-39281: Summary: Speed up Timestamp type inference of legacy format in JSON/CSV data source (was: Fasten Timestamp type inference of legacy format in JSON/CSV data source) > Speed up Timestamp type inference of legacy format in JSON/CSV data source > -- > > Key: SPARK-39281 > URL: https://issues.apache.org/jira/browse/SPARK-39281 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43409) Add documentation for protobuf deserialization options
Parth Upadhyay created SPARK-43409: -- Summary: Add documentation for protobuf deserialization options Key: SPARK-43409 URL: https://issues.apache.org/jira/browse/SPARK-43409 Project: Spark Issue Type: Improvement Components: Documentation, Protobuf Affects Versions: 3.4.0 Reporter: Parth Upadhyay Follow-up from PR comment: [https://github.com/apache/spark/pull/41075#discussion_r1186958551] Currently theres no documentation here [https://github.com/apache/spark/blob/master/docs/sql-data-sources-protobuf.md#deploying] for the full range of options available for `from_protobuf`, so users would need to discover these options by reading the scala code directly. Let's update the documentation to include information about these options. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41532) DF operations that involve multiple data frames should fail if sessions don't match
[ https://issues.apache.org/jira/browse/SPARK-41532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell reassigned SPARK-41532: - Assignee: Jia Fan > DF operations that involve multiple data frames should fail if sessions don't > match > --- > > Key: SPARK-41532 > URL: https://issues.apache.org/jira/browse/SPARK-41532 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Jia Fan >Priority: Major > > We do not support joining for example two data frames from different Spark > Connect Sessions. To avoid exceptions, the client should clearly fail when it > tries to construct such a composition. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41532) DF operations that involve multiple data frames should fail if sessions don't match
[ https://issues.apache.org/jira/browse/SPARK-41532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-41532. --- Fix Version/s: 3.5.0 Resolution: Fixed > DF operations that involve multiple data frames should fail if sessions don't > match > --- > > Key: SPARK-41532 > URL: https://issues.apache.org/jira/browse/SPARK-41532 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Jia Fan >Priority: Major > Fix For: 3.5.0 > > > We do not support joining for example two data frames from different Spark > Connect Sessions. To avoid exceptions, the client should clearly fail when it > tries to construct such a composition. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43410) Improve vectorized loop for Packed skipValues
xiaochen zhou created SPARK-43410: - Summary: Improve vectorized loop for Packed skipValues Key: SPARK-43410 URL: https://issues.apache.org/jira/browse/SPARK-43410 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: xiaochen zhou Improve vectorized loop for Packed skipValues {{}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43410) Improve vectorized loop for Packed skipValues
[ https://issues.apache.org/jira/browse/SPARK-43410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720525#comment-17720525 ] xiaochen zhou commented on SPARK-43410: --- https://github.com/apache/spark/pull/41092 > Improve vectorized loop for Packed skipValues > - > > Key: SPARK-43410 > URL: https://issues.apache.org/jira/browse/SPARK-43410 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: xiaochen zhou >Priority: Minor > > Improve vectorized loop for Packed skipValues > {{}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43292) ArtifactManagerSuite can't run using maven
[ https://issues.apache.org/jira/browse/SPARK-43292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-43292. --- Fix Version/s: 3.5.0 Assignee: Yang Jie Resolution: Fixed > ArtifactManagerSuite can't run using maven > -- > > Key: SPARK-43292 > URL: https://issues.apache.org/jira/browse/SPARK-43292 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > > run > {code:java} > build/mvn clean install -DskipTests -Phive > build/mvn test -pl connector/connect/server {code} > ArtifactManagerSuite failed due to > > {code:java} > 23/04/26 16:00:07.666 ScalaTest-main-running-DiscoverySuite ERROR Executor: > Could not find org.apache.spark.repl.ExecutorClassLoader on classpath! {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43389) spark.read.csv throws NullPointerException when lineSep is set to None
[ https://issues.apache.org/jira/browse/SPARK-43389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720568#comment-17720568 ] Zach Liu commented on SPARK-43389: -- that's why i set the type as "Improvement" and the priority as "Trivial". but still, it's an unnecessary confusion > spark.read.csv throws NullPointerException when lineSep is set to None > -- > > Key: SPARK-43389 > URL: https://issues.apache.org/jira/browse/SPARK-43389 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.1 >Reporter: Zach Liu >Priority: Trivial > > lineSep was defined as Optional[str] yet i'm unable to explicitly set it as > None: > reader = spark.read.format("csv") > read_options={'inferSchema': False, 'header': True, 'mode': 'DROPMALFORMED', > 'sep': '\t', 'escape': '\\', 'multiLine': False, 'lineSep': None} > for option, option_value in read_options.items(): > reader = reader.option(option, option_value) > df = reader.load("s3://") > raises exception: > py4j.protocol.Py4JJavaError: An error occurred while calling o126.load. > : java.lang.NullPointerException > at > scala.collection.immutable.StringOps$.length$extension(StringOps.scala:51) > at scala.collection.immutable.StringOps.length(StringOps.scala:51) > at > scala.collection.IndexedSeqOptimized.isEmpty(IndexedSeqOptimized.scala:30) > at > scala.collection.IndexedSeqOptimized.isEmpty$(IndexedSeqOptimized.scala:30) > at scala.collection.immutable.StringOps.isEmpty(StringOps.scala:33) > at scala.collection.TraversableOnce.nonEmpty(TraversableOnce.scala:143) > at scala.collection.TraversableOnce.nonEmpty$(TraversableOnce.scala:143) > at scala.collection.immutable.StringOps.nonEmpty(StringOps.scala:33) > at > org.apache.spark.sql.catalyst.csv.CSVOptions.$anonfun$lineSeparator$1(CSVOptions.scala:216) > at scala.Option.map(Option.scala:230) > at > org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:215) > at > org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:47) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:60) > at > org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$11(DataSource.scala:210) > at scala.Option.orElse(Option.scala:447) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:411) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at > py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) > at py4j.ClientServerConnection.run(ClientServerConnection.java:106) > at java.lang.Thread.run(Thread.java:750) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43410) Improve vectorized loop for Packed skipValues
[ https://issues.apache.org/jira/browse/SPARK-43410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaochen zhou updated SPARK-43410: -- Description: Improve vectorized loop for Packed skipValues (was: Improve vectorized loop for Packed skipValues {{}}) > Improve vectorized loop for Packed skipValues > - > > Key: SPARK-43410 > URL: https://issues.apache.org/jira/browse/SPARK-43410 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: xiaochen zhou >Priority: Minor > > Improve vectorized loop for Packed skipValues -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43410) Improve vectorized loop for Packed skipValues
[ https://issues.apache.org/jira/browse/SPARK-43410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43410. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41092 [https://github.com/apache/spark/pull/41092] > Improve vectorized loop for Packed skipValues > - > > Key: SPARK-43410 > URL: https://issues.apache.org/jira/browse/SPARK-43410 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: xiaochen zhou >Priority: Minor > Fix For: 3.5.0 > > > Improve vectorized loop for Packed skipValues -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43410) Improve vectorized loop for Packed skipValues
[ https://issues.apache.org/jira/browse/SPARK-43410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43410: Assignee: xiaochen zhou > Improve vectorized loop for Packed skipValues > - > > Key: SPARK-43410 > URL: https://issues.apache.org/jira/browse/SPARK-43410 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: xiaochen zhou >Assignee: xiaochen zhou >Priority: Minor > Fix For: 3.5.0 > > > Improve vectorized loop for Packed skipValues -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43411) Can't union dataframes with # in subcolumn name
Rudhra Raveendran created SPARK-43411: - Summary: Can't union dataframes with # in subcolumn name Key: SPARK-43411 URL: https://issues.apache.org/jira/browse/SPARK-43411 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.2 Environment: * Azure Synapse Notebooks * Apache Spark Pool: [Azure Synapse Runtime for Apache Spark 3.1 (EOLA) - Azure Synapse Analytics | Microsoft Learn|https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-3-runtime] ** Spark 3.1.2 ** Ubuntu 18.04 ** Python 3.8 ** Scala 2.12.10 ** Hadoop 3.1.1 ** Java 1.8.0_282 ** .NET Core 3.1 ** .NET for Apache Spark 2.0.0 ** Delta Lake 1.0 Reporter: Rudhra Raveendran I was using Spark within an Azure Synapse notebook to load dataframes from various storage accounts and union them into a single dataframe, but it seems to fail as the SQL internal to union doesn't handle special characters properly. Here is a code example of what I was running: {code:java} val data1 = spark.read.parquet("abfss://PATH1") val data2 = spark.read.parquet("abfss://PATH2") val data3 = spark.read.parquet("abfss://PATH3") val data4 = spark.read.parquet("abfss://PATH4") val data = data1 .unionByName(data2, allowMissingColumns=true) .unionByName(data3, allowMissingColumns=true) .unionByName(data4, allowMissingColumns=true) data.printSchema() {code} The issue arose due to having a StructType column, e.g. ABC, that has a subcolumn with # in the name, e.g. #XYZ#. This doesn't seem to be a problem outright, as other Spark functions like select work fine: {code:java} data1.select("ABC.#XYZ#").where(col("#XYZ#").isNotNull).show(5, truncate = false) {code} However, when I ran the earlier snippet with the union statements, I get this error: {code:java} org.apache.spark.sql.catalyst.parser.ParseException: extraneous input '#' expecting {'ADD', 'AFTER', 'ALL', 'ALTER', 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', DATABASES, 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 'DIV', 'DROP', 'ELSE', 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', 'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', 'FIELDS', 'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 'FOREIGN', 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GLOBAL', 'GRANT', 'GROUP', 'GROUPING', 'HAVING', 'IF', 'IGNORE', 'IMPORT', 'IN', 'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 'INTERSECT', 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 'LATERAL', 'LAZY', 'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 'MATCHED', 'MERGE', 'MSCK', 'NAMESPACE', 'NAMESPACES', 'NATURAL', 'NO', NOT, 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 'OPTION', 'OPTIONS', 'OR', 'ORDER', 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', 'OVERLAY', 'OVERWRITE', 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', 'PIVOT', 'PLACING', 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS', 'PROPERTIES', 'PURGE', 'QUERY', 'RANGE', 'RECORDREADER', 'RECORDWRITER', 'RECOVER', 'REDUCE', 'REFERENCES', 'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', 'RESET', 'RESTRICT', 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 'ROLLUP', 'ROW', 'ROWS', 'SCHEMA', 'SELECT', 'SEMI', 'SEPARATED', 'SERDE', 'SERDEPROPERTIES', 'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', 'SKEWED', 'SOME', 'SORT', 'SORTED', 'START', 'STATISTICS', 'STORED', 'STRATIFY', 'STRUCT', 'SUBSTR', 'SUBSTRING', 'TABLE', 'TABLES', 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 'TIME', 'TO', 'TOUCH', 'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRIM', 'TRUE', 'TRUNCATE', 'TYPE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', 'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 'UPDATE', 'USE', 'USER', 'USING', 'VALUES', 'VIEW', 'VIEWS', 'WHEN', 'WHERE', 'WINDOW', 'WITH', 'ZONE', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0) == SQL == #XYZ# ^^^ {code} This seems to indicate the issue is with the implementation of unionByName/union etc. however I'm not familiar enough with the codebase to figure out where this would be an issue (I was able to trace that UnionByName calls on Union which I think is defined here: [spark/basicLogicalOperators.scala at master · apache/spark
[jira] [Commented] (SPARK-38471) Use error classes in org.apache.spark.rdd
[ https://issues.apache.org/jira/browse/SPARK-38471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720628#comment-17720628 ] xin chen commented on SPARK-38471: -- awesome, thanks. > Use error classes in org.apache.spark.rdd > - > > Key: SPARK-38471 > URL: https://issues.apache.org/jira/browse/SPARK-38471 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38471) Use error classes in org.apache.spark.rdd
[ https://issues.apache.org/jira/browse/SPARK-38471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720631#comment-17720631 ] xin chen commented on SPARK-38471: -- Hey [~maxgekk], I'm currently working on this ticket. Could you assign this ticket to me? Thanks > Use error classes in org.apache.spark.rdd > - > > Key: SPARK-38471 > URL: https://issues.apache.org/jira/browse/SPARK-38471 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43412) Introduce `SQL_ARROW_BATCHED_UDF` EvalType for Arrow-optimized Python UDFs
Xinrong Meng created SPARK-43412: Summary: Introduce `SQL_ARROW_BATCHED_UDF` EvalType for Arrow-optimized Python UDFs Key: SPARK-43412 URL: https://issues.apache.org/jira/browse/SPARK-43412 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Xinrong Meng We are about to improve nested non-atomic input/output support of an Arrow-optimized Python UDF. However, currently, it shares the same EvalType with a pickled Python UDF, but the same implementation with a Pandas UDF. Introducing an EvalType enables isolating the changes to Arrow-optimized Python UDFs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43411) Can't union dataframes with # in subcolumn name
[ https://issues.apache.org/jira/browse/SPARK-43411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rudhra Raveendran resolved SPARK-43411. --- Resolution: Won't Fix Turns out this issue isn't present in later versions of Spark (I just tested on the latest version in Synapse, which is 3.3.1, and it worked). I'm assuming since this version is EOL this will be a wontfix, hence closing this out. > Can't union dataframes with # in subcolumn name > --- > > Key: SPARK-43411 > URL: https://issues.apache.org/jira/browse/SPARK-43411 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 > Environment: * Azure Synapse Notebooks > * Apache Spark Pool: [Azure Synapse Runtime for Apache Spark 3.1 (EOLA) - > Azure Synapse Analytics | Microsoft > Learn|https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-3-runtime] > ** Spark 3.1.2 > ** Ubuntu 18.04 > ** Python 3.8 > ** Scala 2.12.10 > ** Hadoop 3.1.1 > ** Java 1.8.0_282 > ** .NET Core 3.1 > ** .NET for Apache Spark 2.0.0 > ** Delta Lake 1.0 >Reporter: Rudhra Raveendran >Priority: Major > > I was using Spark within an Azure Synapse notebook to load dataframes from > various storage accounts and union them into a single dataframe, but it seems > to fail as the SQL internal to union doesn't handle special characters > properly. Here is a code example of what I was running: > {code:java} > val data1 = spark.read.parquet("abfss://PATH1") > val data2 = spark.read.parquet("abfss://PATH2") > val data3 = spark.read.parquet("abfss://PATH3") > val data4 = spark.read.parquet("abfss://PATH4") > val data = data1 > .unionByName(data2, allowMissingColumns=true) > .unionByName(data3, allowMissingColumns=true) > .unionByName(data4, allowMissingColumns=true) > data.printSchema() {code} > The issue arose due to having a StructType column, e.g. ABC, that has a > subcolumn with # in the name, e.g. #XYZ#. This doesn't seem to be a problem > outright, as other Spark functions like select work fine: > {code:java} > data1.select("ABC.#XYZ#").where(col("#XYZ#").isNotNull).show(5, truncate = > false) {code} > However, when I ran the earlier snippet with the union statements, I get this > error: > {code:java} > org.apache.spark.sql.catalyst.parser.ParseException: > extraneous input '#' expecting {'ADD', 'AFTER', 'ALL', 'ALTER', 'ANALYZE', > 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 'AUTHORIZATION', > 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 'CASCADE', 'CASE', > 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 'CLUSTERED', 'CODEGEN', > 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 'COMMENT', 'COMMIT', 'COMPACT', > 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 'CONSTRAINT', 'COST', 'CREATE', > 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 'CURRENT_TIME', > 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', DATABASES, > 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 'DESCRIBE', 'DFS', > 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 'DIV', 'DROP', 'ELSE', > 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', 'EXPLAIN', > 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', 'FIELDS', > 'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 'FOREIGN', 'FORMAT', > 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GLOBAL', 'GRANT', > 'GROUP', 'GROUPING', 'HAVING', 'IF', 'IGNORE', 'IMPORT', 'IN', 'INDEX', > 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 'INTERSECT', > 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 'LATERAL', 'LAZY', > 'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 'LOAD', 'LOCAL', > 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 'MATCHED', 'MERGE', > 'MSCK', 'NAMESPACE', 'NAMESPACES', 'NATURAL', 'NO', NOT, 'NULL', 'NULLS', > 'OF', 'ON', 'ONLY', 'OPTION', 'OPTIONS', 'OR', 'ORDER', 'OUT', 'OUTER', > 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', 'OVERLAY', 'OVERWRITE', 'PARTITION', > 'PARTITIONED', 'PARTITIONS', 'PERCENT', 'PIVOT', 'PLACING', 'POSITION', > 'PRECEDING', 'PRIMARY', 'PRINCIPALS', 'PROPERTIES', 'PURGE', 'QUERY', > 'RANGE', 'RECORDREADER', 'RECORDWRITER', 'RECOVER', 'REDUCE', 'REFERENCES', > 'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', 'RESET', 'RESTRICT', 'REVOKE', > 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 'ROLLUP', 'ROW', 'ROWS', > 'SCHEMA', 'SELECT', 'SEMI', 'SEPARATED', 'SERDE', 'SERDEPROPERTIES', > 'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', 'SKEWED', 'SOME', 'SORT', > 'SORTED', 'START', 'STATISTICS', 'STORED', 'STRATIFY', 'STRUCT', 'SUBSTR', > 'SUBSTRING', 'TABLE', 'TABLES', 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, > 'TERMINATED', 'THEN', 'TIME', 'TO', 'TOUCH', 'TRAILING', 'TRANSACTION', > 'TRANSACTIONS', 'TRANSFORM', 'TRIM', 'TRUE', 'TRUNCATE', 'TYPE', 'UNARC
[jira] [Created] (SPARK-43413) IN subquery ListQuery has wrong nullability
Jack Chen created SPARK-43413: - Summary: IN subquery ListQuery has wrong nullability Key: SPARK-43413 URL: https://issues.apache.org/jira/browse/SPARK-43413 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Jack Chen IN subquery expressions currently are marked as nullable if and only if the left-hand-side is nullable - because the right-hand-side of a IN subquery, the ListQuery, is currently defined with nullability = false always. This is incorrect and can lead to incorrect query transformations. Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN expression returns NULL when the nullable_col is null, but our code marks it as non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> TRUE, transforming the expression to non_nullable_col IN (select nullable_col) , which is an incorrect transformation because NULL values of nullable_col now cause the expression to yield NULL instead of FALSE. This is a long-standing bug that has existed at least since 2016, as long as the ListQuery class has existed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43413) IN subquery ListQuery has wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-43413: -- Description: IN subquery expressions currently are marked as nullable if and only if the left-hand-side is nullable - because the right-hand-side of a IN subquery, the ListQuery, is currently defined with nullability = false always. This is incorrect and can lead to incorrect query transformations. Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN expression returns NULL when the nullable_col is null, but our code marks it as non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> TRUE, transforming the expression to non_nullable_col IN (select nullable_col) , which is an incorrect transformation because NULL values of nullable_col now cause the expression to yield NULL instead of FALSE. This bug can potentially lead to wrong results, but in most cases this doesn't directly cause wrong results end-to-end, because IN subqueries are almost always transformed to semi/anti/existence joins in RewritePredicateSubquery, and this rewrite can also incorrectly discard NULLs, which is another bug. But we can observe it causing wrong behavior in unit tests, and it could easily lead to incorrect query results if there are changes to the surrounding context, so it should be fixed regardless. This is a long-standing bug that has existed at least since 2016, as long as the ListQuery class has existed. was: IN subquery expressions currently are marked as nullable if and only if the left-hand-side is nullable - because the right-hand-side of a IN subquery, the ListQuery, is currently defined with nullability = false always. This is incorrect and can lead to incorrect query transformations. Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN expression returns NULL when the nullable_col is null, but our code marks it as non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> TRUE, transforming the expression to non_nullable_col IN (select nullable_col) , which is an incorrect transformation because NULL values of nullable_col now cause the expression to yield NULL instead of FALSE. This is a long-standing bug that has existed at least since 2016, as long as the ListQuery class has existed. > IN subquery ListQuery has wrong nullability > --- > > Key: SPARK-43413 > URL: https://issues.apache.org/jira/browse/SPARK-43413 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Priority: Major > > IN subquery expressions currently are marked as nullable if and only if the > left-hand-side is nullable - because the right-hand-side of a IN subquery, > the ListQuery, is currently defined with nullability = false always. This is > incorrect and can lead to incorrect query transformations. > Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN > expression returns NULL when the nullable_col is null, but our code marks it > as non-nullable, and therefore SimplifyBinaryComparison transforms away the > <=> TRUE, transforming the expression to non_nullable_col IN (select > nullable_col) , which is an incorrect transformation because NULL values of > nullable_col now cause the expression to yield NULL instead of FALSE. > This bug can potentially lead to wrong results, but in most cases this > doesn't directly cause wrong results end-to-end, because IN subqueries are > almost always transformed to semi/anti/existence joins in > RewritePredicateSubquery, and this rewrite can also incorrectly discard > NULLs, which is another bug. But we can observe it causing wrong behavior in > unit tests, and it could easily lead to incorrect query results if there are > changes to the surrounding context, so it should be fixed regardless. > This is a long-standing bug that has existed at least since 2016, as long as > the ListQuery class has existed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43414) Fix flakiness in Kafka RDD suites due to port binding configuration issue
Josh Rosen created SPARK-43414: -- Summary: Fix flakiness in Kafka RDD suites due to port binding configuration issue Key: SPARK-43414 URL: https://issues.apache.org/jira/browse/SPARK-43414 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.4.0 Reporter: Josh Rosen Assignee: Josh Rosen In SPARK-36837 we updated Kafka to 3.10, which uses a different set of configuration options for configuring the broker listener port. That PR only updated one of two KafkaTestUtils files (the SQL one), so the other one (used by Core tests) had an ineffective port binding configuration and would bind to the default 9092 port. This could lead to flakiness if multiple suites binding to that port ran in parallel. To fix this, we just need to copy the updated port binding configuration from the other suite. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43413) IN subquery ListQuery has wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-43413: -- Description: IN subquery expressions are incorrectly always marked as non-nullable, even when they are actually nullable. They correctly check the nullability of the left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is currently defined with nullability = false always. This is incorrect and can lead to incorrect query transformations. Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN expression returns NULL when the nullable_col is null, but our code marks it as non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> TRUE, transforming the expression to non_nullable_col IN (select nullable_col) , which is an incorrect transformation because NULL values of nullable_col now cause the expression to yield NULL instead of FALSE. This bug can potentially lead to wrong results, but in most cases this doesn't directly cause wrong results end-to-end, because IN subqueries are almost always transformed to semi/anti/existence joins in RewritePredicateSubquery, and this rewrite can also incorrectly discard NULLs, which is another bug. But we can observe it causing wrong behavior in unit tests, and it could easily lead to incorrect query results if there are changes to the surrounding context, so it should be fixed regardless. This is a long-standing bug that has existed at least since 2016, as long as the ListQuery class has existed. was: IN subquery expressions currently are marked as nullable if and only if the left-hand-side is nullable - because the right-hand-side of a IN subquery, the ListQuery, is currently defined with nullability = false always. This is incorrect and can lead to incorrect query transformations. Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN expression returns NULL when the nullable_col is null, but our code marks it as non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> TRUE, transforming the expression to non_nullable_col IN (select nullable_col) , which is an incorrect transformation because NULL values of nullable_col now cause the expression to yield NULL instead of FALSE. This bug can potentially lead to wrong results, but in most cases this doesn't directly cause wrong results end-to-end, because IN subqueries are almost always transformed to semi/anti/existence joins in RewritePredicateSubquery, and this rewrite can also incorrectly discard NULLs, which is another bug. But we can observe it causing wrong behavior in unit tests, and it could easily lead to incorrect query results if there are changes to the surrounding context, so it should be fixed regardless. This is a long-standing bug that has existed at least since 2016, as long as the ListQuery class has existed. > IN subquery ListQuery has wrong nullability > --- > > Key: SPARK-43413 > URL: https://issues.apache.org/jira/browse/SPARK-43413 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Priority: Major > > IN subquery expressions are incorrectly always marked as non-nullable, even > when they are actually nullable. They correctly check the nullability of the > left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is > currently defined with nullability = false always. This is incorrect and can > lead to incorrect query transformations. > Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN > expression returns NULL when the nullable_col is null, but our code marks it > as non-nullable, and therefore SimplifyBinaryComparison transforms away the > <=> TRUE, transforming the expression to non_nullable_col IN (select > nullable_col) , which is an incorrect transformation because NULL values of > nullable_col now cause the expression to yield NULL instead of FALSE. > This bug can potentially lead to wrong results, but in most cases this > doesn't directly cause wrong results end-to-end, because IN subqueries are > almost always transformed to semi/anti/existence joins in > RewritePredicateSubquery, and this rewrite can also incorrectly discard > NULLs, which is another bug. But we can observe it causing wrong behavior in > unit tests, and it could easily lead to incorrect query results if there are > changes to the surrounding context, so it should be fixed regardless. > This is a long-standing bug that has existed at least since 2016, as long as > the ListQuery class has existed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues
[jira] [Updated] (SPARK-43413) IN subquery ListQuery has wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-43413: -- Description: IN subquery expressions are incorrectly marked as non-nullable, even when they are actually nullable. They correctly check the nullability of the left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is currently defined with nullability = false always. This is incorrect and can lead to incorrect query transformations. Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN expression returns NULL when the nullable_col is null, but our code marks it as non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> TRUE, transforming the expression to non_nullable_col IN (select nullable_col) , which is an incorrect transformation because NULL values of nullable_col now cause the expression to yield NULL instead of FALSE. This bug can potentially lead to wrong results, but in most cases this doesn't directly cause wrong results end-to-end, because IN subqueries are almost always transformed to semi/anti/existence joins in RewritePredicateSubquery, and this rewrite can also incorrectly discard NULLs, which is another bug. But we can observe it causing wrong behavior in unit tests, and it could easily lead to incorrect query results if there are changes to the surrounding context, so it should be fixed regardless. This is a long-standing bug that has existed at least since 2016, as long as the ListQuery class has existed. was: IN subquery expressions are incorrectly always marked as non-nullable, even when they are actually nullable. They correctly check the nullability of the left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is currently defined with nullability = false always. This is incorrect and can lead to incorrect query transformations. Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN expression returns NULL when the nullable_col is null, but our code marks it as non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> TRUE, transforming the expression to non_nullable_col IN (select nullable_col) , which is an incorrect transformation because NULL values of nullable_col now cause the expression to yield NULL instead of FALSE. This bug can potentially lead to wrong results, but in most cases this doesn't directly cause wrong results end-to-end, because IN subqueries are almost always transformed to semi/anti/existence joins in RewritePredicateSubquery, and this rewrite can also incorrectly discard NULLs, which is another bug. But we can observe it causing wrong behavior in unit tests, and it could easily lead to incorrect query results if there are changes to the surrounding context, so it should be fixed regardless. This is a long-standing bug that has existed at least since 2016, as long as the ListQuery class has existed. > IN subquery ListQuery has wrong nullability > --- > > Key: SPARK-43413 > URL: https://issues.apache.org/jira/browse/SPARK-43413 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Priority: Major > > IN subquery expressions are incorrectly marked as non-nullable, even when > they are actually nullable. They correctly check the nullability of the > left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is > currently defined with nullability = false always. This is incorrect and can > lead to incorrect query transformations. > Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN > expression returns NULL when the nullable_col is null, but our code marks it > as non-nullable, and therefore SimplifyBinaryComparison transforms away the > <=> TRUE, transforming the expression to non_nullable_col IN (select > nullable_col) , which is an incorrect transformation because NULL values of > nullable_col now cause the expression to yield NULL instead of FALSE. > This bug can potentially lead to wrong results, but in most cases this > doesn't directly cause wrong results end-to-end, because IN subqueries are > almost always transformed to semi/anti/existence joins in > RewritePredicateSubquery, and this rewrite can also incorrectly discard > NULLs, which is another bug. But we can observe it causing wrong behavior in > unit tests, and it could easily lead to incorrect query results if there are > changes to the surrounding context, so it should be fixed regardless. > This is a long-standing bug that has existed at least since 2016, as long as > the ListQuery class has existed. -- This message was sent by Atlassian Jira (v8.20.10#820010) -
[jira] [Created] (SPARK-43415) Impl mapValues for KVGDS#mapValues
Zhen Li created SPARK-43415: --- Summary: Impl mapValues for KVGDS#mapValues Key: SPARK-43415 URL: https://issues.apache.org/jira/browse/SPARK-43415 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.5.0 Reporter: Zhen Li Use an resolved func to pass the mapValues together with all aggExprs. Then on the server side unfold it to apply mapValues first before running aggregate. e.g. https://github.com/apache/spark/commit/a234a9b0851ebce87c0ef831b24866f94f0c0d36 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43414) Fix flakiness in Kafka RDD suites due to port binding configuration issue
[ https://issues.apache.org/jira/browse/SPARK-43414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-43414. --- Fix Version/s: 3.4.1 Resolution: Fixed Issue resolved by pull request 41095 [https://github.com/apache/spark/pull/41095] > Fix flakiness in Kafka RDD suites due to port binding configuration issue > - > > Key: SPARK-43414 > URL: https://issues.apache.org/jira/browse/SPARK-43414 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > Fix For: 3.4.1 > > > In SPARK-36837 we updated Kafka to 3.10, which uses a different set of > configuration options for configuring the broker listener port. That PR only > updated one of two KafkaTestUtils files (the SQL one), so the other one (used > by Core tests) had an ineffective port binding configuration and would bind > to the default 9092 port. This could lead to flakiness if multiple suites > binding to that port ran in parallel. > To fix this, we just need to copy the updated port binding configuration from > the other suite. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43398) Executor timeout should be max of idleTimeout rddTimeout shuffleTimeout
[ https://issues.apache.org/jira/browse/SPARK-43398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-43398: -- Affects Version/s: 3.4.0 3.3.2 3.2.4 3.1.3 > Executor timeout should be max of idleTimeout rddTimeout shuffleTimeout > --- > > Key: SPARK-43398 > URL: https://issues.apache.org/jira/browse/SPARK-43398 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 3.1.3, 3.2.4, 3.3.2, 3.4.0 >Reporter: Zhongwei Zhu >Priority: Major > > When dynamic allocation enabled, Executor timeout should be max of > idleTimeout, rddTimeout and shuffleTimeout. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43416) Fix the bug where the ProduceEncoder#tuples fields names are different from server
Zhen Li created SPARK-43416: --- Summary: Fix the bug where the ProduceEncoder#tuples fields names are different from server Key: SPARK-43416 URL: https://issues.apache.org/jira/browse/SPARK-43416 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.5.0 Reporter: Zhen Li The fields are named _1, _2, ... etc. However on the server side it could be nicely named in agg operations such as key, value etc. Fix this if possible. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43404) Filter current version while reusing sst files for RocksDB state store provider while uploading to DFS to prevent id mismatch
[ https://issues.apache.org/jira/browse/SPARK-43404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-43404. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41089 [https://github.com/apache/spark/pull/41089] > Filter current version while reusing sst files for RocksDB state store > provider while uploading to DFS to prevent id mismatch > - > > Key: SPARK-43404 > URL: https://issues.apache.org/jira/browse/SPARK-43404 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Anish Shrigondekar >Assignee: Anish Shrigondekar >Priority: Major > Fix For: 3.5.0 > > > Filter current version while reusing sst files for RocksDB state store > provider while uploading to DFS to prevent id mismatch -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43404) Filter current version while reusing sst files for RocksDB state store provider while uploading to DFS to prevent id mismatch
[ https://issues.apache.org/jira/browse/SPARK-43404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-43404: Assignee: Anish Shrigondekar > Filter current version while reusing sst files for RocksDB state store > provider while uploading to DFS to prevent id mismatch > - > > Key: SPARK-43404 > URL: https://issues.apache.org/jira/browse/SPARK-43404 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Anish Shrigondekar >Assignee: Anish Shrigondekar >Priority: Major > > Filter current version while reusing sst files for RocksDB state store > provider while uploading to DFS to prevent id mismatch -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43417) Improve CBO stats
Huaxin Gao created SPARK-43417: -- Summary: Improve CBO stats Key: SPARK-43417 URL: https://issues.apache.org/jira/browse/SPARK-43417 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.5.0 Reporter: Huaxin Gao When experimenting the DS V2 Col stats, we identified areas where could potentially improve. For instance, we can probably propagate Union NDV, and add min/max for the varchar columns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43397) Log executor decommission duration
[ https://issues.apache.org/jira/browse/SPARK-43397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-43397: - Assignee: Zhongwei Zhu > Log executor decommission duration > -- > > Key: SPARK-43397 > URL: https://issues.apache.org/jira/browse/SPARK-43397 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Minor > > Log executor decommission duration. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43397) Log executor decommission duration in executorLost method
[ https://issues.apache.org/jira/browse/SPARK-43397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-43397: -- Affects Version/s: 3.5.0 (was: 3.4.0) > Log executor decommission duration in executorLost method > - > > Key: SPARK-43397 > URL: https://issues.apache.org/jira/browse/SPARK-43397 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Minor > Fix For: 3.5.0 > > > Log executor decommission duration. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43397) Log executor decommission duration in executorLost method
[ https://issues.apache.org/jira/browse/SPARK-43397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-43397: -- Summary: Log executor decommission duration in executorLost method (was: Log executor decommission duration) > Log executor decommission duration in executorLost method > - > > Key: SPARK-43397 > URL: https://issues.apache.org/jira/browse/SPARK-43397 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Minor > Fix For: 3.5.0 > > > Log executor decommission duration. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43397) Log executor decommission duration
[ https://issues.apache.org/jira/browse/SPARK-43397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-43397. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41077 [https://github.com/apache/spark/pull/41077] > Log executor decommission duration > -- > > Key: SPARK-43397 > URL: https://issues.apache.org/jira/browse/SPARK-43397 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Minor > Fix For: 3.5.0 > > > Log executor decommission duration. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43401) Upgrade buf to v1.18.0
[ https://issues.apache.org/jira/browse/SPARK-43401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-43401. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41087 [https://github.com/apache/spark/pull/41087] > Upgrade buf to v1.18.0 > -- > > Key: SPARK-43401 > URL: https://issues.apache.org/jira/browse/SPARK-43401 > Project: Spark > Issue Type: Improvement > Components: Build, Connect >Affects Versions: 3.4.1 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43401) Upgrade buf to v1.18.0
[ https://issues.apache.org/jira/browse/SPARK-43401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-43401: - Assignee: BingKun Pan > Upgrade buf to v1.18.0 > -- > > Key: SPARK-43401 > URL: https://issues.apache.org/jira/browse/SPARK-43401 > Project: Spark > Issue Type: Improvement > Components: Build, Connect >Affects Versions: 3.4.1 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43418) add SparkSession.Builder.getOrCreate()
Herman van Hövell created SPARK-43418: - Summary: add SparkSession.Builder.getOrCreate() Key: SPARK-43418 URL: https://issues.apache.org/jira/browse/SPARK-43418 Project: Spark Issue Type: New Feature Components: Connect Affects Versions: 3.4.0 Reporter: Herman van Hövell Assignee: Herman van Hövell Add SparkSession.Builder.getOrCreate. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42514) Scala Client add partition transforms functions
[ https://issues.apache.org/jira/browse/SPARK-42514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-42514: - Epic Link: SPARK-42554 > Scala Client add partition transforms functions > --- > > Key: SPARK-42514 > URL: https://issues.apache.org/jira/browse/SPARK-42514 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43296) Migrate Spark Connect session errors into error class
[ https://issues.apache.org/jira/browse/SPARK-43296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-43296: - Assignee: Haejoon Lee > Migrate Spark Connect session errors into error class > - > > Key: SPARK-43296 > URL: https://issues.apache.org/jira/browse/SPARK-43296 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > Migrate Spark Connect session errors into error class -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43296) Migrate Spark Connect session errors into error class
[ https://issues.apache.org/jira/browse/SPARK-43296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-43296. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40964 [https://github.com/apache/spark/pull/40964] > Migrate Spark Connect session errors into error class > - > > Key: SPARK-43296 > URL: https://issues.apache.org/jira/browse/SPARK-43296 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.5.0 > > > Migrate Spark Connect session errors into error class -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43357) Spark AWS Glue date partition push down broken
[ https://issues.apache.org/jira/browse/SPARK-43357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720763#comment-17720763 ] Snoot.io commented on SPARK-43357: -- User 'stijndehaes' has created a pull request for this issue: https://github.com/apache/spark/pull/41035 > Spark AWS Glue date partition push down broken > -- > > Key: SPARK-43357 > URL: https://issues.apache.org/jira/browse/SPARK-43357 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, > 3.3.1, 3.2.3, 3.2.4, 3.3.2 >Reporter: Stijn De Haes >Priority: Major > > When using the following project: > [https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore] > To have glue supported as as a hive metastore for spark there is an issue > when reading a date-partitioned data set. Writing is fine. > You get the following error: > {quote}org.apache.hadoop.hive.metastore.api.InvalidObjectException: > Unsupported expression '2023 - 05 - 03' (Service: AWSGlue; Status Code: 400; > Error Code: InvalidInputException; Request ID: > beed68c6-b228-442e-8783-52c25b9d2243; Proxy: null) > {quote} > > A fix for this is making sure the date passed to glue is quoted -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43343) Spark Streaming is not able to read a .txt file whose name has [] special character
[ https://issues.apache.org/jira/browse/SPARK-43343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-43343. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41022 [https://github.com/apache/spark/pull/41022] > Spark Streaming is not able to read a .txt file whose name has [] special > character > --- > > Key: SPARK-43343 > URL: https://issues.apache.org/jira/browse/SPARK-43343 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Minor > Fix For: 3.5.0 > > > * For example, If a directory contains a following file: > /path/abc[123] > and users would load spark.readStream.format("text").load("/path") as stream > input. It throws an exception, saying no matching path /path/abc[123]. Spark > thinks abc[123] is a regex that only matches file named abc1, abc2 and abc3. > * Upon investigation this is due to how we > [getBatch|https://github.com/databricks/runtime/blob/3af402d23620a0952e151d96c3184d2233217c87/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L269] > in the FileStreamSource. In `FileStreamSource` we already check file pattern > matching and find all match file names. However, in DataSource we check for > glob characters again and try to expend it > [here|https://github.com/databricks/runtime/blob/3af402d23620a0952e151d96c3184d2233217c87/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L274]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43343) Spark Streaming is not able to read a .txt file whose name has [] special character
[ https://issues.apache.org/jira/browse/SPARK-43343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-43343: Assignee: Siying Dong > Spark Streaming is not able to read a .txt file whose name has [] special > character > --- > > Key: SPARK-43343 > URL: https://issues.apache.org/jira/browse/SPARK-43343 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Minor > > * For example, If a directory contains a following file: > /path/abc[123] > and users would load spark.readStream.format("text").load("/path") as stream > input. It throws an exception, saying no matching path /path/abc[123]. Spark > thinks abc[123] is a regex that only matches file named abc1, abc2 and abc3. > * Upon investigation this is due to how we > [getBatch|https://github.com/databricks/runtime/blob/3af402d23620a0952e151d96c3184d2233217c87/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L269] > in the FileStreamSource. In `FileStreamSource` we already check file pattern > matching and find all match file names. However, in DataSource we check for > glob characters again and try to expend it > [here|https://github.com/databricks/runtime/blob/3af402d23620a0952e151d96c3184d2233217c87/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L274]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42668) Catch exception while trying to close compressed stream in HDFSStateStoreProvider abort
[ https://issues.apache.org/jira/browse/SPARK-42668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720769#comment-17720769 ] Snoot.io commented on SPARK-42668: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/41098 > Catch exception while trying to close compressed stream in > HDFSStateStoreProvider abort > --- > > Key: SPARK-42668 > URL: https://issues.apache.org/jira/browse/SPARK-42668 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Anish Shrigondekar >Assignee: Anish Shrigondekar >Priority: Major > Fix For: 3.5.0 > > > Catch exception while trying to close compressed stream in > HDFSStateStoreProvider abort > We have seen some cases where the task exits as cancelled/failed which > triggers the abort in the task completion listener for > HDFSStateStoreProvider. As part of this, we cancel the backing stream and > close the compressed stream. However, different stores such as Azure blob > store could throw exceptions which are not caught in the current path, > leading to job failures. This change proposes to fix this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43014) spark.app.submitTime is not right in k8s cluster mode
[ https://issues.apache.org/jira/browse/SPARK-43014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720770#comment-17720770 ] Snoot.io commented on SPARK-43014: -- User 'zhouyifan279' has created a pull request for this issue: https://github.com/apache/spark/pull/40645 > spark.app.submitTime is not right in k8s cluster mode > - > > Key: SPARK-43014 > URL: https://issues.apache.org/jira/browse/SPARK-43014 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.2, 3.4.0 >Reporter: Zhou Yifan >Priority: Major > > If submit Spark in k8s cluster mode, `spark.app.submitTime` will be > overwritten when driver starts. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43419) [K8S] Make limit.cores be able to be fallen back to request.cores
Fei Wang created SPARK-43419: Summary: [K8S] Make limit.cores be able to be fallen back to request.cores Key: SPARK-43419 URL: https://issues.apache.org/jira/browse/SPARK-43419 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.4.0 Reporter: Fei Wang make limit.cores be able to be fallen back to request.cores If spark.kubernetes.executor/driver.limit.cores, treat request.cores as limit.cores. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43419) [K8S] Make limit.cores be able to be fallen back to request.cores
[ https://issues.apache.org/jira/browse/SPARK-43419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Wang updated SPARK-43419: - Description: make limit.cores be able to be fallen back to request.cores now without limit.cores, we will meet below issue: {code:java} io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https:///api/v1/namespaces/hadooptessns/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "" is forbidden: failed quota: high-qos-limit-requests: must specify limits.cpu. {code} If spark.kubernetes.executor/driver.limit.cores, treat request.cores as limit.cores. was: make limit.cores be able to be fallen back to request.cores If spark.kubernetes.executor/driver.limit.cores, treat request.cores as limit.cores. > [K8S] Make limit.cores be able to be fallen back to request.cores > - > > Key: SPARK-43419 > URL: https://issues.apache.org/jira/browse/SPARK-43419 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Fei Wang >Priority: Major > > make limit.cores be able to be fallen back to request.cores > now without limit.cores, we will meet below issue: > {code:java} > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https:///api/v1/namespaces/hadooptessns/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "" is forbidden: failed quota: > high-qos-limit-requests: must specify limits.cpu. {code} > If spark.kubernetes.executor/driver.limit.cores, treat request.cores as > limit.cores. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43419) [K8S] Make limit.cores be able to be fallen back to request.cores
[ https://issues.apache.org/jira/browse/SPARK-43419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Wang updated SPARK-43419: - Description: make limit.cores be able to be fallen back to request.cores now without limit.cores, we will meet below issue: {code:java} io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https:///api/v1/namespaces/hadooptessns/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "" is forbidden: failed quota: high-qos-limit-requests: must specify limits.cpu. {code} If spark.kubernetes.executor/driver.limit.cores, how about treat request.cores as limit.cores? was: make limit.cores be able to be fallen back to request.cores now without limit.cores, we will meet below issue: {code:java} io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https:///api/v1/namespaces/hadooptessns/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "" is forbidden: failed quota: high-qos-limit-requests: must specify limits.cpu. {code} If spark.kubernetes.executor/driver.limit.cores, treat request.cores as limit.cores. > [K8S] Make limit.cores be able to be fallen back to request.cores > - > > Key: SPARK-43419 > URL: https://issues.apache.org/jira/browse/SPARK-43419 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Fei Wang >Priority: Major > > make limit.cores be able to be fallen back to request.cores > now without limit.cores, we will meet below issue: > {code:java} > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https:///api/v1/namespaces/hadooptessns/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "" is forbidden: failed quota: > high-qos-limit-requests: must specify limits.cpu. {code} > If spark.kubernetes.executor/driver.limit.cores, how about treat > request.cores as limit.cores? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43413) IN subquery ListQuery has wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720780#comment-17720780 ] ci-cassandra.apache.org commented on SPARK-43413: - User 'jchen5' has created a pull request for this issue: https://github.com/apache/spark/pull/41094 > IN subquery ListQuery has wrong nullability > --- > > Key: SPARK-43413 > URL: https://issues.apache.org/jira/browse/SPARK-43413 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Priority: Major > > IN subquery expressions are incorrectly marked as non-nullable, even when > they are actually nullable. They correctly check the nullability of the > left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is > currently defined with nullability = false always. This is incorrect and can > lead to incorrect query transformations. > Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN > expression returns NULL when the nullable_col is null, but our code marks it > as non-nullable, and therefore SimplifyBinaryComparison transforms away the > <=> TRUE, transforming the expression to non_nullable_col IN (select > nullable_col) , which is an incorrect transformation because NULL values of > nullable_col now cause the expression to yield NULL instead of FALSE. > This bug can potentially lead to wrong results, but in most cases this > doesn't directly cause wrong results end-to-end, because IN subqueries are > almost always transformed to semi/anti/existence joins in > RewritePredicateSubquery, and this rewrite can also incorrectly discard > NULLs, which is another bug. But we can observe it causing wrong behavior in > unit tests, and it could easily lead to incorrect query results if there are > changes to the surrounding context, so it should be fixed regardless. > This is a long-standing bug that has existed at least since 2016, as long as > the ListQuery class has existed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43419) [K8S] Make limit.cores be able to be fallen back to request.cores
[ https://issues.apache.org/jira/browse/SPARK-43419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Wang updated SPARK-43419: - Description: make limit.cores be able to be fallen back to request.cores now without limit.cores, we will meet below issue: {code:java} io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https:///api/v1/namespaces/hadooptessns/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "" is forbidden: failed quota: high-qos-limit-requests: must specify limits.cpu. {code} If spark.kubernetes.executor/driver.limit.cores is not specified, how about treating request.cores as limit.cores? was: make limit.cores be able to be fallen back to request.cores now without limit.cores, we will meet below issue: {code:java} io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https:///api/v1/namespaces/hadooptessns/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "" is forbidden: failed quota: high-qos-limit-requests: must specify limits.cpu. {code} If spark.kubernetes.executor/driver.limit.cores, how about treat request.cores as limit.cores? > [K8S] Make limit.cores be able to be fallen back to request.cores > - > > Key: SPARK-43419 > URL: https://issues.apache.org/jira/browse/SPARK-43419 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Fei Wang >Priority: Major > > make limit.cores be able to be fallen back to request.cores > now without limit.cores, we will meet below issue: > {code:java} > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https:///api/v1/namespaces/hadooptessns/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "" is forbidden: failed quota: > high-qos-limit-requests: must specify limits.cpu. {code} > If spark.kubernetes.executor/driver.limit.cores is not specified, how about > treating request.cores as limit.cores? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43338) Support modify the SESSION_CATALOG_NAME value
[ https://issues.apache.org/jira/browse/SPARK-43338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720788#comment-17720788 ] Kent Yao commented on SPARK-43338: -- Support multiple Hive metastore servers is a big feature, which can not be achieved by making SESSION_CATALOG_NAME variable cc [~cloud_fan] > Support modify the SESSION_CATALOG_NAME value > -- > > Key: SPARK-43338 > URL: https://issues.apache.org/jira/browse/SPARK-43338 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: melin >Priority: Major > > {code:java} > private[sql] object CatalogManager { > val SESSION_CATALOG_NAME: String = "spark_catalog" > }{code} > > The SESSION_CATALOG_NAME value cannot be modified。 > If multiple Hive Metastores exist, the platform manages multiple hms metadata > and classifies them by catalogName. A different catalog name is required > [~yao] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43338) Support modify the SESSION_CATALOG_NAME value
[ https://issues.apache.org/jira/browse/SPARK-43338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720794#comment-17720794 ] melin commented on SPARK-43338: --- You understand a big feature. Only one hms is accessed in a sparksession. I just want spark_catalog to be modified. For example, if you have two hadoop clusters, there should be two hms. Metadata management platform (similar to databricks unity catalog), the acquisition of the HMS metadata, in order to distinguish the uniqueness, need to add catalogName (tableid: catalogName. SchemaName. TableName). When spark accesses hive tables, it is consistent with the catalogname of tableid instead of spark_catalog。 > Support modify the SESSION_CATALOG_NAME value > -- > > Key: SPARK-43338 > URL: https://issues.apache.org/jira/browse/SPARK-43338 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: melin >Priority: Major > > {code:java} > private[sql] object CatalogManager { > val SESSION_CATALOG_NAME: String = "spark_catalog" > }{code} > > The SESSION_CATALOG_NAME value cannot be modified。 > If multiple Hive Metastores exist, the platform manages multiple hms metadata > and classifies them by catalogName. A different catalog name is required > [~yao] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43420) Make DisableUnnecessaryBucketedScan smart with table cache
XiDuo You created SPARK-43420: - Summary: Make DisableUnnecessaryBucketedScan smart with table cache Key: SPARK-43420 URL: https://issues.apache.org/jira/browse/SPARK-43420 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: XiDuo You If a bucket scan has no interesting partition or contains shuffle exchange, then we would disable it. But If the bucket scan is inside table cache, the cached plan would be accessed multi-times, then we should not disable it as it could preserve output partitioning and more likely be reused. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43338) Support modify the SESSION_CATALOG_NAME value
[ https://issues.apache.org/jira/browse/SPARK-43338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720798#comment-17720798 ] Kent Yao commented on SPARK-43338: -- Why not just use Catalog V2 API to implement a separate catalog extension with hive support? or using an exist one https://kyuubi.readthedocs.io/en/v1.7.1-rc0/connector/spark/hive.html > Support modify the SESSION_CATALOG_NAME value > -- > > Key: SPARK-43338 > URL: https://issues.apache.org/jira/browse/SPARK-43338 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: melin >Priority: Major > > {code:java} > private[sql] object CatalogManager { > val SESSION_CATALOG_NAME: String = "spark_catalog" > }{code} > > The SESSION_CATALOG_NAME value cannot be modified。 > If multiple Hive Metastores exist, the platform manages multiple hms metadata > and classifies them by catalogName. A different catalog name is required > [~yao] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43422) Tags are lost on LogicalRelation when adding _metadata
Jan-Ole Sasse created SPARK-43422: - Summary: Tags are lost on LogicalRelation when adding _metadata Key: SPARK-43422 URL: https://issues.apache.org/jira/browse/SPARK-43422 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 3.4.0 Reporter: Jan-Ole Sasse The AddMetadataColumns does not copy tags for the LogicalRelation when adding metadata output in addMetadataCol -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org