[jira] [Commented] (SPARK-42650) link issue SPARK-42550
[ https://issues.apache.org/jira/browse/SPARK-42650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696070#comment-17696070 ] XiDuo You commented on SPARK-42650: --- To be clear, it is the issue of Spark 3.2.3. Spark3.2.1, 3.3.x and master are fine. It can be reproduced by: {code:java} CREATE TABLE IF NOT EXISTS spark32_overwrite(amt1 int) STORED AS ORC; CREATE TABLE IF NOT EXISTS spark32_overwrite2(amt1 long) STORED AS ORC; INSERT OVERWRITE TABLE spark32_overwrite2 select 644164; set spark.sql.ansi.enabled=true; INSERT OVERWRITE TABLE spark32_overwrite select amt1 from (select cast(amt1 as int) as amt1 from spark32_overwrite2 distribute by amt1); {code} > link issue SPARK-42550 > -- > > Key: SPARK-42650 > URL: https://issues.apache.org/jira/browse/SPARK-42650 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.3 >Reporter: kevinshin >Priority: Major > > When use > [KyuubiSparkSQLExtension|https://kyuubi.readthedocs.io/en/v1.6.1-incubating/extensions/engines/spark/] > and when a `insert overwrite` statment meet exception ,a no partion table's > home directory will lost ,partion table will lost partion directory. > > my spark-defaults.conf config : > spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension > > because I can't reopen SPARK-42550 , for detail and reproduce please > reference: > https://issues.apache.org/jira/browse/SPARK-42550 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42660) Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)
[ https://issues.apache.org/jira/browse/SPARK-42660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696059#comment-17696059 ] Apache Spark commented on SPARK-42660: -- User 'mskapilks' has created a pull request for this issue: https://github.com/apache/spark/pull/40266 > Infer filters for Join produced by IN and EXISTS clause > (RewritePredicateSubquery rule) > --- > > Key: SPARK-42660 > URL: https://issues.apache.org/jira/browse/SPARK-42660 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.1 >Reporter: Kapil Singh >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42660) Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)
[ https://issues.apache.org/jira/browse/SPARK-42660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42660: Assignee: Apache Spark > Infer filters for Join produced by IN and EXISTS clause > (RewritePredicateSubquery rule) > --- > > Key: SPARK-42660 > URL: https://issues.apache.org/jira/browse/SPARK-42660 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.1 >Reporter: Kapil Singh >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42660) Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)
[ https://issues.apache.org/jira/browse/SPARK-42660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42660: Assignee: (was: Apache Spark) > Infer filters for Join produced by IN and EXISTS clause > (RewritePredicateSubquery rule) > --- > > Key: SPARK-42660 > URL: https://issues.apache.org/jira/browse/SPARK-42660 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.1 >Reporter: Kapil Singh >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42609) Add tests for grouping() and grouping_id() functions
[ https://issues.apache.org/jira/browse/SPARK-42609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-42609. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40259 [https://github.com/apache/spark/pull/40259] > Add tests for grouping() and grouping_id() functions > > > Key: SPARK-42609 > URL: https://issues.apache.org/jira/browse/SPARK-42609 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42556) Dataset.colregex should link a plan_id when it only matches a single column.
[ https://issues.apache.org/jira/browse/SPARK-42556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42556: Assignee: Apache Spark > Dataset.colregex should link a plan_id when it only matches a single column. > > > Key: SPARK-42556 > URL: https://issues.apache.org/jira/browse/SPARK-42556 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Apache Spark >Priority: Major > > When colregex returns a single column it should link the plans plan_id. For > reference here is the non-connect Dataset code that does this: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1512] > This also needs to be fixed for the Python client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42660) Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)
Kapil Singh created SPARK-42660: --- Summary: Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule) Key: SPARK-42660 URL: https://issues.apache.org/jira/browse/SPARK-42660 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.1 Reporter: Kapil Singh -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42556) Dataset.colregex should link a plan_id when it only matches a single column.
[ https://issues.apache.org/jira/browse/SPARK-42556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42556: Assignee: (was: Apache Spark) > Dataset.colregex should link a plan_id when it only matches a single column. > > > Key: SPARK-42556 > URL: https://issues.apache.org/jira/browse/SPARK-42556 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > When colregex returns a single column it should link the plans plan_id. For > reference here is the non-connect Dataset code that does this: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1512] > This also needs to be fixed for the Python client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42556) Dataset.colregex should link a plan_id when it only matches a single column.
[ https://issues.apache.org/jira/browse/SPARK-42556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696046#comment-17696046 ] Apache Spark commented on SPARK-42556: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/40265 > Dataset.colregex should link a plan_id when it only matches a single column. > > > Key: SPARK-42556 > URL: https://issues.apache.org/jira/browse/SPARK-42556 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > When colregex returns a single column it should link the plans plan_id. For > reference here is the non-connect Dataset code that does this: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1512] > This also needs to be fixed for the Python client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42473) An explicit cast will be needed when INSERT OVERWRITE SELECT UNION ALL
[ https://issues.apache.org/jira/browse/SPARK-42473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-42473. - Fix Version/s: 3.3.3 Assignee: Runyao.Chen Resolution: Fixed > An explicit cast will be needed when INSERT OVERWRITE SELECT UNION ALL > -- > > Key: SPARK-42473 > URL: https://issues.apache.org/jira/browse/SPARK-42473 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.1 > Environment: spark 3.3.1 >Reporter: kevinshin >Assignee: Runyao.Chen >Priority: Major > Fix For: 3.3.3 > > > *when 'union all' and one select statement use* *Literal as column value , > the other* *select statement has computed expression at the same column , > then the whole statement will compile failed. A explicit cast will be needed.* > for example: > {color:#4c9aff}explain{color} > {color:#4c9aff}*INSERT* OVERWRITE *TABLE* test.spark33_decimal_orc{color} > {color:#4c9aff}*select* *null* *as* amt1, {*}cast{*}('256.99' *as* > {*}decimal{*}(20,8)) *as* amt2{color} > {color:#4c9aff}*union* *all*{color} > {color:#4c9aff}*select* {*}cast{*}('200.99' *as* > {*}decimal{*}(20,8)){*}/{*}100 *as* amt1,{*}cast{*}('256.99' *as* > {*}decimal{*}(20,8)) *as* amt2;{color} > *will got error :* > org.apache.spark.{*}sql{*}.catalyst.expressions.Literal cannot be *cast* *to* > org.apache.spark.{*}sql{*}.catalyst.expressions.AnsiCast > The SQL will need to change to : > {color:#4c9aff}explain{color} > {color:#4c9aff}*INSERT* OVERWRITE *TABLE* test.spark33_decimal_orc{color} > {color:#4c9aff}*select* *null* *as* amt1,{*}cast{*}('256.99' *as* > {*}decimal{*}(20,8)) *as* amt2{color} > {color:#4c9aff}*union* *all*{color} > {color:#4c9aff}*select* {color:#de350b}{*}cast{*}({color}{*}cast{*}('200.99' > *as* {*}decimal{*}(20,8)){*}/{*}100 *as* > {*}decimal{*}(20,8){color:#de350b}){color} *as* amt1,{*}cast{*}('256.99' *as* > {*}decimal{*}(20,8)) *as* amt2;{color} > > *but this is not need in spark3.2.1 , is this a bug for spark3.3.1 ?* -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression
[ https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696042#comment-17696042 ] Apache Spark commented on SPARK-42635: -- User 'chenhao-db' has created a pull request for this issue: https://github.com/apache/spark/pull/40264 > Several counter-intuitive behaviours in the TimestampAdd expression > --- > > Key: SPARK-42635 > URL: https://issues.apache.org/jira/browse/SPARK-42635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.2 >Reporter: Chenhao Li >Assignee: Chenhao Li >Priority: Major > Fix For: 3.4.1 > > > # When the time is close to daylight saving time transition, the result may > be discontinuous and not monotonic. > We currently have: > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, > timestamp'2011-03-12 03:00:00')").show > ++ > |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| > ++ > | 2011-03-13 03:59:59| > ++ > scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 > 03:00:00')").show > +--+ > |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| > +--+ > | 2011-03-13 03:00:00| > +--+ {code} > > In the second query, adding one more second will set the time back one hour > instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 > 03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due > to the daylight saving time transition. > The root cause of the problem is the Spark code at > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790] > wrongly assumes every day has {{MICROS_PER_DAY}} seconds, and does the day > and time-in-day split before looking at the timezone. > 2. Adding month, quarter, and year silently ignores Int overflow during unit > conversion. > The root cause is > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246]. > {{quantity}} is multiplied by {{3}} or {{MONTHS_PER_YEAR}} without checking > overflow. Note that we do have overflow checking in adding the amount to the > timestamp, so the behavior is inconsistent. > This can cause counter-intuitive results like this: > {code:scala} > scala> spark.sql("select timestampadd(quarter, 1431655764, > timestamp'1970-01-01')").show > +--+ > |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| > +--+ > | 1969-09-01 00:00:00| > +--+{code} > 3. Adding sub-month units (week, day, hour, minute, second, millisecond, > microsecond)silently ignores Long overflow during unit conversion. > This is similar to the previous problem: > {code:scala} > scala> spark.sql("select timestampadd(day, 106751992, > timestamp'1970-01-01')").show(false) > +-+ > |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| > +-+ > |-290308-12-22 15:58:10.448384| > +-+{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42650) link issue SPARK-42550
[ https://issues.apache.org/jira/browse/SPARK-42650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696037#comment-17696037 ] kevinshin commented on SPARK-42650: --- Spark and Kyuubi are both belong to apache. May Apache community help to figure out the detail of this issue? Will this issue keep exist the next releases. > link issue SPARK-42550 > -- > > Key: SPARK-42650 > URL: https://issues.apache.org/jira/browse/SPARK-42650 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.3 >Reporter: kevinshin >Priority: Major > > When use > [KyuubiSparkSQLExtension|https://kyuubi.readthedocs.io/en/v1.6.1-incubating/extensions/engines/spark/] > and when a `insert overwrite` statment meet exception ,a no partion table's > home directory will lost ,partion table will lost partion directory. > > my spark-defaults.conf config : > spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension > > because I can't reopen SPARK-42550 , for detail and reproduce please > reference: > https://issues.apache.org/jira/browse/SPARK-42550 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression
[ https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-42635. -- Fix Version/s: 3.4.1 Resolution: Fixed Issue resolved by pull request 40237 [https://github.com/apache/spark/pull/40237] > Several counter-intuitive behaviours in the TimestampAdd expression > --- > > Key: SPARK-42635 > URL: https://issues.apache.org/jira/browse/SPARK-42635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.2 >Reporter: Chenhao Li >Assignee: Chenhao Li >Priority: Major > Fix For: 3.4.1 > > > # When the time is close to daylight saving time transition, the result may > be discontinuous and not monotonic. > We currently have: > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, > timestamp'2011-03-12 03:00:00')").show > ++ > |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| > ++ > | 2011-03-13 03:59:59| > ++ > scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 > 03:00:00')").show > +--+ > |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| > +--+ > | 2011-03-13 03:00:00| > +--+ {code} > > In the second query, adding one more second will set the time back one hour > instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 > 03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due > to the daylight saving time transition. > The root cause of the problem is the Spark code at > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790] > wrongly assumes every day has {{MICROS_PER_DAY}} seconds, and does the day > and time-in-day split before looking at the timezone. > 2. Adding month, quarter, and year silently ignores Int overflow during unit > conversion. > The root cause is > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246]. > {{quantity}} is multiplied by {{3}} or {{MONTHS_PER_YEAR}} without checking > overflow. Note that we do have overflow checking in adding the amount to the > timestamp, so the behavior is inconsistent. > This can cause counter-intuitive results like this: > {code:scala} > scala> spark.sql("select timestampadd(quarter, 1431655764, > timestamp'1970-01-01')").show > +--+ > |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| > +--+ > | 1969-09-01 00:00:00| > +--+{code} > 3. Adding sub-month units (week, day, hour, minute, second, millisecond, > microsecond)silently ignores Long overflow during unit conversion. > This is similar to the previous problem: > {code:scala} > scala> spark.sql("select timestampadd(day, 106751992, > timestamp'1970-01-01')").show(false) > +-+ > |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| > +-+ > |-290308-12-22 15:58:10.448384| > +-+{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression
[ https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-42635: Assignee: Chenhao Li > Several counter-intuitive behaviours in the TimestampAdd expression > --- > > Key: SPARK-42635 > URL: https://issues.apache.org/jira/browse/SPARK-42635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.2 >Reporter: Chenhao Li >Assignee: Chenhao Li >Priority: Major > > # When the time is close to daylight saving time transition, the result may > be discontinuous and not monotonic. > We currently have: > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, > timestamp'2011-03-12 03:00:00')").show > ++ > |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| > ++ > | 2011-03-13 03:59:59| > ++ > scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 > 03:00:00')").show > +--+ > |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| > +--+ > | 2011-03-13 03:00:00| > +--+ {code} > > In the second query, adding one more second will set the time back one hour > instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 > 03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due > to the daylight saving time transition. > The root cause of the problem is the Spark code at > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790] > wrongly assumes every day has {{MICROS_PER_DAY}} seconds, and does the day > and time-in-day split before looking at the timezone. > 2. Adding month, quarter, and year silently ignores Int overflow during unit > conversion. > The root cause is > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246]. > {{quantity}} is multiplied by {{3}} or {{MONTHS_PER_YEAR}} without checking > overflow. Note that we do have overflow checking in adding the amount to the > timestamp, so the behavior is inconsistent. > This can cause counter-intuitive results like this: > {code:scala} > scala> spark.sql("select timestampadd(quarter, 1431655764, > timestamp'1970-01-01')").show > +--+ > |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| > +--+ > | 1969-09-01 00:00:00| > +--+{code} > 3. Adding sub-month units (week, day, hour, minute, second, millisecond, > microsecond)silently ignores Long overflow during unit conversion. > This is similar to the previous problem: > {code:scala} > scala> spark.sql("select timestampadd(day, 106751992, > timestamp'1970-01-01')").show(false) > +-+ > |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| > +-+ > |-290308-12-22 15:58:10.448384| > +-+{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42656) Spark Connect Scala Client Shell Script
[ https://issues.apache.org/jira/browse/SPARK-42656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42656. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40257 [https://github.com/apache/spark/pull/40257] > Spark Connect Scala Client Shell Script > --- > > Key: SPARK-42656 > URL: https://issues.apache.org/jira/browse/SPARK-42656 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Zhen Li >Assignee: Zhen Li >Priority: Major > Fix For: 3.4.0 > > > Adding a shell script to run scala client in a scala REPL to allow users to > connect to spark connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42656) Spark Connect Scala Client Shell Script
[ https://issues.apache.org/jira/browse/SPARK-42656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42656: Assignee: Zhen Li > Spark Connect Scala Client Shell Script > --- > > Key: SPARK-42656 > URL: https://issues.apache.org/jira/browse/SPARK-42656 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Zhen Li >Assignee: Zhen Li >Priority: Major > > Adding a shell script to run scala client in a scala REPL to allow users to > connect to spark connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42659) Reimplement `FPGrowthModel.transform` with dataframe operations
[ https://issues.apache.org/jira/browse/SPARK-42659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42659: Assignee: (was: Apache Spark) > Reimplement `FPGrowthModel.transform` with dataframe operations > --- > > Key: SPARK-42659 > URL: https://issues.apache.org/jira/browse/SPARK-42659 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42659) Reimplement `FPGrowthModel.transform` with dataframe operations
[ https://issues.apache.org/jira/browse/SPARK-42659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42659: Assignee: Apache Spark > Reimplement `FPGrowthModel.transform` with dataframe operations > --- > > Key: SPARK-42659 > URL: https://issues.apache.org/jira/browse/SPARK-42659 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42659) Reimplement `FPGrowthModel.transform` with dataframe operations
[ https://issues.apache.org/jira/browse/SPARK-42659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696019#comment-17696019 ] Apache Spark commented on SPARK-42659: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40263 > Reimplement `FPGrowthModel.transform` with dataframe operations > --- > > Key: SPARK-42659 > URL: https://issues.apache.org/jira/browse/SPARK-42659 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42659) Reimplement `FPGrowthModel.transform` with dataframe operations
[ https://issues.apache.org/jira/browse/SPARK-42659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696018#comment-17696018 ] Apache Spark commented on SPARK-42659: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40263 > Reimplement `FPGrowthModel.transform` with dataframe operations > --- > > Key: SPARK-42659 > URL: https://issues.apache.org/jira/browse/SPARK-42659 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42659) Reimplement `FPGrowthModel.transform` with dataframe operations
Ruifeng Zheng created SPARK-42659: - Summary: Reimplement `FPGrowthModel.transform` with dataframe operations Key: SPARK-42659 URL: https://issues.apache.org/jira/browse/SPARK-42659 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.5.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42651) Optimize global sort to driver sort
[ https://issues.apache.org/jira/browse/SPARK-42651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42651: Assignee: (was: Apache Spark) > Optimize global sort to driver sort > --- > > Key: SPARK-42651 > URL: https://issues.apache.org/jira/browse/SPARK-42651 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > > If the size of plan is small enough, it's more efficient to sort all rows at > driver side that saves one shuffle -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42651) Optimize global sort to driver sort
[ https://issues.apache.org/jira/browse/SPARK-42651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695991#comment-17695991 ] Apache Spark commented on SPARK-42651: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/40262 > Optimize global sort to driver sort > --- > > Key: SPARK-42651 > URL: https://issues.apache.org/jira/browse/SPARK-42651 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > > If the size of plan is small enough, it's more efficient to sort all rows at > driver side that saves one shuffle -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42651) Optimize global sort to driver sort
[ https://issues.apache.org/jira/browse/SPARK-42651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42651: Assignee: Apache Spark > Optimize global sort to driver sort > --- > > Key: SPARK-42651 > URL: https://issues.apache.org/jira/browse/SPARK-42651 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > > If the size of plan is small enough, it's more efficient to sort all rows at > driver side that saves one shuffle -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42651) Optimize global sort to driver sort
[ https://issues.apache.org/jira/browse/SPARK-42651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695992#comment-17695992 ] Apache Spark commented on SPARK-42651: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/40262 > Optimize global sort to driver sort > --- > > Key: SPARK-42651 > URL: https://issues.apache.org/jira/browse/SPARK-42651 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > > If the size of plan is small enough, it's more efficient to sort all rows at > driver side that saves one shuffle -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42650) link issue SPARK-42550
[ https://issues.apache.org/jira/browse/SPARK-42650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695989#comment-17695989 ] Yuming Wang commented on SPARK-42650: - It seems like a Kyuubi bug? > link issue SPARK-42550 > -- > > Key: SPARK-42650 > URL: https://issues.apache.org/jira/browse/SPARK-42650 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.3 >Reporter: kevinshin >Priority: Major > > When use > [KyuubiSparkSQLExtension|https://kyuubi.readthedocs.io/en/v1.6.1-incubating/extensions/engines/spark/] > and when a `insert overwrite` statment meet exception ,a no partion table's > home directory will lost ,partion table will lost partion directory. > > my spark-defaults.conf config : > spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension > > because I can't reopen SPARK-42550 , for detail and reproduce please > reference: > https://issues.apache.org/jira/browse/SPARK-42550 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42556) Dataset.colregex should link a plan_id when it only matches a single column.
[ https://issues.apache.org/jira/browse/SPARK-42556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695965#comment-17695965 ] jiaan.geng commented on SPARK-42556: I'm working on. > Dataset.colregex should link a plan_id when it only matches a single column. > > > Key: SPARK-42556 > URL: https://issues.apache.org/jira/browse/SPARK-42556 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > When colregex returns a single column it should link the plans plan_id. For > reference here is the non-connect Dataset code that does this: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1512] > This also needs to be fixed for the Python client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42604) Implement functions.typedlit
[ https://issues.apache.org/jira/browse/SPARK-42604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695964#comment-17695964 ] jiaan.geng commented on SPARK-42604: I will take a look! > Implement functions.typedlit > > > Key: SPARK-42604 > URL: https://issues.apache.org/jira/browse/SPARK-42604 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > We need to add functions.typedlit. This requires a change to the connect > protocol. See SPARK-42579 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42647) Remove aliases from deprecated numpy data types
[ https://issues.apache.org/jira/browse/SPARK-42647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-42647: - Priority: Minor (was: Major) > Remove aliases from deprecated numpy data types > --- > > Key: SPARK-42647 > URL: https://issues.apache.org/jira/browse/SPARK-42647 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1 >Reporter: Aimilios Tsouvelekakis >Assignee: Aimilios Tsouvelekakis >Priority: Minor > Fix For: 3.3.3, 3.4.1 > > > Numpy has started changing the alias to some of its data-types. This means > that users with the latest version of numpy they will face either warnings or > errors according to the type that they are using. This affects all the users > using numoy > 1.20.0. One of the types was fixed back in September with this > [pull|https://github.com/apache/spark/pull/37817] request. > The problem can be split into 2 types: > [numpy 1.24.0|https://github.com/numpy/numpy/pull/22607]: The scalar type > aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, > np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually > be removed. At this point in numpy 1.25.0 they give a awarning > [numpy 1.20.0|https://github.com/numpy/numpy/pull/14882]: Using the aliases > of builtin types like np.int is deprecated and removed since numpy version > 1.24.0 > The changes are needed so pyspark can be compatible with the latest numpy and > avoid > * attribute errors on data types being deprecated from version 1.20.0: > [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations] > * warnings on deprecated data types from version 1.24.0: > [https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations] > > From my main research I see the following: > The only changes that are functional are related with the conversion.py file. > The rest of the changes are inside tests in the user_guide or in some > docstrings describing specific functions. Since I am not an expert in these > tests I wait for the reviewer and some people with more experience in the > pyspark code. > These types are aliases for classic python types so yes they should work with > all the numpy versions > [1|https://numpy.org/devdocs/release/1.20.0-notes.html], > [2|https://stackoverflow.com/questions/74844262/how-can-i-solve-error-module-numpy-has-no-attribute-float-in-python]. > The error or warning comes from the call to the numpy. > > For the versions I chose to include from 3.3 and onwards but I see that 3.2 > also is still in the 18 month maintenace cadence as it was released in > October 2021. > > The pull request: [https://github.com/apache/spark/pull/40220] > Best Regards, > Aimilios -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42647) Remove aliases from deprecated numpy data types
[ https://issues.apache.org/jira/browse/SPARK-42647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-42647: Assignee: Aimilios Tsouvelekakis > Remove aliases from deprecated numpy data types > --- > > Key: SPARK-42647 > URL: https://issues.apache.org/jira/browse/SPARK-42647 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1 >Reporter: Aimilios Tsouvelekakis >Assignee: Aimilios Tsouvelekakis >Priority: Major > > Numpy has started changing the alias to some of its data-types. This means > that users with the latest version of numpy they will face either warnings or > errors according to the type that they are using. This affects all the users > using numoy > 1.20.0. One of the types was fixed back in September with this > [pull|https://github.com/apache/spark/pull/37817] request. > The problem can be split into 2 types: > [numpy 1.24.0|https://github.com/numpy/numpy/pull/22607]: The scalar type > aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, > np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually > be removed. At this point in numpy 1.25.0 they give a awarning > [numpy 1.20.0|https://github.com/numpy/numpy/pull/14882]: Using the aliases > of builtin types like np.int is deprecated and removed since numpy version > 1.24.0 > The changes are needed so pyspark can be compatible with the latest numpy and > avoid > * attribute errors on data types being deprecated from version 1.20.0: > [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations] > * warnings on deprecated data types from version 1.24.0: > [https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations] > > From my main research I see the following: > The only changes that are functional are related with the conversion.py file. > The rest of the changes are inside tests in the user_guide or in some > docstrings describing specific functions. Since I am not an expert in these > tests I wait for the reviewer and some people with more experience in the > pyspark code. > These types are aliases for classic python types so yes they should work with > all the numpy versions > [1|https://numpy.org/devdocs/release/1.20.0-notes.html], > [2|https://stackoverflow.com/questions/74844262/how-can-i-solve-error-module-numpy-has-no-attribute-float-in-python]. > The error or warning comes from the call to the numpy. > > For the versions I chose to include from 3.3 and onwards but I see that 3.2 > also is still in the 18 month maintenace cadence as it was released in > October 2021. > > The pull request: [https://github.com/apache/spark/pull/40220] > Best Regards, > Aimilios -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42647) Remove aliases from deprecated numpy data types
[ https://issues.apache.org/jira/browse/SPARK-42647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-42647. -- Fix Version/s: 3.3.3 3.4.1 Resolution: Fixed Issue resolved by pull request 40220 [https://github.com/apache/spark/pull/40220] > Remove aliases from deprecated numpy data types > --- > > Key: SPARK-42647 > URL: https://issues.apache.org/jira/browse/SPARK-42647 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1 >Reporter: Aimilios Tsouvelekakis >Assignee: Aimilios Tsouvelekakis >Priority: Major > Fix For: 3.3.3, 3.4.1 > > > Numpy has started changing the alias to some of its data-types. This means > that users with the latest version of numpy they will face either warnings or > errors according to the type that they are using. This affects all the users > using numoy > 1.20.0. One of the types was fixed back in September with this > [pull|https://github.com/apache/spark/pull/37817] request. > The problem can be split into 2 types: > [numpy 1.24.0|https://github.com/numpy/numpy/pull/22607]: The scalar type > aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, > np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually > be removed. At this point in numpy 1.25.0 they give a awarning > [numpy 1.20.0|https://github.com/numpy/numpy/pull/14882]: Using the aliases > of builtin types like np.int is deprecated and removed since numpy version > 1.24.0 > The changes are needed so pyspark can be compatible with the latest numpy and > avoid > * attribute errors on data types being deprecated from version 1.20.0: > [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations] > * warnings on deprecated data types from version 1.24.0: > [https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations] > > From my main research I see the following: > The only changes that are functional are related with the conversion.py file. > The rest of the changes are inside tests in the user_guide or in some > docstrings describing specific functions. Since I am not an expert in these > tests I wait for the reviewer and some people with more experience in the > pyspark code. > These types are aliases for classic python types so yes they should work with > all the numpy versions > [1|https://numpy.org/devdocs/release/1.20.0-notes.html], > [2|https://stackoverflow.com/questions/74844262/how-can-i-solve-error-module-numpy-has-no-attribute-float-in-python]. > The error or warning comes from the call to the numpy. > > For the versions I chose to include from 3.3 and onwards but I see that 3.2 > also is still in the 18 month maintenace cadence as it was released in > October 2021. > > The pull request: [https://github.com/apache/spark/pull/40220] > Best Regards, > Aimilios -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41718) Numpy 1.24 breaks PySpark due to use of `np.bool` instead of `np.bool_` in many places
[ https://issues.apache.org/jira/browse/SPARK-41718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-41718. -- Resolution: Duplicate > Numpy 1.24 breaks PySpark due to use of `np.bool` instead of `np.bool_` in > many places > -- > > Key: SPARK-41718 > URL: https://issues.apache.org/jira/browse/SPARK-41718 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Phillip Cloud >Priority: Major > > In numpy 1.24, `numpy.bool` was removed (it was deprecated prior to 1.24). > This causes many APIs in pyspark to stop working because an AttributeError is > raised. The alternative is to use `numpy.bool_` (trailing underscore). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42615) Refactor the AnalyzePlan RPC and add `session.version`
[ https://issues.apache.org/jira/browse/SPARK-42615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695943#comment-17695943 ] Apache Spark commented on SPARK-42615: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40261 > Refactor the AnalyzePlan RPC and add `session.version` > -- > > Key: SPARK-42615 > URL: https://issues.apache.org/jira/browse/SPARK-42615 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41497) Accumulator undercounting in the case of retry task with rdd cache
[ https://issues.apache.org/jira/browse/SPARK-41497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-41497. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 39459 [https://github.com/apache/spark/pull/39459] > Accumulator undercounting in the case of retry task with rdd cache > -- > > Key: SPARK-41497 > URL: https://issues.apache.org/jira/browse/SPARK-41497 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.2, 3.3.1 >Reporter: wuyi >Assignee: Tengfei Huang >Priority: Major > Fix For: 3.5.0 > > > Accumulator could be undercounted when the retried task has rdd cache. See > the example below and you could also find the completed and reproducible > example at > [https://github.com/apache/spark/compare/master...Ngone51:spark:fix-acc] > > {code:scala} > test("SPARK-XXX") { > // Set up a cluster with 2 executors > val conf = new SparkConf() > .setMaster("local-cluster[2, 1, > 1024]").setAppName("TaskSchedulerImplSuite") > sc = new SparkContext(conf) > // Set up a custom task scheduler. The scheduler will fail the first task > attempt of the job > // submitted below. In particular, the failed first attempt task would > success on computation > // (accumulator accounting, result caching) but only fail to report its > success status due > // to the concurrent executor lost. The second task attempt would success. > taskScheduler = setupSchedulerWithCustomStatusUpdate(sc) > val myAcc = sc.longAccumulator("myAcc") > // Initiate a rdd with only one partition so there's only one task and > specify the storage level > // with MEMORY_ONLY_2 so that the rdd result will be cached on both two > executors. > val rdd = sc.parallelize(0 until 10, 1).mapPartitions { iter => > myAcc.add(100) > iter.map(x => x + 1) > }.persist(StorageLevel.MEMORY_ONLY_2) > // This will pass since the second task attempt will succeed > assert(rdd.count() === 10) > // This will fail due to `myAcc.add(100)` won't be executed during the > second task attempt's > // execution. Because the second task attempt will load the rdd cache > directly instead of > // executing the task function so `myAcc.add(100)` is skipped. > assert(myAcc.value === 100) > } {code} > > We could also hit this issue with decommission even if the rdd only has one > copy. For example, decommission could migrate the rdd cache block to another > executor (the result is actually the same with 2 copies) and the > decommissioned executor lost before the task reports its success status to > the driver. > > And the issue is a bit more complicated than expected to fix. I have tried to > give some fixes but all of them are not ideal: > Option 1: Clean up any rdd cache related to the failed task: in practice, > this option can already fix the issue in most cases. However, theoretically, > rdd cache could be reported to the driver right after the driver cleans up > the failed task's caches due to asynchronous communication. So this option > can’t resolve the issue thoroughly; > Option 2: Disallow rdd cache reuse across the task attempts for the same > task: this option can 100% fix the issue. The problem is this way can also > affect the case where rdd cache can be reused across the attempts (e.g., when > there is no accumulator operation in the task), which can have perf > regression; > Option 3: Introduce accumulator cache: first, this requires a new framework > for supporting accumulator cache; second, the driver should improve its logic > to distinguish whether the accumulator cache value should be reported to the > user to avoid overcounting. For example, in the case of task retry, the value > should be reported. However, in the case of rdd cache reuse, the value > shouldn’t be reported (should it?); > Option 4: Do task success validation when a task trying to load the rdd > cache: this way defines a rdd cache is only valid/accessible if the task has > succeeded. This way could be either overkill or a bit complex (because > currently Spark would clean up the task state once it’s finished. So we need > to maintain a structure to know if task once succeeded or not. ) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41497) Accumulator undercounting in the case of retry task with rdd cache
[ https://issues.apache.org/jira/browse/SPARK-41497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-41497: --- Assignee: Tengfei Huang > Accumulator undercounting in the case of retry task with rdd cache > -- > > Key: SPARK-41497 > URL: https://issues.apache.org/jira/browse/SPARK-41497 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.2, 3.3.1 >Reporter: wuyi >Assignee: Tengfei Huang >Priority: Major > > Accumulator could be undercounted when the retried task has rdd cache. See > the example below and you could also find the completed and reproducible > example at > [https://github.com/apache/spark/compare/master...Ngone51:spark:fix-acc] > > {code:scala} > test("SPARK-XXX") { > // Set up a cluster with 2 executors > val conf = new SparkConf() > .setMaster("local-cluster[2, 1, > 1024]").setAppName("TaskSchedulerImplSuite") > sc = new SparkContext(conf) > // Set up a custom task scheduler. The scheduler will fail the first task > attempt of the job > // submitted below. In particular, the failed first attempt task would > success on computation > // (accumulator accounting, result caching) but only fail to report its > success status due > // to the concurrent executor lost. The second task attempt would success. > taskScheduler = setupSchedulerWithCustomStatusUpdate(sc) > val myAcc = sc.longAccumulator("myAcc") > // Initiate a rdd with only one partition so there's only one task and > specify the storage level > // with MEMORY_ONLY_2 so that the rdd result will be cached on both two > executors. > val rdd = sc.parallelize(0 until 10, 1).mapPartitions { iter => > myAcc.add(100) > iter.map(x => x + 1) > }.persist(StorageLevel.MEMORY_ONLY_2) > // This will pass since the second task attempt will succeed > assert(rdd.count() === 10) > // This will fail due to `myAcc.add(100)` won't be executed during the > second task attempt's > // execution. Because the second task attempt will load the rdd cache > directly instead of > // executing the task function so `myAcc.add(100)` is skipped. > assert(myAcc.value === 100) > } {code} > > We could also hit this issue with decommission even if the rdd only has one > copy. For example, decommission could migrate the rdd cache block to another > executor (the result is actually the same with 2 copies) and the > decommissioned executor lost before the task reports its success status to > the driver. > > And the issue is a bit more complicated than expected to fix. I have tried to > give some fixes but all of them are not ideal: > Option 1: Clean up any rdd cache related to the failed task: in practice, > this option can already fix the issue in most cases. However, theoretically, > rdd cache could be reported to the driver right after the driver cleans up > the failed task's caches due to asynchronous communication. So this option > can’t resolve the issue thoroughly; > Option 2: Disallow rdd cache reuse across the task attempts for the same > task: this option can 100% fix the issue. The problem is this way can also > affect the case where rdd cache can be reused across the attempts (e.g., when > there is no accumulator operation in the task), which can have perf > regression; > Option 3: Introduce accumulator cache: first, this requires a new framework > for supporting accumulator cache; second, the driver should improve its logic > to distinguish whether the accumulator cache value should be reported to the > user to avoid overcounting. For example, in the case of task retry, the value > should be reported. However, in the case of rdd cache reuse, the value > shouldn’t be reported (should it?); > Option 4: Do task success validation when a task trying to load the rdd > cache: this way defines a rdd cache is only valid/accessible if the task has > succeeded. This way could be either overkill or a bit complex (because > currently Spark would clean up the task state once it’s finished. So we need > to maintain a structure to know if task once succeeded or not. ) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42630) Make `parse_data_type` use new proto message `DDLParse`
[ https://issues.apache.org/jira/browse/SPARK-42630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695928#comment-17695928 ] Apache Spark commented on SPARK-42630: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40260 > Make `parse_data_type` use new proto message `DDLParse` > --- > > Key: SPARK-42630 > URL: https://issues.apache.org/jira/browse/SPARK-42630 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42630) Make `parse_data_type` use new proto message `DDLParse`
[ https://issues.apache.org/jira/browse/SPARK-42630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695927#comment-17695927 ] Apache Spark commented on SPARK-42630: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40260 > Make `parse_data_type` use new proto message `DDLParse` > --- > > Key: SPARK-42630 > URL: https://issues.apache.org/jira/browse/SPARK-42630 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42658) Handle timeouts and CRC failures during artifact transfer
[ https://issues.apache.org/jira/browse/SPARK-42658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Venkata Sai Akhil Gudesa updated SPARK-42658: - Description: We would need a retry mechanism on the client side to handle CRC failures during artifact transfer because the server would discard data that fails CRC and hence, may lead to missing artifacts during UDF execution. We also require a timeout policy to prevent indefinitely waiting for the server reply. was:We would need a retry mechanism on the client side to handle CRC failures during artifact transfer. The server would discard data that fails CRC and hence, may lead to missing artifacts during UDF execution. > Handle timeouts and CRC failures during artifact transfer > - > > Key: SPARK-42658 > URL: https://issues.apache.org/jira/browse/SPARK-42658 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Priority: Major > > We would need a retry mechanism on the client side to handle CRC failures > during artifact transfer because the server would discard data that fails CRC > and hence, may lead to missing artifacts during UDF execution. > We also require a timeout policy to prevent indefinitely waiting for the > server reply. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42658) Handle timeouts and CRC failures during artifact transfer
[ https://issues.apache.org/jira/browse/SPARK-42658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Venkata Sai Akhil Gudesa updated SPARK-42658: - Summary: Handle timeouts and CRC failures during artifact transfer (was: Handle CRC failures during artifact transfer) > Handle timeouts and CRC failures during artifact transfer > - > > Key: SPARK-42658 > URL: https://issues.apache.org/jira/browse/SPARK-42658 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Priority: Major > > We would need a retry mechanism on the client side to handle CRC failures > during artifact transfer. The server would discard data that fails CRC and > hence, may lead to missing artifacts during UDF execution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42658) Handle CRC failures during artifact transfer
Venkata Sai Akhil Gudesa created SPARK-42658: Summary: Handle CRC failures during artifact transfer Key: SPARK-42658 URL: https://issues.apache.org/jira/browse/SPARK-42658 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.4.0 Reporter: Venkata Sai Akhil Gudesa We would need a retry mechanism on the client side to handle CRC failures during artifact transfer. The server would discard data that fails CRC and hence, may lead to missing artifacts during UDF execution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42657) Support to find and transfer client-side REPL classfiles to server as artifacts
Venkata Sai Akhil Gudesa created SPARK-42657: Summary: Support to find and transfer client-side REPL classfiles to server as artifacts Key: SPARK-42657 URL: https://issues.apache.org/jira/browse/SPARK-42657 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.4.0 Reporter: Venkata Sai Akhil Gudesa To run UDFs which are defined on the client side REPL, we require a mechanism that can find the local REPL classfiles and then utilise the mechanism from https://issues.apache.org/jira/browse/SPARK-42653 to transfer them to the server as artifacts. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42657) Support to find and transfer client-side REPL classfiles to server as artifacts
[ https://issues.apache.org/jira/browse/SPARK-42657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Venkata Sai Akhil Gudesa updated SPARK-42657: - Epic Link: SPARK-42554 > Support to find and transfer client-side REPL classfiles to server as > artifacts > - > > Key: SPARK-42657 > URL: https://issues.apache.org/jira/browse/SPARK-42657 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Priority: Major > > To run UDFs which are defined on the client side REPL, we require a mechanism > that can find the local REPL classfiles and then utilise the mechanism from > https://issues.apache.org/jira/browse/SPARK-42653 to transfer them to the > server as artifacts. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42609) Add tests for grouping() and grouping_id() functions
[ https://issues.apache.org/jira/browse/SPARK-42609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42609: Assignee: Apache Spark (was: Rui Wang) > Add tests for grouping() and grouping_id() functions > > > Key: SPARK-42609 > URL: https://issues.apache.org/jira/browse/SPARK-42609 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42609) Add tests for grouping() and grouping_id() functions
[ https://issues.apache.org/jira/browse/SPARK-42609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42609: Assignee: Rui Wang (was: Apache Spark) > Add tests for grouping() and grouping_id() functions > > > Key: SPARK-42609 > URL: https://issues.apache.org/jira/browse/SPARK-42609 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42609) Add tests for grouping() and grouping_id() functions
[ https://issues.apache.org/jira/browse/SPARK-42609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695912#comment-17695912 ] Apache Spark commented on SPARK-42609: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/40259 > Add tests for grouping() and grouping_id() functions > > > Key: SPARK-42609 > URL: https://issues.apache.org/jira/browse/SPARK-42609 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42640) Remove stale entries from the excluding rules for CompabilitySuite
[ https://issues.apache.org/jira/browse/SPARK-42640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-42640. --- Fix Version/s: 3.4.1 Resolution: Fixed > Remove stale entries from the excluding rules for CompabilitySuite > -- > > Key: SPARK-42640 > URL: https://issues.apache.org/jira/browse/SPARK-42640 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.1 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42640) Remove stale entries from the excluding rules for CompabilitySuite
[ https://issues.apache.org/jira/browse/SPARK-42640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell updated SPARK-42640: -- Epic Link: SPARK-42554 > Remove stale entries from the excluding rules for CompabilitySuite > -- > > Key: SPARK-42640 > URL: https://issues.apache.org/jira/browse/SPARK-42640 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.1 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42655) Incorrect ambiguous column reference error
[ https://issues.apache.org/jira/browse/SPARK-42655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42655: Assignee: Apache Spark > Incorrect ambiguous column reference error > -- > > Key: SPARK-42655 > URL: https://issues.apache.org/jira/browse/SPARK-42655 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Shrikant Prasad >Assignee: Apache Spark >Priority: Major > > val df1 = > sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", > "col5") > val op_cols_same_case = List("id","col2","col3","col4", "col5", "id") > val df2 = df1.select(op_cols_same_case.head, op_cols_same_case.tail: _*) > df2.select("id").show() > > This query runs fine. > > But when we change the casing of the op_cols to have mix of upper & lower > case ("id" & "ID") it throws an ambiguous col ref error: > > val df1 = > sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", > "col5") > val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID") > val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*) > df3.select("id").show() > org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could > be: id, id. > at > org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:363) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:112) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$resolveExpressionByPlanChildren$1(Analyzer.scala:1857) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$resolveExpression$2(Analyzer.scala:1787) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.innerResolve$1(Analyzer.scala:1794) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpression(Analyzer.scala:1812) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpressionByPlanChildren(Analyzer.scala:1863) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$17.$anonfun$applyOrElse$94(Analyzer.scala:1577) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:209) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:209) > > Since, Spark is case insensitive, it should work for second case also when we > have upper and lower case column names in the column list. > It also works fine in Spark 2.3. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42655) Incorrect ambiguous column reference error
[ https://issues.apache.org/jira/browse/SPARK-42655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695892#comment-17695892 ] Apache Spark commented on SPARK-42655: -- User 'shrprasa' has created a pull request for this issue: https://github.com/apache/spark/pull/40258 > Incorrect ambiguous column reference error > -- > > Key: SPARK-42655 > URL: https://issues.apache.org/jira/browse/SPARK-42655 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Shrikant Prasad >Priority: Major > > val df1 = > sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", > "col5") > val op_cols_same_case = List("id","col2","col3","col4", "col5", "id") > val df2 = df1.select(op_cols_same_case.head, op_cols_same_case.tail: _*) > df2.select("id").show() > > This query runs fine. > > But when we change the casing of the op_cols to have mix of upper & lower > case ("id" & "ID") it throws an ambiguous col ref error: > > val df1 = > sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", > "col5") > val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID") > val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*) > df3.select("id").show() > org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could > be: id, id. > at > org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:363) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:112) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$resolveExpressionByPlanChildren$1(Analyzer.scala:1857) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$resolveExpression$2(Analyzer.scala:1787) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.innerResolve$1(Analyzer.scala:1794) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpression(Analyzer.scala:1812) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpressionByPlanChildren(Analyzer.scala:1863) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$17.$anonfun$applyOrElse$94(Analyzer.scala:1577) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:209) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:209) > > Since, Spark is case insensitive, it should work for second case also when we > have upper and lower case column names in the column list. > It also works fine in Spark 2.3. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42655) Incorrect ambiguous column reference error
[ https://issues.apache.org/jira/browse/SPARK-42655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42655: Assignee: (was: Apache Spark) > Incorrect ambiguous column reference error > -- > > Key: SPARK-42655 > URL: https://issues.apache.org/jira/browse/SPARK-42655 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Shrikant Prasad >Priority: Major > > val df1 = > sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", > "col5") > val op_cols_same_case = List("id","col2","col3","col4", "col5", "id") > val df2 = df1.select(op_cols_same_case.head, op_cols_same_case.tail: _*) > df2.select("id").show() > > This query runs fine. > > But when we change the casing of the op_cols to have mix of upper & lower > case ("id" & "ID") it throws an ambiguous col ref error: > > val df1 = > sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", > "col5") > val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID") > val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*) > df3.select("id").show() > org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could > be: id, id. > at > org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:363) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:112) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$resolveExpressionByPlanChildren$1(Analyzer.scala:1857) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$resolveExpression$2(Analyzer.scala:1787) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.innerResolve$1(Analyzer.scala:1794) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpression(Analyzer.scala:1812) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpressionByPlanChildren(Analyzer.scala:1863) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$17.$anonfun$applyOrElse$94(Analyzer.scala:1577) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:209) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:209) > > Since, Spark is case insensitive, it should work for second case also when we > have upper and lower case column names in the column list. > It also works fine in Spark 2.3. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42656) Spark Connect Scala Client Shell Script
[ https://issues.apache.org/jira/browse/SPARK-42656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42656: Assignee: (was: Apache Spark) > Spark Connect Scala Client Shell Script > --- > > Key: SPARK-42656 > URL: https://issues.apache.org/jira/browse/SPARK-42656 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Zhen Li >Priority: Major > > Adding a shell script to run scala client in a scala REPL to allow users to > connect to spark connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42656) Spark Connect Scala Client Shell Script
[ https://issues.apache.org/jira/browse/SPARK-42656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42656: Assignee: Apache Spark > Spark Connect Scala Client Shell Script > --- > > Key: SPARK-42656 > URL: https://issues.apache.org/jira/browse/SPARK-42656 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Zhen Li >Assignee: Apache Spark >Priority: Major > > Adding a shell script to run scala client in a scala REPL to allow users to > connect to spark connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42656) Spark Connect Scala Client Shell Script
[ https://issues.apache.org/jira/browse/SPARK-42656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695879#comment-17695879 ] Apache Spark commented on SPARK-42656: -- User 'zhenlineo' has created a pull request for this issue: https://github.com/apache/spark/pull/40257 > Spark Connect Scala Client Shell Script > --- > > Key: SPARK-42656 > URL: https://issues.apache.org/jira/browse/SPARK-42656 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Zhen Li >Priority: Major > > Adding a shell script to run scala client in a scala REPL to allow users to > connect to spark connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42656) Spark Connect Scala Client Shell Script
Zhen Li created SPARK-42656: --- Summary: Spark Connect Scala Client Shell Script Key: SPARK-42656 URL: https://issues.apache.org/jira/browse/SPARK-42656 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.4.0 Reporter: Zhen Li Adding a shell script to run scala client in a scala REPL to allow users to connect to spark connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36604) timestamp type column analyze result is wrong
[ https://issues.apache.org/jira/browse/SPARK-36604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695869#comment-17695869 ] Ritika Maheshwari commented on SPARK-36604: --- Seems to be working correctly in Spark 3.3.0 spark-sql> insert into a values(cast('2021-08-15 15:30:01' as timestamp) > ); 23/03/02 11:04:11 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Time taken: 3.278 seconds spark-sql> select * from a; 2021-08-15 15:30:01 Time taken: 0.782 seconds, Fetched 1 row(s) spark-sql> analyze table a compute statistics for columns a; Time taken: 1.882 seconds spark-sql> desc formatted a a; col_name a data_type timestamp comment NULL min 2021-08-15 15:30:01.00 -0700 max 2021-08-15 15:30:01.00 -0700 num_nulls 0 distinct_count 1 avg_col_len 8 max_col_len 8 histogram NULL Time taken: 0.095 seconds, Fetched 10 row(s) spark-sql> desc a; a timestamp Time taken: 0.059 seconds, Fetched 1 row(s) spark-sql> > timestamp type column analyze result is wrong > - > > Key: SPARK-36604 > URL: https://issues.apache.org/jira/browse/SPARK-36604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.1.2 > Environment: Spark 3.1.1 >Reporter: YuanGuanhu >Priority: Major > > when we create table with timestamp column type, the min and max data of the > analyze result for the timestamp column is wrong > eg: > {code} > > select * from a; > {code} > {code} > 2021-08-15 15:30:01 > Time taken: 2.789 seconds, Fetched 1 row(s) > spark-sql> desc formatted a a; > col_name a > data_type timestamp > comment NULL > min 2021-08-15 07:30:01.00 > max 2021-08-15 07:30:01.00 > num_nulls 0 > distinct_count 1 > avg_col_len 8 > max_col_len 8 > histogram NULL > Time taken: 0.278 seconds, Fetched 10 row(s) > spark-sql> desc a; > a timestamp NULL > Time taken: 1.432 seconds, Fetched 1 row(s) > {code} > > reproduce step: > {code} > create table a(a timestamp); > insert into a select '2021-08-15 15:30:01'; > analyze table a compute statistics for columns a; > desc formatted a a; > select * from a; > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42655) Incorrect ambiguous column reference error
[ https://issues.apache.org/jira/browse/SPARK-42655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shrikant Prasad updated SPARK-42655: Description: val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5") val op_cols_same_case = List("id","col2","col3","col4", "col5", "id") val df2 = df1.select(op_cols_same_case.head, op_cols_same_case.tail: _*) df2.select("id").show() This query runs fine. But when we change the casing of the op_cols to have mix of upper & lower case ("id" & "ID") it throws an ambiguous col ref error: val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5") val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID") val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*) df3.select("id").show() org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id. at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:363) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:112) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$resolveExpressionByPlanChildren$1(Analyzer.scala:1857) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$resolveExpression$2(Analyzer.scala:1787) at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:60) at org.apache.spark.sql.catalyst.analysis.Analyzer.innerResolve$1(Analyzer.scala:1794) at org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpression(Analyzer.scala:1812) at org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpressionByPlanChildren(Analyzer.scala:1863) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$17.$anonfun$applyOrElse$94(Analyzer.scala:1577) at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193) at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204) at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:209) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:209) Since, Spark is case insensitive, it should work for second case also when we have upper and lower case column names in the column list. It also works fine in Spark 2.3. was: val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5") val op_cols_same_case = List("id","col2","col3","col4", "col5", "id") val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*) df2.select("id").show() This query runs fine. But when we change the casing of the op_cols to have mix of upper & lower case ("id" & "ID") it throws an ambiguous col ref error: val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5") val op_cols_same_case = List("id","col2","col3","col4", "col5", "ID") val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*) df2.select("id").show() Since, Spark is case insensitive, it should work for second case also when we have upper and lower case column names in the column list. It also works fine in Spark 2.3. > Incorrect ambiguous column reference error > -- > > Key: SPARK-42655 > URL: https://issues.apache.org/jira/browse/SPARK-42655 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Shrikant Prasad >Priority: Major > > val df1 = > sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", > "col5") > val op_cols_same_case = List("id","col2","col3","col4", "col5", "id") > val df2 = df1.select(op_cols_same_case.head, op_cols_same_case.tail: _*) > df2.select("id").show() > > This query runs fine. > > But when we change the casing of the op_cols to have mix of upper & lower > case ("id" & "ID") it throws an ambiguous col ref error: > > val df1 = >
[jira] [Updated] (SPARK-42655) Incorrect ambiguous column reference error
[ https://issues.apache.org/jira/browse/SPARK-42655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shrikant Prasad updated SPARK-42655: Description: val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5") val op_cols_same_case = List("id","col2","col3","col4", "col5", "id") val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*) df2.select("id").show() This query runs fine. But when we change the casing of the op_cols to have mix of upper & lower case ("id" & "ID") it throws an ambiguous col ref error: val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5") val op_cols_same_case = List("id","col2","col3","col4", "col5", "ID") val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*) df2.select("id").show() Since, Spark is case insensitive, it should work for second case also when we have upper and lower case column names in the column list. It also works fine in Spark 2.3. was: val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5") val op_cols_same_case = List("id","col2","col3","col4", "col5", "id") val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*) df2.select("id").show() This query runs fine. But when we change the casing of the op_cols to have mix of upper & lower case ("id" & "ID") it throws an ambiguous col ref error: val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5") val op_cols_same_case = List("id","col2","col3","col4", "col5", "ID") val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*) df2.select("id").show() Since, Spark is case insensitive, it should work for second case also when we have upper and lower case column names in the column list. > Incorrect ambiguous column reference error > -- > > Key: SPARK-42655 > URL: https://issues.apache.org/jira/browse/SPARK-42655 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Shrikant Prasad >Priority: Major > > val df1 = > sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", > "col5") > val op_cols_same_case = List("id","col2","col3","col4", "col5", "id") > val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*) > df2.select("id").show() > > This query runs fine. > > But when we change the casing of the op_cols to have mix of upper & lower > case ("id" & "ID") it throws an ambiguous col ref error: > > val df1 = > sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", > "col5") > val op_cols_same_case = List("id","col2","col3","col4", "col5", "ID") > val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*) > df2.select("id").show() > > Since, Spark is case insensitive, it should work for second case also when we > have upper and lower case column names in the column list. > It also works fine in Spark 2.3. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42655) Incorrect ambiguous column reference error
Shrikant Prasad created SPARK-42655: --- Summary: Incorrect ambiguous column reference error Key: SPARK-42655 URL: https://issues.apache.org/jira/browse/SPARK-42655 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.0 Reporter: Shrikant Prasad val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5") val op_cols_same_case = List("id","col2","col3","col4", "col5", "id") val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*) df2.select("id").show() This query runs fine. But when we change the casing of the op_cols to have mix of upper & lower case ("id" & "ID") it throws an ambiguous col ref error: val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5") val op_cols_same_case = List("id","col2","col3","col4", "col5", "ID") val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*) df2.select("id").show() Since, Spark is case insensitive, it should work for second case also when we have upper and lower case column names in the column list. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42599) Make `CompatibilitySuite` as a tool like `dev/mima`
[ https://issues.apache.org/jira/browse/SPARK-42599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-42599. --- Fix Version/s: 3.4.0 Assignee: Yang Jie Resolution: Fixed > Make `CompatibilitySuite` as a tool like `dev/mima` > --- > > Key: SPARK-42599 > URL: https://issues.apache.org/jira/browse/SPARK-42599 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.4.0 > > > Using maven to test `CompatibilitySuite` requires some pre-work(need maven > build sql & > connect-client-jvm module before test), so when we run `mvn package test`, > there will be following errors: > > {code:java} > CompatibilitySuite: > - compatibility MiMa tests *** FAILED *** > java.lang.AssertionError: assertion failed: Failed to find the jar inside > folder: /home/bjorn/spark-3.4.0/connector/connect/client/jvm/target > at scala.Predef$.assert(Predef.scala:223) > at > org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:67) > at > org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57) > at > org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar(CompatibilitySuite.scala:53) > at > org.apache.spark.sql.connect.client.CompatibilitySuite.$anonfun$new$1(CompatibilitySuite.scala:69) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > ... > - compatibility API tests: Dataset *** FAILED *** > java.lang.AssertionError: assertion failed: Failed to find the jar inside > folder: /home/bjorn/spark-3.4.0/connector/connect/client/jvm/target > at scala.Predef$.assert(Predef.scala:223) > at > org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:67) > at > org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57) > at > org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar(CompatibilitySuite.scala:53) > at > org.apache.spark.sql.connect.client.CompatibilitySuite.$anonfun$new$7(CompatibilitySuite.scala:110) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26314) support Confluent encoded Avro in Spark Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-26314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695828#comment-17695828 ] Gustavo Martin commented on SPARK-26314: My team just stumbled upon this problem :( I was hoping Spark would be making use of the AVRO capabilities for finding the right schema associated with some event when using a Schema Registy. > support Confluent encoded Avro in Spark Structured Streaming > > > Key: SPARK-26314 > URL: https://issues.apache.org/jira/browse/SPARK-26314 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: David Ahern >Priority: Major > > As Avro has now been added as a first class citizen, > [https://spark.apache.org/docs/latest/sql-data-sources-avro.html] > please make Confluent encoded avro work out of the box with Spark Structured > Streaming > as described in this link, Avro messages on Kafka encoded with confluent > serializer also need to be decoded with confluent. It would be great if this > worked out of the box > [https://developer.ibm.com/answers/questions/321440/ibm-iidr-cdc-db2-to-kafka.html?smartspace=blockchain] > here are details on the Confluent encoding > [https://www.sderosiaux.com/articles/2017/03/02/serializing-data-efficiently-with-apache-avro-and-dealing-with-a-schema-registry/#encodingdecoding-the-messages-with-the-schema-id] > It's been a year since i worked on anything to do with Avro and Spark > Structured Streaming, but i had to take an approach such as this when getting > it to work. This is what i used as a reference at that time > [https://github.com/tubular/confluent-spark-avro] > Also, here is another link i found that someone has done in the meantime > [https://github.com/AbsaOSS/ABRiS] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42653) Artifact transfer from Scala/JVM client to Server
[ https://issues.apache.org/jira/browse/SPARK-42653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Venkata Sai Akhil Gudesa updated SPARK-42653: - Epic Link: SPARK-42554 > Artifact transfer from Scala/JVM client to Server > - > > Key: SPARK-42653 > URL: https://issues.apache.org/jira/browse/SPARK-42653 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Priority: Major > > In the decoupled client-server architecture of Spark Connect, a remote client > may use a local JAR or a new class in their UDF that may not be present on > the server. To handle these cases of missing "artifacts", we need to > implement a mechanism to transfer artifacts from the client side over to the > server side as per the protocol defined in > https://github.com/apache/spark/pull/40147 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42653) Artifact transfer from Scala/JVM client to Server
[ https://issues.apache.org/jira/browse/SPARK-42653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695820#comment-17695820 ] Apache Spark commented on SPARK-42653: -- User 'vicennial' has created a pull request for this issue: https://github.com/apache/spark/pull/40256 > Artifact transfer from Scala/JVM client to Server > - > > Key: SPARK-42653 > URL: https://issues.apache.org/jira/browse/SPARK-42653 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Priority: Major > > In the decoupled client-server architecture of Spark Connect, a remote client > may use a local JAR or a new class in their UDF that may not be present on > the server. To handle these cases of missing "artifacts", we need to > implement a mechanism to transfer artifacts from the client side over to the > server side as per the protocol defined in > https://github.com/apache/spark/pull/40147 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42653) Artifact transfer from Scala/JVM client to Server
[ https://issues.apache.org/jira/browse/SPARK-42653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42653: Assignee: (was: Apache Spark) > Artifact transfer from Scala/JVM client to Server > - > > Key: SPARK-42653 > URL: https://issues.apache.org/jira/browse/SPARK-42653 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Priority: Major > > In the decoupled client-server architecture of Spark Connect, a remote client > may use a local JAR or a new class in their UDF that may not be present on > the server. To handle these cases of missing "artifacts", we need to > implement a mechanism to transfer artifacts from the client side over to the > server side as per the protocol defined in > https://github.com/apache/spark/pull/40147 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42653) Artifact transfer from Scala/JVM client to Server
[ https://issues.apache.org/jira/browse/SPARK-42653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42653: Assignee: Apache Spark > Artifact transfer from Scala/JVM client to Server > - > > Key: SPARK-42653 > URL: https://issues.apache.org/jira/browse/SPARK-42653 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Assignee: Apache Spark >Priority: Major > > In the decoupled client-server architecture of Spark Connect, a remote client > may use a local JAR or a new class in their UDF that may not be present on > the server. To handle these cases of missing "artifacts", we need to > implement a mechanism to transfer artifacts from the client side over to the > server side as per the protocol defined in > https://github.com/apache/spark/pull/40147 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42653) Artifact transfer from Scala/JVM client to Server
[ https://issues.apache.org/jira/browse/SPARK-42653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695819#comment-17695819 ] Apache Spark commented on SPARK-42653: -- User 'vicennial' has created a pull request for this issue: https://github.com/apache/spark/pull/40256 > Artifact transfer from Scala/JVM client to Server > - > > Key: SPARK-42653 > URL: https://issues.apache.org/jira/browse/SPARK-42653 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Priority: Major > > In the decoupled client-server architecture of Spark Connect, a remote client > may use a local JAR or a new class in their UDF that may not be present on > the server. To handle these cases of missing "artifacts", we need to > implement a mechanism to transfer artifacts from the client side over to the > server side as per the protocol defined in > https://github.com/apache/spark/pull/40147 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42654) Upgrade dropwizard metrics 4.2.17
[ https://issues.apache.org/jira/browse/SPARK-42654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42654: Assignee: Apache Spark > Upgrade dropwizard metrics 4.2.17 > - > > Key: SPARK-42654 > URL: https://issues.apache.org/jira/browse/SPARK-42654 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > * [https://github.com/dropwizard/metrics/releases/tag/v4.2.16] > * [https://github.com/dropwizard/metrics/releases/tag/v4.2.17] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42558) Implement DataFrameStatFunctions
[ https://issues.apache.org/jira/browse/SPARK-42558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42558: Assignee: Apache Spark > Implement DataFrameStatFunctions > > > Key: SPARK-42558 > URL: https://issues.apache.org/jira/browse/SPARK-42558 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Apache Spark >Priority: Major > > Implement DataFrameStatFunctions for connect, and hook it up to Dataset. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42558) Implement DataFrameStatFunctions
[ https://issues.apache.org/jira/browse/SPARK-42558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695816#comment-17695816 ] Apache Spark commented on SPARK-42558: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/40255 > Implement DataFrameStatFunctions > > > Key: SPARK-42558 > URL: https://issues.apache.org/jira/browse/SPARK-42558 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > Implement DataFrameStatFunctions for connect, and hook it up to Dataset. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42654) Upgrade dropwizard metrics 4.2.17
[ https://issues.apache.org/jira/browse/SPARK-42654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695814#comment-17695814 ] Apache Spark commented on SPARK-42654: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/40254 > Upgrade dropwizard metrics 4.2.17 > - > > Key: SPARK-42654 > URL: https://issues.apache.org/jira/browse/SPARK-42654 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Major > > * [https://github.com/dropwizard/metrics/releases/tag/v4.2.16] > * [https://github.com/dropwizard/metrics/releases/tag/v4.2.17] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42558) Implement DataFrameStatFunctions
[ https://issues.apache.org/jira/browse/SPARK-42558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42558: Assignee: (was: Apache Spark) > Implement DataFrameStatFunctions > > > Key: SPARK-42558 > URL: https://issues.apache.org/jira/browse/SPARK-42558 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > Implement DataFrameStatFunctions for connect, and hook it up to Dataset. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42654) Upgrade dropwizard metrics 4.2.17
[ https://issues.apache.org/jira/browse/SPARK-42654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42654: Assignee: (was: Apache Spark) > Upgrade dropwizard metrics 4.2.17 > - > > Key: SPARK-42654 > URL: https://issues.apache.org/jira/browse/SPARK-42654 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Major > > * [https://github.com/dropwizard/metrics/releases/tag/v4.2.16] > * [https://github.com/dropwizard/metrics/releases/tag/v4.2.17] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42654) Upgrade dropwizard metrics 4.2.17
Yang Jie created SPARK-42654: Summary: Upgrade dropwizard metrics 4.2.17 Key: SPARK-42654 URL: https://issues.apache.org/jira/browse/SPARK-42654 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: Yang Jie * [https://github.com/dropwizard/metrics/releases/tag/v4.2.16] * [https://github.com/dropwizard/metrics/releases/tag/v4.2.17] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42653) Artifact transfer from Scala/JVM client to Server
Venkata Sai Akhil Gudesa created SPARK-42653: Summary: Artifact transfer from Scala/JVM client to Server Key: SPARK-42653 URL: https://issues.apache.org/jira/browse/SPARK-42653 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.4.0 Reporter: Venkata Sai Akhil Gudesa In the decoupled client-server architecture of Spark Connect, a remote client may use a local JAR or a new class in their UDF that may not be present on the server. To handle these cases of missing "artifacts", we need to implement a mechanism to transfer artifacts from the client side over to the server side as per the protocol defined in https://github.com/apache/spark/pull/40147 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42553) NonReserved keyword "interval" can't be column name
[ https://issues.apache.org/jira/browse/SPARK-42553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-42553: - Fix Version/s: 3.3.3 > NonReserved keyword "interval" can't be column name > --- > > Key: SPARK-42553 > URL: https://issues.apache.org/jira/browse/SPARK-42553 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.2.3, 3.3.2 > Environment: Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java > 1.8.0_345) > Spark version 3.2.3-SNAPSHOT >Reporter: jiang13021 >Assignee: jiang13021 >Priority: Major > Fix For: 3.3.3, 3.4.1 > > > INTERVAL is a Non-Reserved keyword in spark. "Non-Reserved keywords" have a > special meaning in particular contexts and can be used as identifiers in > other contexts. So by design, interval can be used as a column name. > {code:java} > scala> spark.sql("select interval from mytable") > org.apache.spark.sql.catalyst.parser.ParseException: > at least one time unit should be given for interval literal(line 1, pos 7)== > SQL == > select interval from mytable > ---^^^ at > org.apache.spark.sql.errors.QueryParsingErrors$.invalidIntervalLiteralError(QueryParsingErrors.scala:196) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$parseIntervalLiteral$1(AstBuilder.scala:2481) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.parseIntervalLiteral(AstBuilder.scala:2466) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitInterval$1(AstBuilder.scala:2432) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:2431) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:57) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalContext.accept(SqlBaseParser.java:17308) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitIntervalLiteral(SqlBaseBaseVisitor.java:1581) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalLiteralContext.accept(SqlBaseParser.java:16929) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitConstantDefault(SqlBaseBaseVisitor.java:1511) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$ConstantDefaultContext.accept(SqlBaseParser.java:15905) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitValueExpressionDefault(SqlBaseBaseVisitor.java:1392) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$ValueExpressionDefaultContext.accept(SqlBaseParser.java:15298) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitPredicated$1(AstBuilder.scala:1548) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:1547) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:57) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$PredicatedContext.accept(SqlBaseParser.java:14745) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitExpression(SqlBaseBaseVisitor.java:1343) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$ExpressionContext.accept(SqlBaseParser.java:14606) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitNamedExpression$1(AstBuilder.scala:1434) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:1433) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:57) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$NamedExpressionContext.accept(SqlBaseParser.java:14124) > at >
[jira] [Resolved] (SPARK-42622) StackOverflowError reading json that does not conform to schema
[ https://issues.apache.org/jira/browse/SPARK-42622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-42622. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40219 [https://github.com/apache/spark/pull/40219] > StackOverflowError reading json that does not conform to schema > --- > > Key: SPARK-42622 > URL: https://issues.apache.org/jira/browse/SPARK-42622 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.4.0 >Reporter: Jelmer Kuperus >Assignee: Jelmer Kuperus >Priority: Critical > Fix For: 3.4.0 > > > Databricks runtime 12.1 uses a pre-release version of spark 3.4.x we > encountered the following problem > > !https://user-images.githubusercontent.com/133639/221866500-99f187a0-8db3-42a7-85ca-b027fdec160d.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42622) StackOverflowError reading json that does not conform to schema
[ https://issues.apache.org/jira/browse/SPARK-42622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-42622: Assignee: Jelmer Kuperus > StackOverflowError reading json that does not conform to schema > --- > > Key: SPARK-42622 > URL: https://issues.apache.org/jira/browse/SPARK-42622 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.4.0 >Reporter: Jelmer Kuperus >Assignee: Jelmer Kuperus >Priority: Critical > > Databricks runtime 12.1 uses a pre-release version of spark 3.4.x we > encountered the following problem > > !https://user-images.githubusercontent.com/133639/221866500-99f187a0-8db3-42a7-85ca-b027fdec160d.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42622) StackOverflowError reading json that does not conform to schema
[ https://issues.apache.org/jira/browse/SPARK-42622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-42622: - Fix Version/s: 3.4.1 (was: 3.4.0) Priority: Major (was: Critical) > StackOverflowError reading json that does not conform to schema > --- > > Key: SPARK-42622 > URL: https://issues.apache.org/jira/browse/SPARK-42622 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.4.0 >Reporter: Jelmer Kuperus >Assignee: Jelmer Kuperus >Priority: Major > Fix For: 3.4.1 > > > Databricks runtime 12.1 uses a pre-release version of spark 3.4.x we > encountered the following problem > > !https://user-images.githubusercontent.com/133639/221866500-99f187a0-8db3-42a7-85ca-b027fdec160d.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42652) Improve the multiple watermarking policy documentation
Sandeep Chandran created SPARK-42652: Summary: Improve the multiple watermarking policy documentation Key: SPARK-42652 URL: https://issues.apache.org/jira/browse/SPARK-42652 Project: Spark Issue Type: Documentation Components: Structured Streaming Affects Versions: 3.3.2 Reporter: Sandeep Chandran Its better if we add some examples on Handling multiple watermark documentation. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#policy-for-handling-multiple-watermarks -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42558) Implement DataFrameStatFunctions
[ https://issues.apache.org/jira/browse/SPARK-42558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695706#comment-17695706 ] Yang Jie commented on SPARK-42558: -- Seems DataFrameStatFunctions can only be partially implemented. bloomFilter and countMinSketch have no corresponding protocol yet > Implement DataFrameStatFunctions > > > Key: SPARK-42558 > URL: https://issues.apache.org/jira/browse/SPARK-42558 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > Implement DataFrameStatFunctions for connect, and hook it up to Dataset. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42553) NonReserved keyword "interval" can't be column name
[ https://issues.apache.org/jira/browse/SPARK-42553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695696#comment-17695696 ] Apache Spark commented on SPARK-42553: -- User 'jiang13021' has created a pull request for this issue: https://github.com/apache/spark/pull/40253 > NonReserved keyword "interval" can't be column name > --- > > Key: SPARK-42553 > URL: https://issues.apache.org/jira/browse/SPARK-42553 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.2.3, 3.3.2 > Environment: Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java > 1.8.0_345) > Spark version 3.2.3-SNAPSHOT >Reporter: jiang13021 >Assignee: jiang13021 >Priority: Major > Fix For: 3.4.1 > > > INTERVAL is a Non-Reserved keyword in spark. "Non-Reserved keywords" have a > special meaning in particular contexts and can be used as identifiers in > other contexts. So by design, interval can be used as a column name. > {code:java} > scala> spark.sql("select interval from mytable") > org.apache.spark.sql.catalyst.parser.ParseException: > at least one time unit should be given for interval literal(line 1, pos 7)== > SQL == > select interval from mytable > ---^^^ at > org.apache.spark.sql.errors.QueryParsingErrors$.invalidIntervalLiteralError(QueryParsingErrors.scala:196) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$parseIntervalLiteral$1(AstBuilder.scala:2481) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.parseIntervalLiteral(AstBuilder.scala:2466) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitInterval$1(AstBuilder.scala:2432) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:2431) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:57) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalContext.accept(SqlBaseParser.java:17308) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitIntervalLiteral(SqlBaseBaseVisitor.java:1581) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalLiteralContext.accept(SqlBaseParser.java:16929) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitConstantDefault(SqlBaseBaseVisitor.java:1511) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$ConstantDefaultContext.accept(SqlBaseParser.java:15905) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitValueExpressionDefault(SqlBaseBaseVisitor.java:1392) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$ValueExpressionDefaultContext.accept(SqlBaseParser.java:15298) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitPredicated$1(AstBuilder.scala:1548) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:1547) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:57) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$PredicatedContext.accept(SqlBaseParser.java:14745) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitExpression(SqlBaseBaseVisitor.java:1343) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$ExpressionContext.accept(SqlBaseParser.java:14606) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitNamedExpression$1(AstBuilder.scala:1434) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:1433) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:57) > at >
[jira] [Commented] (SPARK-42555) Add JDBC to DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-42555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695691#comment-17695691 ] Apache Spark commented on SPARK-42555: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/40252 > Add JDBC to DataFrameReader > --- > > Key: SPARK-42555 > URL: https://issues.apache.org/jira/browse/SPARK-42555 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42555) Add JDBC to DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-42555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42555: Assignee: Apache Spark > Add JDBC to DataFrameReader > --- > > Key: SPARK-42555 > URL: https://issues.apache.org/jira/browse/SPARK-42555 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42555) Add JDBC to DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-42555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42555: Assignee: (was: Apache Spark) > Add JDBC to DataFrameReader > --- > > Key: SPARK-42555 > URL: https://issues.apache.org/jira/browse/SPARK-42555 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41725) Remove the workaround of sql(...).collect back in PySpark tests
[ https://issues.apache.org/jira/browse/SPARK-41725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41725: Assignee: Hyukjin Kwon > Remove the workaround of sql(...).collect back in PySpark tests > --- > > Key: SPARK-41725 > URL: https://issues.apache.org/jira/browse/SPARK-41725 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > See https://github.com/apache/spark/pull/39224/files#r1057436437 > We don't have to `collect` for every `sql` but Spark Connect requires it. We > should remove them out. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42651) Optimize global sort to driver sort
[ https://issues.apache.org/jira/browse/SPARK-42651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-42651: -- Description: If the size of plan is small enough, it's more efficient to sort all rows at driver side that saves one shuffle > Optimize global sort to driver sort > --- > > Key: SPARK-42651 > URL: https://issues.apache.org/jira/browse/SPARK-42651 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > > If the size of plan is small enough, it's more efficient to sort all rows at > driver side that saves one shuffle -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42393) Support for Pandas/Arrow Functions API
[ https://issues.apache.org/jira/browse/SPARK-42393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-42393: - Assignee: Xinrong Meng > Support for Pandas/Arrow Functions API > -- > > Key: SPARK-42393 > URL: https://issues.apache.org/jira/browse/SPARK-42393 > Project: Spark > Issue Type: Umbrella > Components: Connect, PySpark >Affects Versions: 3.4.0, 3.5.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42499) Support for Runtime SQL configuration
[ https://issues.apache.org/jira/browse/SPARK-42499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-42499: - Assignee: Takuya Ueshin > Support for Runtime SQL configuration > - > > Key: SPARK-42499 > URL: https://issues.apache.org/jira/browse/SPARK-42499 > Project: Spark > Issue Type: Umbrella > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42651) Optimize global sort to driver sort
XiDuo You created SPARK-42651: - Summary: Optimize global sort to driver sort Key: SPARK-42651 URL: https://issues.apache.org/jira/browse/SPARK-42651 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: XiDuo You -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33628) Use the Hive.getPartitionsByNames method instead of Hive.getPartitions in the HiveClientImpl
[ https://issues.apache.org/jira/browse/SPARK-33628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695654#comment-17695654 ] Maxim Martynov commented on SPARK-33628: Can anyone review this pull request? > Use the Hive.getPartitionsByNames method instead of Hive.getPartitions in the > HiveClientImpl > > > Key: SPARK-33628 > URL: https://issues.apache.org/jira/browse/SPARK-33628 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: jinhai >Priority: Major > Attachments: image-2020-12-02-16-57-43-619.png, > image-2020-12-03-14-38-19-221.png > > > When partitions are tracked by the catalog, that will compute all custom > partition locations, especially when dynamic partitions, and the field > staticPartitions is empty. > The poor performance of the method listPartitions results in a long period > of no response at the Driver. > When read 12253 partitions, the method getPartitionsByNames takes 2 seconds, > and the getPartitions takes 457 seconds, nearly 8 minutes > !image-2020-12-02-16-57-43-619.png|width=783,height=54! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42578) Add JDBC to DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-42578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695650#comment-17695650 ] jiaan.geng commented on SPARK-42578: I will take a look! > Add JDBC to DataFrameWriter > --- > > Key: SPARK-42578 > URL: https://issues.apache.org/jira/browse/SPARK-42578 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41725) Remove the workaround of sql(...).collect back in PySpark tests
[ https://issues.apache.org/jira/browse/SPARK-41725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695643#comment-17695643 ] Apache Spark commented on SPARK-41725: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/40251 > Remove the workaround of sql(...).collect back in PySpark tests > --- > > Key: SPARK-41725 > URL: https://issues.apache.org/jira/browse/SPARK-41725 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > See https://github.com/apache/spark/pull/39224/files#r1057436437 > We don't have to `collect` for every `sql` but Spark Connect requires it. We > should remove them out. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42555) Add JDBC to DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-42555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695633#comment-17695633 ] jiaan.geng commented on SPARK-42555: I will take a look! > Add JDBC to DataFrameReader > --- > > Key: SPARK-42555 > URL: https://issues.apache.org/jira/browse/SPARK-42555 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42641) Upgrade buf to v1.15.0
[ https://issues.apache.org/jira/browse/SPARK-42641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-42641. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40243 [https://github.com/apache/spark/pull/40243] > Upgrade buf to v1.15.0 > -- > > Key: SPARK-42641 > URL: https://issues.apache.org/jira/browse/SPARK-42641 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42641) Upgrade buf to v1.15.0
[ https://issues.apache.org/jira/browse/SPARK-42641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-42641: - Assignee: Ruifeng Zheng > Upgrade buf to v1.15.0 > -- > > Key: SPARK-42641 > URL: https://issues.apache.org/jira/browse/SPARK-42641 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42650) link issue SPARK-42550
[ https://issues.apache.org/jira/browse/SPARK-42650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kevinshin updated SPARK-42650: -- Description: When use [KyuubiSparkSQLExtension|https://kyuubi.readthedocs.io/en/v1.6.1-incubating/extensions/engines/spark/] and when a `insert overwrite` statment meet exception ,a no partion table's home directory will lost ,partion table will lost partion directory. spark-defaults.conf: spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension because I can't reopen SPARK-42550 , for detail and reproduce please reference: https://issues.apache.org/jira/browse/SPARK-42550 was: When use [KyuubiSparkSQLExtension|https://kyuubi.readthedocs.io/en/v1.6.1-incubating/extensions/engines/spark/] and when a{{ insert overwrite}} statment meet exception ,a no partion table's home directory will lost ,partion table will lost partion directory. spark-defaults.conf: spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension because I can't reopen SPARK-42550 , for detail and reproduce please reference: https://issues.apache.org/jira/browse/SPARK-42550 > link issue SPARK-42550 > -- > > Key: SPARK-42650 > URL: https://issues.apache.org/jira/browse/SPARK-42650 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.3 >Reporter: kevinshin >Priority: Major > > When use > [KyuubiSparkSQLExtension|https://kyuubi.readthedocs.io/en/v1.6.1-incubating/extensions/engines/spark/] > and when a `insert overwrite` statment meet exception ,a no partion table's > home directory will lost ,partion table will lost partion directory. > > spark-defaults.conf: > spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension > > because I can't reopen SPARK-42550 , for detail and reproduce please > reference: > https://issues.apache.org/jira/browse/SPARK-42550 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42650) link issue SPARK-42550
[ https://issues.apache.org/jira/browse/SPARK-42650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kevinshin updated SPARK-42650: -- Description: When use [KyuubiSparkSQLExtension|https://kyuubi.readthedocs.io/en/v1.6.1-incubating/extensions/engines/spark/] and when a `insert overwrite` statment meet exception ,a no partion table's home directory will lost ,partion table will lost partion directory. my spark-defaults.conf config : spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension because I can't reopen SPARK-42550 , for detail and reproduce please reference: https://issues.apache.org/jira/browse/SPARK-42550 was: When use [KyuubiSparkSQLExtension|https://kyuubi.readthedocs.io/en/v1.6.1-incubating/extensions/engines/spark/] and when a `insert overwrite` statment meet exception ,a no partion table's home directory will lost ,partion table will lost partion directory. spark-defaults.conf: spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension because I can't reopen SPARK-42550 , for detail and reproduce please reference: https://issues.apache.org/jira/browse/SPARK-42550 > link issue SPARK-42550 > -- > > Key: SPARK-42650 > URL: https://issues.apache.org/jira/browse/SPARK-42650 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.3 >Reporter: kevinshin >Priority: Major > > When use > [KyuubiSparkSQLExtension|https://kyuubi.readthedocs.io/en/v1.6.1-incubating/extensions/engines/spark/] > and when a `insert overwrite` statment meet exception ,a no partion table's > home directory will lost ,partion table will lost partion directory. > > my spark-defaults.conf config : > spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension > > because I can't reopen SPARK-42550 , for detail and reproduce please > reference: > https://issues.apache.org/jira/browse/SPARK-42550 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42650) link issue SPARK-42550
[ https://issues.apache.org/jira/browse/SPARK-42650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kevinshin updated SPARK-42650: -- Description: When use [KyuubiSparkSQLExtension|https://kyuubi.readthedocs.io/en/v1.6.1-incubating/extensions/engines/spark/] and when a{{ insert overwrite}} statment meet exception ,a no partion table's home directory will lost ,partion table will lost partion directory. spark-defaults.conf: spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension because I can't reopen SPARK-42550 , for detail and reproduce please reference: https://issues.apache.org/jira/browse/SPARK-42550 was:https://issues.apache.org/jira/browse/SPARK-42550 > link issue SPARK-42550 > -- > > Key: SPARK-42650 > URL: https://issues.apache.org/jira/browse/SPARK-42650 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.3 >Reporter: kevinshin >Priority: Major > > When use > [KyuubiSparkSQLExtension|https://kyuubi.readthedocs.io/en/v1.6.1-incubating/extensions/engines/spark/] > and when a{{ insert overwrite}} statment meet exception ,a no partion > table's home directory will lost ,partion table will lost partion directory. > > spark-defaults.conf: > spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension > > because I can't reopen SPARK-42550 , for detail and reproduce please > reference: > https://issues.apache.org/jira/browse/SPARK-42550 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42650) link issue SPARK-42550
kevinshin created SPARK-42650: - Summary: link issue SPARK-42550 Key: SPARK-42650 URL: https://issues.apache.org/jira/browse/SPARK-42650 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.3 Reporter: kevinshin https://issues.apache.org/jira/browse/SPARK-42550 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42642) Make Python the first code example tab in the Spark documentation
[ https://issues.apache.org/jira/browse/SPARK-42642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42642. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40250 [https://github.com/apache/spark/pull/40250] > Make Python the first code example tab in the Spark documentation > - > > Key: SPARK-42642 > URL: https://issues.apache.org/jira/browse/SPARK-42642 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Allan Folting >Assignee: Allan Folting >Priority: Major > Fix For: 3.5.0 > > Attachments: Screenshot 2023-03-01 at 8.10.08 PM.png, Screenshot > 2023-03-01 at 8.10.22 PM.png > > > Python is the most approachable and most popular language so it should be the > default language in code examples so this makes Python the first code example > tab consistently across the documentation, where applicable. > This is continuing the work started with: > https://issues.apache.org/jira/browse/SPARK-42493 > where these two pages were updated: > [https://spark.apache.org/docs/latest/sql-getting-started.html] > [https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html] > > Pages being updated now: > [https://spark.apache.org/docs/latest/ml-classification-regression.html] > [https://spark.apache.org/docs/latest/ml-clustering.html] > [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html] > [https://spark.apache.org/docs/latest/ml-datasource.html] > [https://spark.apache.org/docs/latest/ml-features.html] > [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] > [https://spark.apache.org/docs/latest/ml-migration-guide.html] > [https://spark.apache.org/docs/latest/ml-pipeline.html] > [https://spark.apache.org/docs/latest/ml-statistics.html] > [https://spark.apache.org/docs/latest/ml-tuning.html] > > [https://spark.apache.org/docs/latest/mllib-clustering.html] > [https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html] > [https://spark.apache.org/docs/latest/mllib-data-types.html] > [https://spark.apache.org/docs/latest/mllib-decision-tree.html] > [https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html] > [https://spark.apache.org/docs/latest/mllib-ensembles.html] > [https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html] > [https://spark.apache.org/docs/latest/mllib-feature-extraction.html] > [https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html] > [https://spark.apache.org/docs/latest/mllib-isotonic-regression.html] > [https://spark.apache.org/docs/latest/mllib-linear-methods.html] > [https://spark.apache.org/docs/latest/mllib-naive-bayes.html] > [https://spark.apache.org/docs/latest/mllib-statistics.html] > > [https://spark.apache.org/docs/latest/quick-start.html] > > [https://spark.apache.org/docs/latest/rdd-programming-guide.html] > > [https://spark.apache.org/docs/latest/sql-data-sources-avro.html] > [https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html] > [https://spark.apache.org/docs/latest/sql-data-sources-csv.html] > [https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html] > [https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html] > [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html] > [https://spark.apache.org/docs/latest/sql-data-sources-json.html] > [https://spark.apache.org/docs/latest/sql-data-sources-parquet.html] > sql-data-sources-protobuf.html > [https://spark.apache.org/docs/latest/sql-data-sources-text.html] > [https://spark.apache.org/docs/latest/sql-migration-guide.html] > [https://spark.apache.org/docs/latest/sql-performance-tuning.html] > [https://spark.apache.org/docs/latest/sql-ref-datatypes.html] > > [https://spark.apache.org/docs/latest/streaming-kinesis-integration.html] > [https://spark.apache.org/docs/latest/streaming-programming-guide.html] > > [https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html] > [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html] > > > > > > > > > > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42642) Make Python the first code example tab in the Spark documentation
[ https://issues.apache.org/jira/browse/SPARK-42642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42642: Assignee: Allan Folting > Make Python the first code example tab in the Spark documentation > - > > Key: SPARK-42642 > URL: https://issues.apache.org/jira/browse/SPARK-42642 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Allan Folting >Assignee: Allan Folting >Priority: Major > Attachments: Screenshot 2023-03-01 at 8.10.08 PM.png, Screenshot > 2023-03-01 at 8.10.22 PM.png > > > Python is the most approachable and most popular language so it should be the > default language in code examples so this makes Python the first code example > tab consistently across the documentation, where applicable. > This is continuing the work started with: > https://issues.apache.org/jira/browse/SPARK-42493 > where these two pages were updated: > [https://spark.apache.org/docs/latest/sql-getting-started.html] > [https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html] > > Pages being updated now: > [https://spark.apache.org/docs/latest/ml-classification-regression.html] > [https://spark.apache.org/docs/latest/ml-clustering.html] > [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html] > [https://spark.apache.org/docs/latest/ml-datasource.html] > [https://spark.apache.org/docs/latest/ml-features.html] > [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] > [https://spark.apache.org/docs/latest/ml-migration-guide.html] > [https://spark.apache.org/docs/latest/ml-pipeline.html] > [https://spark.apache.org/docs/latest/ml-statistics.html] > [https://spark.apache.org/docs/latest/ml-tuning.html] > > [https://spark.apache.org/docs/latest/mllib-clustering.html] > [https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html] > [https://spark.apache.org/docs/latest/mllib-data-types.html] > [https://spark.apache.org/docs/latest/mllib-decision-tree.html] > [https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html] > [https://spark.apache.org/docs/latest/mllib-ensembles.html] > [https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html] > [https://spark.apache.org/docs/latest/mllib-feature-extraction.html] > [https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html] > [https://spark.apache.org/docs/latest/mllib-isotonic-regression.html] > [https://spark.apache.org/docs/latest/mllib-linear-methods.html] > [https://spark.apache.org/docs/latest/mllib-naive-bayes.html] > [https://spark.apache.org/docs/latest/mllib-statistics.html] > > [https://spark.apache.org/docs/latest/quick-start.html] > > [https://spark.apache.org/docs/latest/rdd-programming-guide.html] > > [https://spark.apache.org/docs/latest/sql-data-sources-avro.html] > [https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html] > [https://spark.apache.org/docs/latest/sql-data-sources-csv.html] > [https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html] > [https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html] > [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html] > [https://spark.apache.org/docs/latest/sql-data-sources-json.html] > [https://spark.apache.org/docs/latest/sql-data-sources-parquet.html] > sql-data-sources-protobuf.html > [https://spark.apache.org/docs/latest/sql-data-sources-text.html] > [https://spark.apache.org/docs/latest/sql-migration-guide.html] > [https://spark.apache.org/docs/latest/sql-performance-tuning.html] > [https://spark.apache.org/docs/latest/sql-ref-datatypes.html] > > [https://spark.apache.org/docs/latest/streaming-kinesis-integration.html] > [https://spark.apache.org/docs/latest/streaming-programming-guide.html] > > [https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html] > [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html] > > > > > > > > > > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org